SDGR-Net: A Spatiotemporally Decoupled Gated Residual Network for Robust Multi-State HDD Health Prediction

Wu, Zehong; Qin, Jinghui; Lu, Yongyi; Yang, Zhijing

doi:10.3390/electronics15071399

Open AccessArticle

SDGR-Net: A Spatiotemporally Decoupled Gated Residual Network for Robust Multi-State HDD Health Prediction

School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(7), 1399; https://doi.org/10.3390/electronics15071399

Submission received: 3 February 2026 / Revised: 9 March 2026 / Accepted: 24 March 2026 / Published: 27 March 2026

Download

Browse Figures

Versions Notes

Abstract

Accurate prediction of hard disk drive (HDD) health states is critical for enabling proactive data maintenance and ensuring data reliability in large-scale data centers. However, conventional models often suffer from semantic entanglement among heterogeneous SMART attributes and from the masking of incipient failure signatures by stochastic noise. To address these challenges, we propose SDGR-Net, a spatiotemporally decoupled learning framework designed to model the complex degradation dynamics of HDDs. SDGR-Net introduces three synergistic innovations: (1) a spatiotemporally decoupled dual-branch encoder that disentangles longitudinal temporal evolution from cross-variable correlations via parameter-isolated branches, thereby reducing representational interference; (2) a parsimonious dual-view temporal extraction mechanism that captures early-stage anomalies through forward–reverse sequence concatenation, enabling high-fidelity preservation of non-stationary pre-failure patterns; and (3) a cross-branch dynamic gated residual fusion module that functions as an adaptive information bottleneck to emphasize failure-critical features while suppressing redundant noise. Extensive experiments conducted on three heterogeneous HDD datasets, ST4000DM000, HUH721212ALN604, and MG07ACA14TA, demonstrate that SDGR-Net consistently outperforms six state-of-the-art baselines. In particular, SDGR-Net achieves a peak fault detection rate (FDR) of 0.9898 and a 69.6% relative reduction in false alarm rate (FAR) under high-reliability operating conditions. These results, corroborated by comprehensive ablation studies, indicate that SDGR-Net effectively balances detection sensitivity and operational robustness, offering a practical solution for intelligent HDD health monitoring.

Keywords:

HDD failure prediction; remaining useful life; dual-view feature extraction; cross-branch dynamic gating

1. Introduction

With the exponential expansion of cloud storage infrastructure and large-scale data centers, hard disk drives (HDDs) continue to serve as a fundamental component of modern data storage systems. Nevertheless, as reported in the widely cited Backblaze drive statistics [1] and corroborated by recent industry analysis [2], disk reliability still cannot be fully guaranteed, and failures are unavoidable. Unexpected disk failures may result in severe data loss and service disruptions, underscoring the need for robust mechanisms for early failure prediction. Accordingly, leveraging Self-Monitoring, Analysis, and Reporting Technology (SMART) data for failure prediction has emerged as a critical research direction for enhancing data center reliability [3,4]. SMART attributes, such as reallocated sector count and raw read error rate, provide a multivariate time-series view of disk health. However, effective utilization of these data remains challenging due to their high dimensionality, severe class imbalance, and substantial noise. Furthermore, the precise definition of a “failure sample” is often ambiguous, as failure signals may appear intermittently before an actual breakdown [5]. Early approaches primarily relied on statistical methods and basic Machine Learning (ML) models. For example, Classification And Regression Trees (CART) [6] and Random Forests [7] were widely adopted to model nonlinear relationships among SMART attributes. To enhance computational efficiency, researchers have also explored two-stage feature selection strategies [8] to eliminate redundant indicators. To mitigate the severe data imbalance problem, where failure samples are scarce relative to healthy ones, methods such as the Synthetic Minority Over-sampling Technique (SMOTE) and cost-sensitive learning were introduced [9]. More recently, specialized frameworks like FPTSF [10] have been developed to extract robust time-series features from low-quality datasets, while SiaDFP [11] employs Siamese neural networks to quantify similarity between healthy and failing disks, thereby improving detection performance in few-shot scenarios.

Despite these advancements, traditional ML models often treat SMART data as static snapshots, failing to capture long-term degradation trends. To overcome this, deep learning (DL) has become the dominant paradigm. BiLSTM-based approaches [12] combined with deep feature extractors have proven effective in modeling temporal dependencies. For scenarios involving new disk models with limited historical data, Generative Adversarial Networks (GANs) [13] and transfer learning methods [14] have been applied to bridge the distribution gap between heterogeneous disks. Similarly, temporal convolutional networks (TCNs) integrated with autoencoders [15] and the recent ConvTrans-TPS architecture [16] have demonstrated that jointly capturing local and global temporal correlations substantially improves predictive performance. Furthermore, to address disk heterogeneity in large-scale data centers, methods like HDDse [17] leverage high-dimensional embedding techniques, while ensemble learning strategies [18] combine diverse predictors to enhance generalization. In addition, active semi-supervised learning frameworks [19] have been proposed to reduce reliance on costly manual annotation.

However, a critical review of existing deep learning methodologies reveals three major limitations that impede their efficiency and accuracy in real-time deployment. The first limitation is entangled representation. Most existing models, such as standard LSTMs or TCNs, jointly encode temporal dynamics and inter-variable correlations within a unified encoder. This entanglement prevents the model from explicitly disentangling the temporal degradation trajectory of individual attributes from the spatial correlations among multiple attributes. The second one is redundancy in bidirectional modeling. Although BiLSTM architectures [12] are effective at capturing backward dependencies, including short-term pre-failure anomalies, they inherently double the number of recurrent parameters. This redundancy increases computational overhead and elevates the risk of overfitting, particularly when training data are limited or noisy. The final one is ineffective feature fusion. Existing multi-view or multi-modal approaches often rely on static concatenation or simple weighted averaging to combine features. These methods lack adaptive mechanisms to dynamically regulate the relative contributions of temporal and spatial features, resulting in suboptimal information integration.

To address these challenges, we propose SDGR-Net, a novel framework for disk failure prediction. The model incorporates architectural and fusion-level innovations to enable efficient, robust, and scalable modeling of multivariate SMART time-series data. Specifically, SDGR-Net decouples spatiotemporal representations through a dual-branch heterogeneous LSTM architecture. Rather than relying on a parameter-intensive bidirectional LSTM (BiLSTM), we introduce a lightweight bidirectional temporal extraction mechanism based on dual-view concatenation. By constructing forward and reversed temporal views via sequence reversal, this mechanism captures both progressive degradation trends and short-term pre-failure anomalies without introducing additional recurrent units. Two homogeneous unidirectional LSTMs independently process the dual views, and the resulting features are concatenated along the channel dimension to preserve fine-grained bidirectional temporal information. A lightweight linear projection layer is then applied to compress the expanded feature space back to the target hidden dimension. Compared with standard BiLSTM [12], this design reduces the number of gated recurrent parameters by approximately 50%, substantially improves inference efficiency, and is better suited for modeling non-stationary SMART time series. In addition, SDGR-Net incorporates a dynamic gated residual cross-branch fusion module that adaptively integrates temporal and variate features, thereby enhancing critical signal expressiveness and improving training stability.

The main contributions are summarized as follows:

We propose a spatiotemporally decoupled dual-branch architecture that independently models longitudinal temporal dynamics and cross-variable correlations. This design effectively alleviates semantic entanglement and reduces feature interference, resulting in more discriminative representations of disk health states.
We introduce a lightweight dual-view bidirectional temporal extraction mechanism that captures critical pre-failure anomalies via forward–reverse sequence concatenation. This approach preserves transient signals with high fidelity while remaining substantially more parameter-efficient than conventional bidirectional LSTMs.
We develop a dynamic gated residual fusion module that adaptively integrates heterogeneous features while suppressing stochastic noise in SMART data. Extensive experimental results demonstrate that this module significantly reduces the false alarm rate (FAR) while maintaining stable training and robust generalization across large-scale HDD populations.

2. Related Work

Hard disk drive (HDD) failure prediction has evolved from heuristic statistical analysis to data-driven machine learning and, more recently, complex deep learning architectures. Early research primarily focused on characterizing failure patterns through statistical analysis of large-scale datasets, laying the physical foundation for understanding disk degradation. For instance, in a seminal study regarding environmental factors, Sankar et al. [20] analyzed data from Microsoft data centers and established a strong correlation between elevated operating temperatures and increased failure rates, challenging earlier assumptions about cooling costs. Similarly, Schroeder and ref. [21] conducted a comprehensive analysis of hardware failures across high-performance computing sites, revealing that failure rates often exceed manufacturer specifications and do not strictly follow the traditional ’bathtub curve’. To complement these macroscopic observations, other works employed non-parametric statistical methods to detect anomalies by comparing the distribution of current SMART attributes against known healthy baselines. However, while these statistical approaches provide valuable insights into failure triggers such as heat and workload, they typically rely on univariate thresholds and lack the capacity to model the complex, non-linear interactions required for precise, individual-level prediction.

To exploit the multivariate nature of SMART data, researchers subsequently turned to machine learning (ML) techniques, where feature engineering combined with robust classifiers became the industry standard. A landmark study [22] demonstrated that ensemble methods like Random Forest (RF) could effectively filter noise and achieve higher detection rates than threshold-based systems in hyperscale storage environments, a finding further solidified by Li et al. [23] who established RF as a robust baseline for handling noisy SMART data. To address the probabilistic uncertainty of failure dependencies, Wang et al. [24] proposed BaNHFaP, a Bayesian Network-based approach that models the conditional dependencies among different sensor attributes, which was later extended with transfer learning [25] to adapt to new disk models. Furthermore, recognizing the computational constraints in large-scale systems, researchers also explored optimization strategies, such as the two-layer feature selection method proposed in [8], utilizing filter and wrapper methods to eliminate redundant indicators before training. Despite their interpretability and efficiency, these traditional ML models generally treat SMART data as static snapshots or flat vectors, which limits their ability to capture long-term sequential degradation trends inherent in the disk aging process.

Consequently, the research focus has shifted toward deep learning to better model temporal dependencies. Han et al. [4] introduced Temporal Convolutional Networks (TCNs) with dilated convolutions to capture long-range patterns in SMART sequences, effectively overcoming the limited receptive field of standard CNNs. Comprehensive evaluations of Recurrent Neural Networks (RNNs) [26] further demonstrated that architectures with memory cells, such as LSTMs and GRUs, handle the vanishing gradient problem in long sequences superiorly compared to traditional MLPs. A critical advancement in this domain was made by Che et al. [27], who utilized Bidirectional LSTMs (BiLSTMs) to estimate the Remaining Useful Life (RUL) of hard drives. Their work highlighted that analyzing data in both forward and reverse directions can reveal subtle pre-failure anomalies that unidirectional models might miss. Additionally, to address the issue of imprecise failure labels—where the exact onset of failure is unknown—Li et al. [28] proposed a Multi-Instance Learning (MIL) framework combined with Attention mechanisms, allowing the model to automatically focus on the most discriminative time steps. For resource-constrained environments like Mobile Edge Computing (MEC), lightweight LSTM variants [29] have also been explored to balance accuracy and deployment cost.

In parallel with architectural advancements, recent studies have focused on tackling practical challenges such as data heterogeneity and the scarcity of labeled data in real-world deployments. For new disk models with zero historical failures, Shi et al. [13] proposed DGTL-Net, a deep generative transfer learning network that synthesizes fault features to train diagnostic models. Similarly, Zhang et al. [14] addressed the “minority disk” problem by transferring knowledge from data-rich disk models to data-sparse ones. These approaches are deeply rooted in the broader advancements of deep transfer learning strategies within intelligent machinery fault diagnosis, which emphasize the transition of failure knowledge from source domains to target applications via shared representations and domain adaptation [30]. To manage the complexity of heterogeneous data centers containing various disk models, Xie et al. [31] developed Ome, an optimized modeling engine that groups disks and trains specialized models. Addressing the high cost of manual labeling, Zhou et al. [19] introduced active semi-supervised learning, which selectively queries human experts for the most informative samples. Similarly, Siddique et al. [32] employed Vision Transformers within a semi-supervised framework to capture high-fidelity fault features, demonstrating that uncertainty-aware knowledge distillation can significantly enhance diagnostic reliability when labeled data is scarce. From an industry perspective, Miller et al. [2] emphasized that besides pure accuracy, model maintainability and low false alarm rates are crucial for effective operations.

Notwithstanding these methodological innovations, existing solutions still exhibit structural limitations that hinder optimal performance. First, most deep learning models [4,26,29] process temporal dynamics and attribute correlations within a single encoder branch, leading to entangled representations where dominant temporal trends may suppress subtle inter-variable correlation signals. Second, while BiLSTMs [27] effectively capture backward dependencies essential for detecting short-term anomalies, they inherently double the parameter count and computational cost, making them less ideal for large-scale, real-time monitoring. Third, hybrid approaches often rely on static concatenation or fixed weights to combine features from different views, lacking the adaptability to handle dynamic changes in feature importance during the degradation process. These identified gaps motivate the design of our proposed SDGR-Net, which introduces a spatiotemporally decoupled architecture, a lightweight dual-view concatenation mechanism, and a dynamic gated fusion module to address these challenges systematically.

3. SDGR-Net

The SDGR-Net, illustrated in Figure 1, is designed to efficiently and robustly extract both longitudinal temporal dynamics and cross-variable correlations from multivariate SMART time series, while preserving fine-grained pre-failure anomaly information. It comprises three primary components: a spatiotemporally decoupled dual-branch LSTM encoder, a parsimonious dual-view bidirectional temporal concatenation mechanism, and a cross-branch dynamic gated residual fusion module followed by feature enhancement and classification layers.

Within SDGR-Net, the encoder adopts a hierarchical spatiotemporal decoupling strategy that separates the extraction of longitudinal degradation patterns from transverse variable correlations into parameter-isolated branches. This design mitigates representational interference while maintaining a structured flow of information. Specifically, the temporal branch first captures global degradation trajectories, after which the variable branch exploits these temporal contexts to identify synergistic inter-attribute correlations. To preserve high signal fidelity, SDGR-Net maintains full temporal resolution in both branches by avoiding premature pooling and instead employs a dual-view concatenation strategy to obtain compact representations without sacrificing localized anomaly patterns.

The variable branch employs a parallel transverse architecture that focuses on modeling synergistic correlations among SMART attributes. Each attribute sequence is transformed into dual temporal views and processed independently, yielding robust variable-correlation features that complement temporal features with an orthogonal semantic perspective.

The outputs of the decoupled branches are adaptively integrated through a cross-branch dynamic gated residual fusion module. Acting as a differentiable information bottleneck, this module computes feature-specific weights via a sigmoid-based gating mechanism and fuses variable-branch features into the temporal branch through residual connections. This design amplifies failure-critical signals, improves the signal-to-noise ratio, and stabilizes gradient propagation, thereby enabling effective exploitation of heterogeneous information in data-scarce settings.

The fused representations are further refined using a residual feedforward network and a multi-head attention module to capture higher-order dependencies. Global temporal pooling is then applied, followed by a fully connected classifier that outputs predicted probabilities for four HDD health states: Good, Fair, Warning, and Alert. Overall, SDGR-Net systematically addresses three core challenges in HDD health monitoring: (i) the isolation of orthogonal temporal and variable features, (ii) parsimonious modeling of bidirectional temporal context, and (iii) adaptive cross-branch feature modulation.

3.1. Spatiotemporally Decoupled Dual-Branch Heterogeneous LSTM Encoder

Multivariate SMART time series serve as digital telemetry of HDD health and inherently encode heterogeneous information with fundamentally distinct statistical properties and physical interpretations. From a longitudinal perspective, individual SMART attributes exhibit continuous, cumulative, and often irreversible degradation patterns that reflect mechanical aging processes. From a transverse perspective, multiple attributes display complex, instantaneous co-evolution and nonlinear interdependencies, which frequently indicate acute anomalies or external stressors. Conventional prediction paradigms typically employ single-branch or shared-encoder architectures that implicitly project these disparate semantic dimensions into a unified latent space. We argue that such coupled modeling is theoretically suboptimal due to representation interference and manifold entanglement. When heterogeneous signals, namely low-frequency monotonic trends and high-frequency volatile correlations, are entangled within a shared parameter space, the resulting optimization landscape becomes ill-conditioned. In particular, dominant temporal trends tend to monopolize gradient updates during backpropagation, leading to gradient domination. As a consequence, subtle yet critical inter-variable correlation features are suppressed, rendering the model insensitive to complex failure modes.

To fundamentally address these limitations, SDGR-Net introduces a spatiotemporally decoupled dual-branch heterogeneous LSTM encoder (as illustrated in Figure 2). The core design principle is to achieve orthogonal feature extraction through hierarchical spatiotemporal decoupling. Rather than enforcing strict physical independence, SDGR-Net separates longitudinal temporal modeling from transverse variable-dependency modeling into two pathways governed by strict parameter isolation. This structural separation prevents gradients associated with global temporal trends from overwhelming delicate cross-attribute signals during early-stage representation learning.

The temporal-wise branch is specifically designed to characterize longitudinal Markovian transitions of disk health while prioritizing the preservation of non-stationary transient signals. In contrast to conventional approaches that prematurely apply global average pooling (GAP) during feature extraction—effectively acting as a low-pass filter that attenuates incipient anomaly patterns—SDGR-Net adopts a deferred pooling strategy. The temporal branch retains full hidden-state sequences throughout the extraction process, ensuring that downstream modules have access to high-resolution temporal fluctuations. To capture bidirectional temporal context, this branch processes both forward and reversed temporal views. Let

H_{t_fwd}

and

H_{t_bwd}

be the hidden states derived from these views. The fused temporal representation is formulated as

F_{t} = σ ([H t_{f} wd \oplus H t_{b} wd] W r t + b_{r t}) \in R^{B \times T \times H},

(1)

where ⊕ denotes concatenation, and

W r t

is a learnable projection matrix. The dimensions

B, T

, and H represent the batch size, the temporal window length, and the hidden dimension, respectively. This operation performs implicit manifold alignment, compressing high-fidelity bidirectional information while preserving localized anomaly patterns that often precede mechanical failures.

Orthogonal to temporal modeling, the variable-wise branch reformulates the encoding process by treating the attribute index as the sequential dimension. To enable hierarchical decoupling, this branch receives a contextualized input derived from the temporal summary, ensuring that variable correlations are learned within the established longitudinal degradation context. Although information flow between branches is coordinated, the LSTM units in the variable-wise branch operate with an independent parameter set and traverse across attributes to explicitly model the synergistic structure of the HDD. Whereas the temporal branch captures the when of degradation, the variable-wise branch focuses on identifying structural anomalies—the how of failure—through topological relationships among attributes. To maximize mutual information capture across sensors, this branch also employs a dual-view mechanism. Let

H v_{f} wd

and

H v_{b} wd

represent the hidden features extracted from the variable sequences, the output is calculated as

F_{v} = σ ([H v_{f} wd \oplus H v_{b} wd] W r v + b_{r v}) \in R^{B \times V \times H},

(2)

By enforcing strict parameter isolation and deferring global average pooling until the final classification stage, SDGR-Net ensures that the optimization of directional features remains orthogonal. This design enables the model to learn semantically disentangled feature representations, one emphasizing cumulative degradation and the other highlighting structural anomaly proximity, without the cross-dimensional noise propagation or information dilution commonly observed in coupled, pooling-intensive architectures.

3.2. Parsimonious Bidirectional Temporal Extraction via Dual-View Concatenation

To implement the feature extraction described in the previous section and to overcome the inherent limitations of conventional unidirectional LSTMs and fully bidirectional LSTM (BiLSTM) architectures, SDGR-Net incorporates a parsimonious bidirectional temporal modeling mechanism. In SMART-based monitoring, disk health evolution is characterized by a complex interplay between long-term gradual deterioration and short-term non-stationary pulses. While unidirectional models effectively capture longitudinal trends, they suffer from information lag, as the hidden state at any given time step lacks visibility of future health transitions. In the context of failure prediction, the future relative to a past observation contains the failure event itself, which serves as the most critical supervision signal. Conversely, standard BiLSTM architectures address this but introduce significant parameter redundancy. They necessitate maintaining two completely separate sets of gating weights, doubling the model complexity. In class-imbalanced HDD datasets, such over-parameterization significantly increases the risk of overfitting to the majority healthy class and complicates the optimization of rare failure manifolds.

The dual-view temporal modeling strategy employed in SDGR-Net resolves this dilemma through a parsimonious design consisting of three sequential stages: sequence inversion, independent directional encoding, and concatenation-based dimensionality reduction. The process begins by generating a reversed temporal view via a time-axis inversion operator. While the original view captures forward evolution trends, the reversed view conceptually anchors the impending crash as a temporal origin (

t = 0

). From a gradient-flow perspective, this inverted perspective provides a future-to-past shortcut for error propagation, allowing the supervision signal from the failure label to reach distal incipient anomalies more effectively during backpropagation.

Crucially, unlike standard BiLSTMs that employ two independent sets of weights (

2 P

), SDGR-Net utilizes a single unidirectional LSTM stack to process both the original forward and time-reversed backward sequences in two distinct passes. This means that the directional views share the same set of LSTM gating parameters (W and b). This parameter-shared dual-pass design facilitates orthogonal optimization of directional features within a unified parameter space, allowing the model to learn representations for both degradation accumulation and anomaly proximity with significantly higher efficiency.

To quantitatively substantiate our efficiency claims, Table 1 presents an explicit parameter count comparison based on our experimental configuration (

D_{i n} = 9, H = 512

). For an LSTM layer where the parameter count is

P = 4 (D_{i n} H + H^{2} + H)

, a standard BiLSTM requires

2 P

(approximately 2.13 million parameters). In contrast, our shared dual-pass design requires only

1 P

(approximately 1.06 million), confirming a strict 50% reduction in recurrent complexity while maintaining bidirectional modeling capabilities.

Regarding feature integration, SDGR-Net deliberately avoids conventional aggregation strategies such as mean pooling or element-wise summation. Theoretically, such averaging operations act as destructive low-pass filters, which tend to dilute the directional asymmetry and cancel out high-frequency pre-failure pulses. Instead, we employ a concatenation-based fusion strategy to preserve high-fidelity details from both views in a hyper-dimensional latent space. A subsequent linear projection layer then performs feature selection and dimensionality compression, mapping the integrated features back to a compact representation.

Overall, this mechanism serves as a parsimonious bidirectional feature extraction operator. By avoiding information-lossy pooling within the branches and deferring global average pooling (GAP) until the terminal classification tier, SDGR-Net ensures that subtle pre-failure pulses are preserved with high fidelity throughout the temporal and variable encoding phases. This design effectively resolves the trade-off between detection sensitivity and operational robustness, providing a physically plausible and computationally efficient solution for proactive HDD health management.

3.3. Cross-Branch Dynamic Gated Residual Fusion Module

To address the inherent limitations of conventional static feature fusion methods, which lack the ability to adaptively modulate feature contributions in response to evolving disk operating states, SDGR-Net introduces a cross-branch dynamic gated residual fusion module (as illustrated in Figure 3). Traditional fusion strategies, such as static concatenation or fixed-weight summation, are insufficient for discriminating the relative importance of longitudinal temporal trends and transverse variable-wise correlations, particularly under high-temperature or incipient-failure conditions. Moreover, naive gating mechanisms that neglect cross-branch interactions are prone to feature distortion caused by heterogeneous data distributions, while the absence of residual connections can lead to gradient attenuation and unstable optimization. In contrast, the proposed fusion module integrates cross-branch adaptive gating with residual learning, enabling fine-grained feature modulation, stable integration, and uninterrupted gradient propagation.

The fusion module consists of a dynamic gating unit and a residual shortcut. It takes the longitudinal temporal branch output

F_{t} \in R^{B \times H}

and the transverse variable branch output

F_{v} \in R^{B \times H}

as inputs. Acting as a differentiable information bottleneck, the dynamic gating unit computes an adaptive weight vector

G \in R^{B \times H}

via a linear transformation followed by a Sigmoid activation, allowing feature-wise modulation conditioned on the current disk health state:

G = σ (W_{g} F_{v} + b_{g}),

(3)

where

W_{g} \in R^{H \times H}

and

b_{g} \in R^{H}

are learnable parameters. The final fused representation

F_{fusion}

is obtained by integrating the gated transverse branch with the longitudinal residual branch:

F_{fusion} = G ⊙ F_{v} + (1 - G) ⊙ F_{t},

(4)

where ⊙ denotes the element-wise product. Within this framework, the residual component preserves the dominant temporal degradation manifold, effectively mitigating gradient vanishing and ensuring stable training dynamics. Empirically, observed gradient norms are consistently maintained within the optimal range of 0.8–1.2.

This fusion mechanism offers several theoretical and practical advantages. First, the adaptive gating selectively amplifies failure-critical SMART attributes, e.g., temperature, reallocated sector count, and ECC error rate, under abnormal operating conditions, while suppressing redundant stochastic noise during normal operation, thereby substantially improving the signal-to-noise ratio (SNR) for health state classification. Second, the residual connection guarantees the uninterrupted propagation of macroscopic temporal trends, promoting robust generalization across heterogeneous HDD models and diverse operating environments. Third, the fusion layer is computationally parsimonious, with a time complexity of

O (B H)

and negligible per-sample inference overhead (≤0.002 ms), making it well suited for large-scale, real-time monitoring deployments. Finally, the learned gating weights provide a degree of structural interpretability: under elevated thermal or mechanical stress, failure-relevant SMART attributes consistently receive higher activation weights, demonstrating the physical plausibility and engineering reliability of the proposed fusion strategy. By adaptively balancing longitudinal trend preservation and transverse anomaly amplification, this module delivers a precise, interpretable, and deployment-ready solution for proactive HDD health management.

4. Experiment Settings

4.1. Datasets

The experimental datasets used in this study are obtained from Backblaze, a widely recognized provider of online backup and cloud storage services. Backblaze routinely collects and publicly releases daily snapshots of hard disk drive (HDD) operational data, including Self-Monitoring, Analysis, and Reporting Technology (SMART) attributes and drive status information, spanning from 2013 to the present. To date, these datasets comprise operational records covering more than 2 PB of hosted HDD storage, forming a large-scale and realistic benchmark for HDD reliability analysis. Each quarterly release is distributed in comma-separated values (CSV) format and contains 255 variables, including raw and normalized SMART attributes as well as metadata such as collection date, drive serial number, model identifier, storage capacity, and failure status.

In this study, we select datasets corresponding to three HDD models: ST4000DM000, MG07ACA14TA, and HUH721212ALN604. The ST4000DM000 dataset includes 17,899 drives with a capacity of 4 TB, representing a medium-scale dataset. The MG07ACA14TA dataset comprises 38,101 drives with a capacity of 14 TB, constituting a large-scale dataset. The HUH721212ALN604 dataset contains 90 drives with a capacity of 8 TB, representing a small-scale dataset. These models are deliberately chosen to facilitate comparative analyses across heterogeneous device types and storage scales.

4.2. Implementation Details

All experiments are conducted under a unified software and hardware environment to ensure training stability and reproducibility. The proposed model is implemented in Python 3.9 using PyTorch 1.13, and all experiments are executed on an NVIDIA RTX 3090 GPU. This consistent experimental setup provides a fair and reliable basis for performance evaluation and ablation studies.

For model training and evaluation, we implemented a stratified random splitting strategy at the sequence instance level. Each dataset was partitioned into a training set (70%) and a testing set (30%). Given the extreme class imbalance in HDD failure data, the split was stratified relative to the health state labels. This methodology ensures that the rare minority failure stages are proportionally represented in both the training and testing sets, providing a statistically significant benchmark for calculating sensitivity metrics such as Recall.

By performing the split at the instance level rather than at the individual time-step level, the framework ensures that the entire temporal window for a given HDD is assigned to only one subset. This approach effectively prevents data leakage, as the model cannot access “future” or “neighboring” observations of a test instance during its training phase. This stratified partitioning facilitates a rigorous evaluation of SDGR-Net’s generalization capability across the full spectrum of HDD health degradation.

During training, a unified optimization configuration is adopted to enhance convergence stability and generalization performance. The training objective is defined using the cross-entropy loss function, which is standard for multi-class classification tasks. To further mitigate overfitting in HDD health state prediction and improve robustness to noisy or ambiguous samples, label smoothing is applied with a smoothing factor of 0.05.

The Adam optimizer is employed due to its effectiveness in handling non-stationary gradients and adaptively adjusting learning rates. The initial learning rate is set to

1 \times 10^{- 4}

during the primary training stage. Additionally, weight decay regularization is applied to constrain parameter magnitudes and reduce model complexity, thereby improving generalization. The weight decay coefficient is also set to

1 \times 10^{- 4}

, which helps prevent excessive parameter growth and training instability in later stages. For learning rate scheduling, an exponential decay strategy is adopted via the LambdaLR scheduler. Specifically, the learning rate at the t-th training epoch is updated according to

η_{t} = η_{0} \cdot γ^{t},

(5)

where

η_{0}

denotes the initial learning rate,

γ

represents the decay factor that set to 0.98 depending on the training stage, and t indicates the current epoch index. This strategy enables relatively large parameter updates in the early training phase to accelerate convergence, while gradually reducing the learning rate in later stages to facilitate fine-grained parameter optimization and stable model convergence.

4.3. Data Preprocessing and Health State Annotation

4.3.1. Feature Selection

To resolve the high dimensionality and redundancy inherent in raw telemetry, we performed a data-driven feature selection using Information Entropy evaluation. The choice of IE as a selection criterion is fundamentally motivated by its ability to characterize the stochastic transition of disk health from operational order to terminal disorder. Physically, a healthy HDD operates within a stable statistical equilibrium; however, the onset of failure introduces non-stationary perturbations that increase the system’s informational complexity. Unlike linear correlation metrics, IE quantifies the informational gain by measuring the reduction in uncertainty regarding the hidden health states when observing specific SMART fluctuations. This allows the model to identify attributes that provide the most significant discriminative evidence of failure proximity, even when the degradation signals are highly non-linear or sporadic.

To ensure the model’s generalization across different disk manufacturers, we ranked the attributes based on their Total Information Entropy Score aggregated across the three datasets. As demonstrated in Table 2, we selected the top 9 attributes, encompassing critical dimensions of disk health: surface stability (ID 5, 197), read/write quality (ID 1), usage intensity (ID 9, 12, 192, 193), and environmental stress (ID 194). By focusing on these high-entropy attributes, the framework effectively filters out stochastic sensor noise and isolates the most salient indicators of incipient failure, providing a semantically clarified input for the subsequent decoupled branches.

4.3.2. Data Normalization and Numerical Stability

Subsequent to this informational refinement, we address the numerical disparity among the selected attributes. Although the 9 high-entropy features are physically significant, they exhibit heterogeneous scales—for instance, ID 9 (Power-On Hours) may span thousands of units, whereas ID 5 (Reallocated Sectors) typically fluctuates within a single-digit range. To prevent features with larger magnitudes from dominating the gradient updates and to ensure optimal numerical conditioning, all selected SMART attributes are linearly normalized to the range

[- 1, 1]

using the following formula:

x_{norm} = 2 \cdot \frac{x - x_{min}}{x_{max} - x_{min}} - 1,

(6)

where

x_{min}

and

x_{max}

denote the minimum and maximum values of each attribute within the dataset. This normalization ensures consistent scales across the feature space, facilitating stable convergence within the hierarchical spatiotemporal decoupled branches of SDGR-Net. By standardizing the input manifold, the framework further enhances its capacity to identify subtle, non-stationary pre-failure signals during orthogonal feature extraction, ensuring that the informational purity captured by the entropy evaluation is effectively utilized for robust health state classification.

In SDGR-Net, HDD health monitoring is formulated as a discrete-time health state classification problem, in which health labels serve as coarse but robust proxies for remaining useful life (RUL). This discrete formulation effectively bridges raw degradation measurements and practical maintenance decision-making. Based on domain knowledge, HDD health is categorized into four stages: Good, Fair, Warning, and Alert. Specifically, samples collected from healthy HDDs or samples 31–60 days prior to failure are labeled as Good. Subsequent stages are defined by their proximity to the failure origin: 21–30 days (Fair), 11–20 days (Warning), and 0–10 days (Alert), as detailed in Table 3.

This partitioning strategy is justified by degradation physics, statistical properties, and operational requirements. Primarily, statistical analysis of SMART telemetry indicates that most critical indicators remain in a stable steady-state until approximately 30 days prior to failure, after which they exhibit exponential or abrupt deviations. This observation remains consistent even for long-life disks; regardless of the total operational duration, observable SMART degradation signatures typically remain latent for the majority of the lifespan and only manifest saliently as the disk approaches the failure point.

Consequently, grouping the 31–60 day range into the Good class prevents the model from attempting to learn failure precursors from statistically insignificant fluctuations, thereby reducing semantic noise and preventing overfitting during training. From an operational perspective, a 30-day early warning window provides a sufficient golden period for hardware replacement and data migration in large-scale data centers. Introducing a finer-grained state in the 31–60 day range would offer diminishing returns for proactive maintenance while potentially increasing the risk of false positives. Finally, by consolidating early-stage samples into a single stable class, the framework maximizes the discriminative boundary between the healthy manifold and the acceleration phase, enhancing class saliency for robust risk identification.

A key challenge in HDD failure prediction is the severe class imbalance, where failure-related samples are vastly outnumbered by instances from the Good class. To alleviate classification bias and ensure sufficient supervision for rare failure transitions, we adopt a Random Oversampling by Replication (ROR) strategy. Specifically, the sample size of the majority class is used as the target cardinality, and minority classes are randomly resampled with replacement until their sizes are matched.

This two-pronged strategy, data-level class balancing via ROR and feature-level saliency enhancement through dynamic gated residual fusion, forms a synergistic mechanism. It prevents gradient updates from being dominated by healthy samples during backpropagation, enabling SDGR-Net to remain highly sensitive to incipient failure precursors even in large-scale storage deployments.

4.4. Baselines

To comprehensively evaluate the effectiveness of the proposed SDGR-Net, several representative deep learning-based methods for hard disk drive (HDD) failure prediction and remaining useful life (RUL) estimation are selected as baselines. These models cover recurrent neural networks, generative data augmentation frameworks, attention-enhanced architectures, and transformer-based approaches, providing a diverse comparison set.

Bidirectional LSTM (BiLSTM) [33] is a classical recurrent neural network architecture that processes temporal sequences in both forward and backward directions. By jointly exploiting past and future contextual information, BiLSTM has been widely adopted in time-series modeling tasks, including HDD health prediction, to capture long-term temporal dependencies in SMART attribute sequences.

TimeGAN-GRU [34]: A disk failure prediction framework that integrates an Attribute-Gated Recurrent Unit (GRU) network with a TimeGAN adversarial network. The GRU network captures long-term temporal dependencies in disk monitoring data sequences, while the TimeGAN component mitigates class imbalance through adversarial generation of realistic synthetic samples.

TimeGAN-LSTM [34]: Similar to TimeGAN-GRU, it is a disk failure prediction method based on Long Short-Term Memory (LSTM) neural networks integrated with the TimeGAN adversarial network.

LSTM-Attention [35] integrates an attention mechanism into a standard LSTM architecture to dynamically assign different importance weights to historical time steps. This design allows the model to focus on critical temporal segments that are more relevant to HDD degradation and failure evolution, improving sensitivity to key failure precursors.

Attention-Based Bidirectional LSTM (Attention-Based BiLSTM) [36] further extends BiLSTM by incorporating an attention mechanism over bidirectional hidden states. This architecture enhances the model’s ability to selectively emphasize informative temporal features while suppressing redundant or noisy information, making it suitable for complex multivariate SMART time-series analysis.

DiskTransformer [37] is a transformer-based framework specifically designed for HDD failure prediction. By leveraging self-attention mechanisms, DiskTransformer captures long-range temporal dependencies and global correlations among SMART attributes without relying on recurrent structures. This model represents a state-of-the-art non-recurrent approach for disk health modeling and serves as a strong baseline for comparison.

4.5. Evaluation Criteria

Given that hard disk drive (HDD) health degradation is characterized by long-term progressive deterioration intertwined with short-term burst anomalies, and that multiple SMART attributes exhibit strong temporal and cross-variable dependencies, conventional binary evaluation metrics are insufficient for assessing multi-state health prediction performance. Therefore, we adopt a set of multi-class evaluation criteria, including Accuracy (ACC), False Alarm Rate (FAR), Recall, Precision, and F1 score, to comprehensively evaluate the proposed SDGR-Net from the perspectives of overall accuracy, early failure sensitivity, and false alarm robustness. The introduction of each metric is provided below.

Accuracy (ACC) measures the proportion of correctly classified samples across all health states and serves as a fundamental indicator of global prediction performance. In this work, HDD health conditions are categorized into four discrete states: Healthy, Fair, Warning, and Critical. For multi-class classification, accuracy is defined as

ACC = \frac{1}{N} \sum_{i = 1}^{N} I ({\hat{y}}_{i} = y_{i}),

(7)

where N denotes the total number of samples in the validation set,

y_{i}

and

{\hat{y}}_{i}

represent the ground-truth and predicted labels of the i-th sample, respectively, and

I (\cdot)

is the indicator function. ACC primarily reflects the effectiveness of the proposed spatiotemporally decoupled dual-branch architecture in jointly modeling temporal degradation patterns and cross-SMART attribute correlations without feature interference.

False Alarm Rate (FAR) quantifies the tendency of the model to incorrectly classify non-critical or non-failure states as failure-related states, which is particularly important for large-scale data center deployment where excessive false alarms may lead to unnecessary maintenance costs. FAR is defined as

FAR = \frac{1}{K} \sum_{k = 1}^{K} \frac{F P_{k}}{F P_{k} + T N_{k}},

(8)

where K is the number of health states, and

F P_{k}

and

T N_{k}

denote the false positives and true negatives for the k-th class, respectively. A low FAR indicates that the model can suppress spurious alarms caused by transient fluctuations in SMART attributes. This metric is particularly influenced by the proposed dynamic gated residual fusion module, which adaptively regulates the contribution of variable-branch features and preserves stable temporal representations.

Recall evaluates the proportion of false alarms among all predicted failure instances, reflecting the reliability of failure detection decisions:

Recall = \frac{1}{K} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F N_{k}},

(9)

where

T P_{k}

and

F N_{k}

represent the true positives and false positives of the k-th class, respectively. Recall is closely related to the model’s sensitivity to early failure precursors. The proposed lightweight dual-view bidirectional temporal extraction mechanism, which concatenates forward and reversed temporal representations instead of simple averaging, plays a key role in reducing Recall by preserving short-term anomaly details preceding disk failures.

Precision measures the proportion of correctly predicted failure-related samples among all predicted failure samples and is defined as

Precision = \frac{1}{K} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F P_{k}} .

(10)

High precision indicates that the model’s failure predictions are reliable and not dominated by false positives, which is essential for practical HDD health monitoring systems.

F1 score provides a balanced evaluation by jointly considering Precision and Recall, and is computed as

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall} .

(11)

The F1 score comprehensively reflects the trade-off between early failure detection sensitivity and false alarm suppression. Improvements in F1 directly demonstrate the effectiveness of the proposed architecture in capturing both long-term degradation trends and short-term abnormal behaviors across heterogeneous HDD models.

Collectively, these evaluation metrics enable a thorough assessment of the proposed model’s classification accuracy, early warning capability, and robustness against false alarms, providing strong quantitative evidence for its suitability in multi-state HDD health prediction tasks.

5. Results and Analysis

5.1. Performance Comparison

To rigorously evaluate the predictive effectiveness and architectural robustness of SDGR-Net, we conducted a comprehensive comparative study against six state-of-the-art baselines, encompassing recurrent benchmarks (vanilla BiLSTM), generative-augmented frameworks (TimeGAN-LSTM/GRU [34]), and attention-based architectures (LSTM-Attention [35], DiskTransformer [36], and Attention-Based BiLSTM [37]). Performance was systematically assessed on three heterogeneous HDD datasets: ST4000DM000, HUH721212ALN604, and MG07ACA14TA. To ensure the reliability of the empirical findings and account for stochastic variations during training, all experiments for SDGR-Net were executed for 10 independent trials with different random seeds. The final results are reported as Mean ± Standard Deviation in Table 4, providing a quantitative basis for statistical significance analysis.

As summarized in Table 4, SDGR-Net achieves a peak mean recall of

0.9898 \pm 0.0011

on the HUH721212ALN604 dataset, substantially outperforming both generative and attention-based competitors. From a reliability engineering perspective, this exceptional sensitivity, combined with a remarkably low variance (0.0011), reflects the high-fidelity temporal feature preservation enabled by our lightweight dual-view extraction mechanism. While baseline models—particularly those utilizing standard pooling—effectively act as low-pass filters that smooth incipient failure precursors, the dual-view concatenation in SDGR-Net preserves the longitudinal signal integrity. The stability of these results across multiple runs suggests that the failure-origin anchoring strategy significantly stabilizes the optimization landscape, preventing the model from oscillating during the capture of transient pre-failure pulses.

This advantage is further corroborated by the consistently low FAR attained by SDGR-Net, which reaches

0.0220 \pm 0.0014

on the large-scale MG07ACA14TA dataset. Quantitatively, the margin of improvement over the best-performing baseline (a reduction of 0.0377 compared to 0.0597) is approximately 27 times the observed standard deviation, confirming that the superiority of SDGR-Net is statistically significant and not a result of random fluctuation. The cross-branch dynamic gated residual fusion module in SDGR-Net addresses the issue of semantic entanglement by acting as a differentiable information bottleneck. By adaptively down-weighting the transverse variable branch during stable operating phases, the model prevents non-failure-related noise from contaminating the latent health representation. Unlike GAN-based approaches, which may introduce synthetic bias and exhibit unstable FAR, the narrow confidence intervals of SDGR-Net ensure that only failure-critical attribute co-evolutions influence the final prediction, providing a reliable foundation for large-scale proactive maintenance.

On the ST4000DM000 dataset, characterized by pronounced inter-class ambiguity and noisy trajectories, SDGR-Net achieves a mean global accuracy of

0.9007 \pm 0.0028

, outperforming DiskTransformer by a substantial margin. This performance gap provides empirical evidence of the resolved representation interference. In shared latent manifolds, strong gradients from dominant longitudinal trends often overshadow weaker transverse signals. By contrast, the spatiotemporally decoupled design of SDGR-Net enables orthogonal gradient flows through parameter-isolated branches. The resulting statistical robustness confirms that the performance superiority of our framework is a direct consequence of its semantically clarified design, ensuring physically plausible health state classification that generalizes well across heterogeneous HDD populations.

5.2. Confusion Matrix Analysis

To further evaluate the granular discriminative capability and decision-making robustness of SDGR-Net across varying health stages, we analyze the confusion matrices for the HUH721212ALN604, ST4000DM000, and MG07ACA14TA datasets, as illustrated in Figure 4. The rows represent the ground-truth health states (Good, Fair, Warning, and Alert), while the columns signify the predicted labels.

In all three datasets, the main diagonal elements exhibit overwhelming dominance, confirming that SDGR-Net effectively maps heterogeneous SMART telemetry onto semantically distinct health manifolds. For instance, on the HUH721212ALN604 dataset, the model correctly identifies 31,824 out of 32,316 Alert instances and 38,231 out of 38,712 Good instances. This superior precision validates the efficacy of the spatiotemporally decoupled dual-branch architecture in extracting orthogonal features, ensuring that longitudinal degradation trends and transverse variable correlations do not interfere with the classification boundaries.

In industrial HDD maintenance, misclassifying an Alert state as Good is the most detrimental failure, as it results in missed detections of imminent crashes. The matrices show that SDGR-Net maintains an exceptionally high safety margin; in the large-scale MG07ACA14TA dataset, only 1.08% of Alert samples leaked into the Good category. This robustness is directly attributable to the parsimonious dual-view temporal extraction mechanism, which utilizes failure-anchored reversed views to amplify the gradients of pre-failure pulses, preventing them from being diluted by stochastic noise.

The majority of classification errors are concentrated in adjacent cells. This is particularly evident in the noisy ST4000DM000 dataset, where 1451 Fair instances were predicted as Good. From the perspective of reliability engineering, this reflects the non-linear and continuous nature of HDD wear-and-tear, where health state transitions exhibit inherent statistical overlap. The fact that errors are confined to neighboring stages rather than across tiers confirms that the internal representations of SDGR-Net correctly preserve the ordinal progression of disk degradation.

The low confusion rate between the Good class and failure-related states across all datasets underscores the effectiveness of the dynamic gated residual fusion module. By acting as a differentiable informational bottleneck, the module adaptively suppresses transient SMART fluctuations during stable operation, thereby minimizing the False Alarm Rate (FAR) and reducing unnecessary maintenance costs in large-scale data center deployments.

In summary, the confusion matrix analysis demonstrates that SDGR-Net achieves an optimal balance between sensitivity to incipient failure precursors and specificity to stable health states, substantiating its readiness for proactive disk reliability management.

5.3. Ablation Study

5.3.1. Impact of Class Imbalance Mitigation Strategies

To clarify whether the performance gains of SDGR-Net originate from its architectural innovations or the data-level balancing strategy, we conducted a comparative analysis between the Random Oversampling by Replication (ROR) used in our framework and two widely adopted loss-level mitigation techniques: Weighted Cross-Entropy (WCE) and Focal Loss. WCE assigns penalty weights inversely proportional to class frequencies to counteract majority-class bias, while Focal Loss introduces a modulating factor to focus training on hard-to-classify samples. These experiments were executed on all three datasets using the proposed SDGR-Net backbone to isolate the effects of sampling methodology versus cost-sensitive learning.

The comparative results summarized in Table 5 demonstrate that the ROR strategy consistently yields the highest Recall and F1-scores across all HDD models. On the large-scale MG07ACA14TA dataset, ROR improves the F1-score by approximately 1.6% over WCE and 0.8% over Focal Loss. From a physical perspective, ROR maintains the inherent distribution of SMART time-series signatures within each mini-batch, providing more stable gradient updates for the bidirectional temporal extraction modules. In contrast, loss-reweighting methods like WCE can sometimes over-amplify stochastic sensor jitter in minority classes, potentially inducing minor fluctuations in the False Alarm Rate (FAR) and compromising the precision of health state transitions.

More importantly, these results provide empirical confirmation that the SDGR-Net architecture remains the primary driver of discriminative power. Even when utilized with computationally “weaker” mitigation strategies like WCE, SDGR-Net still significantly outperforms the baseline models (e.g., BiLSTM or DiskTransformer as reported in Table 2) that utilize the full ROR strategy. For instance, the ACC of SDGR-Net combined with WCE on the ST4000DM000 dataset (0.8842) remains substantially superior to the standard BiLSTM using ROR (0.7913). This significant performance margin demonstrates that our hierarchical spatiotemporal decoupling and dynamic gated fusion provide the fundamental feature sensitivity required for incipient failure detection. Consequently, the ROR strategy acts as a complementary tool that ensures numerical sufficiency, while the architecture provides the semantically clarified health manifolds necessary for robust and physically plausible multi-state prediction.

5.3.2. Efficacy of Spatiotemporally Decoupled Orthogonal Architecture

To rigorously assess the architectural necessity of the spatiotemporally decoupled dual-branch design, we conducted a comprehensive ablation study across three heterogeneous datasets. This study empirically evaluates the theoretical claims of semantic disentanglement and orthogonal encoding by comparing SDGR-Net with four ablated variants: (i) Temporal-only, (ii) Variable-only, (iii) Shared-Params, and (iv) Mixed Input. As reported in Table 6, the consistent performance degradation observed in the Temporal-only and Variable-only variants across all datasets substantiates the premise that HDD health states are encoded in two distinct yet complementary semantic dimensions. Specifically, the longitudinal temporal branch captures the Markovian evolution of disk health—the ‘when’ of degradation—but is insufficient to model the transverse synergistic triggers required to characterize complex co-evolution patterns. Conversely, the Variable-only branch captures the ‘how’ of failure through inter-variable dependencies but lacks the cumulative temporal context necessary for robust prediction. By jointly modeling these dimensions, SDGR-Net achieves informational completeness, mapping HDD health states into a stereoscopic latent space that integrates both global temporal trends and localized inter-variable interactions. This complementarity is evidenced by the substantial F1-score decline (exceeding 3%) on the HUH721212ALN604 dataset when either branch is removed.

Beyond representational completeness, the structural organization of these features is crucial for preserving signal purity. The severe performance degradation of the Mixed Input variant—whose false alarm rate (FAR) on the MG07ACA14TA dataset reaches 0.1510, nearly seven times higher than that of SDGR-Net (0.0220), providing direct empirical evidence of semantic entanglement. As discussed in our methodology, multivariate SMART attributes exhibit heterogeneous statistical characteristics: longitudinal degradation trends are typically low-frequency and smooth, whereas transverse inter-variable anomalies are high-frequency and volatile. When these heterogeneous signals are naively merged into a single input stream, high-frequency stochastic noise propagates across the shared feature space, effectively contaminating the stable longitudinal degradation patterns. In contrast, SDGR-Net functions as a structural firewall, isolating noise within its respective semantic branch and preventing cross-dimensional interference. This design is fundamental to the model’s enhanced reliability and its ability to suppress spurious alarms in large-scale monitoring scenarios.

Moreover, the persistent performance gap between the Shared-Params variant and SDGR-Net, particularly on the noisy ST4000DM000 dataset, further confirming the existence of representation interference in coupled architectures. In shared LSTM blocks, optimization is often dominated by higher-magnitude signals. Because longitudinal temporal trends generally exhibit greater magnitude and lower variance than subtle transverse correlations, gradient updates during backpropagation become biased toward the temporal dimension. This imbalance induces optimization competition, whereby variable-wise dependencies are treated as secondary noise and are effectively suppressed. By enforcing strict parameter isolation, SDGR-Net enables orthogonal gradient propagation, allowing the variable branch to reach its representational capacity without being overwhelmed by the temporal branch.

In summary, the ablation results demonstrate that spatiotemporal decoupling in SDGR-Net is not merely an architectural choice but a deliberate feature purification strategy aligned with the physical characteristics of HDD failure processes. By resolving the trade-off between detection sensitivity and operational specificity through orthogonal encoding and semantic isolation, SDGR-Net mitigates representation interference and noise propagation. Consequently, subsequent bidirectional extraction and gated fusion modules operate on semantically coherent latent representations, establishing SDGR-Net as a robust and reliable framework for proactive HDD health management.

5.3.3. Impact of Bidirectional Dual-View Extraction and Feature Synthesis

To quantitatively assess the contribution of each design component in the lightweight dual-view temporal extraction mechanism, we conducted a systematic ablation study across three heterogeneous HDD datasets, as summarized in Table 7. This study empirically validates the design rationale presented in Section 3.3, with particular emphasis on the necessity of the backward temporal view and the effectiveness of the concatenation-based fusion strategy. As indicated by the performance trends, eliminating the backward temporal view and retaining only a unidirectional forward sequence results in consistent degradation in predictive accuracy and reliability across all datasets. For example, on the MG07ACA14TA dataset, the F1-score decreases from 0.9664 to 0.9248, while the false alarm rate (FAR) increases from 0.0220 to 0.0688. From a reliability engineering perspective, these findings indicate that unidirectional modeling is informationally incomplete: although the forward view captures the longitudinal aging process, it lacks the failure-origin context introduced by the reversed sequence. The backward temporal view effectively anchors the imminent failure as a temporal reference, enabling a ’future-to-past’ gradient flow that enhances the detection of incipient failure precursors and non-stationary transient patterns that are often underrepresented in forward-only modeling.

The superiority of the parsimonious dual-view strategy in SDGR-Net is further demonstrated through comparison with a conventional BiLSTM implementation. While the BiLSTM partially alleviates the performance loss observed in the single-view setting, it consistently underperforms the SDGR-Net branch, particularly on the noisy ST4000DM000 dataset, where SDGR-Net attains higher overall accuracy (ACC). This performance disparity provides empirical evidence of representation competition and optimization instability inherent in fully bidirectional recurrent architectures when applied to class-imbalanced HDD failure prediction. In standard BiLSTMs, the increased number of gating parameters often leads to over-parameterization, causing gradients associated with dominant temporal trends to overshadow subtle bidirectional cues. In contrast, SDGR-Net processes forward and backward sequences in parameter-isolated branches prior to fusion, thereby enabling orthogonal optimization paths and preserving directional specificity. This design ensures that complementary temporal representations from both views can reach their representational capacity without mutual interference.

Finally, we examined the effectiveness of the feature integration stage by replacing the ‘concatenation followed by linear projection’ with naive mean aggregation. Across all datasets, mean aggregation consistently degrades predictive robustness and detection sensitivity, as evidenced by a pronounced reduction in Recall relative to the full SDGR-Net architecture. This result shows the hypothesis of semantic dilution associated with symmetric fusion strategies. As articulated in the SDGR-Net design philosophy, naive averaging functions as a destructive low-pass filter, attenuating signal amplitude and suppressing the directional asymmetry of pre-failure patterns. In contrast, the concatenation-based fusion in SDGR-Net preserves the distinct characteristics and high-fidelity details of both temporal views, while the subsequent linear projection maintains dimensional consistency and information richness. Collectively, these ablation results demonstrate that the integration of a failure-anchored backward view, parsimonious decoupled modeling, and concatenation-based fusion is essential for resolving the trade-off between detection sensitivity and operational reliability, thereby establishing the lightweight dual-view temporal extraction mechanism as a core component of high-precision HDD health monitoring.

5.3.4. Impact of Dynamic Gated Fusion

To quantitatively assess the contribution of each functional component in the cross-branch dynamic gated residual fusion module of SDGR-Net, we conducted systematic ablation experiments across three heterogeneous HDD datasets, as summarized in Table 8. This study empirically validates the necessity of both the gated informational bottleneck and the longitudinal residual shortcut for effectively integrating decoupled feature representations. The performance results demonstrate that SDGR-Net consistently outperforms alternative fusion strategies—including fixed-weight fusion, dynamic gating without residual connections, and naive static fusion—across all evaluation metrics. For example, on the HUH721212ALN604 dataset, SDGR-Net increases the F1-score from 0.9578 (static fusion) to 0.9895, while simultaneously reducing the false alarm rate (FAR) to 0.0091. From a representation learning perspective, these findings indicate that simple concatenation or fixed-weight summation fails to capture the heterogeneous contributions of longitudinal temporal trends and transverse variable correlations. Such static strategies implicitly assume semantic homogeneity across branches, leading to representation interference and hindering the isolation of incipient failure precursors from background stochastic noise.

The effectiveness of the dynamic gating mechanism in SDGR-Net is further evidenced by comparison with fixed-weight and static fusion variants. On the large-scale MG07ACA14TA dataset, naive fusion approaches exhibit substantially higher FAR than the gated fusion employed by SDGR-Net. This increase in false alarms provides empirical evidence of semantic entanglement, whereby redundant or non-critical SMART attribute fluctuations propagate into the final decision space. In contrast, the dynamic gating mechanism in SDGR-Net functions as a differentiable informational bottleneck, computing adaptive importance weights that amplify failure-critical attribute interactions while suppressing irrelevant variability. However, the performance degradation observed in the dynamic gating without residual variant, particularly the reductions in ACC and F1-score on the ST4000DM000 dataset, reveals an important limitation: although gating enhances variable selectivity, it can inadvertently disrupt the stable longitudinal degradation trajectory when used in isolation.

The inclusion of the residual connection in SDGR-Net addresses this limitation by preserving the semantic integrity of the primary health representation. By providing an uninterrupted gradient pathway for temporal branch features, the residual shortcut maintains the macroscopic degradation trend, while the gated variable branch contributes localized, synergistic anomaly enhancements. This coordinated interaction improves the signal-to-noise ratio (SNR) and promotes stable gradient propagation, as reflected in consistently superior Precision and Recall across all datasets. Collectively, these ablation results demonstrate that the fusion module in SDGR-Net is not merely a feature aggregation mechanism, but a strategic feature modulation framework aligned with the physical characteristics of HDD failure processes. By adaptively balancing longitudinal trend preservation with transverse anomaly amplification, the gated residual fusion module establishes a robust and physically plausible foundation for multi-state HDD health prediction, achieving high predictive fidelity while minimizing operational overhead.

5.4. Analysis of Training Dynamics and Gradient Stability

To empirically substantiate the optimization benefits of the proposed architecture, we monitor the evolution of the global gradient norm during the training phase. In deep recurrent networks, numerical instability—manifesting as gradient vanishing or exploding—often hinders the model’s ability to learn long-term degradation dependencies within non-stationary SMART telemetry. Figure 5 illustrates the normalized

L_{2}

gradient norm of SDGR-Net over 200 training epochs, where the raw norm is scaled by the square root of the total number of trainable parameters (

{∥ g ∥}_{2} / \sqrt{N}

) to represent the average gradient energy per weight. This metric serves as a critical indicator of the health of the optimization landscape throughout the convergence process.

The experimental results reveal that the normalized gradient norm of SDGR-Net remains remarkably stable, strictly oscillating within the optimal range of 0.8–1.2 throughout the optimization process. This concentration of gradient energy near unity indicates that the error signals propagated from the multi-state health labels maintain a consistent magnitude as they flow back through the complex hierarchy of the longitudinal temporal and transverse variable branches. This phenomenon suggests that the integrated architecture effectively acts as a numerical stabilizer, preventing the exponential attenuation of gradients that typically plagues deep LSTM-based models, thereby ensuring that the optimization process remains well-conditioned even in the later stages of convergence.

This observed stability is a direct consequence of the convex fusion logic employed in our gated residual module, defined as

F_{fusion} = G ⊙ F_{v} + (1 - G) ⊙ F_{t}

. By utilizing the sigmoid gate

G

to adaptively balance feature saliency while providing a continuous residual path through

(1 - G)

, SDGR-Net establishes an uninterrupted gradient highway. Even in scenarios where the gating mechanism heavily suppresses redundant transverse noise to minimize false alarms, the longitudinal degradation backbone is preserved via the residual shortcut. This mechanism ensures that the optimization landscape does not become ill-conditioned, allowing the model to reach a more stable local minimum while facilitating faster and smoother convergence across heterogeneous HDD populations.

Furthermore, from a signal-processing perspective, this gradient stability is vital for capturing the incipient anomalies and transient pulses that precede mechanical failure. Since these pre-failure signals are often subtle and localized in the high-frequency domain, a stable gradient flow ensures they are not smoothed out or neglected during backpropagation. The consistent maintenance of gradient norms within the 0.8–1.2 range confirms that SDGR-Net preserves the sensitivity required to identify high-order nonlinear correlations, which directly contributes to the superior Recall and F1-scores observed in our benchmarks. Ultimately, these training dynamics provide a solid empirical foundation for the model’s predictive robustness and its readiness for large-scale proactive maintenance deployments.

5.5. Interpretability Analysis via Gating Weight Distribution

To empirically validate the structural interpretability and physical plausibility of SDGR-Net, we analyze the learned gating distributions

G

across the four health states (Good, Fair, Warning, Alert) for the HUH721212ALN604, ST4000DM000, and MG07ACA14TA datasets. The gating weights, ranging from 0 to 1, dictate the informational flow from the transverse variable branch into the final health manifold. The average gating distributions for the nine selected SMART attributes are summarized in Figure 6.

The quantitative results reveal a consistent and physically sound trend across all datasets. In the MG07ACA14TA dataset, which contains the largest population, we observe a distinct informational bottleneck effect. During the Fair and Warning stages, the gating weights remain relatively low (averaging approximately 0.50–0.52), indicating that the model primarily preserves the longitudinal temporal backbone via the residual path while suppressing potential stochastic noise in the variable branch. However, as the disk enters the Alert state, the average gating weights surge to the 0.61–0.63 range. This adaptive amplification signifies that the model identifies a higher necessity for cross-variable synergistic patterns to confirm imminent failure transitions, aligning with the reliability engineering principle that failure precursors manifest as synchronized deviations across multiple sensors.

A similar dynamic shift is observed in the HUH721212ALN604 dataset. While the weights for the Good and Warning states hover around 0.52–0.55, the Alert state exhibits a sustained increase to approximately 0.59. This consistent re-weighting behavior suggests that SDGR-Net effectively discriminates between steady-state operation and the incipient failure phase. Interestingly, the ST4000DM000 dataset exhibits higher baseline weights (0.57–0.61) across all health categories. This phenomenon can be attributed to the high-frequency noise and inter-class ambiguity inherent in this specific HDD model, necessitating a continuous integration of both temporal and variable features to maintain discriminative power.

Collectively, the visualization of the gating distributions confirms that the cross-branch dynamic gated residual fusion module is not a static aggregator but a strategic feature modulator. By adaptively re-weighting heterogeneous feature branches based on the perceived risk level, SDGR-Net ensures that its decision-making process is semantically clarified and physically grounded. This evidence directly supports our interpretability claim, demonstrating that the model has internalized the complex non-linear degradation logic required for proactive HDD health monitoring.

5.6. Cross-Model Generalization and Transfer Robustness Analysis

To ascertain whether SDGR-Net internalizes universal physical degradation laws or merely overfits specific data distributions, we conducted a zero-shot cross-model generalization evaluation. Although the three heterogeneous datasets (ST4000DM000, MG07ACA14TA, and HUH721212ALN604) share an identical set of 9 SMART attributes as defined in Section 4.1, they exhibit significantly different statistical profiles due to variations in model-specific firmware logic, sensor calibrations, and mechanical tolerances. In this experiment, the framework was trained on a single source dataset and directly evaluated on the remaining two target datasets without any re-training or fine-tuning. This protocol provides a rigorous assessment of the model’s robustness against domain shift in a shared feature space. The quantitative results are summarized in Table 9.

The empirical findings demonstrate that SDGR-Net retains a high degree of predictive fidelity across different HDD models. When trained on the large-scale MG07ACA14TA dataset, the model maintains robust F1-scores of 0.8136 and 0.8429 on the ST4000DM000 and HUH721212ALN604 targets, respectively. This transfer resilience is particularly noteworthy because, even though the input SMART IDs are consistent, the failure thresholds vary drastically between the ST4000DM000 and MG07ACA14TA profiles. The ability of SDGR-Net to maintain an F1-score above 0.80 in these tasks proves that it has effectively decoupled the universal longitudinal aging logic from dataset-specific noise.

The structural foundation of this generalization capability lies in the hierarchical spatiotemporal decoupling strategy. By isolating the longitudinal temporal backbone, the model focuses on the Markovian progression of mechanical wear, which is a fundamentally invariant physical process across different HDD models. While the cross-variable correlations may exhibit different intensities in a new target domain due to varying sensor sensitivities, the dynamic gated residual fusion module functions as an adaptive filter. It automatically re-weights the attributes to prioritize those cross-variable correlations that remain most salient for failure detection in the target environment. Furthermore, the dual-view temporal extraction provides a standardized Fail-Origin reference, mitigating the impact of varying degradation speeds. In conclusion, the cross-model evaluation substantiates that SDGR-Net yields physically plausible and highly transferable representations by successfully managing the trade-off between stable temporal dynamics and volatile cross-variable correlations.

5.7. Computational Efficiency and Deployment Feasibility

Beyond predictive accuracy, computational efficiency is paramount for real-world deployment in large-scale data centers. To address this, we conducted a comprehensive runtime benchmark comparing the inference latency, computational complexity (GFLOPs), and memory footprint of SDGR-Net against all baselines. All models were evaluated on an NVIDIA GeForce RTX 3090 under identical conditions. The results are detailed in Table 10.

As shown in Table 10, SDGR-Net achieves a remarkable balance between performance and efficiency. While complex architectures like DiskTransformer offer strong theoretical capabilities, they suffer from prohibitive computational costs (22.43 M parameters, 0.78 GFLOPs) and high latency (5.27 ms/sample). In contrast, SDGR-Net maintains a lean parameter profile of 6.19 M, which is comparable to the lightweight BiLSTM baselines (≈5.3 M) but significantly more efficient than the generative and transformer-based models.

Most critically, SDGR-Net demonstrates superior runtime characteristics essential for deployment. It achieves an inference latency of 1.86 ms per sample, which is nearly 3× faster than DiskTransformer and significantly quicker than the TimeGAN-augmented models (2.85–3.12 ms). Furthermore, our optimized parameter-sharing strategy and efficient stack implementation result in an exceptionally low maximum GPU memory usage of 39.87 MB. This is a drastic reduction compared to the >100 MB memory footprint required by all other LSTM/GRU-based baselines and the heavy DiskTransformer. This efficiency advantage confirms that SDGR-Net is not only accurate but also highly scalable and suitable for deployment on resource-constrained edge devices or high-throughput monitoring systems.

6. Conclusions

In this work, we propose SDGR-Net, a robust architectural framework for multi-state HDD health prediction. SDGR-Net systematically addresses key challenges inherent to multivariate SMART time series analysis, including semantic entanglement, noise sensitivity, and the detection of incipient anomalies. By integrating three synergistic architectural innovations, SDGR-Net achieves a balanced trade-off between predictive sensitivity and operational reliability. Specifically, the spatiotemporally decoupled dual-branch architecture enables orthogonal feature encoding, effectively separating longitudinal degradation dynamics from transverse inter-variable correlations to mitigate representation interference. The lightweight dual-view temporal extraction mechanism introduces a parsimonious bidirectional context by anchoring impending failure origins without incurring the parameter redundancy associated with conventional BiLSTMs, thereby preserving non-stationary pre-failure signals with high fidelity. In addition, the cross-branch dynamic gated residual fusion module functions as a differentiable informational bottleneck, adaptively suppressing stochastic noise while retaining essential temporal trajectories through residual shortcuts that stabilize gradient propagation.

Extensive experiments conducted on three heterogeneous HDD datasets, HUH721-212ALN604, ST4000DM000, and MG07ACA14TA, demonstrate that SDGR-Net consistently outperforms six state-of-the-art baseline models, including GAN-augmented and Transformer-based approaches. In particular, SDGR-Net achieves a peak Recall of 0.9898 and reduces the False Alarm Rate (FAR) by up to 69.6% in high-reliability operating regimes. Comprehensive ablation studies further confirm that the proposed architectural decoupling and adaptive fusion strategies are critical for alleviating optimization competition among heterogeneous feature representations, resulting in semantically disentangled latent manifolds and improved discriminative capability.

Overall, SDGR-Net exhibits strong generalization performance and physical plausibility in failure prediction, offering a computationally efficient and deployable solution for proactive HDD maintenance in large-scale data center environments. By effectively resolving the fundamental trade-off between detection sensitivity and specificity, this work establishes a solid foundation for next-generation intelligent health monitoring systems, bridging the gap between deep learning representational power and the practical demands of reliability engineering.

Author Contributions

Conceptualization, Z.W. and J.Q.; methodology, Z.W. and J.Q.; software, Z.W.; validation, Z.W.; formal analysis, Z.W.; investigation, Z.W. and J.Q.; resources, Y.L. and Z.Y.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, J.Q. and Z.Y.; visualization, Y.L., J.Q. and Z.Y.; supervision, Y.L., J.Q. and Z.Y.; project administration, Y.L. and Z.Y.; funding acquisition, J.Q. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 62206314, Guangdong Basic and Applied Basic Research Foundation under Grant No. 2025A1515010454 and 2023A1515012561, and Science and Technology Projects in Guangzhou under Grant No. 2024A04J4388.

Data Availability Statement

The data presented in this study are openly available in Backblaze HDD datasets at https://www.backblaze.com/b2/hard-drive-test-data.html (accessed on 23 March 2026). The processed features and SDGR-Net implementation code are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Backblaze. Backblaze Drive Stats for 2023; Technical Report; Backblaze, Inc.: San Mateo, CA, USA, 2023. [Google Scholar]
Miller, Z.; Medaiyese, O.; Ravi, M.; Beatty, A.; Lin, F. Hard Disk Drive Failure Analysis and Prediction: An Industry View. In Proceedings of the 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks—Supplemental Volume (DSN-S), Porto, Portugal, 27–30 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 21–27. [Google Scholar] [CrossRef]
Botezatu, M.M.; Giurgiu, I.; Bogojeska, J.; Wiesmann, D. Predicting Disk Replacement towards Reliable Data Centers. In Proceedings of the KDD’16, 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016; pp. 39–48. [Google Scholar] [CrossRef]
Burrello, A.; Pagliari, D.J.; Bartolini, A.; Benini, L.; Macii, E.; Poncino, M. Predicting Hard Disk Failures in Data Centers Using Temporal Convolutional Neural Networks. In Proceedings of the Euro-Par 2020: Parallel Processing Workshops; Balis, B., Heras, D.B., Antonelli, L., Bracciali, A., Gruber, T., Hyun-Wook, J., Kuhn, M., Scott, S.L., Unat, D., Wyrzykowski, R., Eds.; Springer: Cham, Germany, 2021; pp. 277–289. [Google Scholar]
Murray, J.F.; Hughes, G.F.; Kreutz-Delgado, K. Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application. J. Mach. Learn. Res. 2005, 6, 783–816. [Google Scholar]
Li, J.; Ji, X.; Jia, Y.; Zhu, B.; Wang, G.; Li, Z.; Liu, X. Hard Drive Failure Prediction Using Classification and Regression Trees. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Atlanta, GA, USA, 23–26 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 383–394. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Wang, H.; Zhuge, Q.; Sha, E.H.M.; Xu, R.; Song, Y. Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection. Appl. Sci. 2023, 13, 7544. [Google Scholar] [CrossRef]
Ahmed, J.; Green, R.C., II. Predicting severely imbalanced data disk drive failures with machine learning models. Mach. Learn. Appl. 2022, 9, 100361. [Google Scholar] [CrossRef]
Lu, X.; Tu, C.; Yang, H.; Guo, J.; Sun, H. FPTSF: A Failure Prediction of Hard Disks Based on Time Series Features Towards Low Quality Dataset. In Proceedings of the Web and Big Data; Zhang, W., Tung, A., Zheng, Z., Yang, Z., Wang, X., Guo, H., Eds.; Springer Nature: Singapore, 2024; pp. 438–447. [Google Scholar]
Fang, X.; Guan, W.; Li, J.; Cao, C.; Xia, B. SiaDFP: A Disk Failure Prediction Framework Based on Siamese Neural Network in Large-Scale Data Center. IEEE Trans. Serv. Comput. 2024, 17, 2890–2903. [Google Scholar] [CrossRef]
Li, D.; Qiu, W.; Zhang, J.; Xue, K.; Zhang, X.; Sun, B.; Lin, F.; Li, L. A Failure Prediction Approach Based on BiLSTM and Deep Feature Extractor for Hard Disk Drives. In Proceedings of the 2023 IEEE 6th International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 23–25 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 146–150. [Google Scholar] [CrossRef]
Shi, C.; Wu, Z.; Lv, X.; Ji, Y. DGTL-Net: A Deep Generative Transfer Learning Network for Fault Diagnostics on New Hard Disks. Expert Syst. Appl. 2021, 169, 114379. [Google Scholar] [CrossRef]
Zhang, J.; Zhou, K.; Huang, P.; He, X.; Xie, M.; Cheng, B.; Ji, Y.; Wang, Y. Minority Disk Failure Prediction Based on Transfer Learning in Large Data Centers of Heterogeneous Disk Systems. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 2155–2169. [Google Scholar] [CrossRef]
Jiang, C.; He, D. Hard disk drive failure prediction model based on Temporal Convolutional Network combined with Auto-Encoder. In Proceedings of the CSAI ’24, 2024 8th International Conference on Computer Science and Artificial Intelligence; Association for Computing Machinery: New York, NY, USA, 2025; pp. 513–518. [Google Scholar] [CrossRef]
Xu, S.; Xu, X. ConvTrans-TPS: A Convolutional Transformer Model for Disk Failure Prediction in Large-Scale Network Storage Systems. In Proceedings of the 2023 26th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Rio de Janeiro, Brazil, 24–26 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1318–1323. [Google Scholar] [CrossRef]
Zhang, J.; Huang, P.; Zhou, K.; Xie, M.; Schelter, S. HDDse: Enabling High-Dimensional Disk State Embedding for Generic Failure Detection System of Heterogeneous Disks in Large Data Centers. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC 20), Virtual Event, 15–17 July 2020; pp. 111–126. [Google Scholar]
Zhang, M.; Ge, W.; Tang, R.; Liu, P. Hard Disk Failure Prediction Based on Blending Ensemble Learning. Appl. Sci. 2023, 13, 3288. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, F.; Feng, D. A Disk Failure Prediction Method Based on Active Semi-supervised Learning. ACM Trans. Storage 2022, 18, 1–33. [Google Scholar] [CrossRef]
Sankar, S.; Shaw, M.; Vaid, K. Impact of temperature on hard disk drive reliability in large datacenters. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems Networks (DSN), Hong Kong, China, 27–30 June 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 530–537. [Google Scholar] [CrossRef]
Wang, G.; Zhang, L.; Xu, W. What Can We Learn from Four Years of Data Center Hardware Failures? In Proceedings of the 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Denver, CO, USA, 26–29 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 25–36. [Google Scholar] [CrossRef]
Eckart, B.; Chen, X.; He, X.; Scott, S.L. Failure Prediction Models for Proactive Fault Tolerance within Storage Systems. In Proceedings of the 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems, Baltimore, MD, USA, 8–10 September 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–8. [Google Scholar] [CrossRef]
Shen, J.; Wan, J.; Lim, S.J.; Yu, L. Random-forest-based failure prediction for hard disk drives. Int. J. Distrib. Sens. Netw. 2018, 14, 155014771880648. [Google Scholar] [CrossRef]
Chaves, I.C.; de Paula, M.R.P.; Leite, L.G.; Queiroz, L.P.; Gomes, J.P.P.; Machado, J.C. BaNHFaP: A Bayesian Network Based Failure Prediction Approach for Hard Disk Drives. In Proceedings of the 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), Recife, Brazil, 9–12 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 427–432. [Google Scholar] [CrossRef]
Pereira, F.L.F.; Lima, F.D.d.S.; Leite, L.G.d.M.; Gomes, J.P.P.; Machado, J.d.C. Transfer Learning for Bayesian Networks with Application on Hard Disk Drives Failure Prediction. In Proceedings of the 2017 Brazilian Conference on Intelligent Systems (BRACIS), Uberlandia, Brazil, 2–5 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 228–233. [Google Scholar] [CrossRef]
Lima, F.D.S.; Pereira, F.L.F.; Chaves, I.C.; Gomes, J.P.P.; Machado, J.C. Evaluation of Recurrent Neural Networks for Hard Disk Drives Failure Prediction. In Proceedings of the 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), Sao Paulo, Brazil, 22–25 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 85–90. [Google Scholar] [CrossRef]
Coursey, A.; Nath, G.; Prabhu, S.; Sengupta, S. Remaining Useful Life Estimation of Hard Disk Drives using Bidirectional LSTM Networks. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 4832–4841. [Google Scholar] [CrossRef]
Wang, G.; Wang, Y.; Sun, X. Multi-Instance Deep Learning Based on Attention Mechanism for Failure Prediction of Unlabeled Hard Disk Drives. IEEE Trans. Instrum. Meas. 2021, 70, 3513509. [Google Scholar] [CrossRef]
Shen, J.; Ren, Y.; Wan, J.; Lan, Y.; Yang, X. Hard Disk Drive Failure Prediction for Mobile Edge Computing Based on an LSTM Recurrent Neural Network. Mob. Inf. Syst. 2021, 2021, 8878364. [Google Scholar] [CrossRef]
Tang, S.; Ma, J.; Yan, Z.; Zhu, Y.; Khoo, B.C. Deep transfer learning strategy in intelligent fault diagnosis of rotating machinery. Eng. Appl. Artif. Intell. 2024, 134, 108678. [Google Scholar] [CrossRef]
Xie, Y.; Feng, D.; Wang, F.; Zhang, X.; Han, J.; Tang, X. OME: An Optimized Modeling Engine for Disk Failure Prediction in Heterogeneous Datacenter. In Proceedings of the 2018 IEEE 36th International Conference on Computer Design (ICCD), Orlando, FL, USA, 7–10 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 561–564. [Google Scholar] [CrossRef]
Siddique, M.F.; Umar, M.; Ahmad, W.; Kim, J.M. Advanced fault diagnosis in milling cutting tools using vision transformers with semi-supervised learning and uncertainty quantification. Sci. Rep. 2025, 15, 42460. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
Hai, Q.; Zhang, S.; Liu, C.; Han, G. Hard disk drive failure prediction based on gru neural network. In Proceedings of the 2022 IEEE/CIC International Conference on Communications in China (ICCC), Foshan, China, 11–13 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 696–701. [Google Scholar]
Zhang, S.; Hai, Q.; Wu, W.; Han, G. Hard disk drives failure prediction using the deep learning method based on attention mechanism. In Proceedings of the 2023 5th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT), Guangzhou, China, 18–20 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 133–137. [Google Scholar]
Bai, A.; Chen, M.; Peng, S.; Han, G.; Yang, Z. Attention-based bidirectional LSTM with differential features for disk RUL prediction. In Proceedings of the 2022 IEEE 5th International Conference on Electronic Information and Communication Technology (ICEICT), Online, 21–23 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 684–689. [Google Scholar]
Ge, W.; Liu, P.; Zhang, M.; Zhang, Z.; Lai, Y. DiskTransformer: A Transformer Network for Hard Disk Failure Prediction. In Proceedings of the 2024 7th International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 24–27 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 327–332. [Google Scholar]

Figure 1. Overall architecture of the proposed SDGR-Net, designed to extract and integrate temporal and variable features from SMART time-series data. (1) Temporal Branch: processes forward and reversed temporal views using independent LSTM blocks. Dual-view concatenation and dimensionality reduction produce compact temporal features preserving long-term trends and short-term anomalies; (2) Variable Branch: captures inter-variable correlations via forward and reversed LSTM processing of variable-wise sequences, with dual-view fusion and dimensionality reduction; (3) Classification Tier: Enhances fused features via multi-head attention, followed by terminal global pooling for health state probability output. Note: SDGR-Net omits intermediate pooling during feature extraction to maximize signal fidelity. Key innovations include dual-view concatenation fusion for lightweight bidirectional temporal modeling and gated residual cross-branch fusion for adaptive feature integration.

Figure 2. Detailed architecture of the spatiotemporally decoupled dual-branch heterogeneous LSTM encoder, comprising independent temporal and variable feature extraction branches.

Figure 3. The design of the cross-branch dynamic gated residual fusion module.

Figure 4. Confusion matrix analysis of SDGR-Net performance on three heterogeneous HDD datasets.

Figure 5. Evolution of the normalized global gradient norm during training. The stable range of 0.8–1.2 validates the efficacy of the gated residual fusion in maintaining gradient energy and preventing vanishing gradients.

Figure 6. Visualization of the learned gating weight distributions across different health states. The adaptive intensification of weights for critical attributes during the Alert phase substantiates the physical plausibility and interpretability of the SDGR-Net framework.

Table 1. Explicit Parameter Count Comparison for Temporal Encoding Layers (calculated with Input Dimension

D_{i n} = 9

and Hidden Dimension

H = 512

).

Table 1. Explicit Parameter Count Comparison for Temporal Encoding Layers (calculated with Input Dimension

D_{i n} = 9

and Hidden Dimension

H = 512

).

Model Architecture	Encoding Mechanism (Formula)	Exact Parameter Count
Standard BiLSTM	Independent Dual-Path ( $2 \times P$ )	2,138,112
SDGR-Net (Ours)	Shared Dual-Pass ( $1 \times P$ )	1,069,056

Table 2. Ranking of Selected SMART Attributes based on Total Information Entropy (IE) Score across Three Heterogeneous Datasets.

ID	SMART Attribute	Entropy Score per Dataset			Total Score
ID	SMART Attribute	Set A	Set B	Set C	Total Score
5	Reallocated Sectors Count	0.235	0.218	0.241	0.694
197	Current Pending Sector Count	0.212	0.185	0.205	0.602
1	Raw Read Error Rate	0.091	0.076	0.088	0.255
194	Temperature (Celsius)	0.065	0.059	0.062	0.186
193	Load/Unload Cycle Count	0.042	0.048	0.045	0.135
9	Power-On Hours (POH)	0.038	0.035	0.037	0.110
192	Power-off Retract Count	0.025	0.021	0.024	0.070
12	Power Cycle Count	0.016	0.018	0.015	0.049
4	Start/Stop Count	0.011	0.009	0.010	0.030

Note: Set A: ST4000DM000; Set B: HUH721212ALN604; Set C: MG07ACA14TA.

Table 3. Health state labels and their corresponding time ranges.

Health Degree	DayToFailure
GOOD	31–60
Fair	21–30
Warning	11–20
Alert	0–10

Table 4. Comparative performance of SDGR-Net and baseline models on the ST4000DM000, HUH721212ALN604, and MG07ACA14TA datasets. For SDGR-Net, the results are reported as Mean ± Standard Deviation over 10 independent runs to demonstrate the statistical robustness of the proposed architecture.

Dataset	Model	ACC (%)	Precision	F1 Score	Recall	FAR
HUH721212ALN604	BiLSTM	0.9551	0.9653	0.9761	0.9871	0.0299
	TimeGAN-LSTM [34]	0.9714	0.9753	0.9771	0.9789	0.0165
	TimeGAN-GRU [34]	0.9686	0.9753	0.9762	0.9772	0.0224
	LSTM-Attention [35]	0.9723	0.9791	0.9791	0.9789	0.0192
	DiskTransformer [36]	0.9632	0.9792	0.9803	0.9814	0.0177
	Attention-Based BiLSTM [37]	0.9552	0.9654	0.9761	0.9871	0.0299
	SDGR-Net (Ours)	$0.9805 \pm 0.0011$	$0.9892 \pm 0.0009$	$0.9895 \pm 0.0008$	$0.9898 \pm 0.0007$	$0.0091 \pm 0.0006$
ST4000DM000	BiLSTM	0.7913	0.8380	0.8562	0.8752	0.1455
	TimeGAN-LSTM [34]	0.8893	0.9313	0.9115	0.8925	0.0553
	TimeGAN-GRU [34]	0.8738	0.9122	0.9001	0.8883	0.0718
	LSTM-Attention [35]	0.8505	0.9041	0.8782	0.8537	0.0760
	DiskTransformer [36]	0.8780	0.9224	0.9273	0.9323	0.0673
	Attention-Based BiLSTM [37]	0.8661	0.9431	0.9148	0.8882	0.0461
	SDGR-Net (Ours)	$0.9007 \pm 0.0028$	$0.9462 \pm 0.0025$	$0.9373 \pm 0.0022$	$0.9287 \pm 0.0021$	$0.0453 \pm 0.0024$
MG07ACA14TA	BiLSTM	0.8812	0.9335	0.9246	0.9159	0.0553
	TimeGAN-LSTM [34]	0.8846	0.9186	0.9141	0.9097	0.0682
	TimeGAN-GRU [34]	0.8539	0.8810	0.8930	0.9053	0.1034
	LSTM-Attention [35]	0.8719	0.9086	0.9053	0.9021	0.0768
	DiskTransformer [36]	0.8301	0.8921	0.8944	0.8967	0.0919
	Attention-Based BiLSTM [37]	0.8869	0.9297	0.9311	0.9324	0.0597
	SDGR-Net (Ours)	$0.9441 \pm 0.0015$	$0.9735 \pm 0.0013$	$0.9664 \pm 0.0012$	$0.9594 \pm 0.0011$	$0.0220 \pm 0.0014$

Table 5. Performance evaluation of SDGR-Net across different class imbalance mitigation strategies on three heterogeneous HDD datasets. Results represent the mean performance to ensure a fair comparison between sampling and loss-weighting methods.

Dataset	Strategy	ACC	Precision	F1-Score	Recall	FAR
HUH721212ALN604	SDGR-Net + Weighted CE	0.9634	0.9752	0.9746	0.9741	0.0152
	SDGR-Net + Focal Loss	0.9712	0.9804	0.9796	0.9788	0.0124
	SDGR-Net + ROR (Ours)	0.9805	0.9892	0.9895	0.9898	0.0091
ST4000DM000	SDGR-Net + Weighted CE	0.8842	0.9215	0.9113	0.9014	0.0612
	SDGR-Net + Focal Loss	0.8906	0.9327	0.9239	0.9152	0.0528
	SDGR-Net + ROR (Ours)	0.9007	0.9462	0.9373	0.9287	0.0453
MG07ACA14TA	SDGR-Net + Weighted CE	0.9284	0.9592	0.9503	0.9416	0.0385
	SDGR-Net + Focal Loss	0.9351	0.9664	0.9582	0.9502	0.0294
	SDGR-Net + ROR (Ours)	0.9441	0.9735	0.9664	0.9594	0.0220

Table 6. Impact of the spatiotemporally decoupled dual-branch modeling in SDGR-Net on predictive robustness and feature discriminability across multiple HDD populations.

Dataset	Model	ACC (%)	Precision	F1-Score	Recall	FAR
HUH721212ALN604	Temporal-only	0.9446	0.9563	0.9591	0.9618	0.0382
	Variable-only	0.9236	0.9472	0.9552	0.9633	0.0454
	Shared-Params	0.9543	0.9640	0.9754	0.9667	0.0333
	Mixed Input	0.8105	0.8690	0.8709	0.8729	0.1114
	SDGR-Net (Ours)	0.9805	0.9892	0.9895	0.9898	0.0091
ST4000DM000	Temporal-only	0.8006	0.8809	0.8637	0.8472	0.0985
	Variable-only	0.8307	0.8842	0.8857	0.8872	0.0999
	Shared-Params	0.8658	0.9155	0.9081	0.9008	0.0714
	Mixed Input	0.8256	0.8622	0.8612	0.8643	0.1228
	SDGR-Net (Ours)	0.9007	0.9462	0.9373	0.9287	0.0453
MG07ACA14TA	Temporal-only	0.8465	0.9101	0.9039	0.8979	0.0752
	Variable-only	0.8444	0.9158	0.8929	0.8712	0.0678
	Shared-Params	0.8944	0.9359	0.9310	0.9262	0.0537
	Mixed Input	0.7229	0.8211	0.8195	0.8181	0.1510
	SDGR-Net (Ours)	0.9441	0.9735	0.9664	0.9594	0.0220

Table 7. Ablation analysis of the lightweight dual-view temporal extraction mechanism in SDGR-Net across three heterogeneous HDD datasets.

Dataset	Fusion Strategy	ACC (%)	Precision	F1-Score	Recall	FAR
HUH721212ALN604	Forward View	0.9635	0.9709	0.9706	0.9703	0.0276
	BiLSTM	0.9607	0.9705	0.9694	0.9682	0.0280
	Mean Aggregation	0.9495	0.9581	0.9585	0.9588	0.0402
	Concat + Linear (Ours)	0.9805	0.9892	0.9895	0.9898	0.0091
ST4000DM000	Forward View	0.8703	0.9214	0.9074	0.8939	0.0689
	BiLSTM	0.8869	0.9236	0.9185	0.9135	0.0680
	Mean Aggregation	0.8637	0.9092	0.9004	0.8917	0.0824
	Concat + Linear (Ours)	0.9007	0.9462	0.9373	0.9287	0.0453
MG07ACA14TA	Forward View	0.8998	0.9249	0.9248	0.9246	0.0688
	BiLSTM	0.9274	0.9551	0.9537	0.9524	0.0379
	Mean Aggregation	0.8936	0.9186	0.9176	0.9167	0.0753
	Concat + Linear (Ours)	0.9441	0.9735	0.9664	0.9594	0.0220

Table 8. Ablation Study of the Cross-Branch Dynamic Gated Residual Fusion Module on Three HDD Datasets.

Dataset	Fusion Ways	ACC (%)	Precision	F1-Score	Recall	FAR
HUH721212ALN604	Global Average Pooling	0.9522	0.9610	0.9617	0.9625	0.0373
	Dynamic gating without Residual connection	0.9471	0.9572	0.9568	0.9564	0.0413
	Concat	0.9482	0.9564	0.9578	0.9593	0.0417
	Dynamic gating + Residual (Ours)	0.9805	0.9892	0.9895	0.9898	0.0091
ST4000DM000	Global Average Pooling	0.8725	0.9196	0.9084	0.8974	0.0733
	Dynamic gating without Residual connection	0.8562	0.9031	0.8944	0.8859	0.0868
	Concat	0.8545	0.9087	0.8920	0.8757	0.0823
	Dynamic gating + Residual (Ours)	0.9007	0.9462	0.9373	0.9287	0.0453
MG07ACA14TA	Global Average Pooling	0.9226	0.9450	0.9480	0.9550	0.0489
	Dynamic gating without Residual connection	0.9219	0.9460	0.9430	0.9430	0.0487
	Concat	0.9154	0.9497	0.9422	0.9349	0.0453
	Dynamic gating + Residual (Ours)	0.9441	0.9735	0.9664	0.9594	0.0220

Table 9. Zero-shot cross-model generalization performance of SDGR-Net. Results highlight the framework’s resilience to dataset-specific distribution shifts and its ability to internalize universal degradation laws across heterogeneous HDD models.

Source (Train)	Target (Test)	ACC	Precision	F1-Score	Recall	FAR
MG07ACA14TA	ST4000DM000	0.8425	0.8241	0.8136	0.8033	0.0814
MG07ACA14TA	HUH721212ALN604	0.8712	0.8657	0.8429	0.8214	0.0652
ST4000DM000	MG07ACA14TA	0.8156	0.7984	0.7814	0.7652	0.0985
ST4000DM000	HUH721212ALN604	0.7942	0.7712	0.7569	0.7431	0.1124
HUH721212ALN604	MG07ACA14TA	0.7588	0.7425	0.7122	0.6843	0.1312
HUH721212ALN604	ST4000DM000	0.7321	0.7214	0.6889	0.6592	0.1543

Table 10. Comprehensive Efficiency Benchmark Comparison. All models were tested on an NVIDIA GeForce RTX 3090. Lower values are better for Params, GFLOPs, Latency, and Memory.

Model	Params (M)	GFLOPs	Latency (ms)	Max Memory (MB)
BiLSTM	5.290	0.0620	1.2694	104.58
TimeGAN + GRU	8.450	0.1850	2.8500	128.40
TimeGAN + LSTM	9.120	0.2105	3.1200	135.60
Attention + LSTM	5.342	0.0628	1.3450	105.12
DiskTransformer	22.431	0.7819	5.2697	210.50
BiLSTM + Attention	5.355	0.0621	1.3270	104.82
SDGR-Net (Ours)	6.192	0.0971	1.8569	39.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, Z.; Qin, J.; Lu, Y.; Yang, Z. SDGR-Net: A Spatiotemporally Decoupled Gated Residual Network for Robust Multi-State HDD Health Prediction. Electronics 2026, 15, 1399. https://doi.org/10.3390/electronics15071399

AMA Style

Wu Z, Qin J, Lu Y, Yang Z. SDGR-Net: A Spatiotemporally Decoupled Gated Residual Network for Robust Multi-State HDD Health Prediction. Electronics. 2026; 15(7):1399. https://doi.org/10.3390/electronics15071399

Chicago/Turabian Style

Wu, Zehong, Jinghui Qin, Yongyi Lu, and Zhijing Yang. 2026. "SDGR-Net: A Spatiotemporally Decoupled Gated Residual Network for Robust Multi-State HDD Health Prediction" Electronics 15, no. 7: 1399. https://doi.org/10.3390/electronics15071399

APA Style

Wu, Z., Qin, J., Lu, Y., & Yang, Z. (2026). SDGR-Net: A Spatiotemporally Decoupled Gated Residual Network for Robust Multi-State HDD Health Prediction. Electronics, 15(7), 1399. https://doi.org/10.3390/electronics15071399

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SDGR-Net: A Spatiotemporally Decoupled Gated Residual Network for Robust Multi-State HDD Health Prediction

Abstract

1. Introduction

2. Related Work

3. SDGR-Net

3.1. Spatiotemporally Decoupled Dual-Branch Heterogeneous LSTM Encoder

3.2. Parsimonious Bidirectional Temporal Extraction via Dual-View Concatenation

3.3. Cross-Branch Dynamic Gated Residual Fusion Module

4. Experiment Settings

4.1. Datasets

4.2. Implementation Details

4.3. Data Preprocessing and Health State Annotation

4.3.1. Feature Selection

4.3.2. Data Normalization and Numerical Stability

4.4. Baselines

4.5. Evaluation Criteria

5. Results and Analysis

5.1. Performance Comparison

5.2. Confusion Matrix Analysis

5.3. Ablation Study

5.3.1. Impact of Class Imbalance Mitigation Strategies

5.3.2. Efficacy of Spatiotemporally Decoupled Orthogonal Architecture

5.3.3. Impact of Bidirectional Dual-View Extraction and Feature Synthesis

5.3.4. Impact of Dynamic Gated Fusion

5.4. Analysis of Training Dynamics and Gradient Stability

5.5. Interpretability Analysis via Gating Weight Distribution

5.6. Cross-Model Generalization and Transfer Robustness Analysis

5.7. Computational Efficiency and Deployment Feasibility

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI