1. Introduction
Today, hard disks are the primary storage devices in computers, and many data centers rely on numerous hard disks to store critical information. In scenarios involving large-scale storage systems, such as high-performance computing and internet services [
1], hard disk failures occur quite frequently. According to a survey in the literature [
2], 78% of hardware replacements are due to hard disk failures. The catastrophic consequences of hard disk failures are permanent and difficult to recover from, thereby reducing the reliability of data centers. Therefore, predicting disk failures early not only reduces the risk of data loss but also lowers the cost of data recovery.
Hard disk failure prediction is an important proactive maintenance method for storage systems. Unlike traditional software operation and maintenance, which involves repairing faults after they occur, hard disk fault prediction is a passive process where operation and maintenance tools proactively handle potential system failures before they occur, in order to avoid data loss.
In the fields of internal combustion engine fault diagnostics and software vulnerability prediction, A. Srinivaas reviewed the methods of internal combustion engine fault detection, including model-based and data-driven approaches [
3]. The fault prediction methods that combine advanced deep learning techniques, such as Transformer, have shown better performance in fault detection compared to traditional machine learning methods. Zhang B et al. [
4] introduced a vulnerability prediction method based on multi-level N-gram feature extraction and heterogeneous ensemble learning. Considering the poor generalization performance and tendency to overfit of individual classifiers, the method combines the four best-performing classifiers as the base classifiers. The above methods all take into account the impact of features on prediction results, which is also applicable to hard disk failure prediction problems.
In practical industrial applications, the collection, transmission, and storage of hard disk SMART data can result in partial data loss due to various objective or subjective reasons. For instance, during data collection, various errors such as sensor errors, data transmission errors, and device failures may occur, leading to inaccurate or incomplete data within the dataset. Additionally, in large-scale data centers, SMART data collection may need to be disabled during holidays, which can result in missing SMART data during the shutdown period. Furthermore, maintenance engineers might intentionally corrupt sample data to reduce the accuracy of hard disk failure prediction technologies, aiming to prevent being replaced by intelligent operation and maintenance software. In the above situations, inaccurate or incomplete hard disk data, namely low-quality hard disk data, has been generated, which has a huge impact on accurately predicting hard disk failures. In this scenario, it is difficult to use high-quality datasets for fault prediction. If we can process low-quality datasets to make their prediction results close to those of high-quality datasets, it will have great practical significance. In addition, if we can process low-quality datasets to make their prediction performance close to that of high-quality datasets, and even achieve good prediction performance even when 99% of data is lost, we can compress hard disk data in this way, greatly saving storage space and costs.
To address this, this paper proposes DFPoLD: A hard Disk Failure Prediction on Low-quality Datasets. The main contributions of this paper are as follows:
- (1)
This paper constructs and open-sources a low-quality SMART dataset named Backblaze-. We first delete 10–99% of the 2022 data to create Backblaze-. Subsequently, we extract data, reset labels, select features, fill missing values, add time series features, and split the data into training and testing sets. Given the imbalance between positive and negative samples in hard disk datasets, we set the class_weight parameter during training to balance sample distribution. Additionally, we use stratified sampling when dividing the dataset, ensuring proportional representation of positive and negative samples in both training and testing sets.
- (2)
For each SMART attribute, we constructed time series statistical features such as the Absolute Sum of First Difference (ASFDs) to reflect the absolute fluctuations between adjacent observations of the SMART data. Experimental results indicate that the introduction of time series features improves the accuracy of the model and enables earlier prediction of hard disk failures.
- (3)
This paper fully considers the impact of different quality datasets on time window selection. By selecting a time window of 30 days and observing the changes in the predicted probability of hard disk failure over time, it is found that the quality of the dataset can affect the selection of time windows. We proposed a time window selection formula based on the proportion of hard disk data loss by fitting on datasets with 10%, 50%, and 80% data loss. The experimental results indicate that the poorer the quality of the dataset, the longer the time window selection should be. Choosing an appropriate time window can improve the accuracy of model prediction.
The reminder of this paper is organized as follows:
Section 2 examines related research on hard drive failure prediction.
Section 3 describes the construction process of low-quality datasets. We demonstrate the effectiveness of the Absolute Sum of First Difference (ASFD) and Complexity Invariant Distance (CID) time series features in
Section 4. In
Section 5,we use the LightGBM (LGB) model for prediction and the Optuna method for optimization.
Section 6 verifies the effectiveness of the method. Finally,
Section 7 summarizes the main findings.
2. Related Work
Over the past few decades, numerous scholars have proposed methods based on machine learning and deep learning to improve the accuracy of hard drive failure prediction:
These methods include using Long Short-Term Memory (LSTM) [
5,
6,
7], Generative Adversarial Networks (GANs) [
8], Regularized Greedy Forest (RGF) [
9], Temporal Convolutional Networks (TCNs) [
10], recurrent neural networks (RNNs) [
11], and convolutional neural networks (CNNs) [
12] to build hard disk failure prediction models. Unfortunately, these approaches are generally based on high-quality datasets for training, such as the Backblaze open-source dataset. Consequently, the applicability of these prediction models in scenarios involving low-quality datasets is limited.
Self-Monitoring Analysis and Reporting Technology (SMART) [
13] can detect various operational metrics of hard disks, such as read/write counts, head load/unload cycles, and seek error rates. Traditional threshold-based hard disk failure prediction methods compare the current SMART attributes of the hard disk with preset thresholds. If the SMART attributes exceed the corresponding thresholds, the hard disk triggers an alert to the operating system. However, this approach only achieves a True Positive Rate (TPR) of 3% to 10% with a False Positive Rate (FPR) of 0.1%.Over the past few decades, many researchers have proposed various methods to improve the accuracy of failure prediction by using machine learning and deep learning techniques.
Alessio Burrello et al. [
10] utilized Temporal Convolutional Networks (TCNs) for hard disk failure prediction. By leveraging real-world datasets, they found that TCNs outperformed Random Forest (RF) and Recurrent Neural Networks (RNNs), achieving approximately a 7.5% improvement in the Fault Detection Rate (FDR) and reducing the False Alarm Rate (FAR) to 0.052% within a 90-day input window. Alessio B. explored the architecture design space and proposed a range of models that offer various trade-offs between complexity, memory usage, and performance. Their results demonstrate that TCNs consistently outperform RNNs at any given model size and complexity.
Tianming J. approached hard drive failure prediction from the perspective of economic cost, aiming to minimize the cost of hard drive failure recovery [
14]. They proposed the evaluation metric Mean-Cost-To-Recovery (MCTR) and employed a threshold-moving approach to find the optimal results. Evaluation on three real datasets demonstrated that compared to passive fault-tolerant techniques, MCTR reduced costs by 86.9%.
To address the poor performance of hard disk failure prediction in heterogeneous data centers, Ji Z. employed a twin neural network based on Long Short-Term Memory (LSTM) to obtain high-dimensional embedding vectors of hard disks [
15]. By integrating a distance-based anomaly detection method, evaluations were conducted on both the Tencent and Backblaze datasets. Experimental results demonstrated that the use of mixed datasets combined with the anomaly detection method achieved better applicability and superior performance across datasets with different hard disk models, with AUC and F1 scores both exceeding 0.9. Compared to other models, this approach achieved the highest True Positive Rate (TPR) while maintaining a low False Positive Rate (FPR).
To enhance the prediction quality and timeliness of hard disk failure modeling, Han Wang proposed a two-layer classification-based feature selection scheme [
16]. This method involves designing a filter to compute attribute importance, thereby filtering out attributes insensitive to failure recognition. Additionally, it determines feature correlations based on the correlation coefficient and introduces an attribute classification method. Experimental results show that this technique improves the prediction accuracy of ML/AI-based hard disk failure prediction models. Furthermore, under optimal conditions, the proposed scheme reduces training and prediction latency by 75% and 83%, respectively.
Mingyu Zhang proposed a novel failure prediction method based on the idea of blending ensemble learning on the publicly available Backblaze hard disk datasets [
17]. Through the experimental results on multiple types of hard disks, an ensemble learning model with high performance on most types of hard disks is found, which solves the problem of the low robustness and generalization of traditional machine learning methods and proves the effectiveness and high universality of this method.
Wenqiang Ge proposed a new method for hard disk failure prediction to address the issue that the prediction performance for long-term failure and small-sample disks is not satisfactory [
18]. The framework consists of a time-series feature extraction network and a prediction network. Experimental results on public datasets have demonstrated that our proposed method cannot only predict long-term failure but also has reliable prediction performance when facing small-sample disk data.
In summary, these methods have several main shortcomings:
- (1)
Failure to consider the temporal aspect of hard drive data. Existing hard drive failure prediction methods mostly treat each day’s sampling of hard drive data as a single sample, neglecting the temporal information of the time series. Models typically only consider the state of the hard drive on a given day when making predictions. In hard drive failure prediction, a hard drive does not suddenly fail at a specific moment but undergoes a gradual deterioration process. Initially, the hard drive exhibits unstable parameters before eventually failing after a period of time. Therefore, capturing information during the period of unstable hard drive states can significantly improve prediction accuracy.
- (2)
Lack of exploration of higher-order features. In hard drive failure prediction, most researchers use the raw SMART dataset for training. The raw SMART dataset includes both the raw and normalized values of SMART attributes, with each value being a number reported by the drive. However, the raw SMART dataset does not capture abrupt changes in the data. Therefore, incorporating higher-order time series features that can reflect sudden changes would improve the prediction model’s accuracy.
- (3)
Poor applicability in industrial scenarios. Most existing research methods utilize high-quality datasets, such as the Backblaze open-source dataset. However, in real-world scenarios, the collected data often contains missing values and is generally of lower quality. Additionally, in practical applications, there is a significant imbalance between the number of positive and negative samples in the dataset. In performance evaluation experiments, current techniques typically randomly split the dataset into training and testing sets, which does not align with the actual application patterns in data centers.
Our previous work considered the temporal characteristics of hard disk data and constructed low-quality datasets [
19]. In this study, we have made several significant optimizations:
- (1)
The prior work did not account for the impact of dataset quality on time window selection, uniformly setting the time window to 10 days across datasets of varying quality, followed by data truncation and label resetting. In this study, we set the time window to 30 days and observed, across datasets of different qualities and using various models, that lower dataset quality necessitates a longer time window. This paper proposes a time window selection formula by setting a time window size of 1–30 days on datasets with 10%, 50%, and 80% data loss, and selecting the appropriate time window size based on the data loss ratio.
- (2)
This study attempts to construct an extremely low-quality dataset by removing 99% of the data. Using this dataset, we trained the LGB model and compared its performance to datasets with 0% to 80% data loss. The results show decreases in both AUC score and F1 score, alongside a significant increase in FPR, with the rate of increase accelerating as more data is removed. Although TPR increases on this extremely low-quality dataset, the rapid rise in FPR indicates a higher misclassification rate. Thus, we conclude that higher dataset quality leads to a higher upper limit for hard disk failure prediction accuracy, while lower dataset quality reduces this upper limit.
- (3)
The previous work did not take into account the probability of hard disks being predicted to fail over time. In this study, we observe the probability of hard disks being predicted as a failure over time in different models on low-quality datasets with 99% data loss and the datasets with 99% data loss and added ASFD features, by setting the time window to 30 days. Before adding ASFD time series features to the dataset, the probability of hard disks being predicted to be about to fail decreases over time. After adding features, the probability of failure prediction becomes flat over time, with a high prediction probability even before 30 days. This indicates that the ASFD time series features added in this paper increase the probability of the accurate prediction of the failure of hard disks, and offer advanced warning as to the time when hard disk failures are predicted. This further proves that ASFD is suitable for hard disk failure prediction scenarios.
6. Experiments
This chapter mainly describes how by setting an appropriate time window, adding ASFD and CID time series, and using the LGB model for training, the prediction effect of hard disk failures on low-quality datasets can be improved to that of high-quality datasets. And the effectiveness and generality of our model are verified on a real dataset collected by Nankai Baidu Joint Lab at Nankai University.
6.5. Model Comparison and Selection
To observe the variation in the effectiveness of hard disk failure prediction across different models, this experiment contrasts the training results on the drop_SMART dataset with 80% of the data removed, the filled_SMART dataset with missing values filled column-wise, the SMART+CID dataset, the SMART+ASFD dataset, and the SMART+ASFD+CID dataset. Various models, including LGB, XGBoost, CatBoost, NGBoost, and AdaBoost, were employed for training. It is important to note that the NGBoost and AdaBoost models require datasets without missing values, hence the absence of results for the drop_SMART dataset in the NGBoost and AdaBoost training in the figures.
Figure 5 illustrates the performance of different models across various datasets. The horizontal axis represents five different datasets, while the vertical axis indicates the scores of different metrics for each model under the respective dataset. In
Figure 5a, we observe the effect of different models on TPR. In
Figure 5b, the impact of different models on the AUC score is depicted, and in
Figure 5c, the influence of different models on the F1 score is shown. From the graph, we can discern that:
- (1)
The effectiveness of the five models varies across different metrics in hard drive failure prediction, with the overall results showing that the LGB model performs the best in terms of TPR, AUC score, and F1 score.
- (2)
Both the LGB and XGBoost models demonstrate good performance in the TPR and AUC score metrics for hard drive failure prediction. In comparison, the CatBoost and NGBoost models exhibit less favorable results.
- (3)
While AdaBoost performs poorly in the TPR and AUC score metrics, it achieves relatively high scores in the F1 score metric.
- (4)
After deleting 80% of the data, the XGBoost model achieves the highest TPR score, reaching 0.6605. However, when the dataset with deleted entries is filled with mode values, the LGB model surpasses XGBoost in TPR. With the addition of time series indicators, the LGB model still performs the best, with the highest TPR reaching 0.8669.
- (5)
The addition of CID to the filled dataset has a detrimental effect on the CatBoost model, while it has a positive correlation with other models.
- (6)
Adding ASFD to the filled dataset results in a significant improvement in TPR for all models, indicating the effectiveness of ASFD time series indicators in hard drive failure prediction.
- (7)
When ASFD and CID are both applied to the dataset, the TPR is almost identical to when ASFD is used alone, suggesting that ASFD is more suitable for hard drive failure prediction scenarios.
To observe how dataset quality impacts failure prediction performance across different models, this experiment compares the training results of LGB, XGBoost, and CatBoost on the drop_SMART dataset, which was constructed by removing 10% to 99% of the data. Models such as NGBoost and AdaBoost were excluded from the experiment due to their requirement for datasets to be free of missing values.
Figure 6 illustrates the performance of different models on different datasets of different qualities. The horizontal axis represents 11 datasets of different qualities, and the vertical axis indicates the scores of different metrics for each model under the respective dataset. In
Figure 6a, we observe the effect of different models on TPR, and
Figure 6b represents the FPR performance of different models. In
Figure 6c, the impact of different models on the AUC score is depicted, and in
Figure 6d, the influence of different models on the F1 score is shown. From the graph, we can discern that:
- (1)
Across datasets of varying quality, the three models exhibit different performance levels for hard disk failure prediction. The LGB model consistently achieves the best overall results in terms of TPR and AUC score.
- (2)
For XGBoost and CatBoost, as dataset quality deteriorates, their TPR, AUC score, and F1 score metrics exhibit a clear downward trend.
- (3)
When data loss exceeds 95%, TPR begins to recover; however, FPR simultaneously increases, indicating that the LGB model misclassifies a large number of hard disks as faulty, resulting in no overall improvement in performance.
- (4)
Although CatBoost performs poorly in terms of TPR, AUC score, and F1 score, it achieves excellent and stable results for FPR, remaining close to 0. This suggests that the declining dataset quality does not impact FPR when using the CatBoost model.
- (5)
When data loss exceeds 70%, the FPR of the LGB model rises sharply, suggesting that higher data loss leads to an increased false positive rate in hard disk failure predictions.
- (6)
LGB and XGBoost deliver comparable and strong performance on the AUC metric, whereas the CatBoost model shows relatively inferior results.
- (7)
For datasets with less than 70% data loss, the LGB model outperforms XGBoost and CatBoost across all metrics, making it the preferred choice.
- (8)
A comprehensive analysis of the four metrics indicates that dataset quality has a significant impact on hard disk failure prediction performance. High-quality datasets yield superior results, while low-quality datasets lead to poorer performance.
To observe the changes in predicting the probability of hard disk failure over time in different models. We set the time window to 30 days. Comparisons were made between models trained on the SMART dataset with 99% data loss, including LGB, CatBoost, and XGBoost, and models trained on the SMART+ASFD dataset with 99% data loss, namely LGB (LGB2), CatBoost (CatBoost2), and XGBoost (XGBoost2).
The horizontal axis represents the number of days before failure. When the horizontal axis is n, it indicates n days remain before the hard disk failure occurs (the axis progresses forward in time; larger values indicate a longer time until failure). The vertical axis represents the average probability that a failing hard disk in the test set is predicted to fail on day x prior to the actual failure. From
Figure 7, the following observations can be made:
- (1)
Before incorporating time-series features, the XGB model outperforms other algorithms. However, after adding time-series features, the predicted failure probabilities decrease. Conversely, the LGB model surpasses the XGB model without the addition of features after integrating the ASFD feature.
- (2)
After adding the ASFD feature to the dataset, the LGB and CatBoost models significantly outperform their pre-integration counterparts, with a marked improvement in predicting failure probabilities for failing hard disks. This indicates that the ASFD time-series feature introduced in this study is effective for hard disk failure prediction, particularly in scenarios with substantial data loss.
- (3)
For the LGB model, before incorporating the ASFD time-series feature, the probability of predicting hard disks as likely to fail decreases steadily as the time increases. After integrating the feature, the failure prediction probabilities stabilize over time and remain high even 30 days prior to failure. This demonstrates that the ASFD time-series feature enhances the predictive accuracy by encoding historical hard disk information into each day’s data, improving the likelihood of accurately predicting failures and allowing earlier detection of hard disk failures.
Author Contributions
Conceptualization, S.W. and H.Y.; data curation, S.W. and X.L.; formal analysis, C.T.; funding acquisition, J.G.; investigation, X.L.; methodology, S.W. and X.L.; project administration, Y.F. and H.Y.; resources, H.S.; software, S.W. and X.L.; supervision, H.Y.; validation, visualization, X.L.; roles/writing—original draft, S.W. and X.L.; and writing—review and editing, Y.F. and H.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by a grant from the Tianjin Manufacturing High Quality Development Special Foundation (20232185), the National Key R&D Program of China (2021YFB3300903), the National Natural Science Foundation of China (U23A20299), the Kingbase Foundation (2024-CN-FW-0287), and the Roycom Foundation (70306901).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original data presented in the study are openly available in OSF at
https://osf.io/6xd2a (accessed on 17 January 2025).
Acknowledgments
The authors would like to thank all the anonymous reviewers for their helpful comments and suggestions. We also thank Wenrui Zhang, Jiahua Tong, and Ping Wang for their test work and helpful discussion.
Conflicts of Interest
Author Jiangpu Guo was employed by the company Roycom Information Technology Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
ASFD | Absolute Sum of First Difference |
TPR | True Positive Rate |
FPR | False Positive Rate |
CID | Complexity Invariant Distance |
LGB | LightGBM |
LSTM | Long Short-Term Memory |
GANs | Generative Adversarial Networks |
RGF | Regularized Greedy Forest |
TCNs | Temporal Convolutional Networks |
RNNs | Recurrent Neural Networks |
CNNs | Convolutional Neural Networks |
SMART | Self-Monitoring Analysis and Reporting Technology |
RF | Random Forest |
FDR | Fault Detection Rate |
FAR | False Alarm Rate |
MCTR | Mean-Cost-To-Recovery |
MA | Moving Average |
DPF | Days to Predict Failure |
References
- Schroeder, B.; Gibson, G.A. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies, San Jose, CA, USA, 13–16 February 2007; Volume 7, pp. 1–16. [Google Scholar]
- Xu, S.; Xu, X. ConvTrans-TPS: A Convolutional Transformer Model for Disk Failure Prediction in Large-Scale Network Storage Systems. In Proceedings of the 26th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Rio de Janeiro, Brazil, 24–26 May 2023; pp. 1318–1323. [Google Scholar]
- Srinivaas, A.; Sakthivel, N.R.; Nair, B.B. Machine Learning Approaches for Fault Detection in Internal Combustion Engines: A Review and Experimental Investigation. Informatics 2025, 12, 25. [Google Scholar] [CrossRef]
- Zhang, B.; Gao, Y.; Wu, J.; Wang, N.; Wang, Q.; Ren, J. Approach to Predict Software Vulnerability Based on Multiple-Level N-gram Feature Extraction and Heterogeneous Ensemble Learning. Int. J. Softw. Eng. Knowl. Eng. 2022, 32, 1559–1582. [Google Scholar] [CrossRef]
- Shen, J.; Ren, Y.; Wan, J.; Lan, Y.; Yang, X. Hard disk drive failure prediction for mobile edge computing based on an LSTM recurrent neural network. Mob. Inf. Syst. 2021, 2021, 8878364. [Google Scholar] [CrossRef]
- Coursey, A.; Nath, G.; Prabhu, S.; Sengupta, S. Remaining useful life estimation of hard disk drives using bidirectional lstm networks. In Proceedings of the 2021 IEEE International Conference on Big Data, Orlando, Florida, USA, 15–18 December 2021; pp. 4832–4841. [Google Scholar]
- Ahmed, J.; Green, I.I.R.C. Cost aware LSTM model for predicting hard disk drive failures based on extremely imbalanced SMART sensors data. Eng. Appl. Artif. Intell. 2024, 127, 107339. [Google Scholar] [CrossRef]
- Liu, Y.; Guan, Y.; Jiang, T.; Zhou, K.; Wang, H.; Hu, G.; Zhang, J.; Fang, W.; Cheng, Z.; Huang, P. SPAE: Lifelong disk failure prediction via end-to-end GAN-based anomaly detection with ensemble update. Future Gener. Comput. Syst. 2023, 148, 460–471. [Google Scholar] [CrossRef]
- Gargiulo, F.; Duellmann, D.; Arpaia, P.; Moriello, R.S.L. Predicting hard disk failure by means of automatized labeling and machine learning approach. Appl. Sci. 2021, 11, 8293. [Google Scholar] [CrossRef]
- Burrello, A.; Pagliari, D.J.; Bartolini, A.; Benini, L.; Macii, E.; Poncino, M. Predicting hard disk failures in data centers using temporal convolutional neural networks. In Euro-Par 2020: Parallel Processing Workshops, Proceedings of the 26th International Conference on Parallel and Distributed Computing, Warsaw, Poland, 24–28 August, 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 277–289. [Google Scholar]
- Xu, C.; Wang, G.; Liu, X.; Guo, D.; Liu, T.-Y. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 2016, 65, 3502–3508. [Google Scholar] [CrossRef]
- Lu, S.; Luo, B.; Patel, T.; Yao, Y.; Tiwari, D.; Shi, W. Making disk failure predictions SMARTer! In Proceedings of the 18th USENIX Conference on File and Storage Technologies, Boston, MA, USA, 25–27 February 2020; pp. 151–167. [Google Scholar]
- Allen, B. Monitoring hard disks with SMART. Linux J. 2004, 2004, 9. [Google Scholar]
- Jiang, T.; Huang, P.; Zhou, K. Cost-efficiency disk failure prediction via threshold-moving. Concurr. Comput. Pract. Exp. 2020, 32, e5669. [Google Scholar] [CrossRef]
- Zhang, J.; Huang, P.; Zhou, K.; Xie, M.; Schelter, S. HDDse: Enabling High-Dimensional Disk State Embedding for Generic Failure Detection System of Heterogeneous Disks in Large Data Centers. In Proceedings of the 2020 USENIX Annual Technical Conference, Boston, MA, USA, 15–17 July 2020; pp. 111–126. [Google Scholar]
- Wang, H.; Zhuge, Q.; Sha, E.H.-M.; Xu, R.; Song, Y. Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection. Appl. Sci. 2023, 13, 7544. [Google Scholar] [CrossRef]
- Zhang, M.; Ge, W.; Tang, R.; Liu, P. Hard disk failure prediction based on blending ensemble learning. Appl. Sci. 2023, 13, 3288. [Google Scholar] [CrossRef]
- Ge, W.; Liu, P.; Zhang, M.; Zhang, Z.; Lai, Y. DiskTransformer: A Transformer Network for Hard Disk Failure Prediction. In Proceedings of the 7th International Conference on Artificial Intelligence and Big Data (ICAIBD), Beijing, China, 5–7 July 2024; pp. 327–332. [Google Scholar]
- Lu, X.; Tu, C.; Yang, H.; Guo, J.; Sun, H. FPTSF: A Failure Prediction of Hard Disks Based on Time Series Features Towards Low Quality Dataset. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, Jinhua, China, 30 August–1 September 2024; pp. 438–447. [Google Scholar]
- Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Han, S.; Wu, J.; Xu, E.; He, C. Robust Data Preprocessing for Machine-Learning-Based Disk Failure Prediction in Cloud Production Environments. arXiv 2019, arXiv:1912.09722. [Google Scholar]
- Wang, H.; Yang, Y.; Yang, H. Hard Disk Failure Prediction Based on Lightgbm with CID. In Proceedings of the 2021 IEEE Symposium on Computers and Communications (ISCC), Athens, Greece, 5–8 September 2021; pp. 1–7. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 3149–3157. Available online: https://dl.acm.org/doi/10.5555/3294996.3295074 (accessed on 4 December 2017).
- Priyadarshini, I.; Cotton, C. A novel LSTM–CNN–grid search-based deep neural network for sentiment analysis. J. Supercomput. 2021, 77, 13911–13932. [Google Scholar] [CrossRef] [PubMed]
- Belete, D.M.; Huchaiah, M.D. Grid search in hyperparameter optimization of machine learning models for prediction of HIV/AIDS test results. Int. J. Comput. Appl. 2022, 44, 875–886. [Google Scholar] [CrossRef]
- Sun, Y.; Ding, S.; Zhang, Z.; Jia, W. An improved grid search algorithm to optimize SVR for prediction. Soft Comput. 2021, 25, 5633–5644. [Google Scholar] [CrossRef]
- Shekhar, S.; Bansode, A.; Salim, A. A comparative study of hyper-parameter optimization tools. In Proceedings of the 2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), Brisbane, Australia, 8–10 December 2021; pp. 1–6. [Google Scholar]
- Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, Q.; Shen, W. Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library. Chin. J. Chem. Eng. 2022, 52, 115–125. [Google Scholar] [CrossRef]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
- Lai, J.P.; Lin, Y.L.; Lin, H.C.; Shih, C.Y.; Wang, Y.P.; Pai, P.F. Tree-based machine learning models with optuna in predicting impedance values for circuit analysis. Micromachines 2023, 14, 265. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Stones, R.J.; Wang, G.; Liu, X.; Li, Z.; Xu, M. Hard drive failure prediction using decision trees. Reliab. Eng. Syst. Saf. 2017, 164, 55–65. [Google Scholar] [CrossRef]
Figure 1.
ASFD of different time series.
Figure 2.
CID of different time series.
Figure 3.
Comparison of existing models on different datasets.
Figure 4.
Comparison between DFPoLD and other models.
Figure 5.
(a) TPR of different models on each dataset. (b) AUC score of different models on each dataset. (c) F1 score of different models on each dataset.
Figure 6.
(a) The effect of TPR for different models on different dataset qualities. (b) The effect of FPR for different models on different dataset qualities. (c) The effect of AUC score for different models on different dataset qualities. (d) The effect of F1 score for different models on different dataset qualities.
Figure 7.
Change in probability of failure disk being predicted as failure with time.
Figure 8.
(a) The probability of different models predicting the failure of hard disks on a dataset with 10% data loss. (b) The probability of different models predicting the failure of hard disks on a dataset with 50% data loss. (c) The probability of different models predicting the failure of hard disks on a dataset with 99% data loss.
Figure 9.
(a) Experimental results of different time windows deleting 80% of data. (b) Experimental results of different time windows deleting 50% of data. (c) Experimental results of different time windows deleting 10% of data.
Table 1.
Initial dataset.
Year | Total Hard Disks | Healthy Disks | Failed Disks |
---|
2022 | 18,611 | 17,978 | 633 |
2023 | 18,238 | 17,678 | 560 |
Total | 36,849 | 35,656 | 1193 |
Table 2.
SMART attributes after feature selection.
ID | SMART Attribute | Raw or Normalized |
---|
1 | Read Error Rate | Raw and Normalized |
3 | Spin_Up_Time | Normalized |
4 | Start/Stop Count | Raw and Normalized |
5 | Reallocated Sector Count | Raw and Normalized |
7 | Seek Time Performance | Raw and Normalized |
9 | Power-On Hours | Raw and Normalized |
10 | Spin Retry Count | Normalized |
12 | Power Cycle Count | Raw |
183 | SATA Downshift Error Count | Raw and Normalized |
184 | End-To-End Error | Raw and Normalized |
187 | Reported Uncorrectable Errors | Raw and Normalized |
188 | Command_Timeout | Raw |
189 | High_Fly_Writes | Raw and Normalized |
190 | Airflow Temperature | Raw and Normalized |
191 | G-Sense_Error_Rate | Raw and Normalized |
192 | Power-off Retract Count | Raw |
193 | Load Cycle Count | Raw and Normalized |
194 | Temperature_Celsius | Raw and Normalized |
195 | Hardware ECC Recovered | Raw and Normalized |
197 | Current Pending Sector Count | Raw and Normalized |
198 | Uncorrectable Sector Count | Raw and Normalized |
199 | UDMA_CRC_Error_Count | Raw |
240 | Head Flying Hours | Raw |
241 | Total LBAs Written | Raw |
242 | Total LBAs Read | Raw |
Table 3.
Optimal hyperparameters of LGB model.
Hyper Parameters | Optimal Hyperparameters |
---|
objective | binary |
metric | binary_error |
max_depth | 20 |
num_leaves | 100 |
n_estimators | 600 |
learning_rate | 0.05 |
verbosity | −1 |
random_state | 1 |
class_weight | balanced |
Table 4.
Experimental results of different quality datasets.
Dataset | TPR | FPR | AUC Score | F1 Score |
---|
all_data | 0.9984 | 0.0001 | 0.9992 | 0.9984 |
drop_data_10 | 0.9927 | 0.0001 | 0.9963 | 0.9947 |
drop_data_20 | 0.9661 | 0.0002 | 0.9830 | 0.9804 |
drop_data_30 | 0.9444 | 0.0004 | 0.9720 | 0.9650 |
drop_data_40 | 0.9169 | 0.0011 | 0.9579 | 0.9404 |
drop_data_50 | 0.8992 | 0.0020 | 0.9486 | 0.9188 |
drop_data_60 | 0.8427 | 0.0031 | 0.9198 | 0.8727 |
drop_data_70 | 0.7540 | 0.0054 | 0.8743 | 0.7897 |
drop_data_80 | 0.6548 | 0.0236 | 0.8156 | 0.5598 |
drop_data_90 | 0.5968 | 0.1166 | 0.7401 | 0.2396 |
drop_data_91 | 0.5556 | 0.1232 | 0.7162 | 0.2166 |
drop_data_92 | 0.5452 | 0.1576 | 0.6938 | 0.1782 |
drop_data_93 | 0.5331 | 0.1841 | 0.6745 | 0.1551 |
drop_data_94 | 0.4202 | 0.1116 | 0.6543 | 0.1804 |
drop_data_95 | 0.5968 | 0.2877 | 0.6546 | 0.1200 |
drop_data_96 | 0.6729 | 0.4193 | 0.6268 | 0.0954 |
drop_data_97 | 0.6524 | 0.4618 | 0.5953 | 0.0867 |
drop_data_98 | 0.7208 | 0.5996 | 0.5606 | 0.0739 |
drop_data_99 | 0.8032 | 0.7452 | 0.5290 | 0.0686 |
Table 5.
Experimental results of randomly deleting 50% of data.
Dataset | TPR | FPR | AUC Score | F1 Score |
---|
drop_50_1 | 0.8629 | 0.0010 | 0.9310 | 0.9126 |
drop_50_2 | 0.8508 | 0.0012 | 0.9248 | 0.9029 |
drop_50_3 | 0.8323 | 0.0009 | 0.9157 | 0.8951 |
drop_50_4 | 0.8524 | 0.0007 | 0.9258 | 0.9100 |
drop_50_5 | 0.8444 | 0.0009 | 0.9217 | 0.9030 |
drop_50_6 | 0.8556 | 0.0007 | 0.9275 | 0.9119 |
drop_50_7 | 0.8508 | 0.0009 | 0.9249 | 0.9064 |
drop_50_8 | 0.8500 | 0.0007 | 0.9247 | 0.9094 |
drop_50_9 | 0.8363 | 0.0009 | 0.9177 | 0.8978 |
drop_50_10 | 0.8524 | 0.0008 | 0.9258 | 0.9096 |
Average | 0.8488 | 0.0009 | 0.9240 | 0.9059 |
STDEV | 0.0090 | 0.0002 | 0.0045 | 0.0060 |
Table 6.
Experimental results of randomly deleting 80% of data.
Dataset | TPR | FPR | AUC Score | F1 Score |
---|
drop_80_1 | 0.7532 | 0.0498 | 0.8517 | 0.4711 |
drop_80_2 | 0.7315 | 0.0444 | 0.8435 | 0.4846 |
drop_80_3 | 0.7419 | 0.0479 | 0.8470 | 0.4736 |
drop_80_4 | 0.7427 | 0.0437 | 0.8495 | 0.4933 |
drop_80_5 | 0.7363 | 0.0449 | 0.8457 | 0.4846 |
drop_80_6 | 0.7484 | 0.0463 | 0.8511 | 0.4842 |
drop_80_7 | 0.7347 | 0.0444 | 0.8451 | 0.4861 |
drop_80_8 | 0.7484 | 0.0490 | 0.8497 | 0.4720 |
drop_80_9 | 0.7589 | 0.0437 | 0.8576 | 0.5012 |
drop_80_10 | 0.7379 | 0.0433 | 0.8473 | 0.4930 |
Average | 0.7434 | 0.0457 | 0.8488 | 0.4844 |
STDEV | 0.0087 | 0.0024 | 0.0041 | 0.0099 |
Table 7.
Experimental results of padding on dataset with 80% loss of data.
Processing | TPR | FPR | AUC Score | F1 Score |
---|
- | 0.6548 | 0.0236 | 0.8156 | 0.5598 |
Mode Filling | 0.6621 | 0.0208 | 0.8207 | 0.5845 |
Mean filling | 0.7347 | 0.0513 | 0.8417 | 0.4560 |
MICE | 0.6540 | 0.0288 | 0.8126 | 0.5253 |
Table 8.
Experimental results of different time windows deleting 80% of data.
Time Window (Days) | TPR | FPR | AUC Score | F1 Score |
---|
30 | 0.9832 | 0.00020 | 0.9915 | 0.9886 |
29 | 0.9712 | 0.00018 | 0.9855 | 0.9783 |
28 | 0.9852 | 0.00016 | 0.9925 | 0.9902 |
27 | 0.9730 | 0.00021 | 0.9864 | 0.9832 |
26 | 0.9838 | 0.00025 | 0.9918 | 0.9881 |
25 | 0.9777 | 0.00055 | 0.9886 | 0.9807 |
24 | 0.9848 | 0.00059 | 0.9921 | 0.9837 |
23 | 0.9846 | 0.00031 | 0.9921 | 0.9877 |
22 | 0.9791 | 0.00046 | 0.9893 | 0.9827 |
21 | 0.9766 | 0.00093 | 0.9878 | 0.9747 |
20 | 0.9819 | 0.00025 | 0.9908 | 0.9872 |
19 | 0.9886 | 0.00058 | 0.9940 | 0.9858 |
18 | 0.9848 | 0.00062 | 0.9921 | 0.9833 |
17 | 0.9678 | 0.00084 | 0.9835 | 0.9715 |
16 | 0.9774 | 0.00082 | 0.9883 | 0.9767 |
15 | 0.9781 | 0.00122 | 0.9884 | 0.9713 |
14 | 0.9519 | 0.00153 | 0.9752 | 0.9535 |
13 | 0.9636 | 0.00120 | 0.9812 | 0.9642 |
12 | 0.9673 | 0.00073 | 0.9833 | 0.9728 |
11 | 0.9644 | 0.00129 | 0.9815 | 0.9633 |
10 | 0.8669 | 0.00286 | 0.9320 | 0.8892 |
9 | 0.9424 | 0.00182 | 0.9703 | 0.9445 |
8 | 0.9283 | 0.00103 | 0.9636 | 0.9481 |
7 | 0.9477 | 0.00164 | 0.9730 | 0.9498 |
6 | 0.9205 | 0.00351 | 0.9585 | 0.9103 |
5 | 0.8952 | 0.00388 | 0.9457 | 0.8917 |
4 | 0.9465 | 0.00267 | 0.9719 | 0.9354 |
3 | 0.7230 | 0.00665 | 0.8582 | 0.7548 |
2 | 0.8419 | 0.00396 | 0.9190 | 0.8606 |
1 | 0.5984 | 0.03360 | 0.7824 | 0.4663 |
Table 9.
Experimental results of different time windows deleting 50% of data.
Time Window (Days) | TPR | FPR | AUC Score | F1 Score |
---|
30 | 0.9878 | 0.00047 | 0.9937 | 0.9870 |
29 | 0.9852 | 0.00019 | 0.9925 | 0.9897 |
28 | 0.9855 | 0.00082 | 0.9923 | 0.9807 |
27 | 0.9829 | 0.00032 | 0.9913 | 0.9866 |
26 | 0.9894 | 0.00042 | 0.9945 | 0.9885 |
25 | 0.9841 | 0.00046 | 0.9918 | 0.9853 |
24 | 0.9818 | 0.00033 | 0.9907 | 0.9860 |
23 | 0.9849 | 0.00013 | 0.9924 | 0.9905 |
22 | 0.9831 | 0.00035 | 0.9914 | 0.9864 |
21 | 0.9773 | 0.00053 | 0.9884 | 0.9807 |
20 | 0.9827 | 0.00036 | 0.9912 | 0.9861 |
19 | 0.9847 | 0.00016 | 0.9923 | 0.9900 |
18 | 0.9906 | 0.00046 | 0.9951 | 0.9886 |
17 | 0.9754 | 0.00063 | 0.9874 | 0.9784 |
16 | 0.9774 | 0.00065 | 0.9884 | 0.9791 |
15 | 0.9727 | 0.00091 | 0.9859 | 0.9730 |
14 | 0.9731 | 0.00082 | 0.9861 | 0.9745 |
13 | 0.9717 | 0.00053 | 0.9856 | 0.9780 |
12 | 0.9720 | 0.00080 | 0.9856 | 0.9743 |
11 | 0.9688 | 0.00090 | 0.9839 | 0.9712 |
10 | 0.9697 | 0.00115 | 0.9843 | 0.9681 |
9 | 0.9504 | 0.00146 | 0.9744 | 0.9537 |
8 | 0.9492 | 0.00144 | 0.9739 | 0.9535 |
7 | 0.9750 | 0.00106 | 0.9870 | 0.9722 |
6 | 0.9417 | 0.00237 | 0.9697 | 0.9368 |
5 | 0.9302 | 0.00334 | 0.9634 | 0.9178 |
4 | 0.9604 | 0.00164 | 0.9794 | 0.9566 |
3 | 0.8285 | 0.00437 | 0.9121 | 0.8475 |
2 | 0.9209 | 0.00232 | 0.9593 | 0.9264 |
1 | 0.8189 | 0.00328 | 0.9078 | 0.8560 |
Table 10.
Experimental results of different time windows deleting 10% of data.
Time Window (Days) | TPR | FPR | AUC Score | F1 Score |
---|
20 | 0.9948 | 0.00004 | 0.9974 | 0.9968 |
19 | 0.9953 | 0.00004 | 0.9976 | 0.9970 |
18 | 0.9973 | 0.00008 | 0.9986 | 0.9975 |
17 | 0.9939 | 0.00006 | 0.9969 | 0.9960 |
16 | 0.9940 | 0.00007 | 0.9970 | 0.9960 |
15 | 0.9968 | 0.00011 | 0.9983 | 0.9968 |
14 | 0.9943 | 0.00004 | 0.9971 | 0.9966 |
13 | 0.9945 | 0.00008 | 0.9972 | 0.9960 |
12 | 0.9880 | 0.00018 | 0.9939 | 0.9913 |
11 | 0.9956 | 0.00020 | 0.9977 | 0.9949 |
10 | 0.9936 | 0.00019 | 0.9967 | 0.9940 |
9 | 0.9938 | 0.00012 | 0.9968 | 0.9951 |
8 | 0.9910 | 0.00010 | 0.9955 | 0.9940 |
7 | 0.9841 | 0.00016 | 0.9920 | 0.9897 |
6 | 0.9894 | 0.00036 | 0.9945 | 0.9894 |
5 | 0.9698 | 0.00016 | 0.9848 | 0.9823 |
4 | 0.9842 | 0.00027 | 0.9919 | 0.9881 |
3 | 0.9710 | 0.00027 | 0.9854 | 0.9813 |
2 | 0.9723 | 0.00055 | 0.9859 | 0.9781 |
1 | 0.9528 | 0.00137 | 0.9757 | 0.9565 |
Table 11.
Test set prediction results of Naikai datasets.
Dataset | TPR | FPR | AUC Score | F1 Score |
---|
NanKai-0 | 0.9965 | 0.00027 | 0.9981 | 0.9908 |
NanKai-10 | 0.9966 | 0.00001 | 0.9983 | 0.9979 |
NanKai-50 | 0.9936 | 0.00020 | 0.9967 | 0.9911 |
NanKai-80 | 0.9946 | 0.00037 | 0.9971 | 0.9871 |
Table 12.
Test set prediction results.
Data Removal Rate | Positive Predictions | Positive Prediction Probability | DPF (Days) |
---|
0 | 86,606 | 0.9842 | 9.84 |
50% | 86,736 | 0.9857 | 9.86 |
80% | 85,824 | 0.9753 | 9.75 |
STDEV | 0.0056 | 0.0561 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).