Previous Article in Journal
Design Requirements of Breast Cancer Symptom-Management Apps
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

DFPoLD: A Hard Disk Failure Prediction on Low-Quality Datasets †

1
School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China
2
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China
3
Haojing Cloud Computing Technology Corporation, Nanjing 211153, China
4
Roycom Information Technology Corporation, Tianjin 301721, China
5
China Electronics System Technology Corporation, Beijing 100036, China
6
Beijing Kingbase Technology Inc., Beijing 100006, China
*
Author to whom correspondence should be addressed.
Presented at the 8th APWeb-WAIM Joint International Conference on Web and Big Data (APWeb-WAIM 2024), Jinhua, China, 30 August–1 September 2024.
These authors contributed equally to this work.
Informatics 2025, 12(3), 73; https://doi.org/10.3390/informatics12030073
Submission received: 22 April 2025 / Revised: 7 July 2025 / Accepted: 15 July 2025 / Published: 16 July 2025
(This article belongs to the Section Big Data Mining and Analytics)

Abstract

Hard disk failure prediction is an important proactive maintenance method for storage systems. Recent years have seen significant progress in hard disk failure prediction using high-quality SMART datasets. However, in industrial applications, data loss often occurs during SMART data collection, transmission, and storage. Existing machine learning-based hard disk failure prediction models perform poorly on low-quality datasets. Therefore, this paper proposes a hard disk fault prediction technique based on low-quality datasets. Firstly, based on the original Backblaze dataset, we construct a low-quality dataset, Backblaze-, by simulating sector damage in actual scenarios and deleting 10% to 99% of the data. Time series features like the Absolute Sum of First Difference (ASFD) were introduced to amplify the differences between positive and negative samples and reduce the sensitivity of the model to SMART data loss. Considering the impact of different quality datasets on time window selection, we propose a time window selection formula that selects different time windows based on the proportion of data loss. It is found that the poorer the dataset quality, the longer the time window selection should be. The proposed model achieves a True Positive Rate (TPR) of 99.46%, AUC of 0.9971, and F1 score of 0.9871, with a False Positive Rate (FPR) under 0.04%, even with 80% data loss, maintaining performance close to that on the original dataset.
Keywords: failure prediction; low-quality data; time series feature; time window failure prediction; low-quality data; time series feature; time window

Share and Cite

MDPI and ACS Style

Wei, S.; Lu, X.; Yang, H.; Tu, C.; Guo, J.; Sun, H.; Feng, Y. DFPoLD: A Hard Disk Failure Prediction on Low-Quality Datasets. Informatics 2025, 12, 73. https://doi.org/10.3390/informatics12030073

AMA Style

Wei S, Lu X, Yang H, Tu C, Guo J, Sun H, Feng Y. DFPoLD: A Hard Disk Failure Prediction on Low-Quality Datasets. Informatics. 2025; 12(3):73. https://doi.org/10.3390/informatics12030073

Chicago/Turabian Style

Wei, Shuting, Xiaoyu Lu, Hongzhang Yang, Chenfeng Tu, Jiangpu Guo, Hailong Sun, and Yu Feng. 2025. "DFPoLD: A Hard Disk Failure Prediction on Low-Quality Datasets" Informatics 12, no. 3: 73. https://doi.org/10.3390/informatics12030073

APA Style

Wei, S., Lu, X., Yang, H., Tu, C., Guo, J., Sun, H., & Feng, Y. (2025). DFPoLD: A Hard Disk Failure Prediction on Low-Quality Datasets. Informatics, 12(3), 73. https://doi.org/10.3390/informatics12030073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop