Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection

Wang, Han; Zhuge, Qingfeng; Sha, Edwin Hsing-Mean; Xu, Rui; Song, Yuhong

doi:10.3390/app13137544

Open AccessArticle

Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection

by

Han Wang

,

Qingfeng Zhuge

^*

,

Edwin Hsing-Mean Sha

,

Rui Xu

and

Yuhong Song

School of Computer Science and Technology, East China Normal University, Shanghai 200063, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7544; https://doi.org/10.3390/app13137544

Submission received: 31 May 2023 / Revised: 21 June 2023 / Accepted: 24 June 2023 / Published: 26 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

Predicting hard disk failure effectively and efficiently can prevent the high costs of data loss for data storage systems. Disk failure prediction based on machine learning and artificial intelligence has gained notable attention, because of its good capabilities. Improving the accuracy and performance of disk failure prediction, however, is still a challenging problem. When disk failure is about to occur, the time is limited for the prediction process, including building models and predicting. Faster training would promote the efficiency of model updates, and late predictions not only have no value but also waste resources. To improve both the prediction quality and modeling timeliness, a two-layer classification-based feature selection scheme is proposed in this paper. An attribute filter calculating the importance of attributes was designed, to remove attributes insensitive to failure identification, where importance is gained based on the idea of classification tree models. Furthermore, by determining the correlation between features based on the correlation coefficient, an attribute classification method is proposed. In experiments, the models of machine learning and artificial intelligence were applied, and they included naïve Bayesian, random forest, support vector machine, gradient boosted decision tree, convolutional neural networks, and long short-term memory. The results showed that the proposed technique could improve the prediction accuracy of ML/AI-based hard disk failure prediction models. Specifically, utilizing random forest and long short-term memory with the proposed technique showed the best accuracy. Meanwhile, the proposed scheme could reduce training and prediction latency by 75% and 83%, respectively, in the best case compared with the baseline methods.

Keywords:

ML/AI; disk failure prediction; timeliness; feature selection

1. Introduction

Nowadays, the scale of data centers has rapidly increased, because more data are generated everyday with the significant development of the internet and computer technologies. Among the limited kinds of storage devices in data centers, hard disk drives (HDDs) represent a large proportion, which is one of the main reasons behind server breakdowns in data centers, where around 82% of server breakdowns are caused by HDD failures [1], leading to data losses and economic losses. Consequently, improving HDD reliability is extremely important. Previously, researchers in the storage community have undertaken many works to make HDDs more reliable. Traditionally, some data redundancy methods such as replication in GFS [2], erasure code (EC) [3], and redundant arrays of independent disks (RAID) [4] have widely been deployed in modern storage systems as remedies in response to the occurrence of disk failure. However, these methods are reactive fault-tolerant techniques that reconstruct data when disk failures occur, thus waste storage space and bandwidth [5].

In recent years, machine learning technologies and artificial intelligence (ML/AI) have appealed to many researchers because of their significant success in data-driven applications. Thus, proactive failure prediction based on ML/AI is widely studied and has become a hot spot in the HDD reliability field. Compared to traditional reactive approaches, disk failure prediction can allow administrators to take precautions if failures are successfully predicted, resulting in lower costs for maintenance. Generally, self-monitoring, analysis, and reporting technology (SMART) [6] data are used for building prediction models, which are correlated with the disk health and indicative of eventual failure. Previously, many works have gained satisfactory results in HDD failure prediction [7,8,9]; for example, Shen et al. [7] predicted disk failure with more than 95% accuracy based on the random forest (RF) model; and Lu et al. [8] and Zhang et al. [9] obtained an extremely high prediction quality based on long short-term memory (LSTM). However, most of these works did not discuss timeliness. This is also crucial for disk failure prediction. On the one hand, the data of the disk state are updated in real-time. Thus, it is necessary to update prediction models to make them suitable for the latest forecast. In this case, building models quickly can improve the efficiency of models updating, which would make the latest predictions more accurate. On the other hand, the prediction latency may be too high to replace a failed disk in time, leading to data and economic losses; thus, reducing prediction latency is also critical. To speed up modeling, some works focused on updating hardware architectures, such as GPU and FPGA. Meanwhile, some techniques focused on model optimization such as Adam [10] and model compression methods such as network pruning.

In this work, another perspective for model acceleration is considered; that is, reducing the input data features. This is proposed based on two observations: First, for neural networks such as convolutional neural networks (CNN), more training epochs always lead to lower training losses but higher latency. Second, using fewer attributes does not decrease a models’ performance, but reduces the modeling latency. Thus, the purpose of this work was to improve the timeliness, without reducing the models’ quality, by selecting as few attributes as possible. To realize this, a feature selection scheme was designed. The basic idea is to first classify attributes into different groups by mining the correlations between them and then to select a representative from each group for modeling. Thus, this would use fewer attributes to represent the overall data. However, there remains some challenges: First, there are attributes that may have negative impacts on modeling. In this case, some groups may be insensitive to predict failures and reduce the prediction quality. Therefore, an attribute filtering method is designed that removes attributes with low importance based on the idea of classification tree models. Second, the correlation between attributes is sophisticated and hard to mine comprehensively and accurately. Thus, a classification method based on correlation coefficient is proposed. As a result, the proposed scheme selects a small number of attributes to differentiate failures, and these features are also the most crucial ones. It should be noticed that, to the best of our knowledge, we are the first to conduct feature selection from the perspective of mining the relationship between attributes in this area. The main contributions of this work are as follows:

From the perspective of the data in ML/AI-based hard disk failure prediction, this paper discusses and analyzes improving the prediction accuracy and timeliness of modeling by effectively selecting crucial attributes;
An attribute filtering scheme is designed to remove unimportant attributes based on the idea of classification tree models, where entropy and the Gini index are the kernel techniques of these models; thus, both entropy and the Gini index are considered and employed to determine the importance of attributes;
An attribute reduction scheme is proposed, to reduce the number of attributes in deep learning.First, this classifies features into different groups by mining their mutual correlation, where Pearson’s correlation coefficient and Spearman’s correlation coefficient are considered to mine the correlation between attributes. Then, it selects the most critical attribute from each group for modeling;
We employ real disk health data to evaluate our scheme. The experimental results show that the proposed scheme can improve the prediction accuracy of ML/AI-based prediction models, and it can reduce the training and prediction latency by up to 75% and 83%, respectively, compared with the baseline methods. Our proposed scheme based on the Gini index and Pearson’s correlation coefficient performs best when using RF and LSTM.

This paper is organized as follows: Section 2 introduces the background and related work; Section 3 presents the motivation; Section 4 presents the design; Section 5 evaluates the experimental results; and Section 7 concludes the paper.

2. Background and Related Work

2.1. ML/AI Based HDD Failure Prediction

Traditionally, SMART data are used to monitor disk states and raise the alarm when failures happen; and, in general, SMART attributes are set by manufacturers, who also set a conservative threshold for each attribute to raise the alarm when the data exceed the threshold. However, this approach always leads to lower failure alarm rates (FAR), and also lower failure detection rates (FDR). As reported in [11], this original threshold-based method only achieves a 3–10% FDR. To improve failure prediction performance, many works have proposed some ML/AI methods based on SMART attributes. Generally, this task is defined as a binary classification problem, where one class is for healthy disks and the other is for failures. For example, Li et al. [5] used gradient-boosted regression trees (GBRT) [12] to design a residual life prediction model, and their results showed that this method can offer a practical improvement in disk failure prediction. Mahdisoltani et al. [13] compared several methods for sector errors prediction; as a result, random forest (RF) [14]-based classifiers obtained the most accurate result. In order to make better use of time-series SMART data, many works [8,9,15] employed recurrent neural networks (RNN) and long short-term memory (LSTM) [16], in which there are memories cell to record historical information of SMART attributes, and these can achieve outstanding prediction results.

2.2. Related Works

The most closely related works are the feature selection techniques in the data preprocessing of ML/AI-based disk failure prediction. Previously, some works trained models without feature selection [17]; however, some features are not strongly correlated with drive failures, and using them for failure prediction could have a negative impact on the model performance [11]. Thus, feature selection is widely employed in the disk failure prediction field and has crucial impacts on prediction quality. Generally, features are selected based on their relationship with failures. Some works selected features based on expertise [5]. Some works obtained features using statistical methods for modeling; for example, reverse arrangement tests, rank-sum tests, and z-scores were used in [15]. Farzaneh et al. [13] used the correlation coefficient and information gain to select features. Zhang et al. applied principal component analysis (PCA) to find the most related features for disk failure prediction modeling. Lu et al. [8] employed J-Index [18] for feature selection, where it determined the correlation between an attribute and failure. Zhang et al. [9] and Jiang et al. [19] selected features based on Pearson’s correlation coefficient [20] to obtain a linear correlation between continuous data, and they also chose features that had closer correlations with failures. Moreover, some works selected features based on machine learning techniques; for example, Chaves et al. [21] used random forests to find features, and the features with high information gain were selected for modeling. However, most of the previous works mainly focused on improving the prediction accuracy by determining whether the selected features correlated with failures, which can obtain good prediction results, but the trade-off between prediction accuracy and modeling timeliness was not explored further. In this study, to find a better way to balance prediction accuracy and modeling timeliness, both the correlation between attributes and failures and the correlation between different attributes are considered for feature selection design, and a two-layer classification-based scheme is proposed.

3. Motivation

Previously, many works have shown that deep learning techniques tend to perform better for disk failure prediction [8,9]. Thus, this work was motivated by failure prediction experiments based on deep learning models, where two observations were found. To simplify, we show some examples using a CNN model.

Observation 1: For deep learning models, more training epochs lead to lower training losses but higher latency. Figure 1a shows the change in modeling loss with the increase in training epochs, where the modeling loss displays a decreasing tendency. This reflects that sufficient epochs are needed to obtain a more accurate model. At the same time, Figure 1b shows the normalized modeling latency; from 10 to 1000 epochs, the latency increases rapidly; therefore, a more accurate model always needs a sufficiently long time for training. Thus, this would impact the efficiency of some applications that require timeliness. Unfortunately, disk failure prediction is one of these, where if timeliness is not considered, the modeling latency may be too long to replace failed disks in time, wasting computing resources and storage, and what is worse, resulting in data and economic losses.

Observation 2: Using fewer attributes does not degrade the prediction quality but reduces the modeling latency. Figure 2a shows the prediction effect for different models, and these models were trained using different numbers of attributes. To evaluate the effect, precision, recall, and F1-score were used, which is explained in Section 5; and for all of them, the higher the value, the better. The epoch of this CNN model was set to 50 for simplicity, and the attributes were selected randomly without putting them back. Generally, training models with more attributes may allow a better prediction effect; however, the trend for the three metrics is up and down, as shown in the figure. This reflects that more attributes cannot result in better prediction effects; even with 33 attributes, the result was not better than that with 8 attributes, in which, the precision, recall and F1-score of the former and the later were

0.81

,

0.24

,

0.38

, and

0.65

,

0.39

,

0.49

, respectively. Additionally, Figure 2b shows the normalized modeling latency when training with 33 to 3 attributes; the latency decreased significantly with fewer attributes. Thus, in this example, if only eight attributes are used for training, the prediction results would be close to training with all attributes; meanwhile, the model latency would be reduced significantly.

Therefore, to maintain both timeliness and model quality for disk failure prediction, it is necessary to training models with enough epochs and at the same time reduce the modeling latency as much as possible. It is necessary to design a good feature selection scheme to improve the efficiency of this task.

4. Design

In this section, a feature selection scheme is proposed to improve the efficiency of the disk failure prediction task. Figure 3 presents an overview of this scheme, which is designed for the data preprocessing block. The basic idea of this design is to distinguish between healthy and failed disks using as few attributes as possible, based on the correlation with failure identification and the relationship between attributes. Thus, this can reduce modeling latency and obtain good prediction results. However, there are at least two challenges to this basic idea: First, an ideal method should be designed which would choose the attributes that can best distinguish the healthy and soon-to-fail disks, to construct the prediction models. If the selected attributes are not beneficial for identifying failures, the prediction results will not be accurate. Second, how to choose as few attributes as possible is a difficult problem. On the one hand, the relationship between attributes is sophisticated and hard to determine accurately; on the other hand, it is hard to choose a set of attributes to represent the state of a disk comprehensively.

To overcome the above challenges, an efficient feature selection scheme is proposed. First, after data cleaning, normalization, labeling, and subsampling, an attribute filtering method is designed, which decides the features’ importance and removes the attributes with a low importance value; in other words, it removes the features that are unrelated to failure prediction. Second, to further reduce attributes, an attribute reduction method is proposed, which classifies features based on the correlation between them and then choosing one attribute from each group for training models. After the steps above, the failure prediction models can be trained and evaluated.

4.1. Problem Description and Data Processing

In this subsection, the designs for data preprocessing, except feature selection, are demonstrated. In this work, this task is defined as a binary classification problem, as described in Section 2. The input is a series of disk SMART data, and we used 10 days’ data as a sample, as in [8,22,23]. For example, if there is only one attribute a for SMART data,

a i

is the value for a on the ith day; thus, the input would be

[a 1, a 2, a 3, . . ., a 10]

. Moreover, the label set is

{0, 1}

, a sample is labeled as 1 when the disk will fail within the next 10 days, otherwise, it is labeled as 0. Thus, the output would be whether the disk would fail or not in the next 10 days. As there are many attributes with null values in the original SMART data, we delete these attributes for constructing better prediction models. Furthermore, since the range of values for different attributes across different disks and vendors varies widely, and it is hard to perform meaningful comparisons. Thus, min–max normalization was used, as shown in Equation (1), where

x_{i}

is the original value of the ith variable in X,

x_{m i n}

and

x_{m a x}

are the minimum and the maximum values in X, respectively.

x_{n o r m} = \frac{x_{i} - x_{m i n}}{x_{m a x} - x_{m i n}}

(1)

In addition, an extreme imbalance between healthy (negative) and failure (positive) samples is common in this task, where the failure samples are much fewer than healthy samples. If this is not handled well, the prediction results tend to predict healthy disks and be insensitive to failures, which is not our goal. Thus, we used sub-sampling to reduce this impact, which was commonly used in other works [13,24] In this work, we keep all failure disks and randomly select healthy disks. To choose a reasonable ratio of healthy and failure disks, we experimented with different ratios from 1:1 to 20:1 and found that the middle range of this ratio had the best balance between the number of training data and prediction accuracy. Thus, we chose 10:1 in this paper by default.

4.2. Attribute Filtering Based on Feature Importance

The basic idea for the attribute filter is to find the attributes that have a negligible or negative influence on the prediction results. Inspired by the idea of decision tree models that use entropy or the Gini index to decide the split nodes for classification, in this work, both entropy and Gini index are employed to decide the features’ importance, where the Gini index is chosen by default, because it had a relatively better performance than the other approach in our experiments, as shown in Section 5.

Entropy is a quantitative indicator of the uncertainty of random variables. When the entropy of a dataset is higher, the dataset is more chaotic. As shown in Equation (2), when the task is binary classification, the entropy of the dataset D is represented.

E n t r o p y (D) = - (p {log}_{2}^{p} + (1 - p) {log}_{2}^{1 - p}),

(2)

where p is the possibility of one class, and

1 - p

is the possibility of the other class. Based on an attribute A, D can be divided into several subsets; in this case, the entropy of D based on A is calculated as in Equation (3):

E n t r o p y (D | A) = \sum_{i = 1}^{k} q_{i} \times E n t r o p y (D | a_{i}),

(3)

where A has k values, and

q_{i}

is the possibility when the value of A is

a_{i}

. Then,

E n t r o p y (D) - E n t r o p y (D | A)

is the information gain when classifying D using A, when this value is larger, A is more likely to be selected to be a spite node for the classifier; thus, A is more important.

The Gini index is similar to the idea of entropy and employed in classification and regression trees (CART). Equation (4) shows the Gini index for the dataset D when the task is binary classification, where p is the possibility of one class, and

1 - p

is the possibility of the other class.

G i n i (D) = 2 p (1 - p),

(4)

Based on the attribute A, the dataset may be divided into several subsets; thus, the Gini index for the dataset based on A can be calculated based on the Gini index of every subset, as shown in Equation (5). If the Gini index is low, that means the uncertainty of the dataset is low, and the related attribute would be selected as a spite node for the classifier; then, based on this attribute, it would find the next spite node with the same calculation, until the reduction in the Gini index is too small to split.

G i n i (D, A) = \sum_{i}^{k} \frac{| D_{i} |}{| D |} G i n i (D_{i}) .

(5)

Then, we can gain the importance of each feature based on the reduction in entropy or the Gini index for the dataset with attributes. If the entropy or the Gini index of an attribute can further reduce the uncertainty of the dataset, it would be more important. Thus, the importance value of a feature A is

\frac{R_{A}}{R}

, where

R_{A}

is the uncertainty reduction in A, and R is the total reduction in the classifier. Finally, the attributes whose importance values are lower than

T_{i}

are removed, in which

T_{i}

is the importance threshold. To choose reasonable

T_{i}

, we experimented with different threshold values from

0.005

to

0.02

, and when this value was around

0.01

, the prediction models performed best. Thus, in this work,

T_{i}

was set to

0.01

. In this step, the attributes that have less influence on dividing the negative and positive samples are filtered.

4.3. Attributes Reduction Based on Correlation Classification

To further reduce attributes, in this section, a reduction method based on correlation classification is proposed. The basic idea is that two correlative features have a similar impact on the prediction results because they have similar distributions or trends. Therefore, if the correlation between them can be determined, only one attribute in a correlative attribute set needs to be selected, and a new set of attributes can be combined that would represent all attributes.

The method is divided into three steps: (1) First, to mine the correlation between attributes, Pearson’s correlation coefficient [20] and Spearman’s correlation coefficient [25] are considered. These are classic statistical methods to evaluate the correlation between vectors and are widely employed in disk failure tasks for feature selection. However, in previous works, they were only used to determine the correlation for one attribute with the final target. In this work, the correlation of every two attributes is considered. The values range from −1 to 1, if the value is positive, then, the two attributes are positively correlated, otherwise, they are negatively correlated, and when the absolute value is higher, the correlation is closer. Figure 4 shows an example of the correlation of five attributes. Table 1 shows the ID and the corresponding attribute names of these five attributes, where ID is the identifier number of each SMART attribute. They are labeled as

S 192

,

S 193

,

S 240

,

S 241

, and

S 242

in Figure 4. In this paper, Pearson correlation coefficient is used by default, because based on the Pearson correlation coefficient, we could obtain the best prediction results, as described in Section 5. (2) Second, based on correlation values, attributes can be classified into several groups, if the absolute correlation value between two attributes is higher than a threshold

T_{c}

, then the two attributes are grouped together. In this work,

T_{c}

was set to

0.6

, because

0.6

is a general boundary value between moderate and good correlation [26]. It should be noticed that the correlation of every two attributes in a group should exceed

T_{c}

. Thus, in the example above, the attributes may be classified into three groups:

{S 192, S 240, S 242}

,

{S 193}

, and

{S 241}

. (3) Finally, an attribute from each group would be selected for model construction. For simplicity, we choose the one which best correlates with the final target. In the example above, finally, three attributes would be selected.

The pseudo-code of the two-layer classification-based feature selection method is shown in Algorithm 1, where

D a t a

is the original data with n attributes, X is the samples that are the results after data processing, Y is the labels for healthy and failure samples,

T_{i}

is the importance threshold for feature filtering and set as

0.01

by default, and

T_{c}

is the correlation threshold to classify attributes for further feature reduction, which it is set as

0.6

by default.

Algorithm 1 Two-layer Classification-Based Feature Selection

Input:

D a t a = < A_{1}, A_{2}, \dots, A_{n} >

, X, Y,

T_{i}

,

T_{c}

.

Output:

A set of selected attributes R.

1:: (Layer One)
2:: Let $R_{1}$ be a null array;
3:: Input X and Y to get the list of important values of attributes $I m$ based on classification trees;
4:: for $i = 1 \to n$ do
5:: if $I m_{i} \geq T_{i}$ then
6:: Add ith attribute to array $R_{1}$ ;
7:: end if
8:: end for
9:: return Attributes after filtering $R_{1}$ .
10:: (Later Two)
11:: Filter $D a t a$ with only the attributes in $R_{1}$ and get $D a t a *$ , $m = | R_{1} |$ , $R_{1}^{*} = R_{1}$ ;
12:: Get correlation coefficients between attributes by using Pearson’s and Spearman’s methods;
13:: Let $C l a s s e s$ be a null array;
14:: while $R_{1}^{*}$ is not null do
15:: Let C be a null array; Pop the first element $c 1$ in $R_{1}^{*}$ and add it to C;
16:: for $i = 1 \to | R_{1}^{*} |$ do
17:: if correlation value between ith attribute and $c 1$ is higher than $T_{c}$ then
18:: Add ith attribute to C and remove it from $R_{1}^{*}$ ;
19:: end if
20:: end for
21:: Add C to $C l a s s e s$ ;
22:: end while
23:: Add an attribute with the highest correlation value for failures from each sub-array in $C l a s s e s$ to R;
24:: return R.

4.4. Modeling and Prediction

After the above processes, models can be built. In this work, several ML/AI models were chosen. They are the naïve Bayesian (NB) [27], RF, support vector machine (SVM) [28], GBDT, CNN, and LSTM. All of them have been used for disk failure prediction tasks, among which RF and LSTM, especially LSTM, are widely studied and have obtained significant results, as described in Section 2. To train and evaluate the models, the ratio of training and test data was set to 7:3, which is the same as in other research, including [9]. Additionally, some important parameters were set for these models, the number of estimators for RF and GBDT were 10 and 50, respectively, because there was no significant improvement when using more than 10 and 50 estimators for these.

Moreover, a CNN model was built with a two-dimensional convolutional layer, a max pool layer, a two-dimensional convolutional layer, and a dense layer; and an LSTM model was built with 3 LSTM layers, and the number of units for the LSTM layers were 64, 128, and 64. Regarding these two models, one of the most important parameters is the training epoch, as described in Section 3, as sufficient epochs can lead to a higher accuracy. However, when the number of epochs is too high, the prediction performance is decreased because of overfitting [29]. To avoid overfitting, the number of epochs was chosen based on the models’ performance on the validation set. We used 20% of the training set as validation data. Figure 5 shows the change in the value of the training and validation loss functions of CNN and LSTM with the increase in epochs, where the lower the loss, the better. For CNN, as shown in Figure 5a, the training and validation loss decreased with the increase in epochs, until the epochs were around 60; after that, the validation loss was not reduced, which reflects the overfitting issue. Thus, 60 was chosen as the epoch number for the CNN. With the same logic, the number of epochs selected for LSTM was 270, as shown in Figure 5b. In addition, for these two models, the batch size was 72, the dropout rate was 0.25, and the learning rate was 0.0001. These settings were similar to other research, including [8,9].

5. Results and Analysis

In this section, the experimental setup and methodology are described first. Then the experimental results are reported.

5.1. Experimental Setup

The experiments were implemented on a commercial computer, which was equipped with 3.6 GHz processors and 16 GB main memory. The models were implemented in Python 3.8, using TensorFlow 2.4.0 [30], Keras 2.4.3 [31], and Scikit-learn libraries [32]. The disk SMART data were from Backblaze [33], which is one of the biggest data centers in the world, and we chose two models’ data (ST4000DM000 and ST12000NM0007) from 2020–01 to 2020–12 to evaluate our methods. To simplify the presentation, DM000 and NM0007 were used to represent the two types, respectively, in the latter description, and the specific information of the used datasets is shown in Table 2. Each disk was classified either as “Healthy" or “Failed". In this table, the number of healthy disks and failed disks are shown in the 4th and 5th columns, respectively, “Total" means the total number of disks, which is equal to the sum of the number of healthy and failed disks. “Sample" indicates the SMART records. Each healthy disk or failed disk has many SMART records, and they were the input for training models. "Attr." is the original number of SMART attributes in the two datasets.

The model quality was evaluated using the precision, recall, and F1-score, and they were calculated using Equations (6), (7) and (8), respectively. Precision indicates the proportion of true positive samples (TP) among all predicted failures and reflects the accuracy of the classifier in identifying failures. Recall represents the proportion of TP within all actually failed disks, and this reflects the classifier’s ability to recognize failures. F1-Score is the harmonic average of precision and recall. They all range from 0 to 1, and the higher, the better. Furthermore, the modeling latency was recorded, to evaluate the timeliness of the schemes, where this consisted of training and prediction latency.

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

F 1 - S c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

To evaluate the performance of the proposed methods, two groups of schemes were evaluated. The first group was for the attribute filtering which was the first layer of feature selection, and the schemes were as follows:

Original is the baseline, which uses all attributes for modeling;
Pearson is the scheme that selects features based on Pearson’s correlation coefficient, as used in [9];
Spearman is the scheme that selects features based on Spearman’s correlation coefficient.
J-Index is a feature selection method based on J-index, as used in [8];
Entropy is the proposed scheme that uses entropy to select features without further attribute reduction;
Gini is the proposed scheme without further attributes reduction, which selects features based on the Gini index.

The second group was used to evaluate the performance of our attribute reduction scheme which was the second layer of feature selection. In this paper, two methods were used to determine the correlation between attributes: Pearson’s correlation coefficient and Spearman’s correlation coefficient as described in Section 4.3. Thus, the selection results in the first group were reduced using two methods. When using Pearson’s correlation coefficient, the evaluated schemes were named Org_Prs, Prs_Prs, Sprm_Prs, Jidx_Prs, Ent_Prs, and Gini_Prs. Org_Prs used Pearson’s correlation coefficient for attribute reduction based on Original; Prs_Prs is the scheme that applied the Pearson method for the first layer feature selection (feature filtering) and also used Pearson’s correlation coefficient to determine the correlation between the selected attributes for attribute reduction. Using the same logic, the other methods, including Sprm_Prs, Jidx_Prs, Ent_Prs, and Gini_Prs, used Pearson’s correlation coefficient for attribute reduction in the second layer of feature selection based on the corresponding feature filtering method mentioned in the first group.

When using Spearman’s correlation coefficient, the evaluated schemes were Org_Sprm, Prs_Sprm, Sprm_Sprm, Jidx_Sprm, Ent_Sprm, and Gini_Sprm. Prs_Sprm used Spearman’s correlation coefficient to determine the correlation between selected attributes for attribute reduction after using the Pearson method for feature filtering in the first group; Sprm_Sprm was the scheme that applied the Spearman method for the first layer feature selection and also used Spearman’s correlation coefficient for attribute reduction. Using the same logic, the other methods, including Org_Sprm, Jidx_Sprm, Ent_Sprm, and Gini_Sprm used Spearman’s correlation coefficient for attribute reduction in the second layer feature selection based on the corresponding feature filtering method mentioned in the first group. Entropy and Gini were the proposed attribute filtering methods. In the second group, every method in the first group was combined with our proposed attribute reduction methods, where Ent_Prs, Gini_Prs, Ent_Sprm, and Gini_Sprm were our proposed schemes. The construction of the name of the second group was the combination of the name of the first group and the suffix _Prs or _Sprm. Thus, in the following representation, the second group is represented as Group2_Prs or Group2_Sprm for simplicity.

5.2. Results

5.2.1. Results after Feature Selection

Table 3 presents the number of attributes after selection for all schemes. Based on the parameters we set in this paper, the Gini scheme selected 13 and 12 attributes for model DM000 and NM0007, respectively. Thus, to compare this scheme with the others, for Pearson, Spearman, and J-Index, we selected the top 13 and 12 attributes with the highest coefficient value for disk model DM000 and NM0007, respectively. For Entropy, the importance threshold was the same as that of Gini, although the number of attributes was larger than Gini in NM0007. For the last two rows of this table, the number of selected attributes based on methods in the first group was further reduced using Pearson’s correlation coefficient and Spearman’s correlation coefficient, respectively.

5.2.2. Prediction Quality

Table 4 shows the prediction quality of all schemes under different ML/AI models. The metrics were precision, recall, and F1-score, as described above, and in this table, they are represented by Pre, Re, and F1, respectively. In each row, Group1 represents the results for the related scheme in the first evaluated group. Group2_Prs and Group2_Sprm are for the second group, where Group2_Prs represents the results after further reduction based on Pearson’s correlation coefficient, and Group2_Sprm represents the results after further reduction based on Spearman’s correlation coefficient. From the results, several conclusions can be drawn:

First, compared with the other schemes in the first evaluated group, Entropy and Gini had the best performance. In all ML/AI models, the prediction quality of Entropy and Gini was similar to, or even better than, that of the Original. In addition, they obtained the best results when using RF and LSTM. This is one of the reasons why we design our attribute filtering module based on the idea of decision trees. Regarding the three other schemes, Pearson, Spearman, and J-Index, they presented a worse prediction quality for almost all ML/AI models compared with the Original. In detail, J-Index performed better than Pearson and Spearman, where Pearson and Spearman had a similar prediction quality. On the one hand, this reflects the high effectiveness of the decision tree-based attribute selection methods and proves that a suitable feature selection method can improve the prediction quality. On the other hand, this implies that an unsuitable feature selection method will damage the prediction quality of ML/AI models.

Second, in most cases, the prediction quality after deep attribute reduction was similar to before. This can be observed in the NB, RF, GBDT, CNN, and LSTM for all schemes. In some cases, the prediction quality was even improved after attribute reduction. For example, in DM000, all metrics of Org_Sprm using LSTM were better than the Original; in NM0007, all metrics of Org_Prs using LSTM were better than the Original. In most situations, it seems that there are no obvious advantages to using Pearson’s or Spearman’s correlation coefficient, because in some models, the prediction quality of _Prs is better, and in some cases, the other performs better. For example, for SVM, Org_Sprm performed much better than Org_Prs in DM000; and in NM0007, Org_Prs performed better than Org_Sprm. However, Gini_Prs obtained the best prediction quality among all schemes when using RF in DM000 (precision, recall, and F1-score were up to 1, 0.99, and 0.99, respectively), and it achieved the best prediction quality when using LSTM in NM0007 (precision, recall, and F1-score were up to 1, 0.99, and 0.99, respectively). Thus, the combination of Gini and attribute reduction based on Pearson’s correlation coefficient was the default choice in this paper. This result proved the high efficiency of the proposed method for attribute reduction.

Third, in all schemes, RF performed the best compared with the other ML/AI models, almost in all cases. In the Original, the performance of RF was as good as that in Entropy and Gini, which implies that noise attributes had less impact on RF. However, if some critical attributes are lacking, the prediction quality is reduced, as observed in Pearson, Spearman, and J-Index. Considering LSTM, the prediction results were similar to those of RF in Entropy, and Gini; in particular, it achieved the best results for Gini_Prs, which shows that noise attributes influence the results of LSTM. In Pearson and Spearman, the performance of LSTM was much worse than that of RF; in this case, unsuitable attribute selection greatly impacted the quality of LSTM, and good attribute selection revealed the performance potential of LSTM. For the other ML/AI models, NB and SVM performed worst in almost all schemes, and CNN had the next worst performance. They showed different comparison results for DM000 and NM0007; for example, in DM000, the performance of SVM was worse in Pearson, Spearman, and J-Index compared with Original. However, in NM0007, the quality improved in Pearson, Spearman, and J-Index. The same trend could be also observed for CNN in J-Index.

5.2.3. Modeling Latency

In this section, the modeling latency is divided into training and prediction latency. Figure 6 shows the normalized training latency for all schemes under all ML/AI models in both DM000 and NM0007. The analysis of the results is as follows:

Original showed the highest latency in all situations. For the first evaluated group, for most models, including NB, SVM, and CNN, the training latency decreased as the number of attributes was reduced. For example, for CNN in Figure 6a, Pearson, Spearman, J-Index, Entropy, and Gini reduced the latency by 66%, 67%, 66%, 67%, and 67%, respectively, compared with the Original. When comparing them with the second evaluated group, the latency of all schemes decreased after attribute reduction in almost all ML/AI models (except LSTM). For example, in DM000, Org_Prs and Org_Sprm reduced the latency by 20–50% compared with the Original; Prs_Prs and Prs_Sprm reduced the latency by 25–49% compared with Pearson; Sprm_Prs and Sprm_Sprm reduced the latency by 10–45% compared with Spearman; Jidx_Prs and Jidx_Sprm reduced the latency by 2–28% compared with J-Index; Ent_Prs and Ent_Sprm reduced the latency by 1–39% compared with Entropy; and Gini_Prs and Gini_Sprm reduced the latency by 13–31% compared with Gini. For NM0007, the trend was similar to that for DM000. In this case, although in some ML/AI models, the comparison results for the first group did not have a clear trend with the decrease in the number of attributes, the latency reduced as the number of attributes was reduced based on our attributes reduction method. On the one hand, this implies that the number of attributes indeed correlates with the modeling latency. On the other hand, this can prove the high effectiveness of our attribute reduction method. Overall, our proposed scheme Gini_Prs could reduce the training latency by up to 75% compared with the Original, with a relatively high prediction quality.

For the tree-based ML/AI models including RF and GBDT, Pearson and Spearman always showed the lowest latency, and J-Index presented the second-lowest latency. The reason for this is that, for tree-based models, the modeling latency is closely related to the complexity of tree structures. If the Gini index and entropy reduction based on certain attributes are too low, then they are not immediately selected as split nodes, thus reducing the calculations and the tree structure would be simpler. However, compared with other schemes, Pearson, Spearman, and J-Index showed a lower prediction quality, due to the lack of some critical attributes.

Regarding LSTM, compared with the other ML/AI models, reducing attributes resulted in a relatively small reduction in training latency. In the first evaluated group, Pearson, Spearman, J-Index, Entropy, and Gini reduced the training latency by 1–12% compared with the Original. When comparing the second group, they showed different trends in DM000 and NM0007. In DM000, Org_Prs and Org_Sprm reduced the training latency by 25% and 17% compared with Original; Prs_Prs and Prs_Sprm reduced the training latency by around 5% compared with Original; Ent_Prs and Ent_Sprm reduced the training latency by around 7% compared with Entropy. In NM0007, Org_Prs and Org_Sprm reduced the training latency by 13% and 8% compared with Original; Jidx_Prs and Jidx_Sprm reduced the training latency by around 9% compared with J-Index; and Gini_Prs reduced the latency by 3% compared with Gini. For other schemes, the decrease ratio was less than 2% and the latency even improved a little. The reasons behind this may relate to the LSTM structure and the system architecture used. In this case, the number of attributes was too low for LSTM modeling and this may not be the main bottleneck.

Figure 7 shows the normalized prediction latency for all schemes under all ML/AI models in both DM000 and NM0007. In all situations, Original was the slowest. Moreover, in most cases, the trend of latency was the same as that for training, and the prediction latency decreased as the attributes decreased. For the first group in DM000 and in CNN, Pearson, Spearman, J-Index, Entropy, and Gini reduced the prediction latency by 60%, 61%, 56%, 63%, and 61%, respectively. When comparing the second group, some cases showed a different trend; for example, Prs_Prs and Prs_Sprm improved the latency by around 3% compared with Pearson when using SVM in NM0007, and they reduced the latency by 5% in DM000. However, for most cases, the trend was the same as the training latency. Overall, for the proposed scheme, Gini_Prs reduced the latency by up to 83% compared with the Original. This showed the high efficiency for our proposed scheme. For LSTM, the prediction latency for all schemes was similar, and they reduced the latency by 1–5% compared with the Original. The reason behind this was the same as for the training. In this case, the number of attributes was too small in all schemes; thus, the improvement in latency was not obvious.

5.2.4. Computation Latency

In Table 5, the computation latency of all schemes for DM000 and NM0007 is presented. First, in the first evaluated group, Spearman always showed the highest latency compared with the other schemes, and J-Index showed the next highest latency. Although Pearson was one of the fastest methods, the prediction quality in this scheme was significantly worse than the decision tree-based methods, Entropy and Gini. This, again, implies the high efficiency of the decision tree-based methods. Second, the schemes using Spearman’s correlation coefficient for attribute reduction always showed an obviously higher latency compared with using Pearson’s correlation coefficient. In this case, from the perspective of efficiency, Pearson’s correlation coefficient was employed for attribute reduction by default. Third, comparing Ent_Prs with Gini_Prs, they had a similar computation latency, where Gini_Prs was 14% slower than Ent_Prs in DM000 and 6% faster than Ent_Prs in NM0007. Thus, with consideration of the prediction quality, Gini_Prs is more recommended. It should be noted that, compared with the modeling latency, the overheads of attribute selection is much lower, especially for deep neural networks such as CNN and LSTM. In addition, once the critical attributes have been selected, they can be used multiple times in disk prediction tasks. Thus, the overheads are negligible.

6. Discussion

Considering both modeling timeliness and disk failure prediction accuracy, this paper designed a two-layer classification-based feature selection scheme. The main idea of this design is to find the most representative features for modeling, to balance the modeling timeliness and prediction accuracy. Fortunately, both the prediction accuracy and modeling timeliness were improved after feature reduction. These results showed the effectiveness of the proposed design. There were some limitations of this work. The data size used in the experiment was not large enough, and the total number of original features shown in the dataset was low. We believe that, when the dimensions of the data are larger, the effectiveness of our design would be more obvious. Additionally, this work considered an off-line disk failure prediction scheme, and the design for an online failure prediction scheme would be different. Furthermore, this work only verified the effectiveness on machine learning modeling based on a CPU, and other accelerators could be considered. In future works, more kinds of disk failure-related data and other architectures of machine learning accelerators could be considered for the design. In addition, other useful feature selection algorithms could be explored for an online disk prediction scheme.

7. Conclusions

In this paper, to improve both the prediction quality and modeling timeliness of ML/AI-based hard disk failure prediction, a two-layer classification-based feature selection scheme was proposed. First, it filters unimportant attributes based on the decision tree-based method. Then, it classifies these attributes into different groups based on the correlation between them. Finally, it selects the most critical attribute from each group for training the models. In the evaluation, the proposed scheme could improve the quality of the models. When our scheme was based on the Gini index and Pearson’s correlation coefficient, it obtained the best prediction results using LSTM and RF; meanwhile, the proposed scheme could reduce the modeling latency significantly, which would improve the timeliness of the disk failure prediction task.

Author Contributions

Conceptualization, H.W., Q.Z., E.H.-M.S., R.X. and Y.S.; methodology, H.W. and Q.Z.; validation, H.W.; formal analysis, H.W.; investigation, H.W.; data curation, H.W.; writing—original draft preparation, H.W.; writing—review and editing, H.W., Q.Z., R.X. and Y.S.; visualization, H.W.; supervision, Q.Z. and E.H.-M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by NSFC 61972154 and Shanghai Science and Technology Commission Project 20511101600.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.backblaze.com/(accessed on 2 February 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, G.; Zhang, L.; Xu, W. What can we learn from four years of data center hardware failures? In Proceedings of the DSN, Denver, CO, USA, 26–29 June 2017; pp. 25–36. [Google Scholar]
Ghemawat, S.; Gobioff, H.; Leung, S.T. The Google file system. In Proceedings of the SOSP, New York, NY, USA, 19–22 October 2003; pp. 29–43. [Google Scholar]
Huang, C.; Simitci, H.; Xu, Y.; Ogus, A.; Calder, B.; Gopalan, P.; Li, J.; Yekhanin, S. Erasure coding in windows azure storage. In Proceedings of the ATC, Boston, MA, USA, 13–15 June 2012; pp. 15–26. [Google Scholar]
Patterson, D.A.; Gibson, G.; Katz, R.H. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the SIGMOD, Chicago, IL, USA, 1–3 June 1988; pp. 109–116. [Google Scholar]
Li, J.; Stones, R.J.; Wang, G.; Li, Z.; Liu, X.; Xiao, K. Being accurate is not enough: New metrics for disk failure prediction. In Proceedings of the SRDS, Budapest, Hungary, 26–29 September 2016; pp. 71–80. [Google Scholar]
Allen, B. Monitoring hard disks with SMART. Linux J. 2004, 74–77. [Google Scholar]
Shen, J.; Wan, J.; Lim, S.J.; Yu, L. Random-forest-based failure prediction for hard disk drives. Int. J. Distrib. Sens. Netw. 2018, 14, 1550147718806480. [Google Scholar] [CrossRef]
Lu, S.; Luo, B.; Patel, T.; Yao, Y.; Tiwari, D.; Shi, W. Making disk failure predictions smarter! In Proceedings of the FAST, Santa Clara, CA, USA, 24–27 February 2020; pp. 151–167. [Google Scholar]
Zhang, J.; Huang, P.; Zhou, K.; Xie, M.; Schelter, S. HDDse: Enabling High-Dimensional Disk State Embedding for Generic Failure Detection System of Heterogeneous Disks in Large Data Centers. In Proceedings of the ATC, Nha Trang, Vietnam, 8–10 October 2020; pp. 111–126. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Murray, J.F.; Hughes, G.F.; Kreutz-Delgado, K.; Schuurmans, D. Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application. J. Mach. Learn. Res. 2005, 6, 783–816. [Google Scholar]
Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Mahdisoltani, F.; Stefanovici, I.; Schroeder, B. Improving storage system reliability with proactive error prediction. In Proceedings of the ATC, Santa Clara, CA, USA, 12–14 July 2017; pp. 391–402. [Google Scholar]
Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
Xu, C.; Wang, G.; Liu, X.; Guo, D.; Liu, T.Y. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 2016, 65, 3502–3508. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Han, S.; Lee, P.P.; Shen, Z.; He, C.; Liu, Y.; Huang, T. Toward adaptive disk failure prediction via stream mining. In Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), Singapore, 29 November–1 December 2020; pp. 628–638. [Google Scholar]
Fluss, R.; Faraggi, D.; Reiser, B. Estimation of the Youden Index and its associated cutoff point. Biom. J. J. Math. Methods Biosci. 2005, 47, 458–472. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Jiang, T.; Zeng, J.; Zhou, K.; Huang, P.; Yang, T. Lifelong disk failure prediction via GAN-based anomaly detection. In Proceedings of the 2019 IEEE 37th International Conference on Computer Design (ICCD), Abu Dhabi, United Arab Emirates, 17–20 November 2019; pp. 199–207. [Google Scholar]
Cohen, I.; Huang, Y.; Chen, J.; Benesty, J.; Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson Correlation Coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
Chaves, I.C.; de Paula, M.R.P.; Leite, L.G.; Queiroz, L.P.; Gomes, J.P.P.; Machado, J.C. Banhfap: A bayesian network based failure prediction approach for hard disk drives. In Proceedings of the 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), Recife, Brazil, 9–12 October 2016; pp. 427–432. [Google Scholar]
Anantharaman, P.; Qiao, M.; Jadav, D. Large scale predictive analytics for hard disk remaining useful life estimation. In Proceedings of the 2018 IEEE International Congress on Big Data (BigData Congress), San Francisco, CA, USA, 2–7 July 2018; pp. 251–254. [Google Scholar]
Botezatu, M.M.; Giurgiu, I.; Bogojeska, J.; Wiesmann, D. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 39–48. [Google Scholar]
Zhang, J.; Zhou, K.; Huang, P.; He, X.; Xiao, Z.; Cheng, B.; Ji, Y.; Wang, Y. Transfer learning based failure prediction for minority disks in large data centers of heterogeneous disk systems. In Proceedings of the 48th International Conference on Parallel Processing, Kyoto, Japan, 5–8 August 2019; pp. 1–10. [Google Scholar]
Myers, L.; Sirois, M.J. Spearman correlation coefficients, differences between. In Encyclopedia of Statistical Sciences; John Wiley & Sons: Hoboken, NJ, USA, 2004. [Google Scholar] [CrossRef]
Schober, P.; Boer, C.; Schwarte, L.A. Correlation coefficients: Appropriate use and interpretation. Anesth. Analg. 2018, 126, 1763–1768. [Google Scholar] [CrossRef] [PubMed]
Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; Volume 3, pp. 41–46. [Google Scholar]
Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef] [PubMed]
Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tensorflow: A system for large-scale machine learning. In Proceedings of the OSDI, Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
Gulli, A.; Pal, S. Deep learning with Keras; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Backblaze. Hard Drive Data and Stats. 2021. Available online: https://www.backblaze.com/b2/hard-drive-test-data.html (accessed on 2 February 2023).

Figure 1. Observation 1: (a) The loss of a CNN model decreases with the increase in epochs; (b) The latency increases with the increase in epochs.

Figure 2. Observation 2: (a) The results of model quality with the increase in the amount of attributes amount; (b) The modeling latency.

Figure 3. Overview of the proposed scheme.

Figure 4. An example correlation matrix.

Figure 5. Training and validation loss with increasing epochs for a disk type: (a) CNN; (b) LSTM.

Figure 6. Normalized training latency for the two models: (a) the latency for DM000; (b) the latency for NM0007.

Figure 7. Normalized prediction latency for the two models: (a) the latency for DM000; (b) the latency for NM0007.

Table 1. Attributes in the example correlation matrix.

ID	S.M.A.R.T Attribute Name
192	Power-Off Retract Cycles
193	Load/Unload Cycles
240	Head Flying Hours
241	Total LBAs Written
242	Total LBAs Read

Table 2. Number of Disks, Samples, and Attributes.

Model	Duration	Total	Healthy	Failed	Healthy Sample	Failure Sample	Attr.
ST4000DM000	12 months	19,241	18,972	269	24,800	2344	36
ST12000NM0007	12 months	37,255	36,916	339	27,300	2413	33

Table 3. Number of Attributes after Feature Selection.

DM000						NM0007
Original	Pearson	Spearman	J-Index	Entropy	Gini	Original	Pearson	Spearman	J-Index	Entropy	Gini
36	13	13	13	13	13	33	12	12	12	16	12
Org_Prs	Prs_Prs	Sprm_Prs	Jidx_Prs	Ent_Prs	Gini_Prs	Org_Prs	Prs_Prs	Sprm_Prs	Jidx_Prs	Ent_Prs	Gini_Prs
21	7	7	7	10	10	19	8	9	9	9	8
Org_Sprm	Prs_Sprm	Sprm_Sprm	Jidx_Sprm	Ent_Sprm	Gini_Sprm	Org_Sprm	Prs_Sprm	Sprm_Sprm	Jidx_Sprm	Ent_Sprm	Gini_Sprm
21	8	8	8	10	10	19	8	10	10	9	8

Table 4. Disk Failure Prediction Quality.

		DM000									NM0007
		Group1			Group2_Prs			Group2_Sprm			Group1			Group2_Prs			Group2_Sprm
		Pre	Re	F1	Pre	Re	F1	Pre	Re	F1	Pre	Re	F1	Pre	Re	F1	Pre	Re	F1
Original	NB	0.69	0.38	0.49	0.7	0.37	0.49	0.72	0.37	0.48	0.79	0.31	0.45	0.81	0.27	0.41	0.8	0.3	0.43
	RF	0.99	0.98	0.99	0.99	0.99	0.99	1	0.98	0.99	1	0.97	0.98	0.99	0.94	0.97	1	0.95	0.97
	SVM	0.91	0.54	0.68	0.89	0.09	0.17	0.94	0.46	0.62	0.84	0.31	0.45	0.83	0.29	0.43	0.77	0.24	0.36
	GBDT	0.97	0.79	0.87	0.95	0.76	0.85	0.97	0.77	0.86	0.93	0.71	0.81	0.92	0.61	0.73	0.91	0.68	0.78
	CNN	0.9	0.62	0.74	0.8	0.54	0.65	0.77	0.54	0.64	0.82	0.35	0.49	0.86	0.36	0.51	0.74	0.33	0.46
	LSTM	0.98	0.9	0.94	0.96	0.95	0.96	0.99	0.91	0.95	0.91	0.92	0.92	0.94	0.92	0.93	0.96	0.81	0.91
Pearson	NB	0.78	0.36	0.49	0.78	0.35	0.48	0.78	0.35	0.48	0.78	0.3	0.43	0.81	0.26	0.4	0.79	0.29	0.42
	RF	0.95	0.46	0.62	0.95	0.46	0.62	0.95	0.47	0.63	0.95	0.72	0.82	0.89	0.57	0.69	0.94	0.72	0.82
	SVM	0.78	0.24	0.36	0.75	0.12	0.21	0.77	0.12	0.21	0.77	0.32	0.46	0.82	0.31	0.45	0.73	0.25	0.37
	GBDT	0.92	0.43	0.59	0.9	0.42	0.57	0.92	0.43	0.59	0.9	0.62	0.74	0.87	0.54	0.67	0.9	0.62	0.74
	CNN	0.72	0.34	0.47	0.77	0.24	0.36	0.8	0.2	0.32	0.72	0.4	0.51	0.77	0.36	0.49	0.68	0.37	0.48
	LSTM	0.94	0.36	0.52	0.87	0.37	0.52	0.87	0.37	0.52	0.95	0.48	0.64	0.88	0.46	0.6	0.88	0.46	0.6
Spearman	NB	0.73	0.35	0.47	0.74	0.34	0.47	0.74	0.34	0.47	0.78	0.3	0.43	0.81	0.26	0.4	0.79	0.29	0.42
	RF	0.99	0.49	0.66	0.98	0.48	0.64	0.96	0.5	0.65	0.96	0.71	0.81	0.91	0.56	0.7	0.95	0.72	0.82
	SVM	0.78	0.23	0.36	0.74	0.12	0.2	0.76	0.11	0.2	0.77	0.32	0.46	0.82	0.31	0.45	0.73	0.25	0.37
	GBDT	0.95	0.42	0.59	0.93	0.41	0.57	0.95	0.42	0.59	0.89	0.62	0.73	0.87	0.54	0.67	0.9	0.62	0.74
	CNN	0.71	0.35	0.47	0.77	0.29	0.42	0.77	0.24	0.36	0.77	0.33	0.46	0.73	0.38	0.5	0.7	0.33	0.45
	LSTM	0.95	0.23	0.37	0.68	0.29	0.41	0.78	0.29	0.42	0.87	0.51	0.65	0.91	0.45	0.61	0.85	0.54	0.66
J-Index	NB	0.62	0.36	0.46	0.63	0.35	0.45	0.71	0.3	0.43	0.77	0.32	0.45	0.77	0.32	0.46	0.75	0.3	0.43
	RF	0.98	0.93	0.96	0.98	0.91	0.95	0.97	0.89	0.93	0.98	0.88	0.93	0.97	0.84	0.9	0.98	0.84	0.91
	SVM	0.81	0.3	0.44	0.77	0.15	0.25	0.79	0.15	0.25	0.88	0.34	0.49	0.88	0.35	0.5	0.8	0.22	0.35
	GBDT	0.88	0.68	0.76	0.87	0.68	0.76	0.87	0.68	0.76	0.89	0.69	0.78	0.87	0.69	0.77	0.9	0.68	0.78
	CNN	0.79	0.47	0.59	0.54	0.65	0.59	0.57	0.62	0.59	0.66	0.54	0.6	0.71	0.44	0.54	0.54	0.5	0.52
	LSTM	0.94	0.83	0.88	0.94	0.81	0.87	0.88	0.75	0.81	0.83	0.73	0.78	0.88	0.73	0.8	0.85	0.67	0.75
Entropy	NB	0.63	0.37	0.47	0.7	0.37	0.49	0.71	0.36	0.47	0.72	0.43	0.54	0.78	0.31	0.45	0.75	0.32	0.44
	RF	1	0.98	0.99	0.99	0.98	0.99	0.99	0.99	0.99	1	0.97	0.98	1	0.97	0.98	1	0.97	0.98
	SVM	0.94	0.54	0.69	0.82	0.21	0.33	0.93	0.54	0.68	0.91	0.42	0.57	0.85	0.33	0.47	0.8	0.26	0.4
	GBDT	0.96	0.78	0.86	0.95	0.75	0.84	0.96	0.77	0.85	0.93	0.72	0.81	0.91	0.6	0.72	0.92	0.65	0.76
	CNN	0.78	0.76	0.77	0.86	0.7	0.77	0.86	0.69	0.76	0.85	0.39	0.54	0.36	0.56	0.44	0.63	0.44	0.52
	LSTM	0.99	0.98	0.99	0.99	0.96	0.97	0.96	0.85	0.9	0.9	0.92	0.91	0.95	0.88	0.91	0.99	0.94	0.96
Gini	NB	0.7	0.37	0.49	0.76	0.35	0.48	0.77	0.35	0.48	0.75	0.4	0.52	0.78	0.31	0.44	0.75	0.32	0.44
	RF	0.99	0.99	0.99	1	0.99	0.99	0.99	0.97	0.98	1	0.97	0.98	1	0.95	0.97	1	0.96	0.98
	SVM	0.94	0.51	0.66	0.92	0.52	0.67	0.92	0.51	0.66	0.89	0.41	0.56	0.84	0.31	0.45	0.78	0.24	0.37
	GBDT	0.96	0.78	0.86	0.96	0.77	0.85	0.97	0.96	0.85	0.91	0.71	0.8	0.92	0.6	0.73	0.92	0.68	0.78
	CNN	0.81	0.71	0.76	0.87	0.73	0.79	0.83	0.78	0.8	0.36	0.72	0.48	0.77	0.42	0.54	0.76	0.36	0.49
	LSTM	1	0.97	0.98	0.99	0.98	0.98	1	0.96	0.98	1	0.98	0.99	1	0.99	0.99	0.98	0.92	0.95

Table 5. Latency of the Selection Schemes.

DM000						NM0007
Original	Pearson	Spearman	J-Index	Entropy	Gini	Original	Pearson	Spearman	J-Index	Entropy	Gini
∖	2.4 s	7.4 s	5.3 s	1.6 s	2.8 s	∖	1.6 s	5.5 s	4.2 s	2.2 s	3.8 s
Org_Prs	Prs_Prs	Sprm_Prs	Jidx_Prs	Ent_Prs	Gini_Prs	Org_Prs	Prs_Prs	Sprm_Prs	Jidx_Prs	Ent_Prs	Gini_Prs
2.2 s	5.4 s	10.5 s	8.5 s	5.6 s	6.4 s	1.6 s	4.1 s	8.3 s	7.1 s	6.7 s	6.3 s
Org_Sprm	Prs_Sprm	Sprm_Sprm	Jidx_Sprm	Ent_Sprm	Gini_Sprm	Org_Sprm	Prs_Sprm	Sprm_Sprm	Jidx_Sprm	Ent_Sprm	Gini_Sprm
7.3 s	13.1 s	12.6 s	10.6 s	14.1 s	14.4 s	5.2 s	10.8 s	11.3 s	13.1 s	16.2 s	12.6 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Zhuge, Q.; Sha, E.H.-M.; Xu, R.; Song, Y. Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection. Appl. Sci. 2023, 13, 7544. https://doi.org/10.3390/app13137544

AMA Style

Wang H, Zhuge Q, Sha EH-M, Xu R, Song Y. Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection. Applied Sciences. 2023; 13(13):7544. https://doi.org/10.3390/app13137544

Chicago/Turabian Style

Wang, Han, Qingfeng Zhuge, Edwin Hsing-Mean Sha, Rui Xu, and Yuhong Song. 2023. "Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection" Applied Sciences 13, no. 13: 7544. https://doi.org/10.3390/app13137544

APA Style

Wang, H., Zhuge, Q., Sha, E. H.-M., Xu, R., & Song, Y. (2023). Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection. Applied Sciences, 13(13), 7544. https://doi.org/10.3390/app13137544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection

Abstract

1. Introduction

2. Background and Related Work

2.1. ML/AI Based HDD Failure Prediction

2.2. Related Works

3. Motivation

4. Design

4.1. Problem Description and Data Processing

4.2. Attribute Filtering Based on Feature Importance

4.3. Attributes Reduction Based on Correlation Classification

4.4. Modeling and Prediction

5. Results and Analysis

5.1. Experimental Setup

5.2. Results

5.2.1. Results after Feature Selection

5.2.2. Prediction Quality

5.2.3. Modeling Latency

5.2.4. Computation Latency

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI