Evaluating Machine Learning Classiﬁcation Using Sorted Missing Percentage Technique Based on Missing Data

: Missing data are common in industrial sensor readings owing to system updates and unequal radio-frequency periods. Existing methods addressing missing data through imputation may not always be appropriate. This study presented a sorted missing percentages technique for ﬁltering attributes when building machine learning classiﬁcation models using sensor readings with missing data. Signal detection theory was employed to evaluate the distinguishing ability of resulting models. To evaluate its performance, the proposed technique was applied to a publicly available air pressure system dataset, which then was used to build several classiﬁers. The experimental results indicated that the proposed technique allowed a logistic regression model to achieve the best accuracy score (99.56%) and a better distinguishing ability (response bias of 0.0013, adjusted response bias of 0.0044, and decision criterion of − 1.8994) compared with the methods applied to the same dataset and reported in papers published between 2016 and 2019 March on binary classiﬁcation, wherein attributes with more than 20% of missing data were ﬁltered out. The proposed technique is suitable for industrial sensor data analysis and can be applied to the scenarios dealing with missing data owing to unequal radio-frequency periods or a system being updated with new ﬁelds. and


Introduction
Industrial level sensors are employed as standalone devices (e.g., to measure temperature and humidity independently), packed modules (e.g., to measure temperature and humidity synchronously), or even bundled in target systems. With the rapid development of the Internet of Things, the production and procurement costs of sensors effectively satisfy the economic requirements of customers. In addition, multiple sensors can be configured in existing and single structures [1]. Daily collected data are useful for enterprises, particularly for meeting set objectives, i.e., preventive maintenance plans. Compared with sensors used in social sciences and medical fields, industrial sensors generate more missing data during data collection. The occurrences of missing data in industrial applications are primarily due to system updates and unequal radio-frequency periods, which are expected situations.
In real applications, determining the approaches of addressing missing data is a common challenge prior to implementing meaningful models for decisions. Generally, analysts determine possible reasons for missing data based on their industry background and work experience, and then take appropriate steps to address them. This study considers the following two general scenarios: data are missing completely at random or missing at random [2,3]. Appropriate strategies for dealing with these The APS dataset included two classes (NEG and POS). The calculated imbalance ratio was 59 in the training set and 41 in the test set. This set of public data was used for an empirical analysis of various topics. In terms of publication time, it can be divided into two scopes. Previous studies from 2016 to 2019 March [10][11][12][13][14][15][16] have used many dedicated methods of data imputation and machine learning algorithms to solve the high-class imbalance and missing data problems [17][18][19][20][21][22][23][24]. However, these studies have focused only on the training set. They have not considered whether the assumption of a consistent distribution of missing data across the training and test sets is satisfied. After 2019 March, a couple of innovative articles related to artificial intelligence applications and data generating are published. For example, the Sjöblom (2019) introduced genetic algorithms as an evolutionary strategy in statistical learning (the XGBoost) and then to automate the optimization procedures [25]. Škrlj, B. et al. (2020) employed a self-attention network to estimate the feature importance and then build the prediction models [26]. The APS data is one of the nine experimental datasets. About the limited data due to missing data, in contrast to existing data generation techniques. Ranasinghe and Parlikad (2019) proposed a methodology capable of generating new and realistic failure data samples [27].
In this paper, the proposed sorted missing percentages technique is for filtering attributes when building machine learning classification models using sensor readings with missing data. Signal detection theory (SDT) was then employed to evaluate the distinguishing ability of the resulting models. The numerical results were compared to the studies published from 2016 to 2019 March.
The proposed technique applied sorted missing percentages to filter attributes and then determined the control (CTRL) and experimental (EXP) attribute groups during the data preprocessing stage of the machine learning modeling pipeline. Figure 1 summarizes the workflows for machine learning modeling applied in previous (As-Is) and current (in this paper) studies. The workflows include the following stages: data acquired, data preprocessing, modeling, and evaluation. In the evaluation step, we employed more indicators compared to the previous studies to evaluate the model performance and its distinguishing ability.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 16 importance and then build the prediction models [26]. The APS data is one of the nine experimental datasets. About the limited data due to missing data, in contrast to existing data generation techniques. Ranasinghe and Parlikad (2019) proposed a methodology capable of generating new and realistic failure data samples [27]. In this paper, the proposed sorted missing percentages technique is for filtering attributes when building machine learning classification models using sensor readings with missing data. Signal detection theory (SDT) was then employed to evaluate the distinguishing ability of the resulting models. The numerical results were compared to the studies published from 2016 to 2019 March.
The proposed technique applied sorted missing percentages to filter attributes and then determined the control (CTRL) and experimental (EXP) attribute groups during the data preprocessing stage of the machine learning modeling pipeline. Figure 1 summarizes the workflows for machine learning modeling applied in previous (As-Is) and current (in this paper) studies. The workflows include the following stages: data acquired, data preprocessing, modeling, and evaluation. In the evaluation step, we employed more indicators compared to the previous studies to evaluate the model performance and its distinguishing ability.

Sorted Missing Percentage and Recommended Steps
The concept of sorted missing percentages was proposed by Biteus and Lindgren [14], who applied a cutoff value of 10% in their study that was related to the APS data [9]. This study used 10% as a cutoff value to construct a control group of attributes. It then explored different cutoff values to construct experimental groups of attribute. Filtering can be called conservative if the percentage of missing data was greater than the cutoff value. For example, by applying a cutoff value of 20% or 30%, fewer attributes but more instances can be selected compared with that of the control group. In contrast, filtering can be called liberal if the percentage of missing data is less than the cutoff value. For example, by applying a cutoff value of 5%, more attributes but less instances can be selected compared with that of the control group.
Note that this method does not alter the original data. Furthermore, the included instances can be used more flexibly when the maximum attribute space is established under different filtering criteria. This can be considered an advantage because, under the premise of the same maximum attribute space, the assumption of a consistent distribution between the training and test sets can be satisfied. Through comparison, it was established that the attribute space of the group producing the best model performance is also the best configurable attribute space for the considered dataset. The

Sorted Missing Percentage and Recommended Steps
The concept of sorted missing percentages was proposed by Biteus and Lindgren [14], who applied a cutoff value of 10% in their study that was related to the APS data [9]. This study used 10% as a cutoff value to construct a control group of attributes. It then explored different cutoff values to construct experimental groups of attribute. Filtering can be called conservative if the percentage of missing data was greater than the cutoff value. For example, by applying a cutoff value of 20% or 30%, fewer attributes but more instances can be selected compared with that of the control group. In contrast, filtering can be called liberal if the percentage of missing data is less than the cutoff value. For example, by applying a cutoff value of 5%, more attributes but less instances can be selected compared with that of the control group.
Note that this method does not alter the original data. Furthermore, the included instances can be used more flexibly when the maximum attribute space is established under different filtering criteria. This can be considered an advantage because, under the premise of the same maximum attribute space, the assumption of a consistent distribution between the training and test sets can be satisfied. Through comparison, it was established that the attribute space of the group producing the best model performance is also the best configurable attribute space for the considered dataset. The recommended steps for implementing the sorted missing percentage method for the groups can be summarized as follows.

•
Step 1: Sort attributes in descending order of missing percentages.

•
Step 2: Set a cutoff value of 10% and select all attributes with less than that 10% of missing values into the control group.

•
Step 3: Set different missing percentages as thresholds and use them to define the experimental groups.

•
Step 4: Compare the control and experimental groups. Select the group that yields the best model performance and define the best configurable attribute space.

Machine Learning Classifiers
This study used five classifiers, and these classifiers are the most widely used and the most mature algorithms currently. Each classifier has its own characteristics and different limitations with respect to its implementation. According to the prediction target and computational approach, all of these classifiers belong to binary (negative or positive) and discriminative classifiers. Due to the contents of outcomes, logistic regression (LR) is a type of probabilistic classifier, and the other four classifiers are deterministic classifiers. If we focus on attribute combinations and the learning process, LR and support vector machines (SVM) can be considered linear and global classifiers. In contrast, the k-nearest neighbors, random forest, and stochastic gradient boosting trees are nonlinear, iterative computing, and local classifiers, respectively [28]. To optimally interpret the provided data, the final classifier should be selected from a list of alternatives, and each must provide consistent model performance for both the training set (for modeling) and test set (for verification). A classifier is determined to be the best if it provides the best model performance and best distinguishing ability.

Logistic Regression (LR)
LR [29][30][31] is based on the regression model and frequently used for binary classification problems. For example, the negative class is assigned with label 0, whereas the positive class is assigned with label 1. The general form π(x) and its logit transformation form g(x) are given as follows: where (x 1, x 2, . . . , x n ) is the vector of predictors and β i (i = 1, 2, . . . , n) are the regression coefficients of the predictors. Here, the output class labels were calculated as the logit-transformed probabilities.
If the probability is close to 1, then the positive class is assigned.

Support Vector Machine (SVM)
SVMs [18,29,[31][32][33][34] are based on the concept of building decision planes to define decision boundaries. The decision plane is a hyperplane that separates examples into the most likely categories with the maximum distance between them. For a binary classification task, the general form Y and objective function g(Y) are given as follows: where C is the capacity constant, w is the vector of coefficients, ϕ is the slack variable, and b is a constant used to handle non-separable data (i.e., inputs). The index i labels the n training examples, and the kernel θ is used to transform data from the input to the feature space. Here, larger C values represent a higher degree of error penalization. When using an SVM, it is important to select and configure parameter C appropriately to avoid overfitting.

k-Nearest Neighbors (k-NNs)
k-nearest neighbor (k-NN) [10,16,29,31,32,35] is a memory-based reasoning model (also referred to as instance-based learning) that includes a set of known examples, and the outcomes are known. Given a new example (i.e., the query point), the task is to classify the outcome through its k neighbors based on the selected distance metric, and the outcome is selected based on the majority votes (bagging) from the k neighbors. A small k value leads to large variance in predictions, whereas a large k value may lead to a high degree of model bias. For binary classification problems, the typical distance weighting W is formed as follows: where x and p are the query point and a case from the example sample, respectively, and D(x, p i ) is the distance between the query point x and case p i of the example sample. An optimal k value is estimated using the cross-validation technique, which achieves an appropriate compromise between variance and bias, and avoids overfitting. The weighted distance between new query points x new in k-fold cross-validation is summed up to 1.

Stochastic Gradient Boosting Trees (GBTs)
A gradient boosting tree (GBT) [11,19,29,31,36,37] is an ensemble learning algorithm that includes a set of classifiers based on weighted examples. Here, the classifiers are built sequentially, and every misclassified example is labeled to minimize the target loss function. The general form y n+1 E of a is given as follows: where y E is the outcome of the ensemble, f i (x) is the function of the weak learners, ∇L is the loss function, and y T i is the target class. With the GBT approach, several characteristics should be considered carefully, e.g., all involved classifiers are weak learners, normalization of examples is required before constructing the classifier, and overfitting can occur easily.

Random Forests (RFs)
Random forest (RF) [12,13,29,31,38] is an ensemble learning algorithm that involves a collection of decision trees (i.e., classifiers). The decision trees are independent identically distributed random vectors. In addition, the final selected tree is that with the most votes (bagging). The general form DF and voting criteria dDF(x) are given as follows: where d j is the jth generated tree, c is the decision class, and N c (x) is the number of votes for classification of sample x ∈ X to class c.

Indicators for Model Performance
Accuracy, F-measure, and the Matthews correlation coefficient (MCC) [39] are commonly used to explain the model's performance. The receiver operating characteristics (ROCs) curve is a two-dimensional graph presenting the relative tradeoff between true-positives on the Y-axis and false-positives on the X-axis [40]. The area under the curve (AUC) is a metric that summarizes the ROC curve [31]. The ROC curve and calculated AUC were used to visualize the predictive accuracy of the selected model. Furthermore, the criteria adopted in SDT [41,42] were used in this paper. SDT is more commonly used in psychophysics and psychology studies and is applied to various tasks, including detection, identification, recognition, and classification. Regardless of the task, the initial focus in these studies is on how to effectively analyze decision-making under uncertainty and bias, i.e., the model's performance. The secondary focuses in SDT are determining how much information is obtained during decision-making and whether the obtained information is useful and applicable, i.e., the model's distinguishing ability.
In SDT, β is defined as the ratio of neural activity produced by signal and noise; it is used to compute the response bias. β opt is the adjustment of β used to detect changes in the signal and noise probabilities. β opt,payo f f is involved with additional information such as the reward for correct events (Value) and the penalty for incorrect events (Cost).
X c is the decision criterion serving as a useful indicator reflecting a criterion shift. An X c value close to 0 indicates an unbiased decision. In other words, all the information was fully used during the decision-making process. A positive X c represents a type of liberal decision-making, which is also interpreted as information is overused. A negative X c indicates conservative decision-making, where information is filtered to reduce the risk of mistakes.
where X is the decision variable of binary classification for either S (signal) or N (noise); C is the criterion that X is presented, d is the distance of the midways between the signal and noise distributions, Value is a reward for events that are correctly classified, and Cost is a penalty for events that are

Dataset
The dataset used in this paper was initially provided by Scania CV AB (Sweden). The data were extracted from historical operational data, and included information based on on-board sensors showing how each truck has been used on average. The target system was the APS generating pressurized air, which is used in various truck functions such as braking and gear changing. There were 171 attributes: 100 single numerical counters and seven histograms with 10 bins each, and one class label. The attributes were categorized as a class and anonymized operational data. The operational data have an identifier and a bin id (Identifier_Bin). For example, attribute aa_000 was collected from class a, the operational identifier was aa, and the bin id was 000.
The dataset was divided into training and test sets by Scania experts (60,000 training examples and 16,000 test examples). The class label could be either POS or NEG, wherein POS indicates detected failures related to a specific APS system component and NEG indicates failures not related to any of the APS system components.

Data Profiling and Filtering
In this paper, the control and experimental attribute groups were defined for modeling and comparison ( Figure 1). Table 1 lists selected results of the invariant check on attribute cd_000, representing class c, operational identifier cd, and bin id 000. As shown, attribute cd_000 had the same value for the mean, median, and mode, as well as the minimum and maximum values, which implies all valid data were the same. Therefore, this attribute should be removed from the list of attributes for predictive modeling. Including invariant attributes would only increase the computational cost.   The IR is also a quick way of determining whether the training and test sets are consistently distributed for the given attributes. First of all, the examples of the attributes are labeled as VALID if they comprise real values. If the examples are shown as blank cells, they will be labeled as N/A. It is assumed that if the IR of individual attributes falls within a specified range centered on the overall IR, the assumption of a consistent distribution between the training and test sets was valid. Furthermore, 2 × 2 cross-tabulation was conducted. The calculated Chi-square results would clarify if the missing data behavior of the attributes was homogenous between NEG and POS classes disregarding whether the examples were from the training or test set. After performing IR checks on all attributes, the training set IR values were ranged between 55 and 61, whereas the test set values ranged between 39 and 43, with a single exception for the attribute ac_000. The ac_000 data was from the class a, the operational identifier was ac, and the bin id was 000. As shown in Table 3, the calculated IRs for the training and test sets were 104.3 and 75.9, which were larger than the overall IRs of 59.0 and 41.7, respectively. The calculated Pearson Chi-square and maximum-likelihood Chi-square results (CLASS × ac_000) indicated statistical significance (p-values were both 0.0000). For these findings, it is recommended to remove attribute ac_000 from the list of predictive modeling attributes. For the APS dataset, Biteus and Lindgren [14] recommended removing attributes with over 10% missing data; however, they do not explain why the criterion for filtering attributes was set to 10%. Following this concept, we created bins by the sorted missing percentages, namely, >30%, 20-30%, 10-20%, and 5-10%. Figure 2 shows the bar chart presenting the bins by different missing percentages. For the APS dataset, Biteus and Lindgren [14] recommended removing attributes with over 10% missing data; however, they do not explain why the criterion for filtering attributes was set to 10%. Following this concept, we created bins by the sorted missing percentages, namely, >30%, 20-30%, 10-20%, and 5-10%. Figure 2 shows the bar chart presenting the bins by different missing percentages.  Table 4 shows the details for the contained attributes in each bin. As shown, the first three bins had the same attributes and sorting results for both the training and test sets. The fourth bin, 5-10%, had the same attributes but different sorting results.

5-10%
ak_000, ca_000, dm_000, df_000, dg_000, dh_000, dl_000, dj_000, dk_000, eb_000, di_000, bx_000, cc_000 ca_000, ak_000, df_000, dg_000, dh_000, di_000, dj_000, dk_000, dl_000, dm_000, eb_000, bx_000, cc_000 13 According to the results in Table 4, the cumulative missing percentages were further used to define the control (CTRL) and experimental groups (EXP1, EXP2, and EXP3). After removing the attributes with more than 10% missing, the remaining 70 attributes were assigned to a bin and defined as a control group (CTRL). The missing percentage of attributes contained in EXP1 was less than 30% (contains 88 attributes), in EXP2 was less than 20% (74 attributes), and in EXP3 was less than 5% (57  Table 4 shows the details for the contained attributes in each bin. As shown, the first three bins had the same attributes and sorting results for both the training and test sets. The fourth bin, 5-10%, had the same attributes but different sorting results.

5-10%
ak_000, ca_000, dm_000, df_000, dg_000, dh_000, dl_000, dj_000, dk_000, eb_000, di_000, bx_000, cc_000 ca_000, ak_000, df_000, dg_000, dh_000, di_000, dj_000, dk_000, dl_000, dm_000, eb_000, bx_000, cc_000 13 According to the results in Table 4, the cumulative missing percentages were further used to define the control (CTRL) and experimental groups (EXP1, EXP2, and EXP3). After removing the attributes with more than 10% missing, the remaining 70 attributes were assigned to a bin and defined as a control group (CTRL). The missing percentage of attributes contained in EXP1 was less than 30% (contains 88 attributes), in EXP2 was less than 20% (74 attributes), and in EXP3 was less than 5% (57 attributes). Table 5 lists the results of comparing the control and experimental attribute groups over that of the training and test sets. During the modeling phase using the training set, k-NN performed best in CTRL (99.66%), EXP2 (99.73%), and EXP3 (99.32%). LR was best in EXP1 (99.82%); however, it was second best in other groups. During the validation phase using the test set, LR showed the best results in all groups, and k-NN was second best. Thus, in addition to LR, k-NN could be considered as a suitable classifier for the APS dataset. Considering the scores obtained for accuracy, F-measure, and MCC, the combination of EXP2 and LR outperformed the other attribute groups and classifiers. The stability of a prediction model must be considered for practical applications. Thus, k-NN can be considered as a potential candidate. If LR failed on the prediction or underperformed, then k-NN can be replaced. To get further insights into the models' performance, decile-wise lift charts were employed to demonstrate the level of the prediction capability that can be produced by examples fed into the trained model. If few examples can produce a high prediction capability, then the calculated lift value is regarded to be high. Figure 3 shows the decile-wise lift charts to demonstrate how LR and k-NN perform by sample size (represented in percentages).

Comparison between the Control and Experimental Groups
It can be noticed from Figure 3a that both LR and k-NN performed well when predicting the NEG class label. After inputting 80% of the examples of the test set, the performance of both the LR and k-NN models decreased. For the POS label, it can be noted from Figure 3b that Table 6 lists and values obtained for the LR model to demonstrate which of the considered attribute groups provides more stable results. All the values were very close to zero, which means that the control and experimental groups both returned an unbiased outcome. In addition, all calculated values were negative, which means these unbiased outcomes reflect conservative decision-making. Although the value was the smallest for EXP1 (0.0003) in the training set, it was eight times bigger for this group (0.0026) in the test set. EXP2 had the second smallest value (0.0011) in the training set and the smallest in the test set (0.0013). The small difference between these two values proved again that the LR model could deliver better classification results for EXP2 than for the other groups.

Discussion
The APS dataset was used as an empirical study on various topics related to classic machine learning; we reviewed some of those published between 2016 and 2019 March. Costa et al. [10] applied mean imputation and the Soft-Impute algorithm to handle missing data and concluded that RF was the best classifier owing to the highest cost-wise ratio of 92.56%, where the FP and FN rates were 3.74% and 3.70%, respectively. Cerquerira et al. [11] removed attributes with over 50% missing data, conducted metafeature engineering to generate new attributes, implemented Synthetic Minority Oversampling Technique (SMOTE) to replace the removed examples, and then concluded that XGBoost with meta-features yielded the lowest average cost and deviance. Condek et al. [12] applied median imputation address missing data and concluded that RF can be used as a cost function providing better results than the naïve approaches of checking every truck or no truck until failure. Ozan et al. [13] introduced an optimized k-NN approach to handle missing values and created a tailored k-NN model using a specified HEIM distance. Biteus and Lindgren [14] removed attributes with more than 10% missing values and applied mean imputation to the remaining attributes. They evaluated various classifiers, including RF, SVM, and k-NN, and selected RF, which returned an accuracy score of 0.99.
Rafsunjani et al. [15] used five imputation techniques, including expectation-maximization, mean imputation, Soft-Impute, Multivariate Imputation by Chained Equation (MICE), and iterative singular value decomposition, and applied five classifiers, including naïve Bayes (NB), k-NN, SVM,  Table 6 lists β and X c values obtained for the LR model to demonstrate which of the considered attribute groups provides more stable results. All the β values were very close to zero, which means that the control and experimental groups both returned an unbiased outcome. In addition, all calculated X c values were negative, which means these unbiased outcomes reflect conservative decision-making. Although the β value was the smallest for EXP1 (0.0003) in the training set, it was eight times bigger for this group (0.0026) in the test set. EXP2 had the second smallest β value (0.0011) in the training set and the smallest in the test set (0.0013). The small difference between these two values proved again that the LR model could deliver better classification results for EXP2 than for the other groups.

Discussion
The APS dataset was used as an empirical study on various topics related to classic machine learning; we reviewed some of those published between 2016 and 2019 March. Costa et al. [10] applied mean imputation and the Soft-Impute algorithm to handle missing data and concluded that RF was the best classifier owing to the highest cost-wise ratio of 92.56%, where the FP and FN rates were 3.74% and 3.70%, respectively. Cerquerira et al. [11] removed attributes with over 50% missing data, conducted metafeature engineering to generate new attributes, implemented Synthetic Minority Oversampling Technique (SMOTE) to replace the removed examples, and then concluded that XGBoost with meta-features yielded the lowest average cost and deviance. Condek et al. [12] applied median imputation address missing data and concluded that RF can be used as a cost function providing better results than the naïve approaches of checking every truck or no truck until failure. Ozan et al. [13] introduced an optimized k-NN approach to handle missing values and created a tailored k-NN model using a specified HEIM distance. Biteus and Lindgren [14] removed attributes with more than 10% missing values and applied mean imputation to the remaining attributes. They evaluated various classifiers, including RF, SVM, and k-NN, and selected RF, which returned an accuracy score of 0.99.
Rafsunjani et al. [15] used five imputation techniques, including expectation-maximization, mean imputation, Soft-Impute, Multivariate Imputation by Chained Equation (MICE), and iterative singular value decomposition, and applied five classifiers, including naïve Bayes (NB), k-NN, SVM, RF, and GBT. They concluded that NB performs better on the actual and imbalanced dataset; however, RF performed better on the balanced dataset after applying an under-sampling method. In addition, the mean imputation was identified as the best method for imputing missing values. However, if the primary concern is the FP rate rather than accuracy, then Soft-Impute was shown to outperform the other imputation techniques and NB was demonstrated to have the best performance compared to the other classifiers. Jose and Gopakumar [16] employed a k-NN algorithm for missing data imputation, implemented an improved RF algorithm to reduce both the FP and FN misclassification rates, and demonstrated competitive results in terms of precision, F-measure, and MCC. Table 7 compares the results achieved in this paper for the EXP2-LR combination with those of the previous studies in terms of accuracy, F-measure, and MCC. The previous studies included for the comparison applied imputation techniques directly to missing data in the training set; however, the process is not discussed and assumptions related to the consistency of the training and test datasets were not provided. In addition, the training and test sets were not evaluated after applying imputation to determine if they satisfy the assumption of a consistent distribution. It can be noticed from Table 7 that the recommended model, EXP2-LR, had the best accuracy (99.56%), F-measure (73.24%), and MCC (74.30%) results. Table 8 compares the results achieved in this paper for the EXP2-LR combination with those of the previous studies in terms of β and X c values, which indicate the effectiveness of the selected classifiers for binary classification. It can be noticed from the table that, the previously reported methods all provided unbiased outcomes (β values were close to zero) and conservative decision-making (X c is negative). While the β values of the previous methods indicate perfect unbiased results, the related β opt values were high, which means that the validated results were not sufficiently generalized. In other words, the previous studies pay more attention to examples with the NEG class label than those with the POS class label. They may be able to minimize the FN score but at the cost of increasing FP scores. Furthermore, the X c values of all previous methods were smaller (i.e., bigger negative numbers) than that of the EXP2-LR model, which indicates that previous methods provided more conservative decisions. These results demonstrated that the EXP2-LR model recommended in this paper achieved the best performance in the considered binary classification task.
Two further findings were discussed here. First, according to the SDT, when the β values were close to zero, the selected model was implemented with few biases. This characteristic is adjusted by applying additional weights such as penalties and rewards (usually are predefined by experienced experts). Previous studies published between 2016 and 2019 March implemented various imputation techniques; however, these manipulations do not improve the recognition ability of the reported models. In contrast, adding weight values might cause more conservative decision-making.
The second finding is about data representativeness after resampling. An inappropriate resampling method can lead to new problems, e.g., insufficient fitting, even if it improves the original class imbalance. Rafsunjani et al. [15] compared actual data to under-sampled data. As shown in Table 8, the β value calculated for the under-sampled data was more unbiased (β was 0.0000, i.e., perfectly unbiased) than that calculated for the actual data (β was 0.0013, i.e., very unbiased). However, the β opt and X c values calculated for the under-sampled data (0.0550 and −2.8358, respectively) were farther from zero than those for the actual data (0.0375 and −2.1683, respectively), which means the under-sampled data were more conservative. Regardless of the resampling method used (under-or over-sampling), the recommendation for the subsequent processing on the transformed data was to first check the consistent distribution assumption. Under such circumstances, advanced metrics such the ROC curve and AUC should be used to provide a wider understanding of the model's performance over the training and test sets, i.e., during modeling and validation. Figure 4 shows the ROC curves for the selected EXP2-LR model considering the NEG class label, separately for the training and test sets. The calculated AUC values for the training and test sets were 0.974213 and 0.979435, respectively. A high AUC value means that the trained model returns a great result during the validation phase. It also means that the selected model is applicable. The trends of the two NEG curves were similar. Both curves quickly reached the top, almost fully sensitive, when the 1-specificity value was close to 0.1; 1-specificity is also known as the FP rate and represents the probability of a Type I error or false alarm event. In addition, the NEG curves for the test set were steeper than those for the training set. Two further findings were discussed here. First, according to the SDT, when the values were close to zero, the selected model was implemented with few biases. This characteristic is adjusted by applying additional weights such as penalties and rewards (usually are predefined by experienced experts). Previous studies published between 2016 and 2019 March implemented various imputation techniques; however, these manipulations do not improve the recognition ability of the reported models. In contrast, adding weight values might cause more conservative decision-making.
The second finding is about data representativeness after resampling. An inappropriate resampling method can lead to new problems, e.g., insufficient fitting, even if it improves the original class imbalance. Rafsunjani et al. [15] compared actual data to under-sampled data. As shown in Table 8, the value calculated for the under-sampled data was more unbiased ( was 0.0000, i.e., perfectly unbiased) than that calculated for the actual data ( was 0.0013, i.e., very unbiased). However, the and values calculated for the under-sampled data (0.0550 and −2.8358, respectively) were farther from zero than those for the actual data (0.0375 and −2.1683, respectively), which means the under-sampled data were more conservative. Regardless of the resampling method used (under-or over-sampling), the recommendation for the subsequent processing on the transformed data was to first check the consistent distribution assumption.
Under such circumstances, advanced metrics such the ROC curve and AUC should be used to provide a wider understanding of the model's performance over the training and test sets, i.e., during modeling and validation. Figure 4 shows the ROC curves for the selected EXP2-LR model considering the NEG class label, separately for the training and test sets. The calculated AUC values for the training and test sets were 0.974213 and 0.979435, respectively. A high AUC value means that the trained model returns a great result during the validation phase. It also means that the selected model is applicable. The trends of the two NEG curves were similar. Both curves quickly reached the top, almost fully sensitive, when the 1-specificity value was close to 0.1; 1-specificity is also known as the FP rate and represents the probability of a Type I error or false alarm event. In addition, the NEG curves for the test set were steeper than those for the training set.  Figure 5 shows the ROC curves and AUC values for the EXP2-LR model considering the POS class label, separately for the training and test sets. Compared to the curves for the NEG class label, the curves for the POS class label rose more quickly at the beginning until 1-specificity was reached at a value of 0.2. The POS curves go downward when the 1-specificity value was between 0.02 and 0.5. After that, the curves tended to flatten.  Figure 5 shows the ROC curves and AUC values for the EXP2-LR model considering the POS class label, separately for the training and test sets. Compared to the curves for the NEG class label, the curves for the POS class label rose more quickly at the beginning until 1-specificity was reached at a value of 0.2. The POS curves go downward when the 1-specificity value was between 0.02 and 0.5. After that, the curves tended to flatten. Based on the above results, it was proved that the selected EXP2-LR model was outperformed on the APS data than other machine learning models created from the previous articles. Further, under the data conditions proposed as EXP2, as a candidate model, k-NN also can generate a higher ACC (99.44%) and F-measure (63.64%), and a bit weaker MCC (65.91%) than previous models (Table  5 and Table 7). When the existence of missing data is a kind of reasonable behavior, our proposed sorted missing percentage method can achieve better modeling results. According to the original structure of data that is not changed much, these outperformed results can be explained that our proposed method was with stability and reliability.
Our proposed method also has limitations in practice. First, our proposed method cannot meet the needs of automation. Typically, either single or multiple imputation methods can quickly calculate and fill up the missing data through existed software or programming. Considering the variety of data sources and the respect to the original data structure, our proposed method must first sort the attributes by missing percentage and must also check whether the distributions of the training and the test sets are consistent. These two manipulations are required, the manual and subjective checks. The second limitation is the decision of the cut-off point, and it relies on the advice of experienced experts. In the empirical study with the APS data, the cut-off point is defined as if the attributes are with 10% missing data, and this concept comes from Biteus and Lindgren [14]. Overcoming the second limitation is not an easy task, but it also proves the importance of domain knowledge and industry experience for problem-solving.

Conclusions
In real-world practices, data preprocessing is always tricky for analysts, mainly when dealing with class imbalance, missing data, and inconsistent distributions. This study considered the problem of predicting binary class labels from sensor readings with missing data based on a publicly available APS dataset. First, the IR was introduced to determine if the assumption of a consistent distribution of the class labels across the training and test sets was satisfied for the considered APS dataset and the attributes were found to be not useful were removed.
Next, a sorted missing percentage technique was proposed to construct a control and several experimental groups of attributes to be used in training of several classifiers for comparison. According to the experimental results, the LR model trained using the EXP2 attribute group (where the attributes with over 20% of missing values were filtered out) demonstrated the best performance, achieving an accuracy score of 99.56%, F-measure of 73.24%, and MCC of 74.30%. Further, the relative values and indicators based on SDT were used to compare the distinguishing ability of the best performing LR model depending on the attribute group. The best values of = 0.0013, = 0.0044, and = −1.8994 were once again achieved through the LR and EXP2 combination, which outperformed other methods reported in the literate and applied to the same APS dataset. These Based on the above results, it was proved that the selected EXP2-LR model was outperformed on the APS data than other machine learning models created from the previous articles. Further, under the data conditions proposed as EXP2, as a candidate model, k-NN also can generate a higher ACC (99.44%) and F-measure (63.64%), and a bit weaker MCC (65.91%) than previous models (Tables 5  and 7). When the existence of missing data is a kind of reasonable behavior, our proposed sorted missing percentage method can achieve better modeling results. According to the original structure of data that is not changed much, these outperformed results can be explained that our proposed method was with stability and reliability.
Our proposed method also has limitations in practice. First, our proposed method cannot meet the needs of automation. Typically, either single or multiple imputation methods can quickly calculate and fill up the missing data through existed software or programming. Considering the variety of data sources and the respect to the original data structure, our proposed method must first sort the attributes by missing percentage and must also check whether the distributions of the training and the test sets are consistent. These two manipulations are required, the manual and subjective checks. The second limitation is the decision of the cut-off point, and it relies on the advice of experienced experts. In the empirical study with the APS data, the cut-off point is defined as if the attributes are with 10% missing data, and this concept comes from Biteus and Lindgren [14]. Overcoming the second limitation is not an easy task, but it also proves the importance of domain knowledge and industry experience for problem-solving.

Conclusions
In real-world practices, data preprocessing is always tricky for analysts, mainly when dealing with class imbalance, missing data, and inconsistent distributions. This study considered the problem of predicting binary class labels from sensor readings with missing data based on a publicly available APS dataset. First, the IR was introduced to determine if the assumption of a consistent distribution of the class labels across the training and test sets was satisfied for the considered APS dataset and the attributes were found to be not useful were removed.
Next, a sorted missing percentage technique was proposed to construct a control and several experimental groups of attributes to be used in training of several classifiers for comparison. According to the experimental results, the LR model trained using the EXP2 attribute group (where the attributes with over 20% of missing values were filtered out) demonstrated the best performance, achieving an accuracy score of 99.56%, F-measure of 73.24%, and MCC of 74.30%. Further, the relative β values and X c indicators based on SDT were used to compare the distinguishing ability of the best performing LR model depending on the attribute group. The best values of β = 0.0013, β opt = 0.0044, and X c = −1.8994 were once again achieved through the LR and EXP2 combination, which outperformed other methods reported in the literate and applied to the same APS dataset. These results demonstrated that the proposed sorted missing percentage technique allows one to address the problem of missing data in sensor readings without changing the original data structure and then builds a predictive model with a low response bias and a high distinguishing ability.
Although the original APS dataset does not provide rich information, e.g., no complete definitions of the attributes, this empirical study and the presented numerical results demonstrated that a method involving the IR check and sorted missing percentages could provide a flexible way to deal with missing sensor data.
Future work can proceed in the following two directions. First, the proposed method can be applied to other datasets to verify if it can generalize well at instances wherein the data source may comprise of different scenarios, e.g., when missing data is only encountered in the training set. The research materials including data files and the deployment scripts were compressed and stored in a cloud hard drive, https://drive.google.com/drive/folders/1E_Rxt20L6dDRxrdlCGgPhCRGkfafJ9MZ. Second, other attribute selection methods, advanced indicators of performance, and the innovated machine learning modeling techniques such as artificial intelligence networks and deep learning, can be tested to see if they can generate more reliable and accurate results than those achieved in this paper to satisfy the demands of real-world applications.