Classiﬁcation Performance of Thresholding Methods in the Mahalanobis–Taguchi System

: The Mahalanobis–Taguchi System (MTS) is a pattern recognition tool employing Mahalanobis Distance (MD) and Taguchi Robust Engineering philosophy to explore and exploit data in multidimensional systems. The MD metric provides a measurement scale to classify classes of samples (Abnormal vs. Normal) and gives an approach to measuring the level of severity between classes. An accurate classiﬁcation result depends on a threshold value or a cut-off MD value that can effectively separate the two classes. Obtaining a reliable threshold value is very crucial. An inaccurate threshold value could lead to misclassiﬁcation and eventually resulting in a misjudgment decision which in some cases caused fatal consequences. Thus, this paper compares the performance of the four most common thresholding methods reported in the literature in minimizing the misclassiﬁcation problem of the MTS namely the Type I–Type II error method, the Probabilistic thresholding method, Receiver Operating Characteristics (ROC) curve method and the Box–Cox transformation method. The motivation of this work is to ﬁnd the most appropriate thresholding method to be utilized in MTS methodology among the four common methods. The traditional way to obtain a threshold value in MTS is using Taguchi’s Quadratic Loss Function in which the threshold is obtained by minimizing the costs associated with misclassiﬁcation decision. However, obtaining cost-related data is not easy since monetary related information is considered conﬁdential in many cases. In this study, a total of 20 different datasets were used to evaluate the classiﬁcation performances of the four different thresholding methods based on classiﬁcation accuracy. The result indicates that none of the four thresholding methods outperformed one over the others in (if it is not for all) most of the datasets. Nevertheless, the study recommends the use of the Type I–Type II error method due to its less computational complexity as compared to the other three thresholding methods.


Introduction
The Mahalanobis-Taguchi System (MTS) is a pattern information technology that aids the quantitative decision-making process by constructing a multivariate measurement scale using data analytic methods [1]. It was developed by the renowned Japanese Quality guru Dr. Genichi Taguchi. The MTS methodology started with the theory of Mahalanobis distance (MD) formulated by the famous Indian statistician, Dr. P.C. Mahalanobis in 1936 [2] inspired by his determination to examine if the Indian people who married European people came from specific caste levels. The formulation of MD was then extended by Dr. Taguchi who integrated the MD formulation with his robust engineering concepts to enhance the MD methodology to become a popular application tool for pattern recognition and forecasting technique in multidimensional systems [3]. Therefore, numerous applications of MTS ranging to the fields of remanufacturing, medical diagnosis, pattern recognition, aerospace, agro-cultures, administration, banking and finances have been reported [3][4][5][6][7]. One of the prominent functions of MTS is to classify two groups of samples such as classifying groups of healthy and unhealthy patients, conformance and nonconformance products, normal and abnormal state of conditions, acceptable and non-acceptable of approval terms as well as other binary discrimination purposes. In MTS, to classify any two or more samples among the sample groups, MD values for each sample are calculated based on their common feature datasets. The MD values computed are viewed as points in the high dimensional space and they represent the distances of the corresponding samples from each in a univariate scale. If the MD values between the two recognition samples are "closer", then the two samples could be said to have a common similarity otherwise, they are different from each other. Then, the question arises as to how close is "closer" as mentioned previously. This is where a threshold value or a cut-off value is required to carrying out the classification process effectively.
In the MTS context, Taguchi proposed the use of Taguchi's Quadratic Loss Function (QLF) as the mean to determine a threshold value to classify samples [8]. QLF aims to minimize the monetary loss resulted from wrongly classify the samples (false alarm). Thus, cost information associated with the misclassification problems is required to determine the threshold value. The next section will discuss the fundamental concept of QLF in further detail. However, QLF was seen as impractical because of the difficulty in estimating the relative cost or the monetary loss in each sample case [9][10][11]. Therefore, several state-of-thearts thresholding methods have been reported in the literature as the alternative ways to determine the threshold in the MTS methodology. The following four thresholding methods namely probabilistic thresholding method [12,13], Type-I and Type-II errors method [14][15][16][17], ROC curve method [9,13] and control chart method via Box-Cox transformation [18] are the most common thresholding methods being deployed in the MTS which will also be discussed in further details in the next section. The aim of this study is to compare the effectiveness of these four common thresholding methods in MTS methodology. To the best of the authors' knowledge, no comparison works have been conducted to evaluate the effectiveness of these four common thresholding methods in MTS. The reports found in the literature were mainly focused on demonstrating the usage of the threshold methods based on unique case studies of the researchers. It is therefore the motivation of this paper to compare the classification performance of these four common thresholding methods in the MTS across several datasets.
The paper is presented as follows, a theoretical overview of the fundamental concept of MD and MTS is explained in Section 2. A brief discussion on the fundamental concepts of thresholding methods used in the MTS including the Quadratic Loss Function, Probabilistic Thresholding method, Type-I and Type-II Errors method, ROC curve method and Box-Cox Transformation method are discussed in Section 3. Sections 4 and 5 explain datasets used in the study which involved 20 datasets and the results and discussion of the comparison studies in evaluating the classification performances of the threshold methods. Section 6 concludes the key findings and contributions of this paper.

The Concept of Mahalanobis Distance (MD)
MD is a dimensionless distance measure based on the correlation between features and pattern differences that can be analysed with respect to a reference population [19], as shown in Figure 1. This reference population is called the normal space. The distance measure is termed the Mahalanobis Scale (MS) and aids the discriminant analysis approach by assessing the level of abnormality of datasets against the normal space. Appl. Sci. 2021, 11, x FOR PEER REVIEW 3 of 23 MD has an elliptical shape (see Figure 1) due to the correlation effect between the features. If there is no correlation, the MD is the same as the Euclidean Distance (ED) that has a circular shape. MD is different from Euclidean Distance since the latter does not consider the correlation among the features of the data points.

Mahalanobis Distance (MD) Formulation
MD is defined as in Equation (1): where; MD has been well deployed in a broad array of applications [20,21] mainly because it is very effective in tracking intervariable correlations in data.

Mahalanobis-Taguchi System (MTS) procedures.
Taguchi extended the MD methodology with his robust engineering concepts to become an efficient and effective strategy for prediction and forecasting in multidimensional systems. In the MTS methodology, the formulation of MD is "scaled" where the existing MD formulation stated in Equation (1) is divided by a term "k" that denotes the number of variables or features of a recognition system. Therefore, the equation for calculating the scaled MD in the MTS methodology becomes: From this point onwards, the MD computation will be based on Equation (2). The MD offers a statistical measure to diagnose unknown sample conditions with known samples and provides information to make future predictions.
The fundamental steps in the MTS methodology are explained in the next section. MD has an elliptical shape (see Figure 1) due to the correlation effect between the features. If there is no correlation, the MD is the same as the Euclidean Distance (ED) that has a circular shape. MD is different from Euclidean Distance since the latter does not consider the correlation among the features of the data points.

Mahalanobis Distance (MD) Formulation
MD is defined as in Equation (1): where; • k = the total number of features; • i = the number of features (i = 1, 2, . . . , k); • j = the number of samples (j = 1, 2, . . . , n); • Z ij = the standardized vector of normalized characteristics of x ij ; • x ij = the value of the ith characteristic in the jth observation; • m i = the mean of the ith characteristic; • s i = the standard deviation of the ith characteristic; • T = the transpose of the vector; • C −1 = the inverse of the correlation coefficient matrix.
MD has been well deployed in a broad array of applications [20,21] mainly because it is very effective in tracking intervariable correlations in data.

Mahalanobis-Taguchi System (MTS) Procedures
Taguchi extended the MD methodology with his robust engineering concepts to become an efficient and effective strategy for prediction and forecasting in multidimensional systems. In the MTS methodology, the formulation of MD is "scaled" where the existing MD formulation stated in Equation (1) is divided by a term "k" that denotes the number of variables or features of a recognition system. Therefore, the equation for calculating the scaled MD in the MTS methodology becomes: From this point onwards, the MD computation will be based on Equation (2). The MD offers a statistical measure to diagnose unknown sample conditions with known samples and provides information to make future predictions.
The fundamental steps in the MTS methodology are explained in the next section. To construct a measurement scale, a homogeneous dataset from normal observations needs to be collected to build a reference group called the normal group [22]. It is used as a base or reference point in the scale. The collected normal datasets need to be standardized to obtain a dimensionless unit vector followed by the MD computation. Practically, the MD for unknown data is interpreted as the nearness to the mean of the normal group. As a countercheck, the average value of the MDs for the normal group must always be close to unity; therefore they are called the normal space or Mahalanobis Space (MS) [23].
The steps for the construction of the MS are outlined below: • Calculate the mean characteristic in the normal dataset as: • Then, calculate the standard deviation for each characteristic: • Next, standardise each characteristic to form the normalized data matrix (Z ij ) and its transpose (Z T ij ): • Then, verify that the mean of the normalized data is zero: • Verify that the standard deviation of the normalized data is one: • Form the correlation coefficient matrix (C) of the normalized data. The element matrix (c ij ) is calculated as follows: where: where: n is the number of samples, X and Y are two different features being correlated, X bar and Y bar are the averages among the data in each variable, and V(X) and V(Y) are the variances of X and Y.
• Finally, calculate the MD j using Equation (2). To evaluate the measurement scale, observations outside the MS or abnormal datasets are used. The same mathematical calculation is repeated to calculate the same goal (MD value) using the abnormal sample data. However, the abnormal data are normalized based on the mean, standard deviation and correlation matrix of the normal group. The normal MDs and abnormal MDs are then compared. An acceptable measurement scale should demonstrate significant discrimination between the normal and abnormal MD values.

STAGE 3: Identify significant features
In the third stage, the system is optimized by means of selecting only the features that are known to be significant or "useful" for the system. This is where the Orthogonal Array (OA) and signal-to-noise ratio (SNR) are utilized. The features are assigned to an orthogonal array experimental run of two-level, in which "used" is signified as level 1 and "not used" as level 2. The MD for each experiment runs for all "used" features from each abnormal sample is calculated. The calculated MD values are recorded according to the experimental run. The SNR based on the MD values for all samples is then computed.

The Role of the Orthogonal Array (OA) in MTS
Orthogonal array (OA) is a type of fractional factorial design of experiment introduced by C.R. Rao in 1947 [24]. It is different from the traditional fractional factorial DOE in the sense that it tries to balance the combination or interaction of factors equally with the minimum number of experimental runs. In MTS, the orthogonal array structure is represented by Latin symbology as L a (b c ) where L is the Latin Square, a is the number of runs, b is the number of factor levels and c is the number of main factors. Table 1 illustrates an example of an OA structure for seven factors with eight runs and two factor levels. The name "orthogonal" is suggested not because of the perpendicular attribute of the structure but rather it is defined as any pairs of columns with the same repetition number of combinations of factors [24]. To illustrate further, using the OA in Table 2 as an example, take a pair between column 1 and column 2, the repetition number of each level of combinations in this column pair is the same (which is twice in this case). The same number of repetitions should be obtained for the rest of the column pairs thus the L 8 (2 7 ) array depicted by Table 1 can be said to be orthogonal. Table 2 illustrates the number of repetitions in level combination for another three more column pairs. In MTS, OAs are used to select the features of importance by minimizing the different combinations of the original set of features. The features are assigned to the different columns of the array. Since the features have only two levels, a two-level array is used in MTS as illustrated in Table 2. For each run of an OA, MDs corresponding to the known abnormal conditions are computed. The importance of features is judged based on their ability to measure the degree of abnormality on the measurement scale [25]. This is where the signal to noise ratio metric is deployed. Further discussion on OA concepts can be found from [24,[26][27][28].

The Role of the SNR in MTS
The signal-to-noise ratio (SNR) concept which can be considered as the core essence of Taguchi philosophy, is developed by Taguchi who get inspired when he was practicing an engineering profession in a Japanese telecommunication company in the 1950s. In the telecommunication context, the SNR captures the magnitude of true information (i.e., signals) after making some adjustments for uncontrollable variation (i.e., noise) [26]. In Taguchi's robust engineering concept, the SNR is defined as the measure of the functionality of the system, which exploits the interaction between the control factors and the noise factors. A "gain" in the SNR value denotes a reduction in the variability, hence a reduction in the number of factors associated with the "noise" (factors that are considered insignificant for the classification effort) resulting in a reduction of the classification process in terms of time and cost. Refs. [27,28] provide a detailed description of SNR concepts and their origin of the formulation.
In the context of MTS, the SNR is defined as the measure of the accuracy of the measurement scale for predicting abnormal conditions [26]. In MTS, a higher value of SNR which is expressed in decibels (dB), means a lower prediction error. SNR is used as a metric to assess how significant each variable in the system contributes to the ability to discriminate between normal and abnormal observations. It could also be used to assess the overall performance of a given MTS model and the degree of improvement that it has made after underwent the optimization process.
The two most commonly used SNRs in MTS are larger-the-better (LTB) and dynamic [23,26,29]. In this study, the larger-the-better SNR will be utilized due to less computational complexity.

Larger-the-Better SNR
LTB is formulated as in Equation (11) below, where t is the abnormal conditions and D 1 2 , D 2 2 , . . . , D t 2 is the MDs corresponding to the abnormal situations. The SNR (for the larger-the-better criterion) corresponding to qth run of OA is given as: For each variable X i , SNR 1 represents the average SNR of level 1 for X i while SNR 2 represents the average SNR of level 2 for X i throughout the vertical columns of the OA. Thus, positive gains from Equation (12) constitute useful features while negative gains constitute otherwise. Table 3 illustrates the assessment made using the SNR to evaluate significant factors of the L 8 OA structure. Table 3. An example of useful feature selection using OA (L 8 2 7 ) and signal-to-noise ratio (SNR).
Factor MD Computation SNR Run 1 2 3 4 5 6 7 The optimized system is then re-evaluated with the abnormal samples to validate the effectiveness of assessing the discriminant power. Once confirmed, the optimized system is used for future applications in diagnosis, classification, or forecasting purposes. Figure 2 illustrates the summary of the fundamental stages in MTS. Note that it is in Stage 4 where the optimum threshold value (MD T ) of the optimized system is obtained prior to future diagnosis or classification usage.

Quadratic Loss Function
Quadratic Loss Function (QLF) was introduced by Dr. Genichi Taguchi which aims to quantify the quality lost to society [30]. Taguchi defines "loss to society" not only in terms of operational problems such as rejections, scraps, or rework but also in terms of among others, pollution that is added to the environment, products that are worn out too quickly while in use, or other negative effects that could occur over the operational life of

Quadratic Loss Function
Quadratic Loss Function (QLF) was introduced by Dr. Genichi Taguchi which aims to quantify the quality lost to society [30]. Taguchi defines "loss to society" not only in terms of operational problems such as rejections, scraps, or rework but also in terms of among others, pollution that is added to the environment, products that are worn out too quickly while in use, or other negative effects that could occur over the operational life of the products. In the context of Robust Engineering Design, QLF is used to determine the specification limits for a product. Ref. [30] provides a clear discussion of QLF. QLF promotes that any deviation on either side of the quality target incurs a monetary loss. This concept helps management understand the importance of robustness of a design, because the variation is expressed in monetary terms.
The idea of QLF is applied in the MTS to determine the threshold values for the classification problem [8]. Take a medical diagnosis problem, for example, if the MD value of a patient's blood sample exceeds the threshold value, the patient is classified as unhealthy, and thus leading to a decision where the patient should be given a further complete medical examination. In Quadratic Loss Function, the optimal threshold (MD T ) is given by: where: The key element in the QLF concept is to balance between the cost of treating a patient and the cost of not treating a patient (as in the medical application). However, in real practical applications, even outside the medical diagnosis problems, obtaining the associated monetary information was seen to be impractical and difficult to obtain [10,13,31], hence several alternative approaches to determine the optimal threshold value have been reported in the literature of which several of them are discussed as follows.

Probabilistic Thresholding Method
Ref. [13] introduced a probabilistic thresholding method (PTM) in their study to evaluate the classification performance of MTS grounded by Chebyshev's theorem. Ref. [32] used PTM based on Chebyshev's theorem in his work to reduce solder paste inspection process in a Surface-Mount Technology (SMT) assembly using MTS. Chebyshev's theorem is useful to estimate the probability of getting a value that deviates from the mean by less than some degree of standard deviation, especially when the probability distribution of the dataset is unknown. The optimal threshold (MD T ) can be calculated with the following formula: where: • MD is the average of the MDs of the normal group, • s MD is the standard deviation of the MDs of the normal group, • λ is a small parameter or the confidence level (typically 5% or 0.05) • ω is the percentage of the normal examples whose MDs are smaller than the minimum MD of the remainder abnormal examples and do not overlap with the abnormal MDs.
Ref. [32] provides the method to determine ω as illustrated in Figure 3 for a simple example. The 10 blue boxes represent normal samples on their MD scales while the 7 orange boxes represent abnormal samples with two boxes of respective samples being overlapped to each other. Thus, ω is obtained by taking the percentage of the non-overlapped normal boxes over the total of normal boxes which is in this case, 7 divided by 10 equivalents to 70% or 0.7 on a zero-to-one scale.


λ is a small parameter or the confidence level (typically 5% or 0.05)  ω is the percentage of the normal examples whose MDs are smaller than the minimum MD of the remainder abnormal examples and do not overlap with the abnormal MDs.
Ref. [32] provides the method to determine ω as illustrated in Figure 3 for a simple example. The 10 blue boxes represent normal samples on their MD scales while the 7 orange boxes represent abnormal samples with two boxes of respective samples being overlapped to each other. Thus, ω is obtained by taking the percentage of the non-overlapped normal boxes over the total of normal boxes which is in this case, 7 divided by 10 equivalents to 70% or 0.7 on a zero-to-one scale.

Type-I and Type-II Errors Method
Several attempts have been reported in the literature to minimize Type-I and Type-II errors in finding the optimum threshold of the MTS [14,15,18,31]. Generally, Type-I error is a misclassification error associated with the true normal samples when they were classified as abnormal while Type-II error occurred when the true abnormal samples were predicted as normal. For a two-classification problem, the normal samples can be regarded as positive, and the abnormal samples can be regarded as negative. Consequently, there will be four classification results such that: 1. TP (True Positive) = an observation is positive and predicted as positive,

Type-I and Type-II Errors Method
Several attempts have been reported in the literature to minimize Type-I and Type-II errors in finding the optimum threshold of the MTS [14,15,18,31]. Generally, Type-I error is a misclassification error associated with the true normal samples when they were classified as abnormal while Type-II error occurred when the true abnormal samples were predicted as normal. For a two-classification problem, the normal samples can be regarded as positive, and the abnormal samples can be regarded as negative. Consequently, there will be four classification results such that: 1.
TP (True Positive) = an observation is positive and predicted as positive, 2.
FP (False Positive) = an observation is negative but predicted as positive, 3.
TN (True Negative) = an observation is negative and predicted as negative, and 4.
FN (False Negative) = an observation is positive but predicted as negative.
The four classification results could be further understood in a tabular representation as shown in Table 4 which is also known as Confusion Matrix.  Table 4, Type-I error is derived as α = FN FN+TP while Type-II error is expressed as β = FP TN+FP . To determine the optimal threshold (MD T ) is to minimize the sum of α Type-I + β Type-II such that: The optimal threshold (MD T ) that minimizes the Type-I and Type-II errors could be illustrated in Figure 4: The four classification results could be further understood in a tabular representation as shown in Table 4 which is also known as Confusion Matrix.  Table 4, Type-I error is derived as = while Type-II error is expressed as = .
To determine the optimal threshold (MDT) is to minimize the sum of αType-I + βType-II such that:

MDT(min) = αType-I + βType-II
The optimal threshold (MDT) that minimizes the Type-I and Type-II errors could be illustrated in Figure 4:

ROC Curve Method
The history of Receiver Operating Characteristics (ROC) is dated back to the World War II-era with which the radar operators used this theory to decide whether a blip on

ROC Curve Method
The history of Receiver Operating Characteristics (ROC) is dated back to the World War II-era with which the radar operators used this theory to decide whether a blip on the radar receiving screen indicated an enemy battleship, a friendly allied asset, or just a "noise". This signal detection theory was firstly popularized outside the military world by [33] in the area of phycology and over the years, the theory has been widely used in various disciplines including electronic signal detection, medical prognosis and diagnosis as well as data mining application for classification purposes [34].
In the context of the classification problem of MTS, ref. [9] deploys ROC in software defect diagnosis based on a multivariate set of software metrics and attributes by incorporating sensitivity and specificity metrics in the training dataset as the threshold value (see Figure 5). Sensitivity is defined as the proportion of actual positive class which is correctly identified as such while Specificity is the proportion of the negative class, which is correctly identified as negative such that: Note that in Figure 5, the x-axis of the diagram uses "1-specificity" metric to denote a false positive value of the classifier. Thus, the aim is to determine the area under the curve of the model classifier of which the bigger the area the better. In other words, the closer the model classifier line (represented by the red-dotted curvy line) to the prefect classifier shape (represented by the blue-dotted line) the better chance of the model to classify all the samples correctly according to their respective classes. Thus, ref. [9] aims to find the MDT value that maximizes the area under the curve of the MTS classifier.
Using area under the ROC curve to determine a threshold value, however, could be misleading since any two ROC curves may have different shapes but they could have identical areas under the curve [35]. Thus, ref. [11] proposed instead of maximizing the area under the curve, minimizing the Euclidean Distance from any point of the classifier curve to the maximum theoretical threshold value (i.e., maximum true positive rate) is sought. Figure 6 illustrates the approach by taking the examples of an "A" as the maximum sensitivity value while points B, C, D and E represent four different points on the MTS classifier curve. Note that in Figure 5, the x-axis of the diagram uses "1-specificity" metric to denote a false positive value of the classifier. Thus, the aim is to determine the area under the curve of the model classifier of which the bigger the area the better. In other words, the closer the model classifier line (represented by the red-dotted curvy line) to the prefect classifier shape (represented by the blue-dotted line) the better chance of the model to classify all the samples correctly according to their respective classes. Thus, ref. [9] aims to find the MD T value that maximizes the area under the curve of the MTS classifier.
Using area under the ROC curve to determine a threshold value, however, could be misleading since any two ROC curves may have different shapes but they could have identical areas under the curve [35]. Thus, ref. [11] proposed instead of maximizing the area under the curve, minimizing the Euclidean Distance from any point of the classifier curve to the maximum theoretical threshold value (i.e., maximum true positive rate) is sought. Figure 6 illustrates the approach by taking the examples of an "A" as the maximum sensitivity value while points B, C, D and E represent four different points on the MTS classifier curve. misleading since any two ROC curves may have different shapes but they could have identical areas under the curve [35]. Thus, ref. [11] proposed instead of maximizing the area under the curve, minimizing the Euclidean Distance from any point of the classifier curve to the maximum theoretical threshold value (i.e., maximum true positive rate) is sought. Figure 6 illustrates the approach by taking the examples of an "A" as the maximum sensitivity value while points B, C, D and E represent four different points on the MTS classifier curve.  Thus, from Figure 6, the distances between the points (4 points in this example, d AB , d AC , d AD and d AE ) to point A are calculated. The closer the classifier performance to point A is, the better it is. Changing the threshold will change the point coordinate on the curve. Therefore, the problem of finding the optimum threshold can be reformulated into the problem of finding the closest point that lies on the curve to point A given: And thus, the optimum MD T is established by obtaining the shortest Euclidean Distance such that: where d A.MDT is Euclidean Distance between point A and any point of MD T that lies on the ROC curve such as points B, C, D or E illustrated in the example of Figure 6. FPR A is the false positive rate at point A which is equal to zero. TPR A is the true positive rate at point A which is equal to one while FPR MDT is the false positive rate at the threshold MD T . TPR MDT is the true positive at threshold MD T . Thus, MD T that gives the lowest d AMDT value will be taken as the optimum threshold value (MD T ).

Box-Cox Transformation
The distribution of MD values for all samples contributed to the construction of the MTS classifier does not generally follow a normal distribution. They are always skewed to the left and to the right of the MDs distribution plot since normal and abnormal samples are treated as different sample populations. Ref. [18] attempted to transformed the nonnormal distribution of MDs into a normally distributed MDs distribution using Box-Cox transformation procedures. The motivation of their work comes from their intention to adopt a Control Chart Limit concept to determine the optimal MD T value. In a control limit procedure, the mean (µ) and standard deviation (σ) of the sample population in a normal distribution could be easily determined. Thus, all samples (in MD terms) are transformed using Box-Cox transformation which is defined in Equation (19)-(21) as follows: where MD i is the MD of the ith sample, MD i (λ) is the transformed MD value. The value of λ is obtained, such that it maximizes the logarithm of the likelihood function in Equation (20) as the following: where, MD(λ) = 1 n n ∑ i=1 (MD i (λ)) and n is the total number of samples (both normal and abnormal). And thus, to obtain the optimal threshold value (τ x ) of the transformed samples, one has to minimize the following error (ε) function according to Equation (21) below: where τ is the threshold value (in Box-Cox transformed term), e 1 is the number of samples classified as unhealthy (abnormal) which in fact they were healthy (normal), n h is the total number of healthy (normal) samples, e 2 is the number of samples classified as healthy (normal) which in fact they were unhealthy (abnormal) while n u is the total unhealthy (abnormal) samples in the dataset.
Since τ threshold value is in the form of a transformed Box-Cox term, to convert the transformed threshold value into a non-transformed MD T form, Equation (19) is deployed by rearranging the equation into MD term by incorporating λ value which was obtained previously and accordingly.

Datasets
The classification performance of the four mentioned thresholding methods namely the Probabilistic Thresholding Method, the Type-I and Type-II error method, the ROC method and the Box-Cox transformation method will be tested against 20 different datasets (refer to Table 5) of which 18 of them are obtained from standard benchmark datasets based on evolutionary learning (KEEL) repository [36]. The standard benchmark datasets are originally from the UCI machine learning repository which is utilized by many for the studies of binary or two classes of classification problems (normal vs. abnormal in this case). Each sample in the datasets is randomly selected (based on their class attributes of normal or abnormal), and assigned to a Training set or a Testing set accordingly. The quantity of samples in the training and the testing set that corresponds to their class attributes is roughly divided by a 50-50 percent basis [37]. The training sets are the datasets of which the optimized number of variables (reduced variables) as well as the optimum thresholding value (MD T ) are sought using the MTS procedures and the four thresholding methods respectively.
The additional two datasets namely the Medical diagnosis of liver disease [38] and the Taguchi's charactear recognition [23] datasets are also included. The following section will briefly describe these two additional datasets.

Medical Diagnosis of Liver Disease Data
Liver disease data represent a dataset that was originally collected and used for MTS analysis by Dr. Genichi Taguchi himself during his initial work on MTS. These data can be considered as renowned data when it comes to evaluating MTS performances since it has been applied by various researchers in evaluating and analysing MTS performances in binary classification problems [23,26,30].
The story behind the data came over nearly 30 years ago when Dr. Genichi Taguchi working together with Dr. Tatsuji Kanetaka of Tokyo Tenshin Hospital on which they embarked on a joint study of liver disease diagnosis. The result of the study was made public in 1987 and the data were published in various publications as well as being used for several MTS-comparison study purposes. The data contain observations of a healthy group as well as the abnormal on 17 features as shown in Table 6. Table 6. Variables in the liver disease diagnosis and notations for the analysis.  The healthy group (MS) is constructed based on observations of 200 people (healthy), who do not have any health problems together with 17 abnormal conditions (unhealthy). These data act as the training data for the construction of Mahalanobis Space MS (reference group). While a total of 60 samples (other than the training samples) are taken as the testing samples [38].

Taguchi's Character Recognition
It is a feature selection technique in character recognition proposed by [39] in which feature extraction of a character is based on the instances of variation and abundance items. Figure 7 illustrates an example of variation and abundance instances of a character "5". Variation is defined as the number of switches between white-to-grey or grey-to-white as represented by the small circle; while abundance is the number of square grey boxes as the arrow passes through each row in the index (see Figure 7). These variation and abundance items act as the variables of interest in MTS for classification purposes. Ref. [23] provides a detailed explanation of these concepts and examples of how they are deployed in the MTS methodology. In this paper, pattern recognition for character "5" is selected for analysis in the study.

Taguchi's Character Recognition
It is a feature selection technique in character recognition proposed by [39] in which feature extraction of a character is based on the instances of variation and abundance items. Figure 7 illustrates an example of variation and abundance instances of a character "5". Variation is defined as the number of switches between white-to-grey or grey-towhite as represented by the small circle; while abundance is the number of square grey boxes as the arrow passes through each row in the index (see Figure 7). These variation and abundance items act as the variables of interest in MTS for classification purposes. Ref. [23] provides a detailed explanation of these concepts and examples of how they are deployed in the MTS methodology. In this paper, pattern recognition for character "5" is selected for analysis in the study. Ref. [23] demonstrated the use of this method in recognizing character number "5" out of several "normal" and "abnormal" samples that formed several shapes similar and not similar to a numeral "5" respectively. The data were published in 2012 which consist of 14 variables (7 abundance instances and 7 variation instances). A number of 18 normal (resemblance of character "5") and 46 abnormal (no resemblance to numeral "5") samples were collected for the study.

Results and Discussion
The optimization algorithms for all four thresholding techniques mentioned in Section 3 above were constructed using the Visual Basic language platform. The programming algorithms were then compiled on a 64-bit under high-performance computing machine with Intel Core i7-8750H Processor with DDR42666 16GB memory.

Variable Reduction Using Mahalanobis-Taguchi System
The variables of all 20 datasets were optimized using MTS procedures. Table 7 shows the optimized variables of respective datasets obtained (After Optimize) against their original variables set (Before Optimize). Note that the reduced number of variables (optimized) are the significant variables suggested by MTS for future prediction and classification purposes. Figures 8-10 illustrate the optimization results based on SNR Plots and Ref. [23] demonstrated the use of this method in recognizing character number "5" out of several "normal" and "abnormal" samples that formed several shapes similar and not similar to a numeral "5" respectively. The data were published in 2012 which consist of 14 variables (7 abundance instances and 7 variation instances). A number of 18 normal (resemblance of character "5") and 46 abnormal (no resemblance to numeral "5") samples were collected for the study.

Results and Discussion
The optimization algorithms for all four thresholding techniques mentioned in Section 3 above were constructed using the Visual Basic language platform. The programming algorithms were then compiled on a 64-bit under high-performance computing machine with Intel Core i7-8750H Processor with DDR42666 16GB memory.

Variable Reduction Using Mahalanobis-Taguchi System
The variables of all 20 datasets were optimized using MTS procedures. Table 7 shows the optimized variables of respective datasets obtained (After Optimize) against their original variables set (Before Optimize). Note that the reduced number of variables (optimized) are the significant variables suggested by MTS for future prediction and classification purposes. Figures 8-10 illustrate the optimization results based on SNR Plots and SNR Gain Charts. Due to page limitation, three datasets, namely Medical Diagnosis of Liver Disease, Wdbc and the Spambase, were displayed since these datasets showed a higher number of variable reductions as compared to the rest. The SNR Plots show the average values of SNRs based on the level of OA. The SNR Gain Charts illustrate the SNR gain between the level averages that correspond to each variable in the dataset. The positive SNR gains denote useful variables for future purposes while negative SNR gains were considered insignificant variables and thus were discarded.        Table 7 shows that more than half (>50%) of the original number of variables were removed for Wdbc, Spambase and Medical Diagnosis of liver disease datasets, while almost half (>40%) of the original variables were removed from the Appendicitis and the Coil2000 datasets. These results could significantly reduce the classification effort with a much smaller number of variables to process in those particular datasets. Unlike the rest of the datasets, the Banana, Haberman-2, Monk2, Ring and Taguchi Character Recognition datasets, however, produced no reduction in the number of variables when they were optimized using the MTS. This indicates that all original variables for these particular five datasets are found to be significant and will be fully used for future classification purposes.

Optimum Thresholds
With the optimized variables obtained via the MTS, the optimum threshold value (MDT) for each optimized dataset was computed using the four threshold methods mentioned in Section 3 previously. Table 8 tabulates the threshold values (MDT) suggested by each method of which the cut-off value to classify the testing samples (either normal or   Table 7 shows that more than half (>50%) of the original number of variables were removed for Wdbc, Spambase and Medical Diagnosis of liver disease datasets, while almost half (>40%) of the original variables were removed from the Appendicitis and the Coil2000 datasets. These results could significantly reduce the classification effort with a much smaller number of variables to process in those particular datasets. Unlike the rest of the datasets, the Banana, Haberman-2, Monk2, Ring and Taguchi Character Recognition datasets, however, produced no reduction in the number of variables when they were optimized using the MTS. This indicates that all original variables for these particular five datasets are found to be significant and will be fully used for future classification purposes.

Optimum Thresholds
With the optimized variables obtained via the MTS, the optimum threshold value (MD T ) for each optimized dataset was computed using the four threshold methods mentioned in Section 3 previously. Table 8 tabulates the threshold values (MD T ) suggested by each method of which the cut-off value to classify the testing samples (either normal or abnormal) in the testing sets will be used. Note that, the optimum λ opt and the MD T in Box-Cox transformed terms are also included in the table since they are part of the items required in obtaining the optimum threshold values via Box-Cox transformation process. In this study, an MD value of a testing sample having less than or equal to MD T is denoted as normal, otherwise, it is considered abnormal.  Table 9 shows the classification accuracy (in %) for each dataset based on the threshold values obtained via Type I-Type-II, ROC Curve, Chebyshev's Theorem and Box-Cox transformation methods accordingly. The classification process is conducted using the testing sets which consist of normal and abnormal samples. These classification results will indicate how good the MD T to which the normal samples and abnormal samples in the testing sets are differentiated.

Classification Accuracy Results
In general, the classification process is conducted firstly by computing the MD values of all samples (both normals and abnormals) in the testing set. Thus, a decision is made when the MD value of the testing sample having less than or equal to the MD T to be denoted as normal, otherwise it will be considered abnormal. These results are then compared against the true class of the samples (normal and abnormal) to which the accuracy of the classification performance is measured.
In Table 9, the classification results correspond to each dataset are shown of which bold fonts indicate superior classification performances against the others. Interestingly, it was clearly shown that none of the four thresholding methods outperformed one of the others in (if it is not for all) most of the datasets. This finding confirms the no free lunch theorem [40] in that there is no single algorithm that suits all datasets. However, an equivalent classification performance (74.21%) by all thresholding methods could be seen in the Titanic dataset. It could also be seen that Type-I-Type-II, ROC curve and the Box-Cox Transformation methods gave equivalent classification accuracies in the Appendicitis, Ionosphere and Spambase datasets as well as in the Medical Diagnosis of Liver Disease dataset of which a perfect classification performance (100%) is achieved from the three threshold methods. Despite the complexities in computing the optimum threshold using Box-Cox transformation method, it produced nearly a perfect classification performance (98.14%) against the other three methods in the ring dataset as well as obtained equal performances with 68.49% and 96.20% accuracies in Banana and Wisconsin datasets respectively against Type-I-Type-II error. On the other hand, Type-I-Type-II error method produced a higher number of successful attempts with 11 successful frequencies over the other three methods for all datasets. Figure 11 illustrates this finding based on the results extracted from Table 8. Furthermore, the Type-I-Type-II error method seems favourable in this case since it is computationally less complex in computing the optimum threshold value as compared to the other three methods.  Figure 11 illustrates this finding based on the results extracted from Table 8. Furthermore, the Type-I-Type-II error method seems favourable in this case since it is computationally less complex in computing the optimum threshold value as compared to the other three methods. Figure 11. Frequency of successes over 20 datasets by all threshold methods.
From Table 9, it was interesting to see that each of the thresholding methods outperformed one over the others on different datasets. For example, Chebyshev's Theorem method outperformed the others in Bupa, Coil2000, Haberman-2, Heart, Magic, Phenome, Pima and Wdbc datasets. On the other hand, Box-Cox Transformation method seems superior on Ring and Sonar datasets while Type-I-Type-II error method was found best on Monk2 and Spectfheart. These findings indicate that the suitability and utilization of each thresholding method depend on the dataset itself. Therefore, one could conduct a trial run for all thresholding methods to come to the decision in selecting the suitable thresholding method for any dataset of interest however, it seems impractical and will increase classi- From Table 9, it was interesting to see that each of the thresholding methods outperformed one over the others on different datasets. For example, Chebyshev's Theorem method outperformed the others in Bupa, Coil2000, Haberman-2, Heart, Magic, Phenome, Pima and Wdbc datasets. On the other hand, Box-Cox Transformation method seems superior on Ring and Sonar datasets while Type-I-Type-II error method was found best on Monk2 and Spectfheart. These findings indicate that the suitability and utilization of each thresholding method depend on the dataset itself. Therefore, one could conduct a trial run for all thresholding methods to come to the decision in selecting the suitable thresholding method for any dataset of interest however, it seems impractical and will increase classification efforts. Further studies should be conducted to investigate the nature and the attributes of the datasets to which thresholding methods are suitable. Perhaps a systematic procedure could be developed to guide the decision process.
Another interesting point to highlight is that out of 20 datasets, only seven of them (coil2000, Ionosphere, Ring, Spambase, Wdbc, Wisconsin, Medical Diagnosis of Liver Disease and Taguchi Character Recognition) produced classification accuracies of more than 80% across all thresholding methods by which an above 80% marks (>80%) is considered a promising prediction result [9]. The remaining datasets produced classification accuracy results with below than 80% of predictive accuracies across all thresholding methods. The lowest classification accuracy was seen on the Spectfheart dataset with a staggering low of 29.85% accuracy when predicting the testing samples based on the threshold value suggested by the ROC curve method. Generally, this not only denotes the unsuitability of the ROC method on the dataset, it also denotes that the predictive capability of the MTS seems unpromising in certain cases of datasets. This could be due to the validity of the reduced number of variables achieved during the optimization procedure of the MTS by which Orthogonal Array (OA) is utilized for feature selection. MD values is sensitive to the choices of variables in the classifier system since the computed MD value varies with different sets of significant variables. Therefore, obtaining the optimal significant variable set is crucial in the MTS particularly on the MS (the reference group).
Future studies should investigate the practicality of OA as an effective scheme for significant feature selection in the MTS. The suggestion seems to agree with reports in the literature claiming that the feature selection search mechanism using an orthogonal array (OA) for variable reduction in the MTS is inadequate and leads to inaccurate and suboptimal solutions [41][42][43][44][45] for certain datasets. OA failed to explore other potential optimum combinations of features in their studies since the exploitation on higher-order combinations among variables in datasets using OA search structure was seen as insufficient.
The use of Swarm Intelligence-based algorithms (SI) such as the Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), Bees Algorithms (BA), Fish Algorithms (FA) to name a few, could be one alternative to handle the issue as suggested by [25]. SI-based algorithms are meta-heuristic in nature in that the search mechanism is tailored to guide a specific optimization problem heuristically toward promising solution search spaces that contain good quality solutions [46]. Further, the combination of exploration (diversification) and exploitation (intensification) search mechanism of the SI increases the ability to find optimal solutions in a reasonable time [47,48]. Hence, the strategies offered by these algorithmic techniques are worth to be explored in solving the weakness of OA in that respect. Others also suggested several alternative methods replacing the OA in MTS such as the adaptive One-Factor-at-a-Time (aOFAT) [49] and Rough-set Theory [50]. Perhaps a modification of the OA matrix structure itself with other orthogonal matrix theories such as the Paley's cyclic matrix or Hadamard matrix [23] could possibly worth to be considered.
In MTS, Taguchi recommended using two types of signal-to-noise ratio which are "larger-the-better" and "Dynamic" signal-to-noise ratio. The former was utilized in this work. The latter type of signal-to-noise ratio (Dynamic) is another powerful selection metric that takes into account the level of abnormality of the input samples in its computational procedures. Unlike larger-the-better type signal-to-noise ratio, the Dynamic signal-to-noise ratio formulation is quite complex which makes the computational effort a challenging task however, it may provide a more promising solution. Thus, exploiting what the Dynamic signal-to-noise ratio could offer in improving the feature selection process of the MTS would be an encouraging research study in the future.
Nonetheless, this study focuses on the comparison of thresholding classification performances in the MTS between the four threshold methods mentioned previously. Based on this study, it was clearly shown that not a single threshold method produced superior classification performance for all datasets. Nevertheless, the authors seem to recommend the use of the Type-I-Type-II error method as the alternative approach as compared to the other thresholding methods owing to its simplicity with less computational burden. However, it is suggested that more studies with more datasets could be conducted in the future to strongly support this generalization.

Conclusions
This paper provides a comparative study to evaluate the classification performance of the MTS and to suggest the appropriate thresholding method to be utilized in MTS methodology between four common thresholding methods namely the Type-I-Type-II error method, the Probabilistic Thresholding Method, ROC curve method and the Box-Cox transformation method. To the best of the authors' knowledge, no comparison works have been conducted to evaluate the effectiveness of those common thresholding methods towards MTS classification performances on several datasets. The outcome of this study could provide an initial insight on a general thresholding method that is suitable across several case data. The result found that none of the four thresholding methods outperformed one over the others in (if it is not for all) most of the datasets. It could also be found that the effective use of the four thresholding methods to produce promising classification performances is dataset dependant. Hence, further studies to investigate the cause of these dependency behaviours and their relationships are urged. In addition, the study also found an unpromising predictive ability of the MTS in classifying several datasets of the study. Improving the significant variable selection process of the MTS using several alternative approaches was suggested. PSO-based thresholding studies could also be considered as another thresholding alternative in improving the MTS classification problem. Nevertheless, from the study, the Type-I-Type-II error method seems favourable due to its lower algorithm complexity as compared to the other three thresholding methods. It is also recommended to evaluate the computational time complexities of these algorithms in the future to further support the findings.