Next Article in Journal
Currency Hedging Strategies Using Histogram-Valued Data: Bivariate Markov Switching GARCH Models
Next Article in Special Issue
Methodology for the Assessment of Imprecise Multi-State System Availability
Previous Article in Journal
A Mathematical Approach to Simultaneously Plan Generation and Transmission Expansion Based on Fault Current Limiters and Reliability Constraints
Previous Article in Special Issue
The Exponentiated Burr–Hatke Distribution and Its Discrete Version: Reliability Properties with CSALT Model, Inference and Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Multivariate Shewhart Control Chart Based on the Stahel-Donoho Robust Estimator and Mahalanobis Distance for Multivariate Outlier Detection

1
Dammam Community College, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia
2
Department of Mathematics and Statistics, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia
3
Preparatory Year Mathematics Program, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(21), 2772; https://doi.org/10.3390/math9212772
Submission received: 8 October 2021 / Revised: 25 October 2021 / Accepted: 29 October 2021 / Published: 1 November 2021
(This article belongs to the Special Issue Probability and Statistics in Quality and Reliability Engineering)

Abstract

:
While researchers and practitioners may seamlessly develop methods of detecting outliers in control charts under a univariate setup, detecting and screening outliers in multivariate control charts pose serious challenges. In this study, we propose a robust multivariate control chart based on the Stahel-Donoho robust estimator (SDRE), whilst the process parameters are estimated from phase-I. Through intensive Monte-Carlo simulation, the study presents how the estimation of parameters and presence of outliers affect the efficacy of the Hotelling T2 chart, and then how the proposed outlier detector brings the chart back to normalcy by restoring its efficacy and sensitivity. Run-length properties are used as the performance measures. The run length properties establish the superiority of the proposed scheme over the default multivariate Shewhart control charting scheme. The applicability of the study includes but is not limited to manufacturing and health industries. The study concludes with a real-life application of the proposed chart on a dataset extracted from the manufacturing process of carbon fiber tubes.

1. Introduction

Outliers are those observations at both extremes, which do not follow the majority of observations pattern in a dataset. Outlier detection is of concern in data analysis and scientific areas, of which statistical process control (SPC) is not an exemption [1]. This is because outliers have a major influence on any statistical analysis as they increase the error variance, reduce the power of statistical tests, and cause bias in estimation, hence leading to incorrect inferences and conclusions, and sometimes, ending with deadly decisions, take the health sector as an example. With little percentage and magnitude present in data (big or small), outliers will grossly distort the performance and analysis of the data. Therefore, the art of outlier detection is a prominent and important aspect of data analysis, even more so now that more and more data are being analyzed simultaneously, such as with multivariate control charting.
Control charts are the most widely used tool amongst the seven tools of SPC [2]. Their vast applicability in different fields and sectors give them an edge over other tools of SPC for process monitoring. Control charts, however, can have a univariate or multivariate setup, a memory or memory-less type, and/or monitoring location or dispersion in an ongoing process. Readers are referred to [2] for more information about control charts and their types. Furthermore, control charts are of two stages: phase-I (the prospective stage) and phase-II (the retrospective stage). The process parameters are used to set the chart’s control limits in phase-I. Moreover, if the process parameters are unknown, they are estimated with some preliminary samples, whereas the monitoring and correction of unnatural causes of variation occur in the retrospective stage. The choice and amount of preliminary sample employed in estimating the unknown parameters in phase-I vary among practitioners and as result affect the performance of the chart in the monitoring stage. These samples often contain some unusual observations and outliers, which exert a disproportionate pull on the parameter estimated, making the chart less efficient in detecting anomalies. The multivariate Shewhart chart that has been studied in this paper is a memory-less type for monitoring location parameters, while the process parameters are known and estimated from phase-I samples. Over the years, SPC researchers have investigated the effect of parameter estimation on control charts in both univariate and multivariate setups. To mention a few, reference [3] gave an up-to-date review on parameter estimation effects on control charts. Saleh et al. [4] evaluated the parameter estimation’s effect on an exponentially weighted moving average (EWMA) control chart with its run length properties. A similar study was conducted by Jones [5].
Many research works in the literature have studied outlier detection in the univariate setup, some of which are applied to control charts in the univariate setup. References [6,7,8] have independently proposed outlier detection models in the univariate setup of control charts either for location or dispersion monitoring. They found that the control charts based on detection models require fewer phase-I samples to detect anomalies, as these charts are quicker and more sensitive to contamination. Guarnieri et al. [9] developed control charts for individual observation and exponentially weighted moving averages based on residues to detect outliers in autoregressive models. Bakar et al. [10] also conducted a comparative study for outlier detection techniques in control charts with application in data mining. As Vidmar and Blagus [11] applied different outlier detection approaches to healthcare quality monitoring. Zhang and Albin [12] employed a chi-square chart method for detecting outliers in complex profiles. Other research in this direction include, among others, [13,14]. While there are models for detecting multivariate outliers, few of them have been applied to SPC. Examples include the robust multivariate control chart for outlier detection by Fan et al. [15] and robust estimates, residuals, and outlier detection with multi-response data by Gnanadesikan and Kettenring [16]. The authors of [17] considered minimum volume ellipsoid (MVE) and/or weighted mean vector and mean square successive differences (WD) to decrease the impact of outliers on multivariate control charts. Hubert et al. [18] reviewed the minimum covariance determinant (MCD) methods and their extension as competent tools for outlier detection. Other researchers have approached the outlier detection problem with robust multivariate estimators. The pioneer of this idea was Stahel [19] where he studied the breakdown of covariance estimators; Maronna and Yohai [20] further extended the research of Stahel. Rousseeuw and Hubert [21] also studied the robust multivariate location and scatter estimators. Similar studies include but are not limited to [22,23,24].
In the aforementioned references, none of the studies that applied multivariate robust estimators to control charts have focused on detecting and screening outliers of the phase-I samples. Therefore, this paper focuses on detecting multivariate outliers in the multivariate Shewhart control chart. It employs a Stahel-Donoho robust estimator incorporated with the Mahalanobis distance for detecting and screening out the outlying observations in the preliminary samples, from which the process parameters are estimated. This paper reports the effect of parameter estimations on a multivariate Shewhart chart’s control limits and performance. Reporting parameter estimations’ effect is not the main goal of this study; however, it helps readers to better understand the positive impact of the outlier detection process.
The remainder of this article is organized as follows. Section 2 entails the methodology with an insight to the multivariate Shewhart control chart when the process parameters are known and estimated, the presence of outliers in the preliminary samples, and the proposed multivariate outlier detection process. Results and discussion appear in Section 3, while Section 4 gives an illustrative example with a real-life dataset extracted from the manufacturing process of carbon fiber tubes. Section 5 concludes the study with a summary of the findings and future recommendations.

2. Methodology

The aim of this study is to detect and screen outliers of the m preliminary samples employed for parameter estimation, especially when the samples are outlier prone. This section explains in detail the multivariate Shewhart control chart for location monitoring, both when the parameters are known and estimated from phase-I preliminary samples. Then it demonstrates the effect of practitioners’ variability in the samples employed for estimation, and its effect on the chart’s performance. In addition, this section presents how outliers in those samples distort the chart’s efficacy and become less sensitive, then concludes the section with the proposed outlier detection-based multivariate Shewhart chart, and its application on a real-life data set extracted from the carbon fiber tubes manufacturing industry.

2.1. Multivariate Shewhart Control Chart

Let X = ( X 1 ,   X 2 ,   X 3   ,   X p ) , a vector of p-correlated quality characteristics, each of size n subgroups, drawn from a p-variate normal distribution be the characteristic of interest for monitoring in a multivariate process. The probability distribution function of X is given as follows:
f ( X ) = 1 ( 2 π ) p / 2  ​ | Σ | 1 / 2  ​ e ( 1 2 ( X μ ) T Σ 1 ( X μ ) ) ; < X i < ,  ​ i = 1 ,  ​ 2 ,  ​ , p .
The resulting multivariate Shewhart chart statistic termed Hotelling T2, for monitoring the location parameter of the random process X ~ N p ( μ , Σ ) X, is given as follows:
T i 2 = n ( X ¯ i μ ) ′​ Σ 1 ( X ¯ i μ ) .
where X ¯ i is the mean vector of the ith observation, n is the sample size, μ = ( μ 1 ,  ​ μ 2 ,  ​ ,  ​ μ p ) and
Σ = [ σ 11 σ 12 σ 1 p σ 21 σ 22 σ 2 p σ p 1 σ p 2 σ p p ]  ​
is the mean vector and variance-covariance matrix of the process. The chart signals an alarm when the T i 2 statistic is plotted beyond the upper control limit (UCL) of the chart, i.e., ( T i 2 > UCL = χ α , p 2 ) . This is the case when the process parameters ( μ , Σ ) are known. However, when the parameters are unknown, they are estimated from m phase-I preliminary samples. The Hotelling T2 statistics then become
T i 2 = n ( X ¯ i μ ^ ) ′​ S 1 ( X ¯ i μ ^ ) ,
where μ ^ = i = 1 m j = 1 n X i , j / m n and S = i = 1 m j = 1 n ( X i , j X ¯ i ) ( X i , j X ¯ i ) T / m ( n 1 ) are the estimates of the in-control mean vector and variance-covariance matrix emerging from the phase-I samples. It is important to note that the amount of m phase-I sample and the choice of estimators employed for estimating the parameters vary amongst practitioners, hence the variability in the efficacy and performance of their charts. Subsequently, the corresponding UCL of the T i 2 statistic in (3), for the monitoring stage, phase-II, is given as follows:
UCL = p ( m + 1 ) ( n 1 ) mn m p + 1 F α , p , m n m p + 1
Again, if T 2 > UCL , the chart sends a signal, so the practitioner tends to the unnatural cause of variation. The ith observation on which a signal was sent is the run length. The run length is simply the number of observations plotted within the limit before recording the first out-of-control (OoC). With many iterations, run length becomes a variable whose properties will be used for evaluating the chart.
All this explains the traditional method for constructing the multivariate Shewhart chart for location monitoring. The next section establishes parameter estimation effects on the multivariate Shewhart chart. Section 2.3, on the other hand, reveals how the outliers emanating from the phase-I sample negatively affect the chart’s performance, while Section 2.4 highlights the need for incorporating multivariate robust estimators for outlier detection.

2.2. Effect of Practitioners’ Variabilities on the Multivariate Shewhart Chart

In this section, the study reveals how the practitioners’ variability in the choice of m samples affects the multivariate Shewhart chart’s performance. Through intensive Monte-Carlo simulation, we demonstrate how different m phase-I samples for estimating the unknown parameters play a vital role in the performance of the multivariate Shewhart chart as compared to the known parameter case. This study considers m of 25, 100, and 500 to represent small, medium, and large samples, respectively. An algorithm was developed in R language to simulate the multivariate Shewhart chart defined in (2) for the known parameter case and in (3) for the unknown case. For the known case, it was assumed that the mean vector was zero, variances were unity, and the covariance was 50% (i.e., σ i i = 1 and σ i j = 0.5 ). With p = 2 ,   3 ,   α = 0.0027 , the in-control (IC) average run length ( ARL 0 ) corresponded to 370. While for the unknown cases, the process parameters were estimated from m = 25 ,   100 ,   500 samples with sample mean vector μ ^ and covariance matrix S. The algorithm also considered the OoC situations, when the mean vector increased over a range of shift δ [ 0 , 5 ] . The first effect of estimation began with the UCL; the UCL varied as the m sample varied, to yield the nominal ARL 0 of 370 as in the known case. The simulation results are presented in Table 1 and Table 2. The detailed discussion of these results is in Section 3.

2.3. Effect of Outliers on the Multivariate Shewhart Control Chart with Estimated Parameters

Having noticed the estimation effect on the multivariate Shewhart chart’s performance in the previous section, we demonstrate how outliers in the m phase-I samples worsen the chart’s performance in the monitoring stage. To achieve this aim, we generated m phase-I samples from a mixed distribution, a(1 − θ)100% from the normal distribution and the remaining θ100% from a chi-square distribution with v degrees of freedom as follows:
X  ​ ~  ​ ( 1 θ ) N p ( μ , Σ ) + θ [ N p ( μ , Σ ) + ω χ ( v ) 2 ]
where θ > 0 represents the percentage of outliers present in the data, ω ≥ 1 is the magnitude of the outliers, and χ ( v ) 2 represents the outlier added to the normal distribution. The study estimated the parameters μ ^ and S from the m sample, and then computed the Hotelling T2 statistic as in (3). The same algorithm, process parameters, and control limits employed in Section 2.2 were used to compute the IC run length properties alone to observe the outliers’ effect. The results are presented in Table 3 and Table 4 for magnitudes ω = 1,2, respectively. With just 10 % of outliers (θ = 0.10), the ARL 0 increased by more than 600 % of its expected value when ω = 1 and close to 3000% when ω = 2.
The findings from the results in this section and the previous section suggest the following options:
  • The m phase-I sample should be sufficiently increased until results similar to those of the known case are achieved.
  • The process should prevent the occurrence of unnatural variations and outliers with smaller m phase-I samples
These options are practically impossible in real life scenarios, because increasing samples is typically uneconomical. More so, a process cannot be freed from variations with a natural or assignable cause. Hence, there is the need to incorporate robust multivariate estimators for better estimation and screening of the outliers.

2.4. Proposed Multivariate Shewhart Chart Based on Stahel-Donoho Robust Estimators (SDRE)

From the results in Table 3 and Table 4, it is apparent that increasing the m samples cannot suppress the negative impact of the outliers on the chart. Hence, there is a need to employ robust location and dispersion estimators as substitutes to the default μ ^ and S that are not sensitive to outliers. Therefore, this study proposes a multivariate Shewhart chart based on the Stahel-Donoho robust estimator. Like any robust estimator, the SDRE estimators were able to retain their efficiency in the presence of outliers. This feature makes them able to detect the presence of outliers no matter how small or large the m samples are. Readers are referred to [25,26,27] for more information about the merits of robust estimators.
Stahel [19] and Donoho [22] were the first to develop a robust equivariant estimator of multivariate location and dispersion with a considerable high breakdown point of any p-variate multivariate data. However, it became well known with the analysis of Maronna and Yohai [20]. Maronna and Yohai [20] assumed X = { x 1 ,   x 2 , ,   x n } to be a set of n data points in p , and defined the “outlyingness” r for any y p as r ( y , X ) = sup a r 1 ( y ,   a ,   X ) , where r 1 ( y ,   a ,   X ) = | a y μ ( a X ) | / σ ( a X ) and μ() and σ() are the robust univariate location and dispersion statistics. The Stahel-Donoho robust estimators (SDRE), denoted as (t,V), are defined as weighted mean and weighted covariance matrix, each with weights of the form w(r), where wi is the weight function of each observation and inverse proportional to the “outlyingness” of the observation, r, obtained by considering all univariate projections of the data. Mathematically, SDRE is written as follows:
t ( X ) = i = 1 n w i X i i = 1 n w i  ​ and  ​ V ( X ) = i = 1 n w i ( X i t ) ( X i t ) T i = 1 n w i
where w i = w ( r ( x i , X ) ) . The SDRE is then used to estimate the process parameters from the m phase-I samples instead of μ ^ and S. Furthermore, (t,V) estimators are employed in the Mahalanobis distance to screen out the potential outliers present in the m samples as in (7).
D ( X , t ) = n ( X t ) T V 1 ( X t )

2.5. The Algorithm

This section explains in detail the algorithm and performance evaluation adopted in this study. The major performance measure of a control chart is the run length properties: average run length (ARL) and the standard deviation of the run length (SDRL). Through the Monte-Carlo simulation approach, the run length properties of both the IC ( ARL 0 and SDRL 0 ) and OoC ( ARL 1 and SDRL 1 ) of the scheme were computed. The following is the algorithm developed in R language to achieve this aim:
  • Generate 106 random variables of p-variate quality characteristics, each of sample size n = 5 from a multivariate normal distribution to be monitored in the phase-II stage.
  • (a) Known case: Define the mean vectors and covariance matrix, then proceed to step 3.
    (b) Unknown case: Generate some m phase-I samples from the same distributions to compute the default mean vector and covariance matrix estimators ( μ ^ and S), then proceed to step 3 (see Section 2.2).
    (c) Unknown case with outliers: Generate some m phase-I samples from a mixed distribution as defined in (5), then compute the default mean vector and covariance matrix estimators ( μ ^ and S) and then proceed to step 3 (see Section 2.3).
    (d) Unknown case with outliers screened: Generate some m phase-I samples from a mixed distribution as defined in (5), compute the SDRE (t,V) in (6), employ the SDRE to screen the outliers as explained in (7), and then compute μ ^ and S of the remaining dataset after screening. Then, proceed to step 3 (see Section 2.4).
  • Calculate the T i 2 statistic in (2) for the known parameter case and (3) for the unknown cases, as the case may be.
  • Plot the T i 2 statistic against the control limit, UCL, until the first ith observation plots beyond UCL. For known cases, UCL = χ a , p 2 , while for the unknown cases, use the UCL defined in (4).
  • Record the ith observation where the signal occurred as the run length.
  • Repeat the steps from 1–5 for 105 iterations. Record the run length for each iteration. Then, calculate the average and standard deviation of the run length as ARL0 and SDRL0, respectively.
    The algorithm is summarized with a flowchart presented in Figure 1 for easy readability.

3. Results and Discussions

This section presents and discusses the results and findings of the methodologies explained in Section 2, in three categories: (a) the effect of practitioners’ variabilities on the chart, (b) the effect of outliers on the chart’s performance, and (c) the improvement of the proposed SDRE-based multivariate chart. The performance measure of this study was the run-length properties. The IC ARL 0 and SDRL 0 were expected to be sufficiently large as the nominal ARL 0 = 370, while the OoC ARL 1 and SDRL 1 were expected to be significantly small, implying the chart’s ability to quickly detect anomalies in the process.

3.1. Practitioners’ Variability Effect on Multivariate Shewhart Chart

Following the algorithm in Section 2.5, Table 1 and Table 2 contain the ARLs of the multivariate Shewhart chart for the known parameter case and the estimated parameters with p = 2 and 3, respectively. The parameter estimation effect on the chart’s performance is evident from the ARL and SDRL values. The different m phase-I samples represent the variabilities in practitioners’ choice, ranging from small to medium to large. The larger the m samples, the better the chart’s performance as compared to the known case. When δ = 0, both the ARL 0 and SDRL 0 were expected to cluster around the nominal value 370. The ARL values did so, but the SDRLs of the estimated parameter scenarios did not. They dispersed from 370, and the disparity became even wider as the m samples got smaller. Another major effect was how the charts with the estimated parameters were less sensitive to shift as their ARL 1 and SDRL 1 imply. This effect also worsened as the m samples reduced (see Table 1 and Table 2).

3.2. Effect of Outliers on the Multivariate Shewhart Chart’s Performance

Table 3 and Table 4 depict the in-control ARL 0 and SDRL 0 of the multivariate Shewhart chart from the mixed distribution in (5), with p = 3 , some percentages of outliers θ = [ 0 % , 10 % ] , and the magnitude ω = 1 , 2 . Here, all values should be approximate to the nominal ARL of 370, since the environment was IC. When θ = 0 % , it implies the absence of outliers in the process, so the ARL values for all different m samples clustered around 370, while their SDRL values did not. It can be easily observed from the two tables that the outliers’ effect on the chart worsened as the percentage and magnitude of outliers increased. Also, the effect on the ARL values was more obvious as the m sample increased, and vice-versa for the SDRL values. In general, there was more than a 600% increment in the ARL and SDRL values when ω = 1 and a more than 3000% increment when ω = 2 . All of these were due to less than 10% outliers in the data.

3.3. Improvement of the Proposed SDRE-Based Multivariate Shewhart Chart

Here, we present and discuss the results of the proposed multivariate Shewhart chart based on SDRE and Mahalanobis distance for detecting and screening out the multivariate outliers, as described in Section 2.4. Table 5 and Table 6 contain the IC ARL and SDRL as a remedy to the results in Table 3 and Table 4, respectively. These results were obtained by applying the algorithm given in Section 2.5 (with part (d) of step 2). The improvement in the multivariate Shewhart chart’s performance is easily noticeable. When magnitude ω = 1 , there was a more than a 25% decrement in comparison with when the outliers were not screened, while a decrement of more than 70% was achieved when ω = 2 for the ARL values; the recoveries in the SDRL were even better. The SDRE-based multivariate Shewhart could not restore the chart’s performance clustering around the nominal ARL = 370; however, the recorded improvements are remarkable.
Furthermore, the rate of improvements appreciated as the percentage and magnitude of outliers increased. Figure 2, Figure 3, Figure 4 and Figure 5 depict the results in Table 3 and Table 4 (outliers without screening) side-by-side with Table 5 and Table 6 (SDRE outliers screening) to closely observe the improvements. Table 3 and Table 4 depict the IC ARL 0 and SDRL 0 of the multivariate Shewhart chart from the mixed distribution in (5), with some percentages of outliers, θ = [ 0 % , 10 % ] and the magnitude ω = 1 ,   2 . Here, all values should approximate to the nominal ARL of 370, since the environment was IC. When θ = 0 % , it implies the absence of outliers in the process, so the ARL values for all the different m samples clustered around 370, although their SDRL values did not. It can be easily observed from the two tables that the outliers’ effect on the chart worsened as the percentage and magnitude of the outliers increased. Also, the effect on the ARL values was more obvious as the m sample increased, and vice-versa for the SDRL values. In general, there was more than a 600% increment in the ARL and SDRL values when ω = 1 and a more than 3000% increment when ω = 2 . All of these were due to less than 10% outliers in the data.
The standard errors of the run length properties results reported in Table 1, Table 2, Table 3, Table 4, Table 5 and Table 6 were between 0.066% and 0.506%. These values validate the precision of the ARL and SDRL values. In addition, in Table 1 and Table 2, the results of the known cases are the best and the ideal results. However, the unknown case results improved and converged to those of the known cases as the m phase-I sample increased. For the results of the outlier cases in Table 3, Table 4, Table 5 and Table 6, the outliers’ effect was more pronounced as the percentage, θ , and magnitude, ω , of outliers increased. These points further justify and validate the precision and consistency reported results.

4. Illustrative Example with Real-Life Dataset

In the manufacturing industry, carbon fiber tubes are a crucial and widely used material in numerous applications. They are preferred over many traditional materials such as aluminum, titanium, and steel, because of their unique features: resistance to fatigue, high strength and fitness to weight, resistance to corrosion, dimensional stability, and many more. This has resulted in carbon fiber gaining vast application in the manufacturing industry. The manufacturing process of carbon fibers is partly chemical and mechanical. They are mostly made of carbon atoms which bound together in microscopic crystals. The manufacturing process goes through spinning, stabilizing, carbonizing, surface treating, and sizing. The tubes are thin strands of material which are long in diameter. The minute size of carbon fibers requires close monitoring of the manufacturing process. In this study, we monitored three quality characteristics in the manufacturing process of a specific carbon fiber tubing. The characteristics are the inner diameter, thickness, and length of the tubes in inches.
The data were of two stages: in phase-I, each quality characteristic consisted of m = 25 sample points each with a size of n = 5 . Phase-II consisted of 20 observations each of size n = 5 for every quality characteristic. Without any loss of generality, and for conformity with the aim of the study, the illustrative example was categorized into three cases:
  • Case 1-Parameter estimation: Here, we employed the phase-I data to compute the default mean vector and covariance matrix ( μ ^ and S), assuming the process parameters were unknown, and then used the estimates to compute the plotting statistics T i 2 for monitoring the phase-II data as explained in (3) and plotted it against the UCL.
  • Case 2-Parameter estimation with outliers: We infused θ = 7% of outliers with a magnitude ω = 3 and degrees of freedom v = 5 in the phase-I data, to simulate the mixed distribution described in (5), obtained the default parameter estimates ( μ ^ and S), and then used the estimates to compute the plotting statistics T i 2 for monitoring the phase-II dataset and plotted it against the UCL.
  • Case 3-Parameter estimation with outliers and screening: The third case was similar to the second case, but we used the SDRE (t,V) as in (6) to estimate the process parameters from 25 phase-I samples, and to employ the SDRE in the Mahalanobis distance to detect and screen out the outliers. Then, we computed the default parameter estimates ( μ ^ and S) from the remaining screened data, and then computed the plotting T i 2 for monitoring the phase-II dataset and plotted it against the same UCL.
The summaries of the estimations of the parameters for these three cases are given in Table 7, while their plotting statistics T i 2 and corresponding decisions are given in Table 8. In case 1, all observations are IC as they are all below the UCL = 15.16, except the fourth observation 19.4183, which plots beyond the UCL. This case represents when the process parameters are estimated from some preliminary samples without outliers. For case 2, all the plotting T i 2 s were below the control limit despite the presence of outliers. The fourth observation that was plotted beyond the UCL in case 1 was masked due to the outliers’ effect. Case 2 reveals the effect of outliers in the preliminary samples. It also shows the inferiority of using the default mean vector and covariance matrix for estimating parameters, especially when the samples are prone to outliers. In case 3, the OoC fourth observation in case 1 is was detected OoC in this case. With the same magnitude and percentage of outliers as in case 2, case 3 was as efficient as case 1 when there were no outliers. This substantiates the improvement of the proposed SDRE and Mahalanobis distance’s procedures of estimating parameters and detecting outliers as claimed by the simulation results. Figure 6 depicts a visual representation of Table 8.

5. Conclusions

This research paper evaluated the in-control performance of the multivariate Shewhart control chart when the parameters were estimated from phase-I samples that were prone to outliers. The study observed the negative effect of estimation and outliers on the chart’s performance. Hence, we proposed a more efficient and robust multivariate Shewhart chart based on the Stahel-Donoho robust estimators and Mahalanobis distance to detect and screen outliers from the phase-I samples. Through the Monte-Carlo simulation approach, the ARL and SDRL for a different number of phase-I samples from small to medium to large were computed. The findings show that with the presence of outliers, even with large phase-I samples, the effect on the chart’s performance was severe. The results further show that the proposed chart based on SDRE and Mahalanobis distance restored the efficiency of the multivariate Shewhart chart with smaller phase-I samples. Therefore, it is rational to incorporate the SDRE and Mahalanobis distance in default multivariate Shewhart structures, especially when the process parameters are estimated from phase-I samples prone to outliers. The findings of this study were substantiated with real-life application in the manufacturing industry, where three qualities of carbon fiber tubes were monitored. The scope of this study was limited to monitoring the location parameter in a multivariate Shewhart chart. However, the study can be extended to monitoring dispersion parameters in multivariate Shewhart charts and other charting schemes, such as multivariate cumulative sum (MCUSM) and exponentially weighted moving average (MEWMA).

Author Contributions

Conceptualization, I.A.R. and N.A.; methodology, I.A.R., N.A. and M.R.; software, I.A.R. and M.R.A.; validation, N.A. and M.R.; formal analysis, I.A.R., N.A., M.R.A. and M.R.; investigation, I.A.R. and N.A.; resources, I.A.R.; data curation, I.A.R. and N.A.; writing—original draft preparation, I.A.R.; writing—review and editing, I.A.R., N.A., M.R.A. and M.R.; visualization, N.A., M.R.A. and M.R.; supervision, N.A., M.R.A. and M.R.; project administration, N.A. and M.R.; funding acquisition, N.A. and M.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Deanship of Scientific Research (DRS) at King Fahd University of Petroleum and Minerals (KFUPM) under Project No. SB191047.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The simulated data used in this article may be generated in R, using parameter values and the algorithm in Section 2.5. The real data set is available from the book referenced in [2].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hawkins, D.M. Identification of Outliers; Chapman and Hall: London, UK, 1980. [Google Scholar] [CrossRef]
  2. Montgomery, D.C. Introduction to Statistical Quality Control, 6th ed.; John Wiley & Sons, Inc.: New York, NY, USA, 2009. [Google Scholar]
  3. Psarakis, S.; Vyniou, A.K.; Castagliola, P. Some recent developments on the effects of parameter estimation on control charts. Qual. Reliab. Eng. Int. 2014, 30, 1113–1129. [Google Scholar] [CrossRef]
  4. Saleh, N.A.; Mahmoud, M.A.; Jones-Farmer, L.A.; Zwetsloot, I.; Woodall, W.H. Another look at the ewma control chart with estimated parameters. J. Qual. Technol. 2015, 47, 363–382. [Google Scholar] [CrossRef]
  5. Jones, L.A. The statistical design of ewma control charts with estimated parameters. J. Qual. Technol. 2018, 34, 277–288. [Google Scholar] [CrossRef]
  6. Abbas, N.; Abujiya, M.R.; Riaz, M.; Mahmood, T. Cumulative sum chart modeled under the presence of outliers. Mathematics 2020, 8, 269. [Google Scholar] [CrossRef] [Green Version]
  7. Raji, I.A.; Lee, M.H.; Riaz, M.; Abujiya, M.R.; Abbas, N. Outliers detection models in shewhart control charts; An application in photolithography: A semiconductor manufacturing industry. Mathematics 2020, 8, 857. [Google Scholar] [CrossRef]
  8. Abbas, N. A robust S2 control chart with Tukey’s and MAD outlier detectors. Qual. Reliab. Eng. Int. 2020, 36, 403–413. [Google Scholar] [CrossRef]
  9. Guarnieri, J.P.; Souza, A.M.; Jacobi, L.F.; Reichert, B.; da Veiga, C.P. Control chart based on residues: Is a good methodology to detect outliers? J. Ind. Eng. Int. 2019, 15, 119–130. [Google Scholar] [CrossRef] [Green Version]
  10. Bakar, Z.A.; Mohemad, R.; Ahmad, A.; Deris, M.M. A comparative study for outlier detection techniques in data mining. In Proceedings of the 2006 IEEE Conference on Cybernetics and Intelligent Systems, Bangkok, Thailand, 7–9 June 2006. [Google Scholar]
  11. Vidmar, G.; Blagus, R. Outlier detection for healthcare quality monitoring-A comparison of four approaches to over-dispersed proportions. In Quality and Reliability Engineering International; John Wiley and Sons Ltd.: Hoboken, NJ, USA, 2014; Volume 30, pp. 347–362. [Google Scholar]
  12. Zhang, H.; Albin, S. Detecting outliers in complex profiles using a χ2 control chart method. IIE Trans. (Inst. Ind. Eng.) 2009, 41, 335–345. [Google Scholar] [CrossRef]
  13. Manenti, F.; Buzzi-Ferraris, G. Criteria for outliers detection in nonlinear regression problems. Comput. Aided Chem. Eng. 2009, 26, 913–917. [Google Scholar] [CrossRef]
  14. Militino, A.F.; Palacios, M.B.; Ugarte, M.D. Outliers detection in multivariate spatial linear models. J. Stat. Plan. Inference 2006, 136, 125–146. [Google Scholar] [CrossRef]
  15. Fan, S.-K.S.; Huang, H.-K.; Chang, Y.-J. Robust multivariate control chart for outlier detection using hierarchical cluster tree in SW2. Qual. Reliab. Eng. Int. 2013, 29, 971–985. [Google Scholar] [CrossRef]
  16. Gnanadesikan, R.; Kettenring, J.R. Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 1972, 28, 81. [Google Scholar] [CrossRef]
  17. Pranata, A.; Sadik, K. Comparison of Hotelling, MVE and WD for Detecting Outlier in Robust Multivariate Control Chart. Int. J. Sci. Eng. Res. 2016, 7, 1138–1142. [Google Scholar]
  18. Hubert, M.; Debruyne, M.; Rousseeuw, P.J. Minimum covariance determinant and extensions. Wiley Interdiscip. Rev. Comput. Stat. 2018, 10, e1421. [Google Scholar] [CrossRef] [Green Version]
  19. Stahel, W.A. Breakdown of Covariance Estimators; Eidgenössische Technische Hochschule: Zürich, Switzerland, 1981. [Google Scholar]
  20. Maronna, R.A.; Yohai, V.J. The behavior of the Stahel-Donoho robust multivariate estimator. J. Am. Stat. Assoc. 1995, 90, 330–341. [Google Scholar] [CrossRef]
  21. Rousseeuw, P.; Hubert, M. High-breakdown estimators of multivariate location and scatter. In Robustness and Complex Data Structures: Festschrift in Honour of Ursula Gather; Springer: Berlin/Heidelberg, Germany, 2013; pp. 49–66. ISBN 9783642354946. [Google Scholar]
  22. Donoho, D.L. Breakdown Properties of Multivariate Location Estimator; Harvard University: Cambridge, MA, USA, 1982. [Google Scholar]
  23. Ghorbani, H. Mahalanobis distance and its application for detecting multivariate outliers. Math. Inf. 2019, 34, 583–595. [Google Scholar] [CrossRef]
  24. Zuo, Y.; Cui, H.; He, X. On the Stahel-Donoho estimator and depth-weighted means of multivariate data. Ann. Stat. 2004, 32, 167–188. [Google Scholar] [CrossRef]
  25. Abid, M.; Nazir, H.Z.; Riaz, M.; Lin, Z. In-control robustness comparison of different control charts. Trans. Inst. Meas. Control 2017, 40, 3860–3871. [Google Scholar] [CrossRef]
  26. Abid, M.; Nazir, H.Z.; Tahir, M.; Riaz, M.; Abbas, T. A Comparative Analysis of Robust Dispersion Control Charts with Application Related to Health Care Data. J. Test. Eval. 2019, 48, 247–259. [Google Scholar] [CrossRef]
  27. Zwetsloot, I.M.; Schoonhoven, M.; Does, R.J.M.M. Robust point location estimators for the EWMA control chart. Qual. Technol. Quant. Manag. 2016, 13, 29–38. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The flowchart of the methodology.
Figure 1. The flowchart of the methodology.
Mathematics 09 02772 g001
Figure 2. In-control ARL values of the multivariate Shewhart chart from mixed distribution with and without SDRE multivariate outliers screening (ω = 1).
Figure 2. In-control ARL values of the multivariate Shewhart chart from mixed distribution with and without SDRE multivariate outliers screening (ω = 1).
Mathematics 09 02772 g002
Figure 3. In-control SDRL values of the multivariate Shewhart chart from a mixed distribution with and without SDRE multivariate outliers screening (ω = 1).
Figure 3. In-control SDRL values of the multivariate Shewhart chart from a mixed distribution with and without SDRE multivariate outliers screening (ω = 1).
Mathematics 09 02772 g003
Figure 4. In-control ARL values of the multivariate Shewhart chart from a mixed distribution with and without SDRE multivariate outliers screening (ω = 2).
Figure 4. In-control ARL values of the multivariate Shewhart chart from a mixed distribution with and without SDRE multivariate outliers screening (ω = 2).
Mathematics 09 02772 g004
Figure 5. In-control SDRL values of the multivariate Shewhart chart from a mixed distribution with and without SDRE multivariate outliers screening (ω = 2).
Figure 5. In-control SDRL values of the multivariate Shewhart chart from a mixed distribution with and without SDRE multivariate outliers screening (ω = 2).
Mathematics 09 02772 g005
Figure 6. The multivariate Shewhart charts from real life data extracted from carbon fiber tubes.
Figure 6. The multivariate Shewhart charts from real life data extracted from carbon fiber tubes.
Mathematics 09 02772 g006
Table 1. ARL and SDRL values of the multivariate Shewhart control chart with p = 2.
Table 1. ARL and SDRL values of the multivariate Shewhart control chart with p = 2.
Unknown Case: Parameters EstimatedKnown Case
m = 25m = 100m = 500
δARLSDRLARLSDRLARLSDRLARLSDRL
0.00369.38529.24369.101392.986370.2419392.7475370.50370.19
0.50230.28338.60207.208228.794207.52210.96201.90202.24
1.0085.73128.3572.40879.33468.7969.7167.2866.57
1.5029.9142.3125.61727.08323.5823.5723.2822.94
2.0011.8215.509.91810.0889.639.379.458.92
2.505.205.844.7334.4824.664.124.594.12
3.002.872.722.6532.1242.572.012.571.97
3.501.861.421.7561.1731.711.131.701.09
4.001.390.801.3510.6931.330.661.320.65
4.501.160.451.1430.4121.140.401.130.38
5.001.060.271.0530.2391.050.231.050.22
UCL = 12.27UCL = 11.96UCL = 11.87UCL = 11.83
Note: p is the number of charactrsitics, δ is the shift, m is the phase-I sample, UCL is the upper control limit, ARL is the average run length, and SDRL is the standard deviation run length.
Table 2. ARL and SDRL values of the multivariate Shewhart control chart with p = 3.
Table 2. ARL and SDRL values of the multivariate Shewhart control chart with p = 3.
Unknown Case: Parameters EstimatedKnown Case
m = 25m = 100m = 500
δARLSDRLARLSDRLARLSDRLARLSDRL
0.00369.71511.73370.28404.67370.85375.25370.35368.60
0.50252.41354.01237.52264.45233.12236.52229.50229.21
1.00108.87158.9889.9896.3987.0989.1986.1185.38
1.5040.9357.9333.1536.2231.5732.2430.9830.38
2.0016.2821.7613.3813.9412.5212.2912.3711.90
2.507.208.636.125.945.805.295.715.21
3.003.713.813.232.763.102.583.102.55
3.502.251.942.061.532.021.421.981.39
4.001.571.051.490.861.460.831.450.81
4.501.260.601.210.501.210.491.190.48
5.001.110.361.080.301.080.291.080.29
UCL = 15.16UCL = 14.43UCL = 14.22UCL = 14.16
Note: p is the number of charactrsitics, δ is the shift, m is the phase-I sample, UCL is the upper control limit, ARL is the average run length, and SDRL is the standard deviation run length.
Table 3. ALR0 and SDRL0 values of the multivariate Shewhart control chart with outliers (ω = 1).
Table 3. ALR0 and SDRL0 values of the multivariate Shewhart control chart with outliers (ω = 1).
ω = 1m = 25m = 100m = 500
θARLSDRLARLSDRLARLSDRL
0.00370.22506.49369.05406.10370.39376.06
0.01479.59824.67486.10549.93494.51503.12
0.02630.371301.69648.14775.72653.54677.66
0.03780.661614.26816.271045.23832.28866.55
0.04959.312445.421041.621316.051041.871111.20
0.051167.802772.211267.571754.631312.051400.20
0.061492.423676.361526.531987.071591.601692.19
0.071844.595011.121778.282371.011841.321978.55
0.082053.304823.312098.602889.332169.252256.39
0.092349.655314.442523.523514.102476.132649.50
0.102766.196210.282736.633800.612749.252914.35
UCL = 15.16UCL = 14.43UCL = 14.22
Note: ω and θ are the magnitude and percentage of outliers, respectively; m is the phase-I sample; ARL is the average run length; and SDRL is the standard deviation run length.
Table 4. ALR0 and SDRL0 values of the multivariate Shewhart control chart with outliers (ω = 2).
Table 4. ALR0 and SDRL0 values of the multivariate Shewhart control chart with outliers (ω = 2).
ω = 2m = 25m = 100m = 500
θARLSDRLARLSDRLARLSDRL
0.00373.15528.37370.30407.83376.98382.58
0.01810.481994.63945.971420.801084.371246.99
0.021600.064562.272142.584030.072449.512999.48
0.032772.417427.224021.047220.474876.126078.31
0.043957.639442.196533.3310,703.477752.179110.05
0.055523.8311,857.399510.6814,211.4111,506.6513,324.94
0.066896.4813,435.2511,889.1516,306.0815,134.8016,454.69
0.078376.8715,180.6114,364.7118,529.8918,707.2619,675.92
0.089705.7716,708.2716,011.6019,649.5821,367.6021,353.52
0.0910,826.7118,000.6217,978.8721,409.0023,325.1722,914.61
0.1011,452.4218,314.7619,179.1522,606.2425,097.1624,509.93
UCL = 15.16UCL = 14.43UCL = 14.22
Note: ω and θ are the magnitude and percentage of outliers, respectively; m is the phase-I sample; ARL is the average run length; and SDRL is the standard deviation run length.
Table 5. ARL0 and SDRL0 values of the proposed SDRE multivariate Shewhart control chart (ω = 1).
Table 5. ARL0 and SDRL0 values of the proposed SDRE multivariate Shewhart control chart (ω = 1).
ω = 1m = 25m = 100m = 500
θARLSDRLARLSDRLARLSDRL
0.00361.92495.73375.75406.13369.15372.24
0.01449.23701.48457.44506.28458.50464.16
0.02517.30786.42566.53646.91559.66569.57
0.03652.161304.65648.51751.88689.62714.77
0.04769.831581.16803.33948.51802.49821.11
0.05899.751853.83943.291110.27950.73991.70
0.061079.962177.201126.661421.111124.371179.38
0.071260.202722.941285.691573.091325.641377.60
0.081375.702832.761463.111810.171532.291626.18
0.091638.253739.771686.522192.341690.581752.70
0.101858.544088.501856.542425.251889.241968.77
UCL = 15.16UCL = 14.43UCL = 14.22
Note: ω and θ are the magnitude and percentage of outliers, respectively; m is the phase-I sample; ARL is the average run length; and SDRL is the standard deviation run length.
Table 6. ARL0 and SDRL0 values of the proposed SDRE multivariate Shewhart control chart (ω = 2).
Table 6. ARL0 and SDRL0 values of the proposed SDRE multivariate Shewhart control chart (ω = 2).
ω = 2m = 25m = 100m = 500
θARLSDRLARLSDRLARLSDRL
0.00361.78499.51370.12400.86375.70384.66
0.01549.311196.01568.50655.42565.91573.83
0.02751.841868.04815.211003.31840.93887.36
0.031066.502371.921147.921560.901204.671248.53
0.041461.443778.421584.772122.191652.911775.20
0.051933.984755.132115.683075.282239.992431.66
0.062457.796024.812795.443957.072837.892999.14
0.072997.347025.493539.725263.743533.503838.51
0.083645.178120.494207.786211.584222.544591.99
0.094499.189722.214962.757126.194958.275442.69
0.105206.9010,638.485948.828550.875682.246336.62
UCL = 15.16UCL = 14.43UCL = 14.22
Note: ω and θ are the magnitude and percentage of outliers, respectively; m is the phase-I sample; ARL is the average run length; and SDRL is the standard deviation run length.
Table 7. μ ^ and S estimates from the phase-I sample for the three cases under study.
Table 7. μ ^ and S estimates from the phase-I sample for the three cases under study.
Case 1Case 2Case 3
μ ^ 0.99271.035750.01201.00001.041250.08440.99461.040650.0172
0.00220.00260.00400.00340.00290.00450.00270.00400.0053
S 0.00260.01280.00380.00290.01400.00230.00400.01650.0079
0.00400.00380.04950.00450.00230.23910.00530.00790.0507
Table 8. T i 2 values and decisions of the three cases with θ = 0.07, ω = 3, and v = 5.
Table 8. T i 2 values and decisions of the three cases with θ = 0.07, ω = 3, and v = 5.
i Case 1Case 2Case 3
T i 2 Decision T i 2 Decision T i 2 Decision
14.6350IC3.1616IC4.4034IC
22.7626IC2.8659IC2.6736OoC
36.5246IC3.2198IC6.0763IC
419.4183OoC12.2758IC15.9676OoC
52.8439IC0.8071IC2.6362IC
62.7068IC0.6549IC2.4079IC
74.8002IC0.9782IC4.1318IC
80.8486IC0.5265IC0.8318IC
91.0873IC1.1708IC1.1421IC
101.1025IC1.5938IC1.0298IC
110.3968IC0.2688IC0.2967IC
121.9768IC0.9525IC1.9403IC
137.4164IC2.8188IC6.8404IC
1412.0136IC5.5007IC9.4973IC
153.7087IC1.9715IC2.8400IC
162.7188IC1.3448IC2.2457IC
175.7081IC1.4230IC4.6036IC
183.4934IC3.5898IC3.6257IC
1910.6969IC10.1956IC10.5667OoC
208.1595IC2.7431IC7.4650IC
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Raji, I.A.; Abbas, N.; Abujiya, M.R.; Riaz, M. Robust Multivariate Shewhart Control Chart Based on the Stahel-Donoho Robust Estimator and Mahalanobis Distance for Multivariate Outlier Detection. Mathematics 2021, 9, 2772. https://doi.org/10.3390/math9212772

AMA Style

Raji IA, Abbas N, Abujiya MR, Riaz M. Robust Multivariate Shewhart Control Chart Based on the Stahel-Donoho Robust Estimator and Mahalanobis Distance for Multivariate Outlier Detection. Mathematics. 2021; 9(21):2772. https://doi.org/10.3390/math9212772

Chicago/Turabian Style

Raji, Ishaq Adeyanju, Nasir Abbas, Mu’azu Ramat Abujiya, and Muhammad Riaz. 2021. "Robust Multivariate Shewhart Control Chart Based on the Stahel-Donoho Robust Estimator and Mahalanobis Distance for Multivariate Outlier Detection" Mathematics 9, no. 21: 2772. https://doi.org/10.3390/math9212772

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop