1. Introduction
Statistical Quality Control (SQC) is essential for economy; namely, in terms of competition and leadership in industry and management. The majority of enterprises, including medium- and small-sized companies, invest in SQC, recognizing its importance for success.
The present report is the result of a consultation with a leading fried potato manufacturer. A new production line, starting operation in October 2024, had many in-built automated feedback and control features; namely, the means to retrieve frying-oil-temperature readings every 15 s.
Oil temperature is, obviously, a random variable since the introduction of a new batch of raw chips instantaneously lowers the temperature, but this triggers a response from the rheostats, increasing the Joule effect to attain the desirable frying-oil temperature.
Industry guidelines for the large-scale production of potato chips recommend a range of [175–185 °C], and this is also indicated by The Food Standards Agency (UK). Based on previous experience, the factory had a conservative target of 180° ± 4°. Very seldom, consistently low temperatures can compromise crispness, de-oiling and seasoning. But, on the other hand, with rheostat deregulation, overheating may occur, causing starch burning and spoiling several chip batches. Temperatures in the range [176–184 °C] were, therefore, considered negative (N) symptoms, and temperatures <172 °C (underheating) or >188 °C (overheating) were considered positive (P) indicators that the system was deregulating. Temperatures in the ranges [172–176 °C] and [184–188 °C] were considered sophisticated “fuzzy” treatment, with 175 °C and 185 °C (from the industry guidelines) acting as possible change points, implying that, for instance, for overheating, temperatures in the range [184–185 °C] could be classified either as N or P, and in the range [185–188 °C] as P or N. See details in
Section 2.
The factory SQC team members used this Gaussian-based nominal classification in true and false negatives (TN and FN) or in false and true positives (FP and TP) to monitor the state of the system, disregarding the quantitative interval-scale data.
In fact, the resulting nominal data were no longer fit for SQC traditional analysis; for instance, in foolproof control charts, as described in the classic Montgomery [
1] or Aslam [
2]. The main issue, (mis)classification of InC/OutC, with this messy data, was subject to confusion; namely, the fuzzy classification TN/FN of temperatures in the range
and TP/FP in the range
was ambiguous, implying that the fuzzy tools recommended by Hryniewicz [
3] should not be used in view of the poor accuracy levels of temperature classification in the nominal classes TP, FP, TN, and FN. See Stehman [
4], Ting [
5], Tharwat [
6] or Opitz [
7].
Inappropriate recording and monitoring of the data can result in high losses or even in catastrophic disasters, such as what happened with the Chernobyl nuclear power plant. De Veaux and Hand [
8] refer to the fact that “
Anyone who has analyzed real data knows that the majority of their time on a data analysis project will be spent ‘cleaning’ the data before doing any analysis. Common wisdom puts the extent of this at 60–95% of the total project effort, and some studies […] suggest that ‘between one and ten percent of data items in critical organizational databases are estimated to be inaccurate’”. Further, they state that the “
claims by software vendors that their techniques can produce valid results no matter what the quality of the incoming data” is preposterous, deserving the celebrated Sir Ronald Fisher statement that “
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he only may be able to say what the experiment died of”.
From the eighties onward much progress has been made on the analysis of messy data, cf. Milliken and Johnson [
9,
10,
11], but faith in the possibility of individuals dealing with bad data like professionals, as claimed by Asboth [
12], seems too optimistic. Moreover, aside from cleaning the data, some data wrangling (cf. Petricek et al. [
13]) may be needed, and progress using AI brought in interesting new features; see also Megahed et al. [
14], Munappy et al. [
15], and Mohammed et al. [
16]. In the present framework, the goal is to identify change points hinting alteration from InC to OutC; for information on change-point-detection methods, see Truong et al. [
17] and van den Burg and Williams [
18], always having in mind that SQC tools can be misused. For a critical overview see Elhabashy et al. [
19].
Our multidisciplinary research team was consulted to analyze the data and procedures in order to indicate weak and strong points of the SQC benchmark and to recommend alterations that would provide better forecasting; and, namely, to offer counseling on data handling. Less radical than Sir Ronald Fisher, we proposed ways of dealing with the existing messy nominal data, and highlighted inappropriate to use foolproof SQC tools such as CUSUM charts, as clearly advised in Elhabashy et al. [
19].
As in many nonparametric frameworks, we defined adequate scores, whose moving averages and time series trends establish that 0-upcrossings clearly forecast InC/OutC shifts. The present report complements the findings in Pestana and Brilhante [
20], and it is structured as follows:
Section 2 describes the factory SQC routines and ensuing production halts and downtimes, and discusses eventual InC/OutC misclassification using the standard metrics computed from the confusion matrix.
Section 3 addresses the use of scores and their moving averages for diagnosing the clustering of suspicious sequences signaling shifts from InC to OutC. The values of the daily score moving averages determination coefficients, although there are three outliers, are always moderate, supporting the idea that feedback and control are effective in enforcing the independence of sequential readings.
Section 4 discusses the improvements resulting from the SQC routines update.
Section 5 states concluding remarks, advising how to improve routines for timely alert that the system is sliding to OutC.
2. Factory Quality Control Routines
The main SQC factory team’s assignment was to continuously monitor the frying-oil-temperature evolution for each 2-min frying period. The production line automatically made available 8-ary sequences of oil temperatures 15 s apart. The majority of the SQC team members were laymen in Statistics, and, in fact, quite adverse to numeracy. So, they asked someone from the Informatics department to transform the numerical data furnished from the equipment into nominal color-coded marks visually depicting whether the system was In Control (InC) or shifting to Out of Control (OutC), and needing to be halted for maintenance or repair. This was a routine inherited from the previous production line surveillance team, relying on the belief that was a negative (N) symptom that the system was InC, and was a positive (P) symptom that the system was sliding towards OutC.
Since the target was , temperatures were considered True Negative (TN) and color-coded ● and temperatures or were considered True Positive (TP) and color coded ●. In the ranges and (i.e., between one and two standard deviations apart from the target 180°), the classification rule was more elaborate. In terms of what concerns overheating, was classified as TN if the next value was N, and False Negative (FN) if the ensuing value was P, and in such a case color-coded ●. On the other hand, was considered TP if the next temperature was P, and classified as False Positive (FP) if the next value was N, in this case being color-coded ●. Similar rules were adopted in the rare case of under-heating:
TN ● if
TP ● if ;
FN ● if ;
FP ● if ,
Each frying-batch standard time is 2 min, originating a 8-ary sequence of temperature readings 15 s apart.
For example were classified (TN,TN,FN,FP,TN,TN,TN,TP), and recoded ●●●●●●●●, to appear in the supervisor’s monitor.
This Kanban-inspired (see Louis [
21]) inspection simplicity is praise-worth, but the downgrading of the interval-scale temperatures data to the color-code nominal scale is rather exaggerated and adverse to deep statistical analysis.
The raw data of temperature readings, in Stevens [
22] interval scale, were discarded as a result of the above nominal scale 8-ary color-doted sequences. Further, the ordering TN < FN < FP < TP was not used; hence, there is no reason to consider that the data are in the ordinal scale.
The factory SQC team used a layman’s unproved conjecture that the inspection of the color-coded data would indicate the state of the system: 8-ary sequences with less than five ●/● supported the likelihood that the system was InC. On the other hand, an accumulation of neighboring 8-ary sequences with five or more ●/● hinted that the system could enter an OutC state, needing to be halted for maintenance or repair. Another unexplained rule was the decision to consider suspicious 8-ary sequences terminating in the codon PPP, with the implication that no three adjacent P readings existed in the five initial tags of the 8-ary sequence.
As a result of these feedback and control rules, the production had been halted for maintenance on 37 occasions in the 69 working days from October 1 to December 21, as displayed in
Table A1 in
Appendix B; 11 of those interruptions lasted 2 min, 18 interruptions lasted less than 8 min (eventually meaning that half of the interruptions were either FD or required very simple maintenance), and the remaining 18 lasted more than 13 min, with a severe outlier of 109 min, as depicted in the bar chart and boxplot in
Figure 1. The production halt extremes and quartiles are displayed in
Table 1.
Considering the total production downtime (11.1 h) relative to the total production hours (1380 h) for those 69 working days, a rough estimate of the probability that the system is OutC is
(an underestimate, as discussed in
Section 3), while the probability that it is InC is
. This was considered satisfactory, since brand-new equipment should be almost always InC. However, the data did not provide a timely alert that a possible shift from InC to OutC could occur, and a rationale to distinguish whether an alarm would cause a False Discovery (FD), or the absence of an alarm would imply a False Omission (FO), and the ensuing losses and unnecessary wastes, is contrary to the recommended Lean Waste modern guidelines.
The data masking inevitably downscaled the original interval-scale quantitative data of temperatures (see Stevens [
22]) to a nominal scale represented by the categories TN, FN, FP, and TP. This, of course, precluded the possibility of using the foolproof SQC standard tools, such as CUSUM charts.
The factory SQC team used the halt durations in
Table A1 in
Appendix B to classify each instance as False Discovery (FD) or False Omission (FO) of OutC: sensitivity and specificity are the true alert (hit) and true failure (miss) rates to diagnose OutC. The criteria were mainly a combination of halt durations with a qualitative assessment of renewal patterns of occurrences in the same day or in adjacent days; namely, several alerts in 10 h operating periods.
Sensitivity or true predictive rate (TPR) and specificity or true negative rate (TNR), positive predictive value (PPV) and false discovery rate (FDR), negative predictive value (NPV) and false omission rate (FOR), computed from the Confusion Matrix (bold part of
Table 2), are cornerstone concepts to assess SQC. Other important metrics are defined in
Table 3 and incorporated in the Performance Metrics Matrix,
Table 2.
A comparison of the SQC team’s classifications with the findings of the maintenance department was used to scrutinize the classification-confusion matrix in
Table 2 with associated (mis)matching evaluation metrics, as defined in
Table 3. For detailed discussions on confusion matrices and related evaluation metrics, refer to [
4,
5,
6,
7,
23,
24,
25,
26,
27].
The factory SQC team classification-confusion matrix is displayed in
Table 4.
3. Scores and Their Moving Averages
The nominal data were unfit for the usual SQC diagnosis charts. The belief that 8-ary sequences with less than five ●/● (FN/TP) are a symptom that the system was InC, and 8-ary sequences with five or more ●/● (FN/TP) are a symptom that the system was shifting to OutC state, can obviously be misleading, since sudden surges are in most cases immediately corrected.
Only the accumulation of clusters of symptoms is meaningful. It was, therefore, decided to attribute scores to each 8-ary sequence using rules that promoted the attribution of positive scores to 8-ary sequences with a predominance of ●/● (TN/FP) temperatures, and of negative scores to 8-ary sequences with a predominance of ●/● (FN/TP) temperatures.
As shifts from InC to OutC should be forecasted by clustering P observations, the weights used were chosen so that 15-period-score (corresponding to 30 min) moving averages—that should mainly be negative, diagnosing InC—upcross 0 when clusters of strong disturbances do occur, signaling a shift to OutC. This was achieved by computing the scores as
Observe that (
1) can be re-expressed using the simple functions false discovery and false omission rates as multipliers of the number of temperatures exceeding 185°:
The rationale for the definition of scores in (
1) and their moving averages for 30-min periods is as follows:
The score of each 8-ary temperatures sequence ranges from , when all temperatures are TN, to 16, when all temperatures are TP. In 30-min periods averaging, if 3 of the 15 8-ary sequences were (TP,TP,TP,TP,TP,TP,TP,TP), their contribution to the moving averages would be . If all the other 12 sequences were (TN,TN,TN,TN,TN,TN,TN,TN), or more generally any combination of TN and FP, contributing , the moving average would be 0. On the other hand, the average would be positive if at least in one of those sequences one of the temperatures was TP or FN.
On a brand-new production line, we would expect that the system is almost always InC. This would mean that P observations would be rare, and there is a clear indication, see
Table 5, that sequences with four or less Ps should be expected in InC, and that sequences with five or more Ps should be interpreted as symptoms that there exists some risk of OutC—with the proviso, obviously, that only the accumulation of evidence from clustering of such sequences should be an effective alarm.
In
Table 5, we compare the probabilities
and
of
P observations in a 8-ary sequence, where
is the null hypothesis that the model is Gaussian(180,4) and
is the alternative hypothesis that the model shifted at least for Gaussian(190,4). It is assumed that the inbuilt controls continuously work to maintain the system InC, with the side effect of rendering sequential values approximately independent, nearly sub-independent in the sense discussed by Hamedani [
28]. The simple approximate probabilities are, therefore, the product of the classification classes under
and under
, respectively. Taking into account
Table 5, it is plausible to expect that the sequence of 30-min moving averages scores upcross 0 when there is a clustering of 8-ary sequences with 5 or more P temperatures intuitively tied to a possible shift towards OutC.
The
Supplementary Materials in compressed file DataArchive.zip contains working days .xlsx files displaying the scores of the 8-ary sequences, their 15-period moving averages and the corresponding charts. Moving averages have an interesting prognostic value for shifts from InC to OutC since, when 0-upcrossings occurred, production interruptions were needed. Observe, however, that some 2-min halts may denounce FD, or very minor problems are easily solved with routine maintenance.
The album of daily half-hour (15-sequences period) scores’ moving averages in
Appendix C plainly shows that the clustering of suspicious sequences leading to 0-upcrosses is closely tied to production halts for maintenance or repair. The exceptional situation of the blue-colored chronogram on November 08, although there is a 0-upcross and a 38-min production interruption, may be an instance of a false OutC forecast that was thoroughly investigated by the maintenance team without detecting any need for repair. The tiny values of the determination coefficient
, as shown in
Figure 2, whose extremes and quartiles are displayed in
Table 6, support the conviction that feedback and control effectively guarantee the approximate independence of sequential temperature readings.
The ascertainment that 0-upcrossings of scores’ moving averages is a clear alert that the process is sliding down to OutC and indicates that
Table A1 should be substituted by
Table A2 in
Appendix B, which displays the duration from 0-upcrossing to the end of the production halt, depicted in the bar chart and boxplot in
Figure 3.
The production halts’ extremes and quartiles, measured from the 0-upcrossing alert until resumption of production, are displayed in
Table 7.
Therefore, instead of a total halts’ duration of 11:06, the improved estimates and that the frying unit is OutC or Inc, respectively, should be used.
Table A2 in
Appendix B was discussed with the factory SQC and maintenance teams, and we asked them to classify each instance with regard to the true system state and the team’s diagnosis. From this, a new confusion/(mis)matching matrix (
Table 8) was computed.
Comparing
Table 4 and
Table 8 shows a substantial improvement for all the evaluation metrics.
4. Discussion
The rationale for the definition (
1) of scores and the use of 30-min windows for moving averages in
Section 3 tries to balance the negative contribution of 8-ary sequences with a predominance of temperature values less than 185° with the positive redressing of sequences with a predominance of greater than 185° readings. It is obviously a heuristic approach, since it is impossible to devise an optimal decision.
With regard to the choice of the moving averages period, we experimented with 20-, 30-, 40- and 60-min windows. The 20-min window produced many FDs, and the 40-min and 60-min periods a small number of alerts, resulting in too many FO. The 30-min window produced an adequate number of alerts, with a reliable balance trading of PPV (positive predictive value), a sensibility with NPV (negative predictive value) and specificity.
Obviously, other windows would produce reliable results using different coefficients: for the 20 min window, it would be wiser to use lower multipliers for TP and FN (or greater coefficients for TN and FP), and, for the 40-min window, to use the reverse rule of thumb. The goal should always be to attain significative upcrossings of some level that are hinting that the system was deviating towards OutC. In the 60-min window, the smoothing effect of using the large number (30) of scores in the moving averages renders this heuristic useless.
Concerning the multipliers used, the rationale in
Section 3 clearly indicates that they achieve an interesting balance of negative and positive scores when the system is InC, and a positive imbalance when it shifts towards OutC. Obviously, other values could be chosen, for instance
, or
, or
would produce more alerts, and
would produce less alerts. For the fun of it, recalling the famous Euler’s formula tying
, e, i, and −1, we experimented with
, to no avail, producing an excessive number of false alarms.
In view of this small-scale experimentation, the rule of thumb “use 15-period moving averages (30-min window) of scores”, as defined in (
1), has been adopted.
5. Conclusions
The recommendation to implement routines for computing scores and their moving averages, and to display alerts when 0-upcrossings occur, was incorporated into the spreadsheet, which is continuously updated in the supervisor’s monitor. From 2 January onward, the quantitative data were plainly recorded in interval scale, enabling the use of more sophisticated statistical analysis tools. For the factory SQC team’s satisfaction, the classification algorithm based in the decision tree depicted in
Appendix A provided painted displays, as exemplified in
Figure 4, which are even more eye catching than the color dots used in the factory SQC routines in 2024.
Considering that only 8-ary sequences terminating in the codon PPP were suspicious was an inadequate criterion, since, in view of
Table 5, all sequences with five or more Ps are suspicious. However, to a certain extent even this criterion is irrelevant, since feedback and control can overpass momentary deregulation of temperature. Only the accumulation of suspicious 8-ary sequences in an adjacent or neighbouring cluster, indicated by moving averages’ 0-upcrossings, clearly indicates OutC situations.
Maintaining the integrity of the temperature readings, several statistics of the 8-ary sequences are readily available; namely, extreme values, ranges, and the number of values exceeding
. SQC evolved substantially, with methodologies such as Taguchi’s Total Quality Control, Six Sigma and Beyond, or developments with Design of Experiments, [
29,
30,
31,
32], and the availability of temperatures in the interval scale enables the use of those developments, and of traditional charts, as described in Montgomery [
1] and Aslam [
2], without misusing tools [
19]. This also enabled Pestana and Brilhante [
20] to treat 8-ary sequences as digital ants, using digital pheromones to forecast shifts from InC to OutC.
Despite the availability of more sophisticated SQC tools when the interval-scale data are kept, we must be aware that their usefulness resides mainly in their ability to confirm a posteriori that SQC performance is high. The main issue of deciding to halt production for maintenance or repair, thus of forecasting possible OutC, is the routine task of the supervision unit, which must decide in real time whether there is a clustering of suspicious 8-ary sequences.
Thus, as scores’ moving averages can be computed routinely and in a timely manner, we further recommended adding columns in
Figure 4 to display in each line the scores, as defined in (
1), the last 30-min-score moving averages, and the highest temperature in the 8-ary sequence. Further, scores’ moving average 0-upcrossing should trigger a visual and sound alarm. Both recommendations have been implemented