An Experimental Analysis of Drift Detection Methods on Multi-Class Imbalanced Data Streams

Application: The industrial sensor-based application generates continuous non-stationary data streams which change over time. By analyzing the performance of existing change detection methods, the selection of the best performing method can be achieved for application in an industrial environment to early detect the fault or unusual change and to reduce the maintenance cost. Abstract: The performance of machine learning models diminishes while predicting the Remaining Useful Life (RUL) of the equipment or fault prediction due to the issue of concept drift. This issue is aggravated when the problem setting comprises multi-class imbalanced data. The existing drift detection methods are designed to detect certain drifts in speciﬁc scenarios. For example, the drift detector designed for binary class data may not produce satisfactory results for applications that generate multi-class data. Similarly, the drift detection method designed for the detection of sudden drift may struggle with detecting incremental drift. Therefore, in this experimental investigation, we seek to investigate the performance of the existing drift detection methods on multi-class imbalanced data streams with different drift types. For this reason, this study simulated the streams with various forms of concept drift and the multi-class imbalance problem to test the existing drift detection methods. The ﬁndings of current study will aid in the selection of drift detection methods for use in developing solutions for real-time industrial applications that encounter similar issues. The results revealed that among the compared methods, DDM produced the best average F1 score. The results also indicate that the multi-class imbalance causes the false alarm rate to increase for most of the drift detection methods.


Introduction
The difficulty of learning from streaming data with concept drift has increased attention in the area of online learning in recent years since this phenomenon happens in many real-time applications like predicting RUL [1][2][3] fault detection [4][5][6][7][8][9], and risk management [10][11][12][13][14]. These applications generate continuous non-stationary data streams. The non-stationary nature of the stream causes a change in the statistical properties of data which leads to concept drift and class imbalance problems [15,16]. The concept drift includes virtual drift, real drift, or hybrid drift [17]. Let us assume, at time t, we have xt as which leads to concept drift and class imbalance problems [15,16]. The concept drift includes virtual drift, real drift, or hybrid drift [17]. Let us assume, at time t, we have xt as an input vector for the Machine Learning (ML) model and yt as the corresponding target output vector. The virtual drift occurs when the distribution of input data pt(x) changes over time, but the posterior probability of the output pt(y|x), which shows the mapping relationship between xt and yt, does not change with time. In simple words, the data, even after facing changes in input features, still represent the same target output. In real drifts, the posterior probability distributions pt(y|x) change over time, but this change is not caused by changes in pt(x). In contrast, hybrid drift is the combination of both virtual drift and real drift. These major types of drift are further classified into different types based on the frequency of underlying change with respect to time, which are sudden, incremental, gradual, and reoccurring [17,18]. Figure 1 illustrates all four types of concept drift. The basic definition of each of the drift type is as follows:

Sudden:
The concept of data abruptly changes from one concept to another Incremental: The concept of data incrementally changes with a constant speed from one concept to another Gradual: Similar to incremental drift but speed of change of one concept to another is not constant Reoccurring: The reoccurring drift disappears, and after a certain time period it reappears  [17,18].
Besides the drift, the change may occur in class distribution, and the number of instances of each class may also vary, leading to a class imbalance problem [15,[19][20][21]. In case the stream is multi-class, the changes may occur in multiple classes simultaneously; this becomes more challenging for the drift detection methods to detect the new concept and adapt the existing ML model to the latest concept [22]. When data face a class imbalance problem, it signifies that the data are not equally distributed among classes. Some of the classes may have a very high number of instances called majority classes, and on the contrary, some have very skewed data or a smaller number of instances called minority  [17,18].
Besides the drift, the change may occur in class distribution, and the number of instances of each class may also vary, leading to a class imbalance problem [15,[19][20][21]. In case the stream is multi-class, the changes may occur in multiple classes simultaneously; this becomes more challenging for the drift detection methods to detect the new concept and adapt the existing ML model to the latest concept [22]. When data face a class imbalance problem, it signifies that the data are not equally distributed among classes. Some of the classes may have a very high number of instances called majority classes, and on the contrary, some have very skewed data or a smaller number of instances called minority classes. An ML model which is trained on imbalanced data produces biased results toward the majority class and misclassifies the minority class [23][24][25]. Along with that, in nonstationary multi-class streams, the majority class may become the minority class and vice versa [26].
In situations where concept drift and class imbalance both appear concurrently, the drift detection method and data balancing approaches are affected as well as the ML classification model. There is a possibility that only the class imbalance ratio alters at some point in time, but the concept remains the same. In such a situation, the distribution-based drift detection method may trigger a 'drift' alarm which is a false alarm. Similarly, the error rate-based drift detection methods monitor the performance of the ML model and may also generate false alarms in case the performance of ML model is decreased due to class imbalance.
Therefore, this study experimentally analyzes the existing drift detection methods in presence of multi-class imbalanced data. One of the prime objectives of this study is to investigate and find the best existing approach for drift detection in multi-class data streams in the presence of class imbalance and different drift types such as sudden, gradual, and incremental drift.
The foremost contributions of the study are the following: • Highlighting the effect of concept drift and class imbalance in a multi-class data stream environment.

•
Simulation of multi-class data streams with different drift types and class imbalance. • Analysis of existing 10 state-of-the-art drift detection methods.

•
Highlighting the performance difference of an ML model on balanced and imbalanced data streams in presence of concept drift (sudden, gradual, and incremental).
The rest of the paper is organized as follows: Section 2 highlights the existing work on drift detection methods. Section 3 explains the generated data streams, drift detection methods evaluated, and the performance measures used in this study. Section 4 discusses the results, and Section 5 concludes the overall study.

Related Work
The major goal of concept drift detection methods is to provide an efficient approach that works together with the classification model, indicating drift or novelty when there is a substantial change in data properties [27]. The model is updated in this manner, avoiding that it be influenced by the change and so boosting its predicted performance.
The existing drift detection methods are generally divided into three categories which are error rate-based, data distribution-based, and multiple hypothesis test-based or statistical test-based [18]. The error rate-based drift detection methods continuously monitor the performance of the base classifier. The base classifiers are used to categorize incoming instances in concept drift detection systems. It generates a class prediction for each instance, which is then compared to the actual class label. The drift detection technique states that whether or not a drift has occurred based on the classification result. In the end, the instance is used to train the base classifier. Every time a new instance arrives, this procedure is repeated. The Drift Detection Method (DDM) [28] is a well-known example of an error rate-based approach. It increases or decreases the value or error rate based on the classification performance of the base classifier. In the case of correct classification of the incoming instance by the base classifier, DDM decreases the value of the error rate and increases otherwise. When the level of error rate reaches a certain threshold, DDM either considers it as a drift or generates a warning. The DDM produced good results for sudden and gradual drift but not for the slow, gradual drifts.
The Early Drift Detection Method (EDDM) [29], an improved version of DDM, was proposed to increase the performance of slow progressive drifts. Instead of only the error rate, the EDDM tracks the average distances of two errors as well as the two running averages of standard deviations. The EDDM, like the DDM, has two threshold values to designate the warning zone and the drift zone. The reactive Drift Detection Method [30] (RDDM) periodically reduces the number of instances of a very long concept and recalculates the DDM stats to overcome the performance loss issue of DDM. The ADWIN [31] algorithm uses sliding windows to compare the mean of the output sequences that may vary with time. By analyzing the average of a statistic across the two windows, an algo-rithm will determine the window's size and shape. Whenever the absolute value of the difference between the two averages exceeds a certain threshold, a change is recognized, and it resets the output windows.
The data distribution-based drift detection methods mostly use sliding windows to keep the recent data samples and apply statistics to identify the dissimilarity between two distributions in old and new data samples. Most traditional concept drift detection approaches combine learner (classifier) the current distribution and the dissimilarity measurement method (drift detector) to determine modifications and changes between the current and previous distributions. Distribution-based approaches may identify both real and virtual drifts, depending on the kind of dissimilarity metrics used. They may, however, be unable to discriminate the kinds of observed drifts [32]. Dasu et al. [33] presented a distribution-based drift detection method called kdqTree. This approach divides the distributions into smaller substructures. The substructures from the prior and most current data windows are compared using the Kullback-Leibler distance matrix to assess the drift. The study [34] proposes a concept drift detection approach that is based on the premise that there is restricted access to class labels or no access at all. When applied to unlabeled data streams, the suggested technique identifies concept drift based on the class label information that has been anticipated by the base classifier. The Geometric Moving Average [35] (GMA) gives the most recent observation the greatest weight, and all previous observations' weights decrease in geometric progression. A novelty detection algorithm in multi-class streams MINAS was proposed in [27], which separates the learning process into two stages: offline and online. The offline phase trains a decision model based on the problem's known concept. It is only used once. Following that, the online phase collects new samples and categorizes them as belonging to one of the existing classes or as unknown. In addition, the algorithm seeks a coherent collection of unknown samples to find new classes or expansions of established classes. These additional classes and extensions are included in the decision model. Furthermore, the approach enables the model to ignore out-of-date data and automatically adjust to concept drift.
Page Hinckley (PH) [36] is a sequential analysis test that monitors the difference between current accuracy and means accuracy of the base classifier for a current moment. If the difference between two accuracies reaches at a threshold, it is considered as a drift. The Page-Hinckley test was used in [37] to detect the change in normal and abnormal signals. CUSUM [38] is a variant of PH which is a sequential analysis that uses the principles of the Sequential Probability Ratio Test (SPRT). It monitors the mean of the predicted result values and generates the drift alarm if the mean values reach to a certain threshold.
The hypothetical test-based drift detection can be applied to input data as well as the output of the ML model. The HDDM_A and HDDM_W were proposed in [39], which are similar to the DDM but use Hoeffiding's inequalities. HDDM_A calculates the moving average, and HDDM_W calculates the weighted moving average from the incoming input stream. They first identify the relevant cut point in the incoming stream based on that they either apply A-test or W-test to find the current state of the stream. The possible options for stream state are stable, warning, or drift. A Statistical Test of Equal Proportions [40] (STEPD) computes the accuracy of the base classifier in the W most recent instances and compares it to its overall accuracy from the beginning of the learning process.
Recently, a window-based drift detection approach was proposed [41]. It has ability to store the most recent samples on a sliding window and removes the old ones. The samples of the sliding window are distributed in two sub-windows. One sub-window keeps the most recent samples from the main sliding window and considers it a new concept, whereas the other sub-window keeps the older samples and considers them as an old concept. They further apply the Kolmogorov-Smirnov [42] test to compare the absolute distance of the two sub-windows over a cumulative distribution. A window is a buffer memory to store limited data. This window can be used to store input data or output of the ML model to monitor the behavior of the data or output. Another sliding window-based drift detection method called One-Class Drift Detector (OCDD) was proposed in [43]. At the start, a one-class classifier is trained using the data in the sliding window. The oneclass classifier is used to estimate the distribution of the new concept, identifying whether fresh samples are from the existing idea or are outliers. Outlier samples are recognized as data from the new concept. It indicates a drift based on the proportion of outliers discovered in the sliding window. This technique is repeated indefinitely until there is no fresh data. OCDD's fundamental flaw is that it identifies drifts redundantly (false positives) when there are virtual drifts in which P(X) changes, but P(y|X) does not. The classifier is excessively adjusted, and the model loses important knowledge. Furthermore, it is incapable of detecting concept drifts (false negatives) in which changes occur solely on P(y|X). In general, every drift detection method works either on the input data (monitors dissimilarity in data distribution) or on the output of the ML model (drop in accuracy or increase in error rate). We diagrammatically illustrated the input-based drift detector and output-based drift detector in Figure 2a  concept. They further apply the Kolmogorov-Smirnov [42] test to compare the absolute distance of the two sub-windows over a cumulative distribution. A window is a buffer memory to store limited data. This window can be used to store input data or output of the ML model to monitor the behavior of the data or output. Another sliding windowbased drift detection method called One-Class Drift Detector (OCDD) was proposed in [43]. At the start, a one-class classifier is trained using the data in the sliding window. The one-class classifier is used to estimate the distribution of the new concept, identifying whether fresh samples are from the existing idea or are outliers. Outlier samples are recognized as data from the new concept. It indicates a drift based on the proportion of outliers discovered in the sliding window. This technique is repeated indefinitely until there is no fresh data. OCDD's fundamental flaw is that it identifies drifts redundantly (false positives) when there are virtual drifts in which P(X) changes, but P(y|X) does not. The classifier is excessively adjusted, and the model loses important knowledge. Furthermore, it is incapable of detecting concept drifts (false negatives) in which changes occur solely on P(y|X). In general, every drift detection method works either on the input data (monitors dissimilarity in data distribution) or on the output of the ML model (drop in accuracy or increase in error rate). We diagrammatically illustrated the input-based drift detector and output-based drift detector in Figure 2a Most of the work presents a variety of drift detection methods in general. Therefore, the experimental study [44] examined eight drift detection approaches in terms of how they perform in the context of sudden and gradual concept drifts. Another work presented in [45] compared four drift detection methods on drifted data streams with noise. The study [46] compared fourteen drift detection methods on binary class data streams. Again, the sudden and gradual drift types were considered in these studies. These experimental studies are limited to sudden and gradual drift types only, and the data streams used in their study were mostly binary classes. Another main limitation is that the effect of the class imbalance issue was not considered in their work. Hence, comprehensive work is required which compares the existing drift detection methods on data streams with various drift types, in presence of a multi-class imbalance problem. A summary of the existing experimental studies which compare drift detection methods is given in Table 1. Most of the work presents a variety of drift detection methods in general. Therefore, the experimental study [44] examined eight drift detection approaches in terms of how they perform in the context of sudden and gradual concept drifts. Another work presented in [45] compared four drift detection methods on drifted data streams with noise. The study [46] compared fourteen drift detection methods on binary class data streams. Again, the sudden and gradual drift types were considered in these studies. These experimental studies are limited to sudden and gradual drift types only, and the data streams used in their study were mostly binary classes. Another main limitation is that the effect of the class imbalance issue was not considered in their work. Hence, comprehensive work is required which compares the existing drift detection methods on data streams with various drift types, in presence of a multi-class imbalance problem. A summary of the existing experimental studies which compare drift detection methods is given in Table 1.

Methodology
Concept drift and class imbalance are common problems of applications that generate continuous non-stationary data streams. In real-time data streams, it is hard to know where exactly the drift has occurred, and as a result, drift detection methods cannot be properly tested in such situations. Therefore, we generate synthetic data streams to simulate the features of real-time data streams with specific and controllable drift types as carried out by [18,47]. The following subsections discuss the data generation mechanism and experimental setup in detail.

Data Stream Simulation
The study at hand focuses on testing the existing drift detection methods on multiclass balanced and imbalanced data streams with sudden, gradual, and incremental drifts. Therefore, to know where exactly the drift is occurring and what is the actual number of samples in each class, i.e., the class imbalance ratio, we simulated different data streams in a way that each stream had two versions, the balanced data stream and the imbalanced data stream. The reason for creating two versions of the same stream is to identify the performance difference between drift detection methods on balanced and imbalanced data streams when the drift type is the same. It is because we are uninformed of the drift location and drift type in real data streams that evaluating the performance of drift detection methods on such data streams is difficult [48]. As a result, for our investigations, synthetically produced data streams are employed in which we know the drift locations, drift type, and class imbalance ratio at the time of drift occurrence. We used a massive online analysis (MOA) [49] framework to generate the streams. The RandomRBFGenerator, RandomTreeGenerator, and LEDGenerator classes were used to generate different multiclass data streams. Every stream contains 100K samples, after every 25K samples, a new concept replaces the old one. The 'sudden' and 'gradual' drifts are easy to generate using the width (w) property. For sudden, we set w=1, and for gradual w = 5K, which means during these 5K instances, the concept will change gradually. The position (p) is set as 25K in all cases, which means it is the center of each drift (i.e., concept drift occurs after every 25K). Table 2 shows the summary of the generated streams.

Experimental Setup
The purpose of this study is to compare the existing drift detection methods on balanced and imbalanced multi-class data streams and find which of the compared methods contributes more to improve the performance of the base classifier and in what circumstances. Ten well-known drift detection methods are scientifically compared in this study on data streams with various drift types. In these experiments, we used the No Change Detection (NoCD) option to distinguish the performance of the base classifier without using any drift detector. It is to analyze the performance of the base classifier with a drift detector and without a drift detector. All these drift detection methods are used with default parameters, and no parameter tuning is performed. In real-time data stream classification data arrives so fast and may change the concept frequently. Therefore, if a model's parameters are tuned for the initial concept, they lose their worth if the concept changes. The summary of these methods is given in Table 3. All the experiments are performed in MOA [49], using a machine-equipped 2.3 GHz Intel i5 processor and 8 GB RAM. The results are evaluated in terms of the F-1 score metric which provides a way to combine both precisions and recall into a single measure that captures both properties. For data streams where the drift location is known such as streams with sudden drift, the methods are also evaluated using correct identification, distance from the actual drift location and identified location, false detection, and miss detection. The Hoeffding Tree classifier [50] is used as the base classifier because it is an incremental learner capable of learning from massive data streams. Hoeffding Tree with its incremental decision tree approach is capable of updating existing trees only utilizing the single input instance, eliminating the requirement to evaluate older instances. Computes the accuracy of the base classifier in the W most recent instances and compares it to its overall accuracy from the beginning of the learning process.

Results and Discussion
The main purpose of drift detection methods is to timely identify the drift in and trigger the ML model which is employed as the base classifier to adjust itself accordingly so that the classification performance degradation can be avoided. Keeping this in view, the predictive performance of the classifier has assessed when combined with each of the 10 drift detection methods. It is hypothesized in the current study that the performance of the base classifier will be improved using drift detection methods. The experiments were performed on data streams with sudden drift, gradual drift, and incremental drift. We discuss the results for each drift type separately in the following sub-sections.

Balanced and Imbalanced Data Stream with Sudden Drift
Results of the experiments in terms of F1-score on balanced and imbalanced data streams are presented in Figure 3. It can be observed from the results that the base classifier has produced better results on balanced data streams compared to the imbalanced data stream. Another notable aspect to observe is that the performance of the base classifier without the drift detection method (NoCD) produced far better results than using drift detection methods except GMA. In other words, instead of performance improvement, the performance of the base classifier has been decreased using most of the compared drift detection methods.

Results and Discussion
The main purpose of drift detection methods is to timely identify the drift in and trigger the ML model which is employed as the base classifier to adjust itself accordingly so that the classification performance degradation can be avoided. Keeping this in view, the predictive performance of the classifier has assessed when combined with each of the 10 drift detection methods. It is hypothesized in the current study that the performance of the base classifier will be improved using drift detection methods. The experiments were performed on data streams with sudden drift, gradual drift, and incremental drift. We discuss the results for each drift type separately in the following sub-sections.

Balanced and Imbalanced Data Stream with Sudden Drift
Results of the experiments in terms of F1-score on balanced and imbalanced data streams are presented in Figure 3. It can be observed from the results that the base classifier has produced better results on balanced data streams compared to the imbalanced data stream. Another notable aspect to observe is that the performance of the base classifier without the drift detection method (NoCD) produced far better results than using drift detection methods except GMA. In other words, instead of performance improvement, the performance of the base classifier has been decreased using most of the compared drift detection methods.  Figure 4 shows the actual effect of each drift detection method on base classifier performance in terms of percentage increase or decrease. On the y-axis if the value is <0, it means the performance is decreased, and >0 shows the performance improvement, whereas 0 shows no effect. The performance of HT as a base classifier on balanced and imbalanced data streams was decreased using most of the drift detection methods. The performance of the base classifier with GMA remained unchanged, i.e., neither increased nor decreased on balanced as well as imbalanced data streams. Whereas, with DDM and HDDM_A, the performance was slightly improved on the imbalanced data stream.
These results raise the question of why the performance of the base classifier is decreasing using drift detection methods instead of improving. It has been analyzed from another perspective, i.e., the number of false drift alarms. A false alarm takes place when the detector signals a change when there were none. So, when the drift detection method generates the false alarm, it triggers mechanism to reset the base classifier. As a result, the base classifier forgets the previously learned knowledge. In case there is no drift in the data streams and the concept is still the same and the drift detection method generates a   Figure 4 shows the actual effect of each drift detection method on base classifier performance in terms of percentage increase or decrease. On the y-axis if the value is <0, it means the performance is decreased, and >0 shows the performance improvement, whereas 0 shows no effect. The performance of HT as a base classifier on balanced and imbalanced data streams was decreased using most of the drift detection methods. The performance of the base classifier with GMA remained unchanged, i.e., neither increased nor decreased on balanced as well as imbalanced data streams. Whereas, with DDM and HDDM_A, the performance was slightly improved on the imbalanced data stream.
These results raise the question of why the performance of the base classifier is decreasing using drift detection methods instead of improving. It has been analyzed from another perspective, i.e., the number of false drift alarms. A false alarm takes place when the detector signals a change when there were none. So, when the drift detection method generates the false alarm, it triggers mechanism to reset the base classifier. As a result, the base classifier forgets the previously learned knowledge. In case there is no drift in the data streams and the concept is still the same and the drift detection method generates a false alarm (considering that there is a drift), the base classifier will be reset and starts learning from the new concept but in real concept is still the same. This situation causes performance degradation. Therefore, any drift detection method which generates a high number of false alarms causes a decrease in performance. Figure 5 shows the number of false alarms generated by each drift detection method on the balanced and imbalanced data streams. It can be observed that HDDM_W, STEPD, and EWMAC generate a high number of false alarms, and these are the methods that cause more damage to the base classifier's performance. On an imbalanced data stream, both DDM and HDDM_A methods correctly detected actual drift. They did not generate many false alarms, and as a result, they slightly improved the performance of the base classifier.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 9 of 18 performance degradation. Therefore, any drift detection method which generates a high number of false alarms causes a decrease in performance. Figure 5 shows the number of false alarms generated by each drift detection method on the balanced and imbalanced data streams. It can be observed that HDDM_W, STEPD, and EWMAC generate a high number of false alarms, and these are the methods that cause more damage to the base classifier's performance. On an imbalanced data stream, both DDM and HDDM_A methods correctly detected actual drift. They did not generate many false alarms, and as a result, they slightly improved the performance of the base classifier.
(a) (b)  The performance of drift detection methods was also measured on a few other metrics such as True Detection, False Alarm, and Delay in Detection, which are described in [23]. The results are presented in Table 4. As can be seen in Table 4, the value 0 in column 'Drift Detected' against GMA highlights that it did not identify any of the drift out of three drifts available in the data stream; hence, it missed all 3 drifts. The distance (T) represents the distance in time stamp from the actual drift location to the drift detection location, whereas the distance (I) is the distance (in a number of instances) from the first instance  , x FOR PEER REVIEW 9 of 18 performance degradation. Therefore, any drift detection method which generates a high number of false alarms causes a decrease in performance. Figure 5 shows the number of false alarms generated by each drift detection method on the balanced and imbalanced data streams. It can be observed that HDDM_W, STEPD, and EWMAC generate a high number of false alarms, and these are the methods that cause more damage to the base classifier's performance. On an imbalanced data stream, both DDM and HDDM_A methods correctly detected actual drift. They did not generate many false alarms, and as a result, they slightly improved the performance of the base classifier.
(a) (b)  The performance of drift detection methods was also measured on a few other metrics such as True Detection, False Alarm, and Delay in Detection, which are described in [23]. The results are presented in Table 4. As can be seen in Table 4, the value 0 in column 'Drift Detected' against GMA highlights that it did not identify any of the drift out of three drifts available in the data stream; hence, it missed all 3 drifts. The distance (T) represents the distance in time stamp from the actual drift location to the drift detection location, whereas the distance (I) is the distance (in a number of instances) from the first instance  The performance of drift detection methods was also measured on a few other metrics such as True Detection, False Alarm, and Delay in Detection, which are described in [23].
The results are presented in Table 4. As can be seen in Table 4, the value 0 in column 'Drift Detected' against GMA highlights that it did not identify any of the drift out of three drifts available in the data stream; hence, it missed all 3 drifts. The distance (T) represents the distance in time stamp from the actual drift location to the drift detection location, whereas the distance (I) is the distance (in a number of instances) from the first instance of the new concept to the instance where drift is detected by the drift detection method. Any drift detection method which produces poor results on any of the metrics in Table 4 will affect the performance of the base classifier.
From Table 4, it can be concluded that false drift detection is the main factor that causes the decrease in the performance of a base classifier. The EWMAC correctly detected all three drifts without any delay, but it also detected a high number of false drifts and detected 83 false drifts. Because of the false detection, the classifier is reset every time, and as a result, it forgets previously learned knowledge, which caused the 11.6% decrease in base classifier performance. HDDM_W and STEPD also correctly detected all three actual drifts but detected 35 and 27 false drifts, respectively. As a result, HT performance was decreased by 6.88% using HDDM_W and 6.49 using STEPD as a drift detection method.
The EDDM causes a 4.07% performance decrease in HT. However, it detected one false drift, missed 2 actual drifts, and detected just one actual drift but with a delay. These factors contributed to the performance degradation of EDDM.

Balanced and Imbalanced Data Stream with Gradual Drift
The gradual drift is the change of one concept to another gradually. In this study, this change is completed in 10 timestamps (5000 instances). The performance of the base classifier on balanced and imbalanced multi-class data streams is presented in Figure 6. Compared to the data stream with sudden drift, where the base classifier was able to produce good results without using the drift detection methods, and its performance was decreased with most of the drift detection methods. In the case of the data stream with gradual drift, some of the drift detection methods have contributed to improving the performance of the base classifier. Appl. Sci. 2022, 12, x FOR PEER REVIEW 11 of 18 Figure 6. F1 score on balanced and imbalanced data streams with gradual drift. Figure 7 shows the actual impact on the classifier's performance in percentage. On a balanced data stream, the performance of the base classifier is improved using DDM, EDDM, CuSUM, PH, RDDM, and EWMAC by 4.35%, 5.22%, 4.28%, 5.06%, 4.3%, 5.22%, and 6.22%, respectively, whereas HDDM_W and STEPD cause a decrease in the performance of the base classifier by 3.63% and 4.31%. The GMA again showed no effect on the base classifier's performance on both streams. On an imbalanced data stream with gradual drift, using DDM, EDDM, CUSUM, PH, and RDDM contributed to improving the base classifier's performance, whereas EWMAC badly affected the classifier's performance when the data stream had a class imbalance issue. Again, we looked into the false alarm factor, as we did for streams with sudden drift. Figure 8 shows the number of false alarms generated by each drift detection method on the balanced and imbalanced data stream. It can be observed that again HDDM_W, STEPD, and EWMAC generated a high number of false alarms for both streams so as they affected the base classifier's performance. Even though EWMAC generated 63 false alarms but still produced on the balanced data stream, on the imbalanced data stream, the false  Figure 7 shows the actual impact on the classifier's performance in percentage. On a balanced data stream, the performance of the base classifier is improved using DDM, EDDM, CuSUM, PH, RDDM, and EWMAC by 4.35%, 5.22%, 4.28%, 5.06%, 4.3%, 5.22%, and 6.22%, respectively, whereas HDDM_W and STEPD cause a decrease in the performance of the base classifier by 3.63% and 4.31%. The GMA again showed no effect on the base classifier's performance on both streams. On an imbalanced data stream with gradual drift, using DDM, EDDM, CUSUM, PH, and RDDM contributed to improving the base classifier's performance, whereas EWMAC badly affected the classifier's performance when the data stream had a class imbalance issue.  Figure 7 shows the actual impact on the classifier's performance in percentage. On a balanced data stream, the performance of the base classifier is improved using DDM, EDDM, CuSUM, PH, RDDM, and EWMAC by 4.35%, 5.22%, 4.28%, 5.06%, 4.3%, 5.22%, and 6.22%, respectively, whereas HDDM_W and STEPD cause a decrease in the performance of the base classifier by 3.63% and 4.31%. The GMA again showed no effect on the base classifier's performance on both streams. On an imbalanced data stream with gradual drift, using DDM, EDDM, CUSUM, PH, and RDDM contributed to improving the base classifier's performance, whereas EWMAC badly affected the classifier's performance when the data stream had a class imbalance issue. Again, we looked into the false alarm factor, as we did for streams with sudden drift. Figure 8 shows the number of false alarms generated by each drift detection method on the balanced and imbalanced data stream. It can be observed that again HDDM_W, STEPD, and EWMAC generated a high number of false alarms for both streams so as they affected the base classifier's performance. Even though EWMAC generated 63 false alarms but still produced on the balanced data stream, on the imbalanced data stream, the false  Again, we looked into the false alarm factor, as we did for streams with sudden drift. Figure 8 shows the number of false alarms generated by each drift detection method on the balanced and imbalanced data stream. It can be observed that again HDDM_W, STEPD, and EWMAC generated a high number of false alarms for both streams so as they affected the base classifier's performance. Even though EWMAC generated 63 false alarms but still produced on the balanced data stream, on the imbalanced data stream, the false alarm rate was increased to 87 and which causes the decrease in the base classifier's performance. From Figure 8, it can also be observed that DDM, EDDM, CUSUM, PH, and RDDM generated either 0 or very few false alarms; hence, they contributed to improving the base classifier's performance. , x FOR PEER REVIEW 12 of 18 alarm rate was increased to 87 and which causes the decrease in the base classifier's performance. From Figure 8, it can also be observed that DDM, EDDM, CUSUM, PH, and RDDM generated either 0 or very few false alarms; hence, they contributed to improving the base classifier's performance.

Balanced and Imbalanced Data Stream with Incremental Drift
An incremental drift takes place when the concept changes gradually with a constant speed, and this is the most difficult drift to detect. The performance of the base classifier on balanced and imbalanced multi-class data streams with increment drift is presented in Figure 9. It can be observed that it is hard to learn from the data steam with incremental drift and the classifier has produced poor results as compared to streams with gradual drift and streams with sudden drift. The class imbalance further decreases the performance compared to the balanced data stream.  Figure 10 shows the actual effect of each drift detection method on base classifier performance in terms of percentage increase or decrease. The performance of HT as a base classifier on balanced and imbalanced data streams was again decreased using HDDM_W, STEPD, and EWMAC. The HDDM_W also caused a decrease in performance on the balanced data stream. The performance of the base classifier with GMA remained

Balanced and Imbalanced Data Stream with Incremental Drift
An incremental drift takes place when the concept changes gradually with a constant speed, and this is the most difficult drift to detect. The performance of the base classifier on balanced and imbalanced multi-class data streams with increment drift is presented in Figure 9. It can be observed that it is hard to learn from the data steam with incremental drift and the classifier has produced poor results as compared to streams with gradual drift and streams with sudden drift. The class imbalance further decreases the performance compared to the balanced data stream. alarm rate was increased to 87 and which causes the decrease in the base classifier's performance. From Figure 8, it can also be observed that DDM, EDDM, CUSUM, PH, and RDDM generated either 0 or very few false alarms; hence, they contributed to improving the base classifier's performance.

Balanced and Imbalanced Data Stream with Incremental Drift
An incremental drift takes place when the concept changes gradually with a constant speed, and this is the most difficult drift to detect. The performance of the base classifier on balanced and imbalanced multi-class data streams with increment drift is presented in Figure 9. It can be observed that it is hard to learn from the data steam with incremental drift and the classifier has produced poor results as compared to streams with gradual drift and streams with sudden drift. The class imbalance further decreases the performance compared to the balanced data stream.  Figure 10 shows the actual effect of each drift detection method on base classifier performance in terms of percentage increase or decrease. The performance of HT as a base classifier on balanced and imbalanced data streams was again decreased using HDDM_W, STEPD, and EWMAC. The HDDM_W also caused a decrease in performance on the balanced data stream. The performance of the base classifier with GMA remained

Drift Detection Methods
Balance Stream Imbalance Stream Figure 9. F1 score on balanced and imbalanced data streams with incremental drift. Figure 10 shows the actual effect of each drift detection method on base classifier performance in terms of percentage increase or decrease. The performance of HT as a base classifier on balanced and imbalanced data streams was again decreased using HDDM_W, STEPD, and EWMAC. The HDDM_W also caused a decrease in performance on the balanced data stream. The performance of the base classifier with GMA remained unchanged again, i.e., neither increased nor decreased on balanced as well as imbalanced data streams. Whereas, with DDM and HDDM_A, the performance was slightly improved on an imbalanced data stream. The DDM and HDDM_A cause a decrease in performance whereas they produced an improved performance on the imbalanced data stream. Overall, PH produced a high F1 score on a balanced data stream with incremental drift whereas DDM produced a high F1 score on an imbalanced data stream with incremental drift. For streams with incremental drift, we cannot calculate the false alarms because the position of increment drift is always unknown.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 13 of 18 unchanged again, i.e., neither increased nor decreased on balanced as well as imbalanced data streams. Whereas, with DDM and HDDM_A, the performance was slightly improved on an imbalanced data stream. The DDM and HDDM_A cause a decrease in performance whereas they produced an improved performance on the imbalanced data stream. Overall, PH produced a high F1 score on a balanced data stream with incremental drift whereas DDM produced a high F1 score on an imbalanced data stream with incremental drift. For streams with incremental drift, we cannot calculate the false alarms because the position of increment drift is always unknown.
(a) (b) The overall performance of the base classifier on each data stream is presented in Figure 11a-f. It can be observed that the performance of the base classifier is decreased at the time of drift occurrence. Another point to observe here is that class imbalance has further affected the performance of most drift detection methods. Table 5 shows the performance in terms of the overall F1 score on each data stream. The bold values represent the best performances on a particular data stream. The results show that none of these methods has shown consistent performance on all data stream types. The overall performance of the base classifier on each data stream is presented in Figure 11a-f. It can be observed that the performance of the base classifier is decreased at the time of drift occurrence. Another point to observe here is that class imbalance has further affected the performance of most drift detection methods. Table 5 shows the performance in terms of the overall F1 score on each data stream. The bold values represent the best performances on a particular data stream. The results show that none of these methods has shown consistent performance on all data stream types. unchanged again, i.e., neither increased nor decreased on balanced as well as imbalanced data streams. Whereas, with DDM and HDDM_A, the performance was slightly improved on an imbalanced data stream. The DDM and HDDM_A cause a decrease in performance whereas they produced an improved performance on the imbalanced data stream. Overall, PH produced a high F1 score on a balanced data stream with incremental drift whereas DDM produced a high F1 score on an imbalanced data stream with incremental drift. For streams with incremental drift, we cannot calculate the false alarms because the position of increment drift is always unknown.
(a) (b) The overall performance of the base classifier on each data stream is presented in Figure 11a-f. It can be observed that the performance of the base classifier is decreased at the time of drift occurrence. Another point to observe here is that class imbalance has further affected the performance of most drift detection methods. Table 5 shows the performance in terms of the overall F1 score on each data stream. The bold values represent the best performances on a particular data stream. The results show that none of these methods has shown consistent performance on all data stream types.     Therefore, from Table 5, we generated the ranking of the methods on each data stream and calculated the overall ranking same can be seen in Table 6. In this study, one base classifier is tested with 10 drift detections and one without the drift detection method, so the best ranking is 1, and the worst ranking is 11. The rankings were calculated using the RANK.EQ function available in Microsoft excel. The results show that the DDM produced the best average ranking which is 3.17. Both EDDM and PH produced the 2nd best average ranking with 3.83. RDDM produced the 3rd best average ranking with 4.33. 4, respectively. The STEP_D, EWMAC, and HDDM_W remained the poorest performers with 9.83, 9.33, and 9.5, respectively. On imbalance data streams specifically, the DDM produced the best ranking, i.e., ranked 2nd on I-S, ranked 1st on I-G, and ranked 1st on I-I. The ranking of GMA is the ranking of the base classifier because GMA showed no effect on the performance of HT as the base classifier.
Statistical analysis: Results of all drift detection methods are statistically analyzed to identify the performance significance. We used the Friedman test [52] to compare the results of all the methods investigated across six data streams. The Friedman test function FriedmanchisquareResult of python was used to run the test. Friedman's test result with statistic = 32.94 and p = 0.00028 rejects the null hypothesis (all drift detection methods produce the same performance) as p < 0.05. Still, it is not clear which of the compared methods performed significantly differently. To achieve this the Bonferroni-Dunn post hoc test [53] is used to determine which algorithms are statistically distinct. The result of the Bonferroni test is shown in Figure 12. The methods which are not significantly different (at p = 0.05) are connected. Therefore, from Table 5, we generated the ranking of the methods on each data stream and calculated the overall ranking same can be seen in Table 6. In this study, one base classifier is tested with 10 drift detections and one without the drift detection method, so the best ranking is 1, and the worst ranking is 11. The rankings were calculated using the RANK.EQ function available in Microsoft excel. The results show that the DDM produced the best average ranking which is 3.17. Both EDDM and PH produced the 2nd best average ranking with 3.83. RDDM produced the 3rd best average ranking with 4.33. 4, respectively. The STEP_D, EWMAC, and HDDM_W remained the poorest performers with 9.83, 9.33, and 9.5, respectively. On imbalance data streams specifically, the DDM produced the best ranking, i.e., ranked 2nd on I-S, ranked 1st on I-G, and ranked 1st on I-I. The ranking of GMA is the ranking of the base classifier because GMA showed no effect on the performance of HT as the base classifier.
Statistical analysis: Results of all drift detection methods are statistically analyzed to identify the performance significance. We used the Friedman test [52] to compare the results of all the methods investigated across six data streams. The Friedman test function FriedmanchisquareResult of python was used to run the test. Friedman's test result with statistic = 32.94 and p = 0.00028 rejects the null hypothesis (all drift detection methods produce the same performance) as p < 0.05. Still, it is not clear which of the compared methods performed significantly differently. To achieve this the Bonferroni-Dunn post hoc test [53] is used to determine which algorithms are statistically distinct. The result of the Bonferroni test is shown in Figure 12. The methods which are not significantly different (at p = 0.05) are connected.

Conclusions
In this world of advancements, most industrial sensor-based applications generate continuous non-stationary data streams which have inbuilt issues of concept drift and class imbalance. Due to the change in concept, the performance of the output predicting model decreases. To counter this issue, a variety of drift detection methods are proposed to detect the concept drift and adapt the model to a new concept. These methods struggle when the data have a multi-class imbalance issue. To examine how these drift detection methods perform on multi-class data streams, this work compares 10 state-of-the-art drift detection methods on synthetic multi-class balanced and imbalanced data streams. We generated these data streams with sudden, gradual, and incremental drifts. The data stream also faces the issue of class imbalance, so two copies of each data stream were generated with the same type of concept drift. One with class imbalance (imbalanced data stream) and the other without class imbalance (balanced stream). The performance was evaluated in terms of the F1 score. The other performance measures were also used for sudden and gradual drifts such as false alarms and delays in drift detection.
This experimental study accomplishes the following results: • Every drift type (sudden, gradual, or incremental) has a different effect on the performance of the base classifier.

•
None of the compared methods showed consistent performance on different data streams.

•
Most of the drift detection methods were affected by the class imbalance issue.

•
Our results showed that DDM produced the best overall ranking among the compared methods but significantly different from STEPD, HDMM_W, and EWMAC only. • DDM also produced great results on data streams with class imbalance.

•
The STEP_D, EWMAC, and HDDM_W produced poor results on both balanced and imbalanced data streams.

•
Our results also showed that the drift detection method may cause a decrease in base classifier performance in case it generates a false alarm.
Based on the results of this study, we can conclude that the existing drift detection methods are not only lacking in the detection of multiple drifts, but they also struggle when data face multi-class imbalance issues. Therefore, there is a need to design adaptive drift detection methods that can quickly learn the new concepts and update themselves accordingly without generating false alarms. This is a goal for our future work. We are also planning to test more existing drift detection methods as well as the latest methods on numerous multi-class data streams with dynamic class imbalance ratio.