Application of Imbalanced Data Classification Quality Metrics as Weighting Methods of the Ensemble Data Stream Classification Algorithms

In the era of a large number of tools and applications that constantly produce massive amounts of data, their processing and proper classification is becoming both increasingly hard and important. This task is hindered by changing the distribution of data over time, called the concept drift, and the emergence of a problem of disproportion between classes—such as in the detection of network attacks or fraud detection problems. In the following work, we propose methods to modify existing stream processing solutions—Accuracy Weighted Ensemble (AWE) and Accuracy Updated Ensemble (AUE), which have demonstrated their effectiveness in adapting to time-varying class distribution. The introduced changes are aimed at increasing their quality on binary classification of imbalanced data. The proposed modifications contain the inclusion of aggregate metrics, such as F1-score, G-mean and balanced accuracy score in calculation of the member classifiers weights, which affects their composition and final prediction. Moreover, the impact of data sampling on the algorithm’s effectiveness was also checked. Complex experiments were conducted to define the most promising modification type, as well as to compare proposed methods with existing solutions. Experimental evaluation shows an improvement in the quality of classification compared to the underlying algorithms and other solutions for processing imbalanced data streams.


Introduction
Data stream analysis has recently become an increasingly popular topic in the pattern recognition field [1,2]. A multitude of tools and applications constantly produces huge volumes of data that should-most often in a limited time-be processed to extract valuable information. Examples of such sources include, for example, social media and recommendation systems [3], or particularly, increased network traffic during the era of coronavirus and remote work [4]. Such data differ significantly from static data sets, introducing additional difficulties in constructing effective models to solve learning tasks. In addition, more and more often, for example, in the case of fraud detection [5] or network attacks [6], they introduce an imbalance problem [7,8], which is not negligible already when training on static data sets, making streaming classification even more challenging.
The problem of imbalanced data occurs when the size of one of the problem classes far exceeds the count of the other. It is not precisely determined by what numbers we may talk about imbalance, but it is often assumed [8] that in the 9:1 ratio we have a slight imbalance, and when it is 1000:1 or more, we are dealing with a very high imbalance.
Imbalanced data classification is a demanding task, because the dominant majority of recognition algorithms were designed with the assumption of proportional prior probability of classes.
The assumption of most traditional recognition models is to minimize the prediction error, often ignoring the presence of disproportions in the class counts, which leads to the bias of the build model towards the majority class, thus significantly worsening the discriminatory abilities regarding the minority class. In addition, it is very important to carefully select the experimental protocol and the quality assessment metrics used [9], because the most commonly applied classification metrics, such as accuracy, do not take into account the disparities in the problem classes and thus incorrectly assess the quality of the model. One of the available choices is aggregated metrics, such as F1-score, G-mean or balanced accuracy score [10], which by taking into account the recognition quality for all the problem classes are much better suited to the problem of imbalanced data.
Data streams are ordered sequences of information, arriving at high speed [11]. They are also potentially infinite and may change over time. One of the most important phenomena distinguishing the classification of data streams and static data is the so-called concept drift. It consists of changing the distribution of classes in the set-the posterior probability or even the proportion between individual classes [12]. This significantly affects the quality of the prediction, because it often turns out that they were trained on outdated data. One possible taxonomy of this phenomenon is division into three types, due to the dynamics and characteristics of the changes. Sudden drift occurs when the posterior probability at t + 1 is completely different from that at t. In the case of gradual drift, the change in the concept is slower, and the data from both concepts (before and after the change) are mixed up. The last type is the incremental drift, in which the first concept smoothly changes into the second, without mixing them together.
The characteristics of data streams leads to some indefeasible requirements for classifiers operating in their environment: fast data processing in which each object may be presented for training only once, low memory consumption, the possibility of prediction at any time and the ability to adapt to the changing distribution of problem classes [13]. Data for the model can be provided in two ways-online or in batches. In the first case, the objects are processed individually at the moment they arrive, while in the second case the data is grouped into chunks of the same size and processed together. Online learning allows faster detection of concept drifts [14], while learning in batches is easier to implement and more computationally efficient.
The problem of imbalanced streams can be even more difficult to solve than each of its struggles separately. Not all standard methods for solving the imbalanced data problem are feasible in a streaming environment. If the model is learned incrementally, most of the popular sampling algorithms cannot be used, and even the very determination of the imbalance ratio is not trivial [15]. In the case of a data chunk, it is easier to specify at least temporary proportions between classes, but depending on the size of the chunk and number of minority patterns, not all sampling will be equally effective. In addition, due to the characteristics of the streams and speeds at which the data arrives, the computational efficiency of algorithms should also be taken into account.
There are three main groups of methods for improving model performance over imbalanced data: methods at the data level, at the algorithm level, and hybrid methods that most often use an ensemble approach to classification. Data-level methods are based on adapting the training set by changing the number of samples to allow standard machine learning algorithms to train and classify correctly. The simplest and most popular approach is random sampling, where objects are duplicated (over-) or removed (undersampling) in a random manner. It may, however, lead to the removal of patterns potentially valuable for recognition or duplication of non-valuable samples (e.g., noise or outliers). More complex methods are, for example, the SMOTE (Synthetic Minority Over-sampling Technique) algorithm [16], creating new synthetic samples based on neighboring minority class objects, ADASYN, creating more synthetic samples near objects difficult to classify [17], or the NCL (Neighborhood Cleaning Rule) algorithm, removing majority class objects that affect the misclassification of the remaining samples [18]. The methods of preventing data imbalance at the algorithm level transform the machine learning model in such a way as to alleviate its bias in choosing the majority class. One such approach is methods interfering with the cost function of the model [19]. It is modified in such a way that it grants a greater cost of minority class object recognition error. The disadvantage of such methods is the difficulty of choosing the correct cost of errors in the case of real problems. Another algorithm-based method can be the one-class classifiers [20]. By building each classifier on only one class, we get rid of the problem of favoring other classes. However, choosing the right classifiers may be difficult for more complex problems.
For the case of data streams, there are several ways to classify them. The basic model from single classifiers is VFDT (Very Fast Decision Tree)-a decision tree using the Hoeffding boundary (Hoeffding bound) to create branches. Other examples are traditional incremental classifiers that have been adapted to the requirements of data streams, such as neural networks, Bayesian methods, and minimum-distance algorithms. Another approach is classifier ensembles that, thanks to their modularity, easily adapt to non-stationary data streams [21]. In batch learning, a new classifier is often created when new instances appear that may replace the weakest model in the pool. Examples of classifier assembly algorithms are AWE ( Accuracy Weighted Ensemble) [22], AUE (Accuracy Updated Ensemble) [23,24] or WAE (Weighted Aging Ensemble) [13].
Several approaches have been proposed to solve the problem of imbalanced data streams. One of them is to expand the window with minority class data [25]. This is to reduce imbalance based on non-synthetic data (as opposed to artificially increasing the number). This solution, however, does not take into account the possibility of changing the distribution of minority class over time, and also violates the principle stating that one sample should be used once. Another method is used, for example, by the incremental OOB and UOB [15]. They are based on online bagging, where for each member classifier the samples obtained are duplicated according to the Poisson distribution, and sampling (oversampling in the case of OOB and undersampling at UOB) is done by controlling the λ parameter. The disadvantage of incremental learning, however, is the problem with determining the proportion of classes.
The aim of the following work is to propose the modification of popular ensemble models so that they employ the imbalanced classification metrics in the weighting of classifier members and compare them with existing data stream processing solutions. The created algorithms may achieve higher quality classification on imbalanced streams, and the proposed methods may slightly improve the currently used algorithms. The paper shows preliminary research of the topic, thus it will focus on the binary classification task.

Accuracy Weighted Ensemble
Accuracy Weighted Ensemble is an example of a batch processing classifier that processes data in the form of chunks. Each of the models entering the pool uses the same training procedure, but is built around a different data block.
A significant problem in processing data streams is recognizing the point in time when the data has become obsolete. The method of deleting the oldest objects is often used. However, this creates another problem of choosing the appropriate time window after which the data will be forgotten. In the case of too large window, objects from the previous concept are further used in the prediction of the new concept. On the other hand, if the window size is too small, the classifiers may have insufficient data for proper generalization, which may result in overfitting and poor quality of the model. For this reason, AWE does not use window mechanics, only the evaluation of stored data (in the form of classifiers trained on them) in relation to the current concept, and not the time spent in the pool.
It has been proven that an ensemble trained on k blocks in a manner where each model is built on a different block achieves better quality (less prediction error) than a single classifier learned on all k blocks. The condition for this is, however, the assumption that each member classifier has a weight assigned in accordance with its adaptation to the current data distribution. In the case of AWE, it is assessed by estimating the error made by each member on the latest block, which is considered to best reflect the current distribution of classes. In its basic version, the member weights are equal to the difference between the mean square error of each classifier and the estimated mean square error of the random classifier.
where MSE r equals for p(c) being the prior probability of class c. MSE i is calculated as follows where S n is the latest data chunk in a form where x is a feature vector with label c, |S n | is the number of patterns building the chunk and f i c states the posterior probability of i-th classifier assigning pattern x to class c.
Steps of the AWE algorithm in the form of pseudocode are presented in Algorithm 1.

Algorithm 1 AWE pseudocode.
Input: S: new data chunk K: size of the ensemble C: ensemble of K classifiers Output: C: ensemble of K classifiers with updated weights Train new classifier C with S; Calculate weight of C based on 1 using cross-validation on S; for C i in C do Calculate weight w i based on 1; end for C ← K classifiers with highest weights from C ∪ C ; return C;

Accuracy Updated Ensemble
The second algorithm analyzed in the following work, Accuracy Updated Ensemble, is inspired by the AWE, but at the same time gets solved some disadvantages, which are the problem with the selection of the correct size of the chunk and the function of weight selection.
The first disadvantage is caused by the fact that each member classifier is trained only on one chunk of data, and then remains unchanged. If the chunk size is too small, the classifier will not have enough data to build a proper model. On the other hand, if it is too large, it may include data from different concepts. The solution proposed by AUE is to update models of classifiers stored in the pool, not just to change their weights according to changes in concept. Thanks to this, if the distribution of classes between chunks remains unchanged, classifiers well matched to it will improve their quality (as if they were trained on a larger number of samples from the beginning). As a result, it is possible to reduce the size of the chunk without a fear that this will cause a deterioration in the quality of individual members. Training occurs when the weight of the ensemble member is greater than the estimated weight of the random classifier.
The other disadvantage of AWE is its weighting function. By its definition and procedure (cutting off classifiers weaker than the random classifier) it may silence the entire ensemble and make it impossible to predict. AUE proposes the following weight function for i-th team member: MSE i is calculated according to Equation (3), and guarantees that dividing by 0 should never occur.
In addition to the introduced corrections, AUE retains all the advantages of AWE: assigning weights when a new chunk arrives, so classifiers modeled on the outdated concepts do not have a big impact on the result of the final prediction. As a result, AUE achieves better than AWE quality for streams with a stationary concept or streams including gradual drifts, and for sudden drifts, quality is at least the same.
Pseudocode of the AUE algorithm is presented in Algorithm 2.
Input: S: new data chunk K: size of the ensemble C:ensemble of K classifiers Output: C: ensemble of K updated classifiers with updated weights Train new classifier C on S; Estimate the weight of C based on 4 using cross-validation on S; Calculate weight w i based on 4; end for C ← K classifiers with the highest weights from C ∪ C ; for C e in C do if w e > 1 MSE r and C e = C then update C e with S end if end for The presented algorithms are not adapted to the classification of imbalanced data. The main reasons are the methods of assigning weights to ensemble members. They not only affect the fusion of classifiers (mostly being conducted by weighted voting), but also their composition as classifiers with the lowest weights are removed. In addition, in AUE, only members with sufficiently high weights are trained. The mean square error on which the weights are based in both AWE and AUE, as well as typical accuracy score, is not suitable for assessing the quality of a classifier for imbalanced problems. Its low value, which translates into a high weight value, may come from a significant bias towards the majority class, which is best demonstrated by the case of the model that always gives the object the prediction for the majority class [26].

Proposed Changes in AUE and AWE Algorithms to Deal with Imbalanced Classification Problem
For the aforementioned reasons, this paper proposes the application of metrics much better at assessing the quality of algorithms aimed for binary classification of imbalanced data. The first of the selected metrics is the F1-score [27], which aggregates the simple metrics of sensitivity-determining the accuracy of the minority class classification, and precision-indicating the probability of its correct detection.
The subsequent selected metrics aggregate, using different approaches, the sensitivity and the specificity score, which in the binary case indicates the accuracy of recognizing the negative (majority) class. The first is G-mean [28]-the geometric mean of sensitivity and specificity (Equation (6)), and the last one is balanced accuracy score [26]-their arithmetic mean (Equation (7)). The advantage of both these metrics is that they consider both improving the minority class classification, but also avoiding deteriorating the majority class classification.
balanced accuracy score = Sensitivity + Speci f icity 2 In the proposed models, these metrics were used to calculate the weights of ensemble members, and in the case of the AUE model-to estimate the weight of a random classifier based on the prior probability of classes.
In addition, the conducted study verified the impact of data sampling on the quality of classification. Random over-and undersampling methods were chosen because of their simplicity and low computational complexity in stream processing. In addition, in the case of large imbalance leading to a small number of minority class objects, they give similar results to other popular sampling methods.
Pseudocodes of AWE and AUE with added proposed modifications are presented in Algorithms 3 and 4. Estimate weight of C with cross-validation on S based on (5), (6) or (7); Calculate weight w i of C i on S based on (5), (6) or (7); end for C ← K classifiers with the highest weights from C ∪ C ; Estimate weight of C using cross-validation on S based on 5, 6 or 7; Calculate weight of C i on S based on 5, 6 or 7; end for Calculate weight w R of random classifier on S based on 5, 6 or 6 and a priori probabilities;

Experimental Set-Up
When testing the quality of the proposed algorithms, it was decided to use synthetic data streams. Although they do not show how the models would cope with real problems, artificially generated data allow for more accurate analysis due to, among others, the fixed location of the concept drifts and the possibility of any number of replications. The data was provided by the generator from the stream-learn module, employing the Madelon principle [29] of problem synthetization, being present also in the popular scikit-learn module, adding the ability to change data distribution over time and other properties known in the field of stream classification. Additionally, in order to make recognition more difficult, a fixed label noise was inducted to 1% of samples.
In order to thoroughly analyze the behavior of the models, streams with different imbalance levels were created, where the minority class accounts for, respectively, 5%, 10%, 20% and 30% of the entire data stream. For each proportion, five occurrences of different types of concept drift-sudden or gradual-were included in streams and evenly distributed over time. The data stream was delivered to the incremental models in the form of 100 chunks, each with 500 patterns. The stream consisted, like in many analyses of this field [15], of two informative features. Each stream type has been replicated five times, with different random states. Descriptions of generated stream types are shown in Table 1. For each data stream, ensembles of 10 members were built, with the Hoeffding tree chosen as the base classifier. Combined models were created with each combination of parameters-(1) the base algorithm, (2) weighing method and (3) type of sampling, which gave 22 considered solutions, presented in Table 2. In addition, they were compared with the non-modified AWE and AUE algorithms, as well as with the WAE, OOB and UOB approaches. proportional to balanced accuracy score -AUE-b 20 proportional to F1-score -AUE-f 21 in inverse proportion to MSE undersampling u-AUE 22 in inverse proportion to MSE oversampling o-AUE The models were tested using the Test-Then-Train experimental protocol, in which the incoming chunk is used first to evaluate the model and then to train it. The metrics used in model construction, i.e., F1-score, G-mean and balanced accuracy score, were selected also for evaluation. After conducting the experiments, the Wilcoxon test [30] was carried out on the results for observation pairs with four degrees of freedom and a significance level of 0.05.
The experiments were carried out in the Python environment using the scikit-learn [31], stream-learn [32], imbalanced-learn [33] and scikit-multiflow [34] libraries and own implementations of modified AWE and AUE methods. The source code of used algorithms as well as experimental procedure is published in a public repository on GitHub (https://github.com/w4k2/imbalancedstream-ensembles).

Experimental Evaluation
As it may be observed from Tables 3-5, larger differences between the results of individual models occur in the case of streams with a greater imbalance-both in terms of the average of all scores achieved during processing as well as in accordance with statistical tests. Only a large disproportion between classes, on the order of, for example, 1:19, 1:9, seems to be a proper challenge, significantly differentiating the quality of the presented algorithms.
As it was expected according to the AUE description, algorithms based on AUE achieve better results than methods where the AWE is the base ensemble approach. This is due to, in the case of AWE, the use of a limited number of samples for each member, which impairs their discriminatory ability. Classifiers in AUE-based ensembles generally receive more samples from the same concept and thus better recognize the patterns they represent. For a similar reason, in the case of high imbalance models using oversampling cope better with the problem. It is related to the size of the received chunk, and more specifically to the number of received minority class objects. For the stream with the highest disparity between classes, each chunk contains only 25 samples of the minority class. After conducting undersampling, individual classifiers use very few samples to train, which results in their lower quality.
The obtained results show that changes in the weighting method have the greatest impact in the case of the F1-score metric (Table 3). What is more, introducing data sampling degrades the quality of ensembles using imbalanced metrics to calculate weights of member classifiers (Figure 1). Sampling directly affects the frequency of pointing to the minority class, which, by increasing the number of correctly recognized samples, also increases the number of samples falsely identified as a positive class-indicated with the precision metric used by the F1-score. Especially at high imbalance levels, when there are very few minority class samples, even a small percentage of poorly recognized majority class samples rapidly reduces the value of the precision metric. This also explains the significant difference between the values of the F1-score metric and the G-mean and balanced accuracy score in the case of the highest imbalance streams. The latter uses the specificity instead of precision, which, due to the large size of the majority class, responds much more mildly to incorrect classification of individual samples.
The results for the G-mean (Table 4) and balanced accuracy score ( Table 5) metrics show that the mere modification of the method of assigning weights to team members is insufficient-models using sampling alone were statistically significantly better than models without sampling. Both under-and oversampling significantly increased the quality of recognition of majority class objects with a slight deterioration in the classification of the majority class. Still, however, the addition of modification of weight allocation increases the quality of classification, which in some cases is also supported by statistical tests (Figures 2 and 3).
According to the results, the best method to assign weights seems to be in proportion to the F1-score and the second is in proportion to the G-mean metric. Both methods of calculating weights improve the quality of classifiers not only in relation to the own metrics used, but also in all the others. In addition, models using them are in most cases much better than almost all others, which also finds confirmation in performed statistical tests (Figure 4).
It is also worth noting that the proposed models with modifications are also suitable for problems with low imbalance and achieve much better quality than models created strictly for the problem of imbalanced data streams.           Comparison of the best proposed models with other methods of stream processing using the Test-Then-Train procedure on the stream with gradual and sudden concept drifts and 5% of minority class samples.

Conclusions
This paper presents a novel proposition extending state-of-the-art streaming data processing methods with modified weighting metrics for member-classifiers, taking into account the prior probability of classes present during the flow of data stream containing various types of concept drift phenomenon. An in-depth experimental analysis of the proposed methods was carried out, including three standard aggregated metrics used to assess the quality prediction models constructed on imbalanced classification problems, as well as statistical testing to verify the significance of differences between models. Experiments were conducted using various types of class imbalance and drift types to thoroughly study the characteristics of evaluated algorithms. In comparison with the standard methods of solving the problem of imbalanced data streams, based on the resampling of the training set, greater usefulness potential of the presented proposal has been demonstrated in all types of examined imbalance levels and occurring concept drifts. Nonetheless, the considerable limitation of this study was the lack of evaluation on real-life data, which should be included in further research, together with the additional introduction of proposed modifications to different stream processing algorithms.
The modifications introduced in the AWE and AUE methods allow a noticeable improvement in the predictive capabilities of ensemble models both in cases of high imbalance and with relatively small disproportions between the problem classes. The proposed method only modifies the method of establishing weights for individual classifiers in the ensemble pool, and therefore does not create any additional computational overhead, so without major contraindications it may be recommended to use in solving problems of imbalanced stream classification with any imbalance ratio.