Concept Drift Adaptation with Incremental–Decremental SVM

: Data classiﬁcation in streams where the underlying distribution changes over time is known to be difﬁcult. This problem—known as concept drift detection—involves two aspects: (i) detecting the concept drift and (ii) adapting the classiﬁer. Online training only considers the most recent samples; they form the so-called shifting window. Dynamic adaptation to concept drift is performed by varying the width of the window. Deﬁning an online Support Vector Machine (SVM) classiﬁer able to cope with concept drift by dynamically changing the window size and avoiding retraining from scratch is currently an open problem. We introduce the Adaptive Incremental–Decremental SVM (AIDSVM), a model that adjusts the shifting window width using the Hoeffding statistical test. We evaluate AIDSVM performance on both synthetic and real-world drift datasets. Experiments show a signiﬁcant accuracy improvement when encountering concept drift, compared with similar drift detection models deﬁned in the literature. The AIDSVM is efﬁcient, since it is not retrained from scratch after the shifting window slides.


Introduction
An important class of Machine Learning (ML) problems involves dealing with data that have to be classified on arrival. Usually, in these situations, historical data become less relevant for the ML task. Climate change forecasting is such an example. In the past, elaborate models have predicted how carbon emissions impact the warming of the environment quite well. However, given accelerated emission rates, the trends determined from past data have changed [1]. Typically, for these problems, the underlying distribution changes in time. There are many ML models that can approximate a stationary distribution when the number of samples increases to infinity [2]. A classifier that considers its entire history cannot be employed, since it will have poor generalization results, not to mention the technical difficulties raised by keeping all data. This pattern of evolution-for which intrinsic distribution of the data is not stationary-is called concept drift. As data evolve, it may be because of either noise or change; the distinction between them is made via persistence [3]. The concept drift models must combine robustness to noise or outliers with sensitivity to the concept drift [2].
Methods for coping with concept drift are presented in several comprehensive overviews [4][5][6]. An important topic is the embedding of drift detection into the learning algorithm. According to Farid et al. [7], there are three main approaches: instance-based (window-based), weight-based, and ensemble of classifiers. Window-based approaches usually adjust the window size considering the classification accuracy rate, while weightbased approaches discard past samples according to a computed importance. More recent studies [8,9] divide stream mining in the presence of concept drift into active (trigger-based) and passive (evolving) approaches. The active approaches update the model whenever a our method with current approaches in concept drift. Section 6 summarizes the main achievements of our work and discusses further possible extensions.

Related Work: Concept Drift with Adaptive Shifting Windows
In this section, we list some of the most recent concept drift models based on adaptive windows, with a focus on SVM approaches.
Several models addressing concept drift on adaptive windows were proposed in recent years. Detailed overviews are given by the work of Iwashita et al. [5], Lu et al. [6], as well as Gemaque et al. [21]. The Learn++.NSE algorithm [22,23] and its fast version [24] generate a new classifier for each received batch of data, and add the classifier to an existing ensemble. The classifiers are later combined using dynamically weighted majority voting, based on the classifier's age. In [25], the adaptive random forest algorithm, used for the classification of evolving data streams, combines batch algorithm traits with dynamic update methods.
One of the seminal papers in this field is the Drift Detection Method (DDM) described in [26]. It uses the classification error as evidence of concept drift. The classification error decreases as the model learns the newer samples. The model establishes a warning level and a drift level. When the classification error increases over the warning level, newer samples are introduced into a special window. Once the error increases over the drift level, a new model is created, starting from the samples from the special window. A later extension, the Early Drift Detection Method (EDDM) [27], uses the distance between two consecutive errors and its standard deviation instead of the simple error rate used in the DDM. It also follows the same principle of comparing the error rate against warning and drift thresholds. Both methods are designed to operate regardless of the incremental learner used.
The Fast Hoeffding Drift Detection Method (FHDDM) [28] detects the drift point using a constant-size sliding window. It detects a drift if a significant variation is observed between the current and the maximum accuracy. The accuracy difference threshold is determined using Hoeffding's inequality theorem: The FHDDM method is extended by maintaining windows of different sizes in the Stacked Fast Hoeffding Drift Detection Method (FHDDMS) [29]. This is employed to reduce detection delays. Extensive treatment of these two methods is shown in [30,31].
A very recent approach using the error rate is the Accurate Concept Drift Detection Method (ACDDM) [32]. The author analyzes the consistency of prequential error rate using Hoeffding's inequality. At each step, the difference between the current error rate and the minimum error date is determined. This is compared against the threshold given by the Hoeffding inequality for a desired confidence level. The drift is detected when the error rate difference is greater than the computed deviation: The ACDDM is evaluated using the Very Fast Decision Tree learning algorithm ( [32], Section 3).
Recently, SVMs were also used to address concept drift. ZareMoodi et al. [33] proposed an SVM classification model with a learned label space that evolves in time, where novel classes may emerge or old classes may disappear. For the modelling of intricate-shape class boundaries, they used support-vector-based data description methods. Yalcin et al. [34] used SVMs in an ensemble-based incremental learning algorithm to model the changing environments. Learning windows of variable length also appear in the papers of Klinkenberg et al. [17] and Klinkenberg [18]. Klinkenberg's methods use a variable width window, which is adjusted by the estimation of classification generalization error. At each time step, the algorithm builds several SVM models with various window sizes, then it selects the one that minimizes the error estimate. The appropriate window size is automatically computed, and so is the selection of samples and their weights. While the methods used by Klinkenberg et al. can be used online in applications, they are not incremental and the SVMs must be retrained. Another approach proposes an adaptive dynamic ensemble of SVMs which are trained on multiple subsets of the initial dataset [19]. The most recent heuristic approach splits the stream into data blocks of the same size, and uses the next block to assess the performance of the model trained on the current block [35].

Background: The Adaptive Window Model for Drift Detection and the Incremental-Decremental SVM
To make the paper self-contained, we summarize in the following two techniques incorporated in the AIDSVM method: the statistical test used for concept drift detection, and the incremental-decremental SVM procedure used to discard the obsolete part of the window.

Concept Drift with Adaptive Window
We use the ADWIN adapting window strategy to cope with concept drift. Details can be found in the original paper [20].
During learning, past data, up to the current sample, are stored in a fixed-size window. For every sample x i in the window characterized by its class y i , the trained model predicts classŷ i ; we compute the sample error e i which is 0 if y i =ŷ i or 1 if the predicted label is wrong. Given a set of samples, the prediction error e i is a random variable that follows a Bernoulli distribution. The sum of these errors, for a set of samples, is a random variable following a binomial distribution. If the width of the window is n, where x i (i = 1, . . . , n) are the window samples, then the model error rate is the probability p i of observing 1 in the sequence of e j errors (j = 1, . . . , i). Each p i is drawn from a distribution D i . In the ideal case of no concept drift, all D i distributions are identical. With concept drift, the distribution changes, as the error rate is expected to increase.
The ADWIN strategy successively splits the current window of n elements into two "large enough" sub-windows. If these sub-windows show "different enough" averages of their sample error, then the expected values corresponding to the two binomial distributions are different. By incrementing the i value, the approach constructs all possible cuts of the current window into two adjacent splits (W 0 , W 1 ), where W 0 has n 0 samples x 1 , x 2 , . . . , x i and W 1 has n 1 samples x i+1 , . . . , x n . We have n = n 0 + n 1 . As cuts are constructed, they are evaluated against the following Hoeffding test: where µ 0 and µ 1 are the averages for the error values in W 0 and W 1 , and δ ∈ (0, 1) is the required global error. The scaling δ n is required by the Bonferroni correction, since we perform multiple hypothesis testing by repeatedly splitting the samples. The statistical test checks whether the observed averages differ by more than a threshold cut , which is dependent on the window split size.
The null hypothesis H 0 assumes that the mean µ has remained constant along all the sufficient "large enough" cuts performed on the sliding window W. Parameter δ tunes the test's confidence; for example, a 95% confidence level is assimilated to δ = 0.05. The statistical test for observing different distributions in W 0 and W 1 checks whether the observed averages in both sub-windows differ by more than the threshold cut . Given a specified confidence parameter δ, it was shown in [20] that the maximum acceptable difference cut is: where: m = 1 1/n 0 + 1/n 1 (the harmonic mean of n 0 and n 1 ), However, this approach is too conservative. Based on the Hoeffding bound, it overestimates the probability of large deviations for small variance distributions, assuming the variance is σ 2 = 0.25, which is the worst-case scenario. A more appropriate test used in [20] also takes the window variance into consideration: In Equation (4), the square root term actually adjusts the cut term relative to the standard deviation, whereas the additional term guards against the cases where the window sizes are too small.
An exemplification of these criteria is given in Figure 1. We considered a window of 1000 simulated samples. For all samples x i inside the sliding window, we constructed the error e i using several simulated Bernoulli distributions. Afterwards, we computed the average error difference for each window split. A reference classifier with 85% accuracy, with no drift, is simulated with a Bernoulli distribution of p = 0.15. We simulated 20 such distributions. For drift simulation, we created 20 mixed distributions by concatenating the first 700 samples from a Bernoulli distribution with p = 0.15, with the last 300 samples from another Bernoulli distribution with p = 0.4. Thus, we simulated a sudden drop in the classifier's accuracy from 85% to 60%. Then, we obtained the test margins cut and cut_adjusted from Equations (2) and (4) by successive splits of W 0 and W 1 for the shifting window, imposing a limit of at least 41 samples (for statistical relevance). In Figure 1, it can be seen that the two margins, cut and cut_adjusted , are somewhat similar. However, the adjusted threshold ( cut_adjusted ) is more resilient to false positives on smaller partitions, and more conservative on larger ones.

Kuhn-Tucker Conditions and Vector Migration in Incremental-Decremental SVMs
Among the SVM models suitable for adapting to drifting environments, the incremental SVM learning algorithm of Cauwenberghs and Poggio [36] (later extended in [37]) is well equipped for handling non-linearly separable input spaces. By design, it is also able to non-destructively forget samples, adapting its statistical model to the remaining data samples. Retraining from scratch is thus avoided, and the model can learn/unlearn much faster than a traditional SVM. An efficient implementation for individual learning of the CP algorithm was analyzed by Laskov [38], along with a similar algorithm for one-class learning. Practical implementation issues of the CP algorithm were discussed in [39,40]. The algorithm was also adapted for regression [41][42][43]. The incremental approach was revisited more recently in [44], where a linear exponential cost-sensitive incremental SVM was defined. In the following, Equations (5)- (14) are taken from [39]. Our AIDSVM method is based on the CP algorithm. Therefore, we briefly review the theoretical framework, with emphasis on the Kuhn-Tucker conditions and exact vector migration relations, which were previously presented in [39] with full details.
For a set of samples x i with associated labels y i ∈ {−1, +1} (i = 1, . . . , N), a linear SVM computes the separation hyperplane as a linear combination of the input samples given by the function g(x) = w T x + b, where the predicted label is given byŷ i = sign(g(x i )).
The optimal hyperplane is determined by the following optimization problem: where C is the regularization constant tuning the constraints strength. We define the penalty function h(x i ) for data samples x i as: If x i is correctly classified, h(x i ) would be positive; the variable associated with the constraint, ξ i , is zero in this case. Otherwise, if incorrectly classified, or on the right side of the hyperplane, but at a smaller distance than the minimum margin 1 2 w , the value h(x i ) becomes negative.
If sample x i is not classified correctly within a sufficient margin distance, h(x i ) < 0 and ξ i > 0. However, Equation (5) enforces small ξ i penalties. The C regularization parameter tunes the trade-off between margin increase and correct classification.
Solving the constraint optimization problem makes use of the Kuhn-Tucker (KT) conditions; two of them are relevant for the incremental-decremental approach: Applying the KT conditions also determines the separation hyperplane to be computed as g(x i ) = ∑ N j=1 λ j y j x T j x i + b. Condition (8) is the complementary slackness condition. If λ i = 0, then the vector is not part of the solution at all. If non-zero, then (9) must be true, and x i will be part of the solution. When ξ i = 0 and h(x i ) = 0, sample x i will be considered a support vector.
The penalty h(x i ) can be: Based on these conditions, a vector x i could belong to one of the following sets: (i) support vectors, where h(x i ) = 0 and 0 < λ i < C, defining the hyperplane, (ii) error vectors, where h(x i ) < 0 and λ i = C, vectors situated on the wrong side of the separation hyperplane (or in the separation region), and (iii) rest vectors, where h(x i ) > 0 and λ i = 0, vectors situated on the correct side of the separation hyperplane.
We map the input to a multi-dimensional space characterized by a kernel K(·, ·) and use the notation: We generalize the penalty function to h(x i ) = ∑ j λ j Q ij + by i − 1. Incremental-decremental training comes down to varying the λ i parameters so that the KT conditions are always fulfilled. These variations determine vector migrations between the previously mentioned sets of vectors. The variations are defined by: where the 's' index stands for support and the 'r' is used for both error or rest vector sets. By computing the exact increments of ∆λs, we carefully trace vectors' migrations among the sets, thus performing the learning/unlearning (which are symmetrical procedures).
Considering the first relation, ∆λ s = β s ∆λ c , where β s is the s-th component of vector β, we find that −λ s ≤ ∆λ s ≤ C − λ s , and further that −λ s ≤ β s ∆λ c ≤ C − λ s ; this means, for the incremental case, that: Equation (14) is for support vectors only; a similar equation can be written for the rest vectors. The entire discussion has already been provided in detail in [39].

Adaptive-Window Incremental-Decremental SVM (AIDSVM)
We are now ready to introduce the AIDSVM algorithm, which is a generalization of the CP algorithm for concept drift, using an adaptive shifting window.
Using the classification terminology [7][8][9], AIDSVM is a window-based active approach. It uses a window of the most recent samples to construct the classifier, and reacts to the concept drift by discarding the oldest samples from the window, until the Hoeffding condition (1) is met.
The AIDSVM method is presented in Algorithm 1. A high-level diagram is also shown in Figure 2. The algorithm starts with an empty window; the samples are added progressively as they arrive. The window should have a minimum length, such that the statistical test could always be performed on a relevant number of samples. Below this minimum, the drift detection is not employed. For every sample added, several tests are performed on the current window. The window is partitioned into two splits, W 0 and W 1 . As the partition moves, the length of split W 0 increases, and the length of split W 1 decreases. For a window width of n data vectors, where we keep at least m elements in the split, there are exactly n − m − 1 possibilities of constructing the W 0 and W 1 window splits.

Algorithm 1 Concept drift AIDSVM learning and unlearning
procedure ADAPTIVESHIFTINGWINDOW(data_stream) the data stream is considered continuous choose C, cut choose min_window_size and max_window_size, with min_window_size < max_window_size set initial solutions using (x 1 , y 1 ) and (x 2 , y 2 ) window initialized with empty list W ← ∅ while incoming data samples exist do (x k , y k ) ← next incoming sample extend kernel with (x k , y k ) collect statistics for next vector x k append vector while sample x c not yet learned do Q ← compute_Q(kernel, y) with Equation (11) β s ← compute_beta(Q, y) with Equation (12) γ s ← compute_gamma(Q, y, β s ) with Equation (13) ∆l s , ∆l r ← compute_limits_for_support_and_rest_vectors(x c , C) with Equation (14) update all λ s , λ c using ∆l s , ∆l r , C, β s and γ s at least one vector will migrate reassign_vectors_in_sets() procedure UNLEARN(x c ) while x c not yet unlearned do if x c removal leaves its class unrepresented then return Q ← compute_Q(kernel, y) with Equation (11) β s ← compute_beta(Q, y) with Equation (12) γ s ← compute_gamma(Q, y, β s ) with Equation (13) ∆l s , ∆l r ← compute_limits_for_support_and_rest_vectors(x c , C) with Equation (14) update all λ s , λ c using ∆l s , ∆l r , C, β s and γ s at least one vector will migrate reassign_vectors_in_sets() The SVM classifier, trained on the entire window W, is evaluated on every sample x i ∈ W. The estimated classŷ i is compared against the true class label y i . For every pair of window splits W 0 and W 1 , we compute the mean of the sample error e i = {y i different fromŷ i }, and then the difference of those means. This difference is compared to the dynamic threshold cut_adjusted given by Equation (4). In the ideal case, the difference is close to zero (for a window without concept drift). Once the difference becomes greater than the computed threshold, all samples from the first split W 0 are unlearned by the decremental SVM procedure, and training is resumed.
The algorithm does not apply the statistical test if the current shifting window has fewer samples than min_window_size; this is taken as a measure of precaution. Conversely, the upper size of the shifting window is also limited. In addition, the SVM does not remove a vector that is the only remaining representative of its class.
Let us analyze the computational complexity of this algorithm. We consider N to be the width of the shifting window. The ADAPTIVESHIFTINGWINDOW procedure (Algorithm 1) calls the LEARN/UNLEARN procedures. Both procedures follow the following steps: observed that discarding the entire W 0 is sufficient to reinitialize the model, and any further drops do not occur. We can conclude that, for most cases, execution time is in O(N 3 ).

Experiments
We experimentally compared the performance of AIDSVM to the ones of FHDDM, FHDDMS, DDM, EDDM and ADWIN, which were introduced in Sections 2 and 3.1. For these drift detectors, two algorithms were employed, namely Naive Bayes (NB) and Hoeffding Trees (HT) [28]. We used the implementations provided by the Tornado framework (sources can be found on-line [45]).
We also compared the performance of AIDSVM against the classic SVM (https:// scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html, accessed on 2 October 2021) (C-SVM). This is actually a SVM trained on a fixed size window. When a sample arrives, the earliest one is discarded to make room for the new one, and the SVM is retrained from scratch based on the updated window.

SINE1 Dataset
Following [28,30,46], we used the SINE1 synthetic dataset (https://github.com/ alipsgh, accessed on 2 October 2021), with two classes, 100,000 samples, and abrupt concept drift [31]. Additionally, 10% noise was added to the data. The rationale, as given by [46], is to assess also the robustness of the drift detection classifier in the presence of noise. The dataset has only two attributes, (x a , x b ), uniformly distributed in [0, 1]. A point with x b < sin(x a ) is classified as belonging to one class, with the rest belong to the other class. At every 20,000 instances, an abrupt drift occurs: the classification is reversed. This presents the advantage that we know exactly where drift occurs and as such, we can evaluate the sensibility of our classifier.

CIRCLES Dataset
Another dataset used frequently in the literature is the CIRCLES dataset [20,27,31,47]. It is a set with gradual drift; two attributes x and y are uniformly distributed in the interval [0, 1]. The circle function is (x = x c ) 2 + (y − y c ) 2 = r 2 c , where x c and y c define the circle center and r c is its radius. Positive instances are inside the circle, whereas the exterior ones are labelled as negative. Concept drift happens when the classification function (the circle parameters) changes; this happens every 25,000 samples.

COVERTYPE Dataset
The Forest Covertype dataset [48] is often applied in the data stream mining literature [20,27,31,47]. It describes the forest coverage type for 30 × 30 meter cells, provided by the US Forest Service (USFS) Region 2 Resource Information System. There are 581,012 instances and 54 attributes, not counting the class type. Out of these, only 10 are continuous, so the rest of them (such as wilderness area and soil type) are nominal. The set defines seven classes; we only used two classes, the most represented, with a total of 495,141 data samples. The classes are equally balanced: 211,840 in class 1 vs. 283,301 in class 2. The dataset was already normalized [49]. For the SVM to work properly in case of small windows, we detected the sets of temporally consecutive data samples belonging to the same class. We observed that, apart from a set of 5692 consecutive elements of the same class, which was skipped, all other such sequences had less than 300 elements. For those, we switched the middle element with the most recent element of a different class, to ensure that we have no sequences longer than 150 samples from the same class. This is similar with SVM keeping its most recent sample of the opposite class, in the definition of the hyperplane.
Concept drift in the COVERTYPE case may appear as a result of change in the forest cover type [31]. There is no hard evidence of concept drift in this case; we do not know whether concept drift does occur, and in that case, what is its position within the data stream [31,47,50]. Thus we cannot compare against a baseline; only comparison among employed methods is possible.

Performance Comparison
We evaluated the performance on these three datasets, for two classifiers with five drift detection methods from the Tornado framework [45] (thus a total of 10 models), against AIDSVM and C-SVM. For AIDSVM, different parameters were used, and they are dataset dependent. They are shown in Table 1, where the window size for AIDSVM is the maximum allowed. The window size was chosen as a sufficiently large number, that provided best accuracy results when the model was trained from the beginning of the stream and tested on the next 100 samples. The γ parameter was computed as γ = 1/(N · σ 2 ), where N is the dataset size and σ 2 the variance for all of the features. These were previously rescaled to fit in the [0, 1] interval. The C parameter was determined by the incremental training process, sufficiently large so that the initial support vector set would not become empty. As we wanted a confidence level of 95% in the Hoeffding inequality (1), we chose δ = 0.05.
The accuracies for the realized experiments are presented in Table 2. On the SINE1 dataset, the most accurate classic drift detection model is HT+FHDDM, with 86.37%. The C-SVM performance is only 84.83%; because C-SVM is not suited for abrupt drift, this poor accuracy is somehow expected. The AIDSVM model was the best performer, with 88.68%, indicating good adaptability to abrupt drift. On the CIRCLES dataset, however, it can be seen that the best models seem to be on par-HT+FHDDM with 87.16%, HT+FHDDMS with 87.19%, C-SVM with 87.17% and AIDSVM with 87.22%. Being a dataset with gradual concept drift, C-SVM is expected to behave well, and this is supported by the experiment. For the COVERTYPE dataset, the best classic model seems to be the HT+DDM, achieving 89.90%. C-SVM performance of 91.79% suggests that this dataset seems to also be a gradual drift dataset; however, better performance of AIDSVM of 92.17% indicates a rather rapid drift. We made a time comparison between C-SVM and AIDSVM in Table 3. We recorded the mean time and standard deviation, in milliseconds, after training on the same 1000 sample window size, on an Intel Core i5-8400 CPU with 16 GB RAM. We observed the advantage of AIDSVM without retraining from scratch.  Figure 3 shows the window size dynamics for the three datasets, trained on the AIDSVM classifier. For SINE1, we can clearly observe the sudden drift changes; the window becomes almost empty. For the CIRCLES dataset, drift change is still visible at samples 25,000, 50,000, and only a little bit at 75,000. We explain this by considering the gradual drift employed by the dataset and by the fact that using a shifting window is inherently a way to cope with drift. In case of the COVERTYPE dataset, the drift here looks more like a combination of gradual and abrupt drifts; this is also supported by the point-to-point comparison among the drift methods contained in Table 4.

Qualitative Discussion
We represented the instant accuracy of the classifier (given in Figure 4). This was computed as a mean on the next 100 samples. The C-SVM instant accuracy falls at concept drift and slowly recovers, whereas the AIDSVM accuracy recovers faster. We also computed the Exponentially Weighted Average at sample t, computed as V t = βV t−1 + (1 − β)A t , where A t is the accuracy at sample t. β is chosen to be 0.9995, equivalent to a weighted average for about the last 2000 samples. One can see that the simple C-SVM EWA drops by about 20%, whereas the AIDSVM EWA drop is below 5%. The last metric, the mean accuracy, shows that, compared with the C-SVM, the AIDSVM mean accuracy gain is about 4%; interestingly, our mean accuracy of 88.54% is slightly better than Diversity Measure as a Drift Detection Method in a semi-supervised environment (DMDDM-S, 87.2%) presented in the most recent work of Mahdi [46], on the same SINE1 dataset. Table 4. Drift points detected by the compared models. Drift is detected the same regions, mostly observed for the SINE1 and CIRCLES datasets. Here, we have only shown the first five detections. The presence of ellipsis shows that the sequence is longer. Clear concept drift is seen in SINE1 around theoretical positions 20,000, 40,000 and 60,000, and for CIRCLES at positions 25,000, 50,000 and 75,000. COVERTYPE dataset seems to have a mixture of abrupt and gradual concept drift.

Drifts Signalled
SINE1 CIRCLES COVERTYPE Figure 4. Accuracy comparison between fixed-window C-SVM and adaptive-window AIDSVM, on the SINE1 dataset. The instant accuracy evaluated on the next 100 samples is depicted with blue (C-SVM) and green (AIDSVM). Exponentially weighted accuracy (EWA) is also shown, as well as the mean accuracy.

Conclusions
We introduced AIDSVM, an incremental-decremental SVM concept drift model with adaptive shifting window. It presents two important advantages: (i) better accuracy, because irrelevant samples are discarded at the appropriate moment based on the Hoeffding test, and (ii) it is faster than a classic SVM since no retraining is needed-the model is adapted on-the-run. The results of the experiments on three frequently used datasets indicate a better adjustment of the AIDSVM model compared to other drift-detection methods.
Experimental evaluation indicated that AIDSVM copes better with concept drift, and in general it has similar or better accuracy results compared to classical concept drift detectors. However, the construction of the incremental solution is generally slower; this makes AIDSVM well suited for data streams with moderate throughput, where good accuracy is required in the presence of concept drift. To the best of our knowledge, our implementation is the first online SVM classifier that copes with concept drift using dynamic adaptation of the shifting window by avoiding retrain from scratch.
A further improvement to the current AIDSVM implementation would be to speed up the unlearning process. This can be carried out in two stages. First, one would determine how many samples from the beginning of the window have to be removed. This is achieved by testing the Hoeffding condition (1) on sub-windows formed by successively removing the oldest sample. Second, after finding out which samples must be removed, one would have to decrease all λ c characteristic values for those vectors in a uniform way, and a similar relation to Equation (13) must be derived.
AIDSVM could be modified to support regression problems; the incremental-decremental SVM for regression was previously approached in [41][42][43]. A natural direction would also be to extend AIDSVM to multiple classes, where an ensemble of incremental SVMs with adaptive windows could be trained in parallel.

Conflicts of Interest:
The authors declare no conflict of interest.