1. Introduction
Currently, more than ever before, technology plays an increasingly central role in daily life, making data a cornerstone of human activity. The amount of data to be exploited has increased tremendously, leading to a new era, that of big data. In addition to their large volume, big data are also characterized by high velocity and substantial variety [
1], resulting from their production in continuous flows [
2], known as data streams, while they may also arrive in various semi-structured or even unstructured formats.
The mining of these data streams is a challenging process because of their evolving characteristics. Apart from the large volume and possible variability, the distribution of the data may often undergo several changes, known as drifts. Drifts may represent changes that affect specific data variables (typically referred to as data drifts) or even, in a machine learning context, changes that affect the correlation between input and output variables (typically referred to as concept drifts) [
3,
4]. Either way, the research community has identified the need for machine learning techniques that are capable of adapting to these drifts [
5].
Contemporary techniques for data streams, or so-called online machine learning algorithms, typically include mechanisms to identify drifts, which allow the algorithm to adapt to distribution shifts [
6,
7,
8,
9,
10,
11]. Though useful, these methods usually require manual configuration, while their performance is highly dependent on the dataset under analysis. On the other hand, AutoML approaches for data streams are few and rather limited [
12,
13,
14,
15,
16,
17]; most of them either do not adapt to drifts, or they focus only on concept drifts, while they may also employ a narrow selection of models (e.g., only trees) or introduce considerable latency when integrating distribution changes into the models.
In this work, we propose a methodology that confronts the aforementioned limitations. We design an AutoML Pipeline for Streams (AML4S), which involves different preprocessing techniques and online machine learning models, effectively making it able to handle a wide variety of data streams. All steps of our pipeline (i.e., preprocessing, model selection, hyperparameter tuning) are fully automated, while it also integrates a drift detection mechanism that can identify both data drifts and concept drifts, thus allowing our pipeline to adapt to any changes in the distribution of the data. Finally, to assess our methodology, we also craft a data stream generator to create synthetic datasets that include both data drifts and concept drifts. Our methodology is then compared with different online learning methods, as well as a contemporary AutoML approach.
The remainder of this paper is organized as follows.
Section 2 provides background information and related work on data streams.
Section 3 describes our proposed framework for an AutoML pipeline.
Section 4 provides the results of our evaluation on different datasets.
Section 5 discusses certain threats to validity and limitations, while
Section 6 concludes this paper and provides ideas for future work.
2. Background and Related Work
As already mentioned, technological advancements have brought forth a new era in data mining and machine learning research; one where algorithms must be able to adapt to ever-evolving data streams. Click-through data, social media outputs, financial micro-transactions, and even IoT sensor data are only a few of the example applications that generate large amounts of data at high velocity [
18]. In this context, a significant challenge faced by traditional (batch) machine learning models is that the distribution of the incoming data may shift at some point, e.g., a new product launch may significantly change the buying behavior of consumers, unexpected news may alter the stock price of a company, etc. These types of changes in distribution are commonly referred to as drifts [
3,
4]. Drifts can take various forms, as shown in
Figure 1.
They can be sudden, where the distribution changes drastically in a small time frame, usually in response to an unexpected event, or the distribution can change gradually due to the inertia of the scenario, with the two distributions coexisting temporarily, (e.g., a new popular framework may replace another older one in user preference). Incremental drifts are also similar; however, they occur in cases where there is no clear transition between the two underlying data distributions, but rather the shift happens progressively and continuously (e.g., electricity/gas use throughout the autumn and winter months when the weather becomes progressively colder). Finally, a drift can even be reoccurring, meaning that previously observed distributions reappear occasionally, which is a common response to external changes that keep appearing at different points in time (e.g., a weather event may change the traffic conditions whenever it happens).
Another important distinction can be made based on the variables that are affected by the distribution change. In current literature, the two main categories of drifts recognized are concept drifts and data drifts [
3,
4], as shown in
Figure 2 (current literature also provides other names for these drifts; e.g., data drifts are sometimes referred to as virtual drifts or feature drifts, while concept drifts are also known as label drifts [
5]). In the context of a classification scenario, concept drifts signify changes in the correlation between the input variables (
X) and the target output (
y), i.e., a shift in the conditional distribution
. As a result, the decision boundary initially learned by the model becomes outdated and must be adjusted to reflect the new concept (see top-right panel of
Figure 2). Data drifts, on the other hand, signify changes in the input data features (i.e., in the distribution
), which may or may not affect the model’s output; thus, the classifier’s decision boundary may or may not still be valid (see bottom two panels of
Figure 2).
When they occur, concept and data drifts can significantly affect the performance of machine learning models; thus, the challenge of online learning is to build algorithms that are capable of adapting to changes, even in real time. Several approaches have been developed in this area [
6,
7,
8,
9], with models that are trained per data instance which may also include drift detectors, such as the Adaptive Windowing (ADWIN) estimator [
20]. These approaches are analyzed in the following paragraphs.
One of the earliest efforts that deals with evolving data streams and adapts to concept drifts is the Hoeffding Adaptive Tree model [
6]. The algorithm builds a tree in an incremental way, and it uses a sliding window of variable length, which is changing over time, managing statistics at its nodes. The model has three different versions: HAT-INC, which uses a linear incremental estimator, HAT-EWMA, which uses an Exponential Weight Moving Average (EWMA), and HAT-ADWIN, which uses ADWIN for drift detection.
Other approaches have also adapted ensembles for the online learning scenario. For instance, Oza and Russell [
21] designed an ensemble that uses the Poisson distribution to resample the data on which every base model is trained. Building on this approach, Bifet proposed the Leveraging Bagging method [
7], an ensemble that further increases resampling by experimenting with larger Poisson lambda values. Moreover, the method uses random output detection codes instead of deterministic codes so that each classifier makes a prediction using a different function (rather than the same, which is standard practice in ensemble methods). And, finally, ADWIN is also used to detect concept drift; when changes are detected in the ensemble error rate, the base classifier with the highest error rate is replaced with a new one (model reset).
A similar approach is followed by Gomes [
8], who proposed the Adaptive Random Forest method. This method also creates an ensemble model with decision trees as base learners, which are trained using Poisson distribution on resampling. The training process is based on the Hoeffding tree algorithm, with the main differences being that there is no early tree pruning and that whenever a node is created, a random subset of features is selected to decide the split at that point. This Adaptive Random Forest also uses a drift detection mechanism to cope with concept drift; the mechanism detects warnings of drifts, and when a warning occurs, a “background” tree is created and trained along with the ensemble without influencing the ensemble predictions. Then, if a drift is detected, the tree from which the warning signal originated is replaced by its respective background tree.
Another forest-based ensemble has been proposed by Rad et al. [
10]. The authors introduced Hybrid Forest, a method using two components: a complete Hoeffding Tree as the main learner and many weak learners (also Hoeffding Trees) to improve prediction when needed. The main learner uses all data features, whereas each weak learner uses a subset of the features. The decision of whether the algorithm uses the main learner or the weak learners is based on the performance on the last samples, using a sliding window. When the number of correct answers of the weak learners is above a threshold (usually when drifts occur), their majority decision is used; otherwise, the main learner is used.
The Streaming Random Patches method [
9] also creates an ensemble model, where the base models of the ensemble are decision trees (Hoeffding Trees). The trees are trained on data originating from resampling online bagging using the Poisson distribution. The main difference between this method and the Adaptive Random Forest is that random subsets of features are created in order to train every base model of the ensemble, thus achieving better diversity in models. Apart from that, Streaming Random Patches also integrates a drift detection method and adaptation similar to the Adaptive Random Forest.
An interesting combination of online ensemble methods is offered by the Performance-Weighted Probability Averaging Ensemble (PWPAE) [
11]. The method uses two classification algorithms, Adaptive Random Forest and Streaming Random Patches, and combines them with two state-of-the-art drift detectors, ADWIN and DDM. A novelty of this method is that the weights of the ensemble models are not predefined; instead, they are modified dynamically according to the performance of each learner. PWPAE also employs k-means to sample the data in order to create a representative subset for the learning process.
The aforementioned approaches attempt to solve the problem of the occurrence of drifts in evolving data streams; however, their main drawback is the requirement of manual configuration. Moreover, those algorithms may be effective on certain data streams, but they do not generalize well to others due to the fact that they are limited by the specifics of each classification algorithm. As a result, recently, several researchers have attempted to design AutoML approaches for evolving data streams [
12,
13,
15,
16,
17]. AutoML is a field of machine learning that aspires to automate model selection and hyperparameter optimization, thus providing a configuration that is effective on a variety of data. In the context of online learning, AutoML models are typically coupled with drift detection mechanisms.
One such solution is the adaptation algorithm proposed by Madrid et al. [
12]. The authors implemented a modification to the Auto-Sklearn machine learning toolkit [
22], which integrates countermeasure mechanisms for concept drifts in data streams. The method receives data in batches. The first batch is used to train the first ensemble model using Auto-Sklearn. This ensemble model is then used to make predictions on the next batch, and subsequently, when the real values arrive, a Hoeffding Drift Detection Method (FHDDM) [
23] is used to determine whether there is any concept drift in the data. If there is, the model is updated, either via full model replacement (a new model is created and trained with the data of all past batches) or by changing the weights of the ensemble (using the latest batch or all the stored batches).
Another approach is EvoAutoML [
13], an AutoML method that expands on online bagging [
21], inspired by genetic algorithms. The method initially creates an ensemble with a random population of algorithm pipelines. After that, the algorithm decides when a mutation should occur in this population, based on a sampling rate. When a mutation is about to happen, the algorithm chooses the best pipeline and changes a random parameter to create the new pipeline, while also determining the worst pipeline, in order to be removed from the population. After the mutation step, the population is trained on the new instance (every sample).
Another AutoML approach that is also based on evolutionary algorithm generation is AutoClass [
14]. Initially, AutoClass creates a population of algorithms with their default parameters (Hoeffding Tree, Hoeffding Adaptive Tree, kNN with and without ADWIN). After that, a sliding window is used to decide each time which configuration should been removed, and which should be added with new parameters. The decision is made using fitness proportionate selection [
24], where the fitness of each configuration is defined as its predictive performance.
An alternative approach is followed by Chacha [
15], an AutoML method that is used to determine the best parameters for the prediction model. Upon setting up the initial configuration, the algorithm continuously determines the best configuration as the data are coming. The best configuration (champion) is determined each time by comparing the latest best one with others from memory (challengers) based on the value of the progressive validation loss metric. The challengers are limited to a number provided by the user, while not all challengers run (and compete with the champion) at the same time for efficiency. Every time a new sample arrives, it is used by the champion for prediction, and after that (when the real value arrives), it is used to update the champion and some of the challengers (the live models). Subsequently, the algorithm chooses if it needs to change the champion or to remove any model from the challengers (and replace it with a new one).
ASML [
17] is an online AutoML method, which further focuses on providing a scalable solution by creating simple pipelines or ensembles (depending on user configuration). A batch of data is needed at the beginning to train the initial pipelines and determine the optimal one. After that, every time a new batch of data arrives, it is used to assess a set of different pipelines. Apart from the best pipeline at the time, the pipelines to be assessed are generated from adaptive random directed nearby search, and from random combinations of the search space.
A similar approach is followed by the OAML method [
16]. OAML initially requires an amount of data to train a number of different pipelines, then evaluate them in order to find the best one to use. Every time a new instance arrives, it is initially used for prediction and subsequently (when the real value arrives) for training the pipelines. The method further involves an Early Drift Detection Method (EDDM) [
25], which is used to detect concept drifts. Every time a drift is detected or a set number of samples arrives without the model having been replaced, the method initiates a retraining phase, where it takes the same steps as in the first training, using the samples in memory.
Although the aforementioned approaches are effective under specific scenarios, they also have certain limitations. First of all, the approaches that do not employ drift detection methods may not be a good fit for data streams with abrupt changes in the distribution of the data, as they might only adapt to gradual drifts. Other approaches focus only on concept drifts, not adapting the models preemptively in the case of data drifts. Concerning adaptability, online machine learning methods may need manual configuration to optimize their parameters. In contrast, AutoML methods are optimized automatically, although they often lack preprocessing algorithms, and they are sometimes limited by the types of classification models used (e.g., some support only one category of classification models and are thus mainly focus on hyperparameter optimization). Moreover, certain approaches have to wait for multiple batches of data to arrive before initialization, as they base their retraining on large time windows. Finally, certain implementations are not available online and/or do not employ state-of-the-practice libraries.
In this work, we design an AutoML algorithm that overcomes all of the above limitations.
Table 1 summarizes certain features of the methods analyzed above while also including the advantages of our method in these aspects. As shown in this comparison, our method supports both concept drift and data drift detection using the ADWIN algorithm. In contrast, all other approaches focus only on concept drifts, while certain AutoML implementations do not even explicitly include a drift detector [
13,
14,
15,
17]. Instead, they are based only on the online adaptation capabilities of the underlying models, which may hinder their effectiveness on sudden drifts. Another important distinction is whether the proposed solutions also include data preparation steps, i.e., if they offer end-to-end pipelines.
As shown in
Table 1, there are several approaches that only offer classification models [
6,
7,
8,
9,
10,
11,
14,
15]; thus, they require significant effort by the user to preprocess the data and possibly perform feature selection. Even AutoML approaches are usually limited only to preprocessing methods [
12,
13,
16], whereas ASML [
17] and our proposed approach (AML4S) are the only ones that further support feature selection. As a result, AML4S does not require any knowledge of data or any machine learning expertise, as it identifies the optimal algorithms and parameters for each pipeline step (preprocessing, feature selection, model selection, hyperparameter optimization) automatically.
Furthermore, concerning the search space of each method, we note that AutoML methods are obviously expected to be more effective than single classifiers, especially when they support a large variety of online classification algorithms, with multiple configurations. In this aspect, both the Adaptation algorithm of Madrid et al. [
12] and AML4S are quite effective, as they support 15 and 12 models, respectively (including different categories of models, as discussed in
Section 3.2). However, the Adaptation algorithm is more oriented towards efficient AutoML, which is expected, as it employs the Auto-Sklearn library [
22], whereas our approach focuses on the challenges of online learning to a great extent, as well. More specifically, after initialization, our method is fully online, as it is trained per instance and not in batches, while it also performs retraining only upon drift detection (thus not requiring a configurable time parameter). Finally, our pipeline is implemented using River [
26], which is a state-of-the-practice online machine learning library, facilitating the application of our approach to different problems in a straightforward manner.
5. Discussion and Threats to Validity
We acknowledge certain threats to validity and limitations of our approach and our evaluation. Those threats include the choice of the evaluation dataset and metrics, the choice of tools and algorithms/models used, as well as their limitations with respect to noise (and sensitivity to drifts) and class imbalance, and the limited comparison with other approaches.
Concerning the evaluation datasets, we may note that evaluating our approach was challenging, since current datasets and data generators focus mainly on concept drifts. Therefore, we created our own generator to evaluate our method in a more thorough setting. Since our generator includes drifts placed by hand, concerns about our method’s accuracy on a broader spectrum of problems may arise. To mitigate these concerns, we also evaluated our pipeline on a number of widely used datasets for drift detection (both real and synthetic) [
30,
31,
33,
34,
35,
36], illustrating how it ultimately yields better results, which is a testament to its generalization capabilities and its robustness. As for the (aggregated) accuracy, as our evaluation metric of choice, we opted for the most commonly used method in earlier work in the field [
12,
13,
16,
17]. However, we acknowledge that it may be a limitation of our method, and we plan to devise a more well-rounded evaluation framework in future work.
Regarding our choice of algorithms, we opted for a variety of methods, limited mainly by the options offered by the River framework [
26]. For instance, River did not support online implementations for models such as SVMs or Neural Networks; therefore, we did not include them in our pipeline. However, we included an extensive (but not exhaustive) list of preprocessors (standard/min–max scaler), feature selectors (variance-based, similarity-based) and models (i.e., statistical, probabilistic, forests, instance-based), along with variety of hyperparameter ranges, to make our method more versatile and ensure enhanced performance across problems of different natures. In future work, we plan to explore the possibility of offering more choices, especially regarding preprocessing and parameter search space. Moreover, we plan on experimenting with different frameworks for online learning, such as MOA [
37] or CapyMOA [
38]. These frameworks also support sampling strategies, which could be used to address issues stemming from class imbalance as well as deep learning models, which can further enhance the performance and applicability of our system. Finally, potential inclusion of such models also prompts us to investigate the optimal memory size, or even employ automated methods for its tuning, in our future work.
Furthermore, the choice of drift detector may have a significant impact on the performance of our pipeline. ADWIN was selected for its robustness and versatility, meaning that it can handle various types of drifts, short-term noise, and evolving data distributions. However, it can also perform poorly in the presence of extreme noise patterns as well as multiple individual outliers. On the one hand, gradual drifts over a large time span may be detected too late or even remain undetected due to the statistical nature of the drift detector. These changes have to be handled by the online model itself, which should be able to effectively adjust to gradual/incremental distribution changes. On the other hand, sharp, random fluctuations may result in inaccurate drift detection, since these types of ‘spikes’/outliers in the data may be falsely regarded as drifts. As mentioned in
Section 3.3, such phenomena are partially mitigated by the implementation of the buffer memory. However, in future work, we aim to address these issues by exploring alternative mitigation techniques, such as dynamically adjusting our detector’s sensitivity based on recent drift patterns, assessing the effectiveness of other noise-resilient drift detectors (e.g., EDDM [
25]), or even combining drift detectors.
Finally, our comparison with other approaches on automated online machine learning was limited by the fact that such systems are currently few, while their implementations are often not available online. Nevertheless, we chose to at least compare our method with the OAML framework [
16], for which we were able to find the results (the source code of the tool itself is not available at the time of writing). Although the OAML framework offers a number of methods for AutoML, we compare our approach to the OAML-basic method, since this is the method that our system most closely resembles. Thus, to ensure fair comparison, other variants, such as the OAML-ensemble or the OAML-model store, were not considered, as they rely on different prerequisites (e.g., OAML-ensemble builds ensembles of pipelines, while OAML-model store keeps a memory of models). However, as future work, we plan to investigate the possibility of enhancing AML4S using ensembles in order to further improve upon these variants. Finally, to aid researchers working in this direction, all data (the developed data generator) and scripts required to reproduce our findings are available online at
https://github.com/AuthEceSoftEng/automl-data-streams (accessed on 22 August 2025).
6. Conclusions
As data today are produced in continuous flows of large volume and velocity, effective data stream mining has grown to be an important challenge. In this context, several research efforts have been directed towards effectively classifying instances of data streams and especially modeling shifts in the distributions of the data, known as drifts. However, contemporary approaches also have significant limitations; several of them are not automated, and may require manual parameter tuning. Even AutoML solutions typically focus only on concept drifts and do not always offer a complete pipeline that can adapt to different features and data drifts.
In this work, we created an online AutoML method that can be used on evolving data streams and can handle both data and concept drifts. Our method, AML4S, builds multiple pipelines automatically, including preprocessors, feature selectors, and classifiers, and assesses them in order to choose the best-performing one at each time, thus effectively adapting to distribution changes. It handles gradual changes using online models, while abrupt changes are handled by a drift detection mechanism. Upon evaluating AML4S, we found that it outperforms other online machine learning techniques, while it is also more effective than the state-of-the-art OAML-basic algorithm.
Our method is also not free of limitations, which are mainly centered around the choice of techniques, as well as the assessment of our pipeline. We denote as a limitation the choice of algorithms for our pipeline, bounded by the implementations of the River framework [
26]. Including more algorithms and more configurations, along with a more extensive search space and advanced methods of hyperparameter tuning, would be an interesting extension to our pipeline. Moreover, our drift detector of choice, ADWIN, though robust and versatile in several cases, may be sensitive to scenarios with significant noise and/or extreme outliers (e.g., detecting multiple drifts that are false positives). Finally, from an assessment perspective, it is important to evaluate the aforementioned techniques in different scenarios. In this context, we have crafted our own data generator to specifically assess the performance of our method on data drift detection scenarios, while we have also included several other datasets in our evaluation, both synthetic and real, to evaluate its effectiveness. To further improve on this aspect and address the aforementioned limitations, it would be interesting to also include scenarios with extreme noise or multiple types of drifts so that we gain insights for refining our method.
In future work, we plan to further improve our method by taking action in several directions. First of all, it would be interesting to improve our drift detection mechanism by implementing techniques to mitigate occurrences of false positives or delayed drift detection, which are often observed at distributions with very frequent and/or gradual shifts. For instance, the parameters of ADWIN could be automatically adjusted according to the variance (distribution) of the dataset (e.g., the significance threshold discussed in
Section 3.3 can be used to control the sensitivity of the algorithm to perceived drifts). Another idea would be to construct sliding windows of multiple feature values, effectively performing multivariate data drift detection. Moreover, we plan to further improve our AutoML pipeline selector by introducing an optimization strategy for selecting hyperparameters and finding the best pipeline. For that, Bayesian optimization [
39] or Tree Parzen Estimators [
40] seem to be promising options, since they offer effective and efficient pipeline-building. The former focuses on efficiently navigating the algorithm search space (preventing resource allocation to suboptimal configurations), while the latter allows for simultaneous evaluation of multiple configurations. Furthermore, we aim to expand our search space by including more preprocessing techniques and/or classification algorithms. It would be interesting to further explore the effectiveness of ensemble/deep learning methods [
41,
42,
43] or even methods that could handle recurrent drifts by storing previous models to memory. And, in this aspect, it would also be beneficial to research the necessary optimizations to the buffer memory size and resource handling. Finally, we could further investigate the challenge of class imbalance in online learning [
44] and investigate the combination of concept drift detectors with class imbalance handling techniques [
45] in streaming data.