Next Article in Journal
Analysis of Hydrofoil Pump Layout and Similarity Theory in Plain River Network Areas
Previous Article in Journal
Efficient and Systematic Calibration of Manning’s Roughness Coefficients in River Networks: An Integrated Workflow Using Orthogonal Experiments and Successive Approximation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Feature-Enhanced Erroneous Outlier Detection in Hydrological Time Series Using Ensemble Methods

1
Faculty of Science and Engineering, Southern Cross University, Gold Coast Campus, Bilinga, QLD 4225, Australia
2
School of Architecture and Civil Engineering, Faculty of Sciences, Engineering and Technology, The University of Adelaide, Adelaide, SA 5005, Australia
*
Author to whom correspondence should be addressed.
Water 2026, 18(4), 446; https://doi.org/10.3390/w18040446
Submission received: 23 December 2025 / Revised: 28 January 2026 / Accepted: 4 February 2026 / Published: 8 February 2026
(This article belongs to the Section Hydraulics and Hydrodynamics)

Abstract

Data quality issues in hydrological time series directly affect hydrological modelling applications, including flood forecasting and water resource management. A critical challenge in hydrological monitoring is distinguishing erroneous outliers caused by sensor malfunctions or data transmission errors from natural extreme events such as floods, which exhibit similar statistical characteristics but require opposite treatments in forecasting models. Current detection practices rely on generic algorithms without systematic validation or adaptation to hydrological temporal dependencies, limiting their effectiveness in operational contexts. This study addresses these gaps through a comprehensive framework for detecting erroneous outliers in daily hydrological time series. We engineered 19 features that capture temporal dependencies and hydrological patterns, and reduced them to six key features that capture raw measurements, temporal patterns, and hydrological dynamics. We evaluated 13 detection algorithms across three categories: statistical methods (e.g., Extreme Studentised Deviate and Hampel filter), ML approaches (e.g., Isolation Forest, and Local Outlier Factor), and feature-enhanced variants. Three data-driven ensemble strategies were developed: Accurate (maximising F1-score), Diverse (balancing performance with method diversity), and Fast (prioritising computational efficiency). By injecting controlled outliers into the recorded hydrological data from five-gauge stations (in the Tweed River catchment, Australia), the outlier detection framework was validated. The outcomes showed that the ensemble methods achieved satisfactory F1 scores (0.6–0.9) in detecting the erroneous outliers. Statistical testing also identified the top-performing detection algorithms. The framework developed in this paper provides a validated tool for quality control in hydrological analysis, with potential applications in drought monitoring and flood forecasting systems.

1. Introduction

Data quality forms the foundation of reliable flood forecasting systems and other water resource management activities, directly affecting forecasting accuracy and operational effectiveness [1]. Hydrological monitoring networks are collecting hydrological measurements (such as water levels) on a continuous basis, i.e., hydrological time series. Data-driven forecasting models rely on such historical hydrological time series to learn the system behaviour and make predictions to inform important water management decisions [2]. However, data quality problems often arise directly from sensor packet malfunctions, calibration drift, transmission errors, and log failures [3], resulting in missing data and erroneous measurements that compromise model performance and operational decision-making [4]. Despite the wide range of research studies on the quality issues of data in a hydrological system, flood forecasting studies often generate results using basic statistical rules without systematic validation and adaptation of these rules to the specific characteristics of hydrological systems.
This lack of systematic validation is exacerbated by the difficulty in covering different types of anomalous observations in hydrological observations. This study distinguishes between the following related concepts: (i) Anomalies are anything that resembles any other observations that are not expected, whether in a legitimate or an erroneous sense. (ii) Outliers are statistical abnormalities that do not lie under normal data distributions. (iii) Erroneous outliers refer specifically to measurement failures or data quality issues to be discounted, as opposed to legitimate extreme events that, while statistically unusual, represent valid hydrological conditions [5]. Legitimate extremes, such as floods and droughts, follow physical laws and must be preserved, as they provide essential information for water resource management, risk assessment, and infrastructure design. This distinction becomes particularly critical in hydrological time series analysis, where extreme events are the primary target for prediction.
Current erroneous outlier detection practices in data-driven hydrological modelling lack systematic methodology and validation. There are different types of detection methods ranging from simple statistical tests [6] to advanced algorithms [7]. However, recent studies in hydrological modelling have shown fragmented application of these methods. Some use the 3-sigma rule [8] or boxplot methods [6], while others adopt robust loss functions, such as Huber loss [9], or moving averages [10], which are used as a preprocessing step. There are numerous cases in the literature that acknowledge the existence of outliers [6,11], despite reports indicating that they are directly influenced by input data noise during model learning [12]. Further, whereas measurement errors degrade the performance of forecasting models [12], the removal of legitimate extremes could eliminate precisely the information for hydrological time series analysis. This is an area of challenge that requires techniques to distinguish between sensor errors that need correction and the preservation of legitimate extremes.
This differentiation challenge is especially complex, given the distinct characteristics of hydrological time series that present a challenge in detecting erroneous outliers. Hydrological time series exhibits extreme skewness with rare but critical peak events dominating system behaviour. These peaks are irregular in nature and are driven by complex meteorological patterns rather than predictable cyclic processes commonly found in most industrial or economic time series. The temporal dependencies are fundamentally different; flood hydrographs show rapid rises and gradual recessions over days, contrasting with smooth patterns typical of other domains [13]. Seasonal variations create time-varying normality definitions where values typical during monsoons would appear anomalous during dry periods. Hydrological persistence is a phenomenon in which erroneous outliers may spread over time, with sensor drift producing sequences of slowly deviating observations that escape detection as point anomalies [14].
Given these challenges, various detection methods have been developed, though statistical threshold methods continue to dominate current practice [15]. The 3-sigma rule remains widely adopted [8], whereas boxplot-based methods using interquartile range rules are frequently employed [6,16,17]. The Hampel filter, which flags points that deviate more than three standard deviations from local medians, provides an alternative approach [18]. While these methods incorporate temporal context through sliding windows, they apply identical thresholds across all hydrological conditions without accounting for the complex temporal dependencies inherent in catchment responses and rely solely on raw measurements rather than engineered features that could provide additional context. Physical threshold methods offer domain-specific alternatives. Belyakova, Moreido [19] defined erroneous outliers through physically impossible changes, preserving legitimate rapid flood rises that statistical methods might flag. Zhou, Qiao [20] achieved a detection accuracy of 88.3% using hydraulic correlation principles for open-channel systems.
While statistical methods, such as the Hampel filter, demonstrate strong performance, combining them with enhanced machine learning (ML) variants through ensembles provides additional benefits. Isolation Forest (IF) algorithms isolate anomalies through random feature splits [7]. Local Outlier Factor (LOF) algorithms measure local density deviations [21]. LSTM autoencoders detect anomalies by monitoring reconstruction error thresholds, thereby capturing complex temporal patterns [22]. Prediction-based detection represents an emerging paradigm. Zhao, Zhu [23] developed multiple forecasting models to predict expected values, flagging observations that fell outside confidence intervals. Hybrid strategies combine multiple principles: Halicki and Niedzielski [24] validated detections by checking upstream station consistency, while Seasonal-Hybrid Extreme Studentised Deviate (SH-ESD) applies seasonal decomposition before outlier detection. Despite the availability of diverse detection methods, systematic frameworks for combining these approaches through ensemble strategies remain unexplored, as does the potential for feature engineering to enhance detection performance.
Furthermore, despite the variation in detection methods, validation remains a fundamental challenge in unsupervised outlier detection, as it is difficult to compare the effectiveness of outlier detection methods due to the unavailability of ground truth. Without labelled data (erroneous outliers or normal), researchers follow a wide range of evaluation strategies. Synthetic evaluation is dominant [7,25], where researchers inject controlled outliers to evaluate performance. Semi-synthetic approaches preserve temporal structure while enabling evaluation. Hydrological validation, as implemented in [24], ensures that detected outliers maintain consistency between upstream and downstream locations, providing physics-based verification without the need for ground truth labels. Statistical comparison frameworks (e.g., Friedman tests with Nemenyi post hoc analysis) are used to determine significant differences in performance between algorithms [7]. Agreement metrics, such as Jaccard similarity, measure the overlap between the outputs of different detectors, revealing complementary detection capabilities.
To understand the current state of the field, we reviewed studies that have addressed erroneous outlier detection in the context of hydrology. Table 1 illustrates an underlying split in the research on hydrological time series error detection. Different characteristics are demonstrated by dedicated erroneous outlier detection studies. Five of six studies implement validation frameworks, from synthetic data injection to hydrological consistency checks. van de Wiel, van Es [7] developed multi-scale temporal windows, while van de Wiel, van Es [7] incorporated upstream gauge relationships.
Analysis of these studies reveals three important gaps in current practice:
  • Recent flood forecasting studies apply generic outlier detection methods without validation or adaptation to hydrological characteristics;
  • Existing methods rely on raw data or limited features, failing to capture the temporal dependencies and physical processes inherent in hydrological dynamics;
  • No systematic framework exists for selecting and combining detection methods based on empirical performance.
This paper addresses these gaps through a proposed framework for detecting erroneous outliers in hydrological time series at individual monitoring stations using univariate methods for unlabelled data. We make three significant contributions:
  • Feature engineering that enhances detection algorithms with temporal and hydrological context.
  • Systematic evaluation of 13 detection algorithms using semi-synthetic validation across 30 dataset configurations.
  • Data-driven ensemble strategies that automatically select optimal algorithm combinations. Our approach compares vanilla implementations using only raw data against enhanced variants utilising engineered features, with validation enabled through synthetic outlier injection. While we cannot validate preservation of all legitimate extreme events without ground truth, incorporating hydrological process features represents progress toward reducing false positive detection of legitimate floods.
The remainder of this paper is organised as follows. Section 2 presents our methodology, including feature engineering, synthetic data injection, detection algorithms, and ensemble selection strategies. Section 3 describes the experimental setup, results and discussion. Section 4 concludes with key findings and future research directions.

2. Methodology

This section presents our systematic approach to addressing the three identified gaps in the literature. Figure 1 outlines the methodology for detecting erroneous outliers in hydrological time series. The framework progresses from feature engineering through synthetic data injection, detection algorithms, and ensemble modelling to evaluation. Each phase builds upon the previous one, creating a systematic approach for identifying erroneous outliers. The following subsections detail each component.

2.1. Feature Engineering

To enhance the detection of erroneous outliers, we derived 19 features based on an understanding of hydrological processes and temporal patterns, as outlined in the literature. To the best of our knowledge, this is the most comprehensive feature set applied to hydrological time series erroneous outlier detection, incorporating numerous new applications in this field of erroneous outlier detection. While traditional outlier detection studies typically relied on raw data, our feature set integrated temporal dependencies, multi-scale patterns, and an understanding of hydrological processes. These features provided detection algorithms with temporal and statistical context, in addition to raw measurements, and therefore allowed them to capture different patterns of erroneous behaviour in the data [27].
The 19 engineered features were designed to reflect aspects of different hydrological behaviour. Temporal difference features ( v a l u e_d i f f , v a l u e_d i f f _ p c t ) capture the rate of change, which is critical when looking for a sudden spike in the sensor. Statistical features (rolling statistics, deviation measures) define the local baseline to detect contextual erroneous outliers. Hydrological process features ( r i s i n g_l i m b , f a l l i n g_l i m b , q u i c k f l o w_a p p r o x ) incorporate catchment response patterns to help preserve legitimate flood dynamics. Lag features ( l a g_ 1 through l a g_ 30 ) add a temporal context to autocorrelated hydrological time series. Through correlation analysis, we were able to reduce these to six key features, minimising redundancy while retaining detection capability.

2.2. Semi-Synthetic Data Injection

To facilitate rigorous evaluation on unlabelled hydrological time series, we designed a semi-synthetic data generation framework to introduce controlled erroneous measurements into the original time series whilst maintaining their natural characteristics. This testing approach retains the temporal patterns and physical properties of actual catchment responses whilst providing known ground truth labels for validation. The framework employed a 5% contamination rate grounded in operational monitoring experience and literature [25]. Studies of real-world hydrological networks report error rates ranging from 3 to 10% depending on station maintenance practices, sensor age, and environmental conditions [28,29], positioning 5% as a conservative central estimate. Analysis of Bureau of Meteorology (BoM) quality codes in our study catchment confirmed this selection, revealing that 4.2–7.8% of observations across the five stations were flagged with compromised quality indicators (codes 140: “ability to represent parameter not known”; code 210: “not release quality or missing” [30]). This rate balances validation objectives: contamination below 3% produces insufficient outlier samples for robust statistical evaluation, whilst rates exceeding 10% create unrealistically challenging scenarios that may favour overly aggressive detection methods inappropriate for operational deployment.
The contamination was distributed as 50% point, 30% contextual, and 20% collective outliers, reflecting operational failure mode frequencies observed in hydrological monitoring [31,32]. Point outliers (50%), manifesting as isolated erroneous measurements, constitute the most common error type, resulting from transient causes including data transmission bit errors, momentary sensor voltage fluctuations, electromagnetic interference during storms, or brief debris contact with sensors [27]. For each point, an erroneous outlier at time t, we calculated as shown in Equation (1):
O t p o i n t =   μ l o c a l +   ε   ·   σ l o c a l ·   ξ
where μ l o c a l and σ l o c a l are the mean and standard deviation within a 7-day window centred at t , ξ 1,1 determines direction, and ξ U 3,10 provides magnitude as a multiplier of standard deviations. The range of 3–10 standard deviations was selected based on common outlier detection thresholds (3σ rule) as the minimum [33], with the upper bound ensuring that outliers are statistically extreme while remaining possible due to sensor errors. We enforced the physical constraint. O t p o i n t 0 to maintain non-negative flows.
Contextual outliers (30%) typically arise from sensor calibration drift over time, datum changes without corresponding metadata updates, or rating curve extrapolation during moderate floods where stage-discharge relationships deviate from calibration conditions. We identified stable periods where the rolling standard deviation σ r o l l i n g t < P 30 σ r o l l i n g , then injected it, as shown in Equation (2):
O t c o n t e x t =   μ s t a b l e ·   α
where μ s t a b l e is the local mean during stable flow, and α   ~   U ( 2,4 ) creates moderate deviations. The range 2–4 times the stable mean was chosen to generate values that are unusual for stable conditions but less extreme than point erroneous outliers, consistent with contextual anomaly characteristics described by Chandola, Banerjee [27].
Collective outliers (20%) [27], representing seq uences of erroneous measurements, occur less frequently but reflect systematic failures such as sensor fouling by algae or sediment accumulation during extended periods, sustained power supply issues, or malfunctioning dataloggers. This distribution prioritises the most operationally relevant failure scenarios whilst ensuring comprehensive evaluation across all error manifestations encountered in practice. For a sequence starting at t 0 , we modelled it, as shown in Equation (3):
O t c o l l e c t i v e = { μ l o c a l + t t 0 L / 3 · P p e a k       i f   r i s i n g   l i m b         μ l o c a l + P p e a k                           i f   peak         μ l o c a l + P p e a k · e λ t t p e a k               i f   r e c e s s i o n       }
where L is the sequence length, P p e a k ~ U 3,6 · σ l o c a l , defines peak magnitude, and λ = 0.3 controls recession rate [34] based on typical hydrograph behaviour.
The injection process maintained critical hydrological characteristics through three mechanisms to ensure that the synthetic erroneous outliers do not compromise the underlying data structure. First, temporal autocorrelation, the tendency for consecutive measurements to be similar due to catchment memory effects [35], was preserved by ensuring a minimum separation of 3 timesteps between injected outliers. This prevents artificial clustering that would distort the natural persistence in hydrological time series, where recent conditions have a strong influence on today’s flow. Second, seasonal patterns remained intact as outliers were scaled relative to local statistics rather than global values. This ensures that injected errors during wet seasons remain proportionally different from those during dry periods, thus maintaining the natural variability cycles. Third, legitimate extreme events in the original data were protected by only injecting outliers during periods where V t   μ l o c a l <   2 σ l o c a l , a criterion that ensures actual floods and droughts [36], which are essential for model training and validation, remain unmodified. With this semi-synthetic data injection established to evaluate detection performance, we present the proposed erroneous outlier detection methods assessed using these synthetic benchmarks.

2.3. Proposed Erroneous Outlier Detection Framework

We developed a framework for outlier detection, which utilised thirteen individual algorithms and three data-driven ensemble approaches to systematically identify erroneous outliers in hydrological time series. We chose these algorithms as they can be demonstrated to have a good performance in hydrological studies [7,18,20] and general time series applications [37,38], representing diverse detection principles from statistical hypothesis testing to ML methods. This diversity enables the detection of the diverse manifestations of erroneous outliers, while the ensemble strategies leverage the complementary strengths of individual methods.

2.3.1. Individual Detection Algorithms

Statistical Methods
(i)
The Extreme Studentised Deviate (ESD) [39] method provides rigorous statistical hypothesis testing for outlier detection through the iterative application of Grubbs’ test [27]. For each iteration i , we calculated the test statistic for the most extreme value and compared it against the critical value as shown in Equation (4):
λ c r i t = n i 1 × t d i s t n i 2 + t d i s t 2 × n i
where n is the sample size, and t represents the t-distribution critical value. We set the maximum outliers to 10% of the data length and significance level α = 0.05 , following standard statistical practice.
(ii)
The SH-ESD [40] extends the basic ESD by first decomposing the time series using Seasonal-Trend decomposition using Loess (STL) with period = 365 days to capture annual hydrological cycles. We then applied ESD to the residuals, distinguishing seasonal extremes from erroneous outliers. This is critical for hydrological time series, where high flows during wet seasons represent normal behaviour.
(iii)
The Hampel filter [41,42] employs median-based statistics within sliding windows to identify local outliers. For each point, we calculated the local median and Median Absolute Deviation (MAD) within a window size w = 14 as shown in Equation (5):
O u t l i e r i f : x t m e d i a n l o c a l > k × 1.4826 × M A D l o c a l
where k = 3 provides the equivalent of 3-sigma detection under normality assumptions. We selected a window size of 14 to strike a balance between local sensitivity and stability, capturing weekly patterns in daily data while maintaining computational efficiency.
Wavelet-Based Detection
(iv)
The Discrete Wavelet Transform (DWT) [43] method decomposes signals into frequency components to identify high-frequency anomalies characteristic of sensor errors [44]. We employed the Daubechies-4 wavelet at decomposition level 4 [45], chosen for its effectiveness with hydrological signals exhibiting smooth trends and sharp transitions (flood peaks). Outlier scores derived from reconstructed detail coefficients, with MAD-based thresholding providing detection as shown in Equation (6):
T h r e s h o l d = m e d i a n d j + 3 × 1.4826 × M A D d j
where d j represents detail coefficients at level j .
Machine Learning Methods
(v)
The Isolation Forest (IF) algorithm [46] isolates anomalies through recursive partitioning, based on the principle that outliers take fewer splits to be isolated [47]. We configured IF with 100 estimators [46] and contamination = “auto” so that the algorithm will automatically determine the outlier proportion concerning the data structure. This unsupervised technique proves effective for identifying contextual outliers that violate local patterns rather than global thresholds.
(vi)
The Local Outlier Factor (LOF) [48] quantifies local density deviations by comparing each point’s neighbourhood density to its neighbours’ densities [27,47]. We set k = 20 neighbours [48] to capture sufficient local context while avoiding over-smoothing in regions of varying data density, which is important for hydrological time series where flow variability changes seasonally. Points with substantially lower density compared to their neighbours ( L O F < < 1 ) are the outliers [27].
(vii)
The LSTM Autoencoder learns normal temporal patterns through reconstruction, identifying outliers via reconstruction error [49]. Our architecture (32–16–16–32 neurons) balanced model capacity with training stability, using 30 epochs with early stopping. We set the erroneous outlier threshold at m e d i a n + 3 × 1.4826 × M A D of reconstruction errors, maintaining consistency with our robust statistical methods. The sequence length of 14 captures daily patterns essential for distinguishing legitimate diurnal variations from normal outliers.
(viii)
The Outlier Robust Extreme Learning Machine (ORELM) implements single hidden layer networks with robust loss functions to minimise erroneous outlier influence during hydrological time series analysis [50]. We employed 100 hidden nodes with Huber loss ( δ = 1.4826 × M A D and iterative reweighting over 10 iterations. This configuration provided rapid training while maintaining consistency, which is critical for operational systems requiring frequent model updates.
Enhanced Multi-Feature Variants
We developed enhanced variants of each algorithm ((ix) e_IF, (x) e_LOF, (xi) e_DWT, (xii) e_LSTM, and (xiii) e_ORELM) utilising our six-feature representation ( v a l u e , r o l l i n g_m i n_ 7 d , r i s i n g_l i m b , l a g_ 30 , l a g_ 14 , and v a l u e_d i f f_p c t ) to capture temporal dependencies and hydrological processes. These variants applied the same detection principles but operated in the expanded feature space, enabling the identification of complex, erroneous outlier patterns that are invisible in univariate analysis.
Throughout this study, we adopted a consistent naming convention to distinguish between algorithm variants. Methods prefixed with “v_” (v_ESD, v_IF, v_LOF, v_DWT, v_LSTM, v_ORELM) represent vanilla implementations operating solely on raw time series values. Methods prefixed with “e_” (e_IF, e_LOF, e_DWT, e_LSTM, e_ORELM) represent enhanced variants utilising our six-feature representation. Seasonal-Hybrid ESD (SH-ESD) represents an existing enhanced variant that achieves improvement through STL decomposition rather than feature engineering, while the Hampel filter stands alone as a robust statistical method requiring no enhancement. This nomenclature system facilitates clear comparison between basic and enhanced methods in our results.
Algorithm parameters were selected based on established literature recommendations and alignment with hydrological time scales. Statistical methods followed standard practices: ESD with 5% contamination and α = 0.05 [39], Hampel filter with 11-day window capturing weekly patterns and 3-sigma threshold [41,42], and DWT using Daubechies-4 wavelet suitable for hydrological signals [43,44,45]. Machine learning methods adopted parameters from foundational studies: IF with 100 estimators and 0.05 contamination [46,47], LOF with 20 neighbours [21,48], LSTM with 32–16-16–32 architecture balancing model capacity with training stability [22], and ORELM with 100 hidden nodes and Huber robust loss [50]. Enhanced variants maintained identical algorithmic parameters to vanilla counterparts but operated on the 6-feature representation (Section 3.2). Contamination parameters (0.05) across density-based methods aligned with our semi-synthetic validation framework (5% injected outliers). Temporal parameters (11–14 day windows, 14–30 day lags) were selected to capture hydrological response time scales.

2.3.2. Ensemble Modelling

We developed three new strategies for ensemble construction based on the selection of optimal algorithm combinations, rather than predetermined algorithm combinations, depending on empirical measures of performance. This adaptive ensemble selection approach can be applied by a similar principle to dynamic ensemble selection methods [51], but specifically for the erroneous outlier detection problem in a hydrological context, to adapt to the specific dataset characteristics and operational requirements.
Accurate Ensemble: This ensemble maximised detection performance by selecting the top three methods ranked by F1-score, regardless of computational cost or inter-method correlation.
Diverse Ensemble: This ensemble implemented a greedy forward selection strategy [52] that balances individual method performance with prediction diversity. Starting with the best-performing method, the algorithm iteratively adds methods that minimise correlation with already-selected methods while maintaining acceptable individual performance. The diversity score in Equation (7) is as follows:
D S m = 1 ρ ¯ m , S × F 1 m
ensured complementary detection capabilities, where ρ ¯ ( m , S ) represents the average Pearson correlation between the candidate method m and the selected set S. This approach leverages complementary detection patterns that highly correlated methods might overlook:
Fast Ensemble: This ensemble prioritised computational efficiency for real-time monitoring applications. We employed a greedy approach [52] for the selection process, which iteratively selects the fastest methods from the models while maintaining a minimum performance threshold (F1-score > 0.5), ensuring rapid processing without compromising detection quality. This makes it suitable for operational systems that require real-time outlier identification.
These strategies eliminated the need for subjective method selection, ensuring reproducible and optimised ensemble construction tailored to specific operational contexts. The diversity-based selection particularly addressed the limitation of traditional voting ensembles, where highly correlated methods can overlook decisions, potentially missing outlier types that require alternative detection principles.

2.3.3. Ensemble Voting Mechanism

Final erroneous outlier detection for each ensemble was determined through weighted voting among its three selected methods, as shown in Equation (8):
P e n s e m b l e = i = 1 3 w i × p i i = 1 3 w i > τ
where w i represents the weight for method i within that ensemble, p i is its binary prediction (1 for erroneous outlier, 0 for normal), and τ is the voting threshold. We set equal weights ( w i = 1 ) for all selected methods and τ = 0.5 , requiring majority agreement for outlier classification. This conservative approach minimises false positives, which is critical for operational flood forecasting systems where false alarms erode user trust [53].

2.4. Evaluation Strategy

To test the effectiveness of the detection framework and ensemble strategies, we developed an evaluation framework that combines performance metrics, analysis of agreement, and statistical significance testing. This multi-faceted approach addresses a fundamental challenge in assessing unsupervised approaches in the absence of labelled erroneous outlier data. The evaluation was designed to be applied across multiple station–data-type combinations, as detailed in Section 3.

2.4.1. Performance Metrics

We calculated standard classification metrics using the semi-synthetic ground truth labels. For each method, we computed precision ( P ), recall ( R ), and F1-score ( F 1 ), as shown in Equations (9)–(11), respectively:
P = T P T P + F P
R = T P T P + F N
F 1 = 2 · P · R P + R
where TP represents true positives (correctly identified erroneous outliers), FP represents false positives (normal points flagged as erroneous outliers), and FN represents false negatives (missed erroneous outliers). We used the F1-score as our primary metric, as it strikes a balance between precision and recall that is akin to what is needed in hydrological analysis, where false alarms or missed events significantly impact the consequences. We also calculated the Receiver Operating Characteristic—Area Under the Curve (ROC-AUC). ROC-AUC is a measure of the probability that a method is able to rank a random erroneous outlier higher than a random normal point.

2.4.2. Agreement Analysis

We employed two complementary measures to quantify inter-algorithm relationships. We calculated algorithm agreement using Pearson correlation coefficients [54] to measure overall detection agreement, as shown in Equation (12):
ρ i j = t = 1 n p i t p ¯ i p j t p ¯ j t = 1 n p i t p p ¯ i 2 t = 1 n p j t p ¯ j 2
where p i t and p j t are binary predictions (1 for erroneous outlier and −1 for normal) from methods i and j at time t , and p ¯ i and p ¯ j are their respective means. High correlation ( ρ > 0.7 ) indicates methods agree on both normal and erroneous outlier classifications.
The Jaccard similarity index focused exclusively on outlier overlap, as shown in Equation (13):
J i j = O i O j O i O j
where O i and O j represent the sets of erroneous outlier indices from methods i and j . While Pearson correlation captured agreement on both normal and erroneous outlier classifications, Jaccard similarity specifically measured the overlap in detected erroneous outliers. This distinction proves crucial for ensemble selection, as methods may agree strongly on normal behaviour while identifying different erroneous outlier subsets, which is a desirable property for ensemble diversity. These metrics were selected as they are established measures for comparing binary classifiers in ensemble learning contexts.

2.4.3. Statistical Significance Testing

We applied the Friedman test to verify that performance differences across methods are statistically significant, rather than due to random variation. The test statistic follows a chi-squared distribution as shown in Equation (14):
χ F 2 = 12 n k k + 1 j = 1 k R j 2 3 n k + 1
where n = 30 experiments, k = 16 methods, and R j represents the sum of ranks for method j across all experiments. We ranked methods within each experiment (1 = best, 16 = worst) based on F1-scores, then calculated average ranks across experiments.
Following significant Friedman results ( p < 0.001 ), we conducted post hoc analysis using the Nemenyi test to identify statistically equivalent method groups. The critical difference threshold determines when rank differences become significant, as shown in Equation (15):
C D = q α k k + 1 6 n
where q α represents the critical value from the Studentised range distribution. We set α = 0.05 , yielding q 0.05 = 3.391 for k = 16 methods. With n = 30 experiments, this produces C D = 4.211 . Methods whose average ranks differed by less than this threshold are considered statistically equivalent, forming performance groups rather than a strict ranking.

3. Experimental Setup, Results, and Discussion

This section presents the experimental setup and the results of applying our detection framework.

3.1. Study Site and Dataset Description

This section presents the experimental setup and results from applying the detection framework to hydrological time series. This study utilised daily data from the BoM, Australia, monitoring stations in the Tweed River catchment, northeastern New South Wales (NSW). The Tweed catchment, spanning 1326 square kilometres, is the northernmost coastal catchment in NSW, bounded by the McPherson Range on the NSW–Queensland border, the Burringbar and Condong Ranges to the southeast, and the Tweed Range to the west. Its highest point is Mount Warning at 1156 metres. The catchment experiences a subtropical climate with hot, humid summers and cool to mild winters, receiving a mean annual rainfall of 1510 mm concentrated between December and March. Average summer maximum temperatures range from 27 to 28 °C at the coast to 28–29.5 °C inland at Murwillumbah [55,56].
The dataset included both water level (water course level, WCL) and discharge (water course discharge, WCD) measurements, expressed in both metres and cubic metres per second. The discharge values were determined using level rating curves rather than directly from sensors. To capture varying hydrological behaviours, three temporal aggregations were used: daily maximum (DMax), daily mean (DMean), and daily minimum (DMin), all of which are available from the BoM. This led to 30 distinct dataset configurations (5 stations × 2 measurement types × 3 aggregations). Each configuration contained between 5890 and 24,933 daily observations after quality filtering, with null values ranging from 0 to 11,547 depending on station continuity and sensor reliability. Figure 2 visualises the temporal coverage and data completeness for the DMean aggregation across all stations, representing 30 dataset configurations, with all station data ending 10 July 2025. Figure 3 illustrates the spatial distribution of the five monitoring stations across the Tweed catchment for DMean for both water level and discharge.
All analyses were conducted on a Windows 11 (Build 26100) system with an Intel Core i7-1065G7 processor (1.30 GHz, four cores, eight logical processors) and 16 GB of RAM. The computational environment comprised Python 3.11.8 within vs. Code, utilising scikit-learn for ML algorithms, pandas for data manipulation, numpy for numerical computations, and matplotlib for visualisation. The semi-synthetic outlier injection and detection algorithms were implemented using custom Python scripts, with parallel processing employed where applicable to manage the computational demands across all 30 dataset configurations. The following section presents the results of feature selection analysis across these configurations.

3.2. Feature Selection Results

We reduced the 19 engineered features to a final set of six features through a systematic correlation-based selection process designed to minimise multicollinearity while ensuring generalizability across diverse hydrological conditions. For each of the 30 dataset configurations, we computed pairwise Pearson correlation coefficients [54] between all features and identified highly correlated pairs (|r| > 0.70). For each correlated pair, we calculated the average absolute correlation of each feature with all other features and removed the feature exhibiting higher average correlation (greater global redundancy), retaining the feature with lower average correlation (more unique information). For example, among the lag features (lag_1, lag_7, lag_14, and lag_30) with high inter-correlations (r = 0.65–0.94), lag_1 and lag_7 were removed due to higher average correlations with other features, while lag_14 and lag_30 were retained as they provided complementary temporal scales with lower redundancy. Following this correlation filtering across all configurations, we retained features appearing in ≥60% of configurations to ensure robust performance across different stations, data types, and temporal aggregations. Detailed methodology, including complete correlation matrices and selection pseudocode, is provided in Supplementary Materials S2.
Table 2 presents the selection frequency results, revealing six features with consistently high importance: (i) value (100%), (ii) rolling_min_7d (100%), (iii) rising_limb (97%), (iv) lag_30 (86%), (v) lag_14 (83%), and (vi) value_diff_pct (63%). The universal selection of value and rolling_min_7d establishes them as fundamental indicators, with rolling_min_7d providing context for identifying anomalous deviations from recent baseflow conditions. The high frequency of rising_limb (97%) demonstrates the importance of hydrological state transitions for contextual outlier detection, as erroneous values often violate natural recession or rising patterns. Temporal lag features (lag_14 and lag_30, selected in 83–86% of configurations) encapsulate catchment memory effects, enabling detection of values inconsistent with antecedent conditions. The moderate selection frequency of value_diff_pct (63%) reflects its utility in detecting abrupt sensor malfunctions that manifest as unrealistic rate-of-change values, though its importance varies with flow regime stability.
This feature set extends previous multi-scale approaches [7] by explicitly incorporating temporal lags rather than aggregated window statistics, providing more precise temporal context for outlier evaluation. Complete correlation matrices, feature importance scores, and detailed selection methodology are provided in Supplementary Materials S2.

3.3. Erroneous Outlier Detection Performance

3.3.1. Individual Method Performance

Following feature selection, the 16 detection methods were tested on all configurations of the validated datasets using the semi-synthetic validation framework. Figure 4 shows the ROC curves for two representative configurations: Station 201012 water level DMean (Figure 4a) and Station 201005 discharge DMax (Figure 4b), which illustrate the performance variability under varying hydrological conditions.
Configuration 201012 water level DMean had good performance in the majority of the methods. The ensemble methods yielded the highest AUC values, with the Diverse Ensemble (0.972), Fast Ensemble (0.968), and Accurate Ensemble (0.962), all exceeding 0.96, demonstrating the effectiveness of using multiple detection approaches. Among individual methods, the best results were 0.984 (e_ORELM) and 0.976 (e_IF), although these also differed in configuration. The Hampel filter (0.890) and SH-ESD (0.973) maintained steady performance in supporting their reliability as statistical approaches. The enhanced variants proved to be better than the vanilla ones, as shown by e_DWT (0.923), which outperforms v_DWT (0.583), and e_LSTM (0.753), which improves over v_LSTM (0.533).
Configuration 201005 discharge DMax showed lower overall performance across methods. The Diverse Ensemble (0.895) and Accurate Ensemble (0.879) were able to hold their performance levels, while the Fast Ensemble dropped down to 0.657. Individual methods displayed more variation, with the Hampel filter retaining 0.874. The enhanced versions showed an improvement over the vanilla versions in most cases (e_IF 0.674 improved on v_IF 0.527; e_ORELM 0.707 improved on v_ORELM 0.574). The exception was e_LOF (0.615), which performed below v_LOF (0.710), indicating that feature engineering effects vary by algorithm type. Despite these modifications to the individual methods, the ensemble approaches retained higher values of AUC through their voting mechanisms, which integrate different detection strategies to ensure better consistency in performance.
To understand the complementary nature of these detection methods, we examined the patterns of agreement between the algorithm results using an algorithm agreement matrix and Jaccard similarity scores. The correlation matrices of the algorithm agreement (Figure 5) revealed distinct patterns for different dataset configurations. For Station 201012 water level DMean (Figure 5a), the methods were characterised by moderate to high correlations (0.4–0.9), with v_ESD and SH-ESD exhibiting a correlation of 0.908, indicating a common statistical basis among the methods. The ensemble methods showed high intercorrelations (>0.88), whereas v_LOF exhibited a low correlation with the other techniques (<0.1). In the Station 201005 discharge, the DMax (Figure 5b) correlation patterns differed, with a higher correlation observed between v_IF and v_ESD (correlation coefficient 0.966), while v_LOF showed a low correlation with most methods. The ensemble approaches were still able to exhibit moderate correlations (0.3–0.7) compared to individual methods, but lower than those for the water level configurations. These different patterns of correlations between configurations are known to validate the ensemble approach, as they show that methods are recognising different subsets of erroneous outliers, even though there may be a high correlation in the overall predictions.
The Jaccard similarity indices (Figure 6) revealed low to moderate outlier overlap between methods despite their correlation patterns. For Station 201012 water level DMean (Figure 6a), v_ESD and SH-ESD showed the highest similarity (0.834), whilst the Accurate and Diverse Ensembles demonstrated high overlap (0.903). Most method pairs achieved similarities below 0.5, with v_LOF showing minimal overlap with all methods (<0.04). In Station 201005 discharge DMax (Figure 6b), v_IF and v_ESD exhibited high similarity (0.940), yet overall similarities remained lower than the first configuration. These low Jaccard values, contrasting with the higher algorithm agreement correlation coefficients, indicate that whilst methods may agree on normal behaviour classification, they identify different erroneous outlier subsets. This complementary detection pattern validates the ensemble strategy, which combines methods that identify different erroneous outlier subsets rather than redundantly detecting the same erroneous outliers.
Visual comparison of detection outcomes demonstrates how engineered features improve separability between erroneous measurements (Figure 7). At Station 201,001, during April–June 2012, v_IF identified 21 outliers, including numerous false positives during natural variability episodes, whilst feature enhancement reduced detections to eight with substantially improved precision, a 62% reduction in false positives whilst maintaining true outlier detection (Figure 7a). DWT exhibited similar improvement: the vanilla version flagged 12 outliers predominantly during recession periods, incorrectly interpreting exponential decay as anomalous, whilst the enhanced version eliminated these by incorporating temporal context through features characterising normal recession behaviour (Figure 7b). LoF reduced detections from six outliers to one outlier with enhanced specificity (Figure 7c). SH-ESD maintained the same detection count as its vanilla counterpart (three outliers) (Figure 7d). The consistent reduction in false positives across methods whilst maintaining true outlier identification (purple stars indicating ensemble consensus) validates that explicit representation of hydrological processes enables algorithms to identify the erroneous outliers.

3.3.2. Cross-Dataset Performance Analysis

The performance heatmap (Figure 8) provided evidence of method behaviour across all 30 dataset configurations. The three ensemble methods maintained consistently high F1-scores above 0.7 in most dataset configurations, with the Accurate Ensemble showing scores between 0.69 and 0.91, the Diverse Ensemble between 0.66 and 0.94, and the Fast Ensemble between 0.42 and 0.90. This consistency validated the ensemble approach for operational deployment.
Individual method performance varied across dataset configurations. The Hampel filter maintained stable performance with F1-scores ranging from 0.67 to 0.91, indicating its consistency across different hydrological conditions. SH-ESD showed similar stability (0.62 to 0.78), though with a narrower performance range. Among enhanced variants, e_IF showed moderate performance (0.54 to 0.83), while e_LOF demonstrated consistent but modest scores (0.52 to 0.68). The e_LSTM showed the lowest performance among enhanced methods (0.44 to 0.53), particularly struggling with discharge measurements. Vanilla methods exhibited limited performance. v_IF achieved F1-scores between 0.40 and 0.44, showing minimal variation but consistently poor detection. v_LOF performed similarly poorly (0.35 to 0.62). v_LSTM showed limited performance (0.35 to 0.42). v_ORELM maintained moderate consistency (0.49–0.66).
Station-specific patterns emerged out of the heatmap. Station 201005 generally indicated a poorer performance of all methods, suggesting more challenging hydrological characteristics. Station 201012 consistently produced better results, especially with respect to water levels. Discharge measurements consistently yielded lower F1-scores compared to water level measurements at all stations and methods. This systematic underperformance is likely due to the additional uncertainty associated with rating curve conversions. Since discharge values are derived from water level measurements rather than direct measurements, they inherit the original measurement errors, while also introducing additional errors from the stage-discharge relation, especially during extreme events when the rating curves may be less reliable.
The box plot distributions (Figure 9) further illustrated performance patterns across data configurations. For precision metrics, the ensemble methods exhibited low distribution width and high median values, suggesting that their performance is similar across stations. The Accurate and Diverse Ensembles achieved a precision of more than 0.6 in most cases, with higher performance for water level DMax configurations. Individual methods had wider distributions and lower medians. Enhanced variants generally performed better than their vanilla counterparts, with e_IF and e_LOF having higher precision than v_IF and v_LOF, respectively.
Recall patterns were different from precision results. Ensemble methods have always had balanced recall scores, ranging from 0.6 to 0.8. The Fast Ensemble had a slightly lower recall than the Accurate and Diverse Ensembles, especially in the case of discharge measurements. Among individual methods, the Hampel filter showed satisfactory recall in water level measurements and decreased performance in discharge data. Vanilla performed poorly in recalling information from all categories, with v_LSTM and v_ORELM mostly maintaining a mediocre 0.4.
The consistency of box widths gave information about the reliability of the method. Narrow boxes for ensemble methods implied stable performance across the five stations, whereas wide boxes for the vanilla methods were suggestive of high sensitivity to station characteristics. Enhanced methods exhibited intermediate box widths, indicating greater stability than vanilla ones, but not to the same extent as ensemble consistency.

3.3.3. Ensemble Selection and Composition

Analysis of the ensemble composition revealed the efficiency of the ensemble selection mechanism (Table 3). The Hampel filter achieved 100% selection in both Accurate and Diverse Ensembles, and 62.9% in the Fast Ensemble, demonstrating a good balance between the two criteria. SH-ESD showed similarly high selection rates (91.4% Accurate, 88.6% Diverse), which confirms that methods dealing explicitly with seasonality were important in detecting erroneous outliers in hydrological time series.
Enhanced variants were found to dominate the ensemble compositions (60–80% of the ensembles), despite what was observed in the three strategies. e_IF was found to have a selection of high individual performance but low diversity (74.3% of the Accurate Ensemble, but 8.6% among the Diverse Ensemble), which appeared to suggest that it had high individual performance but low diversity, indicating that it yielded complementary outlier patterns. e_LOF showed the opposite pattern, with 74.3% selection in Diverse Ensembles despite modest individual performance, suggesting it captured complementary outlier patterns. e_DWT was frequently selected in the Fast Ensemble (80% selection) due to its computational efficiency. Vanilla methods rarely appeared in ensembles. v_ESD achieved 25.7% selection in Accurate and Fast Ensembles but never appeared in the Diverse Ensemble. Other vanilla variants (v_LOF, v_IF) showed minimal selection rates below 6%, confirming that feature engineering contributed to improved performance.

3.3.4. Statistical Validation

Statistical analysis with the Friedman test demonstrated that the 16 methods were significantly different (p < 0.001). Using critical difference analysis (CD = 4.211 at a level of significance of 0.05), four methods were found to be in the top-performing group (Accurate Ensemble, Diverse Ensemble, Hampel, and SH-ESD) (Figure 10). While no statistically significant differences existed within this top-performing group, the ensemble methods performed best in terms of average ranks (Accurate: 1.65 and Diverse: 1.82), demonstrating substantial consistency in performance across configurations. These differences were well within the critical level of difference (statistical equivalence). The Hampel filter, 3.17, and SH-ESD, 4.70, also managed to join the first tier, despite being of the individual method type. The Fast Ensemble (rank 6.43) was outside the statistically best group, confirming a performance trade-off; ML variants without feature engineering ranked poorly. All vanillas (v_LSTM, v_DWT, v_LOF, v_IF) listed in the bottom quartile (ranks > 13) suggest that ML algorithms without appropriate adaptation were limited in their effectiveness in detecting erroneous outliers in hydrological time series.

3.3.5. Operational Detection Rate

Analysis of erroneous outlier detection rates on unlabeled operational data (Figure 11) revealed the practical method’s behaviour. Detection rates ranged from near-zero to over 40%, with distinct patterns across methods and data types. Statistical methods demonstrated moderate detection rates, with the Hampel filter consistently identifying 5–8% of observations as outliers. Vanilla ML methods showed high variability. Vanilla LOF and IF detected fewer than 2% outliers in most dataset configurations, suggesting these methods may miss erroneous outliers that other methods identify. v_LSTM exhibited erratic behaviour with detection rates spanning 0–35% depending on the station. Enhanced variants demonstrated more calibrated behaviour, with e_LOF and e_IF maintaining detection rates between 5% and 15%. The three ensemble methods converged on consistent detection rates of 6–10% across all dataset configurations. This convergence proved the consistency of the proposed ensemble approach, as the voting mechanism effectively moderated both over-sensitive and under-sensitive base detectors. Slightly higher detection rates in discharge measurements compared to water level (median 8.2% vs. 6.4%) likely reflected the inherently noisier nature of discharge data.

3.4. Overall Discussion

The evaluation of 30 configurations of the dataset reveals the presence of unique relationships between data quality characteristics and detection performance, offering insights into the practical deployment of erroneous outlier detection methods in operational hydrological analysis.
The performance heatmap (Figure 7) reveals distinct relationships between data quality and detection effectiveness across the five monitoring Stations. Station 201012 consistently achieved F1-scores above 0.8 across most methods and measurement types, corresponding with its 99.6% data completeness and minimal temporal gaps (Figure 2). Conversely, station 201005 exhibited systematically lower performance, with many methods achieving F1-scores below 0.6, particularly evident in the orange-red regions of the heatmap. This station’s fragmented data record (53.6% completeness for water level, 68.5% for discharge) with substantial missing periods directly impacts the framework’s ability to calculate reliable temporal features. The lag features (lag_14, lag_30) and rolling statistics, which proved essential for detection, require consistent historical records.
This explains why methods dependent on temporal context underperformed at stations with incomplete data histories, consistent with established findings that RNNs and sequence-dependent algorithms exhibit performance degradation when applied to irregularly sampled or incomplete time series [22,49]. Quantitatively, temporal methods (LSTM, DWT) at Station 201005 achieved F1-scores of 0.58–0.62 compared to density-based approaches (IF, LOF) achieving 0.75–0.78, demonstrating a performance gap directly attributable to the station’s fragmented temporal record. Density-based methods maintained robust performance by operating on local neighbourhoods rather than extended sequences, which is why ensemble strategies combining diverse algorithmic foundations achieved F1 > 0.70 even at this challenging station.
The performance differential, as measured by type, appears consistently across all stations and methods in the heatmap (Figure 7), with discharge measurements (left panel) showing more orange-red cells compared to water level measurements (right panel). This pattern is reflected in the semi-synthetic validation, where detection methods achieved higher AUC values for water level configurations. However, the operational detection rates (Figure 10) reveal an interesting contrast: discharge measurements showed higher median outlier detection rates (8.2%) compared to water level (6.4%). This apparent contradiction validates the framework’s sensitivity to inherent differences in data quality. Discharge values, derived through rating curves rather than direct measurement, contain compound uncertainties from water level measurement errors, hydraulic model assumptions, and curve extrapolation during extreme events. These systematic uncertainties manifest as increased variability, which detection algorithms correctly identify as anomalous patterns, resulting in higher operational detection rates despite lower performance in controlled, semi-synthetic evaluations.
The enhanced variants’ systematic improvement over vanilla implementations demonstrates that temporal contextualisation addresses fundamental challenges in hydrological outlier detection. The box plot distributions (Figure 8) show enhanced methods achieving narrower performance ranges and higher medians across all measurement categories. This improvement stems from feature engineering that captures an understanding of hydrological processes, enabling algorithms to better distinguish between sensor failures and legitimate extreme events. The consistent selection of rising_limb indicators and temporal lag features across 83–97% of configurations (Table 2) confirms that explicit representation of catchment response patterns enhances detection capabilities beyond raw measurement analysis.
The observed performance patterns fundamentally reflect interactions between algorithmic detection principles and underlying hydrological process characteristics rather than algorithmic superiority alone. Statistical methods (Hampel and SH-ESD) perform optimally during stable baseflow-dominated periods where local deviations indicate sensor anomalies, but struggle during non-stationary flood conditions where legitimate responses exceed historical thresholds, where performance declines at Station 201005 compared to stable Station 201012. Temporal methods (LSTM and DWT) excel in predictable flow regimes where consistent patterns enable reliable sequence learning. Density-based methods (IF and LOF) demonstrate robustness across diverse conditions by avoiding temporal stationarity assumptions, with local neighbourhood approaches adapting naturally to varying flow regimes. The low-to-moderate Jaccard similarity indices between method pairs (<0.5) demonstrate valuable complementarity rather than inconsistency: different algorithms detect different error manifestations shaped by flow context. Statistical methods identify baseflow sensor drift, density-based methods detect rising limb spikes during transitional periods, and temporal methods capture recession violations. This algorithmic diversity enables ensemble strategies to capture errors across the full flow spectrum. The systematic performance differential between discharge and water level measurements stems from hydrological derivation, where discharge values inherit compounded uncertainties from water level error propagation, rating curve model deviations during extremes, and extrapolation beyond calibration range, explaining higher operational detection rates (8.2% vs. 6.4%) despite lower F1-scores.
The ensemble strategies successfully addressed different operational requirements whilst maintaining performance advantages over individual methods. The three ensemble approaches demonstrated consistent behaviour across the diverse conditions represented in the 30 dataset configurations, with the Accurate and Diverse Ensembles maintaining F1-scores above 0.7 in most cases. The ensemble composition analysis revealed that successful combinations consistently incorporated robust statistical foundations (Hampel filter, SH-ESD) alongside enhanced ML variants, indicating that effective detection requires hybrid approaches rather than relying solely on advanced algorithms. The convergence of ensemble detection rates to expected contamination levels across diverse hydrological conditions provides evidence for operational reliability. However, adaptation to different catchment characteristics and climatic conditions requires further investigation.
Further, to quantitatively demonstrate the advantages of adaptive ensemble selection over conventional fixed combination approaches, we compared our adaptive ensembles against representative static baselines where method composition remains fixed regardless of dataset characteristics. A Fixed-Top3 baseline (always combining Hampel + SH-ESD + e_IF—the three methods with the highest average rank) achieved a mean F1-score of 0.78 ± 0.11 across the 30 configurations, whilst our Accurate Ensemble achieved 0.82 ± 0.07, representing a 5.1% improvement in central performance and 36% reduction in variance. This improved consistency reflects adaptive selection’s capacity to exclude methods poorly suited to specific conditions: for example, at Station 201005 with fragmented temporal records (53.6% completeness), the Diverse Ensemble excluded sequence-dependent methods (LSTM and DWT) that struggled with temporal gaps, instead selecting density-based methods (LOF and IF) that maintained robustness, achieving F1 = 0.72 compared to Fixed-Top3’s 0.63 (14.3% improvement). Conversely, at Station 201012 with high completeness (99.6%), adaptive ensembles matched static performance (0.88 vs. 0.87), indicating minimal penalty for adaptability when conditions are favourable. Similarly, a fixed statistical baseline (Hampel + SH-ESD + v_ESD) optimised for computational efficiency achieved F1 = 0.71 ± 0.13, whilst our Fast Ensemble achieved 0.74 ± 0.14—superior detection quality whilst maintaining comparable processing speed. These quantitative comparisons demonstrate that adaptive selection provides measurable benefits over conventional fixed combinations through station-specific composition tailoring: adjusting to data quality characteristics (completeness and temporal gaps), measurement types (direct vs. derived), and operational constraints, rather than applying universal method combinations that inevitably encounter conditions misaligned with their fixed assumptions.

3.5. Limitations and Future Directions

This study has several limitations that suggest directions for future research. First, the validation framework is based on semi-synthetic outlier injection instead of manually labelled ground truth data. While this approach enables systematic evaluation, the 5% contamination rate and type distribution (50% point, 30% contextual, and 20% collective) represent simplified assumptions. Real-world error patterns exhibit station-specific variability influenced by local maintenance practices, sensor age, and environmental conditions. Additionally, the clear temporal separation imposed during synthetic injection (minimum three timesteps) may not fully capture scenarios where multiple failure modes co-occur, such as sensor fouling coinciding with flood events when accurate data is most critical.
Future work refers to developing benchmark datasets with manually labelled outliers to support the use of supervised learning and validate the preservation of extreme events, which is not confirmed without ground truth labels. Second, the evaluation is localised to provide a regional study of the Tweed River catchment in subtropical Australia. Different climatic regions where distinct hydrological regimes can be found may dictate different feature sets and detection thresholds. The ability of this framework to be extended across a range of climatic regions, from snowmelt to monsoon climates, would be important to the generalisability of the framework. Third, the framework operates at a daily temporal resolution, which is appropriate for strategic hydrological analysis, such as flood forecasting applications; however, it may not be aware of outlier patterns at sub-daily frequencies, which are critical for flash flood systems. Extending the framework to hourly or sub-hourly resolutions would help in urban drainage and flash flood applications. In addition, the feature engineering approach also requires relatively complete historical records; stations with considerable periods of missing data can calculate lag and antecedent condition features, which is particularly challenging.
Future improvements may include the use of meteorological variables, specifically rainfall data, which will enable better detection in the event of storms. The integration of spatial features from neighbouring stations could utilise catchment connectivity, whereas graph neural networks could be employed to capture complex spatio-temporal dependencies across monitoring networks. By filling the gap between the fundamental practice for detection and the requirement for validated procedures specific to hydrological characteristics, this framework enables a basis for bolstering the data quality controllers in a flood forecasting framework.

4. Conclusions

This study addressed the gap in systematic outlier detection for hydrological time series applications by developing a comprehensive framework tailored to the unique characteristics of hydrological data. Through evaluation across 30 dataset configurations spanning five monitoring stations with diverse data quality conditions, we demonstrated that feature-enhanced detection substantially improves the data quality of hydrological time series. Our contributions include: (i) the development of 19 hydrological features reduced to six core indicators (value, rolling_min_7d, rising_limb, lag_14, lag_30, and value_diff_pct) through systematic correlation-based selection analysis; (ii) the comprehensive evaluation of 13 detection algorithms showing enhanced variants outperform vanilla implementations by 56% on average through explicit temporal and hydrological contextualisation; and (iii) three data-driven ensemble selection strategies (Accurate, Diverse, and Fast) that balance detection accuracy, algorithmic diversity, and computational speed to address different operational requirements.
Evaluation across diverse hydrological conditions revealed critical relationships between data characteristics and detection effectiveness. Stations with high temporal completeness (e.g., Station 201012: 99.6% data coverage) consistently achieved F1-scores above 0.8 across most methods, whilst stations with fragmented records (e.g., Station 201005: 53.6% completeness) exhibited systematically lower performance (F1 < 0.6), particularly for sequence-dependent algorithms such as LSTM and DWT that require consistent historical records to calculate temporal features effectively. This pattern underscores the fundamental importance of data continuity when employing methods dependent on temporal context. Feature engineering proved essential for distinguishing erroneous outliers from natural extreme events, with enhanced variants achieving narrower performance ranges and higher median F1-scores across all measurement categories. The systematic improvement stems from explicit representation of hydrological processes that enables algorithms to leverage temporal dependencies and physical constraints absent in raw data analysis. Rising/falling limb indicators capture state transitions, lag features encode catchment memory effects, and rolling statistics provide baseflow context. Discharge measurements exhibited systematically higher operational detection rates (8.2%) compared to water level (6.4%), reflecting compound uncertainties from rating curve derivation rather than inferior data quality, validating the framework’s sensitivity to inherent measurement characteristics.
The ensemble strategies successfully balanced competing operational requirements whilst maintaining robust performance across diverse conditions. All three ensembles demonstrated F1-scores above 0.7 in most configurations, with the Accurate and Diverse Ensembles maintaining consistent performance even at stations with challenging data characteristics. Composition analysis revealed that effective detection requires hybrid approaches combining robust statistical foundations (Hampel filter and SH-ESD) with enhanced machine learning variants, rather than relying solely on advanced algorithms. The convergence of operational detection rates to 6–10% across ensemble methods aligns with realistic contamination levels, providing evidence for deployment reliability in real-world monitoring contexts.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/w18040446/s1, S1. Studies applied outlier detection as preprocessing for flood forecasting. S2. Detailed Feature Engineering and Selection Methodology. S3. Pseudocodes for the Ensemble Approaches (Reference [57] is cited in the Supplementary Materials).

Author Contributions

Conceptualization, B.K., G.S., A.R.A. and F.T.; Methodology, B.K.; Validation, B.K.; Formal analysis, B.K.; Investigation, B.K.; Resources, B.K.; Data curation, B.K.; Writing—original draft, B.K.; Writing—review and editing, B.K., G.S., A.R.A. and F.T.; Visualisation, B.K.; Supervision, G.S., A.R.A. and F.T.; Project administration, G.S., A.R.A. and F.T.; Funding acquisition, G.S., A.R.A. and F.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Connectivity Innovation Network (CIN): 52251, an initiative of the New South Wales (NSW) Government and the NSW Telco Authority, Australia.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Shao, P.; Feng, J.; Lu, J.; Zhang, P.; Zou, C. Data-Driven and Knowledge-Guided Denoising Diffusion Model for Flood Forecasting. Expert Syst. Appl. 2024, 244, 122908. [Google Scholar] [CrossRef]
  2. Kuhaneswaran, B.; Sorwar, G.; Alaei, A.R.; Tong, F. Evolution of Data-Driven Flood Forecasting: Trends, Technologies, and Gaps—A Systematic Mapping Study. Water 2025, 17, 2281. [Google Scholar] [CrossRef]
  3. Katsouda, M.; Boutsinas, B. Detecting Extreme Values in Time Series Based on Bar Visibility. Int. J. Data Sci. Anal. 2025, 20, 4879–4888. [Google Scholar] [CrossRef]
  4. AlSalehy, A.S.; Bailey, M. Improving Time Series Data Quality: Identifying Outliers and Handling Missing Values in a Multilocation Gas and Weather Dataset. Smart Cities 2025, 8, 82. [Google Scholar] [CrossRef]
  5. Cook, A.A.; Mısırlı, G.; Fan, Z. Anomaly Detection for IoT Time-series Data: A Survey. IEEE Internet Things J. 2019, 7, 6481–6494. [Google Scholar] [CrossRef]
  6. Li, Y.; Su, M.; Duan, Z.; Liu, H. A New Integrated Prediction Method of River Level Based on Spatiotemporal Correlation. Stoch. Environ. Res. Risk Assess. 2024, 38, 1121–1143. [Google Scholar] [CrossRef]
  7. van de Wiel, L.; van Es, D.M.; Feelders, A.J. Real-time Outlier Detection in Time Series Data of Water Sensors. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  8. Ni, C.; Marsani, M.F.; Shan, F.P.; Zou, X. Flood Prediction with Optimized Gated Recurrent Unit-Temporal Convolutional Network and Improved KDE Error Estimation. AIMS Math. 2024, 9, 14681–14696. [Google Scholar] [CrossRef]
  9. Luo, Y.; Zhou, Y.; Chen, H.; Xiong, L.; Guo, S.; Chang, F.-J. Exploring a Spatiotemporal Hetero Graph-Based Long Short-Term Memory Model for Multi-Step-Ahead Flood Forecasting. J. Hydrol. 2024, 633, 130937. [Google Scholar] [CrossRef]
  10. Almazan, N.; Garcia, J.; Laman, S.; Morato, P.; Fabregas, A.; Coronado, A.; Molejon, M.; Violeta, M.L. Floodwatcher: Forecasting Marikina River Level Using Generative Pre-Trained Transformer with Kernel PCA. Procedia Comput. Sci. 2024, 245, 220–228. [Google Scholar] [CrossRef]
  11. Darkwah, G.K.; Kalyanapu, A.; Owusu, C. Machine Learning-Based Flood Forecasting System for Window Cliffs State Natural Area, Tennessee. Geohazards 2024, 5, 64–90. [Google Scholar] [CrossRef]
  12. Kim, D.; Lee, J.; Kim, J.; Lee, M.; Wang, W.; Kim, H.S. Comparative Analysis of Long Short-Term Memory and Storage Function Model for Flood Water Level Forecasting of Bokha Stream in Namhan River, Korea. J. Hydrol. 2022, 606, 127415. [Google Scholar] [CrossRef]
  13. Blázquez-García, A.; Conde, A.; Mori, U.; Lozano, J.A. A Review on Outlier/Anomaly Detection in Time Series Data. ACM Comput. Surv. 2021, 54, 1–33. [Google Scholar] [CrossRef]
  14. Takeuchi, K. Hydrological Persistence Characteristics of Floods and Droughts—Interregional Comparisons. J. Hydrol. 1988, 102, 49–67. [Google Scholar] [CrossRef]
  15. Berendrecht, W.; Van Vliet, M.; Griffioen, J. Combining Statistical Methods for Detecting Potential Outliers in Groundwater Quality Time Series. Environ. Monit. Assess. 2023, 195, 85. [Google Scholar] [CrossRef]
  16. Nguyen, A.D.; Vu, V.H.; Hoang, D.V.; Nguyen, T.D.; Nguyen, K.; Le Nguyen, P.; Ji, Y. Attentional Ensemble Model for Accurate Discharge and Water Level Prediction with Training Data Enhancement. Eng. Appl. Artif. Intell. 2023, 126, 107073. [Google Scholar] [CrossRef]
  17. Pan, M.; Zhou, H.; Cao, J.; Liu, Y.; Hao, J.; Li, S.; Chen, C.-H. Water Level Prediction Model Based on GRU and CNN. IEEE Access 2020, 8, 60090–60100. [Google Scholar] [CrossRef]
  18. Shabbir, M.; Chand, S.; Iqbal, F.; Kisi, O. Hybrid Approach for Streamflow Prediction: LASSO-Hampel Filter Integration with Support Vector Machines, Artificial Neural Networks, and Autoregressive Distributed Lag Models. Water Resour. Manag. 2024, 38, 4179–4196. [Google Scholar] [CrossRef]
  19. Belyakova, P.; Moreido, V.; Tsyplenkov, A.; Amerbaev, A.; Grechishnikova, D.; Kurochkina, L.; Filippov, V.; Makeev, M. Forecasting Water Levels in Krasnodar Krai Rivers with the Use of Machine Learning. Water Resour. 2022, 49, 10–22. [Google Scholar] [CrossRef]
  20. Zhou, L.; Qiao, Y.; Zhang, Z.; Han, Z.; Lei, X.; Qin, Y.; Wang, H. Development of a Novel Outlier Index for Real-time Detection of Water Level Outliers for Open-channel Water Transfer Projects. J. Hydroinformatics 2023, 25, 1072–1083. [Google Scholar] [CrossRef]
  21. Alghushairy, O.; Alsini, R.; Soule, T.; Ma, X. A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams. Big Data Cogn. Comput. 2020, 5, 1. [Google Scholar] [CrossRef]
  22. Wei, Y.; Jang-Jaccard, J.; Xu, W.; Sabrina, F.; Camtepe, S.; Boulic, M. LSTM-Autoencoder-based Anomaly Detection for Indoor Air Quality Time-series Data. IEEE Sens. J. 2023, 23, 3787–3800. [Google Scholar] [CrossRef]
  23. Zhao, Q.; Zhu, Y.; Wan, D.; Yu, Y.; Cheng, X. Research on the Data-driven Quality Control Method of Hydrological Time Series Data. Water 2018, 10, 1712. [Google Scholar] [CrossRef]
  24. Halicki, M.; Niedzielski, T. A New Approach for Hydrograph Data Interpolation and Outlier Removal for Vector Autoregressive Modelling: A Case Study From the Odra/Oder River. Stoch. Environ. Res. Risk Assess. 2024, 38, 2781–2796. [Google Scholar] [CrossRef]
  25. Campos, G.O.; Zimek, A.; Sander, J.; Campello, R.J.; Micenková, B.; Schubert, E.; Assent, I.; Houle, M.E. On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. Data Min. Knowl. Discov. 2016, 30, 891–927. [Google Scholar] [CrossRef]
  26. Hussain, I. Outlier Detection Using Nonparametric Depth-based Techniques in Hydrology. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 456–462. [Google Scholar] [CrossRef]
  27. Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
  28. Schlaeger, F.; Natschke, M.; Witham, D. Quality assurance for hydrometric network data as a basis for integrated river basin management. IAHS Publ. 2007, 310, 327. [Google Scholar]
  29. McMillan, H.; Krueger, T.; Freer, J. Benchmarking observational uncertainties for hydrology: Rainfall, river discharge and water quality. Hydrol. Process. 2012, 26, 4078–4111. [Google Scholar] [CrossRef]
  30. BoM. Water Data Online. Australian Bureau of Meteorology. Available online: https://www.bom.gov.au/ (accessed on 15 August 2025).
  31. Teh, H.Y.; Kempa-Liehr, A.W.; Wang, K.I.-K. Sensor data quality: A systematic review. J. Big Data 2020, 7, 11. [Google Scholar] [CrossRef]
  32. Kundzewicz, Z.W.; Robson, A.J. Change detection in hydrological records—A review of the methodology/Revue méthodologique de la détection de changements dans les chroniques hydrologiques. Hydrol. Sci. J. 2004, 49, 7–19. [Google Scholar] [CrossRef]
  33. Pukelsheim, F. The Three Sigma Rule. Am. Stat. 1994, 48, 88–91. [Google Scholar] [CrossRef]
  34. Tallaksen, L. A Review of Baseflow Recession Analysis. J. Hydrol. 1995, 165, 349–370. [Google Scholar] [CrossRef]
  35. Li, K.; Huang, G.; Wang, S.; Baetz, B.; Xu, W. A Stepwise Clustered Hydrological Model for Addressing the Temporal Autocorrelation of Daily Streamflows in Irrigated Watersheds. Water Resour. Res. 2022, 58, e2021WR031065. [Google Scholar] [CrossRef]
  36. Chebana, F.; Ouarda, T.B. Multivariate Quantiles in Hydrological Frequency Analysis. Environmetrics 2011, 22, 63–78. [Google Scholar] [CrossRef]
  37. Fang, C.; Wang, X.; Hu, W.; He, X.; Huang, Z.; Gu, H. Outlier Identification of Concrete Dam Displacement Monitoring Data Based on WAVLET-DBSCAN-IFRL. Water 2025, 17, 716. [Google Scholar] [CrossRef]
  38. Wijayanto, A.; Sugiharto, A.; Santoso, R. Detection Model for Potential Flooding Areas Using K-Means and Local Outlier Factor (LOF). In Proceedings of the 2024 4th International Conference of Science and Information Technology in Smart Administration (ICSINTESA), Balikpapan, Indonesia, 12–13 July 2024. [Google Scholar]
  39. Grubbs, F.E. Sample Criteria for Testing Outlying Observations; University of Michigan: Ann Arbor, MI, USA, 1949. [Google Scholar]
  40. Vieira, R.G.; Leone Filho, M.A.; Semolini, R. An Enhanced Seasonal-Hybrid ESD Technique for Robust Anomaly Detection on Time Series. In Simpósio Brasileiro De Redes De Computadores E Sistemas Distribuídos (SBRC); SBC: Porto Alegre, Brazil, 2018. [Google Scholar]
  41. Hampel, F.R. The Influence Curve and Its Role in Robust Estimation. J. Am. Stat. Assoc. 1974, 69, 383–393. [Google Scholar] [CrossRef]
  42. Pearson, R.K.; Neuvo, Y.; Astola, J.; Gabbouj, M. Generalized Hampel Filters. EURASIP J. Adv. Signal Process. 2016, 2016, 87. [Google Scholar] [CrossRef]
  43. Thill, M.; Konen, W.; Bäck, T. Time Series Anomaly Detection With Discrete Wavelet Transforms and Maximum Likelihood Estimation. In Proceedings of the Intern. Conference on Time Series (ITISE), Granada, Spain, 18–20 September 2017. [Google Scholar]
  44. Labat, D. Wavelet Analyses in Hydrology. In Advances in Data-Based Approaches for Hydrologic Modeling and Forecasting; World Scientific: London, UK, 2010; pp. 371–410. [Google Scholar]
  45. Fekih Romdhane, T.; Ouni, R. Electrocardiogram Analysis Using Discrete Wavelet Transform for Anomalies Detection. SN Comput. Sci. 2023, 4, 348. [Google Scholar] [CrossRef]
  46. Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008. [Google Scholar]
  47. Najman, K.; Zieliński, K. Outlier Detection With the Use of Isolation Forests. In Conference of the Section on Classification and Data Analysis of the Polish Statistical Association; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  48. Breunig, M.M.; Kriegel, H.-P.; Ng, R.T.; Sander, J. LOF: Identifying Density-based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 15–18 May 2000. [Google Scholar]
  49. Park, J.; Seo, Y.; Cho, J. Unsupervised Outlier Detection for Time-series Data of Indoor Air Quality Using LSTM Autoencoder With Ensemble Method. J. Big Data 2023, 10, 66. [Google Scholar] [CrossRef]
  50. Ebtehaj, I.; Bonakdari, H. A Reliable Hybrid Outlier Robust Non-Tuned Rapid Machine Learning Model for Multi-Step Ahead Flood Forecasting in Quebec, Canada. J. Hydrol. 2022, 614, 128592. [Google Scholar] [CrossRef]
  51. Ko, A.H.; Sabourin, R.; Britto, A.S., Jr. From Dynamic Classifier Selection to Dynamic Ensemble Selection. Pattern Recognit. 2008, 41, 1718–1731. [Google Scholar] [CrossRef]
  52. Chen, M. A Greedy Algorithm with Forward-Looking Strategy; INTECH Open Access Publisher: London, UK, 2008. [Google Scholar]
  53. Weyrich, P.; Scolobig, A.; Patt, A. Dealing With Inconsistent Weather Warnings: Effects on Warning Quality and Intended Actions. Meteorol. Appl. 2019, 26, 569–583. [Google Scholar] [CrossRef]
  54. Garg, A.; Tai, K. Comparison of Statistical and Machine Learning Methods in Modelling of Data With Multicollinearity. Int. J. Model. Identif. Control 2013, 18, 295–312. [Google Scholar] [CrossRef]
  55. Department of Planning and Environment NSW. Tweed Catchment. Learn About Water 2023. Available online: https://water.dpie.nsw.gov.au/about-us/learn-about-water/basins-and-catchments/catchments/tweed (accessed on 23 February 2025).
  56. Environment Heritage NSW. Tweed River Catchment—Landscape Groups. In Index of Estuarine Condition; NSW Department of Planning and Environment: New South Wales, Australia, 2023. [Google Scholar]
  57. Aiyelokun, O.; Aiyelokun, O.; Agbede, O. Application of Random Forest (RF) for Flood Levels Prediction in Lower Ogun Basin, Nigeria. Nat. Hazards 2023, 119, 2179–2195. [Google Scholar] [CrossRef]
Figure 1. Methodological framework for erroneous outlier detection in hydrological time series. The approach integrates semi-synthetic validation (5% contamination: 50% point, 30% contextual, and 20% collective outliers) with feature engineering (19 features). Thirteen detection algorithms, comprising vanilla models, statistical filters, and enhanced feature-based variants, are evaluated individually and combined through three ensemble strategies (Accurate, Diverse, and Fast). Multi-faceted evaluation includes performance metrics, agreement analysis (Jaccard similarity, correlation matrices), and statistical significance testing (Friedman test with critical difference analysis).
Figure 1. Methodological framework for erroneous outlier detection in hydrological time series. The approach integrates semi-synthetic validation (5% contamination: 50% point, 30% contextual, and 20% collective outliers) with feature engineering (19 features). Thirteen detection algorithms, comprising vanilla models, statistical filters, and enhanced feature-based variants, are evaluated individually and combined through three ensemble strategies (Accurate, Diverse, and Fast). Multi-faceted evaluation includes performance metrics, agreement analysis (Jaccard similarity, correlation matrices), and statistical significance testing (Friedman test with critical difference analysis).
Water 18 00446 g001
Figure 2. Temporal data availability for daily mean (DMean) measurements across five monitoring stations in the Tweed River catchment. Horizontal bars show discharge (WCD, upper panel) and water level (WCL, lower panel) data coverage from station commencement (red dots) to 10 July 2025. White gaps indicate missing data periods. Percentages show data completeness after quality filtering (53.6–99.8%). Station IDs are colour-coded as indicated. The DMean aggregation shown is representative of all 30 experimental configurations analysed.
Figure 2. Temporal data availability for daily mean (DMean) measurements across five monitoring stations in the Tweed River catchment. Horizontal bars show discharge (WCD, upper panel) and water level (WCL, lower panel) data coverage from station commencement (red dots) to 10 July 2025. White gaps indicate missing data periods. Percentages show data completeness after quality filtering (53.6–99.8%). Station IDs are colour-coded as indicated. The DMean aggregation shown is representative of all 30 experimental configurations analysed.
Water 18 00446 g002
Figure 3. Spatial distribution of hydrological monitoring stations in the Tweed River catchment, northeastern New South Wales. Five stations (201001, 201005, 201012, 201015, and 201900) from the BoM are distributed across the 1326 km2 catchment and are positioned along the water course network, with station identifiers and locations shown. The scale bar represents 10 km.
Figure 3. Spatial distribution of hydrological monitoring stations in the Tweed River catchment, northeastern New South Wales. Five stations (201001, 201005, 201012, 201015, and 201900) from the BoM are distributed across the 1326 km2 catchment and are positioned along the water course network, with station identifiers and locations shown. The scale bar represents 10 km.
Water 18 00446 g003
Figure 4. Comparative performance of erroneous outlier detection methods using ROC analysis. Sixteen methods were evaluated on semi-synthetic data for (a) Station 201012 water level daily mean (WCL DMean) and (b) Station 201005 discharge daily maximum (WCD DMax). Methods include vanilla implementations (v_), enhanced feature-based variants (e_), statistical approaches (Hampel, SH-ESD), and three ensemble strategies. AUC values demonstrate the higher performance of ensemble methods and enhanced variants over vanilla implementations. Configuration (a) shows better overall performance (AUC 0.533–0.984) than configuration (b) (AUC 0.441–0.895).
Figure 4. Comparative performance of erroneous outlier detection methods using ROC analysis. Sixteen methods were evaluated on semi-synthetic data for (a) Station 201012 water level daily mean (WCL DMean) and (b) Station 201005 discharge daily maximum (WCD DMax). Methods include vanilla implementations (v_), enhanced feature-based variants (e_), statistical approaches (Hampel, SH-ESD), and three ensemble strategies. AUC values demonstrate the higher performance of ensemble methods and enhanced variants over vanilla implementations. Configuration (a) shows better overall performance (AUC 0.533–0.984) than configuration (b) (AUC 0.441–0.895).
Water 18 00446 g004
Figure 5. Algorithm agreement correlation matrices revealing detection pattern similarities between 16 methods. Pearson correlation coefficients are shown for (a) the Station 201012 water level daily mean and (b) the Station 201005 discharge daily maximum. Values range from −1 (inverse agreement) to 1 (perfect agreement). Dark red indicates strong positive correlation; blue shows negative correlation. Method pairs exhibit varying agreement patterns across different hydrological conditions. High correlations (>0.8) between ensemble methods and statistical approaches indicate similar detection patterns.
Figure 5. Algorithm agreement correlation matrices revealing detection pattern similarities between 16 methods. Pearson correlation coefficients are shown for (a) the Station 201012 water level daily mean and (b) the Station 201005 discharge daily maximum. Values range from −1 (inverse agreement) to 1 (perfect agreement). Dark red indicates strong positive correlation; blue shows negative correlation. Method pairs exhibit varying agreement patterns across different hydrological conditions. High correlations (>0.8) between ensemble methods and statistical approaches indicate similar detection patterns.
Water 18 00446 g005
Figure 6. Jaccard similarity indices quantifying outlier overlap between detection methods. Matrices show the proportion of commonly identified outliers for (a) the Station 201012 water level daily mean and (b) Station 201005 discharge daily maximum. Values range from 0 (no overlap) to 1 (complete overlap). Low similarities (<0.5) between most method pairs indicate complementary detection capabilities, whilst ensemble methods show high mutual overlap (>0.8) through voting mechanisms.
Figure 6. Jaccard similarity indices quantifying outlier overlap between detection methods. Matrices show the proportion of commonly identified outliers for (a) the Station 201012 water level daily mean and (b) Station 201005 discharge daily maximum. Values range from 0 (no overlap) to 1 (complete overlap). Low similarities (<0.5) between most method pairs indicate complementary detection capabilities, whilst ensemble methods show high mutual overlap (>0.8) through voting mechanisms.
Water 18 00446 g006
Figure 7. Visual comparison of vanilla versus enhanced detection at Station 201001 (April–June 2012). Purple stars indicate true outliers (ensemble consensus). (a) Isolation Forest reduces detections from 21 to 8. (b) Discrete Wavelet Transform eliminates 12 recession-period false positives. (c) Local Outlier Factor reduces from 6 to 1. (d) ESD methods maintain 3 detections with improved confidence. Enhanced versions consistently reduce false positives whilst preserving true outlier identification.
Figure 7. Visual comparison of vanilla versus enhanced detection at Station 201001 (April–June 2012). Purple stars indicate true outliers (ensemble consensus). (a) Isolation Forest reduces detections from 21 to 8. (b) Discrete Wavelet Transform eliminates 12 recession-period false positives. (c) Local Outlier Factor reduces from 6 to 1. (d) ESD methods maintain 3 detections with improved confidence. Enhanced versions consistently reduce false positives whilst preserving true outlier identification.
Water 18 00446 g007
Figure 8. Performance evaluation of 16 erroneous outlier detection methods across 30 hydrological dataset configurations. F1-scores displayed as a heatmap, separated into discharge (left) and water level (right) measurements. Horizontal lines delineate ensemble (top), enhanced (middle), and vanilla (bottom) method categories. Station 201012 consistently demonstrates high performance across both measurement types, whereas Station 201005 exhibits lower performance. Ensemble methods consistently demonstrate high performance (F1 > 0.7, green), while vanilla implementations show poor detection capability (F1 < 0.5, orange-red). Statistical methods (Hampel, SH-ESD) maintain moderate performance.
Figure 8. Performance evaluation of 16 erroneous outlier detection methods across 30 hydrological dataset configurations. F1-scores displayed as a heatmap, separated into discharge (left) and water level (right) measurements. Horizontal lines delineate ensemble (top), enhanced (middle), and vanilla (bottom) method categories. Station 201012 consistently demonstrates high performance across both measurement types, whereas Station 201005 exhibits lower performance. Ensemble methods consistently demonstrate high performance (F1 > 0.7, green), while vanilla implementations show poor detection capability (F1 < 0.5, orange-red). Statistical methods (Hampel, SH-ESD) maintain moderate performance.
Water 18 00446 g008
Figure 9. Distribution of precision and recall metrics for 16 erroneous outlier detection methods across six hydrological data categories. Box plots show performance variability across five monitoring stations for discharge and water level measurements at three temporal aggregations (daily maximum, mean, and minimum). Background colours indicate the models: yellow (ensemble methods) and green (enhanced methods). Ensemble methods demonstrate narrow distributions with high medians, indicating consistent performance. Vanilla implementations exhibit broad distributions and low medians, reflecting sensitivity to station characteristics.
Figure 9. Distribution of precision and recall metrics for 16 erroneous outlier detection methods across six hydrological data categories. Box plots show performance variability across five monitoring stations for discharge and water level measurements at three temporal aggregations (daily maximum, mean, and minimum). Background colours indicate the models: yellow (ensemble methods) and green (enhanced methods). Ensemble methods demonstrate narrow distributions with high medians, indicating consistent performance. Vanilla implementations exhibit broad distributions and low medians, reflecting sensitivity to station characteristics.
Water 18 00446 g009
Figure 10. Critical difference diagram revealing the statistical significance of performance differences between 16 erroneous outlier detection methods. Average ranks computed from the Friedman test across 30 dataset configurations (lower ranks indicate better performance). Horizontal red bar indicates critical difference threshold (CD = 4.211, α = 0.05); methods within this distance are statistically equivalent. The top-performing group comprises Accurate Ensemble (1.65), Diverse Ensemble (1.82), Hampel (3.17), and SH-ESD (4.70). Vanilla implementations occupy the bottom quartile (ranks > 12), confirming the contribution of feature engineering to detection performance.
Figure 10. Critical difference diagram revealing the statistical significance of performance differences between 16 erroneous outlier detection methods. Average ranks computed from the Friedman test across 30 dataset configurations (lower ranks indicate better performance). Horizontal red bar indicates critical difference threshold (CD = 4.211, α = 0.05); methods within this distance are statistically equivalent. The top-performing group comprises Accurate Ensemble (1.65), Diverse Ensemble (1.82), Hampel (3.17), and SH-ESD (4.70). Vanilla implementations occupy the bottom quartile (ranks > 12), confirming the contribution of feature engineering to detection performance.
Water 18 00446 g010
Figure 11. Operational outlier detection rates on unlabelled hydrological data across 16 methods. Box plots show the percentage of observations flagged as outliers for five stations grouped by measurement type (water course level/discharge) and temporal aggregation (daily maximum/mean/minimum). Detection rates range from <1% to >40%. Ensemble methods achieve detection rates of 6–10%, aligning with expected contamination levels. Vanilla LSTM exhibits erratic behaviour with extreme variability. Discharge measurements show higher median detection rates than water level.
Figure 11. Operational outlier detection rates on unlabelled hydrological data across 16 methods. Box plots show the percentage of observations flagged as outliers for five stations grouped by measurement type (water course level/discharge) and temporal aggregation (daily maximum/mean/minimum). Detection rates range from <1% to >40%. Ensemble methods achieve detection rates of 6–10%, aligning with expected contamination levels. Vanilla LSTM exhibits erratic behaviour with extreme variability. Discharge measurements show higher median detection rates than water level.
Water 18 00446 g011
Table 1. Dedicated erroneous outlier detection studies in hydrological time series.
Table 1. Dedicated erroneous outlier detection studies in hydrological time series.
PaperOutlier Detection TechniquesData UsedFeaturesDescription of DataLearning TypeValidation Technique
Zhao, Zhu [23]RNN + SVM optimised by PSO, combined with Adaboost, LSTM optimised by MEA, Wavelet analysis, extreme value check, time-varying check, Inverse Distance Weighting, Kriging Method, Trend Surface Method, Multivariate SVMWater level, dischargeHistorical data patterns, precipitation stations’ spatial coordinates, elevation, distance, correlation coefficients1 station|Hourly dataSupervisedMetrics: RMSE, maximum error, average error, predictive confidence
Hussain [26]Mahalanobis distance outlyingness, projection depth outlyingness, spatial Mahalanobis outlyingness, Tukey (halfspace) depth outlyingnessDischarge Peak, volume, principal component scores (z1, z2)1 station|1977–2017 (41 years)|Daily dataNot MentionedNot mentioned
van de Wiel, van Es [7]Autoregressive Models, Linear Regression with Lasso Penalty, Quantile Regression Forests (QRFs), QR-MLP, QR Perceptron, QR RNNs (GRU, LSTM), IFWater heightsMean values of window sizes [64 h, 128 h, …, 1048 h]4 weirs|2015–2019 (4 years)|15-min-interval dataUnsupervisedSemi-synthetic evaluation (injected drift, jump, extremes), expert evaluation, Friedman and Nemenyi tests
Zhou, Qiao [20]Outlier index based on water level–flow relationship (proposed method), first-order difference method, order of magnitude classificationWater level, dischargeOnly raw data>60 sluice gates|2017–2021 (4 years)|2-h-interval dataUnsupervisedSemi-synthetic evaluation (random noise (4–9 cm) added to 15 randomly selected non-adjacent monitoring data points)
Halicki and Niedzielski [24]Extreme values method
IF
Water levelHydrological evaluation using upstream gauge data and time lag calculations between gauges27 stations|2016–2022 (6 years)|Hourly dataSemi-supervisedHydrological evaluation criterion (checking hydrological connectivity of outliers), semi-synthetic (artificial outlier insertion experiment)
Current StudyVanilla and enhanced versions of ESD, DWT, LOF, IF, LSTM, ORELM
Hampel Filter
3 ensemble models (Accurate, Diverse and Fast)
Water level, discharge 19   engineered   features   plus   raw   value   ( 20   total ) ,   final   model   uses :   raw   value ,   r o l l i n g_m i n_ 7 d ,   r i s i n g_l i m b ,   l a g_ 14 ,   l a g_ 30 ,   v a l u e_d i f f_p c t (6 features after correlation analysis)5 stations × 2 data types × 3 time aggregations = 30 experiments
Daily data
UnsupervisedSemi-synthetic evaluation (5% injection of point, contextual and collective outliers), Jaccard similarity index, algorithm agreement correlation analysis, statistical measures (Friedman test and Nemenyi test)
Table 2. Feature selection frequency across 30 dataset configurations.
Table 2. Feature selection frequency across 30 dataset configurations.
FeatureNumber of Cases
value30/30 (100%)
rolling_min_7d30/30 (100%)
rising_limb29/30 (97%)
lag_3026/30 (86%)
lag_1425/30 (83%)
value_diff_pct19/30 (63%)
Table 3. Ensemble composition summary.
Table 3. Ensemble composition summary.
Ensemble TypeAccurate EnsembleDiverse EnsembleFast Ensemble
Hampel100.00%100.00%62.90%
SH-ESD91.40%88.60%5.70%
e_IF74.30%8.60%17.10%
v_ESD25.70%11.40%25.70%
e_LOF5.70%74.30%77.10%
e_DWT2.90%8.60%80.00%
e_ORELM0.00%8.60%22.90%
v_LOF0.00%0.00%2.90%
v_IF0.00%0.00%5.70%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kuhaneswaran, B.; Sorwar, G.; Alaei, A.R.; Tong, F. Feature-Enhanced Erroneous Outlier Detection in Hydrological Time Series Using Ensemble Methods. Water 2026, 18, 446. https://doi.org/10.3390/w18040446

AMA Style

Kuhaneswaran B, Sorwar G, Alaei AR, Tong F. Feature-Enhanced Erroneous Outlier Detection in Hydrological Time Series Using Ensemble Methods. Water. 2026; 18(4):446. https://doi.org/10.3390/w18040446

Chicago/Turabian Style

Kuhaneswaran, Banujan, Golam Sorwar, Ali Reza Alaei, and Feifei Tong. 2026. "Feature-Enhanced Erroneous Outlier Detection in Hydrological Time Series Using Ensemble Methods" Water 18, no. 4: 446. https://doi.org/10.3390/w18040446

APA Style

Kuhaneswaran, B., Sorwar, G., Alaei, A. R., & Tong, F. (2026). Feature-Enhanced Erroneous Outlier Detection in Hydrological Time Series Using Ensemble Methods. Water, 18(4), 446. https://doi.org/10.3390/w18040446

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop