1. Introduction
Data quality forms the foundation of reliable flood forecasting systems and other water resource management activities, directly affecting forecasting accuracy and operational effectiveness [
1]. Hydrological monitoring networks are collecting hydrological measurements (such as water levels) on a continuous basis, i.e., hydrological time series. Data-driven forecasting models rely on such historical hydrological time series to learn the system behaviour and make predictions to inform important water management decisions [
2]. However, data quality problems often arise directly from sensor packet malfunctions, calibration drift, transmission errors, and log failures [
3], resulting in missing data and erroneous measurements that compromise model performance and operational decision-making [
4]. Despite the wide range of research studies on the quality issues of data in a hydrological system, flood forecasting studies often generate results using basic statistical rules without systematic validation and adaptation of these rules to the specific characteristics of hydrological systems.
This lack of systematic validation is exacerbated by the difficulty in covering different types of anomalous observations in hydrological observations. This study distinguishes between the following related concepts: (i) Anomalies are anything that resembles any other observations that are not expected, whether in a legitimate or an erroneous sense. (ii) Outliers are statistical abnormalities that do not lie under normal data distributions. (iii) Erroneous outliers refer specifically to measurement failures or data quality issues to be discounted, as opposed to legitimate extreme events that, while statistically unusual, represent valid hydrological conditions [
5]. Legitimate extremes, such as floods and droughts, follow physical laws and must be preserved, as they provide essential information for water resource management, risk assessment, and infrastructure design. This distinction becomes particularly critical in hydrological time series analysis, where extreme events are the primary target for prediction.
Current erroneous outlier detection practices in data-driven hydrological modelling lack systematic methodology and validation. There are different types of detection methods ranging from simple statistical tests [
6] to advanced algorithms [
7]. However, recent studies in hydrological modelling have shown fragmented application of these methods. Some use the 3-sigma rule [
8] or boxplot methods [
6], while others adopt robust loss functions, such as Huber loss [
9], or moving averages [
10], which are used as a preprocessing step. There are numerous cases in the literature that acknowledge the existence of outliers [
6,
11], despite reports indicating that they are directly influenced by input data noise during model learning [
12]. Further, whereas measurement errors degrade the performance of forecasting models [
12], the removal of legitimate extremes could eliminate precisely the information for hydrological time series analysis. This is an area of challenge that requires techniques to distinguish between sensor errors that need correction and the preservation of legitimate extremes.
This differentiation challenge is especially complex, given the distinct characteristics of hydrological time series that present a challenge in detecting erroneous outliers. Hydrological time series exhibits extreme skewness with rare but critical peak events dominating system behaviour. These peaks are irregular in nature and are driven by complex meteorological patterns rather than predictable cyclic processes commonly found in most industrial or economic time series. The temporal dependencies are fundamentally different; flood hydrographs show rapid rises and gradual recessions over days, contrasting with smooth patterns typical of other domains [
13]. Seasonal variations create time-varying normality definitions where values typical during monsoons would appear anomalous during dry periods. Hydrological persistence is a phenomenon in which erroneous outliers may spread over time, with sensor drift producing sequences of slowly deviating observations that escape detection as point anomalies [
14].
Given these challenges, various detection methods have been developed, though statistical threshold methods continue to dominate current practice [
15]. The 3-sigma rule remains widely adopted [
8], whereas boxplot-based methods using interquartile range rules are frequently employed [
6,
16,
17]. The Hampel filter, which flags points that deviate more than three standard deviations from local medians, provides an alternative approach [
18]. While these methods incorporate temporal context through sliding windows, they apply identical thresholds across all hydrological conditions without accounting for the complex temporal dependencies inherent in catchment responses and rely solely on raw measurements rather than engineered features that could provide additional context. Physical threshold methods offer domain-specific alternatives. Belyakova, Moreido [
19] defined erroneous outliers through physically impossible changes, preserving legitimate rapid flood rises that statistical methods might flag. Zhou, Qiao [
20] achieved a detection accuracy of 88.3% using hydraulic correlation principles for open-channel systems.
While statistical methods, such as the Hampel filter, demonstrate strong performance, combining them with enhanced machine learning (ML) variants through ensembles provides additional benefits. Isolation Forest (IF) algorithms isolate anomalies through random feature splits [
7]. Local Outlier Factor (LOF) algorithms measure local density deviations [
21]. LSTM autoencoders detect anomalies by monitoring reconstruction error thresholds, thereby capturing complex temporal patterns [
22]. Prediction-based detection represents an emerging paradigm. Zhao, Zhu [
23] developed multiple forecasting models to predict expected values, flagging observations that fell outside confidence intervals. Hybrid strategies combine multiple principles: Halicki and Niedzielski [
24] validated detections by checking upstream station consistency, while Seasonal-Hybrid Extreme Studentised Deviate (SH-ESD) applies seasonal decomposition before outlier detection. Despite the availability of diverse detection methods, systematic frameworks for combining these approaches through ensemble strategies remain unexplored, as does the potential for feature engineering to enhance detection performance.
Furthermore, despite the variation in detection methods, validation remains a fundamental challenge in unsupervised outlier detection, as it is difficult to compare the effectiveness of outlier detection methods due to the unavailability of ground truth. Without labelled data (erroneous outliers or normal), researchers follow a wide range of evaluation strategies. Synthetic evaluation is dominant [
7,
25], where researchers inject controlled outliers to evaluate performance. Semi-synthetic approaches preserve temporal structure while enabling evaluation. Hydrological validation, as implemented in [
24], ensures that detected outliers maintain consistency between upstream and downstream locations, providing physics-based verification without the need for ground truth labels. Statistical comparison frameworks (e.g., Friedman tests with Nemenyi post hoc analysis) are used to determine significant differences in performance between algorithms [
7]. Agreement metrics, such as Jaccard similarity, measure the overlap between the outputs of different detectors, revealing complementary detection capabilities.
To understand the current state of the field, we reviewed studies that have addressed erroneous outlier detection in the context of hydrology.
Table 1 illustrates an underlying split in the research on hydrological time series error detection. Different characteristics are demonstrated by dedicated erroneous outlier detection studies. Five of six studies implement validation frameworks, from synthetic data injection to hydrological consistency checks. van de Wiel, van Es [
7] developed multi-scale temporal windows, while van de Wiel, van Es [
7] incorporated upstream gauge relationships.
Analysis of these studies reveals three important gaps in current practice:
Recent flood forecasting studies apply generic outlier detection methods without validation or adaptation to hydrological characteristics;
Existing methods rely on raw data or limited features, failing to capture the temporal dependencies and physical processes inherent in hydrological dynamics;
No systematic framework exists for selecting and combining detection methods based on empirical performance.
This paper addresses these gaps through a proposed framework for detecting erroneous outliers in hydrological time series at individual monitoring stations using univariate methods for unlabelled data. We make three significant contributions:
Feature engineering that enhances detection algorithms with temporal and hydrological context.
Systematic evaluation of 13 detection algorithms using semi-synthetic validation across 30 dataset configurations.
Data-driven ensemble strategies that automatically select optimal algorithm combinations. Our approach compares vanilla implementations using only raw data against enhanced variants utilising engineered features, with validation enabled through synthetic outlier injection. While we cannot validate preservation of all legitimate extreme events without ground truth, incorporating hydrological process features represents progress toward reducing false positive detection of legitimate floods.
The remainder of this paper is organised as follows.
Section 2 presents our methodology, including feature engineering, synthetic data injection, detection algorithms, and ensemble selection strategies.
Section 3 describes the experimental setup, results and discussion.
Section 4 concludes with key findings and future research directions.
3. Experimental Setup, Results, and Discussion
This section presents the experimental setup and the results of applying our detection framework.
3.1. Study Site and Dataset Description
This section presents the experimental setup and results from applying the detection framework to hydrological time series. This study utilised daily data from the BoM, Australia, monitoring stations in the Tweed River catchment, northeastern New South Wales (NSW). The Tweed catchment, spanning 1326 square kilometres, is the northernmost coastal catchment in NSW, bounded by the McPherson Range on the NSW–Queensland border, the Burringbar and Condong Ranges to the southeast, and the Tweed Range to the west. Its highest point is Mount Warning at 1156 metres. The catchment experiences a subtropical climate with hot, humid summers and cool to mild winters, receiving a mean annual rainfall of 1510 mm concentrated between December and March. Average summer maximum temperatures range from 27 to 28 °C at the coast to 28–29.5 °C inland at Murwillumbah [
55,
56].
The dataset included both water level (water course level, WCL) and discharge (water course discharge, WCD) measurements, expressed in both metres and cubic metres per second. The discharge values were determined using level rating curves rather than directly from sensors. To capture varying hydrological behaviours, three temporal aggregations were used: daily maximum (DMax), daily mean (DMean), and daily minimum (DMin), all of which are available from the BoM. This led to 30 distinct dataset configurations (5 stations × 2 measurement types × 3 aggregations). Each configuration contained between 5890 and 24,933 daily observations after quality filtering, with null values ranging from 0 to 11,547 depending on station continuity and sensor reliability.
Figure 2 visualises the temporal coverage and data completeness for the DMean aggregation across all stations, representing 30 dataset configurations, with all station data ending 10 July 2025.
Figure 3 illustrates the spatial distribution of the five monitoring stations across the Tweed catchment for DMean for both water level and discharge.
All analyses were conducted on a Windows 11 (Build 26100) system with an Intel Core i7-1065G7 processor (1.30 GHz, four cores, eight logical processors) and 16 GB of RAM. The computational environment comprised Python 3.11.8 within vs. Code, utilising scikit-learn for ML algorithms, pandas for data manipulation, numpy for numerical computations, and matplotlib for visualisation. The semi-synthetic outlier injection and detection algorithms were implemented using custom Python scripts, with parallel processing employed where applicable to manage the computational demands across all 30 dataset configurations. The following section presents the results of feature selection analysis across these configurations.
3.2. Feature Selection Results
We reduced the 19 engineered features to a final set of six features through a systematic correlation-based selection process designed to minimise multicollinearity while ensuring generalizability across diverse hydrological conditions. For each of the 30 dataset configurations, we computed pairwise Pearson correlation coefficients [
54] between all features and identified highly correlated pairs (|r| > 0.70). For each correlated pair, we calculated the average absolute correlation of each feature with all other features and removed the feature exhibiting higher average correlation (greater global redundancy), retaining the feature with lower average correlation (more unique information). For example, among the lag features (lag_1, lag_7, lag_14, and lag_30) with high inter-correlations (r = 0.65–0.94), lag_1 and lag_7 were removed due to higher average correlations with other features, while lag_14 and lag_30 were retained as they provided complementary temporal scales with lower redundancy. Following this correlation filtering across all configurations, we retained features appearing in ≥60% of configurations to ensure robust performance across different stations, data types, and temporal aggregations. Detailed methodology, including complete correlation matrices and selection pseudocode, is provided in
Supplementary Materials S2.
Table 2 presents the selection frequency results, revealing six features with consistently high importance: (i) value (100%), (ii) rolling_min_7d (100%), (iii) rising_limb (97%), (iv) lag_30 (86%), (v) lag_14 (83%), and (vi) value_diff_pct (63%). The universal selection of value and rolling_min_7d establishes them as fundamental indicators, with rolling_min_7d providing context for identifying anomalous deviations from recent baseflow conditions. The high frequency of rising_limb (97%) demonstrates the importance of hydrological state transitions for contextual outlier detection, as erroneous values often violate natural recession or rising patterns. Temporal lag features (lag_14 and lag_30, selected in 83–86% of configurations) encapsulate catchment memory effects, enabling detection of values inconsistent with antecedent conditions. The moderate selection frequency of value_diff_pct (63%) reflects its utility in detecting abrupt sensor malfunctions that manifest as unrealistic rate-of-change values, though its importance varies with flow regime stability.
This feature set extends previous multi-scale approaches [
7] by explicitly incorporating temporal lags rather than aggregated window statistics, providing more precise temporal context for outlier evaluation. Complete correlation matrices, feature importance scores, and detailed selection methodology are provided in
Supplementary Materials S2.
3.3. Erroneous Outlier Detection Performance
3.3.1. Individual Method Performance
Following feature selection, the 16 detection methods were tested on all configurations of the validated datasets using the semi-synthetic validation framework.
Figure 4 shows the ROC curves for two representative configurations: Station 201012 water level DMean (
Figure 4a) and Station 201005 discharge DMax (
Figure 4b), which illustrate the performance variability under varying hydrological conditions.
Configuration 201012 water level DMean had good performance in the majority of the methods. The ensemble methods yielded the highest AUC values, with the Diverse Ensemble (0.972), Fast Ensemble (0.968), and Accurate Ensemble (0.962), all exceeding 0.96, demonstrating the effectiveness of using multiple detection approaches. Among individual methods, the best results were 0.984 (e_ORELM) and 0.976 (e_IF), although these also differed in configuration. The Hampel filter (0.890) and SH-ESD (0.973) maintained steady performance in supporting their reliability as statistical approaches. The enhanced variants proved to be better than the vanilla ones, as shown by e_DWT (0.923), which outperforms v_DWT (0.583), and e_LSTM (0.753), which improves over v_LSTM (0.533).
Configuration 201005 discharge DMax showed lower overall performance across methods. The Diverse Ensemble (0.895) and Accurate Ensemble (0.879) were able to hold their performance levels, while the Fast Ensemble dropped down to 0.657. Individual methods displayed more variation, with the Hampel filter retaining 0.874. The enhanced versions showed an improvement over the vanilla versions in most cases (e_IF 0.674 improved on v_IF 0.527; e_ORELM 0.707 improved on v_ORELM 0.574). The exception was e_LOF (0.615), which performed below v_LOF (0.710), indicating that feature engineering effects vary by algorithm type. Despite these modifications to the individual methods, the ensemble approaches retained higher values of AUC through their voting mechanisms, which integrate different detection strategies to ensure better consistency in performance.
To understand the complementary nature of these detection methods, we examined the patterns of agreement between the algorithm results using an algorithm agreement matrix and Jaccard similarity scores. The correlation matrices of the algorithm agreement (
Figure 5) revealed distinct patterns for different dataset configurations. For Station 201012 water level DMean (
Figure 5a), the methods were characterised by moderate to high correlations (0.4–0.9), with v_ESD and SH-ESD exhibiting a correlation of 0.908, indicating a common statistical basis among the methods. The ensemble methods showed high intercorrelations (>0.88), whereas v_LOF exhibited a low correlation with the other techniques (<0.1). In the Station 201005 discharge, the DMax (
Figure 5b) correlation patterns differed, with a higher correlation observed between v_IF and v_ESD (correlation coefficient 0.966), while v_LOF showed a low correlation with most methods. The ensemble approaches were still able to exhibit moderate correlations (0.3–0.7) compared to individual methods, but lower than those for the water level configurations. These different patterns of correlations between configurations are known to validate the ensemble approach, as they show that methods are recognising different subsets of erroneous outliers, even though there may be a high correlation in the overall predictions.
The Jaccard similarity indices (
Figure 6) revealed low to moderate outlier overlap between methods despite their correlation patterns. For Station 201012 water level DMean (
Figure 6a), v_ESD and SH-ESD showed the highest similarity (0.834), whilst the Accurate and Diverse Ensembles demonstrated high overlap (0.903). Most method pairs achieved similarities below 0.5, with v_LOF showing minimal overlap with all methods (<0.04). In Station 201005 discharge DMax (
Figure 6b), v_IF and v_ESD exhibited high similarity (0.940), yet overall similarities remained lower than the first configuration. These low Jaccard values, contrasting with the higher algorithm agreement correlation coefficients, indicate that whilst methods may agree on normal behaviour classification, they identify different erroneous outlier subsets. This complementary detection pattern validates the ensemble strategy, which combines methods that identify different erroneous outlier subsets rather than redundantly detecting the same erroneous outliers.
Visual comparison of detection outcomes demonstrates how engineered features improve separability between erroneous measurements (
Figure 7). At Station 201,001, during April–June 2012, v_IF identified 21 outliers, including numerous false positives during natural variability episodes, whilst feature enhancement reduced detections to eight with substantially improved precision, a 62% reduction in false positives whilst maintaining true outlier detection (
Figure 7a). DWT exhibited similar improvement: the vanilla version flagged 12 outliers predominantly during recession periods, incorrectly interpreting exponential decay as anomalous, whilst the enhanced version eliminated these by incorporating temporal context through features characterising normal recession behaviour (
Figure 7b). LoF reduced detections from six outliers to one outlier with enhanced specificity (
Figure 7c). SH-ESD maintained the same detection count as its vanilla counterpart (three outliers) (
Figure 7d). The consistent reduction in false positives across methods whilst maintaining true outlier identification (purple stars indicating ensemble consensus) validates that explicit representation of hydrological processes enables algorithms to identify the erroneous outliers.
3.3.2. Cross-Dataset Performance Analysis
The performance heatmap (
Figure 8) provided evidence of method behaviour across all 30 dataset configurations. The three ensemble methods maintained consistently high F1-scores above 0.7 in most dataset configurations, with the Accurate Ensemble showing scores between 0.69 and 0.91, the Diverse Ensemble between 0.66 and 0.94, and the Fast Ensemble between 0.42 and 0.90. This consistency validated the ensemble approach for operational deployment.
Individual method performance varied across dataset configurations. The Hampel filter maintained stable performance with F1-scores ranging from 0.67 to 0.91, indicating its consistency across different hydrological conditions. SH-ESD showed similar stability (0.62 to 0.78), though with a narrower performance range. Among enhanced variants, e_IF showed moderate performance (0.54 to 0.83), while e_LOF demonstrated consistent but modest scores (0.52 to 0.68). The e_LSTM showed the lowest performance among enhanced methods (0.44 to 0.53), particularly struggling with discharge measurements. Vanilla methods exhibited limited performance. v_IF achieved F1-scores between 0.40 and 0.44, showing minimal variation but consistently poor detection. v_LOF performed similarly poorly (0.35 to 0.62). v_LSTM showed limited performance (0.35 to 0.42). v_ORELM maintained moderate consistency (0.49–0.66).
Station-specific patterns emerged out of the heatmap. Station 201005 generally indicated a poorer performance of all methods, suggesting more challenging hydrological characteristics. Station 201012 consistently produced better results, especially with respect to water levels. Discharge measurements consistently yielded lower F1-scores compared to water level measurements at all stations and methods. This systematic underperformance is likely due to the additional uncertainty associated with rating curve conversions. Since discharge values are derived from water level measurements rather than direct measurements, they inherit the original measurement errors, while also introducing additional errors from the stage-discharge relation, especially during extreme events when the rating curves may be less reliable.
The box plot distributions (
Figure 9) further illustrated performance patterns across data configurations. For precision metrics, the ensemble methods exhibited low distribution width and high median values, suggesting that their performance is similar across stations. The Accurate and Diverse Ensembles achieved a precision of more than 0.6 in most cases, with higher performance for water level DMax configurations. Individual methods had wider distributions and lower medians. Enhanced variants generally performed better than their vanilla counterparts, with e_IF and e_LOF having higher precision than v_IF and v_LOF, respectively.
Recall patterns were different from precision results. Ensemble methods have always had balanced recall scores, ranging from 0.6 to 0.8. The Fast Ensemble had a slightly lower recall than the Accurate and Diverse Ensembles, especially in the case of discharge measurements. Among individual methods, the Hampel filter showed satisfactory recall in water level measurements and decreased performance in discharge data. Vanilla performed poorly in recalling information from all categories, with v_LSTM and v_ORELM mostly maintaining a mediocre 0.4.
The consistency of box widths gave information about the reliability of the method. Narrow boxes for ensemble methods implied stable performance across the five stations, whereas wide boxes for the vanilla methods were suggestive of high sensitivity to station characteristics. Enhanced methods exhibited intermediate box widths, indicating greater stability than vanilla ones, but not to the same extent as ensemble consistency.
3.3.3. Ensemble Selection and Composition
Analysis of the ensemble composition revealed the efficiency of the ensemble selection mechanism (
Table 3). The Hampel filter achieved 100% selection in both Accurate and Diverse Ensembles, and 62.9% in the Fast Ensemble, demonstrating a good balance between the two criteria. SH-ESD showed similarly high selection rates (91.4% Accurate, 88.6% Diverse), which confirms that methods dealing explicitly with seasonality were important in detecting erroneous outliers in hydrological time series.
Enhanced variants were found to dominate the ensemble compositions (60–80% of the ensembles), despite what was observed in the three strategies. e_IF was found to have a selection of high individual performance but low diversity (74.3% of the Accurate Ensemble, but 8.6% among the Diverse Ensemble), which appeared to suggest that it had high individual performance but low diversity, indicating that it yielded complementary outlier patterns. e_LOF showed the opposite pattern, with 74.3% selection in Diverse Ensembles despite modest individual performance, suggesting it captured complementary outlier patterns. e_DWT was frequently selected in the Fast Ensemble (80% selection) due to its computational efficiency. Vanilla methods rarely appeared in ensembles. v_ESD achieved 25.7% selection in Accurate and Fast Ensembles but never appeared in the Diverse Ensemble. Other vanilla variants (v_LOF, v_IF) showed minimal selection rates below 6%, confirming that feature engineering contributed to improved performance.
3.3.4. Statistical Validation
Statistical analysis with the Friedman test demonstrated that the 16 methods were significantly different (
p < 0.001). Using critical difference analysis (CD = 4.211 at a level of significance of 0.05), four methods were found to be in the top-performing group (Accurate Ensemble, Diverse Ensemble, Hampel, and SH-ESD) (
Figure 10). While no statistically significant differences existed within this top-performing group, the ensemble methods performed best in terms of average ranks (Accurate: 1.65 and Diverse: 1.82), demonstrating substantial consistency in performance across configurations. These differences were well within the critical level of difference (statistical equivalence). The Hampel filter, 3.17, and SH-ESD, 4.70, also managed to join the first tier, despite being of the individual method type. The Fast Ensemble (rank 6.43) was outside the statistically best group, confirming a performance trade-off; ML variants without feature engineering ranked poorly. All vanillas (v_LSTM, v_DWT, v_LOF, v_IF) listed in the bottom quartile (ranks > 13) suggest that ML algorithms without appropriate adaptation were limited in their effectiveness in detecting erroneous outliers in hydrological time series.
3.3.5. Operational Detection Rate
Analysis of erroneous outlier detection rates on unlabeled operational data (
Figure 11) revealed the practical method’s behaviour. Detection rates ranged from near-zero to over 40%, with distinct patterns across methods and data types. Statistical methods demonstrated moderate detection rates, with the Hampel filter consistently identifying 5–8% of observations as outliers. Vanilla ML methods showed high variability. Vanilla LOF and IF detected fewer than 2% outliers in most dataset configurations, suggesting these methods may miss erroneous outliers that other methods identify. v_LSTM exhibited erratic behaviour with detection rates spanning 0–35% depending on the station. Enhanced variants demonstrated more calibrated behaviour, with e_LOF and e_IF maintaining detection rates between 5% and 15%. The three ensemble methods converged on consistent detection rates of 6–10% across all dataset configurations. This convergence proved the consistency of the proposed ensemble approach, as the voting mechanism effectively moderated both over-sensitive and under-sensitive base detectors. Slightly higher detection rates in discharge measurements compared to water level (median 8.2% vs. 6.4%) likely reflected the inherently noisier nature of discharge data.
3.4. Overall Discussion
The evaluation of 30 configurations of the dataset reveals the presence of unique relationships between data quality characteristics and detection performance, offering insights into the practical deployment of erroneous outlier detection methods in operational hydrological analysis.
The performance heatmap (
Figure 7) reveals distinct relationships between data quality and detection effectiveness across the five monitoring Stations. Station 201012 consistently achieved F1-scores above 0.8 across most methods and measurement types, corresponding with its 99.6% data completeness and minimal temporal gaps (
Figure 2). Conversely, station 201005 exhibited systematically lower performance, with many methods achieving F1-scores below 0.6, particularly evident in the orange-red regions of the heatmap. This station’s fragmented data record (53.6% completeness for water level, 68.5% for discharge) with substantial missing periods directly impacts the framework’s ability to calculate reliable temporal features. The lag features (lag_14, lag_30) and rolling statistics, which proved essential for detection, require consistent historical records.
This explains why methods dependent on temporal context underperformed at stations with incomplete data histories, consistent with established findings that RNNs and sequence-dependent algorithms exhibit performance degradation when applied to irregularly sampled or incomplete time series [
22,
49]. Quantitatively, temporal methods (LSTM, DWT) at Station 201005 achieved F1-scores of 0.58–0.62 compared to density-based approaches (IF, LOF) achieving 0.75–0.78, demonstrating a performance gap directly attributable to the station’s fragmented temporal record. Density-based methods maintained robust performance by operating on local neighbourhoods rather than extended sequences, which is why ensemble strategies combining diverse algorithmic foundations achieved F1 > 0.70 even at this challenging station.
The performance differential, as measured by type, appears consistently across all stations and methods in the heatmap (
Figure 7), with discharge measurements (left panel) showing more orange-red cells compared to water level measurements (right panel). This pattern is reflected in the semi-synthetic validation, where detection methods achieved higher AUC values for water level configurations. However, the operational detection rates (
Figure 10) reveal an interesting contrast: discharge measurements showed higher median outlier detection rates (8.2%) compared to water level (6.4%). This apparent contradiction validates the framework’s sensitivity to inherent differences in data quality. Discharge values, derived through rating curves rather than direct measurement, contain compound uncertainties from water level measurement errors, hydraulic model assumptions, and curve extrapolation during extreme events. These systematic uncertainties manifest as increased variability, which detection algorithms correctly identify as anomalous patterns, resulting in higher operational detection rates despite lower performance in controlled, semi-synthetic evaluations.
The enhanced variants’ systematic improvement over vanilla implementations demonstrates that temporal contextualisation addresses fundamental challenges in hydrological outlier detection. The box plot distributions (
Figure 8) show enhanced methods achieving narrower performance ranges and higher medians across all measurement categories. This improvement stems from feature engineering that captures an understanding of hydrological processes, enabling algorithms to better distinguish between sensor failures and legitimate extreme events. The consistent selection of rising_limb indicators and temporal lag features across 83–97% of configurations (
Table 2) confirms that explicit representation of catchment response patterns enhances detection capabilities beyond raw measurement analysis.
The observed performance patterns fundamentally reflect interactions between algorithmic detection principles and underlying hydrological process characteristics rather than algorithmic superiority alone. Statistical methods (Hampel and SH-ESD) perform optimally during stable baseflow-dominated periods where local deviations indicate sensor anomalies, but struggle during non-stationary flood conditions where legitimate responses exceed historical thresholds, where performance declines at Station 201005 compared to stable Station 201012. Temporal methods (LSTM and DWT) excel in predictable flow regimes where consistent patterns enable reliable sequence learning. Density-based methods (IF and LOF) demonstrate robustness across diverse conditions by avoiding temporal stationarity assumptions, with local neighbourhood approaches adapting naturally to varying flow regimes. The low-to-moderate Jaccard similarity indices between method pairs (<0.5) demonstrate valuable complementarity rather than inconsistency: different algorithms detect different error manifestations shaped by flow context. Statistical methods identify baseflow sensor drift, density-based methods detect rising limb spikes during transitional periods, and temporal methods capture recession violations. This algorithmic diversity enables ensemble strategies to capture errors across the full flow spectrum. The systematic performance differential between discharge and water level measurements stems from hydrological derivation, where discharge values inherit compounded uncertainties from water level error propagation, rating curve model deviations during extremes, and extrapolation beyond calibration range, explaining higher operational detection rates (8.2% vs. 6.4%) despite lower F1-scores.
The ensemble strategies successfully addressed different operational requirements whilst maintaining performance advantages over individual methods. The three ensemble approaches demonstrated consistent behaviour across the diverse conditions represented in the 30 dataset configurations, with the Accurate and Diverse Ensembles maintaining F1-scores above 0.7 in most cases. The ensemble composition analysis revealed that successful combinations consistently incorporated robust statistical foundations (Hampel filter, SH-ESD) alongside enhanced ML variants, indicating that effective detection requires hybrid approaches rather than relying solely on advanced algorithms. The convergence of ensemble detection rates to expected contamination levels across diverse hydrological conditions provides evidence for operational reliability. However, adaptation to different catchment characteristics and climatic conditions requires further investigation.
Further, to quantitatively demonstrate the advantages of adaptive ensemble selection over conventional fixed combination approaches, we compared our adaptive ensembles against representative static baselines where method composition remains fixed regardless of dataset characteristics. A Fixed-Top3 baseline (always combining Hampel + SH-ESD + e_IF—the three methods with the highest average rank) achieved a mean F1-score of 0.78 ± 0.11 across the 30 configurations, whilst our Accurate Ensemble achieved 0.82 ± 0.07, representing a 5.1% improvement in central performance and 36% reduction in variance. This improved consistency reflects adaptive selection’s capacity to exclude methods poorly suited to specific conditions: for example, at Station 201005 with fragmented temporal records (53.6% completeness), the Diverse Ensemble excluded sequence-dependent methods (LSTM and DWT) that struggled with temporal gaps, instead selecting density-based methods (LOF and IF) that maintained robustness, achieving F1 = 0.72 compared to Fixed-Top3’s 0.63 (14.3% improvement). Conversely, at Station 201012 with high completeness (99.6%), adaptive ensembles matched static performance (0.88 vs. 0.87), indicating minimal penalty for adaptability when conditions are favourable. Similarly, a fixed statistical baseline (Hampel + SH-ESD + v_ESD) optimised for computational efficiency achieved F1 = 0.71 ± 0.13, whilst our Fast Ensemble achieved 0.74 ± 0.14—superior detection quality whilst maintaining comparable processing speed. These quantitative comparisons demonstrate that adaptive selection provides measurable benefits over conventional fixed combinations through station-specific composition tailoring: adjusting to data quality characteristics (completeness and temporal gaps), measurement types (direct vs. derived), and operational constraints, rather than applying universal method combinations that inevitably encounter conditions misaligned with their fixed assumptions.
3.5. Limitations and Future Directions
This study has several limitations that suggest directions for future research. First, the validation framework is based on semi-synthetic outlier injection instead of manually labelled ground truth data. While this approach enables systematic evaluation, the 5% contamination rate and type distribution (50% point, 30% contextual, and 20% collective) represent simplified assumptions. Real-world error patterns exhibit station-specific variability influenced by local maintenance practices, sensor age, and environmental conditions. Additionally, the clear temporal separation imposed during synthetic injection (minimum three timesteps) may not fully capture scenarios where multiple failure modes co-occur, such as sensor fouling coinciding with flood events when accurate data is most critical.
Future work refers to developing benchmark datasets with manually labelled outliers to support the use of supervised learning and validate the preservation of extreme events, which is not confirmed without ground truth labels. Second, the evaluation is localised to provide a regional study of the Tweed River catchment in subtropical Australia. Different climatic regions where distinct hydrological regimes can be found may dictate different feature sets and detection thresholds. The ability of this framework to be extended across a range of climatic regions, from snowmelt to monsoon climates, would be important to the generalisability of the framework. Third, the framework operates at a daily temporal resolution, which is appropriate for strategic hydrological analysis, such as flood forecasting applications; however, it may not be aware of outlier patterns at sub-daily frequencies, which are critical for flash flood systems. Extending the framework to hourly or sub-hourly resolutions would help in urban drainage and flash flood applications. In addition, the feature engineering approach also requires relatively complete historical records; stations with considerable periods of missing data can calculate lag and antecedent condition features, which is particularly challenging.
Future improvements may include the use of meteorological variables, specifically rainfall data, which will enable better detection in the event of storms. The integration of spatial features from neighbouring stations could utilise catchment connectivity, whereas graph neural networks could be employed to capture complex spatio-temporal dependencies across monitoring networks. By filling the gap between the fundamental practice for detection and the requirement for validated procedures specific to hydrological characteristics, this framework enables a basis for bolstering the data quality controllers in a flood forecasting framework.
4. Conclusions
This study addressed the gap in systematic outlier detection for hydrological time series applications by developing a comprehensive framework tailored to the unique characteristics of hydrological data. Through evaluation across 30 dataset configurations spanning five monitoring stations with diverse data quality conditions, we demonstrated that feature-enhanced detection substantially improves the data quality of hydrological time series. Our contributions include: (i) the development of 19 hydrological features reduced to six core indicators (value, rolling_min_7d, rising_limb, lag_14, lag_30, and value_diff_pct) through systematic correlation-based selection analysis; (ii) the comprehensive evaluation of 13 detection algorithms showing enhanced variants outperform vanilla implementations by 56% on average through explicit temporal and hydrological contextualisation; and (iii) three data-driven ensemble selection strategies (Accurate, Diverse, and Fast) that balance detection accuracy, algorithmic diversity, and computational speed to address different operational requirements.
Evaluation across diverse hydrological conditions revealed critical relationships between data characteristics and detection effectiveness. Stations with high temporal completeness (e.g., Station 201012: 99.6% data coverage) consistently achieved F1-scores above 0.8 across most methods, whilst stations with fragmented records (e.g., Station 201005: 53.6% completeness) exhibited systematically lower performance (F1 < 0.6), particularly for sequence-dependent algorithms such as LSTM and DWT that require consistent historical records to calculate temporal features effectively. This pattern underscores the fundamental importance of data continuity when employing methods dependent on temporal context. Feature engineering proved essential for distinguishing erroneous outliers from natural extreme events, with enhanced variants achieving narrower performance ranges and higher median F1-scores across all measurement categories. The systematic improvement stems from explicit representation of hydrological processes that enables algorithms to leverage temporal dependencies and physical constraints absent in raw data analysis. Rising/falling limb indicators capture state transitions, lag features encode catchment memory effects, and rolling statistics provide baseflow context. Discharge measurements exhibited systematically higher operational detection rates (8.2%) compared to water level (6.4%), reflecting compound uncertainties from rating curve derivation rather than inferior data quality, validating the framework’s sensitivity to inherent measurement characteristics.
The ensemble strategies successfully balanced competing operational requirements whilst maintaining robust performance across diverse conditions. All three ensembles demonstrated F1-scores above 0.7 in most configurations, with the Accurate and Diverse Ensembles maintaining consistent performance even at stations with challenging data characteristics. Composition analysis revealed that effective detection requires hybrid approaches combining robust statistical foundations (Hampel filter and SH-ESD) with enhanced machine learning variants, rather than relying solely on advanced algorithms. The convergence of operational detection rates to 6–10% across ensemble methods aligns with realistic contamination levels, providing evidence for deployment reliability in real-world monitoring contexts.