1. Introduction
Water scarcity represents a critical global challenge that compromises both the quantitative availability and qualitative suitability of freshwater resources. This dual constraint is intensified by demographic growth, rapid urbanisation, climate change, and increasing anthropogenic pollutant loads [
1,
2]. When both dimensions are considered simultaneously, the proportion of the global population exposed to severe water stress increases substantially, particularly in high-pressure regions where water-quality degradation further reduces effective supply [
3,
4]. In semi-arid environments, water scarcity is not determined solely by limited precipitation, but also by the reduced assimilative capacity of fluvial systems, where low-flow conditions amplify the effects of pollutant inputs. Under these conditions, environmental degradation directly affects ecosystem functioning, agricultural productivity, and human health.
This problem is characterised by marked spatial asymmetry, as semi-arid basins often exhibit pronounced upstream–downstream gradients driven by the cumulative interaction of hydrological processes and land-use dynamics [
5]. Surface-water degradation is therefore not spatially uniform; rather, it reflects the convergence of point and diffuse pollution sources combined with declining dilution capacity along the fluvial continuum [
4,
6,
7].
The Jipijapa River micro-basin (Manabí, Ecuador) constitutes a representative example of these dynamics. The system is exposed to multiple anthropogenic pressures, including domestic and peri-urban effluents, diffuse agricultural runoff, and localised productive activities, which collectively increase organic load, turbidity, and microbiological contamination while reducing dissolved oxygen concentrations in downstream reaches [
8,
9,
10]. These pressures are spatially structured, with downstream sectors receiving cumulative pollutant loads from upstream settlements and agricultural areas. Consequently, the system displays conditions typical of semi-arid rivers under anthropogenic stress, where water-quality deterioration follows a longitudinal gradient rather than a random spatial distribution [
11,
12]. In this study, the longitudinal gradient refers to the progressive upstream–downstream variation in water-quality conditions along the river continuum, driven by the cumulative transport of pollutants, changes in land use, reduced dilution capacity, and increasing anthropogenic pressure towards the lower reaches.
In semi-arid basins, the analysis of water-quality dynamics requires approaches capable of capturing spatial heterogeneity and non-linear interactions among hydrological processes, land-use patterns, and anthropogenic pressures. Recent studies have emphasised the importance of integrating physicochemical indicators with biological metrics to obtain a more comprehensive assessment of ecosystem condition [
13,
14,
15].
The concept of spatial asymmetry has gained increasing relevance in this context, as water-quality degradation is rarely uniform across river systems. Instead, it tends to manifest as longitudinal gradients driven by cumulative pollutant inputs, landscape structure, and variations in dilution capacity [
16,
17]. Along the river continuum, pollutant loads, sediment transport, organic matter accumulation, and microbiological contamination may intensify downstream as the channel receives successive inputs from settlements, agricultural areas, and degraded riparian zones. In semi-arid systems, low-flow conditions further reduce dilution capacity and increase the persistence of contaminants, making upstream–downstream contrasts more pronounced.
Hotspot identification has emerged as a key tool for understanding the spatial organisation of contamination. Previous studies have shown that pollutants tend to cluster in specific river sections where anthropogenic pressures converge, thereby enabling the prioritisation of critical zones for intervention and environmental management [
18,
19]. Although much of this work has focused on emerging contaminants such as microplastics, the methodological principles of threshold-based screening and spatial clustering are directly transferable to conventional water-quality indicators, including BOD
5, turbidity, electrical conductivity, and faecal coliforms.
Comparable patterns have been reported in semi-arid regions of Latin America, particularly in Colombia and Mexico, where high vulnerability associated with steep slopes, reduced vegetation cover, and land-use pressure has been linked to increased salinity, suspended solids, and seasonal water-quality deterioration [
20,
21,
22]. These findings reinforce the role of topographic and land-use gradients as primary controls of water-quality variability, especially under conditions of limited hydrological buffering. Accordingly, the assessment of semi-arid river systems requires analytical frameworks capable of capturing spatial contrasts among reaches, identifying exceedance patterns, and evaluating concordance between physicochemical indicators and biological metrics such as the Biological Monitoring Working Party index adapted for Colombia (BMWP/Col). This integrated perspective is essential for supporting prioritisation and decision-making in heterogeneous environmental systems.
In parallel, biological assessment using macroinvertebrate indices such as BMWP/Col has been widely applied in Latin American river systems to evaluate ecological integrity. These indicators provide sensitive responses to organic pollution and habitat degradation, complementing physicochemical analyses and enabling integrated assessment frameworks [
23,
24]. In semi-arid contexts, shifts towards pollution-tolerant taxa have been consistently associated with increasing anthropogenic pressure and declining water quality [
25,
26].
More recently, machine-learning approaches have been incorporated into water-quality studies to improve predictive capacity and capture complex relationships among variables. Models such as Random Forest and Support Vector Machine (ε-SVR) have demonstrated strong performance under non-normal conditions and moderate sample sizes, particularly where traditional deterministic models are constrained by limited data availability [
13,
27,
28].
Despite these advances, a significant gap persists in the literature. Most studies address physicochemical characterisation, biological assessment, or predictive modelling independently, without integrating these components within a unified and reproducible analytical framework. This limitation is particularly evident in semi-arid micro-basins, where data scarcity restricts the application of comprehensive and decision-oriented methodologies.
GIS was employed exclusively for spatial localisation of sampling sites, while all statistical analyses and graphical outputs were generated through reproducible, code-based workflows to ensure traceability and methodological transparency. Despite recent progress, there remains a lack of integrated, code-reproducible approaches that combine physicochemical indicators, biological evidence (BMWP/Col), and predictive modelling in data-limited semi-arid basins; this study explicitly addresses this gap [
27,
29].
Hydric asymmetry constitutes a major challenge for integrated basin management, as technical, social, and ecological dimensions are often weakly articulated. Recent studies emphasise the need for analytical frameworks that integrate field observations, secondary data sources, and, where available, remote sensing information to strengthen diagnosis and predictive capacity under spatial and temporal variability [
15,
30].
Within this context, hydrological resilience frameworks provide operational tools to anticipate, diagnose, and respond to socio-environmental pressures. These approaches enable the evaluation of system robustness, adaptive capacity, and recovery potential through indicators, trend analyses, and predictive models, offering quantitative support for decision-making under conditions of environmental stress and change [
27,
31,
32]. In this study, hydrological resilience is not measured as a dynamic recovery rate after discrete disturbance events. Instead, it is interpreted as an inferred spatial resilience gradient derived from the persistence, attenuation, or amplification of water-quality degradation along the river continuum. This structural proxy reflects the capacity of different reaches to maintain favourable physicochemical and ecological conditions under cumulative anthropogenic pressure.
However, the literature still lacks integrated, reproducible approaches that jointly analyse physicochemical conditions, biological responses, and predictive dynamics within a unified framework applicable to semi-arid micro-basins. This limitation constrains the development of robust diagnostic tools capable of supporting prioritisation and decision-making under conditions of spatial heterogeneity and data scarcity.
Accordingly, this study aims to characterise spatial water-quality gradients, identify critical zones of degradation, and support environmental decision-making processes in semi-arid river systems through an integrated analytical framework combining physicochemical indicators, BMWP/Col bioassessment, and exploratory machine-learning analysis of BOD5 behaviour. The framework is used to infer a spatial resilience gradient rather than to measure dynamic recovery after disturbance events.
2. Materials and Methods
2.1. Study Area
The Jipijapa River micro-basin is located in the southern sector of Manabí Province, within the coastal region of Ecuador, in Jipijapa canton. It forms part of the Portoviejo sub-basin, which is integrated into the Chone River watershed. The study area covers approximately 3266 ha and extends across both urban and rural parishes, including Manuel Inocencio Parrales, Guale, La Unión, and El Anegado.
An operational section of the fluvial system was delineated along the main river course, extending from the mountainous headwaters in the La Pita sector to the coastal outlet near Puerto Cayo. Nine monitoring stations (MAQ01–MAQ09) were georeferenced using GPS and distributed longitudinally along this axis.
The study reach is bounded between coordinates 553,482 E–9,849,454 N and 529,051 E–9,850,412 N (UTM WGS84, zone 17S), with elevations ranging from approximately 400 to 7 m a.s.l. Based on altitudinal gradient, slope, and land-use transitions, the system was segmented into three functional reaches (upper, middle, and lower), providing a consistent spatial framework for longitudinal analysis (
Figure 1).
2.1.1. Physical and Climatic Characteristics
The system exhibits pronounced geomorphological contrasts, with slopes exceeding 30% in the upper reach and decreasing to less than 3% downstream. This altitudinal gradient exerts a primary control on hydrological processes, including runoff generation, sediment transport, and erosion–deposition dynamics.
Soils are predominantly clay-loam, with reduced infiltration capacity in degraded areas, promoting surface runoff and facilitating contaminant mobilisation. The climate corresponds to a tropical dry semi-arid regime, characterised by mean annual temperatures of approximately 25 °C and precipitation between 200 and 400 mm, concentrated mainly between January and April.
Land use is heterogeneous, consisting of short-cycle agriculture, pasturelands, secondary vegetation, and deforested areas, combined with dispersed rural and peri-urban settlements located in proximity to the river channel. This configuration favours diffuse pollutant inputs and increases the system’s vulnerability to water-quality degradation.
2.1.2. Socio-Environmental Context
The basin supports approximately 2504 inhabitants distributed across 272 dwellings. Sanitation infrastructure is limited, and field surveys conducted among 406 residents confirmed the presence of untreated greywater discharges and domestic activities, including laundry, near the riverbanks. Land use is dominated by small-scale agriculture and cattle grazing, with additional localised agro-processing and small-scale commercial activities contributing to organic loading and microbiological contamination.
2.1.3. Justification for Selection as a Case Study
The Jipijapa River micro-basin represents a typical semi-arid fluvial system under increasing anthropogenic pressure, characterised by limited dilution capacity, intermittent hydrological behaviour, and progressive water-quality deterioration.
These conditions make it a suitable natural laboratory for analysing spatial gradients of contamination, ecological degradation, and hydrological response under coupled climatic and human influences.
Previous studies have documented declining ecological integrity and shifts in macroinvertebrate communities towards pollution-tolerant taxa, associated with untreated domestic discharges and expanding agricultural activities [
8,
9,
10].
From a methodological perspective, the clear altitudinal segmentation into upper, middle, and lower reaches provides a robust framework for analysing longitudinal variation in physicochemical parameters, regulatory exceedances, and biological indicators.
Figure 2 shows the spatial distribution of monitoring stations, contamination sources, and hotspots along the longitudinal gradient of the Jipijapa River micro-basin. The mapped contamination patterns distinguish between organic pollution indicators, represented by BOD
5 and dissolved oxygen depletion; microbiological contamination, represented by faecal coliforms; and physicochemical stressors, represented by turbidity and electrical conductivity. This spatial representation supports the identification of downstream sectors where organic loading, microbial contamination, suspended solids, and ionic enrichment converge under cumulative anthropogenic pressure.
Importantly, the availability of multi-year monitoring data (2023–2025) from georeferenced sampling stations enables the identification of persistent spatial patterns, rather than isolated observations, thereby strengthening the analytical robustness and interpretability of the study.
2.2. Methodological Design
A quantitative observational design was adopted, integrating descriptive, inferential, ecological, and predictive modelling components implemented entirely in Python. The methodological framework was structured to: (i) characterise spatial gradients in water quality along the Jipijapa River micro-basin; (ii) identify critical monitoring sites based on multi-parameter exceedance patterns; (iii) assess ecological condition using macroinvertebrate-based bioindicators; (iv) examine relationships between physicochemical and environmental variables; and (v) evaluate the predictive performance of machine-learning models for BOD5 under semi-arid conditions.
The study was based on a multi-year monitoring dataset (2023–2025) comprising 27 observations collected from nine georeferenced sampling stations distributed across three functional reaches (upper, middle, and lower), thereby supporting the analysis of persistent spatial patterns in water quality.
Descriptive analysis was conducted at both the site and reach levels. Reach-level summaries were computed using median and interquartile range (IQR) from pooled observations across monitoring campaigns, preserving intra-site variability and avoiding temporal aggregation. This approach allowed robust characterisation of spatial gradients while maintaining the integrity of the multiyear dataset.
Hotspot identification was performed using an operational multi-year exceedance criterion based on TULSMA/Acuerdo Ministerial 097-A guideline values (Book VI, Annex 1). A monitoring site was classified as a hotspot when it exhibited persistent impairment, defined as either: (i) a mean number of exceedances equal to or greater than two across sampling campaigns, or (ii) exceedances occurring in at least two of the three campaigns. Evaluated parameters for hotspot classification included dissolved oxygen, BOD
5, turbidity, and faecal coliforms. Total dissolved solids were retained in the descriptive physicochemical analysis but excluded from the hotspot decision matrix because no monitoring site exceeded the corresponding TULSMA threshold. This criterion enabled the identification of spatially consistent contamination patterns rather than isolated exceedance events. This exceedance-based approach follows the logic of threshold screening and spatial prioritisation used in water-quality and pollution-risk studies, where critical sites are identified by combining regulatory or empirical thresholds with the recurrence or concentration of contamination signals [
14,
33].
For each monitoring site
, sampling campaign
, and water-quality parameter
, regulatory exceedance was coded as a binary variable:
The number of exceedances per site and campaign was calculated as:
where
represents the number of evaluated parameters. The mean exceedance score for each monitoring site was then computed as:
where
corresponds to the number of monitoring campaigns. Finally, hotspot status was assigned according to the following decision rule:
where
indicates that site
was classified as a hotspot, and
is an indicator function identifying campaigns in which two or more parameters exceeded regulatory thresholds. This decision rule combines exceedance intensity, represented by the mean number of exceedances, and exceedance persistence, represented by recurrence across campaigns. Therefore, the criterion prioritises sites with sustained multi-parameter impairment rather than isolated exceedance events.
The adoption of TULSMA thresholds ensures consistency with Ecuador’s national environmental standards for surface waters used for agricultural, ecological, and recreational purposes, providing a robust regulatory benchmark for the interpretation of water quality in semi-arid basins. However, these thresholds are specific to the Ecuadorian regulatory context and should be locally adapted when applying the methodological logic to other basins.
Given the distributional properties of the dataset and the presence of non-normal behaviour in several variables, statistical inference was conducted using non-parametric methods. Differences among reaches were evaluated using the Kruskal–Wallis test. Where significant differences were detected, pairwise comparisons were performed using Dunn’s test with Holm adjustment. Effect sizes were estimated using Cliff’s delta (δ), allowing assessment of the magnitude of spatial differences beyond statistical significance.
Ecological condition was assessed using the BMWP/Col index derived from macroinvertebrate assemblages obtained during a single biological sampling campaign (n = 9), providing a spatial diagnostic of ecological conditions. Accordingly, the BMWP/Col assessment was interpreted as a static spatial diagnosis along the longitudinal gradient, not as an evaluation of temporal variability, ecological seasonality, or interannual biological change. Reach-level comparisons were conducted using the same non-parametric framework, enabling integration of biological and physicochemical evidence.
Associations between physicochemical and environmental variables were analysed using Spearman’s rank correlation, based on the complete dataset (n = 27). This approach was selected due to the presence of skewness and non-linear relationships among variables, ensuring robust estimation of monotonic associations. Given the number of pairwise comparisons, p-values were adjusted using the Benjamini–Hochberg false discovery rate procedure, and statistical relevance was interpreted using FDR-adjusted q-values.
Predictive modelling of BOD
5 was performed using two regression approaches: Random Forest (RF) and Support Vector Machine in its regression form (ε-SVR). Predictor variables included dissolved oxygen, turbidity, electrical conductivity, and faecal coliforms. Standardisation was applied exclusively to SVM using a pipeline with StandardScaler, while RF was implemented without prior scaling. For the SVM model, scaling was applied strictly within each LOOCV fold through the scikit-learn pipeline, avoiding information leakage between training and validation data. Model performance was evaluated using leave-one-out cross-validation (LOOCV), with coefficient of determination (R
2), root mean square error (RMSE), and mean absolute error (MAE) as evaluation metrics [
13,
28,
34]. The models were used for internal diagnosis and pattern detection within the monitored dataset, not for operational prediction or external forecasting. To account for uncertainty associated with the limited sample size, 95% confidence intervals for R
2, RMSE, and MAE were estimated using non-parametric bootstrap resampling of the LOOCV observed–predicted pairs.
This integrated methodological framework combines spatial analysis, ecological assessment, and data-driven modelling to capture the complexity of water-quality dynamics in semi-arid fluvial systems, with emphasis on persistent spatial patterns rather than short-term or isolated variability.
2.3. Data Collection
Physicochemical and biological characterisation of the Jipijapa River micro-basin was conducted through a multi-year monitoring scheme (2023–2025) comprising three independent sampling campaigns. The campaigns were interpreted according to the local dry–rainy seasonal regime, characterised by rainfall concentration mainly between January and April and a prolonged dry period during the rest of the year. A total of nine georeferenced sampling stations (MAQ01–MAQ09) were established along the main fluvial axis, with three stations per reach (upper, middle, and lower), resulting in a total of 27 observations (
n = 27). The spatial distribution of the stations along the river continuum is shown in
Figure 2.
Sampling locations were defined based on altitudinal gradient, accessibility, and observed land-use patterns, ensuring representativeness of upstream, transitional, and downstream conditions. Altitude and population-related pressure were considered relevant spatial factors because downstream sites combine lower elevation, reduced dilution capacity, and greater exposure to domestic and peri-urban activities. Each station was sampled once per campaign, preserving independence between observations.
All samples were collected during daylight hours, immediately stored in insulated containers at approximately 4 °C, and transported to the laboratory on the same day under strict chain-of-custody procedures. Pre-sterilised sampling bottles were used to prevent contamination. The following parameters were measured: pH, electrical conductivity (EC), turbidity, dissolved oxygen (DO), biochemical oxygen demand (BOD5), temperature, nitrates, phosphates, total dissolved solids (TDS), and faecal coliforms. Recorded water temperature showed limited variation across campaigns and stations, ranging approximately from 21.5 to 23.7 °C, with slightly higher values in the lower reach.
Laboratory analyses were conducted in accordance with Standard Methods for the Examination of Water and Wastewater (APHA) [
35] standards and were compared against guideline values reported in
Supplementary Materials. In-situ measurements (pH, EC, turbidity, DO) were performed using calibrated portable meters, with daily calibration using certified standards. BOD
5 was determined using the 5-day incubation method, while nutrients and COD were analysed using standard colourimetric and volumetric procedures. Faecal coliforms were quantified using the multiple-tube fermentation method (MPN).
Quality assurance included field blanks, equipment rinsates, and laboratory duplicates to ensure analytical reliability. In parallel, aquatic macroinvertebrates were sampled using a D-frame net and standardised kick sampling procedures. Samples were preserved in 70% ethanol and identified to family level using regional taxonomic keys. Biological data were subsequently integrated with physicochemical measurements to support ecological assessment.
2.4. Ecological Assessment Using BMWP/Col Index
Biological water quality was evaluated using the BMWP/Col index (Biological Monitoring Working Party adapted for Colombia) [
29], widely applied in Neotropical river systems. For each sampling site
, the index was computed as:
where:
Ecological classification was interpreted following Roldán [
29]:
≥121: Very good quality
101–120: Good
61–100: Moderate quality
36–60: Poor
<35: Very Poor
Reach-level values were calculated as the mean of the three sampling stations within each reach. Differences among reaches were evaluated using the same non-parametric framework described below.
Because macroinvertebrate assemblages were sampled once at each station, BMWP/Col results were used to characterise spatial ecological condition across upper, middle, and lower reaches. No inference was made regarding biological seasonality or temporal ecological variability. The interpretation was therefore based on the concordance between the single-campaign biological gradient and the multi-year physicochemical degradation pattern.
2.5. Statistical Analysis and Predictive Modelling
All data processing and analysis were conducted using Python 3.8.10, employing pandas, numpy, scipy, scikit-learn, and scikit-posthocs libraries.
2.5.1. Descriptive and Inferential Analysis
Descriptive statistics were computed at both site and reach levels. Reach-level summaries were expressed as median and interquartile range (IQR) based on pooled observations across the three monitoring campaigns, preserving intra-site variability.
Differences among reaches were evaluated using the Kruskal–Wallis test, appropriate for non-normal datasets. When significant differences were detected, pairwise comparisons were performed using Dunn’s test with Holm adjustment.
Effect size was estimated using Cliff’s delta (δ), allowing interpretation of the magnitude of differences independently of sample size.
2.5.2. Correlation Analysis
Associations between physicochemical and environmental variables were assessed using Spearman’s rank correlation (ρ), based on the complete dataset (n = 27). This method was selected due to non-normality and the presence of monotonic but non-linear relationships.
Variables analysed included DO, BOD5, turbidity, EC, faecal coliforms, altitude, vegetation cover, and distance to anthropogenic sources. Given the number of pairwise comparisons and the moderate sample size, raw p-values derived from Spearman correlations were adjusted using the Benjamini–Hochberg false discovery rate (FDR) procedure. Statistical relevance was therefore interpreted using FDR-adjusted q-values, while the correlation matrix was treated as an exploratory tool for identifying coherent association patterns rather than as a set of independent confirmatory tests.
2.5.3. Predictive Modelling of BOD5
Two machine-learning regression models were implemented:
Machine-learning modelling was incorporated to strengthen the replicability and diagnostic value of the proposed analytical framework for semi-arid basins with limited monitoring data. In basins with similar environmental conditions, where field records are often scarce, discontinuous, or spatially heterogeneous, RF and ε-SVR provide a reproducible procedure for analysing the relationship between BOD5 and selected physicochemical predictors. The models were used to examine internal data structure and predictor behaviour within the monitored system, rather than to infer generalisable results beyond the available dataset.
Predictor variables included DO, turbidity, EC, and faecal coliforms.
Standardisation was applied exclusively to SVM using a StandardScaler pipeline, while RF was trained on raw data.
For the SVM model, standardisation was implemented within the modelling pipeline to ensure consistency between preprocessing and cross-validation. Scaling was applied strictly within each fold of the LOOCV through the scikit-learn pipeline, avoiding information leakage between training and validation data.
Model performance was evaluated using leave-one-out cross-validation (LOOCV), suitable for small environmental datasets. LOOCV was selected because it maximises the use of available observations and provides an internal validation strategy appropriate for small environmental datasets, especially when field monitoring is limited by logistical, climatic, or institutional constraints. Accordingly, model outputs were interpreted as internal diagnostic and pattern-detection results, not as operational predictions or external forecasts.
Performance metrics included:
where:
To account for uncertainty associated with the limited sample size, 95% confidence intervals for R2, RMSE, and MAE were estimated using non-parametric bootstrap resampling of the LOOCV observed–predicted pairs.
Permutation importance was derived from the Random Forest model fitted to the full dataset. This approach estimates the decrease in model performance after randomly permuting each predictor and reports mean importance values with standard deviations across repeated permutations. Importance values were interpreted descriptively and as associative indicators of predictor relevance within the monitored system, without implying causal relationships [
13,
28,
34,
36,
37].
2.5.4. Seasonal Variability (Contextual Analysis)
Seasonal variability was analysed by grouping observations into rainy and dry seasons across the multiyear dataset (2023–2025). Observations from all monitoring campaigns were classified according to hydrological season, enabling cross-seasonal comparison within the aggregated dataset. This classification was used as a contextual proxy because continuous hydrometeorological records, including precipitation, discharge, and flow variability, were not available for formal seasonal attribution.
Seasonal differences were evaluated using the Mann–Whitney U test. Given the absence of repeated within-season sampling at each site, seasonal effects were interpreted as indicative system responses rather than independent temporal replicates. Therefore, this analysis should be understood as a contextual and descriptive assessment of dry–rainy contrasts within the available dataset, not as a substitute for dedicated hydrological monitoring or process-based seasonal analysis.
Accordingly, this analysis was used to provide contextual insight into hydrological modulation of water-quality parameters, while primary inference was based on spatial variability across reaches [
14,
38].
3. Results
3.1. Spatial Variation in Physicochemical Water Quality Across Altitudinal Reaches
Spatial variation in physicochemical water quality was evaluated using a multi-year dataset (2023–2025; n = 27) derived from nine sampling points distributed across three altitudinal reaches: upper (MAQ01–MAQ03), middle (MAQ04–MAQ06), and lower (MAQ07–MAQ09).
Table 1 summarises reach-level conditions using median and interquartile range (IQR) computed from pooled observations across the study period. Dissolved oxygen decreases progressively from the upper reach (6.1 [0.36] mg L
−1) to the middle reach (3.21 [0.34] mg L
−1) and further to the lower reach (2.1 [0.26] mg L
−1). In contrast, BOD
5 increases from 6.1 [0.6] mg L
−1 in the upper reach to 111.0 [21.0] mg L
−1 in the lower reach. Turbidity, electrical conductivity, and total dissolved solids also increase downstream, whereas alkalinity reaches its highest values in the middle reach (910.0 [315.0] mg L
−1), indicating localised geochemical or anthropogenic influences.
Non-parametric analyses indicated statistically significant differences among reaches for the main water-quality parameters. Kruskal–Wallis tests showed strong global contrasts for dissolved oxygen (H = 23.15, p < 0.001, ε2 = 0.88), BOD5 (H = 19.33, p < 0.001, ε2 = 0.72), turbidity (H = 23.16, p < 0.001, ε2 = 0.88), electrical conductivity (H = 17.91, p < 0.001, ε2 = 0.66), and total dissolved solids (H = 23.16, p < 0.001, ε2 = 0.88), indicating large effect sizes and a pronounced spatial gradient. Post hoc comparisons using Dunn’s test with Holm adjustment confirmed that the most consistent differences involved contrasts with the lower reach. Significant differences were observed between upper and lower reaches across all parameters (p < 0.01). Differences between middle and lower reaches were also statistically significant for dissolved oxygen, turbidity, electrical conductivity, and total dissolved solids (p < 0.05). In contrast, comparisons between upper and middle reaches were not statistically significant for BOD5 (p = 0.163) and electrical conductivity (p = 0.458), indicating partial overlap in these parameters.
Effect size estimates (Cliff’s δ) indicated strong to very strong differences between reaches. Comparisons involving the lower reach showed very strong separation (|δ| ≈ 1.0) for dissolved oxygen, turbidity, and total dissolved solids. BOD5 also exhibited strong negative effect sizes between upper and lower reaches (δ = −1.0), confirming a marked increase downstream, while electrical conductivity showed large differences between the lower and the other reaches despite moderate separation between upper and middle sections. Graphical representations are based on individual observations rather than distribution-based summaries, ensuring accurate visual interpretation without over-representing distributional properties under moderate sample sizes.
Figure 3 presents individual observations for each parameter across reaches (
n = 9 per reach), with median values indicated. Dissolved oxygen values are consistently higher in the upper reach and decrease markedly towards the lower reach, reflecting progressive oxygen depletion along the fluvial gradient. In contrast, BOD
5, turbidity, electrical conductivity, and total dissolved solids exhibit a clear downstream increase, with the highest values concentrated in the lower reach. The separation between reaches is particularly pronounced for dissolved oxygen, turbidity, and total dissolved solids, while electrical conductivity shows a strong downstream increase with moderate variability in the middle reach.
3.2. Spatial Distribution of Contaminants and Identification of Hotspots
Hotspots were identified using the complete multi-year dataset (2023–2025;
n = 27) by applying an operational criterion based on exceedance patterns relative to TULSMA thresholds. A monitoring site was classified as a hotspot when it exhibited consistent exceedances across sampling campaigns, defined as either (i) a mean number of exceedances equal to or greater than two, or (ii) exceedances occurring in at least two of the three monitoring campaigns. The hotspot decision matrix included dissolved oxygen, BOD
5, turbidity, and faecal coliforms. Total dissolved solids were retained in the descriptive physicochemical analysis but excluded from
Table 2 because no monitoring site exceeded the corresponding TULSMA threshold. This combined criterion allows the identification of sites with sustained impairment, even when average exceedance levels are moderate, thereby avoiding the exclusion of consistently impacted locations.
As presented in
Table 2, five monitoring points met the hotspot criterion. The lower reach (MAQ07–MAQ09) exhibits the highest exceedance levels, with mean exceedance values ranging from 2.67 to 3.00, indicating consistent multi-parameter impairment across the study period. In the middle reach, two additional hotspots (MAQ04 and MAQ05) were identified. While MAQ04 meets the mean exceedance threshold (2.00), MAQ05 shows a slightly lower mean value (1.67) but satisfies the persistence criterion, with exceedances occurring in multiple campaigns, indicating recurrent but less intense impairment compared to the downstream section.
The spatial distribution of exceedances shows a clear concentration of critical conditions in the lower reach, where all monitoring sites consistently exceeded multiple regulatory thresholds. Dissolved oxygen remained below 6 mg L−1 across these sites, while BOD5 values exceeded 100 mg L−1 and faecal coliform concentrations were consistently above 200 MPN/100 mL. In the middle reach, exceedances were primarily associated with reduced dissolved oxygen and elevated faecal coliform levels, whereas other parameters generally remained within acceptable limits.
Non-parametric comparisons of exceedance patterns among reaches indicated statistically significant differences. The Kruskal–Wallis test confirmed strong global contrasts among reaches, while pairwise comparisons using Dunn’s test with Holm correction identified significant differences between the lower reach and both upper and middle reaches. Effect size estimates (Cliff’s δ) indicated large magnitudes of difference, supporting the presence of a well-defined spatial gradient in regulatory exceedances across the micro-basin.
3.3. Ecological Assessment and Bioindicators
The BMWP/Col index was derived from a single biological sampling campaign conducted across nine monitoring sites distributed along three altitudinal reaches: upper (MAQ01–MAQ03), middle (MAQ04–MAQ06), and lower (MAQ07–MAQ09), providing one observation per site (
n = 9). This dataset represents a spatial diagnostic of ecological conditions rather than a temporal assessment. Therefore, BMWP/Col values should be interpreted as evidence of longitudinal ecological differentiation, not as evidence of seasonal or interannual biological variability. Reach-level summaries are presented in
Table 3.
Mean BMWP/Col values decreased progressively from the upper reach (118.3) to the middle reach (66.3) and further to the lower reach (37.0). Corresponding value ranges were 98–135 in the upper reach, 54–80 in the middle reach, and 31–42 in the lower reach, indicating a marked longitudinal decline in ecological quality.
Kruskal–Wallis analysis indicated statistically significant differences among reaches (H = 7.20, p = 0.0273). Post hoc Dunn–Holm comparisons identified a statistically significant difference between the upper and lower reaches (p = 0.0219), whereas differences between upper–middle and middle–lower reaches were not statistically significant (p = 0.3594). Effect size estimation using Cliff’s delta indicated very strong separation between groups (|δ| = 1.0) for all pairwise comparisons, confirming the presence of a well-defined ecological gradient despite the limited sample size. Although based on a single biological campaign, the BMWP/Col pattern is consistent with the multi-year physicochemical degradation gradient observed across the same monitoring network, supporting its interpretation as a complementary spatial bioindicator. Temporal ecological variability remains outside the scope of this analysis.
Figure 4 presents individual BMWP/Col values for each sampling site across reaches (
n = 3 per reach), with mean values indicated to support visual interpretation. The upper reach exhibits consistently high scores associated with good ecological conditions, whereas the middle reach shows intermediate values corresponding to moderate ecological quality. The lower reach presents low BMWP/Col values, reflecting degraded ecological conditions. The clear separation among reaches supports the presence of a strong longitudinal ecological gradient. Overall, BMWP/Col values exhibit a consistent decreasing trend from the upper to the lower reach, reflecting progressive ecological degradation along the longitudinal gradient. Although based on a single sampling campaign, the strong agreement between biological and physicochemical patterns reinforces the robustness of the observed spatial structure.
3.4. Correlations Between Physicochemical and Environmental Variables
Spearman’s rank correlations were estimated to evaluate associations between physicochemical and environmental variables using the complete multi-year dataset (
n = 27), derived from three sampling campaigns conducted between 2023 and 2025 across nine monitoring sites. Given the non-normal distribution of several variables and the presence of skewness, Spearman’s method was adopted as a robust non-parametric approach, consistent with the analytical framework described in the
Section 2. To reduce the risk of false positives associated with multiple pairwise comparisons, statistical significance was interpreted using Benjamini–Hochberg FDR-adjusted q-values.
The correlation matrix (
Figure 5) reveals a coherent correlation structure across the micro-basin. Dissolved oxygen exhibited strong negative correlations with BOD
5 (ρ = −0.916, q < 0.001), electrical conductivity (ρ = −0.643, q = 0.00052), and faecal coliforms (ρ = −0.887, q < 0.001), while showing strong positive correlations with altitude (ρ = 0.876, q < 0.001) and vegetation cover (ρ = 0.886, q < 0.001). Additionally, a moderate negative correlation with distance was observed (ρ = −0.449, q = 0.023).
BOD5 displayed positive correlations with turbidity (ρ = 0.603, q = 0.00135), electrical conductivity (ρ = 0.632, q = 0.00068), and faecal coliforms (ρ = 0.784, q < 0.001), and negative correlations with altitude (ρ = −0.830, q < 0.001) and vegetation (ρ = −0.854, q < 0.001). A moderate positive correlation with distance was also identified (ρ = 0.534, q = 0.006). Turbidity was strongly associated with electrical conductivity (ρ = 0.838, q < 0.001) and distance (ρ = 0.728, q < 0.001), and showed moderate negative correlations with altitude (ρ = −0.502, q = 0.010) and vegetation (ρ = −0.398, q = 0.045). Its relationship with faecal coliforms was weak and not statistically significant after FDR adjustment (ρ = 0.119, q = 0.555). Electrical conductivity exhibited moderate positive correlations with distance (ρ = 0.705, q < 0.001) and BOD5, and negative correlations with altitude (ρ = −0.660, q < 0.001) and vegetation cover (ρ = −0.439, q = 0.026). Its association with faecal coliforms was not statistically significant after FDR adjustment (ρ = 0.339, q = 0.090). Faecal coliforms were negatively correlated with altitude (ρ = −0.674, q < 0.001) and vegetation (ρ = −0.778, q < 0.001), while their associations with turbidity and distance were not statistically significant after FDR adjustment (q > 0.05), suggesting that microbial contamination is more strongly linked to localised sources than to sediment transport processes.
Altitude and vegetation cover exhibited a strong positive correlation (ρ = 0.912, q < 0.001), whereas both variables were negatively correlated with distance (ρ = −0.683, q < 0.001; ρ = −0.527, q = 0.0066, respectively), reinforcing the role of landscape structure in regulating water-quality conditions. Overall, most strong and mechanistically coherent correlations remained statistically relevant after FDR correction, supporting the interpretation of an organised spatial degradation gradient while maintaining a cautious exploratory interpretation of the correlation matrix.
3.5. Predictive Modelling of BOD5 Using Machine Learning
Two regression approaches were evaluated to predict BOD5 concentrations from physicochemical variables (DO, turbidity, EC, and faecal coliforms): Random Forest (RF) and Support Vector Machine in its regression form (ε-SVR, hereafter SVM). Model performance was assessed using leave-one-out cross-validation (LOOCV), employing R2, RMSE, and mean absolute error (MAE) as evaluation metrics. Standardisation was applied exclusively to SVM through a scaler–SVM pipeline, while RF was implemented without prior scaling. In each LOOCV iteration, the scaler was fitted only on the training subset and then applied internally to the corresponding validation observation, thereby preventing data leakage. To account for uncertainty associated with the limited sample size, 95% confidence intervals for R2, RMSE, and MAE were estimated using non-parametric bootstrap resampling of the LOOCV observed–predicted pairs.
3.5.1. Overall Performance
Both models exhibited high apparent predictive performance under LOOCV. RF yielded R2 = 0.977 (95% CI: 0.957–0.997), RMSE = 7.98 mg L−1 (95% CI: 2.57–12.08 mg L−1), and MAE = 3.73 mg L−1 (95% CI: 1.42–6.79 mg L−1), while SVM obtained R2 = 0.976 (95% CI: 0.940–0.999), RMSE = 8.18 mg L−1 (95% CI: 1.60–13.15 mg L−1), and MAE = 3.31 mg L−1 (95% CI: 1.01–6.44 mg L−1). The nearly identical coefficients of determination indicate comparable explanatory capacity, with RF showing slightly lower RMSE, whereas SVM achieved lower absolute error. The width of the bootstrap confidence intervals, particularly for RMSE and MAE, reflects the uncertainty associated with the moderate sample size and the clustered distribution of high BOD5 values in the lower reach.
As shown in
Figure 6, predictions from both models closely follow the 1:1 reference line across the full range of observed values. Slight dispersion is observed at higher BOD
5 concentrations (>100 mg L
−1), where both models tend to underestimate peak values, although deviations remain moderate.
Given the sample size (n = 27), LOOCV provides an internally consistent validation framework; however, results should be interpreted as indicative of model behaviour rather than definitive evidence of generalisation performance, particularly in the presence of structured spatial gradients. Accordingly, the models are interpreted as tools for internal diagnosis and pattern detection within the monitored dataset, not as operational prediction models for external forecasting.
3.5.2. Variable Importance (Random Forest)
Permutation importance derived from the Random Forest model indicates that dissolved oxygen was the dominant predictor of BOD
5 behaviour, followed by turbidity and electrical conductivity, whereas faecal coliforms contributed only marginally to model performance (
Figure 7). Mean decreases in R
2 were 0.277 ± 0.053 for DO, 0.227 ± 0.044 for turbidity, 0.160 ± 0.031 for EC, and 0.011 ± 0.004 for faecal coliforms.
This hierarchy is consistent with the physicochemical structure of the dataset, where oxygen depletion and suspended solids are closely associated with organic loading. The low importance of faecal coliforms suggests that, within this dataset, microbiological contamination does not substantially improve BOD5 prediction when physicochemical variables are already included.
Although permutation importance reduces the bias associated with impurity-based RF importance, these values are still interpreted as associative indicators of predictor relevance within the monitored dataset and do not imply causal relationships. The associated standard deviations reflect uncertainty across repeated permutations.
3.5.3. Error Analysis
Prediction residuals obtained through LOOCV are presented in
Figure 8. Both models exhibit relatively low error magnitudes across most observations, with larger deviations concentrated in the lower reach, where BOD
5 values are highest and variability increases.
RF shows greater dispersion in extreme values, particularly underestimating peak concentrations, whereas SVM produces slightly more stable residual patterns, as reflected in its lower MAE. Nevertheless, differences between models remain limited, reinforcing their comparable predictive behaviour.
Given the moderate sample size and the clustering of high values in the lower reach, residual patterns should be interpreted with caution, as they may reflect data structure effects rather than purely model-driven error.
3.6. Seasonal Variability and System Response
Seasonal variability was explored by grouping the multi-year dataset (2023–2025;
n = 27) into two hydrological periods (rainy and dry), as defined in
Section 2.5.4. This classification is used here as a contextual proxy to examine potential seasonal modulation of water-quality patterns, rather than as a strict climatological separation. Because continuous precipitation, discharge, and flow records were not available, the seasonal analysis is interpreted as descriptive and contextual. It does not replace dedicated hydrological monitoring or process-based studies of seasonal dynamics.
Seasonal averages indicate moderate differences between periods. Dissolved oxygen shows slightly higher values during the rainy period (3.94 mg L−1) compared with the dry period (3.74 mg L−1). In contrast, BOD5 decreases from 45.97 mg L−1 in the dry period to 39.94 mg L−1 in the rainy period, while turbidity follows a similar pattern, decreasing from 32.72 NTU (dry) to 26.88 NTU (rainy).
These patterns suggest a tendency towards lower concentrations of organic matter and suspended solids under wetter conditions, which is compatible with potential dilution and increased hydrological connectivity during the rainy period. However, these mechanisms are not directly confirmed by the available dataset and should be interpreted cautiously.
To evaluate the significance of seasonal differences, a Mann–Whitney U test was applied. Results indicate that differences between rainy and dry periods were not statistically significant for any of the analysed parameters (DO: p = 0.571; BOD5: p = 0.165; turbidity: p = 0.537).
The results show moderate seasonal contrasts in mean values, but these differences were not statistically significant within the analysed dataset. Therefore, seasonal variation is treated as a secondary descriptive pattern rather than as a dominant explanatory factor.
The seasonal response exhibits a consistent but limited pattern. Dissolved oxygen remains relatively stable between periods, whereas BOD5 and turbidity tend to decrease under rainy conditions. This behaviour is compatible with increased hydrological connectivity and potential dilution effects during wetter periods, but the absence of continuous hydrometeorological information prevents direct attribution to specific rainfall–runoff or discharge processes.
Importantly, this pattern does not imply uniform improvement across the system. As demonstrated in the spatial analysis, the lower reach maintains elevated levels of BOD5 and turbidity under both seasonal conditions, indicating persistent anthropogenic inputs that are not fully mitigated by seasonal hydrological variation.
The relatively small seasonal contrasts, together with the lack of statistical significance, indicate that seasonal variability was less pronounced than the spatial differences observed among reaches within the monitored dataset.
This is particularly evident in the lower reach, where high BOD
5 and turbidity values persist across both periods, suggesting continuous pollutant inputs rather than episodic or climate-driven fluctuations (
Figure 9).
Taken together, these results indicate that water-quality variation in the micro-basin was more strongly associated with spatial gradients than with seasonal contrasts within the available dataset. Seasonal variability is therefore retained as a contextual analysis, while future hydrological studies incorporating precipitation, discharge, and flow continuity are required to evaluate seasonal processes more rigorously.
4. Discussion
The results reveal a clear and statistically robust spatial gradient in water quality along the Jipijapa River micro-basin, characterised by progressive deterioration from headwaters to downstream reaches. This pattern is consistent with widely documented dynamics in semi-arid fluvial systems, where limited dilution capacity and cumulative anthropogenic inputs amplify longitudinal degradation processes [
1,
2,
39]. Although this downstream deterioration is hydrologically expected in basins affected by cumulative anthropogenic pressure, its relevance in this study lies in the integrated empirical confirmation of the pattern through physicochemical parameters, regulatory exceedances, hotspot identification, bioindicator response, correlation structure, and exploratory machine-learning analysis. This convergence of evidence strengthens the interpretation of the gradient as a persistent and spatially organised process rather than an isolated descriptive observation.
Unlike single-event assessments, the present study is supported by a multi-year dataset (2023–2025; n = 27), allowing the identification of persistent spatial patterns rather than isolated fluctuations. This temporal consistency strengthens the interpretation of the observed gradients as structurally embedded within the system rather than episodic responses.
Regional evidence indicates that agricultural expansion, deforestation, and peri-urban development contribute significantly to increases in organic load, suspended solids, and microbiological contamination [
8,
10,
25]. The spatial configuration of the Jipijapa micro-basin reflects these drivers, particularly in downstream sectors where cumulative pressures converge.
4.1. Spatial Gradients and Anthropogenic Pressure
The longitudinal gradient observed in dissolved oxygen, BOD5, turbidity, electrical conductivity, and faecal coliforms demonstrates a highly structured spatial organisation of water quality, supported by statistically significant differences among reaches (Kruskal–Wallis, p < 0.001 across variables).
The upper reach exhibits conditions consistent with relatively preserved systems, characterised by higher dissolved oxygen levels, lower organic load, and reduced variability. These patterns align with studies linking headwater integrity to vegetation cover, reduced disturbance, and enhanced self-regulation capacity in semi-arid environments [
40,
41].
In contrast, the lower reach shows a marked deterioration, with persistent BOD5 exceedances (>100 mg L−1), critically low dissolved oxygen (<3 mg L−1), elevated turbidity and conductivity, and high faecal contamination levels.
The magnitude of these differences, reflected in Cliff’s δ values approaching complete separation (|δ| ≈ 1.0), indicates that downstream degradation is not gradual but structurally differentiated, suggesting the presence of threshold-like behaviour rather than linear deterioration.
This pattern is consistent with the concept of cumulative impact zones, where upstream pollutant transport combines with local inputs from agriculture and settlements, leading to amplified degradation [
14,
18,
20].
The identification of hotspots concentrated in the lower reach (MAQ07–MAQ09), together with intermediate hotspots in the middle reach, reinforces the spatial clustering of contamination, a phenomenon widely reported in river systems where point and diffuse sources overlap [
18,
19].
The seasonal analysis also showed moderate dry–rainy contrasts, with lower mean BOD5 and turbidity values during the rainy period, which is consistent with dilution processes and increased hydrological connectivity. However, because the classification was based on a contextual seasonal proxy rather than continuous hydrometeorological records, these results should be interpreted descriptively and not as formal evidence of seasonal hydrological processes. These differences were smaller than the contrasts observed among reaches and did not reach statistical significance within the available dataset. Therefore, the absence of statistically significant seasonal differences should not be interpreted as evidence that seasonal processes are absent in the basin. Rather, it indicates that, under the current sampling design, the longitudinal gradient associated with cumulative anthropogenic pressure was more pronounced than the seasonal signal.
This interpretation is particularly relevant in the lower reach, where degraded conditions persisted during both dry and rainy periods. The maintenance of elevated BOD5 and turbidity values across seasons suggests that continuous pollutant inputs may reduce the apparent seasonal contrast, especially where untreated domestic discharges and diffuse agricultural pressures converge. Accordingly, seasonal variability is interpreted as a hydrological modulation factor, while the dominant empirical pattern in this study is the persistent upstream–downstream degradation gradient.
4.2. Ecological Degradation and Bioindicator Response
The ecological assessment based on the BMWP/Col index provides complementary biological evidence supporting the spatial degradation pattern identified through physicochemical parameters. The progressive decline in index values from the upper reach (mean = 118.3) to the middle (66.3) and lower reaches (37.0), together with statistically significant differences (Kruskal–Wallis, p = 0.0273), indicates a clear longitudinal deterioration of ecological integrity within the single-campaign spatial diagnosis.
The upper reach, classified as “good” ecological quality, is characterised by the presence of pollution-sensitive macroinvertebrate families, consistent with relatively stable physicochemical conditions. In contrast, the lower reach falls within the “poor” category, reflecting dominance by tolerant taxa typically associated with organic pollution, low dissolved oxygen, and elevated suspended solids.
The statistically significant difference between upper and lower reaches (Dunn–Holm p = 0.0219), together with complete separation in effect size (|δ| = 1.0), indicates that ecological degradation is not only detectable but also ecologically meaningful, reflecting a structurally differentiated biological response to environmental stress.
These findings are consistent with previous studies in the region, which report shifts in macroinvertebrate communities towards pollution-tolerant assemblages under increasing anthropogenic pressure [
8,
9,
10]. Similar patterns have been documented across semi-arid and Andean systems, where biological indicators respond sensitively to cumulative impacts from untreated wastewater, agricultural runoff, and riparian degradation [
23,
24].
Importantly, the strong concordance between physicochemical indicators (e.g., BOD
5, dissolved oxygen, turbidity) and biological response supports the robustness of the observed spatial gradient and reinforces the validity of integrated assessment frameworks combining chemical and ecological metrics, as recommended in resilience-oriented watershed studies [
27,
29]. Although the BMWP/Col assessment was based on a single biological campaign, its interpretation is supported by the stability of the multi-year physicochemical gradient observed across the same monitoring network. The consistent downstream deterioration in dissolved oxygen, BOD
5, turbidity, faecal coliforms, and regulatory exceedances provides independent evidence that the biological pattern reflects a persistent longitudinal degradation structure rather than an isolated sampling artefact. Therefore, BMWP/Col was used as a complementary spatial bioindicator within the integrated framework, while temporal ecological variability remains outside the scope of the present study.
4.3. Coupling Between Physicochemical Dynamics and Environmental Drivers
The correlation structure derived from Spearman analysis reveals a coherent and mechanistically interpretable network of relationships linking water quality to environmental gradients. Because multiple pairwise correlations were evaluated, statistical relevance was interpreted using Benjamini–Hochberg FDR-adjusted q-values.
The strong negative correlation between dissolved oxygen and BOD5 (ρ = −0.916, q < 0.001) reflects the fundamental coupling between organic load and oxygen depletion, a core process in fluvial biogeochemistry. Similarly, the negative associations of dissolved oxygen with faecal coliforms (ρ = −0.887, q < 0.001) and electrical conductivity (ρ = −0.643, q = 0.00052) indicate that increasing anthropogenic pressure is systematically linked to reduced oxygen availability.
Conversely, the positive correlations between dissolved oxygen and both altitude (ρ = 0.876, q < 0.001) and vegetation cover (ρ = 0.886, q < 0.001) highlight the role of landscape structure as a regulating factor of water quality, reinforcing the importance of riparian integrity in maintaining ecological function.
BOD5 exhibits positive relationships with turbidity (ρ = 0.603, q = 0.00135), electrical conductivity (ρ = 0.632, q = 0.00068), and faecal coliforms (ρ = 0.784, q < 0.001), suggesting that organic pollution, suspended solids, and microbiological contamination are not independent phenomena but rather co-occurring processes driven by shared sources, such as agricultural runoff and domestic effluents.
The strong correlation between turbidity and electrical conductivity (ρ = 0.838, q < 0.001) further supports the interpretation of coupled sediment–solute transport processes, particularly in downstream areas where erosion, runoff, and accumulation of dissolved ions interact.
Environmental variables reinforce this structure. The strong positive correlation between altitude and vegetation cover (ρ = 0.912, q < 0.001) and their negative relationships with distance to anthropogenic sources (altitude: ρ = −0.683, q < 0.001; vegetation cover: ρ = −0.527, q = 0.0066) indicate that topographic and land-use gradients jointly influence water-quality patterns.
Rather than representing independent pairwise associations, the observed correlation network reflects a coupled system in which physicochemical variables, landscape structure, and anthropogenic pressure interact to produce spatially organised patterns of degradation. These findings should be interpreted as exploratory and associative, consistent with the moderate sample size and the correction applied for multiple comparisons.
These findings are consistent with integrated watershed analyses that highlight the importance of combining statistical and spatial approaches to capture non-linear interactions between environmental drivers and water quality [
15,
42,
43].
From a systems perspective, the results support the interpretation of the micro-basin as a coupled socio-hydrological system, where land use, environmental gradients, and hydrological processes jointly regulate water-quality dynamics.
4.4. Predictive Modelling and System Behaviour
The predictive modelling results indicate that BOD5 behaviour in the Jipijapa River micro-basin can be internally approximated using data-driven approaches, with both Random Forest (RF) and Support Vector Machine (ε-SVR) showing high internal consistency under LOOCV (R2 ≈ 0.98; RMSE ≈ 8 mg L−1).
The comparable performance of both models suggests that the relationship between predictor variables and BOD5 is strongly structured within the monitored dataset. However, given the limited sample size (n = 27), the clustered distribution of high BOD5 values in the lower reach, and the absence of independent external validation, these results should be interpreted as indicative of internal diagnostic performance rather than evidence of generalisable predictive capacity.
As shown in
Figure 6, predictions from both models closely align with the 1:1 reference line across the observed range. Slight dispersion is observed at higher BOD
5 concentrations (>100 mg L
−1), where both models tend to underestimate peak values, indicating reduced accuracy under extreme conditions.
This pattern is consistent with the presence of localised non-linearities and heterogeneous pollutant inputs in the lower reach, which introduce additional variability not fully captured by the models.
The similar RMSE values observed for RF and SVM indicate comparable model behaviour, although bootstrap confidence intervals reflect uncertainty associated with the limited sample size and spatial clustering of high BOD5 observations.
Permutation importance derived from RF identifies dissolved oxygen as the dominant predictor, followed by turbidity and electrical conductivity, while faecal coliforms contribute minimally to model performance (
Figure 7).
This hierarchy reflects a physically coherent structure: dissolved oxygen acts as an integrative indicator of organic load and metabolic processes; turbidity captures suspended solids and particulate transport; and electrical conductivity reflects dissolved ionic content associated with agricultural inputs and soil processes.
The limited contribution of faecal coliforms suggests that microbiological contamination, while relevant for water-quality assessment, does not independently explain BOD5 variability when physicochemical variables are already considered. Nevertheless, permutation importance values are interpreted as associative indicators of predictor relevance within the monitored dataset and do not imply causal relationships.
Residual analysis (
Figure 8) further supports this interpretation. Both models exhibit low error magnitudes across most observations, with slightly larger deviations concentrated in the lower reach, where BOD
5 values are highest.
From a systems perspective, these results indicate that BOD
5 variability is governed by coupled oxygen–organic matter dynamics, modulated by sediment transport and land-use patterns, rather than purely stochastic processes [
27,
36,
37].
Importantly, LOOCV provides internal validation under limited sample conditions; however, model outputs should be interpreted as analytical tools for pattern detection rather than operational or fully generalisable predictive frameworks, pending validation with independent datasets.
4.5. Cross-Method Convergence and Framework Integration
The analytical components of the framework were not intended to operate as isolated outputs, but as mutually reinforcing diagnostic layers. Physicochemical analysis identified the upstream–downstream degradation gradient; hotspot detection located the sites where regulatory exceedances were recurrent; BMWP/Col provided an independent biological response to the same spatial structure; and machine-learning analysis identified the predictors most closely associated with BOD5 behaviour within the monitored dataset. The qualitative agreement among these components strengthens the internal validity of the framework, because the lower reach was consistently identified as the most degraded sector across regulatory, ecological, statistical, and predictive evidence.
This cross-method convergence is particularly relevant in data-limited semi-arid basins, where each analytical component has specific constraints. Physicochemical indicators provide direct evidence of water-quality impairment, but may not fully capture ecological response. BMWP/Col adds biological meaning to the observed gradient, although it was based on a single campaign. Hotspot analysis supports spatial prioritisation, while machine-learning models help identify the variables most strongly associated with organic pollution dynamics. Therefore, the contribution of the framework lies not in replacing one method with another, but in using their agreement to support a more robust spatial diagnosis.
Future applications could formalise this integration through a composite degradation index combining normalised physicochemical exceedance scores, hotspot persistence, BMWP/Col ecological degradation, and model-derived predictor relevance. For example, an integrated site-level score could be expressed as:
where
represents the integrated degradation index for site
,
is the normalised exceedance score,
is hotspot persistence,
is the inverted and normalised BMWP/Col degradation score,
is the model-supported predictor contribution, and
–
are weights defined according to management priorities or data availability. This formulation was not implemented in the present study due to sample-size constraints, but it provides a reproducible direction for future monitoring programmes.
4.6. Inferred Spatial Resilience Gradient from Empirical Evidence
Hydrological resilience is interpreted here as an inferred spatial resilience gradient rather than as a directly measured dynamic recovery process. Because the study did not assess discrete disturbance events or recovery rates, resilience is treated as a structural proxy derived from the spatial behaviour of physicochemical degradation, biological response, hotspot persistence, and exploratory modelling results.
The spatial gradient observed across the micro-basin suggests a declining inferred resilience profile from headwaters to downstream reaches, supported by convergent evidence from multiple analytical components.
In the upper reach, higher dissolved oxygen, lower BOD
5, reduced turbidity, and favourable BMWP/Col values suggest a higher structural capacity to maintain favourable physicochemical and ecological conditions under cumulative pressure. This condition is consistent with resilience frameworks linking ecosystem integrity to vegetation cover, reduced disturbance, and enhanced self-regulation processes [
40,
41].
The middle reach represents a transitional zone, where moderate degradation suggests partial reduction in buffering capacity, likely associated with increasing agricultural influence and localised anthropogenic inputs.
In contrast, the lower reach exhibits persistent exceedances of regulatory thresholds, strong statistical separation from other reaches, and degraded biological conditions. These characteristics indicate a system operating near or beyond functional thresholds, where the inferred resilience proxy is lower, reflecting reduced buffering capacity and persistent degradation rather than directly measured recovery limitation.
Rather than representing a binary condition, the inferred resilience gradient in this system emerges as a spatially structured continuum, reflecting the interaction between anthropogenic pressure and environmental buffering capacity.
The correlation structure further supports this interpretation. Strong negative relationships between dissolved oxygen and BOD5, together with positive associations between altitude, vegetation, and oxygen availability, indicate that the inferred resilience gradient is closely linked to landscape configuration and anthropogenic pressure gradients.
Importantly, the consistency of these patterns across multiple analytical approaches (physicochemical gradients, regulatory exceedances, hotspot persistence, BMWP/Col response, correlation structure, and exploratory machine-learning results) reinforces the interpretation of the inferred resilience gradient as a structural proxy derived from convergent spatial evidence, rather than a directly measured dynamic metric.
From a socio-hydrological perspective, the micro-basin can be understood as a coupled human–environment system, where land use, sanitation practices, and hydrological processes interact to determine spatial variability in the inferred resilience gradient [
30,
31,
32].
4.7. Management Implications and Transferability
The results provide a robust empirical basis for prioritising management interventions within the Jipijapa River micro-basin, particularly through the identification of spatially explicit hotspots and the integration of physicochemical, biological, and predictive indicators.
The concentration of critical conditions in the lower reach highlights the need for targeted actions focused on reducing organic pollution, improving wastewater treatment, and controlling diffuse agricultural inputs. In this context, hotspot identification based on persistent exceedance patterns offers a practical tool for decision-making, enabling resource allocation to areas with the highest environmental risk.
The strong agreement between physicochemical indicators and BMWP/Col results reinforces the value of combining chemical monitoring with biological assessment, as recommended in integrated water management frameworks [
26,
44]. This approach enhances diagnostic accuracy and supports more effective intervention strategies.
The predictive modelling component further contributes to management by providing a reproducible analytical framework for identifying the variables most closely associated with BOD5 behaviour, particularly dissolved oxygen, turbidity, and electrical conductivity. While not intended for direct operational forecasting, these models can support diagnostic interpretation and early recognition of critical conditions.
From a systems perspective, the findings emphasise the importance of addressing water quality through integrated watershed management approaches that consider land use, vegetation cover, and anthropogenic pressure as interconnected drivers [
27,
31,
36,
38].
In particular, the strong association between altitude, vegetation, and water quality underscores the need for conservation of upstream areas and restoration of riparian zones as key strategies to enhance system resilience and reduce downstream degradation.
Overall, the study provides a methodological logic that may be adapted to semi-arid basins with limited data availability, where combining spatial analysis, ecological indicators, and data-driven modelling can improve decision support and environmental management. However, transferability should be understood at the level of analytical structure rather than direct regulatory application. TULSMA thresholds are specific to Ecuador, and the selected parameters, exceedance criteria, bioindicator interpretation, and management priorities must be adjusted to the legal, hydrological, ecological, and monitoring conditions of each basin.
4.8. Limitations and Research Directions
Despite the robustness of the analytical framework, several limitations should be acknowledged.
First, although the study is based on a multi-year dataset (2023–2025; n = 27), the number of observations per site remains limited, which constrains the statistical power of some analyses and restricts the generalisability of predictive models. In particular, the use of LOOCV provides strong internal validation but does not substitute for external validation using independent datasets.
Second, the ecological assessment based on BMWP/Col was conducted from a single biological sampling campaign, and therefore represents a spatial diagnostic rather than a temporal evaluation of ecological dynamics. Future studies should incorporate repeated biological sampling to capture seasonal and interannual variability in macroinvertebrate communities.
Third, the classification of seasonal conditions was based on a proxy approach derived from the available dataset, rather than direct hydrometeorological records. Although this approach is methodologically consistent with data-limited contexts, it introduces uncertainty in the interpretation of seasonal effects. Consequently, the seasonal analysis was retained only as a contextual and descriptive component and should not be interpreted as a substitute for continuous hydrological studies based on precipitation, discharge, and flow records.
Fourth, the predictive modelling results, while highly consistent, may be influenced by the relatively small sample size and the presence of clustered high values in the lower reach, which can affect model stability and error distribution.
Future research should prioritise the expansion of monitoring networks, incorporation of continuous hydrological and meteorological data, and validation of predictive models under independent conditions.
Additionally, integrating remote sensing data and spatially explicit land-use information could enhance the understanding of landscape–water interactions and improve predictive capacity.
Finally, the integration of socio-economic variables and governance factors would provide a more comprehensive perspective on the drivers of water-quality degradation, reinforcing the socio-hydrological interpretation of the system.
Future research should also develop explicit mathematical integration among the framework components. This could include composite indices that combine physicochemical exceedance intensity, hotspot persistence, BMWP/Col degradation scores, and machine-learning-derived predictor relevance. Such integration would allow the framework to evolve from qualitative methodological convergence towards a quantitative decision-support tool for prioritising monitoring, restoration, and watershed management actions.
5. Conclusions
This study identifies a strong and statistically significant spatial gradient in water quality along the Jipijapa River micro-basin, characterised by progressive deterioration from headwaters to downstream reaches. The consistency of this pattern across physicochemical parameters, biological indicators, and exploratory machine-learning analysis confirms that spatial heterogeneity constitutes the dominant control on system behaviour in this semi-arid environment.
Hotspot identification based on multi-year exceedance criteria reveals a clear spatial concentration of contamination in the lower reach, with transitional impacts in the middle reach. These patterns reflect the cumulative effects of untreated domestic discharges, agricultural runoff, and landscape degradation, providing empirical evidence of anthropogenic pressure as the primary driver of water-quality decline.
The integration of biological assessment using the BMWP/Col index with physicochemical analysis demonstrates a high degree of concordance, confirming the reliability of combined diagnostic approaches for detecting ecological degradation in semi-arid fluvial systems. The observed longitudinal decline in ecological integrity highlights the sensitivity of macroinvertebrate communities to sustained environmental stress.
Correlation analysis reveals a coherent and mechanistically interpretable structure governed by the coupling between organic load and oxygen dynamics, modulated by altitude, vegetation cover, and proximity to anthropogenic sources. These results support the interpretation of the micro-basin as a coupled socio-hydrological system in which landscape configuration and human pressures interact to determine water-quality patterns.
Predictive modelling indicates that BOD5 variability can be internally approximated within the monitored dataset using machine-learning approaches, with Random Forest and SVM achieving high internal consistency under LOOCV. The identification of dissolved oxygen, turbidity, and electrical conductivity as dominant predictors provides a physically consistent representation of system dynamics, linking organic pollution processes with hydrological and land-use controls. However, model performance should be interpreted as diagnostic and indicative, due to dataset size, spatial structure, and the absence of independent external validation.
From a resilience perspective, the system exhibits an inferred spatial decline in functional condition, with headwaters maintaining more favourable physicochemical and ecological characteristics and downstream reaches showing persistent degradation and reduced buffering capacity. This inferred resilience gradient reflects the interaction between natural environmental buffering and sustained anthropogenic stress, rather than a directly measured dynamic recovery process.
The findings have direct implications for watershed management, supporting prioritisation of interventions in the lower reach, preventive strategies in the middle reach, and conservation actions in headwaters. The identification of persistent hotspots provides a practical basis for decision-making under limited-resource conditions.
Methodologically, this study contributes a reproducible analytical framework integrating non-parametric statistics, bioindicators, and machine-learning models, addressing a recognised gap in the analysis of data-limited semi-arid basins. The framework should not be interpreted as directly transferable in its regulatory thresholds or parameter values, because TULSMA criteria are specific to Ecuador and the study is based on a localised micro-basin dataset. Rather, its methodological logic may be adapted to other semi-arid basins with similar monitoring constraints, provided that thresholds, indicators, weighting criteria, and interpretation procedures are locally calibrated.
The value of the framework lies in the qualitative convergence among its components: physicochemical deterioration, recurrent hotspot formation, BMWP/Col ecological decline, and model-supported predictor relevance all identified the downstream sector as the most degraded area of the micro-basin.
Future research should focus on expanding temporal coverage, incorporating hydrological variables such as discharge, validating predictive models using independent datasets, and developing composite indices that quantitatively integrate physicochemical, biological, hotspot, and machine-learning components.
Overall, the study demonstrates that, within the available multi-year dataset, water-quality dynamics in the Jipijapa River micro-basin were more strongly associated with spatially structured anthropogenic pressures than with seasonal contrasts. This highlights the need for spatially targeted management strategies to improve ecosystem condition, reduce persistent contamination hotspots, and support resilience-oriented watershed planning in semi-arid environments.