1. Introduction
Issues related to monitoring the environmental status of water bodies in Kazakhstan are becoming increasingly important. This is relevant as it contributes to ensuring the sustainability of aquatic ecosystems and water security in the country.
The Irtysh is one of the largest rivers in Kazakhstan. Its total length, including the Kara-Irtysh, is 4.2 thousand kilometers. It flows through the territories of China, Kazakhstan, and Russia. The Irtysh river plays a significant role in the development of industry and agriculture in the region, and as a result, is subject to anthropogenic impact. This leads to numerous environmental and social problems [
1]. In addition, the river is used for fishery purposes. With the increasing anthropogenic load, the need for continuous monitoring of its ecological condition becomes more urgent [
2,
3,
4]. One of the sources of anthropogenic impact is the discharge of domestic wastewater from cities and small settlements. As a rule, treated domestic wastewater is discharged into nearby water bodies. Its reuse is not widespread in Kazakhstan due to limiting factors—technical, regulatory, economic, and social, including public acceptance.
As of 1 January 2024, the Republic of Kazakhstan (RK) comprises 17 regions, 188 districts, 89 cities—including 3 of republican significance—29 towns, 2169 rural districts, and 6256 villages [
5,
6]. According to the Bureau of National Statistics, the urban population of the republic amounts to 12,727,404 people, while the rural population is 7,516,577. Thus, about 63% of the population resides in urban areas and about 37% in rural areas. The number of small settlements in Kazakhstan is approximately 6324. Currently, in accordance with the strategic development plan “Kazakhstan-2050” [
7], state programs are being implemented to protect water resources, and to modernize (reconstruct and build) housing and communal infrastructure, as well as heating, water supply, and wastewater systems, including in small settlements. Furthermore, with the development and expansion of urbanized areas, as well as the increase in population and industrial facilities, the volume of wastewater in Kazakhstan is also rising [
8,
9].
It should be noted that most of the existing treatment facilities are both morally and physically outdated, are in a state of emergency, and create an unfavorable sanitary-epidemiological and environmental situation for the water bodies [
10,
11,
12,
13]. As it was noted in the report of the Minister of Industry and Infrastructural Development of the Republic of Kazakhstan, out of 89 cities of the republic, sewage treatment plants are absent in 27 cities; in 41 cities the wear reaches from 60% [
14]. This paper focuses on small settlements, which are economically weaker and less developed than cities. Most of the existing WWTPs in small settlements were built in the 1970s, have been in operation for a long time, and usually comply with the standard scheme of wastewater treatment (mechanical and biological) with discharge into reservoirs. Vilson et al. [
15] propose two ways to solve this problem: first, they consider it appropriate to build new wastewater treatment plants if the existing structures were built before the 1970s; second, structures built in the second quarter of the 20th century may be retrofitted. When using traditional wastewater treatment technology (mechanical and artificial biological treatment), the values of key indicators in the treated water can vary within the following ranges: BOD
full (biochemical oxygen demand) 10–20 mg/L; suspended solids 12–20 mg/L; ammonium nitrogen 5–7 mg/L; nitrate nitrogen 12–15 mg/L; petroleum hydrocarbons 1 mg/L [
16].
Despite significant hydrological dilution in the Irtysh River, local effluent effects on discharge remain noticeable in the form of elevated values for a number of parameters (BOD, COD, nitrites, ammonium, and phosphates) near the effluents [
1].
A review of the literature shows that some authors focus on analyzing the condition and operation of treatment facilities, emphasizing proper operation and maintenance features of treatment facilities in small settlements [
17,
18,
19,
20]. Other authors address issues related to the intensification and improvement of treatment facility performance [
15,
21,
22,
23,
24], as well as the reuse of treated domestic wastewater [
25]. There are also studies that specifically focus on the Irtysh River. The river was studied by M. Burlybayev [
26] over the period from 1986 to 2011, D. Burlybayeva [
27] from 1947 to 2013, and I.V. Shenberger [
28] from 2006 to 2015. Kolpakova et al. [
29] conducted a hydrological assessment of the Kazakhstani section of the Irtysh River under conditions of industrial development and climate change from 2019 to 2023. In the aforementioned studies, the water quality of the Irtysh River was assessed at control sections near major cities (Ust-Kamenogorsk, Semey, Pavlodar).
Beyond conventional compliance-based assessments, hydrological and geochemical research has developed a broad family of tracer-aided approaches to quantify contributions from different water sources to observed stream chemistry [
30,
31]. A key concept in this family is End-Member Mixing Analysis (EMMA), a statistical and geochemical method widely used to estimate the contribution of various source waters to a mixture and to explain the formation of its current chemical composition [
32,
33,
34,
35,
36]. EMMA is based on linear mixing theory and multidimensional statistical analysis: it combines water and tracer mass balance with principal component analysis (PCA) to treat several chemical variables jointly as a single multidimensional structure [
32,
33]. This allows one to reduce the dimensionality of large heterogeneous data sets, identify latent structure, and separate signal from noise in a way that is mathematically rigorous and easily reproducible [
30,
32].
In classical EMMA, pollutants or conservative solutes (e.g., major ions such as
, selected trace parameters, and stable isotopes) are treated as tracers that distinguish between end-members, and PCA is used to define a low-dimensional mixing space [
32,
33,
35]. Within this framework, stream chemistry is interpreted as a mixture of a finite number of end-members, and their contributions are computed from linear mixing equations [
32,
33]. The fundamental EMMA equations can be derived directly from the conservation of water and tracer mass, for example, for two components with discharges Q
1 and Q
2 and concentrations C
1 and C
2 mixing to produce a stream with discharge
and concentration
;
[
32,
33]. The method requires that end-members exhibit sufficiently distinct tracer signatures relative to analytical uncertainty, that the system can be approximated as quasi-stationary over the analysis period, and that no additional unaccounted sources or sinks significantly affect the water or tracer mass balance. Despite practical challenges in defining representative end-member concentrations in space and time, EMMA has been widely applied in small and medium-sized catchments to separate flow components, identify dominant sources of runoff (e.g., rainwater, soil water, groundwater, snowmelt, glacial melt), and test hydrological hypotheses in a way that is relatively independent of specific model structures [
34,
35,
36].
Studies based on EMMA and related tracer-aided modeling are widespread internationally and increasingly used to understand how climate change and human activity modify water resources and water quality [
31,
35,
36,
37,
38]. Applications include small mountain catchments in Asia and Europe; experimental basins in Russia (e.g., Laninsky Stream in the Baikal region, small mountain catchments in Central Sikhote-Alin); and snow-melt-dominated basins in North America, where EMMA has been used to quantify the proportions of different source waters in river runoff and to develop ensemble approaches to hydrograph separation [
36,
37,
39]. Recent work has also emphasized that tracer-aided modeling (TAM) continues to expand its scope of application and that EMMA remains one of the core tools of hydrograph separation for quantifying the impact of surface- and groundwater sources on streamflow [
31,
38].
However, the use of EMMA-type approaches in the context of local wastewater plumes discharged from small-settlement treatment facilities into large rivers is still limited. Conceptually, the problem is similar to catchment-scale mixing: downstream river water can be represented as a mixture of upstream river water and wastewater effluent. Yet, compared to classical catchment applications, the system is often simpler in terms of the number of end-members, while effluent concentrations may be highly variable and key indicators such as BOD and COD may undergo rapid in-stream transformations that violate strict conservative-tracer assumptions. This creates a need for simplified, EMMA-inspired mixing diagnostics that can be applied with limited data, explicitly account for dilution and mixing, and provide a bridge between process-based understanding and operational predictive tools for effluent management [
36,
39].
A literature review showed that articles have not sufficiently studied the impact of discharges from small industrial facilities on the ecological condition of the Irtysh River, which reduces the completeness and objectivity of the assessment of the actual ecological impact. Studying the impact of discharges from small treatment facilities will provide an objective assessment of the ecological condition of the Irtysh River. This will allow for a more comprehensive consideration of anthropogenic pressures from smaller-scale facilities, such as small treatment facilities.
Small settlements are numerous along the Kazakhstani reach of the Irtysh, and many rely on aging facilities operating conventional mechanical-biological schemes; assessing their discharges is therefore essential for basin-scale water-quality management. This study addresses a gap in assessments of the Irtysh River, which have focused largely on control sections near major cities, by quantifying the impact of discharges from small-settlement treatment facilities that are widespread along the river corridor. In methodological terms, the work integrates three complementary components:
A conservative-tracer mixing analysis to estimate site-specific effluent fractions and dilution;
A transformation diagnostic (θ) that tests whether reactive indicators depart from the mixing line;
Low-complexity, cross-validated predictive models (regularized regressions and PLS) to forecast BOD and COD at facility effluents from upstream water-quality indicators.
Together, these parameters provide a physically grounded baseline, reveal biogeochemical departures from simple entrainment, and deliver practical, uncertainty-aware forecasts for operational use. The objective is to construct and compare parsimonious predictive models for BOD and COD under small-sample conditions, and to interpret their performance against dilution-based expectations derived from the tracer analysis.
This study integrates conservative-tracer mixing (f, D), a transformation diagnostic (θ), and low-complexity, cross-validated predictive models to evaluate two small-settlement facilities on the Irtysh River. The novelty lies in combining physically grounded dilution estimates with a statistical diagnostic of departures from the mixing line and with sparse, uncertainty-aware forecasts for BOD and COD under a small-sample regime (N = 10). The aim is to quantify dilution, identify reactive behavior, and select operationally useful predictors for effluent quality.
2. Materials and Methods
2.1. Study Area
The area of research is the impact of domestic wastewater discharges from two small settlements on the Irtysh River in the East Kazakhstan region.
Figure 1 shows a schematic map showing the location of the area under study.
The objects of the study are sewage treatment plants TF1 and TF2. The design capacity of the wastewater treatment facilities (TF1) in Settlement No. 1 (constructed in 1980) is 5000 m3/day, while the actual capacity is 370.9 m3/day.
The design capacity of the wastewater treatment facilities (TF2) in Settlement No. 2 (constructed in 1964) is 1824 m3/day, while the actual capacity is 625 m3/day.
For the treatment of domestic wastewater from the two settlements, mechanical and artificial biological treatment is provided. The treatment facilities are composed as follows:
The composition of the TF1 in Settlement No. 1 includes a sewage pumping station; a receiving chamber; a horizontal grit chamber with circular water flow (85% wear); a sand hopper; a Venturi-type flow measuring flume; a primary clarifier (50% wear); two-lane aeration tanks; a secondary clarifier (50% wear); an aerobic mineralizer; a contact tank; a sludge digester; a chlorination unit; a discharge point for treated water into the watercourse; and sludge drying beds.
The composition of the TF2 in Settlement No. 2 includes a screening unit (70% wear); a grit chamber (85% wear); primary two-tier and vertical clarifiers (50% wear); biological reactors No. 1 (100% wear) and No. 2 (50% wear); secondary vertical clarifiers (50% wear); sludge drying beds; a chlorination unit; and a contact tank.
2.2. Chemical Water Quality Parameters
The study used data obtained in laboratory conditions from the treatment facilities of two small settlements, No. 1 and No. 2, over the last five years (2020–2024) for 12 indicators. The facilities are located in the East Kazakhstan region.
Water quality was assessed based on 22 indicators by taking samples in June 2024 and 2025 at four locations on the Irtysh River. Background points located 500 m upstream and 500 m downstream from the discharge point of the treatment facilities relative to settlements No. 1 and No. 2 were selected for monitoring. The selected period reflects the current state of the treatment facilities, the technologies used, and the applicable regulatory requirements. The analysis of data for this period really shows the degree of impact of discharges from sewage treatment plants on the ecological state of the Irtysh River.
During the field studies under real operating conditions, pollutants were examined according to the following conditional groups:
Biogenic substances—ammonium salts (NH4+), nitrite (NO2−) and nitrate (NO3−) salts, phosphates (PO43−);
Organic substances—petroleum products, synthetic surfactants;
Major ions—sulfates (SO42−), chlorides (Cl−), calcium (Ca2+), magnesium (Mg2+);
Heavy metals—copper (Cu2+), zinc (Zn2+), lead (Pb2+), chromium (Cr6+), total iron (Fe2+), cadmium (Cd2+), manganese (Mn2+).
An analysis was conducted on the following indicators of wastewater before and after discharge from the studied treatment facilities: temperature, total mineralization (dry residue), pH, suspended solids, permanganate oxidizability, dissolved oxygen, BOD, and COD.
Laboratory-analytical studies to determine the hydrochemical composition of the water were carried out at the accredited laboratory of LLP “Testing Laboratory of NGO EK-ECO” in the city of Ust-Kamenogorsk. Analytical, systematic, and comparative methods were applied for data processing.
The efficiency of cleaning by contamination was determined by the following Formula (1):
where C
inf is the concentration of pollutants at the entrance to sewage treatment plants; and C
efl is the concentration of pollutants at the effluent of sewage treatment plants.
2.3. Mathematical Methods/Models
Measurements taken at the treatment plant (influent/effluent) and in the river (upstream/downstream) alone do not allow for a clear distinction between the impact of mixing (dilution of wastewater with river water) and transformation processes (decrease/increase in indicators in the section of the river downstream of the discharge). A model-based analysis was undertaken to separate the contribution of hydraulic mixing from biogeochemical transformation along the river reach affected by the discharge. Measurements at the wastewater treatment plant (WWTP) effluent and in the river upstream and downstream reflect both processes simultaneously. Therefore, an algebraic framework was employed to quantify the mixing fraction and dilution attributable to effluent entrainment; deviations from conservative mixing consistent with net attenuation (downstream concentrations below the mixing prediction) or net generation/additional inputs (downstream concentrations above the mixing prediction) of reactive water-quality indicators; and the overall magnitude of multi-indicator change in a unitless form suitable for comparison across sites and periods. The approach complements compliance-based assessment by providing interpretable diagnostics of process dominance.
An overview of the complete field-to-model workflow—covering sampling, data preprocessing/QC, conservative-tracer mixing diagnostics (f, D), θ-based deviations from the mixing line, and the predictive modelling/validation pipeline—is provided in
Figure 2.
For each indicator, triplets of concentrations were compiled: upstream river concentration (), downstream river concentration (), and WWTP effluent concentration (). Indicator names were normalized to consistent labels (e.g., Chlorides, Dry residue/TDS, COD, Ammonium, Phosphates, Nitrite, Nitrate). The analytical endpoints (effluent) and (river) were treated as different measures and were not combined within the same model calculations.
For a conservative tracer (i.e., negligible reaction over the inter-station distance), downstream concentration was represented as a linear mixture of upstream river water and effluent (2):
where
denotes the effluent mixing fraction in the downstream cross-section. The fraction and the dilution factor were obtained as the following (3):
Tracer selection prioritized Chlorides. Feasibility was enforced by the convexity (mixture) condition and by the physical bounds . For each site (and date, if applicable) a single reference fraction was selected from feasible tracers following the stated priority; non-feasible records (e.g., sulfate cases violating convexity) were not used for estimation but were retained descriptively.
Chloride (Cl−) was selected as the primary conservative tracer because, over the short inter-station distance considered here, it behaves quasi-conservatively in river water: it is a major dissolved ion that is not subject to rapid biological uptake or redox transformation, and it does not readily sorb or precipitate under typical oxic, circumneutral river conditions. In domestic wastewater, chloride is largely derived from household inputs (e.g., dietary salt) and therefore tends to be elevated relative to upstream river water, providing a strong signal-to-noise ratio for mixing calculations. Consequently, changes in Cl− between the upstream and downstream cross-sections are attributed primarily to physical mixing, enabling estimation of the effluent fraction f and dilution factor D from the linear mass-balance model.
Given
from Model 1, the extent to which mixing alone explained the observed downstream concentration of a reactive indicator was evaluated using the following (4):
Values indicate consistency with mixing; indicate net removal (attenuation) between stations in excess of dilution; indicate net generation/additional inputs (i.e., downstream concentrations above the conservative mixing prediction) or an additional source term, or a possible underestimation of under the sampling conditions; indicate downstream concentrations below the mixing prediction, implying strong removal or data misalignment. The diagnostic was not evaluated for the tracer used to obtain , for which equals 1 by construction. Indicators considered included COD, Ammonium, Phosphates, Nitrite, and Nitrate, with matching definitions and units across river and effluent.
To summarize multi-indicator upstream-to-downstream change without requiring effluent data, relative changes were computed for each indicator
in a site/date group as follows (5):
with
to avoid division by zero. Let
denote the arithmetic mean of available
and
the maximum (most positive) value. The composite, unitless index was defined as follows (6):
Larger corresponds to greater overall downstream deviation (averaged and in the extreme positive direction). The sign of conveys the dominant direction of change. As a sensitivity alternative, may replace to obtain a symmetric variant with equal sensitivity to large negative and positive departures; the primary analysis used the one-sided form above.
The objective of further analysis was to build and compare low-complexity predictive models for the treated effluent indicators BOD and COD under a very small sample size (N = 10), with a preference for methods that enforce strong regularization and limit degrees of freedom. The input variables comprised upstream physicochemical measurements recorded before treatment: BOD (before), COD (before), Suspended solids, Ammonium salt, Nitrites, Nitrates, Chlorides, Sulfates, Phosphates, Synthetic surfactants, Petroleum hydrocarbons, Total mineralization (dry residue), and site indicator (point). Point was treated as a categorical predictor and encoded with indicator variables after dropping the reference level. The output variables were the two downstream responses measured after treatment: BOD (after) and COD (after). All numeric predictors were standardized to zero mean and unit variance prior to multivariate modelling.
Let
denote the response (either
or
), and let
be the standardized predictor matrix (upstream indicators and, where applicable, site indicator variables), with coefficient vector
and intercept
. The general linear formulation is
For the linear families, estimation can be written as penalized least squares:
where the intercept
is not penalized. The penalty term
differentiates the methods:
Baseline-Linear (OLS): .
Ridge: .
Lasso: .
Elastic-Net: , with controlling the L1/L2 balance.
For the Fractional-Logit (Ratio) model, the response is the truncated removal ratio
with
and
; parameters are estimated by maximizing the binomial quasi-likelihood (equivalently, minimizing the negative log-likelihood):
For PLS with
latent components,
is projected to a lower-dimensional score matrix
(
or
), and the final regression is obtained by least squares in the latent space:
with
constructed to capture directions in
that are most informative for predicting
.
Five model families were estimated. Because the dataset is very small (N = 10) relative to the number of candidate predictors, and because strong collinearity is expected among routine water-quality indicators, we compared a small set of complementary low-complexity frameworks that represent different trade-offs aligned with the study objectives. The baseline linear model serves as an interpretable benchmark; Ridge regression improves prediction stability under multicollinearity; LASSO provides sparse variable selection to identify a minimal predictor set; Elastic Net balances stability and sparsity when predictors are correlated and may act in groups; and PLS offers a latent-variable alternative that reduces dimensionality while preserving covariance with the response. All model types were evaluated under the same LOOCV protocol to select the most accurate yet parsimonious specification for operational forecasting.
First, baseline linear level models related the outputs to their corresponding inputs with an intercept and, when present, a fixed effect for site, that is, BOD (after) ~ BOD (before) (+site) and COD (after) ~ COD (before) (+site). Second, a fractional specification modeled the removal ratio using a binomial generalized linear model with a logit link; ratios were truncated to the open interval to stabilize estimation near the boundaries, and predictions on the level scale were obtained by back-transformation . Third, Ridge regression was fit with fitrlinear using least-squares loss, regularization, an estimated bias term, and the L-BFGS solver; the regularization strength was selected from a logarithmic grid spanning to with 60 points based on leave-one-out performance. Fourth, Lasso and Elastic Net were fit with lasso using and , respectively, cross-validated with leave-one-out by setting the number of folds equal to the sample size, with external standardization disabled because predictors had already been scaled, and with the final chosen at the minimum cross-validated mean squared error while retaining the intercept returned by the routine. Fifth, PLS regression with one and two latent components was estimated with plsregress, and predictions were formed as on the standardized design.
Model selection and hyperparameter tuning were carried out using leave-one-out cross-validation (LOOCV) due to the small sample size (N = 10). In each LOOCV fold, the model was trained on N − 1 observations and evaluated on the single held-out observation. Hyperparameters (e.g., the regularization strength for Ridge/Lasso/Elastic-Net and the number of latent components for PLS) were selected to minimize the LOOCV RMSE. Accuracy was quantified by Mean Absolute Error (MAE), the Root Mean Squared Error (RMSE), and the coefficient of determination () computed on the held-out observations, and the primary criterion for choosing a winner for each endpoint was the lowest LOOCV RMSE. In the next stage of the analyses, the best-performing model for each endpoint will be used to generate operational predictions accompanied by prediction intervals; residual diagnostics will be performed with attention to leverage and influential points; sensitivity to the regularization strength and the number of latent components will be assessed; and robustness will be checked with a leave-one-out-of-cluster strategy if site structure is present. If additional observations become available, external validation on new measurements will be conducted, hierarchical specifications with location-level random effects will be considered, and a simple mechanistic mixing benchmark will be used to contextualize level predictions of BOD (after) and COD (after).
3. Results and Discussion
Biological treatment facilities serve as a barrier protecting surface water bodies from the entry of pollutants, thereby contributing to the preservation of natural hydrobiocenoses. Only under artificially created conditions in biological treatment facilities can the waste products generated by biocenosis communities and large volumes of pollutants in wastewater be rapidly and effectively processed [
40]. The average annual indicators of wastewater treatment efficiency at these treatment facilities are presented in
Table 1 and
Table 2.
According to the data of the water-quality parameters at the influent and effluent of TF1 and TF2, the purification effects were determined (
Table 3).
The analysis of the data in
Table 3 shows different values of the cleaning effect for each parameter over the period of operation 2020–2024. At the same time, for each parameter, the cleaning effect has similar values over the years. For example, BOD
full for TF1 had a purification effect ranging from 95.9% (2020) to 94.61% (2024), and for TF2-82.79% (2020) to 76.42% (2024). The decrease in cleaning efficiency by some indicators can be explained by the duration of operation.
Discharge standards for pollutants, approved by the Department of Natural Resources and Environmental Regulation of the East Kazakhstan Region, are established at the discharge points. Based on the data from
Table 1 and
Table 2, diagrams were constructed showing pollutant concentrations that exceed the permissible concentrations (PC) at the discharge points according to the established discharge standards.
Figure 3 presents data on BOD
full,
Figure 4 on ammonium salt,
Figure 5 on nitrates,
Figure 6 on chlorides,
Figure 7 on sulfates, and
Figure 8 on phosphates. The graphs show data from treatment facilities TF1 and TF2 for the years 2020–2024.
For the BOD5 indicator at treatment facilities No. 1 and No. 2, no exceedances of permissible concentrations (PC) were recorded from 2020 to 2022. At facility No. 1, exceedances were observed in 2023 and 2024 by 1.16 and 1.21 times, respectively. At facility No. 2, exceedances occurred in 2023 and 2024 by 1.21 and 1.39 times, respectively.
For the ammonium salt indicator at treatment facilities No. 1 and No. 2, no exceedances of permissible concentrations (PC) were recorded from 2020 to 2022. At facility No. 1, exceedances were observed in 2023 and 2024 by 1.41 and 1.63 times, respectively. At facility No. 2, exceedances occurred in 2023 and 2024 by 1.15 and 1.26 times, respectively.
For nitrate levels at treatment facility No. 1, no exceedances of permissible concentrations (PC) were recorded from 2020 to 2023, with an exceedance of 1.03 times noted in 2024. At facility No. 2, no exceedances of PC for nitrates were observed.
For chloride levels at treatment facility No. 1, no exceedances of permissible concentrations (PC) were recorded from 2020 to 2022, while exceedances of 1.27 and 1.22 times were noted in 2023 and 2024, respectively. At facility No. 2, no exceedances of PC for chlorides were observed.
For sulfate levels at treatment facility No. 1, no exceedances of permissible concentrations (PC) were recorded from 2020 to 2022, while exceedances of 1.14 and 1.21 times were observed in 2023 and 2024, respectively. At facility No. 2, a sulfate PC exceedance of 1.05 times was recorded in 2024.
For phosphate levels at treatment facility No. 1, no exceedances of permissible concentrations (PC) were recorded from 2020 to 2022, while exceedances of 1.83 and 2.06 times were observed in 2023 and 2024, respectively. At facility No. 2, no exceedances of PC for phosphates were noted.
For monitored parameters at the discharge points of treatment facilities No. 1 and No. 2, such as nitrites, suspended solids, synthetic surfactants, and petroleum hydrocarbons, no exceedances of permissible concentrations (PC) were observed.
Table 4 and
Table 5 present the background concentrations at the control sections upstream and downstream of the discharge points of the studied treatment facilities.
The data analysis shows that pollutant concentrations at the control sections do not exceed the maximum permissible concentrations for water bodies designated for fisheries. This can be attributed to the high water volume of the Irtysh River, dilution of discharges, and the river’s self-purification capacity.
The author [
24] notes that during the period 2001–2011, the water of the Irtysh River was at a “normatively clean” level according to domestic and household standards. Krupa et al. [
1] also states that the maximum content of phosphates and nitrate nitrogen in the river in 2023 will be low. An analysis of the research results (2024–2025) cited by the authors in this article also classifies the water of the Irtysh River as “normatively clean.” This is explained by the assimilative properties of river water, as well as the precipitation of pollutants in the cascade of reservoirs.
The discharge of domestic wastewater from small settlements, treated by conventional methods and with long service life, into low-flow water bodies may lead to adverse environmental conditions. Therefore, it is important to emphasize the need for the reconstruction of existing treatment facilities or the implementation of new, modern water treatment technologies. This will enhance treatment efficiency, reduce anthropogenic impact, and ultimately help preserve the ecological state of the Irtysh River.
Effluent fractions inferred from conservative tracers indicated very strong dilution at both sites. Using chloride as the conservative tracer, the reference fractions yielded
(facility TF1) and
(facility TF2). Candidate tracers were screened by convexity and
; infeasible records (e.g., sulfates at facility 1) were excluded from the reference fraction (
Table 6).
Using the chloride-derived
,
was computed for reactive indicators (
Table 7). For facility TF1, values spanned from
(nitrates) to 51.53 (nitrites); selected results were COD = 13.32, ammonium = 42.14, phosphates = 6.18, and dry residue = 2.04. For facility TF2, the range was
(nitrites) to 2.59 (dry residue); selected results were COD = 1.56, ammonium = 0.15, phosphates = 0.09, and nitrates =
. Values for the tracer used to obtain
are not reported (for that indicator
by definition).
Using all indicators with paired upstream–downstream values (19 per site), the composite index reached for facility TF1 and for facility TF2. The mean relative change was (facility TF1) and (facility TF2), indicating small net increases downstream on average. The largest relative increase among indicators attained at facility TF1 and at facility TF2. By definition, .
The tracer-based mixing/dilution calculation indicates that the effluent signal arrived at the downstream cross-sections under very strong dilution at both facilities. Using chloride as the preferred conservative tracer, the reference effluent fractions () corresponded to dilution factors of approximately (facility TF1) and (facility TF2).
Against this dilution background, departures of reactive indicators from the mixing line were quantified using the transformation diagnostic , which compares observed downstream concentrations with values predicted by mixing at the site-specific . At facility TF1, several constituents exceeded the mixing expectation substantially, with for COD, for ammonium, and for phosphates. These values imply downstream concentrations higher than can be explained by entrainment alone and are consistent with additional in-reach inputs (e.g., lateral inflows, bank returns), short-term storage–release dynamics (e.g., sediment–water exchanges), or incomplete transverse mixing at the sampling distance. In contrast, nitrate exhibited , indicating concentrations below the mixing prediction, consistent with strong net removal over the monitored reach or with temporal misalignment between river and effluent samples. At facility TF2, values clustered nearer unity: COD showed only a modest excess over mixing (), whereas ammonium and phosphates were , indicating net attenuation along the reach; nitrite and nitrate displayed negative , again suggestive of removal. Two numerical aspects merit attention. First, very small (large ) inflate the denominator of , increasing sensitivity to small concentration differences; nonetheless, the consistent directional signals across multiple indicators support a process-based interpretation. Second, indicators with mismatched analytical definitions between effluent and river (e.g., vs. ) were excluded from this diagnostic to avoid bias.
A multi-indicator summary of upstream-to-downstream change, expressed by the unitless composite index , yielded for facility TF1 and for facility TF2. Despite the large excursions for selected indicators, the mean relative changes were modest ( and , respectively), while the largest relative increases ( and ) explain why settles near 0.5. This reconciliation is informative: the plume is strongly diluted overall, yet individual reactive constituents can deviate appreciably—positively or negatively—from the mixing line at local scales. These findings are coherent with a system where hydrodynamic dilution and self-purification dominate the aggregate signal, while site-specific sources/sinks or incomplete mixing shape the behavior of particular indicators. Practical implications follow. Verification of full mixing (e.g., by short-range logging of conductivity/chloride and, if possible, an intermediate station within the nominal mixing zone) would reduce uncertainty in and, consequently, in . Improved temporal synchrony between effluent and river sampling would clarify the negative cases for oxidized nitrogen. Where flows are available, converting to discharge ratios would strengthen the physical interpretation of dilution differences between facilities. Overall, the dilution calculation provides a physically grounded baseline; the transformation diagnostic identifies indicators and sites where biogeochemical processes or additional inputs are likely; and the composite index confirms that—despite notable departures for select constituents—the aggregate downstream impact remains modest.
In this stage, seven model specifications were evaluated—Baseline-Linear, Fractional-Logit (Ratio), Ridge, Lasso (α = 1), Elastic-Net (α = 0.5), PLS-1, and PLS-2—under leave-one-out cross-validation. The full set of quality indicators for BOD and COD is reported in
Table 8. Briefly, for BOD the Lasso model achieved the lowest cross-validated error (RMSE = 0.626, MAE = 0.459, R
2 = 0.976), closely followed by Elastic-Net (RMSE = 0.655, MAE = 0.411, R
2 = 0.974), with Ridge and PLS yielding intermediate accuracy and Baseline-Linear and Fractional-Logit performing worse. For COD, Lasso again ranked first (RMSE = 0.795, MAE = 0.634, R
2 = 0.997), Elastic-Net placed second, Ridge and PLS formed a middle tier, the Baseline-Linear model was weaker, and the Fractional-Logit specification performed poorly after back-transformation to the level scale (RMSE = 18.978, R
2 < 0). All R
2 values reported in
Table 8 are based on LOOCV (out-of-sample) predictions, not in-sample fits.
In the winning Lasso models, only a small subset of upstream indicators carried non-zero coefficients, which clarifies where most of the predictive signal resides. For BOD model, the retained predictors were Phosphates (β = 3.1202), Chlorides (β = −1.0255), and COD (β = 0.0088). On the standardized input scale this pattern suggests that higher phosphate content co-varies with higher treated BOD, whereas higher chloride—acting as a conservative tracer of dilution—is associated with lower BOD; the small positive weight on COD likely reflects residual correlation between organic load measures. For COD model, the active predictors were Nitrites (β = 9.8004), Sulfates (β = −3.2327), Suspended Solids (β = −2.0245), Petroleum Hydrocarbons (β = 1.9323), and COD (β = −1.5202). The strong positive coefficient for nitrites is consistent with episodes of incomplete nitrification co-occurring with elevated oxygen demand, while the negative coefficients for sulfates and suspended solids point to conditions where dilution or efficient solids removal reduce effluent COD; the positive petroleum-hydrocarbons coefficient indicates co-movement with COD, as expected for hydrophobic organic fractions. Because predictors were standardized before fitting, coefficient magnitudes provide a relative importance ranking (nitrites ≫ sulfates ≈ suspended solids for COD; phosphates > chlorides for BOD), but given the very small sample (N = 10) and the use of penalization, these signs should be interpreted as stable associations rather than causal effects.
The observed-versus-predicted scatterplot (
Figure 9) shows a compact cloud of points aligned with the 1:1 identity line, indicating accurate level predictions across the observed range. A slight widening of the cloud at higher concentrations is visible and is consistent with the uncertainty expected under a sample size of ten, but no systematic bias is apparent.
Sensitivity checks confirmed the ranking: Elastic-Net trailed Lasso by a small margin, PLS(1) and PLS(2) showed only modest differences, and the ratio/logit specification under-performed for COD after back-transformation to levels.
Taken together, the mixing analysis and the statistical modeling tell a coherent story. The river’s large assimilative capacity and high dilution factors dominate the aggregate downstream signal, which is why background concentrations remain below regulatory thresholds. At the same time, several reactive constituents exhibit clear deviations from the mixing expectation at facility 1 and, to a lesser extent, at facility 2, consistent with local sources/sinks, short-term storage–release, or incomplete transverse mixing over the sampling distance. The sparse Lasso solutions reinforce this process view by isolating a few chemically meaningful predictors (e.g., oxidized nitrogen species for COD, phosphates and chlorides for BOD) that track effluent variability most closely.
Two limitations should be noted. First, the predictive sample is small (), so effect signs and magnitudes should be interpreted as stable associations rather than causal mechanisms; the chosen models address variance control through regularization and cross-validation, but additional data would sharpen estimates and allow external validation. Second, very small reference fractions () inflate the denominator of , which increases sensitivity to small concentration differences; nonetheless, the directional consistency across multiple indicators supports the qualitative interpretation.
From a management perspective, three implications follow. Verification of full mixing at the downstream cross-sections—for example by short-reach logging of conductivity or chloride, ideally with an intermediate station—would reduce uncertainty in and, consequently, in . Synchronizing effluent and river sampling in time would clarify negative cases for oxidized nitrogen. Where discharge data are available, converting to flow ratios would strengthen the physical interpretation of dilution and help compare facilities. In parallel, modernizing small, long-serving treatment plants remains important: even under strong river dilution, localized departures from the mixing line demonstrate that upgrades that enhance ammonium oxidation and phosphorus removal would further reduce the potential for reach-scale excursions. Overall, the dilution calculation provides a physically grounded baseline, the transformation diagnostic identifies where biogeochemical processes or additional inputs are likely, and the predictive models deliver practical tools for forecasting effluent quality with quantified uncertainty.
4. Conclusions
The article presents an assessment of the impact of discharges from two operating treatment facilities with a conventional treatment scheme serving small settlements. The research results showed that although the anthropogenic load on the water body from small treatment facilities is relatively minor, it is still present.
For treatment facilities, permissible concentrations are established for discharges in accordance with the approved pollutant discharge standards. These standards are authorized by the Department of Natural Resources and Environmental Regulation of the East Kazakhstan Region. Exceedances of permissible concentrations were recorded for BODfull, ammonium salt, nitrates, chlorides, sulfates, and phosphates. No exceedances were observed for nitrites, suspended solids, synthetic surfactants, or petroleum hydrocarbons. This indicates the insufficient efficiency of the existing wastewater treatment technologies used in small-scale sewerage systems.
According to BODfull, suspended solids, ammonium salt, phosphates, synthetic surfactants, nitrates, and COD–, the purification effect for TF1 is more than 90%, and for TF2 it is more than 70%. For nitrites, the purification effect ranges from 20 to 6% for TF1, and 64 to 77% for TF2. For TF1 and TF2, the purification effect for chlorides ranges from 10 to 50%, for sulfates from 12 to 58%. Chlorides and sulfates are dissolved in water. In terms of total mineralization, the purification effect for TF1 ranges from 9 to 20%, and for TF2 from 65 to 67%. Their quantity depends on the quantitative content in drinking water for household and drinking purposes of consumers. For petroleum hydrocarbons, the purification effect for TF1 ranges from 25 to 50%, and for TF2 from 30 to 67%. These wastewater treatment plants do not provide technology for the retention of petroleum hydrocarbons, but they do not exceed PC at emission.
A comparative analysis of background concentrations before and after the discharges confirms that the high water volume and significant dilution capacity of the Irtysh River allow for partial compensation of localized impacts. However, despite the river’s natural self-purification ability, maintaining a stable ecological condition in the river basin requires a reduction in anthropogenic load. This can be achieved through the reconstruction of existing treatment facilities with the implementation of modern technological solutions aimed at removing biogenic parameters (phosphates, nitrogen compounds) and organic pollutants.
Thus, the modernization of small wastewater treatment facilities is a key condition for the sustainable development of coastal settlements and the preservation of ecological stability in the Irtysh River basin under increasing anthropogenic load.
This study shows that the downstream signal of the two investigated small-settlement facilities arrived under very strong dilution, as indicated by chloride-based reference fractions (D ≈ 2.0 × 103 and 4.2 × 102). Against this background, several reactive indicators at facility 1—most notably COD, ammonium, and phosphates—exceeded mixing expectations, whereas nitrate exhibited net removal; facility 2 displayed θ values closer to unity, indicating modest excess or attenuation. These outcomes reconcile a strongly diluted plume at the reach scale with constituent-specific departures that likely reflect in-reach sources/sinks, short-term storage–release, or incomplete transverse mixing at sampling distance.
From an operational perspective, low-complexity predictive models trained on upstream indicators performed strongly. Lasso regression yielded the most accurate forecasts for both BOD_after and COD_after, closely followed by Elastic-Net, while PLS and Ridge delivered intermediate accuracy. The winning models were sparse, relying on a small subset of upstream measurements, which simplifies routine monitoring and supports compact early-warning tools.
Three practical recommendations follow. First, tracer logging (e.g., conductivity/chloride) and, where feasible, an intermediate station would reduce uncertainty in effluent fractions and sharpen θ interpretation. Second, synchronizing river and effluent sampling times would improve diagnostics for oxidized nitrogen, which showed negative θ. Third, targeted modernization of small facilities—especially steps that enhance removal of biogenic elements and organic fractions—remains warranted, consistent with the manuscript’s broader argument on technology upgrades for small systems.