Next Article in Journal
Carbonisation of Quercus spp. Wood: Temperature, Yield and Energy Characteristics
Previous Article in Journal
Reliability Evaluation of New-Generation Substation Relay Protection Equipment Based on ASFSSA-LSTM-GAN
Previous Article in Special Issue
Gradient Recovery of Tungsten, Cerium, and Titanium from Spent W-Ce/TiO2 Catalysts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Operational Nitrogen Indicator (ONI): An Intelligent Index for the Wastewater Treatment Plant’s Optimization

by
Míriam Timiraos
1,2,*,
Antonio Díaz-Longueira
1,*,
Esteban Jove
1,*,
Óscar Fontenla-Romero
3 and
José Luis Calvo-Rolle
1
1
Department of Industrial Engineering, University of A Coruña, CTC, CITIC, Campus de Esteiro, 15403 Ferrol, Spain
2
Fundación Instituto Tecnológico de Galicia, Department of Water Technologies, National Technological Center, 15003 A Coruña, Spain
3
Department of Computer Sciences and Information Technologies, University of A Coruña, LIDIA, CITIC, Campus de Esteiro, 15403 Ferrol, Spain
*
Authors to whom correspondence should be addressed.
Processes 2025, 13(7), 2301; https://doi.org/10.3390/pr13072301 (registering DOI)
Submission received: 15 June 2025 / Revised: 11 July 2025 / Accepted: 16 July 2025 / Published: 19 July 2025
(This article belongs to the Special Issue Novel Recovery Technologies from Wastewater and Waste)

Abstract

In the context of wastewater treatment plant optimization, this study presents a novel approach based on a virtual sensor architecture designed to estimate total nitrogen levels in effluent and assess plant performance using an operational indicator. The core of the system is an intelligent agent that integrates real-time sensor data with machine learning models to infer nitrogen dynamics and anticipate deviations from optimal operating conditions. Central to this strategy is the operational nitrogen indicator (ONI), a weighted aggregation of four sub-indicators: legal compliance (Nactual%), the nitrogen dynamic trend (Tnitr%), removal efficiency (Enitr%), and microbial balance (NP%), each of which captures a critical dimension of the nitrogen removal process. The ONI enables the early detection of stress conditions and facilitates adaptive decision-making by quantifying operational status in terms of regulatory thresholds, biological requirements, and dynamic stability. This approach contributes to a shift toward smart wastewater treatment plants, where virtual sensing, autonomous control, and throttling-aware diagnostics converge to improve process efficiency, reduce operational risk, and promote environmental compliance.

1. Introduction

The contemporary reality of water scarcity, intensified due to the effects of climate change and sustained global population growth, has led to a significant increase in the demand for this resource, which has, in turn, resulted in increased wastewater generation [1,2,3]. This increase in wastewater production often exceeds the treatment capacity of many wastewater treatment plants (WWTPs), posing significant operational and environmental challenges [4,5]. Given this situation, optimizing WWTP operation has become an urgent need to ensure both regulatory compliance and efficient resource management [6].
Depending on the type of sewer network, these plants may receive domestic and industrial wastewater, and in the case of combined networks, also stormwater. This variability in the type and volume of water to be treated generates significant fluctuations in the pollutant load, especially during rainfall events, when the flow rate can rise sharply and exceed the facility’s treatment capacity [7]. The specific characteristics of each WWTP, as well as the temporal variability in the quality and quantity of incoming water, require the intensive and precise monitoring of various processes to optimize their performance and comply with regulatory standards [8,9].
Proper monitoring not only enables the real-time control of processes by tracking key parameters but also allows for energy optimization, the early detection of anomalies, efficient sludge management, operational adaptation to climatic and seasonal changes, and the control of the quality of treated water discharged into receiving bodies [6,7,8]. However, despite its many benefits, achieving a high level of monitoring entails significant financial investment. Therefore, it is crucial to reduce the number of physical sensors by identifying representative variables that maintain treatment quality without incurring high costs [10,11,12].
In this context, the use of virtual sensors emerges as an effective and economical alternative. These sensors are computational models that estimate the value of a variable from other already measured variables, eliminating the need for expensive physical devices. One of the most important parameters to monitor is total nitrogen since its presence in the final effluent is directly linked to water pollution and the environmental impact of discharges [13,14]. This parameter, which encompasses various forms of nitrogen, serves as a key indicator of organic load and the efficiency of biological processes within the plant, especially concerning the removal of nitrogen compounds [15,16]. However, its direct measurement requires sophisticated instruments, trained personnel, and specific analytical methods, which limit its continuous and widespread application in many facilities.
At the same time, the integration of intelligent agents offers an innovative and adaptable solution to these challenges. These agents, capable of interacting with sensors and actuators, analyzing data in real time, making decisions, and executing actions on the system, introduce a new dimension of automation and control in WWTPs [17,18]. Thanks to their ability to dynamically adapt to operational conditions, these intelligent systems can enhance treatment efficiency, optimize resource usage, and support regulatory compliance by monitoring critical variables such as total nitrogen [14]. Furthermore, these agents can operate autonomously based on predictive models built from selected variables or virtual sensors, allowing them to act as decision-support systems or even replace tasks that traditionally require constant human supervision [19,20].
Nevertheless, the most innovative and distinctive feature of this work is the introduction and use of operational indicators as key elements in decision-making for wastewater treatment systems [21]. These indicators are metrics derived from the analysis of multiple system variables, and they enable the identification, anticipation, and classification of critical operational states in which the plant may not be functioning within optimal parameters [22]. Unlike traditional approaches based solely on individual sensor values, stress indicators provide a more holistic and dynamic view of system behavior, facilitating a more intelligent and contextualized response to overload situations, variability in pollutant loads, or sensor failures [22].
This type of indicator not only helps detect anomalous operational conditions but also supports the implementation of preventive control strategies, adapting system resources according to the identified stress level and enabling operational adjustments in anticipation of major failures or efficiency losses [23]. Moreover, their use is essential for validating the reliability of data that feeds predictive models, increasing trust in the actions executed by an intelligent agent [6,9,15].
Although this proposal introduces a novel approach, it is important to compare it with other decision-making models applied in WWTPs. Among these, multi-criteria decision-making frameworks (MCDM), such as AHP (Analytic Hierarchy Process) or TOPSIS, are widely used to evaluate and prioritize treatment alternatives or operational strategies [24,25]. Likewise, artificial intelligence approaches or expert systems have been developed to provide operational recommendations based on historical data analysis [26]. Compared to these models, the approach presented in this work is distinguished by integrating a virtual sensor and a dynamic operational indicator into an autonomous agent capable of real-time adaptation and proactive decision-making. This integration enhances the response capability and reduces the dependence on predefined static rules.
In addition, the proposed model stands out for its simplicity and interpretability, thanks to the use of normalized indicators (0–100%) and measurable variables, making it more accessible to plant operators. While other methods may require large volumes of historical data or complex multi-objective optimization procedures, this framework is oriented toward real-time monitoring, adaptability, and efficient implementation over existing infrastructures.
Using regression and variable analysis techniques, a methodology is proposed that identifies the most relevant variables for estimating total nitrogen in the effluent and, based on this estimation, generates operational indicators that enable adaptive and robust system management. This approach leads to a smart plant model in which an autonomous agent incorporates a virtual sensor composed of a nitrogen prediction model and an operational indicator that quantifies the system’s stress level. This integration enables the agent to analyze the process status in real time, anticipate deviations, and execute or recommend corrective actions, thus contributing to operational optimization, cost reduction, regulatory compliance, and the sustainability of the treated water resource.
The paper is structured as follows: after this introduction, the proposed approach is presented. The case study and implementation of the agent model are then described, including experiments and results. Finally, conclusions are drawn, and future work is proposed.

2. Approach

2.1. Operational Nitrogen Indicator—ONI

The operational nitrogen indicator (ONI) is a composite metric designed to assess nitrogen behavior in wastewater treatment systems in real time. Its purpose is twofold: first, it allows for diagnosing the operational status of the biological process concerning nitrogen removal; second, it acts as an early warning mechanism for potential deviations, failures, or unstable conditions. The ONI is functionally integrated within a virtual sensor embedded in an intelligent agent, providing an adaptive decision-making tool to optimize operations, maintain efficiency, and ensure regulatory compliance.

2.1.1. Indicator Structure

The ONI is constructed from the weighted aggregation of four key sub-indicators, selected for their operational, regulatory, and microbiological relevance. Each of them captures a fundamental aspect of nitrogen behavior in the plant:
  • Nreal% —the percentage of the legal limit reached for total nitrogen in the effluent. Equation (1) shows how it can be calculated.
    Nreal % = N effluent _ current N legal _ limit × 100
    This metric evaluates compliance with the effluent nitrogen limits established via environmental legislation. Values approaching 100% indicate proximity to legal thresholds and a heightened risk of regulatory violation. The weight in ONI is 35%.
  • Tnitr%—the temporal trend of total nitrogen. Equation (2) shows how it can be calculated.
    Tnitr % = Δ t Δ N total × 100
    Calculated as the slope of a linear regression over a moving time window (typically 30–60 min), this indicator reflects the temporal evolution of nitrogen concentration. Rising trends suggest accumulation and potential instability in the biological process. The weight in ONI is 15%.
  • Enitr%—nitrogen removal. Equation (3) shows how it can be calculated.
    Enitr % = N inlet N effluent N inlet × 100
    This sub-indicator quantifies the process’s effectiveness in removing nitrogen from influent. High values (>75–80%) denote efficient operation; values below 60% typically indicate process underperformance or failure. The weight in ONI is 30%.
  • NP%—nitrogen-to-phosphorus ratio as a microbial balance indicator. Equation (4) shows how it can be calculated.
    N P = P N N P % = f N P
    The N/P ratio is used to assess the stoichiometric balance necessary for optimal microbial growth. Ideal values typically range between 10 and 16, in line with established biological nutrient removal guidelines. Deviations from this range may indicate stress conditions or nutrient imbalance. The weight in ONI is 20%.
From these sub-indicators, the ONI is calculated using the expression (5)
ONI = 0.35 × Nreal % + 0.20 × Tnitr % + 0.25 × Enitr % + 0.20 × NP %
The assigned weights reflect the criticality of each parameter from both the regulatory perspective and the dynamic behavior of the system.

2.1.2. Interpretation Ranges

ONI values are classified into three operating ranges that allow inferences to be drawn about the current state of the system and the need for intervention:
  • ONI = 0–30 → optimal state: stable operation, no intervention required.
  • ONI = 31–70 → attention status: Early deviations or suboptimal conditions are detected. Preventive operational adjustments are recommended.
  • ONI > 70 → critical status: Imminent risk of failure or legal noncompliance. Requires immediate or automated intervention.
This classification allows an intelligent agent to generate graded alerts and make decisions based on the level of severity detected.

2.1.3. Technical–Operational and Regulatory Justification

The design of the ONI is based on a combination of legal, technical, and microbiological criteria, ensuring both the regulatory validity and operational relevance of the subindicators used. Each component of the ONI responds to a specific need of the biological treatment process and has been calibrated according to the standards applicable to WWTPs in the European Union and Spain (as applied to the specific case study).
Nreal%—Regulatory Compliance
This subindicator quantifies the percentage of the legal limit for the total nitrogen reached in effluent. The reference values come from Directive 91/271/EEC of the EU Council [27] on urban wastewater treatment and its transposition into Spanish law through Royal Decree 509/1996 [28]. This regulation establishes limits of 10 or 15 mg/L of total nitrogen, depending on the size of the plant and the discharge area. The 100% threshold represents the legal compliance limit; therefore, values close to or above this value reflect an immediate risk of violation.
Tnitr%—Dynamic Process Stability
The temporal gradient of total nitrogen is calculated using a linear regression of concentrations over the last 30–60 min, normalized to a range of 0–100. This metric reflects the cumulative trend of nitrogen and is especially useful as an early indicator of dynamic instability. Increasing Tnitr% values alert to potential failures or overloads in the biological stage before regulatory limits are breached. Its inclusion is aligned with proactive control approaches to avoid the number of exceedances allowed during approximately one year of plant operation [28].
Enitr%—Removal Efficiency
This sub-indicator assesses the fraction of nitrogen removed between the system inlet and outlet. According to Directive 91/271/EEC [27], WWTPs located in sensitive areas must achieve a removal efficiency greater than 70%. In practice, desirable operating values range between 75% and 85%, while values below 60% typically indicate system dysfunction (e.g., oxygen deficiency or nutrient imbalance). This metric is used as a proxy for the overall performance of the biological process.
NP%—Microbiological Balance
The balance between nitrogen and phosphorus (N/P ratio) is a key parameter for the growth and maintenance of microbial populations involved in nitrification and denitrification. Optimal values are in the range of 10–16, according to the Redfield ratio and classic studies on biological treatment [29]. Values outside this range can lead to nutrient limitations (NP < 10) or excessive nutrient accumulation (NP > 16), negatively affecting system stability and efficiency.
Weighting Criteria
The weighting of each sub-indicator within the ONI has been defined based on its direct impact on regulatory compliance, its predictive value in anticipating failures, and its feasibility in real-time calculations. The relative weighting of each component is as follows: Nreal%, 35%; Enitr%, 25%; NP%, 20%; and Tnitr%, 15%. These weightings can be adjusted based on the particular characteristics of each WWTP or the specific regulatory context [27,28].

2.1.4. Summary of the ONI’s Objectives

In addition to being a diagnostic metric, the ONI is an active element of the operational monitoring system. Its real-time calculation allows for the continuous assessment of the status of the biological process from multiple dimensions (regulatory, dynamic, microbiological, and efficiency), making it a central tool for operational decision-making.
The ONI architecture is modular and based on four weighted sub-indicators, each targeting a key operational aspect of nitrogen behavior in the plant. A summary of these sub-indicators is provided in Table 1.
The ONI enables different levels of response, depending on the detected status, from preventive recommendations in suboptimal conditions to the activation of emergency protocols in critical scenarios. Its modular, sub-indicator-based design facilitates its adaptation to different plant configurations or local conditions, and is also compatible with predictive models or advanced control systems.

2.2. Agent-Based System Approach

The proposed intelligent agent operates in the environment of a wastewater treatment plant, focusing on the effluent conditions. The system comprises three main components: sensors, actuators, and an intelligent decision-making agent. Figure 1 shows the agent approach schema.

2.2.1. Enviroment Description

The general environment in which the intelligent agent operates is a WWTP, a complex facility dedicated to the treatment of urban and industrial wastewater, to remove contaminants before discharging the treated effluent into the environment. In particular, this case study focuses on a specific WWTP, whose operational and environmental characteristics define the specific context of the agent’s operation (Section 3).
The WWTP environment presents various particularities that influence the behavior and decisions of the intelligent agent. It is a dynamic environment, as the inlet conditions of the wastewater constantly vary due to factors such as changes in the flow rate, the pollutant load, temperature, or specific events such as heavy rainfall or industrial discharges. Furthermore, it is a partially observable environment, given that not all relevant parameters can be measured continuously or directly, and sometimes, only estimates, historical data, or specific sensor readings are available. It is also stochastic, meaning that, even when the same actions are performed, the results may differ due to the unpredictable nature of the influent and the biological behavior of the treatment processes. Finally, in more advanced scenarios, the environment can take on a multi-agent nature, incorporating different intelligent modules that act in a coordinated manner in the different treatment stages, such as pretreatment, biological treatment, or disinfection.

2.2.2. Perception

The perception layer consists of a set of specialized sensors that allow real-time data capture for the characteristics of the effluent treated at the plant. In the WWTP corresponding to the case study, this layer specifically includes a total phosphorus sensor, a total Kjeldahl nitrogen (TKN) sensor, and a total nitrogen sensor. The first two are essential for assessing the presence of nutrients that, at high concentrations, can compromise the quality of the discharged water and generate negative environmental impacts, such as the eutrophication of receiving bodies. These sensors allow continuous measurements that accurately reflect the behavior of the biological and chemical treatment system, constituting a critical source of information for process control.
The data generated via these sensors is processed and normalized to maintain consistency throughout the different stages of analysis and decision-making. Thanks to this structure, the intelligent agent can operate in different scenarios, adapt to environmental variations, or move to other facilities with similar characteristics, without requiring a complete redefinition of the model.

2.2.3. Decision-Making

At the core of the proposed system is an intelligent agent responsible for analyzing the information provided via the sensors and making decisions aimed at optimizing the operation of the wastewater treatment process. To do so, the agent relies on a virtual sensor, which integrates a nitrogen prediction model and an ONI indicator, developed as a decision-support tool.
This virtual sensor estimates the total nitrogen concentration in the effluent from normalized, real-time data, allowing for an accurate characterization of the system’s operating status. The predictive model used is based on a black-box approach capable of learning complex relationships between multiple variables without requiring an explicit description of the physicochemical mechanisms involved. This adaptive capacity makes it a robust solution in the face of changes in operating conditions.
In addition to nitrogen prediction, the virtual sensor calculates the ONI, an indicator that reflects the system’s operational stress level. High ONI values indicate that the process is moving away from its optimal operating zone, acting as an early warning of potential deviations, failures, or inefficiencies.
With this information, the intelligent agent can anticipate problems and proactively propose or implement corrective measures, thereby improving both compliance with regulatory standards and overall processing efficiency.

2.2.4. Action

The system’s actuators play an essential role in implementing the corrective actions defined by the intelligent agent, as they allow various treatment process parameters to be modified in real time based on the system’s operating status. Among the most relevant actuators are the aeration valves, which regulate the oxygen supply and, therefore, directly influence the nitrification and denitrification processes by controlling bacterial activity. Depending on the stress indicator value, the agent can increase or decrease aeration through these devices, thereby optimizing nitrogen removal efficiency. Furthermore, the system can adjust internal recirculation and hydraulic retention time through flow controls, improving water contact with microorganisms and increasing the time available for biological reactions.
Beyond oxygen control, the system features additional actuators that allow for more specific and adaptive responses. For example, external carbon dosing systems (such as methanol or acetate) can be activated to reinforce denitrification when performance falls below the expected threshold. Inlet valves or bypass systems are also used to temporarily reduce organic loads in the event of critical events, such as nitrogen overload. Phosphate dosing pumps and sludge purge valves are also incorporated, allowing the system’s nutrient balance (N/P) to be adjusted, maintaining optimal ratios for microbial metabolism. Together, these actuators allow for the fine-tuned, dynamic control of the system, ensuring process stability and meeting effluent quality objectives based on the indicator results.

2.2.5. System Integration

The integration of these three layers enables a dynamic and adaptive control system. The intelligent agent not only reacts to data in real time but also leverages historical trends stored in a database to optimize decision-making. By incorporating the ONI mechanism, the system proactively adjusts the actuator control strategy through decision-making, ensuring stable and efficient wastewater treatment operations.

2.3. Intelligent Agent Architecture

This section details the internal architecture of the intelligent agent, focusing on the modules that make up its operational logic. It describes the blocks responsible for data preprocessing, the virtual sensor block, and the decision-making block, which determines the corrective actions to be applied. Each of these components collaborates in an integrated manner to allow the agent to act autonomously, efficiently, and adaptively according to the plant’s operating conditions.

2.3.1. Data Preprocessing

Data preprocessing is a critical phase to ensure the quality and consistency of the information fed to the agent. In this stage, the data collected from the environment undergoes rigorous normalization and scaling techniques to homogenize the different magnitudes and units present in the input variables. This transformation is essential to avoid bias and improve the efficiency of the model’s learning process, ensuring that the values are comparable and within optimal ranges for the algorithm.
In addition, data cleaning is performed, which includes the detection and treatment of outliers, as well as the imputation of missing data when necessary. This processing ensures that the final data set is free of inconsistencies and errors, allowing the agent to operate on a solid and reliable basis. This facilitates the agent’s proper integration into the real environment, providing standardized inputs that optimize its performance and accuracy.

2.3.2. Virtual Sensor

The virtual sensor consists of a regression model specifically developed to estimate nitrogen concentration in the effluent, based on measurable process variables. This model provides accurate and continuous estimates of a variable that cannot be directly measured in real time, thus facilitating more efficient and cost-effective monitoring. Model construction involves the appropriate selection and transformation of input variables, as well as training with representative historical data from the system.
Subsequently, based on the estimated nitrogen concentration, the performance indicator known as ONI is calculated, and it summarizes effluent quality in relation to environmental and process objectives. The ONI value provides a metric that reflects the status of the system and enables informed decisions for the optimization and regulation of the WWTP.

2.3.3. Database

The agent uses a structured database that functions as a centralized repository for storing all collected data, both from system inputs and predictions generated via the virtual sensor. This database maintains bidirectional communication with the virtual sensor, allowing both the recording of new information and the retrieval of historical data for analysis and validation. The persistence of this data is essential to ensure the traceability and reproducibility of the control process.

3. Case of Study

This case study focuses on a WWTP in southeastern Spain. The main objective is to develop a plant stress indicator directly related to total nitrogen in the effluent, integrated into an agent that allows regulation and decision making.

3.1. WWTP Description

The WWTP analyzed in this study is located in southeastern Spain, in a Mediterranean climate characterized by seasonal torrential rainfall. It serves an estimated population of 15,000 inhabitants and has a nominal treatment capacity of 1,642,500 m3 per year. This plant represents a medium-sized facility, typical of the region, where WWTPs typically have capacities between 680,000 and 3,650,000 m3/year. Given the environmental conditions and the variability of both the hydraulic load and the input of organic matter, the facility is designed with robust and flexible systems capable of adapting to significant fluctuations.
The WWTP is structured into two main operating lines: the water line, which focuses on wastewater treatment, and the sludge line, which is responsible for the treatment and management of solid waste generated in the process. Wastewater entering the plant first undergoes preliminary treatment, consisting of coarse and fine screening, followed by grit removal and degreasing to remove materials that could interfere with downstream biological and mechanical processes [30].
Secondary treatment is carried out using plug-flow activated sludge reactors that include anoxic and aerobic zones. This configuration facilitates the removal of biological nutrients, particularly nitrogen, by alternating between oxygen-containing and oxygen-free environments. In the anoxic zones, denitrifying bacteria reduce nitrate to nitrogen gas, while in the aerobic zones, nitrifying bacteria convert ammonium to nitrate [31]. This alternation is made possible through a carefully designed control system based on a set of automated actuators and valves that regulate key variables throughout the process.
Specifically, the plant is equipped with a range of actuators that allow the real-time adjustment of critical process parameters. These include the following: internal recirculation systems and flow controls that manage the movement of mixed liquor between zones; inlet throttling or early-stage aeration control to adjust oxygen levels before full aerobic treatment begins; and dissolved oxygen control systems, which can be combined with external carbon dosing to optimize the denitrification phase. Additional systems include phosphate dosing units and sludge discharge valves, which are both essential for maintaining chemical balance and managing solids buildup. The control strategy also includes emergency bypasses, hydraulic check control, and the ability to temporarily halt incoming feeds. Additionally, the plant can immediately inject external carbon and perform intensive aeration to respond to sudden variations in influent characteristics. Phosphorus dosing and sludge blowdown complete the range of response mechanisms available to operators.
Following secondary treatment, the wastewater is clarified in two secondary sedimentation tanks, which separate biological solids from the treated water [32]. The clarified effluent then proceeds to the tertiary treatment stage, designed to remove remaining contaminants, such as phosphorus and pathogenic microorganisms. This stage combines coagulation–flocculation, lamellar sedimentation, and sand filtration, followed by disinfection using ultraviolet radiation and sodium hypochlorite [33,34,35]. Once all stages are completed, the treated water is discharged into the receiving aquatic environment.
Meanwhile, the excess sludge separated in secondary clarification is recirculated to biological reactors or diverted to the sludge line [36]. Here, it first undergoes gravitational thickening to concentrate the solids, followed by mechanical dewatering using centrifuges [37]. The dewatered sludge is temporarily stored in hoppers and finally removed for disposal or reuse, for example, in agriculture. To mitigate the impact of odors during sludge handling, the facility is equipped with an air treatment system combining activated carbon and biofiltration.
Figure 2 illustrates a representative diagram of the operating lines and the various treatments involved in the WWTP case study.
Overall, the plant incorporates advanced control strategies and a flexible infrastructure, making it well suited to meet the operational challenges posed by variable flow and load conditions. The integrated use of valves and actuators to manage aeration, recirculation, chemical dosing, and sludge handling plays a critical role in maintaining the WWTP’s biological performance, particularly in achieving efficient nitrogen and phosphorus removal.

3.2. Obtaining the Data Set

During the analysis period, operational data were collected concerning key variables related to total nitrogen behavior at different points in the system. Two data sets were collected: one for model implementation and testing and the other for the validation of the proposed complete agent.

3.2.1. Data Set for Developing the Model

The data set used to implement the prediction model consists of three variables monitored in the plant. The samples comprising the data set were collected over nine months, with a recording frequency of one value per day. Table 2 shows the variables available in the data set used.
The variables listed in Table 2 were monitored at the plant outlet in both cases. The objective was to measure the effluent. The data set includes a total of 423 data points, distributed over a daily sampling period. Figure 3 shows the distribution of the variables.
From the distribution shown, it can be seen that the three variables exhibit a very similar distribution of values, except around 6, for which there are many measurements of total nitrogen but not of the rest of the variables.

3.2.2. Data Set for Validating Agent Approach

The data set used to validate the agent approach consists of four variables monitored in the plant. As in the previous data set, the samples were collected over nine months with a recording frequency of one value per day. Table 3 shows the variables available in this data set.
For the complete validation of the agent approach, in addition to having the variables used for the development of the model, the value of total nitrogen at the entrance of the WWTP is needed to evaluate one of the indicators proposed in Section 2.1.

4. Agent Model Implementation

4.1. Data Set Preprocessing

To ensure the data set’s quality, which will allow a predictive model with adequate performance to be obtained, a series of preprocessing operations were performed on the data set.

4.2. Applied Methods

In this subsection, the different techniques and algorithms used for the development of the predictive model, responsible for predicting the output nitrogen from the input nitrogen and phosphorus, are briefly presented and detailed. The model is used by the agent to estimate the stress indicators and the most appropriate actions to achieve the proper functioning of the WWTP.
First, the machine learning algorithms used in the research are listed:
  • Linear regression.
  • Polynomial regression.
  • K-nearest neighbors.
  • Decision tree.
  • Random forest.
  • Gradient boosting.
  • Support vector machine.
  • Multi-layer perceptron.
The operation of each of the algorithms is briefly defined in the following.

4.2.1. Linear Regression

Linear regression (LR) is a basic and fundamental statistical method with a strong assumption: that there is a linear relationship between input and output features. Obtaining this linear formula is based on reducing the sum of the squared errors between the actual values and the predictions.
This linear formula relates the value of the variable to be predicted, y, as a function of the characteristics, X; and the model parameters, w. This relationship is expressed in Equation (6).
y ( x , w ) = w 0 + w 1 × x 1 + w 2 × x 2 + + w n × x n
where n is the number of features.
Hyperparameters allow you to configure the creation of the linear equation:
  • Intercept: a boolean value that calculates the independent term (true) or sets it to zero (false).
  • Positive: a boolean value that forces the coefficients to be positive (true) or to take any value (false).

4.2.2. Polynomial Regression

In polynomial regression (PR), the linear regression method is extended by introducing polynomial terms for the characteristics. This allows non-linear relationships to be taken into account. Equation (7) reflects the second-degree polynomial equation that relates y, the variable to be predicted, to the characteristics, X, and the model parameters, w.
y ( x , w ) = w 0 + w 1 × x 1 + w 2 × x 2 + w 3 × x 1 2 + w 4 × x 1 × x 2
where n is the number of features.
Again, hyperparameters allow for a modification of the way the polynomial equation is constructed:
  • Degree: an integer value that determines the degree of the polynomial, that is, the highest power of any variable.
  • include_bias: a boolean value that calculates the independent term (true) or sets it to zero (false).
  • interaction_only: a boolean value that sets whether to include only interaction terms without powers (true).

4.2.3. K-Nearest Neighbors

K-nearest neighbors (KNN) is a non-parametric algorithm that determines the output value based on the K-nearest neighbors (via the mean) to the input features. The hyperparameters that allow its behavior to be modified are as follows:
  • n_neighbors: a natural value that determines the number K of neighbors used for the prediction.
  • Weights: A function that assigns the importance (weight) of each of the K neighbors in the prediction.

4.2.4. Decision Tree

Decision tree (DT) is a method that constructs a tree diagram by dividing the input feature space into each of its internal nodes until the output value is obtained. Figure 4 shows the basic representation of a decision tree. Starting from the main node, decisions are made based on the value of a given characteristic, going deeper into the diagram. These decisions are made at the decision nodes, and the diagram ends at the leaf node, where the model obtains its output.
Hyperparameters allow an influence on the generation of this diagram:
  • max_depth: a natural value that determines the maximum depth of the diagram.
  • criterion: the function used to determine the quality of the division performed during the training process.
  • splitter: the strategy used to select the division at each node during the training process.

4.2.5. Random Forest

Random forest (RF) is a method based on decision trees. It is an ensemble method that uses a given number of trees to obtain predictions trained on random subsets of the data set [38]. The hyperparameters used to configure the model are as follows:
  • n_estimators: a natural value that determines the number of decision trees used.
  • max_depth: a natural value that determines the maximum depth of the diagram.
  • criterion: the function used to determine the quality of the division performed during the training process.

4.2.6. Gradient Boosting

Gradient boosting (GB) is also an assembly method. In this case, decision trees are built sequentially and additively, attempting to minimize and eliminate errors made via the previous decision tree. The model’s behavior can be controlled using the following hyperparameters:
  • Loss: the cost function to be minimized.
  • learning_rate: a floating point number that determines the model’s learning rate. It controls the contribution of each decision tree.
  • n_estimators: a natural value that determines the number of decision trees used.
  • Criterion: the function used to determine the quality of the division performed during the training process.

4.2.7. Support Vector Machine

This project’s support vector machine (SVM) is an extension of the well-known support vector machines used for classification. The approach varies slightly, seeking a function that deviates as little as possible from the real values, within a tolerance range. Errors outside this tolerance range are penalized.
  • Kernel:
  • C: or regularization coefficient, a floating-point number that controls the penalty.
  • Epsilon: a floating-point number that determines the tolerance margin.

4.2.8. Multi-Layer Perceptron

A multi-layer perceptron (MLP) is a feedforward artificial neural network consisting of interconnected neurons. Neurons are grouped into layers, such that the neurons in one layer connect to all the neurons in the next layer, from the input layer (where there are as many neurons as inputs) to the output layer (where the predicted value is returned). The intermediate layers are called hidden layers, and the neurons in these layers are called hidden neurons [39].
  • hidden_neurons: a natural number that determines the number of neurons in the hidden layer.
  • Dropout: a floating point number that determines the proportion of neurons that are canceled.
  • activation_function: a function that determines the neuron’s output based on its inputs.

4.3. Experiment Setup

Before the development and implementation of the agent are begun, it is necessary to develop the predictive model that allows the total nitrogen value in the WWTP outlet effluent to be obtained accurately, based on the total phosphorus and Kjeldahl nitrogen values.
To achieve this, a data set is used to develop the model (Table 2) and the machine learning techniques mentioned in the previous section. This data set is divided into two subsets. The first, containing 80% of the total samples, is used in the first phase of the experiment, which consists of cross-validation. In the second phase, 20% of the original data set is used to validate the predictive model.
In the first phase, various models are created for each of the techniques by varying their hyperparameters. Subsequently, all the resulting models will undergo a cross-validation process with the full data set. This 5-fold cross-validation will be used to determine the performance of each of the models using regression metrics. The objective of this phase is to determine the best model for each of the techniques, which will advance to the second phase. The second phase will serve to validate each of the models selected in the previous phase. In this phase, each model will be trained with the cross-validation data subset, using the subset with 20% of the samples not observed by the models to validate their performance.

4.3.1. Data Set

To ensure the data set’s quality, which will allow obtaining a predictive model with adequate performance, a series of preprocessing operations are performed on the data set.
First, all records that could be duplicated were eliminated. Subsequently, all variables with constant or duplicate values were discarded. The next step was to eliminate all records that did not include any measurement of any variable. Finally, the data set was randomized.

4.3.2. Regression Model Implementation

To determine the best predictive model, a systematic tuning of the hyperparameters of the different machine learning techniques was performed.
Linear Regression
Different configurations were explored by varying the independent term and coefficient constraints. Specifically, models were created by combining the following values:
  • Calculation and assignment of 0 for the independent term.
  • Force the coefficients to be positive and allow them to take any value.
The combination of these hyperparameters results in a total of four different models.
Polynomial Regression
The configurations created were based on adjusting the calculation of the independent term and the degree of the polynomial:
  • Calculation and assignment to 0 of the independent term.
  • Degree of the polynomial between 2 and 7.
In total, due to the combination of hyperparameters, a total of twelve models were handled.
K-Nearest Neighbors
The number of neighbors and the weight assignment functions were modified to create different configurations:
  • Values of n_neighbours from 1 to 99, taking all odd numbers.
  • Uniform and distance-based weight assignment.
The combination of these hyperparameters results in a total of one hundred different models.
Decision Tree
For the creation of decision tree-based configurations, the maximum diagram depth, the split quality calculation function, and the split selection strategy of the split were adjusted:
  • All odd values from 1 to 59 were taken for the depth of the diagram.
  • The Poisson, absolute error, and mean squared error functions with Friedman’s improvement to calculate splitting quality.
  • The selection of the division according to the best option and randomly.
In total, one hundred and eighty-six models derived from the different combinations of hyperparameters were managed.
Random Forest
The configuration of the models in the random forest technique followed the same sweep as that mentioned in the case of random trees, adding a fourth parameter, the number of trees used in the model:
  • From 1 to 59 regressors, always odd numbers, were used to create the forest.
Thanks to the combinations of hyperparameters, two thousand seven hundred and ninety models were developed in total.
Gradient Boosting
Setting up gradient boosting models involved tuning four hyperparameters: the cost function, the learning rate, the number of boosting stages to perform, and the split quality function.
  • The squared error, the absolute error, and the Huber function were used as the loss function.
  • The following values were used as learning rates: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, and 10.
  • Values from 10 to 200, in increments of 10, were used for the number of boosting stages to perform.
  • The squared error and mean squared error functions with Friedman’s improvement for calculating splitting quality.
A total of one thousand four hundred and forty models were worked with, resulting from the combination of hyperparameters.
Support Vector Machine
For support vector machines’ configurations, the kernel function hyperparameters, the regularization coefficient, and epsilon were modified:
  • The radial basis function (RBF), sigmoidal function, and linear function were used as kernels.
  • The following values were used as regularization coefficient values: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, and 100.
  • The values 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, and 10 were used in the epsilon hyperparameter ( ϵ ).
Tuning the hyperparameters resulted in the creation of two hundred ninety-seven distinct models.
Multi-Layer Perceptron
The configuration of the multi-layer perceptron models involved modifying the number of hidden neurons, the dropout parameter, and the activation function of the network neurons. Given the nature of the task, as a regression problem, the model architecture was restricted to a single hidden layer.
  • 1 to 50 neurons were used in the hidden layer.
  • As dropout values, 0.1, 0.2, and 0.3 were used.
  • The activation functions tested were ReLU, the hyperbolic tangent (tanh), sigmoid, and linear.
The exploration of hyperparameter combinations led to the creation of two hundred and forty-one models.

4.3.3. Regression Model Validation

The different regression metrics used to determine the performance of the predictive model are defined below.
Mean Absolute Error
The mean absolute error (MAE) is a positive float, with a minimum value of 0.0 and a maximum value of +. The best possible value, indicating perfect performance with no regression errors, is 0.0. It is calculated according to Equation (8).
MAE = 1 n i = 1 n | y i y i ^ |
where the following applies:
  • n is the total number of samples.
  • yi is the actual ith value.
  • y ^ i is the predicted ith value.
Mean Square Error
The mean square error (MSE) is a positive float, with a minimum value of 0.0 and a maximum value of +. The best possible value, indicating perfect performance with no regression errors, is 0.0. It is calculated according to Equation (9).
MSE = 1 n i = 1 n ( y i y i ^ ) 2
where the following applies:
  • n is the total number of samples.
  • yi is the actual ith value.
  • y ^ i is the predicted ith value.
Symmetric Mean Absolute Percentage Error
The symmetric mean absolute percentage error (SMAPE) is a positive float, with a minimum value of 0.0 and a maximum value of 100.0. The best possible value, indicating perfect performance with no regression errors, is 0.0. It is calculated according to Equation (10).
SMAPE = 100 n × i = 1 n | y i y i | ^ | y i | + | y i | ^ 2
where the following applies:
  • n is the total number of samples.
  • yi is the actual ith value.
  • y ^ i is the predicted ith value.
Coefficient of Determination
The coefficient of determination (R2) establishes the proportion of the total variance explained by the dependent variables. It can be interpreted as a comparison between the prediction and a constant value equal to the mean of the data. The best value, indicating a perfect prediction, is 1, while the worst possible value is . Its expression is shown in Equation (11).
R 2 = 1 i = 1 n ( y i y i ^ ) i = 1 n ( y i y ¯ )
where the following applies:
  • n is the total number of samples.
  • yi is the actual ith value.
  • y ^ i is the predicted ith value.
  • y ¯ is the mean value of the actual values.

4.4. Results

The results obtained during the experiment are presented in this section. Tables and graphs are used to present the information and data.

4.4.1. Regression Model Implementation

This section presents the results obtained during the first phase of the experiment, cross-validation.
Linear Regression
Table 4 shows the average results for the metrics obtained during the cross-validation process. The best value for each metric is emphasized in boldface.
Looking at the table, we can see that the model performance is identical across pairs of configurations. This is because the resulting model is the same; that is, forcing the model coefficients to be positive has no impact. In turn, performance is particularly similar across configurations. The metric values indicate good performance, with a coefficient of determination close to 0.7, an MAE close to 0.45, an MSE close to 0.5, and an SMAPE greater than 5.5%.
To verify the variation between cross-validation iterations, Figure 5 displays the R2 values obtained throughout the process in a violin plot.
Using the violin plot, we can study the variation in the coefficient of determination throughout the cross-validation process. As shown in Table 4, the behavior of all configurations is very similar. The graph shows slight variability, clustering most iterations around the process mean, with some lower outliers.
Polynomial Regression
The average results of the cross-validation process for the polynomial regression-based configurations are shown in Table 5. The best results are emphasized in boldface. For space and presentation reasons, not all PR models are shown in the results table. Only the models with the most interesting results are presented.
According to the results in the table above, the best models for polynomial regression have a low degree, degrees two and three, with equations for which, regardless of restrictions on the sign of the coefficients or the independent term, correct results are achieved. The results are very similar between the models, but two stand out: the degree-two polynomial, with a calculated independent term, and a degree-three polynomial, with an independent term equal to zero and constructed based on combinations of different variables. In the first case, the model obtains the highest coefficient of determination, 0.64, two points above the second model, which stands out in the MAE, MSE, and SMAPE, obtaining values of 0.43, 0.57617, and 5.0302%, respectively, and outperforming the rest of the models.
Figure 6 shows the violin plot for the R2 values obtained during cross-validation. The model with the greatest variation between process iterations is the degree-3 polynomial. The remaining models exhibit less variability.
K-Nearest Neighbors
Table 6 shows the average value of the metrics in the cross-validation of the KNN models. The best results are emphasized in boldface. Due to space and presentation limitations, only the configurations with the most significant results will be shown.
In the table above, the use of a moderate number of neighbors, between 9 and 19, is notable. In turn, the weight assignment function plays a relevant role in the performance of the models. On the one hand, models with uniform weight assignment obtain higher values for the coefficient of determination, while, on the other hand, configurations with a distance-based function for weight assignment achieve better values in the MAE, MSE, and SMAPE metrics. In any case, the configurations perform very similarly, obtaining very similar results.
Figure 7 shows the violin plot for the R2 values obtained during cross-validation.
Looking at the violin plot for the coefficient of determination metric, Figure 7, we see that the behavior of all models is very similar. However, analyzing the diagrams, we can see how models with a distance-based weight distribution exhibit a greater deviation compared to the models with a uniform weight assignment.
Decision Tree
Table 7 shows the mean values obtained during the cross-validation process for the decision tree configurations. The best average values for each of the metrics are emphasized in boldface. Due to the large number of configurations, only the models with the most significant results are shown.
The different configurations show a reduced maximum diagram depth, with the most repeated values being five and seven (and with a single higher value, nineteen). One model stands out for its performance, obtaining the best value in three of the four metrics, R2, MAE, and MSE, and being the one with the second best value in the SMAPE. This model, with a maximum depth of five, a calculation of the split quality based on the Poisson function, and a split selection based on the best option, obtains a 5.26% SMAPE, just 0.05% worse than the best model.
When the violin plots of Figure 8 are examined for the coefficient of determination obtained during cross-validation, it can be seen that the models do behave differently. The models involving random split selection show less variability, especially the model with a maximum depth of 19.
Random Forest
The results of the cross-validation process are presented in Table 8, which shows the average values for the metrics. The best values for each metric are emphasized in boldface. Due to the large number of configurations tested, only the models with the best performance are shown.
The performance of the configurations is quite similar, with considerable differences only being found in the MSE metric. As with the decision trees, the best-performing models involve a reduced maximum diagram depth, between three and seven. On the other hand, the number of trees used to form the tree does vary more, with some models using a small number of trees (such as 3, 5, or 7 trees) and others using a much larger number (such as 35, 43, and 53 trees). On the other hand, the Poisson function is positioned as the best method for calculating split quality, being present in eight of the ten models.
In the violin plot represented in Figure 9, the variability of the R2 metric during cross-validation can be observed.
The figure above highlights the greater viability of configurations with a larger number of decision trees. Since the result is calculated as the average of the results of each individual tree, it is a statistical measure sensitive to outliers.
Gradient Boosting
Table 9 shows the mean values obtained in the cross-validation for each of the metrics. The best result is emphasized in boldface. Due to space constraints, not all models are shown, but only the most interesting ones.
From the previous table, it can be seen that the best results are obtained for pairs of configurations in which the hyperparameter that varies is the function used to calculate the split quality, so it does not impact model performance. The models that achieve the best metrics share several hyperparameters in common, such as the cost function (squared) and the learning rate (0.05), varying only in the number of decision trees used, which ranges from 40 to 70.
From the violin plot in Figure 10, it can be seen that, when the learning rate is 0.05, the greater the number of decision trees, the greater the variability in the R2 during cross-validation. With the selected models, it cannot be confirmed that this does not occur with the learning rate of 0.01, but, in any case, the variability present is lower.
Support Vector Machine
Table 10 shows the mean values of the metrics in the cross-validation of the SVM-based models. Due to space limitations, only the models with the most interesting results are shown. The best results for each metric are emphasized in boldface.
All models share the linear function as the kernel, varying only the regularization coefficient and epsilon. All models share the same linear function as the kernel, varying only the regularization coefficient and epsilon. A higher epsilon value, such as 0.5, yields better results in the coefficient of determination and mean squared error metrics, while a lower epsilon value, such as 0.001, yielded better results in the mean absolute error and symmetric mean absolute percentage error metrics.
The variability in the R2 results during the different iterations of the cross-validation, shown in Figure 11, is similar between all the models, with none standing out for its low deviation.
Multi-Layer Perceptron
The average cross-validation results for the MLP configurations are shown in Table 11. The best values for each metric are emphasized in boldface. Due to the large number of configurations used, only the best-performing models are presented.
The best results for all metrics were obtained through the same model, with 43 neurons in the hidden layer, a dropout factor of 0.1, and a linear activation function. Configurations with a ReLU activation function performed slightly worse, especially when the MAE and SMAPE metrics are examined. The linear activation function was more effective. Increasing the dropout factor resulted in poorer model performance, which performed better with lower dropout values.
However, the good results of this model are not accompanied by low variability throughout the cross-validation process. Figure 12 shows that this is the model with the greatest deviation between the coefficient of determination results during cross-validation.

4.4.2. Regression Model Validation

Next, in this section, the best models for each of the presented techniques are validated. To achieve this, the models are trained with the data set used in cross-validation, and the model’s performance is verified by validating it with the remaining data subset, which has never been observed or processed through the model until now. Table 12 collects the results of the metrics during the validation process. The best results for each metric are highlighted in bold.
Gradient boosting performs the best among all models, achieving the best results in four out of the four calculated metrics: R2, MAE, MSE, and SMAPE. This confirms the model’s good performance, with a good predictive capacity and good fit to the data in the validation. Decision trees also perform well, along with KNN and PR, while the RF model obtains the worst results. SVM and MLP, being more complex models than the previous ones, obtain worse results.

4.4.3. Virtual Sensor Validation

Considering the results of cross-validation and model validation, gradient boosting was determined to be the best algorithm for generating the virtual sensor by achieving the best results in the different metrics used in the cross-validation and validation process. The model, configured with a maximum diagram depth of five layers, the Poisson function to calculate the split quality, and the selection of the best split, will be responsible for estimating total nitrogen in the WWTP outlet effluent based on total phosphorus and total Kjeldahl nitrogen.
Figure 13 represents the errors made using the regression model of the virtual sensor in the validation data set. The X-axis represents the ground truth of total nitrogen in the outlet effluent. In contrast, the Y-axis represents the residuals, the difference between the prediction and the actual values, with each point representing an individual prediction. The black dotted line represents where the residuals are zero, i.e., where the prediction matches the actual value.
No distribution across points, patterns, or trends is observed in the residuals, indicating no bias in the virtual sensor. Lines are observed in the residuals, driven by the very nature of the algorithm, which aggregates the same output for different inputs. However, for high total nitrogen values, predictions are very far from the ideal value, with residuals that could be considered outliers.
From the values resulting from the regression model, the ONI is obtained for each one, resulting in the value of the plant’s operating status.

5. Conclusions and Future Works

In this work, a virtual sensor has been developed as a key component in the control agent’s infrastructure for optimizing processes in the WWTP case study. This virtual sensor is composed of two main elements: a regression model that allows the estimation of the nitrogen concentration in effluent and the calculation of the ONI indicator, which summarizes the system’s status with respect to this compound and serves as the main reference metric for the agent in decision-making. Estimating nitrogen using the regression model makes it possible to obtain continuous, real-time values without the need for constant physical measurements, which provides efficiency, savings, and flexibility in process monitoring.
A total of eight different algorithms were evaluated to build the regression model. All of them showed comparable performance, with R 2 ranging from 0.68 to 0.80. After a validation process using a specific data set, the model based on the gradient boosting algorithm was selected as the most appropriate. This model achieved an R 2 of 0.82, an MAE of 0.39, an MSE of 0.32, and an SMAPE of 4.9, representing an optimal balance between accuracy and generalization. Based on the predictions of this model, the ONI indicator values were calculated, thus completing the virtual sensor’s functionality as an integrated inference and evaluation system for plant operations.
As a future goal, the implementation of the complete system in a real or simulated WWTP environment is proposed, with the aim of validating its performance under more complex and dynamic operating conditions. This will allow for the evaluation of its robustness, adaptability, and practical utility in real-life scenarios. Likewise, the inclusion of additional sub-indicators or the modification of the ONI indicator itself is being considered to enrich the information provided to the control agent. The incorporation of other relevant compounds present in the process, such as phosphorus, chemical oxygen demand (COD), or other specific contaminants, also appears to be a promising way to improve the system’s ability to make informed and sustainable decisions.

Author Contributions

Conceptualization, M.T. and A.D.-L.; methodology, M.T. and E.J.; software, A.D.-L. and Ó.F.-R.; validation, Ó.F.-R. and J.L.C.-R.; formal analysis, J.L.C.-R. and E.J.; investigation, E.J.; resources, M.T.; data curation, A.D.-L.; writing—original draft preparation, E.J.; writing—review and editing, M.T. and A.D.-L.; supervision, Ó.F.-R.; project administration, J.L.C.-R. All authors have read and agreed to the published version of the manuscript.

Funding

Míriam Timiraos’s research was supported by the “Xunta de Galicia” through industrial PhD grants (http://gain.xunta.gal/), accessed on 10 May 2025 under the “Doutoramento Industrial 2022” grant with reference 04_IN606D_2022_ 2692965. Antonio Díaz-Longueira’s research was supported by the Xunta de Galicia (Regional Government of Galicia) through PhD grants (http://gain.xunta.gal), under the “Axudas á etapa predoutoral” grant with reference ED481A-2023-072. Grant PID2022-137152NB-I00 was funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU, Xunta de Galicia grants for the consolidation and structuring of competitive research units. GPC (ED431B 2023/49) CITIC, as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receives subsidies from the Department of Education, Science, Universities, and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref. ED431G 2023/01). This research is the result of the Strategic Project “Critical infrastructures cybersecure through intelligent modeling of attacks, vulnerabilities and increased security of their IoT devices for the water supply sector” (C061/23) as a result of the collaboration agreement signed between the National Institute of Cybersecurity (INCIBE) and the University of A Coruña. This initiative is carried out within the framework of the funds of the Recovery Plan, Transformation and Resilience Plan funds, financed by the European Union (Next Generation).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DTDecision Tree
GBGradient Boosting
KNNsK-Nearest Neighbors
LRLinear Regression
MAEMean Absolute Error
MLPMulti-Layer Perceptron
MSEMean Squared Error
ONIOperational Nitrogen Indicator
PRPolynomial Regression
R2Coefficient of Determination
RBFRadial Basis Function
ReLURectified Linear Unit
RFRandom Forest
SMAPESymmetric Mean Absolute Percentage Error
SVMSupport Vector Machine
tanhHyperbolic tangent
TKNTotal Kjeldahl Nitrogen
WWTPWaste Water Treatment Plant

References

  1. Şenol, R.; Salman, O.; Kaya, Z. Potable water production from ambient moisture. Appl. Water Sci. 2023, 13, 10. [Google Scholar] [CrossRef]
  2. Brown, T.C.; Mahat, V.; Ramirez, J.A. Adaptation to future water shortages in the United States caused by population growth and climate change. Earth’s Future 2019, 7, 219–234. [Google Scholar] [CrossRef]
  3. Boretti, A.; Rosa, L. Reassessing the projections of the world water development report. NPJ Clean Water 2019, 2, 15. [Google Scholar] [CrossRef]
  4. Safarpour, H.; Tabesh, M.; Shahangian, S.A. Environmental Assessment of a Wastewater System under Water demand management policies. Water Resour. Manag. 2022, 36, 2061–2077. [Google Scholar] [CrossRef]
  5. Spellman, F.R. Handbook of Water and Wastewater Treatment Plant Operations; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
  6. Ianes, J.; Cantoni, B.; Remigi, E.U.; Polesel, F.; Vezzaro, L.; Antonelli, M. A stochastic approach for assessing the chronic environmental risk generated by wet-weather events from integrated urban wastewater systems. Environ. Sci. Water Res. Technol. 2023, 9, 3174–3190. [Google Scholar] [CrossRef]
  7. Mascher, F.; Mascher, W.; Pichler-Semmelrock, F.; Reinthaler, F.F.; Zarfel, G.E.; Kittinger, C. Impact of Combined Sewer Overflow on Wastewater Treatment and Microbiological Quality of Rivers for Recreation. Water 2017, 9, 906. [Google Scholar] [CrossRef]
  8. Lu, J.Y.; Wang, X.M.; Liu, H.Q.; Yu, H.Q.; Li, W.W. Optimizing operation of municipal wastewater treatment plants in China: The remaining barriers and future implications. Environ. Int. 2019, 129, 273–278. [Google Scholar] [CrossRef] [PubMed]
  9. Bertanza, G.; Boiocchi, R.; Pedrazzani, R. Improving the quality of wastewater treatment plant monitoring by adopting proper sampling strategies and data processing criteria. Sci. Total Environ. 2022, 806, 150724. [Google Scholar] [CrossRef] [PubMed]
  10. Kizgin, A.; Schmidt, D.; Joss, A.; Hollender, J.; Morgenroth, E.; Kienle, C.; Langer, M. Application of biological early warning systems in wastewater treatment plants: Introducing a promising approach to monitor changing wastewater composition. J. Environ. Manag. 2023, 347, 119001. [Google Scholar] [CrossRef] [PubMed]
  11. Longo, S.; d’Antoni, B.M.; Bongards, M.; Chaparro, A.; Cronrath, A.; Fatone, F.; Lema, J.M.; Mauricio-Iglesias, M.; Soares, A.; Hospido, A. Monitoring and diagnosis of energy consumption in wastewater treatment plants. A state of the art and proposals for improvement. Appl. Energy 2016, 179, 1251–1268. [Google Scholar] [CrossRef]
  12. Martínez, R.; Vela, N.; el Aatik, A.; Murray, E.; Roche, P.; Navarro, J.M. On the Use of an IoT Integrated System for Water Quality Monitoring and Management in Wastewater Treatment Plants. Water 2020, 12, 1096. [Google Scholar] [CrossRef]
  13. Thomas, O.; Théraulaz, F.; Cerdà, V.; Constant, D.; Quevauviller, P. Wastewater quality monitoring. TrAC Trends Anal. Chem. 1997, 16, 419–424. [Google Scholar] [CrossRef]
  14. Pehlivanoglu-Mantas, E.; Sedlak, D.L. Wastewater-Derived Dissolved Organic Nitrogen: Analytical Methods, Characterization, and Effects—A Review. Crit. Rev. Environ. Sci. Technol. 2006, 36, 261–285. [Google Scholar] [CrossRef]
  15. Bagherzadeh, F.; Mehrani, M.J.; Basirifard, M.; Roostaei, J. Comparative study on total nitrogen prediction in wastewater treatment plant and effect of various feature selection methods on machine learning algorithms performance. J. Water Process Eng. 2021, 41, 102033. [Google Scholar] [CrossRef]
  16. Ye, G.; Wan, J.; Deng, Z.; Wang, Y.; Chen, J.; Zhu, B.; Ji, S. Prediction of effluent total nitrogen and energy consumption in wastewater treatment plants: Bayesian optimization machine learning methods. Bioresour. Technol. 2024, 395, 130361. [Google Scholar] [CrossRef] [PubMed]
  17. Manami, M.; Seddighi, S.; Örlü, R. Deep learning models for improved accuracy of a multiphase flowmeter. Measurement 2023, 206, 112254. [Google Scholar] [CrossRef]
  18. Farsi, M.; Shojaei Barjouei, H.; Wood, D.A.; Ghorbani, H.; Mohamadian, N.; Davoodi, S.; Reza Nasriani, H.; Ahmadi Alvar, M. Prediction of oil flow rate through orifice flow meters: Optimized machine-learning techniques. Measurement 2021, 174, 108943. [Google Scholar] [CrossRef]
  19. Baggiani, F.; Marsili-Libelli, S. Real-time fault detection and isolation in biological wastewater treatment plants. Water Sci. Technol. 2009, 60, 2949–2961. [Google Scholar] [CrossRef] [PubMed]
  20. Sen, S.; Husom, E.J.; Goknil, A.; Politaki, D.; Tverdal, S.; Nguyen, P.; Jourdan, N. Virtual sensors for erroneous data repair in manufacturing a machine learning pipeline. Comput. Ind. 2023, 149, 103917. [Google Scholar] [CrossRef]
  21. Ko, D.; Norton, J.W., Jr.; Daigger, G.T. Wastewater management decision-making: A literature review and synthesis. Water Environ. Res. 2024, 96, e11024. [Google Scholar] [CrossRef] [PubMed]
  22. Holloway, T.G.; Williams, J.B.; Ouelhadj, D.; Cleasby, B. Process stress in municipal wastewater treatment processes: A new model for monitoring resilience. Process Saf. Environ. Prot. 2019, 132, 169–181. [Google Scholar] [CrossRef]
  23. Lai, J. Research on prediction algorithm of effluent quality and development of integrated control system for waste-water treatment. Sci. Rep. 2025, 15, 1–21. [Google Scholar] [CrossRef] [PubMed]
  24. Sharma, T.; Kumar, A.; Pant, S.; Kotecha, K. Wastewater Treatment and Multi-Criteria Decision-Making Methods: A Review. IEEE Access 2023, 11, 143704–143720. [Google Scholar] [CrossRef]
  25. Goulart Coelho, L.M.; Lange, L.C.; Coelho, H.M. Multi-criteria decision making to support waste management: A critical review of current practices and methods. Waste Manag. Res. 2017, 35, 3–28. [Google Scholar] [CrossRef] [PubMed]
  26. Wang, Y.; Cheng, Y.; Liu, H.; Guo, Q.; Dai, C.; Zhao, M.; Liu, D. A review on applications of artificial intelligence in wastewater treatment. Sustainability 2023, 15, 13557. [Google Scholar] [CrossRef]
  27. Council Directive 91/271/EEC of 21 May 1991 concerning urban waste-water treatment. Off. J. Eur. Communities 1991, L135, 40–52. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:31991L0271 (accessed on 6 May 2025).
  28. Real Decreto 509/1996, de 15 de Marzo, por el que se Desarrolla el Real Decreto-Ley 11/1995, de 28 de Diciembre, Sobre Tratamiento de Aguas Residuales Urbanas. Boletín Oficial del Estado, n.º 77, 29 de Marzo de 1996, pp. 11290–11301. 1996. Available online: https://www.boe.es/eli/es/rd/1996/03/15/509 (accessed on 6 May 2025).
  29. Henze, M.; van Loosdrecht, M.C.; Ekama, G.A.; Brdjanovic, D. Biological Wastewater Treatment: Principles, Modelling and Design; IWA Publishing: London, UK, 2008. [Google Scholar]
  30. Oakley, S. Preliminary treatment and primary sedimentation. In Global Water Pathogen Project; California State University, Chico: Chico, CA, USA, 2021. [Google Scholar]
  31. Balku, S. Comparison between alternating aerobic–anoxic and conventional activated sludge systems. Water Res. 2007, 41, 2220–2228. [Google Scholar] [CrossRef] [PubMed]
  32. Patziger, M.; Kainz, H.; Hunze, M.; Józsa, J. Influence of secondary settling tank performance on suspended solids mass balance in activated sludge systems. Water Res. 2012, 46, 2415–2424. [Google Scholar] [CrossRef] [PubMed]
  33. Matamoros, V.; Salvadó, V. Evaluation of a coagulation/flocculation-lamellar clarifier and filtration-UV-chlorination reactor for removing emerging contaminants at full-scale wastewater treatment plants in Spain. J. Environ. Manag. 2013, 117, 96–102. [Google Scholar] [CrossRef] [PubMed]
  34. Orhon, D. Evolution of the activated sludge process: The first 50 years. J. Chem. Technol. Biotechnol. 2014, 90, 608–640. [Google Scholar] [CrossRef]
  35. Zagklis, D.P.; Bampos, G. Tertiary Wastewater Treatment Technologies: A Review of Technical, Economic, and Life Cycle Aspects. Processes 2022, 10, 2304. [Google Scholar] [CrossRef]
  36. Ayhan Demirbas, G.E.; Alalayah, W.M. Sludge production from municipal wastewater treatment in sewage treatment plant. Energy Sources Part A Recover. Util. Environ. Eff. 2017, 39, 999–1006. [Google Scholar] [CrossRef]
  37. Christensen, M.L.; Keiding, K.; Nielsen, P.H.; Jørgensen, M.K. Dewatering in biological wastewater treatment: A review. Water Res. 2015, 82, 14–24. [Google Scholar] [CrossRef] [PubMed]
  38. Athey, S.; Tibshirani, J.; Wager, S. Generalized random forests. Ann. Stat. 2019, 47, 1148–1178. [Google Scholar] [CrossRef]
  39. Popescu, M.C.; Balas, V.; Perescu-Popescu, L.; Mastorakis, N. Multilayer perceptron and neural networks. WSEAS Trans. Circuits Syst. 2009, 8, 579–588. [Google Scholar]
Figure 1. Schema of agent-based system.
Figure 1. Schema of agent-based system.
Processes 13 02301 g001
Figure 2. Schema of the WWTP case study.
Figure 2. Schema of the WWTP case study.
Processes 13 02301 g002
Figure 3. Distribution of variables.
Figure 3. Distribution of variables.
Processes 13 02301 g003
Figure 4. Representation of a decision tree.
Figure 4. Representation of a decision tree.
Processes 13 02301 g004
Figure 5. Violin plot of R2 in LR cross-validation.
Figure 5. Violin plot of R2 in LR cross-validation.
Processes 13 02301 g005
Figure 6. Violin plot of R2 in PR cross-validation.
Figure 6. Violin plot of R2 in PR cross-validation.
Processes 13 02301 g006
Figure 7. Violin plot of R2 in KNN cross-validation.
Figure 7. Violin plot of R2 in KNN cross-validation.
Processes 13 02301 g007
Figure 8. Violin plot of R2 in DT cross-validation.
Figure 8. Violin plot of R2 in DT cross-validation.
Processes 13 02301 g008
Figure 9. Violin plot of R2 in RF cross-validation.
Figure 9. Violin plot of R2 in RF cross-validation.
Processes 13 02301 g009
Figure 10. Violin plot of R2 in GB cross-validation.
Figure 10. Violin plot of R2 in GB cross-validation.
Processes 13 02301 g010
Figure 11. Violin plot of R2 in SVM cross-validation.
Figure 11. Violin plot of R2 in SVM cross-validation.
Processes 13 02301 g011
Figure 12. Violin plot of R2 in MLP cross-validation.
Figure 12. Violin plot of R2 in MLP cross-validation.
Processes 13 02301 g012
Figure 13. Residual errors obtained with the regression model with the virtual sensor.
Figure 13. Residual errors obtained with the regression model with the virtual sensor.
Processes 13 02301 g013
Table 1. Summary of ONI sub-indicators, regulatory context, and weights.
Table 1. Summary of ONI sub-indicators, regulatory context, and weights.
Sub-IndicatorMeaningReferenceWeight
Nreal%N total vs. legal limit91/271/EEC [27]; RD 509/1996 [28]35%
Tnitr%N trend ( Δ N / Δ t )Dynamic risk proxy; RD 509/1996 [28]15%
Enitr%N removal efficiency[27]; RD 509/1996 [28]30%
NP%N/P balanceRedfield; Henze et al. [29]20%
ONI = 0.35 × Nreal % + 0.15 × Tnitr % + 0.30 × Enitr % + 0.20 × NP %
Table 2. Data set for developing the model: description, units, and tags of variables.
Table 2. Data set for developing the model: description, units, and tags of variables.
InputTagUnit
Total Phosphorus  Phosp_Tmg/L
Total Kjeldahl Nitrogen  TKNmg/L
OutputTagUnit
Total Nitrogen  Nitrogen_Tmg/L
Table 3. Data set to validate agent approach: description, units, and tags of variables.
Table 3. Data set to validate agent approach: description, units, and tags of variables.
VariableTagUnit
Total Nitrogen at the InfluentNitrogen_T_Influmg/L
Total PhosphorusPhosp_Tmg/L
Total Kjeldahl NitrogenTKNmg/L
Total NitrogenNitrogen_Tmg/L
Table 4. Average values obtained in the LR cross-validation process.
Table 4. Average values obtained in the LR cross-validation process.
Metrics
ConfigurationR2MAE (mg/L)MSE ((mg/L)2)SMAPE (%)
intercept = True
positive = True
0.693190.453830.525645.36976
intercept = True
positive = False
0.693190.453830.525645.36976
intercept = False
positive = True
0.696910.456560.487005.46750
intercept = False
positive = False
0.696910.456560.487005.46750
Table 5. Average values obtained in the PR cross-validation process.
Table 5. Average values obtained in the PR cross-validation process.
Metrics
ConfigurationR2MAE (mg/L)MSE ((mg/L)2)SMAPE (%)
degree = 2
include_bias = True
interaction_only = False
0.644240.443680.582275.15816
degree = 2
include_bias = True
interaction_only = True
0.637890.477830.647455.58262
degree = 2
include_bias = False
interaction_only = True
0.624300.491640.650915.84100
degree = 3
include_bias = False
interaction_only = False
0.622150.432000.576175.03020
Table 6. Average values obtained in the KNN cross-validation process.
Table 6. Average values obtained in the KNN cross-validation process.
Metrics
ConfigurationR2MAE (mg/L)MSE ((mg/L)2)SMAPE (%)
K = 19
Weight = Uniform
0.668720.432680.704215.00806
K = 17
Weight = Uniform
0.668050.428420.699234.94919
K = 13
Weight = Uniform
0.667920.425810.690374.90991
K = 9
Weight = Distance
0.646770.419690.682034.83718
K = 11
Weight = Distance
0.651790.418020.677204.81392
K = 13
Weight = Distance
0.655420.417870.679904.81108
K = 15
Weight = Distance
0.653540.419860.687074.83675
Table 7. Average values obtained in the DT cross-validation process.
Table 7. Average values obtained in the DT cross-validation process.
Metrics
ConfigurationR2MAE (mg/L)MSE ((mg/L)2)SMAPE (%)
max_depth = 5
criterion = poisson
splitter = best
0.608050.450690.656235.26242
max_depth = 5
criterion = absolute
splitter = random
0.577240.510340.888555.94258
max_depth = 19
criterion = poisson
splitter = random
0.547500.508190.753636.17061
max_depth = 5
criterion = friedman
splitter = best
0.527970.462680.759025.37417
max_depth = 5
criterion = absolute
splitter = best
0.526920.456040.747655.19172
max_depth = 7
criterion = poisson
splitter = best
0.509780.463690.740165.46085
Table 8. Average values obtained in the RF cross-validation process.
Table 8. Average values obtained in the RF cross-validation process.
Metrics
ConfigurationR2MAE (mg/L)MSE ((mg/L)2)SMAPE (%)
n_estimators = 7
max_depth = 3
criterion = absolute
0.647200.428880.715674.99111
n_estimators = 3
max_depth = 3
criterion = poisson
0.645310.459330.664685.35222
n_estimators = 3
max_depth = 5
criterion = poisson
0.644710.433060.625125.01796
n_estimators = 39
max_depth = 7
criterion = poisson
0.626940.415150.640464.84066
n_estimators = 27
max_depth = 7
criterion = poisson
0.634050.415070.629854.84770
n_estimators = 45
max_depth = 5
criterion = poisson
0.641920.414840.637374.81943
n_estimators = 3
max_depth = 7
criterion = poisson
0.629420.439520.624855.11903
n_estimators = 5
max_depth = 5
criterion = poisson
0.637740.421350.618014.90172
n_estimators = 43
max_depth = 5
criterion = poisson
0.639450.415210.636694.82227
n_estimators = 53
max_depth = 7
criterion = absolute
0.624980.416850.667904.82728
Table 9. Average values obtained in the GB cross-validation process.
Table 9. Average values obtained in the GB cross-validation process.
Metrics
ConfigurationR2MAE (mg/L)MSE ((mg/L)2)SMAPE (%)
loss = squared
learning_rate = 0.05
n_estimators = 40
criterion = friedman
0.641610.442810.667595.20053
loss = squared
learning_rate = 0.05
n_estimators = 40
criterion = squared
0.641610.442810.667595.20053
loss = squared
learning_rate = 0.01
n_estimators = 190
criterion = squared
0.639640.446720.668905.25352
loss = squared
learning_rate = 0.05
n_estimators = 80
criterion = friedman
0.624210.420650.666984.88552
loss = squared
learning_rate = 0.05
n_estimators = 70
criterion = friedman
0.628910.419880.662004.88338
loss = squared
learning_rate = 0.05
n_estimators = 70
criterion = squared
0.628910.419880.662004.88338
loss = squared
learning_rate = 0.05
n_estimators = 60
criterion = squared
0.634190.421310.657294.91295
loss = squared
learning_rate = 0.05
n_estimators = 50
criterion = friedman
0.639300.426930.654594.99443
loss = squared
learning_rate = 0.05
n_estimators = 50
criterion = squared
0.639300.426930.654594.99443
Table 10. Average values obtained in the SVM cross-validation process.
Table 10. Average values obtained in the SVM cross-validation process.
Metrics
ConfigurationR2MAE (mg/L)MSE ((mg/L)2)SMAPE (%)
kernel = linear
C = 0.05
epsilon = 0.5
0.718520.426540.510555.09560
kernel = linear
C = 0.1
epsilon = 0.5
0.715440.427060.510345.10120
kernel = linear
C = 10
epsilon = 0.5
0.711740.427850.511015.10224
kernel = linear
C = 5
epsilon = 0.001
0.707950.412010.566744.89631
kernel = linear
C = 0.5
epsilon = 0.001
0.708120.411790.566564.89469
kernel = linear
C = 1
epsilon = 0.001
0.708010.411780.566714.89457
Table 11. Average values obtained in the MLP cross-validation process.
Table 11. Average values obtained in the MLP cross-validation process.
Metrics
ConfigurationR2MAE (mg/L)MSE ((mg/L)2)SMAPE (%)
hidden_neurons = 43
dropout = 0.1
activation_function = linear
0.709600.427870.485335.10584
hidden_neurons = 50
dropout = 0.2
activation_function = ReLU
0.703650.436640.487095.20179
hidden_neurons = 27
dropout = 0.3
activation_function = ReLU
0.702820.453030.519165.38930
hidden_neurons = 33
dropout = 0.2
activation_function = linear
0.700350.437060.490435.24328
Table 12. Average values obtained in the Gradient Boosting cross-validation process.
Table 12. Average values obtained in the Gradient Boosting cross-validation process.
Metrics
AlgorithmR2MAE (mg/L)MSE ((mg/L)2)SMAPE (%)
Linear Regression0.750680.476590.442875.79698
Polynomial Regression0.752600.452070.439455.61793
K-Nearest Neighbors0.750550.452470.443105.43719
Decision Tree0.754900.440850.435375.40378
Random Forest0.680780.452770.567035.36810
Gradient Boosting0.816710.388960.325564.90681
Support Vector Machine0.742160.446310.457995.45197
Multi-Layer Perceptron0.722420.467950.493065.70925
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Timiraos, M.; Díaz-Longueira, A.; Jove, E.; Fontenla-Romero, Ó.; Calvo-Rolle, J.L. The Operational Nitrogen Indicator (ONI): An Intelligent Index for the Wastewater Treatment Plant’s Optimization. Processes 2025, 13, 2301. https://doi.org/10.3390/pr13072301

AMA Style

Timiraos M, Díaz-Longueira A, Jove E, Fontenla-Romero Ó, Calvo-Rolle JL. The Operational Nitrogen Indicator (ONI): An Intelligent Index for the Wastewater Treatment Plant’s Optimization. Processes. 2025; 13(7):2301. https://doi.org/10.3390/pr13072301

Chicago/Turabian Style

Timiraos, Míriam, Antonio Díaz-Longueira, Esteban Jove, Óscar Fontenla-Romero, and José Luis Calvo-Rolle. 2025. "The Operational Nitrogen Indicator (ONI): An Intelligent Index for the Wastewater Treatment Plant’s Optimization" Processes 13, no. 7: 2301. https://doi.org/10.3390/pr13072301

APA Style

Timiraos, M., Díaz-Longueira, A., Jove, E., Fontenla-Romero, Ó., & Calvo-Rolle, J. L. (2025). The Operational Nitrogen Indicator (ONI): An Intelligent Index for the Wastewater Treatment Plant’s Optimization. Processes, 13(7), 2301. https://doi.org/10.3390/pr13072301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop