Next Article in Journal
Spatiotemporal Dynamic Evolution of Energy Rebound Effect and Sustainable Path for Energy Conservation–Emission Reduction in Resource-Based Cities of China
Previous Article in Journal
Land Cover and Wildfire Risk: A Multi-Buffer Spatial Analysis of the Relationship Between Housing Destruction and Land Cover in Chile’s Bío-Bío Region in 2023
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sustainable Sewage Treatment Prediction Using Integrated KAN-LSTM with Multi-Head Attention

Division of Information and Electronic Engineering, Muroran Institute of Technology, Muroran 050-8585, Japan
*
Author to whom correspondence should be addressed.
Sustainability 2025, 17(10), 4417; https://doi.org/10.3390/su17104417
Submission received: 23 April 2025 / Revised: 6 May 2025 / Accepted: 9 May 2025 / Published: 13 May 2025

Abstract

:
The accurate prediction of sewage treatment indicators is crucial for optimizing management and supporting sustainable water use. This study proposes the KAN-LSTM model, a hybrid deep learning model combining Long short-term memory (LSTM) networks, Kolmogorov-Arnold Network (KAN) layers, and multi-head attention. The model effectively captures complex temporal dynamics and nonlinear relationships in sewage data, outperforming conventional methods. We applied correlation analysis with time-lag consideration to select key indicators. The KAN-LSTM model then processes them through LSTM layers for sequential dependencies, KAN layers for enhanced nonlinear modeling via learnable B-spline transformations, and multi-head attention for dynamic weighting of temporal features. This combination handles short-term patterns and long-range dependencies effectively. Experiments showed the model’s superior performance, achieving 95.13% R-squared score for FOss (final sedimentation basin outflow suspended solid, one indicator of our research predictions)and significantly improving prediction accuracy. These advancements in intelligent sewage treatment prediction modeling not only enhance water sustainability but also demonstrate the transformative potential of hybrid deep learning approaches. This methodology could be extended to optimize predictive tasks in sustainable aquaponic systems and other smart aquaculture applications.

1. Introduction

Accurate time-series prediction has emerged as a critical tool for optimizing sewage treatment systems [1], enabling both proactive process control and sustainable resource management. While deep learning approaches show considerable promise in this domain, current methods frequently fail to address two fundamental challenges inherent to sewage treatment data: the complex nonlinear relationships between treatment indicators [2] and the optimal selection of predictive features from highly correlated temporal measurements [3].
Traditional prediction methodologies, including Autoregressive integrated moving average (ARIMA) [4] and Support Vector Regression (SVR) [5], exhibit well-documented limitations when applied to sewage treatment data. These approaches typically demonstrate poor performance when handling multivariate, non-stationary time series [6], struggle to capture long-range dependencies in treatment processes, and show sensitivity to measurement noise and missing data [7]. While recent advances in deep learning have offered potential solutions, these too present significant challenges. Long short-term memory (LSTM) networks [8], despite their effectiveness in sequence modeling, often encounter difficulties with gradient flow in long sequences [9]. We acknowledge that residual connections can mitigate LSTM’s gradient issues [10]. However, for sewage treatment forecasting with moderate sequence lengths, our tests showed that residual LSTM provided limited accuracy gains while complicating temporal interpretation—a key requirement for operational use. This trade-off motivated our baseline LSTM design, though residual variants may benefit longer-sequence extensions. Similarly, Transformer architectures, while excelling at capturing long-distance dependencies, frequently fail to recognize local temporal patterns that prove critical for accurate treatment monitoring. Importantly, neither approach adequately addresses the systematic feature selection required for handling correlated sewage indicators [11].
Our study makes two distinct yet complementary contributions to address these challenges. First, we develop a rigorous time-lag correlation analysis [12] framework specifically designed for identifying the most predictive sewage treatment indicators. This preprocessing methodology quantitatively evaluates delayed relationships between parameters (such as the 3-day lag observed between pH and FOss levels). It selects features based on statistically significant correlations, while deliberately avoiding any modification of the subsequent modeling approach. Second, we introduce an innovative KAN-LSTM hybrid architecture that effectively combines the sequential pattern recognition capabilities of LSTM layers with the complex nonlinear modeling capacity of Kolmogorov–Arnold network (KAN) [13] transformations through learnable B-splines, further enhanced by multi-head attention mechanisms that dynamically weigh important temporal features [14].
This deliberate separation of feature selection from modeling architecture yields several significant advantages. The framework provides enhanced interpretability through explicit indicator screening while achieving superior predictive accuracy via its specialized architecture and maintaining practical deployment advantages for treatment plant operations. Our experimental results demonstrate consistent improvements over conventional methods without compromising computational efficiency [15].
The remainder of this paper is systematically organized: Section 2 introduces our methodology, beginning with the time-lag correlation analysis for feature selection, followed by detailed architectural explanations of the integrated KAN-LSTM model with multi-head attention. Section 3 presents comprehensive experimental results, including comparative analyses with state-of-the-art models and ablation studies validating each component’s contribution. Section 4 discusses the practical implications of our findings, acknowledges limitations, and outlines future research directions for sustainable sewage management. Finally, we conclude with the broader impacts of this work.

2. Methods

With the exponential growth of time-series data across financial forecasting, environmental monitoring, and industrial operations, we face increasing demands for advanced forecasting models that can effectively capture linear and nonlinear patterns while remaining computationally efficient [16]. Current deep learning approaches often struggle with the complex characteristics of time-series data [17], particularly when confronting non-stationary patterns and dependencies that span longer periods [18]. To address these challenges, our research introduces a time-lag algorithm utilizing Pearson correlation matrices [19]. This time-lag screening method helps to select more meaningful indicators for prediction, and the timeliness of sewage treatment is more consistent with the actual process. Building on this foundation, we developed an integrated KAN-LSTM model, which combines the sequential learning capabilities of LSTM networks with the expressive power of KAN and the selective focus of multi-head attention mechanisms. We tested our hybrid model on a practical real-world challenge: predicting critical parameters in sewage treatment processes. This application domain is an ideal testing ground due to its complex, multifaceted nature and the significant operational benefits that accurate predictions can provide for treatment facility management and environmental compliance.

2.1. Time-Series Lag Analysis of Indicators in Sewage Treatment

This section examines the interrelations among key indicators in sewage treatment and develops predictive models for forecasting future trends of these indicators, focusing on the importance of time-series lag. We focus on pH, final sedimentation basin outflow (FOss), treated water transparency (TWT), and treated water total nitrogen (TWTN). We analyze extensive data and establish accurate predictive models with a deep learning network using advanced statistical analysis and deep learning algorithms. Our results reveal the intrinsic connections between these indicators and confirm the high accuracy of our model in predictions [11].
Data elements are often standardized internationally because the sewage treatment system and equipment are based on the activated sludge method, which is an international standard. Therefore, we perform a correlation analysis of the indicators on the sewage treatment data to select the four indicators with the strongest correlations. To address the issue of timeliness in the sewage treatment process, we adopt a different approach from the traditional method of comparing correlations. In preliminary experiments, we found that the correlation between the data is the highest when the lag is 3 days, after testing lags from 1 to 5 days. Therefore, we calculate the correlations based on a 3-day lag between the data points. The specific formula is shown in Equation (1). In this study, k is three. This means analyzing the correlation of time-series data with a 3-day difference. The number of lag days depends on the size of the sewage treatment area, and this experiment uses Fukugawa city:
r lag k = i = k + 1 n ( X i X ¯ ) ( Y i k Y lag ¯ ) i = k + 1 n ( X i X ¯ ) 2 i = k + 1 n ( Y i Y lag ¯ ) 2 ,
where r lag k is the Pearson correlation coefficient with a lag of k. X i and Y i k are the individual data points for the two variables, with the Y variable data lagged by k periods. Y lag ¯ is the mean of the Y data points after applying a lag of k periods. n is the total number of data points available after applying the lag. The summation starts from i = k + 1 because the first k data points of the Y variable do not have corresponding X values due to the lag. In the correlation matrix of Figure 1, the rows and columns correspond to a variety of metrics that are essential in sewage treatment processes, including IF-BOD (inflow water biochemical oxygen demand), MLSS (mixed liquor suspended solids), and DO (dissolved oxygen), among others. The correlation coefficient displayed by the heat map identifies pH, FOss, TWT, and TWTN as exhibiting the most significant relationships (blue frames). Using the cross-correlation with a lag of 3 days, we only display data in the figure with absolute values of correlations greater than 0.5, including positive and negative correlations. This method explores the relationship between sewage indicators before and after treatment. This analytical approach allows us to examine the interconnections between indicators from various perspectives, uncovering a broader spectrum of relationships and deepening our understanding of the dynamics within sewage treatment systems. The threshold used in this preliminary experiment is greater than 0.5 and more stringent than in related research [20]. Compared to their threshold of 0.3, 0.5 can find stronger correlations. The results are more convincing.
Figure 2 illustrates the relationships among pH, FOss, TWT, and TWTN using pink scatter plots and blue linear regression lines. Diagonal graphs display the distribution of each variable, while off-diagonal graphs depict relationships between variables through data points and linear regression fits. These visualizations allow us to observe how the linear regression model captures the relationships between each pair of variables, providing an intuitive display of the relationships and model fit. Due to computing resource limitations, predictions were made by selecting the four most correlated indicators. Figure 2 verifies that the selected four indicators are reasonable [11].

2.2. Integrated KAN-LSTM Model

2.2.1. Time-Series Prediction Framework

The proposed hybrid model processes 10 historical time steps (look_back = 10) to predict subsequent indicator values ( T + 1 ) , addressing two key sewage treatment challenges:
(1)
Long-term dependency capture via LSTM’s gated memory mechanism.
(2)
Dynamic feature weighting through multi-head attention.
This design (Figure 3) serves as the computational backbone for subsequent KAN-enhanced transformations (Section 2.2.2), with empirical results in Section 3 demonstrating their superiority over only LSTM or attention-only architectures for sewage data forecasting. The model structure is shown in Figure 3.
Figure 3 illustrates the architecture of the proposed Integrated KAN-LSTM Model, designed for accurate prediction of key sewage treatment indicators—pH, FOss, TWT, and TWTN. The process begins on the left side, where the selected indicators, determined through a 3-day time-lagged Pearson correlation analysis, are input as multivariate time series features. The light blue arrows indicate the data flow direction throughout the model.
First, the input sequence passes through the LSTM layer (depicted in light blue), which captures temporal dependencies using a gated memory mechanism. Next, the LSTM output is forwarded to the KAN layer (green block), which applies nonlinear transformations using learnable B-spline basis functions. Unlike traditional neural networks that use fixed activation functions on nodes, KAN assigns learnable functions on edges, improving the model’s capacity to capture complex nonlinearities in sewage data. Subsequently, the transformed features are passed into the Multi-Head Attention mechanism (orange block), which computes attention weights across different time steps. Each attention head processes the Query (Q), Key (K), and Value (V) vectors, calculated via linear transformations, and applies scaled dot-product attention. The outputs from all heads are concatenated and linearly transformed, allowing the model to dynamically weight temporal features. Finally, the aggregated features are passed to a fully connected output layer, which produces the final predictions for the next time step. A comparative graph in the top right corner visualizes the alignment between predicted and true values, demonstrating the model’s high forecasting accuracy. Overall, this integrated framework leverages the strengths of LSTM in sequence modeling, the nonlinear expressiveness of KAN, and the dynamic temporal weighting capability of attention mechanisms, all while maintaining interpretability and predictive performance.

2.2.2. KAN Layer (Kolmogorov–Arnold Network Layer)

Traditional neural networks (MLPs) [21] fixed activation functions (e.g., ReLU, tanh) at nodes, while KAN innovates by employing learnable activation functions on edges. Each weight parameter in KAN is replaced by a univariate function (typically B-splines [22]), enabling dynamic adaptation to the sewage treatment data’s complex nonlinearities.The KAN model excels in capturing complex nonlinear relationships in data while maintaining high interpretability. By leveraging B-spline functions for nonlinear transformations, it enhances expressiveness when handling highly nonlinear data. Additionally, the KAN model has relatively low computational complexity [13], allowing it to achieve high prediction accuracy with fewer parameters, making it well suited for efficient time-series modeling tasks. Unlike traditional neural networks that use fixed activation functions, KAN uses a learnable activation function at the network’s edge. This design allows each weight parameter in KAN to be replaced by a single variable function, which is usually parameterized in the form of a spline function, providing extremely high flexibility and the ability to simulate complex functions with fewer parameters, enhancing the interpretability of the model.
The KAN layer is a nonlinear transformation layer based on B-spline functions, capable of capturing complex nonlinear relationships in data. The mathematical representation of the KAN layer is as follows:
y = W base x + i = 1 G B i ( x ) · W spline , i ,
where W base is the base weight matrix, B i ( x ) represents the B-spline basis function, W spline , i is the spline weight, and G is the number of grid points. The B-spline basis function is defined as follows:
B i ( x ) = max 0 , 1 x g i ,
where g i is the grid points uniformly distributed in the [ 0 , 1 ] interval. The KAN layer computes both the linear output from the base weight and the nonlinear output from the B-spline transformation, which are then summed to obtain the final output.

2.2.3. LSTM

In this study, LSTM plays a central role in modeling time-series data by capturing long-term dependencies in the sewage treatment dataset. LSTM effectively mitigates the vanishing gradient problem as a recurrent neural network variant, making it well suited for learning sequential patterns. The LSTM layer extracts temporal features from the input sequences [23], which are subsequently refined through a KAN layer to enhance nonlinear representation. Additionally, a multi-head attention mechanism is applied to further emphasize key temporal dependencies, improving predictive accuracy. Integrated within a Transformer-based framework, LSTM serves as the foundational feature extraction component, providing essential temporal information for subsequent processing. Its ability to model sequential dependencies ensures that the predictive model effectively learns patterns in the time-series data, contributing to robust and interpretable forecasting performance.
LSTM consists of four main components: forget gate, input gate, cell candidate, and output gate. The forward computation is as follows:
(1)
Forget gate: Determines how much past information should be forgotten:
f t = σ W f x t + U f h t 1 + b f .
where f t is the forget gate output (ranges between 0 and 1). x t is the input at the current time step. h t 1 is the hidden state from the previous time step. W f , U f , and b f are learnable parameters. σ represents the sigmoid activation function.
(2)
Input gate: Determines the importance of the current input:
i t = σ W i x t + U i h t 1 + b i ,
C ˜ t = tanh W c x t + U c h t 1 + b c .
where i t is the input gate output (ranges between 0 and 1). C ˜ t is the candidate cell state, with tanh limiting its range to [ 1 , 1 ] .
(3)
Cell-state update: Combines information from the forget and input gates:
C t = f t C t 1 + i t C ˜ t .
where C t is the updated cell state at time t. ⊙ denotes the Hadamard (element-wise) product.
(4)
Output gate: Determines which information should be passed to the hidden state:
o t = σ W o x t + U o h t 1 + b o ,
h t = o t tanh C t .
where o t is the output gate activation (ranges between 0 and 1). h t is the hidden state at the current time step. These equations collectively enable LSTM to capture long-term dependencies in sequential data while effectively mitigating the vanishing gradient problem.

2.2.4. Multi-Head Attention Mechanism

As an extension of the attention mechanism, the multi-head attention mechanism further enhances the model’s attention capture ability by processing multiple attention distributions in parallel [24], improves the model’s expression ability and learning efficiency, and becomes one of the core components of the Transformer architecture. After LSTM processes the input data, the KAN layer extracts nonlinear features, and then, the multi-head attention mechanism allocates different attention weights globally, thereby resulting in the following:
(1)
Enhancing long-term dependencies: Compared with LSTM, the attention mechanism can capture long-range dependencies and avoid the gradient vanishing problem.
(2)
Different feature weight allocation: Multiple attention heads can focus on different time steps, improving the model’s representation ability.
(3)
Improving nonlinear modeling capabilities: Combined with the KAN layer, the multi-head attention mechanism further enhances nonlinear mapping and improves prediction capabilities.
Linear transformation is defined as the process of projecting the input features X into three different representation spaces: query (Q), key (K), and value (V). These transformations are achieved through trainable weight matrices W Q , W K , and W V , respectively:
Q = X W Q , K = X W K , V = X W V .
where X is the feature representation from LSTM + KAN layers. W Q , W K , and W V are trainable weight matrices.
The dot product of Q and K T determines similarity. Scaling by d k prevents excessively large values. Softmax normalizes attention scores. This attention mechanism computes the weighted sum of values (V) based on the relevance determined by queries (Q) and keys (K), effectively allowing the model to focus on relevant information in the sequence.
Attention ( Q , K , V ) = softmax Q K T d k V .
Multiple attention heads process different feature subspaces. Outputs are concatenated and transformed by W O . This parallel processing allows the model to jointly attend to information from different representation subspaces at different positions, capturing various types of relationships within the data simultaneously.
MultiHead ( X ) = Concat head 1 , , head h W O .

2.2.5. Model Interpretability

The proposed integrated KAN-LSTM model with multi-head attention achieves high interpretability through a multi-level design while maintaining predictive accuracy. Its modular architecture decomposes the prediction process into three physically meaningful stages: the LSTM layer captures temporal dynamics, the KAN layer performs nonlinear transformations using interpretable B-spline basis functions, and the attention mechanism identifies critical time steps, ensuring the accurate prediction of sewage treatment indicators and providing a powerful and interpretable explanation.

3. Experimental Results and Analysis

3.1. Experimental Data

The raw time-series data were subjected to a comprehensive preprocessing pipeline comprising three critical stages: normalization to ensure numerical stability, temporal partitioning to maintain chronological integrity, and sequential structuring to accommodate the model’s architectural requirements [25].

3.1.1. Data Normalization

To improve numerical stability and accelerate convergence, min–max scaling was applied to transform the raw values into the range [0, 1]. This transformation preserves the relative relationships between data points while preventing large variations in magnitude from affecting model performance [26,27]. Two independent scalers were used to normalize the input and output values separately, ensuring consistency when performing inverse transformations after prediction.

3.1.2. Data Partitioning

The dataset was split into three parts: training (70%), validation (10%), and testing (20%). The training set helped adjust the model’s parameters, the validation set was used to fine-tune settings and prevent overtraining, while the test set assessed the final performance. This approach helps the model perform well on new data while avoiding overfitting.

3.1.3. Sequence Construction

To prepare the time-series data for the LSTM model, a sliding window technique was applied to generate sequential input–output pairs. Each input consisted of a fixed number of past observations (look_back), and the corresponding next value (T) was used as the target. This approach helps the model recognize patterns in historical data and make accurate future predictions by maintaining the temporal structure of the sequence.

3.1.4. Data Representation

The preprocessed sequences were converted into tensor representations for efficient computation. Additionally, batch processing was implemented to facilitate mini-batch gradient descent during training, reducing computational overhead and improving convergence stability. The training, validation, and test sets were structured to maintain temporal continuity, ensuring that the model effectively learns from past trends while making accurate future predictions.

3.2. Data Source

This study utilizes operational data collected from a sewage treatment plant in Fukagawa City, Hokkaido, Japan, spanning the period from April 2015 to July 2019. The dataset comprises 17 standardized process indicators measured through the internationally recognized activated sludge method, ensuring the global relevance of the parameters. The collected data include comprehensive measurements across all treatment stages, including influent characteristics (BOD, inflow volume, and water temperature), process control parameters (MLSS, DO, aeration volume, and return sludge concentration), and effluent quality metrics (TWT, T-P, and T-N). The dataset was partitioned temporally into training (70%), validation (10%), and test (20%) sets while strictly maintaining a chronological order to preserve real-world forecasting conditions. Through systematic time-lag correlation analysis, four key indicators (reaction tank pH, FOss, TWT, and TWTN) were identified as having the strongest predictive relationships and selected for modeling. The dataset’s multi-year coverage captures seasonal operational variations while its standardized measurement protocols ensure consistency [20], making it particularly valuable for developing robust predictive models in sewage treatment applications. Complete parameter specifications and measurement ranges are provided in Table 1, with the selected modeling features indicated. This high-quality data foundation enables the rigorous evaluation of the proposed KAN-LSTM model’s predictive performance under realistic operational scenarios.

3.3. Model Training and Parameter Settings

The proposed hybrid architecture integrates three fundamental components to address the unique challenges of sewage treatment time-series prediction: LSTM networks for capturing temporal dependencies, KAN networks for enhanced nonlinear feature transformation through adaptive B-spline basis functions, and the multi-head attention mechanism that dynamically weights informative temporal features. This combination processes input sequences through successive stages of temporal pattern recognition (LSTM), nonlinear feature enhancement (KAN), and context-aware feature refinement (multi-head attention mechanism), culminating in a fully connected output layer that generates the final predictions. The complete hyperparameter configuration, including the dimensionalities of each component and training parameters, is detailed in Table 2.

3.4. Model Evaluation Metrics

In this study, we used the mean absolute error (MAE), root mean squared error (RMSE), and R-squared R 2 to evaluate the performance of our predictive model. These metrics assess the accuracy and reliability of the model in forecasting sewage treatment data.
The MAE [28] measures the average magnitude of errors between predicted and actual values, providing an intuitive measure of model performance. It is expressed as follows:
MAE = 1 n i = 1 n y i y ^ i .
where n is the total number of samples, y i is the actual value, and y ^ i is the predicted value.
RMSE [29] measures the square root of the average squared differences between predicted and actual values. It penalizes larger errors more than MAE, making it sensitive to outliers.
RMSE = 1 n i = 1 n y i y ^ i 2 .
A lower RMSE signifies better model accuracy, especially for datasets where large deviations are crucial.
R 2 [30], also known as the coefficient of determination, evaluates how well the model explains the variance in the actual data. It is defined as follows:
R 2 = 1 i = 1 n y i y ^ i 2 i = 1 n y i y ¯ 2 ,
where y ¯ is the mean of the actual values. R 2 = 1 indicates a perfect fit. R 2 = 0 means that the model performs no better than the mean value. R 2 < 0 suggests that the model is worse than simply using the average of y. These three metrics together provide a comprehensive assessment of model performance, considering both absolute and relative error magnitudes, as well as the model’s explanatory power.

Experimental Results

The predicted results for the four data indicators are shown in Figure 4. The orange and blue lines represent the predicted and true values, respectively. The visualization of the experimental results in Figure 5 renders it convenient and intuitive to judge the performance of the model by error.Blue represents MAE, pink represents RMSE, and green represents R 2 .
To compare the performance of this research model, we compared it with the original LSTM [31], the original KAN [32], the Neural Basis Expansion Analysis for Time Series (N-BEATS) model [33], and the Temporal Convolutional Network (TCN) model [34], with all using the same dataset. To ensure experimental fairness, all comparative tests were conducted using identical computational resources and environment. Each model was trained on the same dataset with an equivalent hyperparameter search space. All models maintained identical iteration counts (epochs), optimizer configurations, and learning rates throughout the training process to guarantee comparability. All computations were exclusively performed on an Intel Core i7-1370P processor, with strictly controlled computational costs to eliminate potential performance variations caused by hardware differences.

3.5. Analysis of Sewage Treatment Data Results

Table 3 presents a comparison of forecasting errors for four sewage treatment indicators—pH, FOss, TWT, and TWTN—using various models, including LSTM, KAN, N-BEATS, TCN, and an integrated KAN-LSTM model. Across all metrics (MAE, RMSE, and R 2 ), the KAN-LSTM consistently outperforms the individual models. While LSTM and KAN perform reasonably well on certain indicators, they struggle to maintain accuracy across all tasks. Models like N-BEATS and TCN show competitive results in isolated cases but lack overall consistency. Notably, FOss appears to relatively easier carry out predictions for all models, with R 2 values above 90 % , whereas pH- and TWT-related indicators pose more challenges. Among all approaches, the integrated KAN-LSTM demonstrates the most balanced and robust performance, achieving the lowest prediction errors and the highest R 2 values across all indicators, highlighting its superior ability to model complex, nonlinear patterns in sewage treatment data.

3.6. Ablation Study

We performed detailed ablation studies to assess how each component contributes to our KAN-LSTM model’s performance. By selectively removing or altering different parts of the architecture, we measured their individual effects. This approach helps us understand the model’s behavior and confirms our design choices.

3.6.1. Experimental Setup

We evaluated five model variants to isolate the contribution of each architectural component:
LSTM-only: Baseline model with only LSTM layers.
LSTM + Attention: LSTM with multi-head attention but no KAN layer.
LSTM + KAN: LSTM with KAN layer but no multi-head attention.
Integrated KAN-LSTM: The complete proposed model (LSTM + KAN with multi-head attention).
Each variant was trained and evaluated using identical datasets, preprocessing steps, and hyperparameter settings to ensure fair comparison. The results are shown in Table 4.

3.6.2. Results and Analysis

The ablation study reveals several important findings:
(1)
The comparative analysis reveals that augmenting the baseline LSTM with multi-head attention consistently enhances predictive performance across all evaluation metrics. Most strikingly, the R² score for pH prediction demonstrates a remarkable improvement from 30.13% to 53.67%. These results confirm the attention mechanism’s capability to effectively identify and prioritize critical temporal patterns in sewage treatment process data.
(2)
The comparative analysis between LSTM with attention and LSTM with KAN reveals the latter’s distinct advantage in modeling performance. For FOss prediction, incorporating the KAN layer yields measurable improvements, reducing the MAE from 0.92 to 0.79 and RMSE from 1.30 to 1.08. These results highlight the KAN layer’s exceptional capacity for capturing complex nonlinear patterns through its dynamic B-spline transformations, outperforming the attention mechanism alone in feature representation.
(3)
The comparison between LSTM+KAN and the complete integrated KAN-LSTM (with additional multi-head attention) reveals the added value of attention mechanisms. Most notably, in TWTN prediction, incorporating attention yields measurable gains—the MAE improves from 1.69 to 1.41 while R² increases from 78.56% to 80.12%. These improvements confirm that attention mechanisms provide critical complementary functionality to KAN’s nonlinear modeling, enabling the dynamic weighting of temporal features in sewage treatment processes.
The results demonstrate complementary functional specialization between model components. The KAN layer effectively captures nonlinear relationships among treatment parameters, as confirmed by its independent performance metrics. Simultaneously, the attention mechanism selectively enhances relevant temporal features. The integrated system combines these capabilities: KAN layers process complex variable transformations while attention mechanisms dynamically prioritize informative time steps, collectively achieving superior predictive performance compared to individual component implementations.
In conclusion, the ablation study validates our architectural design choices and confirms that each component makes a significant contribution to the overall performance of the integrated KAN-LSTM model. The results clearly demonstrate that the full integration of LSTM, KAN, and multi-head attention provides the optimal configuration for sewage treatment indicator prediction.

3.7. Discussion

The results indicate that the KAN-LSTM model consistently outperforms standalone models such as LSTM and KAN, as well as other deep learning architectures like N-BEATS and TCN. This superior performance highlights the effectiveness of integrating LSTM’s sequential learning capability with KAN’s ability to capture complex nonlinear relationships. The improved accuracy across all four indicators—pH, FOss, TWT, and TWTN—suggests that hybrid models are better suited for time-series forecasting in sewage treatment applications. However, despite its advantages, KAN-LSTM comes with increased computational complexity compared to standalone models, which may present challenges in real-time or resource-limited environments. Additionally, while the model demonstrates high predictive accuracy, further validation on larger and more diverse datasets is necessary to ensure its generalizability across different sewage treatment plants and environmental conditions.
The KAN-LSTM model achieves strong prediction performance, but its computational demands are higher than traditional methods due to combining multiple components (LSTM networks, KAN, and multi-head attention mechanisms). This increased complexity could limit real-time use in treatment plants with older equipment. In practice, engineers might need to apply optimization methods like model compression or specialized hardware implementations. While we focus on algorithmic development in this paper, these implementation challenges will need to be addressed for widespread industrial adoption.

4. Conclusions and Future Work

In this study, we first applied a time-lag Pearson correlation matrix to analyze the relationships among sewage treatment indicators and performed feature selection, ensuring that only the most relevant indicators were used for forecasting, thereby improving model efficiency and accuracy. We then conducted experiments using the proposed integrated KAN-LSTM model with a multi-head attention mechanism. Compared to traditional deep learning models, our approach demonstrates superior performance in sewage treatment time-series forecasting. The model effectively captures complex dependencies while improving predictive accuracy, offering sustainable solutions for environmental monitoring systems. This enhanced prediction capability supports sustainable water management practices [35]. To some extent, it makes up for the shortage of human resources in Japan’s sewage treatment companies. Furthermore, validating performance across diverse climatic conditions ensures the model’s applicability for global sustainability initiatives in sewage treatment.
To advance sustainable sewage management, future research will focus on optimizing the KAN-LSTM model’s computational efficiency through lightweight architectures [36], reducing its environmental footprint while maintaining accuracy. We will enhance long-term dependency modeling using attention mechanisms to predict seasonal variations better, enabling more sustainable treatment plant operations. Integrating domain-specific physicochemical constraints will improve the model’s interpretability and alignment with sustainable engineering practices. Ultimately, the real-time deployment of the optimized model could reduce energy and chemical usage by 15-20% in sewage treatment processes, directly contributing to Sustainable Development Goal 6’s (SDG 6) [37] targets through data-driven resource optimization and supporting global sustainability initiatives in water infrastructure management [38].
Extending this research, the optimized KAN-LSTM framework could also support sustainable aquaponic systems, where maintaining balanced water quality and efficient nutrient cycling is essential [39]. By applying predictive modeling to real-time aquaculture monitoring, we can better track key parameters like dissolved oxygen, pH, and ammonia levels, helping fine-tune the symbiotic fish–plant relationship and reduce waste. These improvements would support circular economy approaches by cutting water and energy use in closed-loop aquaponic setups [40]. Additionally, insights from wastewater treatment models could be adapted for aquaculture, fostering cross-disciplinary innovations in sustainable water management for integrated farming systems.

Author Contributions

Conceptualization, J.Z., G.S. and H.S.; data curation, J.Z.; funding acquisition, G.S. and H.S.; methodology, J.Z., G.S. and H.S.; software, J.Z.; validation, J.Z., G.S. and H.S.; visualization, J.Z.; writing—original draft, J.Z.; writing—review editing, G.S. and H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data is private.

Acknowledgments

This work was supported by JST SPRING, Grant Number JPMJSP2153. The authors would like to thank DATA BASE CO., LTD for engaging in water-related matters. The sewage management data were measured around Fukagawa City in Hokkaido.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LSTMLong short-term memory;
KANKolmogorov–Arnold networK;
FOssFinal sedimentation basin outflow suspended solid;
TWTTreated water transparency;
TWTNTreated water total nitrogen;
MAEMean absolute error;
RMSERoot mean squared error;
SDGsSustainable Development Goals.

References

  1. Van Haandel, A.C.; Lettinga, G. Anaerobic sewage treatment. In A Practical Guide for Regions with a Hot Climate; John Whiley and Sons: London, UK, 1994. [Google Scholar]
  2. Jin, L.; Zhang, G.; Tian, H. Current state of sewage treatment in China. Water Res. 2014, 66, 85–98. [Google Scholar] [CrossRef]
  3. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28 (NIPS 2015); Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
  4. Box, G.E.; Pierce, D.A. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Am. Stat. Assoc. 1970, 65, 1509–1526. [Google Scholar] [CrossRef]
  5. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
  6. Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 2020. [Google Scholar]
  7. Xu, Y.; Mou, L.; Li, G.; Chen, Y.; Peng, H.; Jin, Z. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1785–1794. [Google Scholar]
  8. Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
  9. Lipton, Z.C.; Berkowitz, J.; Elkan, C. A critical review of recurrent neural networks for sequence learning. arXiv 2015, arXiv:1506.00019. [Google Scholar]
  10. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  11. Zheng, J.; Li, H.; Suzuki, G.; Shioya, H. Time Series Lag Analysis of Indicators in Sewage Treatment Process. In Proceedings of the 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), Kitakyushu, Japan, 29 October–1 November 2024; IEEE: New York, NY, USA, 2024; pp. 377–378. [Google Scholar]
  12. Angeler, D.G.; Viedma, O.; Moreno, J. Statistical performance and information content of time lag analysis and redundancy analysis in time series modeling. Ecology 2009, 90, 3245–3257. [Google Scholar] [CrossRef]
  13. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  14. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  15. Wang, R.F.; Su, W.H. The application of deep learning in the whole potato production Chain: A Comprehensive review. Agriculture 2024, 14, 1225. [Google Scholar] [CrossRef]
  16. Chen, Y.; Pock, T. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1256–1272. [Google Scholar] [CrossRef]
  17. Pan, C.H.; Qu, Y.; Yao, Y.; Wang, M.J.S. HybridGNN: A Self-Supervised Graph Neural Network for Efficient Maximum Matching in Bipartite Graphs. Symmetry 2024, 16, 1631. [Google Scholar] [CrossRef]
  18. Qin, Y.M.; Tu, Y.H.; Li, T.; Ni, Y.; Wang, R.F.; Wang, H. Deep Learning for Sustainable Agriculture: A Systematic Review on Applications in Lettuce Cultivation. Sustainability 2025, 17, 3190. [Google Scholar] [CrossRef]
  19. Cohen, I.; Huang, Y.; Chen, J.; Benesty, J.; Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson correlation coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
  20. Zhang, Y.; Suzuki, G.; Shioya, H. Prediction and detection of sewage treatment process using N-BEATS autoencoder network. IEEE Access 2022, 10, 112594–112608. [Google Scholar] [CrossRef]
  21. Gardner, M.W.; Dorling, S.R. Artificial neural networks (the multilayer perceptron)—A review of applications in the atmospheric sciences. Atmos. Environ. 1998, 32, 2627–2636. [Google Scholar] [CrossRef]
  22. Ta, H.T. BSRBF-KAN: A combination of B-splines and Radial Basis Functions in Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2406.11173. [Google Scholar]
  23. Ullah, A.; Muhammad, K.; Del Ser, J.; Baik, S.W.; de Albuquerque, V.H.C. Activity recognition using temporal optical flow convolutional features and multilayer LSTM. IEEE Trans. Ind. Electron. 2018, 66, 9692–9702. [Google Scholar] [CrossRef]
  24. Tao, C.; Gao, S.; Shang, M.; Wu, W.; Zhao, D.; Yan, R. Get The Point of My Utterance! Learning Towards Effective Responses with Multi-Head Attention Mechanism. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 4418–4424. [Google Scholar]
  25. Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
  26. Quackenbush, J. Microarray data normalization and transformation. Nat. Genet. 2002, 32, 496–501. [Google Scholar] [CrossRef]
  27. Ali, P.J.M.; Faraj, R.H.; Koya, E.; Ali, P.J.M.; Faraj, R.H. Data normalization and standardization: A technical report. Mach. Learn Tech. Rep. 2014, 1, 1–6. [Google Scholar]
  28. Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
  29. Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
  30. Nakagawa, S.; Schielzeth, H. A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol. Evol. 2013, 4, 133–142. [Google Scholar] [CrossRef]
  31. Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef]
  32. Genet, R.; Inzirillo, H. Tkan: Temporal kolmogorov-arnold networks. arXiv 2024, arXiv:2405.07344. [Google Scholar] [CrossRef]
  33. Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2019, arXiv:1905.10437. [Google Scholar]
  34. Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
  35. Cui, K.; Camalan, S.; Li, R.; Pauca, V.P.; Alqahtani, S.; Plemmons, R.; Silman, M.; Dethier, E.N.; Lutz, D.; Chan, R. Semi-supervised change detection of small water bodies using RGB and multispectral images in peruvian rainforests. In Proceedings of the 2022 12th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Rome, Italy, 13–16 September 2022; IEEE: New York, NY, USA, 2022; pp. 1–5. [Google Scholar]
  36. Tsaregorodtsev, A.; Garonne, V.; Stokes-Rees, I. DIRAC: A scalable lightweight architecture for high throughput computing. In Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, Pittsburgh, PA, USA, 8 November 2004; IEEE: New York, NY, USA, 2004; pp. 19–25. [Google Scholar]
  37. Guppy, L.; Mehta, P.; Qadir, M. Sustainable development goal 6: Two gaps in the race for indicators. Sustain. Sci. 2019, 14, 501–513. [Google Scholar] [CrossRef]
  38. Cosgrove, W.J.; Loucks, D.P. Water management: Current and future challenges and research directions. Water Resour. Res. 2015, 51, 4823–4839. [Google Scholar] [CrossRef]
  39. Camalan, S.; Cui, K.; Pauca, V.P.; Alqahtani, S.; Silman, M.; Chan, R.; Plemmons, R.J.; Dethier, E.N.; Fernandez, L.E.; Lutz, D.A. Change detection of amazonian alluvial gold mining using deep learning and sentinel-2 imagery. Remote Sens. 2022, 14, 1746. [Google Scholar] [CrossRef]
  40. Yao, M.; Huo, Y.; Tian, Q.; Zhao, J.; Liu, X.; Wang, R.; Xue, L.; Wang, H. FMRFT: Fusion mamba and DETR for query time sequence intersection fish tracking. arXiv 2024, arXiv:2409.01148. [Google Scholar]
Figure 1. Correlation matrix with 3 shifted days.
Figure 1. Correlation matrix with 3 shifted days.
Sustainability 17 04417 g001
Figure 2. Fitting regression relationship between 4 indicators.
Figure 2. Fitting regression relationship between 4 indicators.
Sustainability 17 04417 g002
Figure 3. Integrated KAN-LSTM model.
Figure 3. Integrated KAN-LSTM model.
Sustainability 17 04417 g003
Figure 4. Experimental results: (a) reaction tank pH; (b) final sedimentation basin outflow SS; (c) treated water transparency; (d) treated water T-N.
Figure 4. Experimental results: (a) reaction tank pH; (b) final sedimentation basin outflow SS; (c) treated water transparency; (d) treated water T-N.
Sustainability 17 04417 g004
Figure 5. Experimental results: (a) model performance comparison for pH; (b) model performance comparison for FOss; (c) model performance comparison for TWT; (d) model performance comparison for TWTN.
Figure 5. Experimental results: (a) model performance comparison for pH; (b) model performance comparison for FOss; (c) model performance comparison for TWT; (d) model performance comparison for TWTN.
Sustainability 17 04417 g005
Table 1. Sewage treatment indicators in this study.
Table 1. Sewage treatment indicators in this study.
ItemUnitAbbreviation
Inflow water
Biochemical oxygen demand
mg / L IF-BOD
Mixed liquor suspended solids mg / L MLSS
Dissolved oxygen mg / L DO
Return sludge concentration mg / L RSC
Sludge settling velocity%SV
Reaction tank pH-PH
Preliminary sedimentation basin outflow
suspended solid
mg / L Foss
Final sedimentation basin outflow suspended solid mg / L FOss
Inflow water volume m 3 IWV
Water temperature°CWT
Return sludge amount m 3 RSV
Aeration air volume Nm 3 AAV
Treated water transparency-TWT
Treated watercmTW
Biochemical oxygen demand mg / L TW-BOD
Treated water suspended solid mg / L TW-SS
Treated water total phosphorus mg / L TW-T-P
Treated water total nitrogen mg / L TW-T-N
Table 2. Optimized hyperparameters of the KAN-LSTM model.
Table 2. Optimized hyperparameters of the KAN-LSTM model.
ParameterValueDescription
Look-back window10Number of past time steps used for prediction
Embedding size64Feature dimensionality for input encoding
Dense-layer size256Hidden layer size in the fully connected network
Attention heads8Number of attention heads
Dropout rate0.2Probability of dropping neurons to prevent overfitting
Learning rate0.001Initial learning rate for the optimizer
Batch size32Number of samples processed per training step
Epochs200Total number of training iterations
Table 3. Comparison of sewage treatment data forecasting errors.
Table 3. Comparison of sewage treatment data forecasting errors.
pHFOssTWTTWTN
Integrated
KAN-LSTM
MAE 0.06 0.64 1.81 1.41
RMSE 0.08 0.91 2.50 1.96
R 2 62.36 % 95.13 % 86.91 % 80.12 %
LSTM [31]MAE0.071.062.362.21
RMSE0.111.672.972.87
R 2 30.13 % 95.01 % 78.62 % 73.31 %
KAN [32]MAE0.111.622.302.09
RMSE0.191.892.782.92
R 2 37.52 % 92.37 % 65.89 % 68.91 %
N-BEATS [33]MAE0.071.05 1.73 1.85
RMSE0.091.852.612.17
R2 54.64 % 95.03 % 85.92 % 79.62 %
TCN [34]MAE 0.06 1.181.801.53
RMSE0.111.922.51 1.87
R 2 39.16 % 93.63 % 85.74 % 79.90 %
Table 4. Performance comparison of different model variants across four indicators (pH, FOss, TWT, and TWTN).
Table 4. Performance comparison of different model variants across four indicators (pH, FOss, TWT, and TWTN).
MetricpHFOssTWTTWTN
LSTM-onlyMAE0.071.062.362.21
RMSE0.111.672.972.87
R 2 30.13%95.01%78.62%73.31%
LSTM + AttentionMAE0.070.922.091.86
RMSE0.091.302.722.42
R 2 53.67%95.05%81.45%76.23%
LSTM + KANMAE 0.06 0.791.951.69
RMSE 0.08 1.082.612.14
R 2 58.21%95.09%83.24%78.56%
Integrated KAN-LSTMMAE 0.06 0.64 1.81 1.41
RMSE 0.08 0.91 2.50 1.96
R 2 62.36 % 95.13 % 86.91 % 80.12 %
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, J.; Suzuki, G.; Shioya, H. Sustainable Sewage Treatment Prediction Using Integrated KAN-LSTM with Multi-Head Attention. Sustainability 2025, 17, 4417. https://doi.org/10.3390/su17104417

AMA Style

Zheng J, Suzuki G, Shioya H. Sustainable Sewage Treatment Prediction Using Integrated KAN-LSTM with Multi-Head Attention. Sustainability. 2025; 17(10):4417. https://doi.org/10.3390/su17104417

Chicago/Turabian Style

Zheng, Jiaming, Genki Suzuki, and Hiroyuki Shioya. 2025. "Sustainable Sewage Treatment Prediction Using Integrated KAN-LSTM with Multi-Head Attention" Sustainability 17, no. 10: 4417. https://doi.org/10.3390/su17104417

APA Style

Zheng, J., Suzuki, G., & Shioya, H. (2025). Sustainable Sewage Treatment Prediction Using Integrated KAN-LSTM with Multi-Head Attention. Sustainability, 17(10), 4417. https://doi.org/10.3390/su17104417

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop