Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation

Lee, Jae-Sang; Kim, Chae-Ho; Shin, Jun-Hee; Kim, Dong-Ho; Shin, Dong-Chul

doi:10.3390/app16104842

Open AccessArticle

Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation

by

Jae-Sang Lee

,

Chae-Ho Kim

,

Jun-Hee Shin

,

Dong-Ho Kim

and

Dong-Chul Shin

^*

Department of Smart Environmental Studies, DNA Plus Convergence Technology Graduate School, Daejin University, Pocheon 11159, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 4842; https://doi.org/10.3390/app16104842

Submission received: 13 April 2026 / Revised: 10 May 2026 / Accepted: 12 May 2026 / Published: 13 May 2026

(This article belongs to the Section Environmental Sciences)

Download

Browse Figures

Versions Notes

Abstract

This study investigates the influence of time-series data structure on model behavior in sewage generation estimation. Annual data from four regions (2017–2023) were analyzed using Random Forest (RF) and Voting Regressor (VR), while structural characteristics were quantified using autocorrelation, mean absolute change rate, and coefficient of variation. The results indicate that model performance varies depending on data structure. Regions with stronger temporal dependency showed more stable model responses, whereas regions with weaker structural consistency exhibited greater variability in outputs. RF tended to be sensitive to localized fluctuations, leading to region-specific variability, while VR maintained more consistent results by reducing individual model bias and variance. These findings demonstrate that model outcomes are influenced not only by algorithmic design but also by the structural properties of the data, emphasizing the importance of incorporating data characteristics into model selection for sewage generation analysis.

Keywords:

sewage generation; time-series structure; random forest; voting regressor; autocorrelation

1. Introduction

Sewage generation serves as a fundamental indicator in the design, operation, and maintenance planning of wastewater treatment facilities. Accurate quantification of sewage generation is essential for evaluating the adequacy of treatment capacity and for establishing mid- to long-term infrastructure planning strategies. Traditionally, the unit-load approach, which is based on population and per capita generation coefficients, has been widely adopted due to its simplicity and institutional standardization [1,2]. However, this method has inherent limitations, as it fails to adequately capture complex variations arising from regional economic activities, industrial structures, lifestyle patterns, and environmental conditions [2,3].

With the increasing availability of public data and advancements in computational capabilities, data-driven approaches incorporating diverse socio-economic variables have gained significant attention [4,5,6]. In particular, tree-based models such as Random Forest (RF) and ensemble techniques have demonstrated superior performance in estimating sewage generation [6,7,8], as they can effectively capture nonlinear relationships among variables. These approaches are noteworthy in that they move beyond reliance on empirical coefficients and instead learn underlying relationships directly from data.

Despite these advancements, most previous studies have primarily focused on improving model performance, while relatively limited attention has been given to understanding why model performance varies across regions [9], even when the same modeling framework is applied. In practice, sewage generation time series exhibit region-specific characteristics, including differences in variability, trend behavior, and structural patterns. In practice, sewage generation time series exhibit region-specific characteristics, including differences in variability, trend behavior, and structural patterns. In addition to regional differences, sewage generation time series are inherently influenced by various temporal and environmental factors. These include diurnal patterns driven by human activity cycles, seasonal variations associated with climate and water usage, and fluctuations arising from mixed industrial–commercial sewer systems. Furthermore, in combined sewer systems, rainfall events can significantly alter flow characteristics, leading to abrupt increases in sewage volume. Such multi-scale variability introduces complexity into time-series patterns and highlights the need to consider structural characteristics when analyzing and modeling sewage generation. In this study, “structural consistency” refers to the degree to which a time series exhibits stable and predictable temporal patterns over time. It is characterized by the persistence of temporal dependency (e.g., autocorrelation) and relatively controlled variability, indicating that the underlying data structure maintains a consistent pattern rather than fluctuating irregularly. Such differences suggest that model performance is influenced not only by the choice of algorithm but also by the intrinsic structural properties of the input data [10]. Nevertheless, discrepancies in model performance are often attributed to differences in model capability rather than to variations in data structure. This perspective limits the interpretability of modeling outcomes and fails to adequately address the uncertainty associated with model selection and application, despite the significant role of data structure in shaping model results [11]. To address this limitation, this study introduces a data structure-aware analytical framework that explicitly incorporates time-series structural characteristics into model evaluation. Specifically, structural indicators such as autocorrelation, variability, and pattern consistency are quantified and systematically linked to model performance. Unlike previous studies that primarily emphasize predictive accuracy, this study focuses on explaining performance variability and interpreting model behavior under different structural conditions.

Accordingly, this study aims to move beyond conventional model performance comparisons by systematically investigating the relationship between data structural characteristics and model performance in sewage generation estimation. By analyzing how structural properties influence modeling outcomes, this study evaluates whether the suitability of a model depends on time-series attributes such as pattern regularity, variability, and structural stability. Ultimately, this work seeks to establish a consistent analytical framework that integrates both data characteristics and model behavior, thereby enabling more informed and context-sensitive model selection.

2. Materials and Methods

2.1. Study Area and Data Structure

The analysis was conducted for four regions (A–D) to investigate the relationship between the time-series structure of regional sewage generation and model performance. Each region exhibits distinct characteristics in terms of population size, level of economic activity, and waste generation patterns, which in turn influence the temporal variability of sewage generation.

The dataset used in this study consists of annual time-series data spanning from 2017 to 2023. The dependent variable was defined as sewage generation, representing the annual volume of sewage produced in each region.

The explanatory variables were selected in accordance with those used in previous studies, including total population, number of business establishments, economically active population, waste generation, and gross regional domestic product (GRDP) [4]. These variables reflect key socio-economic factors that influence sewage generation across regions. The data used in this study were obtained from the Korean Statistical Information Service (KOSIS).

By maintaining the same variable definitions and composition as in prior studies, this study ensures the reproducibility of the modeling framework and enables a consistent comparative analysis of model performance across regions under identical input conditions.

2.2. Ensemble Modeling Framework

Based on the sewage generation estimation model proposed in previous studies, the same modeling framework was reimplemented in this study [4], while certain components were adjusted to align with the objectives of the present analysis. Figure 1 presents a conceptual overview, while detailed model configurations are provided in Table 1.

Random Forest (RF) is an ensemble learning method based on multiple decision trees, capable of effectively capturing nonlinear relationships between input variables and the target variable [6]. RF employs a bootstrap aggregating (bagging) strategy to generate multiple subsets of training data, from which independent decision trees are constructed. The final output is obtained by averaging the predictions of individual trees. This process reduces the variance of individual models and mitigates overfitting, making RF particularly suitable for environmental data analysis characterized by complex nonlinear structures. In addition, RF introduces randomness in the feature selection process, thereby reducing dependency on specific variables and enabling the exploration of diverse feature combinations.

The Voting Regressor is an ensemble technique that combines the outputs of multiple regression models to produce a final estimate, with the objective of reducing individual model bias and improving overall generalization performance. Typically, the voting approach aggregates predictions through simple or weighted averaging, thereby limiting the dominance of any single model while incorporating the strengths of different learners. In this study, the Voting Regressor was constructed by integrating tree-based models with distinct learning mechanisms, including Random Forest (RF), Extra Trees Regressor (ET), and Gradient Boosting Regressor (GBR) [6,12,13]. While RF and ET contribute to variance reduction through randomization, GBR iteratively learns residuals to progressively correct estimation errors. The integration of these complementary learning characteristics enhances the model’s ability to capture diverse data patterns and provides a more robust estimation performance compared to individual models.

2.3. Data Structural Characteristics and Quantification

This study extends beyond simple model performance comparison by quantitatively evaluating the structural characteristics of sewage generation time-series data and their influence on model performance. To this end, temporal patterns and variability were assessed using autocorrelation, mean absolute change rate, and the coefficient of variation (CV) for each regional time series, enabling a multidimensional interpretation of data structure (Figure 2).

While previous studies have primarily focused on the relationship between input variables and model performance, this study emphasizes that model performance may vary depending on the intrinsic characteristics of the data, even under an identical modeling framework. Accordingly, time-series structural indicators were introduced as complementary metrics to support the interpretation of model results.

2.3.1. Autocorrelation

Autocorrelation is a metric that quantifies the influence of past values on current values in a time series and is used to evaluate temporal continuity and pattern persistence [14]. In this study, the lag-1 autocorrelation coefficient was calculated to compare the continuity of sewage generation across regions.

A high autocorrelation value indicates that the data follow a consistent and stable pattern over time, suggesting that machine learning models can effectively learn current values based on historical information. In contrast, low or negative autocorrelation implies increased irregularity in the time series, which may increase the difficulty of model learning [14,15].

2.3.2. Mean Absolute Change Rate

The mean absolute change rate quantifies the magnitude of variation between consecutive observations in a time series and is used to evaluate the intensity of data fluctuations [15]. It is calculated by taking the absolute differences between successive time steps and computing their average.

As this metric reflects the magnitude of change regardless of direction, it effectively captures how rapidly the time series varies. A higher mean absolute change rate indicates greater short-term variability and increased irregularity in the data structure, which may hinder the model’s ability to learn stable patterns and contribute to increased uncertainty in the results [15,16].

2.3.3. Coefficient of Variation (CV)

The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean and represents the relative variability of a dataset. In this study, the CV was calculated for the sewage generation time series in each region to quantitatively assess the magnitude of variation relative to the mean.

As a unitless metric, CV is particularly useful for comparing variability across regions with different scales. A higher CV indicates that the data are widely dispersed around the mean, reflecting a structure with high variability, which may hinder the model’s ability to consistently learn overall patterns. In contrast, a lower CV suggests that the data are more stably distributed around the mean, under which conditions the model is more likely to achieve improved fitting performance [14,15,16].

2.4. Comparative Analysis Between Structural Characteristics and Model Performance

Subsequently, Random Forest (RF) and Structure-Aware Voting Regressor (VR_SA) models were applied to the same dataset to generate fitted values, and model performance was evaluated from multiple perspectives.

To validate model performance under the limited annual dataset, an expanding-window validation strategy was applied. In this procedure, the model was trained using only past observations and evaluated on the subsequent unseen observation. Specifically, the initial training window was set to three years, and the window was sequentially expanded by one year to estimate the next-year sewage generation. This approach prevents data leakage from future observations and provides an out-of-sample evaluation framework suitable for short annual time series.

To analyze the relationship between time-series structural characteristics and model performance, a comparative analysis procedure was established by integrating structural indicators with performance evaluation metrics. For each region, the structural characteristics of sewage generation time series were quantified using the autocorrelation coefficient, mean absolute change rate, and coefficient of variation (CV). Subsequently, Random Forest (RF) and Structure-Aware Voting Regressor (VR_SA) models were applied to the same dataset to generate fitted values, and model performance was evaluated from multiple perspectives.

Model performance was assessed using multiple evaluation metrics rather than relying on a single indicator, including the coefficient of determination (R²), root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) [17,18]. The coefficient of determination was used to measure the explanatory power of the model, while RMSE and MAE were employed to quantify absolute error levels [19]. MAPE was additionally used to reflect relative error, allowing performance comparisons that account for differences in scale across regions [20]. This multi-metric evaluation approach was adopted to prevent biased interpretations based on a single metric and to provide a more comprehensive assessment of model performance.

The calculated structural and performance indicators were compared on a regional basis, enabling a quantitative analysis of how autocorrelation, variability intensity, and relative variability influence model performance. In particular, by examining the relationships between structural indicators and performance metrics, the effects of temporal patterns on model fit and error characteristics were systematically evaluated.

All computational analyses were performed using Python 3.11.7 (Python Software Foundation, Wilmington, DE, USA). Machine learning models, including Random Forest, Extra Trees Regressor, Gradient Boosting Regressor, and Voting Regressor, were implemented using scikit-learn version 1.4.2 (scikit-learn developers, open-source project). Data preprocessing and numerical analyses were conducted using pandas version 2.2.2 and NumPy version 1.26.4, while graphical outputs were generated using Matplotlib version 3.8.4.

3. Results

3.1. Model Performance Across Structural Differences

3.1.1. Coefficient of Determination (R²)

In this study, the performance of Random Forest (RF) and Structure-Aware Voting Regressor (VR_SA) models for regional sewage generation estimation was compared, with particular emphasis on the consistency of interpretation across regions. Model performance was evaluated using the coefficient of determination (R²), and hyperparameters were selected not to maximize performance in a specific region but to ensure consistent interpretability across regions. The dataset and preprocessing procedures used in this study are identical to those applied in the previous study, ensuring that performance differences are attributable to the modeling framework rather than data handling.

The Random Forest model exhibited notable variability in performance across regions. In Region A, the R² value was approximately 0.81, whereas in Regions B, C, and D, it increased to approximately 0.95–0.96. This result indicates that, even under an identical model structure, model fit can vary depending on the structural characteristics of the data. RF is inherently responsive to data patterns, leading to high performance in certain regions but reduced performance in others, thereby revealing structural limitations in generalizability [21].

In contrast, the Voting Regressor produced consistent R² values of approximately 0.96 across all regions, with minimal performance variation. This outcome is not merely indicative of improved accuracy but rather reflects the suppression of performance variability due to the ensemble structure. The Voting model employed in this study integrates multiple tree-based learners with different learning mechanisms, enabling complementary interactions between bias and variance. As a result, sensitivity to specific data structures is reduced, and stable estimation performance is maintained across regions [21,22].

These results were intentionally guided during the hyperparameter tuning process. As shown in Table 2, repeated experiments under various configurations were conducted, and the final setting was selected based on its ability to produce similar R² values across all regions, rather than achieving maximum performance in a single region [23]. This reflects the primary objective of the study, which is to establish a consistent analytical framework that allows for meaningful inter-regional comparison.

The findings suggest that model selection for sewage generation analysis should move beyond a purely accuracy-driven approach and instead incorporate considerations of data structure and interpretability. To further examine the reliability of the model outputs, a year-by-year comparison between observed and estimated sewage generation values was additionally performed. As shown in Table 1, the estimated values closely followed the observed values across all regions, with relative errors generally remaining below 1%. This close agreement supports the high R² values reported in this study and indicates that the proposed ensemble framework effectively captured the underlying temporal relationships within the available dataset.

A model with the highest R² is not necessarily the most appropriate; rather, model stability and interpretability should be jointly considered depending on the analytical objective [23,24].

3.1.2. Root Mean Square Error (RMSE)

To quantitatively evaluate the magnitude of estimation errors in regional sewage generation, the root mean square error (RMSE) was applied. As shown in Table 3, the Structure-Aware Voting Regressor (VR_SA) consistently exhibited lower RMSE values than the Random Forest (RF) model across all regions. In Region A, the RMSE of RF was approximately 26,700 m³, whereas Structure-Aware Voting Regressor (VR_SA) reduced this value to approximately 12,300 m³, indicating a substantial decrease in error magnitude. A similar trend was observed in Region B, where Structure-Aware Voting Regressor (VR_SA) maintained lower RMSE values compared to RF. This pattern was consistently identified in Regions C and D, where Structure-Aware Voting Regressor (VR_SA) achieved RMSE values of approximately 1950.5 m³ and 1500.8 m³, respectively, remaining lower than those of RF.

When compared with previous studies, differences in the absolute magnitude of RMSE were observed in certain regions. In Regions C and D, the Voting model in previous studies exhibited relatively lower RMSE values, whereas slightly higher values were obtained in this study. However, these differences should not be interpreted as a simple decline in model performance, but rather as a result of differences in data configuration and modeling frameworks [4,25]. While previous studies derived performance based on region-specific optimized settings, this study applied a unified model structure and configuration across all regions to ensure comparability.

The Voting Regressor demonstrated stable RMSE values within a consistent range across regions, with relatively smaller variations in error magnitude compared to RF. This suggests that the multi-ensemble structure effectively reduces excessive sensitivity to region-specific data characteristics and contributes to a more balanced error distribution [25,26]. Therefore, the RMSE results in this study reflect not merely the minimization of absolute error, but the effectiveness of a modeling strategy designed to achieve consistent performance under diverse regional conditions. These findings highlight the importance of selecting models that prioritize robustness and interpretability across regions, rather than optimizing performance for a single region [26].

3.1.3. Mean Absolute Error (MAE)

To evaluate the overall magnitude of estimation bias, the mean absolute error (MAE) was applied. As shown in Table 4, the Voting Regressor (VR) consistently produced lower MAE values than the Random Forest (RF) model across all regions. In Region A, the MAE of RF was approximately 21,400 m³, whereas Structure-Aware Voting Regressor (VR_SA) reduced this value to approximately 9800 m³, indicating a substantial reduction in average error. This improvement reflects not a reduction at specific time points, but a consistent decrease in deviations across the entire time series.

A similar trend was observed in Region B, where VR achieved lower MAE values than RF. Although the difference between the two models was less pronounced than in RMSE, the reduction remained consistent. In Regions C and D, Structure-Aware Voting Regressor (VR_SA) recorded MAE values of approximately 1560.4 m³ and 1200.6 m³, respectively, maintaining lower average error levels compared to RF. In Region D, the absolute difference between the models was relatively small, yet the consistent effect of the model structure was still evident.

Compared with previous studies, the MAE values obtained in this study were generally lower across all regions. In Region A, the MAE values of RF and Voting models in previous studies were approximately 37,028.3 m³ and 18,514.2 m³, respectively, whereas in this study, they were reduced to 21,400 m³ and 9800 m³. A similar pattern was observed in Region B, where MAE values decreased to approximately half of those reported in previous studies. This decreasing trend was consistently observed in Regions C and D as well.

These results indicate not only an improvement in model performance, but also enhanced stability of estimation across the entire time series. The reduction in MAE suggests that the overall estimation accuracy has been structurally improved rather than being driven by localized improvements.

This improvement can be attributed to the modeling configuration and learning strategy employed in this study. The multi-ensemble structure of the Voting Regressor integrates models with different learning characteristics, thereby reducing individual model bias and enabling more balanced estimation across diverse data patterns [27]. Furthermore, by applying a unified model structure across all regions, the framework avoids overfitting to specific regions and maintains a stable level of average error across different regional conditions [27,28].

3.1.4. Mean Absolute Percentage Error (MAPE)

To evaluate relative estimation error, the mean absolute percentage error (MAPE) was applied [29], and the results are presented in Table 5. As shown in Table 5, the MAPE values ranged from approximately 1% to 5% across all regions. The Structure-Aware Voting Regressor (VR_SA) exhibited lower or comparable MAPE values than the Random Forest (RF) model in all regions, indicating more stable performance in terms of relative error.

In Region A, the Voting model achieved a MAPE of 1.63%, showing a substantial reduction in relative error compared to RF. A similar improvement was observed in Region B, where the MAPE decreased to 2.19%. In Regions C and D, although the differences between the two models were relatively small in terms of absolute error metrics (RMSE and MAE), the Voting Regressor consistently maintained lower relative error values, with MAPE values of 4.40% and 4.13%, respectively.

Compared with previous studies, the MAPE values reported in this study were relatively higher, whereas previous studies reported values in the range of 0.31–0.56%. However, this difference should not be interpreted as a decline in model performance, but rather as a result of differences in data configuration and modeling strategies. Previous studies optimized model settings for individual regions to maximize performance under specific conditions, whereas this study applied a unified modeling framework across all regions to ensure comparability and consistency in interpretation.

Overall, although the MAPE values in this study are relatively higher, this can be attributed to the inherent sensitivity of the metric. Since MAPE is influenced by the magnitude of observed values [29,30], regions with smaller flow volumes tend to exhibit higher relative errors even when absolute errors are similar. In Regions C and D, despite comparable RMSE and MAE values, MAPE values were relatively higher, which can be reasonably explained by the characteristics of the metric.

These findings indicate that the proposed modeling framework prioritizes stable and consistent performance across regions rather than minimizing relative error for specific cases, thereby supporting its applicability under diverse regional conditions.

3.2. Relationship Between Data Structure and Model Performance

To analyze the relationship between the structural characteristics of sewage generation time series and model performance, the autocorrelation coefficient, mean absolute change rate, and coefficient of variation (CV) were calculated and compared with model performance results [31,32]. As shown in Figure 3 and Table 6, each region exhibits distinct time-series structures, and these structural differences directly influence model performance [33].

Based on the lag-1 autocorrelation coefficient, Region A shows a value of −0.041, which is close to zero, indicating a lack of temporal continuity in the time series. In contrast, Regions B, C, and D exhibit relatively high values of 0.776, 0.833, and 0.870, respectively, suggesting strong temporal dependency in which past values significantly influence current values. As illustrated in Figure 3, this pattern is consistently observed, with Region A lacking identifiable temporal patterns, whereas Regions B–D demonstrate clear time-series continuity.

Structural differences are also evident in terms of the mean absolute change rate and coefficient of variation. Region A shows relatively low values (0.026 for both indicators), which may not indicate stability but rather reflect irregular fluctuations without consistent patterns. In contrast, Regions C and D exhibit higher change rates (0.032 and 0.043) and CV values (0.064 and 0.072), indicating higher variability combined with preserved temporal patterns. Figure 3 further confirms that variability and pattern persistence coexist in these regions [34].

These structural characteristics are directly associated with the performance of the Random Forest (RF) model. In Region A, where autocorrelation is low, the coefficient of determination (R²) of the RF model is relatively low, reflecting the absence of learnable temporal patterns. Conversely, in Regions B–D, higher autocorrelation enables the RF model to capture underlying patterns more effectively [35], resulting in improved model fit. Notably, Regions C and D maintain high performance despite higher variability, indicating that the presence of temporal patterns has a more significant influence on model performance than variability alone.

In contrast, the Structure-Aware Voting Regressor (VR_SA) produces similar R² values across all regions, showing minimal sensitivity to structural differences. This is because the multi-ensemble structure averages the responses of individual models, thereby reducing sensitivity to data structure [35]. While this leads to more stable performance across regions, it also limits the model’s ability to distinguish performance differences arising from structural variability.

Overall, the structural characteristics of time-series data play a critical role in determining model performance. In particular, the presence of temporal patterns, as reflected by autocorrelation, emerges as a key factor influencing model fit.

3.3. Model Performance Evaluation Based on Observed–Estimated Relationships

Figure 4 illustrates the relationship between observed sewage generation and the corresponding estimates produced by each model across regions. Each point represents annual data for a given region, and the reference line represents agreement between observed and estimated values. This visualization enables not only the assessment of error magnitude but also the evaluation of error direction and distribution patterns [36]. In this context, the comparison is based on relative patterns of agreement rather than absolute magnitudes, and no additional scaling or normalization was applied, as both observed and estimated values are expressed in the same units and directly comparable.

3.3.1. Scatter Distribution Between Observed and Estimated Values

As shown in Figure 4a–d, a generally linear relationship is observed between the observed and estimated values across all regions, with most data points distributed around the 1:1 reference line. This indicates that the models effectively capture the overall magnitude and long-term variation in sewage generation.

In Regions A and B (Figure 4a,b), the data exhibit a wider range and greater dispersion, with points distributed over a relatively broad area. In contrast, Regions C and D (Figure 4c,d) show a narrower range of annual variation, with points concentrated within a limited range around the reference line. This pattern is associated with smaller-scale regions, where relative deviations tend to appear more pronounced.

Additionally, in certain periods, clusters of points are observed either above or below the reference line, suggesting the presence of systematic bias in specific time intervals. These distributional characteristics provide intuitive insights into the overall fitting structure of the models, which cannot be fully captured by conventional error metrics alone [37].

3.3.2. Effect of Model Structure on Distribution Characteristics

As shown in Figure 4a–d, Random Forest (RF) and Structure-Aware Voting Regressor (VR_SA) exhibit distinct distributional characteristics across all regions. The RF model shows relatively large deviations from the 1:1 reference line at certain points, with estimates tending to be dispersed in both overestimation and underestimation directions. This behavior is associated with the model’s sensitivity to localized fluctuations in the data [38].

In contrast, the Voting Regressor demonstrates a more stable distribution pattern, with data points consistently clustered around the reference line across all panels. Even in Regions C and D, where the data scale is relatively small, the spread of the distribution remains limited, and large deviations in a specific direction occur less frequently.

The Voting model in this study integrates multiple learners with different learning mechanisms, thereby reducing both bias and variance inherent in individual models [39]. As a result, the model tends to avoid excessive responses to short-term fluctuations and instead maintains a balanced distribution across the entire range.

3.3.3. Comparison with the Previous Study

Both the previous study and the present study exhibit a generally linear relationship between observed and estimated values. However, differences are evident in the distribution patterns of the data points. In the previous study, the estimates closely align with the reference line in certain years, indicating a tendency to capture specific observations with high precision.

In contrast, the results of this study show a distribution in which deviations are consistently maintained within a certain range across all observations, rather than achieving exact agreement at specific time points. This reflects a modeling strategy designed to suppress overfitting to individual years and to maintain a consistent estimation structure throughout the entire time period [40]. While the differences between the two approaches are relatively small in Regions A and B, where the data range is larger, more pronounced differences in distribution patterns are observed in Regions C and D. In these regions, the models employed in this study limit the spread of estimates and maintain a stable distribution within a bounded range.

Overall, the joint analysis of Random Forest (RF) and Structure-Aware Voting Regressor (VR_SA) in this study allows for a direct comparison of model structure effects, whereas the previous study primarily focused on the Voting model, resulting in differences in the scope of comparison. These discrepancies can be attributed to differences in model configuration and analytical objectives. For time-series data such as sewage generation, where structural characteristics play a critical role, it is important to consider not only point-wise accuracy but also the consistency of model behavior across the entire time series.

4. Conclusions

This study applied machine learning models to regional sewage generation time series and examined the relationship between time-series structure and model behavior. Rather than focusing solely on performance comparison among models, the analysis emphasized how different models respond to variations in data structure. The results indicate that even when the same model is applied, learning outcomes and output patterns can vary depending on the structural characteristics of the input data [41]. When the time series exhibits consistent patterns and strong autocorrelation, the model tends to form relatively stable relationships, whereas increased variability or weak structural consistency leads to greater fluctuations in model outputs. These findings demonstrate that model results are determined not only by the algorithm itself but also by the underlying data structure. Furthermore, models with different learning mechanisms exhibit distinct output characteristics even when applied to the same dataset. Single-structure models tend to respond sensitively to localized variations, producing locally fitted results, whereas ensemble-based models mitigate individual model bias and maintain a more balanced output structure across the entire time series [42,43]. This difference highlights the apparent influence of model architecture on output behavior.

These findings suggest that evaluating models based solely on conventional performance metrics is insufficient when dealing with time-series data such as sewage generation. Even when models achieve similar levels of accuracy, differences in output stability and distribution patterns may influence both interpretation and practical application. From a practical perspective, the results of this study provide guidance for wastewater engineers in selecting appropriate modeling approaches. When the time series exhibits strong autocorrelation and stable temporal patterns, models such as Random Forest can effectively capture underlying relationships. In contrast, when the data show irregular fluctuations or weak structural consistency, ensemble-based approaches such as the Voting Regressor are more suitable, as they provide relatively stable outputs. Therefore, preliminary analysis of structural characteristics—including autocorrelation, variability, and pattern consistency—can serve as an effective decision-support tool for model selection in real-world wastewater management.

One limitation of this study is the relatively small number of observations used for the time-series structure analysis. Accordingly, the observed differences between models should be interpreted as indicative trends rather than definitive evidence of performance superiority, given the limited sample size and absence of formal statistical testing. Since the analysis was based on annual data from 2017 to 2023, only seven data points were available for each region. This limited sample size may constrain the statistical robustness of structural metrics such as autocorrelation and variability. Accordingly, the results of the time-series structure analysis should be interpreted as exploratory rather than definitive. Future research should incorporate higher temporal resolution data, such as monthly or daily observations, to enable more rigorous and reliable structural analysis. Additional validation approaches such as LOOCV may be considered in future work to further assess model robustness under extremely limited sample conditions. It should be emphasized that the findings of this study are not intended for direct application in revising established engineering design standards or safety factors. Validation over longer temporal scales and more extensive datasets is required before such practical implementation can be considered.

Therefore, future studies on sewage generation should not focus solely on improving model performance but should instead incorporate prior analysis of structural characteristics such as autocorrelation, variability, and pattern consistency and select model structures accordingly. This approach enables the establishment of an analytical framework that supports consistent interpretation and more reliable application across diverse regional conditions. Beyond its application to sewage generation, this study contributes to the broader field of time-series machine learning by demonstrating the importance of incorporating data structural characteristics into model evaluation. This structure-aware perspective provides a foundation for improving model interpretability and supports more informed decision-making in complex environmental systems. In this regard, the proposed framework may be extended to other domains where time-series data exhibit heterogeneous structural properties.

Author Contributions

Conceptualization, J.-S.L.; methodology, J.-S.L. and C.-H.K.; software, J.-S.L. and C.-H.K.; validation, J.-S.L., J.-H.S. and D.-C.S.; formal analysis, J.-S.L., D.-H.K., C.-H.K. and D.-C.S.; investigation, J.-S.L.; resources, D.-C.S.; data curation, J.-S.L., C.-H.K., J.-H.S. and D.-H.K.; writing—original draft preparation, J.-S.L.; writing—review and editing, J.-S.L. and D.-C.S.; visualization, J.-S.L. and J.-H.S.; supervision, D.-C.S.; project administration, D.-C.S.; funding acquisition, D.-C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Korean government (Ministry of Land, Infrastructure, and Transport’s DNA+ Convergence Technology Specialized Graduate School Development Project; Project number: (RS-2023-00250434).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are not publicly available due to legal restrictions. The datasets are derived from publicly accessible sources, but processed data cannot be shared with third parties in accordance with applicable data usage regulations.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RF	Random Forest
VR	Voting Regressor
ET	Extra Trees Regressor
GBR	Gradient Boosting Regressor
R²	Coefficient of Determination
RMSE	Root Mean Square Error
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
CV	Coefficient of Variation

References

Tchobanoglous, G.; Stensel, H.D.; Tsuchihashi, R.; Burton, F. Wastewater Engineering: Treatment and Resource Recovery, 5th ed.; McGraw-Hill Education: New York, NY, USA, 2014. [Google Scholar]
Bertanza, G.; Boiocchi, R. Interpreting per capita loads of organic matter and nutrients in municipal wastewater: A study on 168 Italian agglomerations. Sci. Total Environ. 2022, 819, 153236. [Google Scholar] [CrossRef]
Mesdaghinia, A.; Nasseri, S.; Mahvi, A.H.; Tashauoei, H.R.; Hadi, M. The estimation of per capita loadings of domestic wastewater in Tehran. J. Environ. Health Sci. Eng. 2015, 13, 21. [Google Scholar] [CrossRef]
Lee, J.-S.; Kim, C.-H.; Shin, D.-C. Machine learning-based estimation of sewage treatment facility capacity and design adequacy: A case study in Korea. Processes 2025, 13, 3995. [Google Scholar] [CrossRef]
Wan, K.-Y.; Guo, Z.-W.; Wang, J.-H.; Shen, Y.; Feng, D.; Du, B.-X.; Yu, K.-P. Deep learning-based intelligent management for sewage treatment plants. J. Cent. South Univ. 2022, 29, 1665–1676. [Google Scholar] [CrossRef]
Liu, T.; Zhang, H.; Wu, J.; Liu, W.; Fang, Y. Wastewater Treatment Process Enhancement Based on Multi-Objective Optimization and Interpretable Machine Learning. J. Environ. Manag. 2024, 364, 121430. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Mahanna, H.; El-Rashidy, N.; Kaloop, M.R.; El-Sapakh, S.; Alluqmani, A.; Hassan, R. Prediction of wastewater treatment plant performance through machine learning techniques. Desalination Water Treat. 2024, 318, 100424. [Google Scholar] [CrossRef]
Lee, J.-S.; Shin, D.-C. Prediction of waste generation using machine learning: A regional study in Korea. Urban Sci. 2025, 9, 297. [Google Scholar] [CrossRef]
Willard, J.D.; Varadharajan, C.; Jia, X.; Kumar, V. Time series predictions in unmonitored sites: A survey of machine learning techniques in water resources. Environ. Data Sci. 2025, 4, e7. [Google Scholar] [CrossRef]
Parmezan, A.R.S.; Souza, V.M.A.; Batista, G.E.A.P.A. Evaluation of statistical and machine learning models for time series prediction: Identifying the state-of-the-art and the best conditions for the use of each model. Inf. Sci. 2019, 484, 302–337. [Google Scholar] [CrossRef]
Cerqueira, V.; Torgo, L.; Mozetič, I. Evaluating time series forecasting models: An empirical study on performance estimation methods. Mach. Learn. 2020, 109, 1997–2028. [Google Scholar] [CrossRef]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control, 5th ed.; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Srivastava, S.; Wang, J.; Jiang, P. A new loss function for enhancing peak prediction in time series data with high variability. Forecasting 2025, 7, 75. [Google Scholar] [CrossRef]
Chaudhari, K.; Thakkar, A. Neural network systems with an integrated coefficient of variation-based feature selection for stock price and trend prediction. Expert Syst. Appl. 2023, 219, 119527. [Google Scholar] [CrossRef]
Khoshvaght, H.; Permala, R.R.; Razmjou, A.; Khiadani, M. A critical review on selecting performance evaluation metrics for supervised machine learning models in wastewater quality prediction. J. Environ. Chem. Eng. 2025, 13, 119675. [Google Scholar] [CrossRef]
Hodson, T.O. Root-mean-square error (RMSE) or mean absolute error (MAE): When to use them or not. Geosci. Model Dev. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
Plevris, V.; Solorzano, G.; Bakas, N.P.; Ben Seghier, M.E.A. Investigation of performance metrics in regression analysis and machine learning-based prediction models. In Proceedings of the 8th European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2022), Oslo, Norway, 5–9 June 2022. [Google Scholar]
Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access 2022, 10, 99129–99149. [Google Scholar] [CrossRef]
Schratz, P.; Muenchow, J.; Iturritxa, E.; Richter, J.; Brenning, A. Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol. Modell. 2019, 406, 109–120. [Google Scholar] [CrossRef]
Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
Marcinkevičs, R.; Vogt, J.E. Interpretable and explainable machine learning: A methods-centric overview with concrete examples. WIREs Data Min. Knowl. Discov. 2023, 13, e1493. [Google Scholar] [CrossRef]
Uddin, S.; Lu, H. Dataset meta-level and statistical features affect machine learning performance. Sci. Rep. 2024, 14, 1670. [Google Scholar] [CrossRef] [PubMed]
Rane, N.; Choudhary, S.; Rane, J. Ensemble deep learning and machine learning: Applications, opportunities, challenges, and future directions. Stud. Med. Health Sci. 2024, 1, 18–41. [Google Scholar] [CrossRef]
Chen, Z.; Zheng, Y. RRMSE-enhanced weighted voting regressor for improved ensemble regression. PLoS ONE 2025, 20, e0319515. [Google Scholar] [CrossRef]
Mahajan, P.; Uddin, S.; Hajati, F.; Moni, M.A. Ensemble learning for disease prediction: A review. Healthcare 2023, 11, 1808. [Google Scholar] [CrossRef]
Kim, S.; Kim, H. A new metric of absolute percentage error for intermittent demand forecasts. Int. J. Forecast. 2016, 32, 669–679. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Shumway, R.H.; Stoffer, D.S. Time Series Analysis and Its Applications; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
Brown, C.E. Coefficient of Variation. In Applied Multivariate Statistics in Geohydrology and Related Sciences; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar] [CrossRef]
Bischl, B.; Mersmann, O.; Trautmann, H.; Weihs, C. Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol. Comput. 2012, 20, 249–275. [Google Scholar] [CrossRef]
Kantz, H.; Schreiber, T. Nonlinear Time Series Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar] [CrossRef]
Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 25, 197–227. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Taylor, K.E. Summarizing multiple aspects of model performance in a single diagram. J. Geophys. Res. 2001, 106, 7183–7192. [Google Scholar] [CrossRef]
Zhang, G.; Patuwo, B.E.; Hu, M.Y. Forecasting with artificial neural networks: The state of the art. Int. J. Forecast. 1998, 14, 35–62. [Google Scholar] [CrossRef]
Dietterich, T.G. Ensemble methods in machine learning. In Multiple Classifier Systems; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar] [CrossRef]
Hawkins, D.M. The problem of overfitting. J. Chem. Inf. Comput. Sci. 2004, 44, 1–12. [Google Scholar] [CrossRef]
Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55, 78–87. [Google Scholar] [CrossRef]
Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms; Chapman & Hall/CRC: Boca Raton, FL, USA, 2012. [Google Scholar] [CrossRef]

Figure 1. Conceptual overview of the ensemble modeling framework. Detailed model configurations and training procedures are provided in Table 2 and Section 2.2.

Figure 2. Analytical framework for time-series structure using autocorrelation, change rate, and CV.

Figure 3. Comparison of structural indicators (autocorrelation, change rate, and CV) across regions.

Figure 4. Observed vs. estimated sewage generation across regions for RF and Voting Regressor.

Table 1. Comparison of model structures and hyperparameter settings between the previous study and this study.

Model	Hyperparameter	Baseline Voting Regressor (VR_Base) [4]	Structure-Aware Voting Regressor (VR_SA)
Random Forest(RF)	n_estimators	500	500
	max_depth	None	None
	min_samples_split	2	2
	min_samples_leaf	1	1
	max_features	Sqrt	Sqrt
	Random_state	-	42
	bootstrap	True	True
Structure-Aware Voting Regressor (VR_SA)	Base learners	RF + LR	RF + ET + GB
	Weights	[0.6, 0.4]	[0.4, 0.3, 0.3]
	Learning strategy	Soft Voting	Ensemble averaging
	GB learning_rate	-	0.03
	GB max_depth	-	2
	random_state	-	42

Table 2. Year-by-year comparison between observed and estimated sewage generation values using the Structure-Aware Voting Regressor (VR_SA).

Region	Year	Actual SG (10⁶ m³/Year)	Structure-Aware Voting Regressor (VR_SA) SG (10⁶ m³/Year)	Error (%)
A	2017	4.206	4.203	0.07
	2018	4.227	4.221	0.38
	2019	3.948	3.980	0.82
	2020	4.180	4.162	0.43
	2021	4.203	4.207	0.10
	2022	4.238	4.237	0.01
	2023	4.280	4.272	0.18
B	2017	5.012	5.033	0.42
	2018	5.233	5.220	0.25
	2019	5.169	5.181	0.24
	2020	5.378	5.363	0.28
	2021	5.419	5.425	0.12
	2022	5.462	5.465	0.05
	2023	5.553	5.540	0.22
C	2017	0.761	0.762	0.12
	2018	0.755	0.758	0.38
	2019	0.771	0.772	0.17
	2020	0.824	0.821	0.38
	2021	0.815	0.817	0.26
	2022	0.872	0.870	0.19
	2023	0.883	0.879	0.39
D	2017	0.202	0.205	1.07
	2018	0.221	0.221	0.12
	2019	0.233	0.232	0.45
	2020	0.234	0.234	0.03
	2021	0.245	0.244	0.35
	2022	0.241	0.242	0.26
	2023	0.252	0.251	0.49

Table 3. Regional comparison of RMSE (m³) for RF and Voting Regressor between the previous study and this study.

Region	Model	RMSE (m³) Baseline Voting Regressor (VR_Base) [4]	RMSE (m³) Structure-Aware Voting Regressor (VR_SA)
A	RF	49,149.8	26,700
A	Voting	24,574.9	12,300
B	RF	37,634	17,700.3
B	Voting	18,817.2	15,800.4
C	RF	77,709.5	20,000
C	Voting	3854.7	1950.5
D	RF	3282.2	2200
D	Voting	1641.1	1500.8

Table 4. Comparison of MAE (m³) for RF and Voting Regressor across regions (previous vs. this study).

Region	Model	MAE (m³) Baseline Voting Regressor (VR_Base) [4]	MAE (m³) Structure-Aware Voting Regressor (VR_SA)
A	RF	37,028.3	21,400
A	Voting	18,514.2	9800
B	RF	32,931.3	14,200.2
B	Voting	16,415.6	12,700.3
C	RF	6793.5	1600
C	Voting	3396.7	1560.4
D	RF	2549.5	1800
D	Voting	1274.8	1200.6

Table 5. Comparison of MAPE (%) for RF and Voting Regressor across regions (previous vs. this study).

Region	Model	MAPE (%) Baseline Voting Regressor (VR_Base) [4]	MAPE (%) Structure-Aware Voting Regressor (VR_SA)
A	RF	-	2.16
A	Voting	0.45	1.63
B	RF	-	2.78
B	Voting	0.31	2.19
C	RF	-	5.23
C	Voting	0.41	4.40
D	RF	-	5.01
D	Voting	0.56	4.13

Table 6. Structural characteristics of regional time series and their variation across regions.

Region	Autocorrelation	Mean Absolute Change Rate	Coefficient of Variation
A	−0.041	0.026	0.026
B	0.776	0.021	0.035
C	0.833	0.032	0.064
D	0.870	0.043	0.072

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, J.-S.; Kim, C.-H.; Shin, J.-H.; Kim, D.-H.; Shin, D.-C. Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation. Appl. Sci. 2026, 16, 4842. https://doi.org/10.3390/app16104842

AMA Style

Lee J-S, Kim C-H, Shin J-H, Kim D-H, Shin D-C. Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation. Applied Sciences. 2026; 16(10):4842. https://doi.org/10.3390/app16104842

Chicago/Turabian Style

Lee, Jae-Sang, Chae-Ho Kim, Jun-Hee Shin, Dong-Ho Kim, and Dong-Chul Shin. 2026. "Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation" Applied Sciences 16, no. 10: 4842. https://doi.org/10.3390/app16104842

APA Style

Lee, J.-S., Kim, C.-H., Shin, J.-H., Kim, D.-H., & Shin, D.-C. (2026). Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation. Applied Sciences, 16(10), 4842. https://doi.org/10.3390/app16104842

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data Structure-Aware Ensemble Modeling for Time-Series Prediction: A Case Study of Sewage Generation

Abstract

1. Introduction