Urbanization causes the degradation of stormwater quality and limits the industrial, agricultural, hydropower, and recreational use of water bodies [1
]. Stormwater runoff contains suspended solids, nutrients, trace organic compounds, heavy metals, and pathogens discharged into natural water bodies, impairing ecosystems, and human health [2
]. Total Suspended Solids (TSS) have an effect on water temperature, dissolved oxygen levels, and clarity. The stormwater discharges that wash the surface of the construction site can produce a massive volume of suspended solids [4
]. Additionally, other pollutants such as heavy metals and phosphorus can be attached to TSS and be carried through stormwater runoff [5
]. Thus, the significance of TSS concentration has led researchers toward the development of modeling approaches, such as empirical models, Physically Based Models (PBMs), and data-driven models. PBMs involve complex numerical techniques and require significant computational time. Whereas empirical models and PBMs may continue to be used in the near future, data-driven approaches will begin to play a significant role in the hydrologic and water quality analysis as big data become available and as computing power improves.
Some of the models use an empirical approach to predict pollutant load. Source Loading and Management Model for Windows (WinSLAMM) is an example of an empirical model for stormwater quantity and quality [6
]. WinSLAMM requires a significant amount of information about the site’s geographic location, site development characteristics (e.g., impervious or pervious area), soil type, land use, and rainfall amount. It also requires specifics about the drainage system (e.g., grass swales, infiltration trench) and the fraction of the area covered by the drainage system. Runoff should be estimated before the evaluation of stormwater quality. WinSLAMM uses estimates of particulate solids concentration with respect to the source area (e.g., paved parking) and land use type (e.g., residential and commercial). Pollutant loads are estimated as a product of runoff volume and source area concentration estimates [8
Some of the PBMs that use a physically based approach for runoff estimation apply empirical approaches to estimate pollutant concentration. An example of such a model is the commonly applied stormwater management model (SWMM). SWMM can be applied for both event-based and continuous simulation in urban watersheds [10
]. Although SWMM modeling is mostly applied for the prediction of stormwater runoff, it has also been used to estimate pollutant concentration [12
]. TSS concentration in SWMM is simulated based on an empirical buildup and wash-off model or the Event Mean Concentration (EMC). The buildup wash-off model involves the accumulation of TSS load during dry periods and wash-off by the first storm [13
]. The buildup process is described using exponential, power, or saturation functions [10
]. The lack of generally accepted parameters for the buildup and wash-off functions leads to higher uncertainty in water quality prediction [14
The investigations demonstrate that PBMs can provide guidance on the design and management of water resources. However, these models have been criticized as overparameterized, overly complex, data-intensive, and difficult to use, limiting their broader use. Also, PBMs require the prediction of runoff with reasonable accuracy before simulating pollutant concentrations [5
]. Pertaining to the novelty of Artificial Intelligence (AI), researchers have attempted to take advantage of AI methods. The Machine Learning (ML) technique, one area for the application of AI, is also increasingly implemented in water resources and environmental engineering [17
] and in many other areas such as the medical and economic fields [21
]. Given the limitations of empirical models and PBMs, ML techniques should be considered as alternative approaches for the management of urban stormwater and estimation of runoff and water quality. The use of empirical approaches and constant concentration (EMC) for water quality prediction include assumptions that could potentially limit the accuracy of such models. ML approaches are relatively computationally efficient and cost-effective compared to empirical models and PBMs [23
]. ML algorithms work based on numerical or categorical relationships between features and target values rather than physical relations between inputs and outputs. ML approaches allow for the potential use of historical data on input features and target values that have been collected over several years and move toward the data-driven prediction of stormwater pollutants.
ML techniques have been frequently applied for the estimation of runoff based on rainfall data. Their use for the prediction of water quality parameters such as TSS or nutrient concentration has been very limited. The first step in applying ML methods is to identify relevant variables, called features, and how these features affect the target value (e.g., water quality). The authors of Refs. [24
] investigated the possibility of using an Artificial Neural Network (ANN) and Linear Regression (LR) to estimate TSS concentration. They used Unmanned Aerial Vehicle Images (UAV) for extracting TSS data from water bodies. The results demonstrated that the ANN estimated TSS with an R2
value of 0.84 during the training step and 0.57 during the prediction step. The authors of Ref. [26
] used multiple ML algorithms, such as Multiple Linear Regression, Polynomial Regression, Random Forest, Gradient Boosting Algorithm, and Support Vector Machines, to predict a Water Quality Index (WQI) for various lakes. The best results were obtained using the Gradient Boosting Algorithm, with an R2
value of 0.74 during the training step. The authors of Ref. [27
] used Regression Tree (RT) and Support Vector Regression (SVR) for the estimation of specific indicators of stormwater quality. They estimated target parameters, such as biochemical oxygen demand (BOD5), chemical oxygen demand (COD), total suspended solids (TSS), and total dissolved solids (TDS), in the stormwater treatment network based on features such as land use and volume of runoff. The SVR method predicted BOD5, COD, TSS, and TDS with better accuracy (R2
> 0.8) than the RT algorithm (R2
> 0.7). These studies demonstrate the potential for the application of ML techniques for the management of urban water resources. The methods were used to estimate water quality targets (e.g., TSS concentration) based on quantitative relationships between features and targets. ML methods could provide important insights about the quantitative relations between environmental factors or features (e.g., land use and antecedent dry days) and target values (e.g., TSS or nutrient concentration), especially when there is an extensive database [28
]. Several of the previous studies have focused on the application of ML algorithms to assess streamflow and water quality in the river network. ML methods have rarely focused on pollutant concentration in urban stormwater runoff. In this study, we attempted comprehensively investigate the development and application of data-driven ML algorithms as an alternative to the physically based modeling approach to estimate TSS concentration from urban watersheds. We implemented a new approach where we used comprehensive supervised ML techniques with six single and two ensemble models using six influential factors (drainage area, land use, imperviousness area, rainfall depth, runoff volume, and antecedent days) and one target (TSS concentration). We introduced a sensitivity analysis for evaluating the relevance of the six factors and their relative contribution towards improving the prediction accuracy of the ML algorithms. We used indices such as R2
, NSE, and RMSE to compare the performance of the ML algorithms and identify the best fitting algorithm.
This study investigated the application of supervised machine learning algorithms for urban stormwater quality management. Version 4.02 of the National Stormwater Quality Database (NSQD) was used to extract event-based data on TSS concentration, and associated site features such as drainage area, land use, percent of impervious, antecedent dry days, rainfall depth, and runoff volume. We compiled 530 datapoints from NSQD containing all the features and target values with no missing data. We used 66% of the dataset as input for the training step and 34% for the prediction step.
Eight different ML algorithms were compared: Linear Regression, Regression Tree, Support Vector Regression, Random Forest, Adaptive Boosting, variable weighting k-Nearest Neighbor, uniform weighting k-Nearest Neighbor, and Artificial Neural Network. Linear Regression failed in both the training and prediction steps. However, all the other methods showed a good performance in both the training and prediction steps except the uniform weighting k-Nearest Neighbor and Artificial Neural Network algorithms. These two methods (UW-kNN and ANN) had a good performance in the training step but failed in the prediction step. The highest R2 and NSE and the lowest RMSE in both the training and prediction steps, indicators of good performance, were obtained by the RF and AdB algorithms. Moreover, a sensitivity analysis demonstrated that the prediction accuracy of AdB was sensitive to all input features.
It was demonstrated that machine learning methods are plausible approaches to the prediction of TSS concentration. A limitation identified in some of the models, poor performance in the prediction step, is attributed to overfitting and underfitting problems. Thus, these limitations can be addressed with the choice of appropriate models and the use of sufficient data points. The approach could benefit from the expansion of the NSQD dataset. Thus, with further enhancement of the ML methods and data sources, ML methods have the potential for application across regions. Future efforts to enhance model accuracy should consider the use of hybrid methods or ensemble models.