Supervised Machine Learning for Estimation of Total Suspended Solids in Urban Watersheds

: Machine Learning (ML) algorithms provide an alternative for the prediction of pollutant concentration. We compared eight ML algorithms (Linear Regression (LR), uniform weighting k-Nearest Neighbor (UW-kNN), variable weighting k-Nearest Neighbor (VW-kNN), Support Vector Regression (SVR), Artiﬁcial Neural Network (ANN), Regression Tree (RT), Random Forest (RF), and Adaptive Boosting (AdB)) to evaluate the feasibility of ML approaches for estimation of Total Suspended Solids (TSS) using the national stormwater quality database. Six factors were used as features to train the algorithms with TSS concentration as the target parameter: Drainage area, land use, percent of imperviousness, rainfall depth, runoff volume, and antecedent dry days. Comparisons among the ML methods demonstrated a higher degree of variability in model performance, with the coefﬁcient of determination (R 2 ) and Nash–Sutcliffe (NSE) values ranging from 0.15 to 0.77. The Root Mean Square (RMSE) values ranged from 110 mg/L to 220 mg/L. The best ﬁt was obtained using the AdB and RF models, with R 2 values of 0.77 and 0.74 in the training step and 0.67 and 0.64 in the prediction step. The NSE values were 0.76 and 0.72 in the training step and 0.67 and 0.62 in the prediction step. The predictions from AdB were sensitive to all six factors. However, the sensitivity level was variable.


Introduction
Urbanization causes the degradation of stormwater quality and limits the industrial, agricultural, hydropower, and recreational use of water bodies [1]. Stormwater runoff contains suspended solids, nutrients, trace organic compounds, heavy metals, and pathogens discharged into natural water bodies, impairing ecosystems, and human health [2,3]. Total Suspended Solids (TSS) have an effect on water temperature, dissolved oxygen levels, and clarity. The stormwater discharges that wash the surface of the construction site can produce a massive volume of suspended solids [4]. Additionally, other pollutants such as heavy metals and phosphorus can be attached to TSS and be carried through stormwater runoff [5]. Thus, the significance of TSS concentration has led researchers toward the development of modeling approaches, such as empirical models, Physically Based Models (PBMs), and data-driven models. PBMs involve complex numerical techniques and require significant computational time. Whereas empirical models and PBMs may continue to be used in the near future, data-driven approaches will begin to play a significant role in the hydrologic and water quality analysis as big data become available and as computing power improves.
Some of the models use an empirical approach to predict pollutant load. Source Loading and Management Model for Windows (WinSLAMM) is an example of an empirical model for stormwater quantity and quality [6,7]. WinSLAMM requires a significant amount of information about the site's geographic location, site development characteristics (e.g., impervious or pervious area), soil type, land use, and rainfall amount. It also requires COD, TSS, and TDS with better accuracy (R 2 > 0.8) than the RT algorithm (R 2 > 0. 7). These studies demonstrate the potential for the application of ML techniques for the management of urban water resources. The methods were used to estimate water quality targets (e.g., TSS concentration) based on quantitative relationships between features and targets. ML methods could provide important insights about the quantitative relations between environmental factors or features (e.g., land use and antecedent dry days) and target values (e.g., TSS or nutrient concentration), especially when there is an extensive database [28]. Several of the previous studies have focused on the application of ML algorithms to assess streamflow and water quality in the river network. ML methods have rarely focused on pollutant concentration in urban stormwater runoff. In this study, we attempted comprehensively investigate the development and application of data-driven ML algorithms as an alternative to the physically based modeling approach to estimate TSS concentration from urban watersheds. We implemented a new approach where we used comprehensive supervised ML techniques with six single and two ensemble models using six influential factors (drainage area, land use, imperviousness area, rainfall depth, runoff volume, and antecedent days) and one target (TSS concentration). We introduced a sensitivity analysis for evaluating the relevance of the six factors and their relative contribution towards improving the prediction accuracy of the ML algorithms. We used indices such as R 2 , NSE, and RMSE to compare the performance of the ML algorithms and identify the best fitting algorithm.

Data Source
The input datasets were acquired from the National Stormwater Quality Database (NSQD) (Version 4.02, 2015). This database is the result of an effort by the Environmental Protection Agency (EPA) to compile stormwater quality data in the United States. The Center for Watershed Protection at the University of Alabama was responsible for managing and evaluating the data from the National Pollutant Discharge Elimination System (NPDES) and the Municipal Separate Storm Sewer (MS4) System. The first version of this database was released in 2003. The database included data from more than 3700 storm events collected by 66 agencies from 17 different states in the US [29,30]. The latest version of this database (4.02-2015), used in this study, includes the concentration of different pollutant types, such as TSS, phosphorous, and other pollutants, collected from different EPA rainfall zones. Other data types in the database used as input/features in our ML analysis include drainage area, land use categories, percent imperviousness, antecedent dry days, rainfall depth, and runoff volume [30,31]. We identified input features that may affect TSS concentration. However, the major limitation was finding sufficient data on all the features that could be used to develop an ML model for estimation of the target (TSS concentration). Most of the feature dataset had missing data. Thus, the database was sorted with respect to the features (e.g., rainfall depth, land use, and antecedent dry days) and the target value (TSS concentration) to select datapoints with a complete dataset that included all the features with no missing values.

Data Preprocessing
Identification and selection of the most important environmental factors or features (e.g., land use and imperviousness area) for the estimation of TSS concentration is an important step in the application of ML techniques. Different approaches can be used to carry out feature identification and selection, such as a t-test, p-value test, and a review of the literature [32]. This study evaluated the critical factors described in earlier studies to identify and select relevant features and extended the list to include new features. One of the features included in this study is the drainage area. The volume of stormwater runoff [33,34] and percent imperviousness [35][36][37] were also cited as relevant factors [33,[35][36][37]. Imperviousness influences stormwater runoff, temperature, and the rate of pollutant degradation [35]. Rainfall patterns, intensity, depth [38], and antecedent dry days [39,40] were also stated Water 2021, 13, 147 4 of 24 as predictors of TSS concentration. We used rainfall depth as an input feature in this study since the duration and intensity of the rainfall event were not available. Previous studies have demonstrated that antecedent dry days have a crucial effect on the buildup and pollutant concentrations in urban stormwater [35,41]. Thus, this parameter was also included as one of the features in this study. Land use is another parameter that affects water quality. TSS concentration was observed to change significantly with changes in land use [38,40,42]. Land use was categorized into residential, institutional, commercial, industrial, and freeway. A summary of available data considered in this study is provided in Table 1. The collinearity among the features was evaluated. As shown in Table 2, the collinearity among the features was generally low except for the rainfall depth and the runoff volume. The runoff volume was directly related to rainfall depth. The NSQD database has datapoints with respect to the six mentioned features. We selected 530 datapoints with consistent features and the corresponding target value (TSS Water 2021, 13, 147 5 of 24 concentrations) from 7 different states. The selection was based on whether that dataset contained all the six features and the target value (TSS concentration).

ML Model Structure
The ML algorithm within an open-source python-based software, named Orange (Version 3.26), was implemented for the analysis after compiling the input dataset. The workflow in Figure 1 describes the three major stages involved in running ML algorithms. Stage 1 relates to the input data, in which the dataset consisting of the features (e.g., drainage area and land use) and the target value (TSS concentration) was compiled as an input dataset. The next step in stage 1 involved dividing the data into the training and prediction datasets. We used 66% of the dataset for the training step and 34% for the prediction step. Stage 2 represents the core of the ML analysis, where ML algorithms were used to analyze the quantitative relationships between the features and the target value. At this stage, eight different ML approaches were applied to the input data. Details about each ML method are presented in subsequent sections. Stage 3 is the output step, where the training results were tested using a 10-fold cross-validation approach. If reasonable estimates of the target value were obtained in the training step based on the R 2 , NSE, and RMSE values, the trained ML models were further tested using a separate dataset in the prediction step to determine the accuracy and ability of each ML method to make inferences or generalizations. If the training results did not demonstrate good performance, the internal specific parameters, such as weights, number of nodes, number of neighbors, and depth of the tree, for each ML model were readjusted to obtain better results. Additional information about the evaluation process is included in the results and discussion section. Figure 1 shows the workflow of the model. concentrations) from 7 different states. The selection was based on whether that dataset contained all the six features and the target value (TSS concentration).

ML Model Structure
The ML algorithm within an open-source python-based software, named Orange (Version 3.26), was implemented for the analysis after compiling the input dataset. The workflow in Figure 1 describes the three major stages involved in running ML algorithms. Stage 1 relates to the input data, in which the dataset consisting of the features (e.g., drainage area and land use) and the target value (TSS concentration) was compiled as an input dataset. The next step in stage 1 involved dividing the data into the training and prediction datasets. We used 66% of the dataset for the training step and 34% for the prediction step. Stage 2 represents the core of the ML analysis, where ML algorithms were used to analyze the quantitative relationships between the features and the target value. At this stage, eight different ML approaches were applied to the input data. Details about each ML method are presented in subsequent sections. Stage 3 is the output step, where the training results were tested using a 10-fold cross-validation approach. If reasonable estimates of the target value were obtained in the training step based on the R 2 , NSE, and RMSE values, the trained ML models were further tested using a separate dataset in the prediction step to determine the accuracy and ability of each ML method to make inferences or generalizations. If the training results did not demonstrate good performance, the internal specific parameters, such as weights, number of nodes, number of neighbors, and depth of the tree, for each ML model were readjusted to obtain better results. Additional information about the evaluation process is included in the results and discussion section. Figure 1 shows the workflow of the model.

Machine Learning (ML) Methods
Machine learning techniques are broadly classified into three distinct groups: Supervised, semi-supervised, and unsupervised algorithms. The supervised learning algorithm includes two steps: Training and prediction. The dataset is divided into two subsets, where the first subset is used for training, and the second subset is used for the prediction step. The training or labeling step helps the model identify the pattern and relationship between features (e.g., land use and imperviousness) and the target value (TSS concentration). The trained model and the relationships developed between input features and target values in the training step are further assessed in the prediction step using a prediction dataset to check the accuracy and generalization ability of the model [43,44]. In contrast, unsupervised learning does not involve a training step. Semi-supervised learning includes both supervised and unsupervised learning processes. Semi-supervised learning techniques use part of the training data for supervised learning and part of the dataset for unsupervised learning. Supervised machine learning algorithms were applied in this study [45,46].
Under supervised learning, there are two types of learning algorithms, described as eager and lazy learner algorithms. Eager learning algorithms learn the pattern from the training dataset and apply it to the prediction dataset, while lazy learning algorithms can perform the training and prediction steps simultaneously. These types of algorithms construct a general form of a trained model, using the training dataset that can be applied to a prediction dataset [47,48]. Supervised machine learning algorithms can also be categorized into linear and nonlinear learners [22]. LR methods are well-known linear methods. The kNN, DT, RF, AdB, and ANN algorithms are regarded as nonlinear algorithms, while the SVR method can be considered as both linear and nonlinear. The complexity of an ML model is related to the model structure and model parameters (e.g., hidden layers and weighting factors in ANN algorithm, or considering leaves and nodes in RT) used to capture the patterns in target values. Increasing the number of model parameters increases the complexity and improves the ability of the model to find the relation between features and target values [49][50][51].
Complexity improves accuracy in both the training and prediction steps, although in some cases, complex nonlinear models have been observed to decrease the accuracy in the prediction step [51]. In the prediction step, the relation between features and the target values is evaluated by applying the trained model (using a training dataset) to a prediction dataset. The prediction dataset is anticipated to have similar properties, features, and target data structure as the training dataset. Therefore, ML models are expected to detect the pattern and relation between the features and target values, learned in the training step, in the prediction dataset. However, this might not happen for some models, because the algorithm may stick to the details in the training dataset that are absent in the prediction dataset, leading to an overfitting problem. On the contrary, the simpler model (e.g., linear models) might not be able to capture the details in the patterns of the training dataset as accurately as complex models, leading to an underfitting problem and poor performance during the prediction step [45,46]. Thus, it can be challenging to find the optimum level of complexity to prevent underfitting and overfitting problems [49]. ML algorithms can also be applied as a single model or as a combination of ML techniques, called ensemble modeling, presented in Section 2.4.1.

Single-Model ML Methods
Single-learner ML models have one ML method in their structure. In this study, we applied six different single ML algorithms: LR, uniform weighting kNN, variable weighting kNN, RT, SVR, and ANN. The Single-Model ML methods are presented here. The Ensemble-Model methods are presented in the subsequent section.

Linear Regression (LR)
Linear Regression (LR) is a well-known approach used in many different fields [52]. The LR model can be applied for single-or multiple-variable problems. Most of the phenomena in nature may be better explained by multiple variables than a single variable [53]. Thus, this approach is traditionally divided into two different techniques: Simple and multiple regression models [54,55]. LR methods describe the relationship between the dependent and independent variables, referred to as a target value and features, respectively, in the ML literature.

k-Nearest Neighbor (kNN)
The k-Nearest Neighbor (kNN) is a nonparametric ML method for regression problems [56]. The kNN model stores feature values from all the datapoints (target values) in the input dataset and uses those values to find the similarities between the training dataset and prediction dataset. It uses feature similarity between training and prediction datasets to find similar datapoints (target values) and predict the target values for the prediction dataset. The estimated target value is assigned based on how closely it resembles the values of the training dataset. Two approaches are available for estimating the target value in the kNN algorithm: A uniform weighting averaging approach (UW-kNN) and a variable weighting approach (VW-kNN). In the UW-kNN method, the average of the target of the k-Nearest Neighbors is calculated with the same weighting values. In the VW-kNN algorithm, the inverse distance value is assigned as a weight for each of the k-Nearest Neighbors, and the target value is obtained using a weighted average procedure [57,58]. The kNN regression algorithm uses distance functions, such as Euclidean, Manhattan, and Minkowski, to calculate the distance between datapoints [3,59,60].

Support Vector Regression (SVR)
The Support Vector Regression (SVR) was developed based on the support vector machine algorithm using the ε-intensive loss function. The ε-intensive loss function within the SVR process ignores errors that are less than ε, which creates two margins for the fitted function. The error refers to the difference between the calculated and measured values [22,61]. SVR is able to account for any types of nonlinearity mapping using a limited dataset and is able to make a good generalization based on the statistical theories. The method has been frequently applied because of its performance and robust theoretical principles [60]. The SVR method attempts to find the best linear regression for a nonlinear mapping function with multiple features. This algorithm uses a kernel function to map all input data to a high dimensional plane to find the hyperplane with minimum distances to all available datapoints [62].

4.
Artificial Neural Network (ANN) The Artificial Neural Network (ANN) is a powerful computational method capable of capturing nonlinear patterns in functional relationships between features and targets [15,63,64]. The ANN algorithms consist of three basic layers: Input, hidden, and output. Each layer consists of a different number of neurons. The number of neurons in the input layer depends on the number of features, while the number of neurons in the output layer depends on the number of target values [65,66]. Therefore, changing the number of features and target values in the dataset will affect the number of neurons in the input and output layers. The number of hidden layers and neurons needs to be assigned. However, there is no specific method for finding an optimum number of hidden layers and neurons in the hidden layers [67]. Each layer in ANN consists of different numbers of neurons. The neurons in different layers are connected to each other by weighted paths between layers. The multiple layer perceptron (MLP) algorithm, one of the ANN procedures, was used to perform a one-layer efficient neural network in this study [68,69].

5.
Regression Tree (RT) The Regression Tree (RT) is one of the well-known ML algorithms for solving regression problems [70]. The RT algorithm follows two main steps. The first step is a splitting process to subdivide the dataset using a set of decision rules, and the second step is the pruning process [3,71]. The start node in the splitting process, called the root node, consists of all datapoints. Each node in the splitting process selects one of the input features. The splitting process starts using one of the features for the root node and splits the dataset to the left and right branches. Each of the branches indicates a range of values for the selected feature. Thus, the left and right node (operational node) consist of datapoints fitted to the range of each branch. This process continues by considering the least Sum of Squared Residuals (SSR; between the measured and SSR, in which the minimum SSR refers to the most efficient tree) [70,72]. The pruning process is the second method besides SSR, which controls the growing process in the RT algorithm. Two of the well-known criteria for pruning is assigning a maximum depth for the tree and considering a minimum number of datapoints remaining in each node. The final estimation for the target value is the average of datapoints remaining in the last nodes (prediction nodes) [72,73].

Ensemble-Model ML Methods
Ensemble models use multiple single ML methods to perform the training and prediction steps. Ensemble modeling is further divided into three categories: Bagging, boosting, and stacking or blending. Bagging and boosting are two ensemble methods that use multiple similar single algorithms in their structure, while a stacking model can include both single and ensemble methods inside its structure [74]. This study demonstrates the application of bagging and boosting ensemble models for the estimation of TSS concentration.

Bagging Ensemble Models
A bagging ensemble model allows the use of an unlimited number of similar single learners inside its structure. This algorithm uses a bootstrap sampling procedure to create multiple random subsamples from the training dataset with replacement [75,76]. Random Forest (RF) is a nonparametric bagging ML method for performing regression analysis [24,77]. The RF method, proposed by the authors of [76], combines several individual RTs (n "tree") to produce more accurate results compared to a single RT algorithm [78,79]. Because of its ensemble structure, it prevents instabilities in the final results that may arise from overfitting, usually observed in a single RT algorithm [26,76,80]. In the RF method, individual trees randomly choose a subset of features as their roots and nodes through the splitting process described in the RT method. Each learner uses a specific bootstrapped dataset obtained using the bootstrap sampling process to develop an RT structure [76]. Therefore, each one of the single learners produces an estimate of target values. In the RF method, a uniform weighting averaging approach is applied to target values obtained from the single learners and reported as the final estimated target value for each datapoint in the training dataset [75,76,81].

Boosting Ensemble Models
A boosting ensemble model can use multiple similar single learners inside its structure. The boosting algorithm follows the same process of bootstrap sampling as the bagging algorithm [75,82,83]. Adaptive Boosting (AdB) is a boosting algorithm that is able to use multiple RT learners as its basic learners. This algorithm requires assigning the number of basic learners (n). The basic learners within the AdB algorithm are evaluated iteratively based on the outcomes. The AdB starts the process by creating basic learners (RT) with lower accuracy and improves the subsequent learners based on the results obtained by previous learners. During this process, the algorithm defines weights for the datapoints. AdB starts with a uniform distribution of weights for each datapoint in the subsample. Then, as the process continues, different weights are assigned to each of the datapoints based on their importance to the accuracy of the model. The RT learners trained based on bootstrapped datasets are tested using the original training dataset to identify datapoints that were not predicted accurately. Thus, the AdB places preference on those datapoints predicted inaccurately to improve the accuracy of the next learner [84,85]. This systematic process helps to enhance the accuracy of the AdB method. The final estimate of the target value is a weighted average of the values obtained using individual learners [82,83,86].

Training and Prediction Steps
Supervised ML techniques were used for the prediction of TSS concentration in urban watersheds. Supervised ML algorithms can be trained for specific datasets and used for the prediction step. The study utilized a dataset obtained from NSQD. The dataset from NSQD was divided into two subsets. We used 66% of the dataset randomly selected by the Orange software (version 3.26) for the training step and 34% for the prediction step. All eight algorithms were trained using 350 rows. Ten-fold cross-validation was used for testing the results of the training step. The accuracy of the algorithms in the training step was tested based on indices such as the Coefficient of Determination (R 2 ) [45], Nash-Sutcliffe (NSE) [14,87], and Root Mean Square (RMSE) [27,88]. Inference about the ability of the algorithms for generalization and accuracy was tested further by applying the trained model to the prediction dataset with 180 datapoints. If satisfactory results were not obtained during the prediction step, the training step was rerun iteratively by changing specific input parameters in each ML method. Indices, such as R 2 , NSE, and RMSE, were used to assess model performance in both the training and prediction steps.
The R 2 is the square of the correlation coefficient (r) given by The second index, the Nash-Sutcliffe equation, is given by The RMSE, which was used to calculate the difference between the measured values and calculated values, is given by where P i is the predicted value for the i th row, O i is the measured data, O is the average of the measured values, and k is the total number of the datapoints in a dataset. For each ML method, we generated four different plots where the first and the second plot displayed the pattern between model predicted target values and measured data obtained from NSQD for the training and prediction datasets, respectively. The third and the fourth plots displayed the R 2 , NSE, and RMSE values that demonstrated the performance of each ML algorithm in predicting the target value (TSS concentration) for the training and prediction steps.

k-Nearest Neighbors (kNN)
Three neighbors were used for the uniform weighting method, while 45 neighbors were used for the VW-kNN. Increasing the number of neighbors in a uniform weighting method could improve performance. However, the weighting approach also has a significant effect on the accuracy of the outcome. Figure 3a,b show the R 2 , NSE, and RMSE values obtained using the two algorithms. For the training step, the uniform weighting method predicted the target value with R 2 , NSE, and RMSE values of 0.71, 0.70, and 160 mg/L, respectively. The R 2 , NSE, and RMSE values for the variable weighting method were 0.77, 0.75, and 110 mg/L, respectively.

k-Nearest Neighbors (kNN)
Three neighbors were used for the uniform weighting method, while 45 neighbors were used for the VW-kNN. Increasing the number of neighbors in a uniform weighting method could improve performance. However, the weighting approach also has a significant effect on the accuracy of the outcome. Figure 3a,b show the R 2 , NSE, and RMSE values obtained using the two algorithms. For the training step, the uniform weighting method predicted the target value with R 2 , NSE, and RMSE values of 0.71, 0.70, and 160 mg/L, respectively. The R 2 , NSE, and RMSE values for the variable weighting method were 0.77, 0.75, and 110 mg/L, respectively.

Support Vector Regression (SVR)
The SVR method was one of the methods used in this study. The regression cost value was assigned as 1.4, and the value was 0.01. A nonlinear kernel function with a coefficient of 0.63 was used for the mapping process. The scatterplots for training and prediction steps are presented in Figure 5a,b. The model had good performance across all indices in both the training and prediction steps, with R 2 and NSE values of 0.65 and 0.58, respectively. The RMSE values were 140 mg/L and 160 mg/L, respectively, for the training and the prediction steps.

Support Vector Regression (SVR)
The SVR method was one of the methods used in this study. The regression cost value was assigned as 1.4, and the ε value was 0.01. A nonlinear kernel function with a coefficient of 0.63 was used for the mapping process. The scatterplots for training and prediction steps are presented in Figure 5a

Artificial Neural Network (ANN)
MLP backpropagation was applied in the ANN, which means that the previous iteration results were used to improve the results of the next iteration. There is no specific method for finding an optimum number of hidden layers and neurons in the hidden layers [67]. Therefore, a trial-and-error method was used to find the optimum number of hidden layers and neurons. We did preliminary testing to determine the number of hidden layers and neurons by varying the number of hidden layers and neurons within a range of 1 to 100 and 1 to 500, respectively. The best result was obtained using 1 hidden layer and 230 neurons. The ReLu function, used as an activation function, produced good results. A quasi-Newton method-based solver with a value of 0.06 for regulation, which penalizes the worst weights in the process to prevent overfitting and underfitting problem, and a maximum number of iterations of 100,000 was used to reach the optimum weighting values. The R 2 value for the training step was considerably better than the value

Artificial Neural Network (ANN)
MLP backpropagation was applied in the ANN, which means that the previous iteration results were used to improve the results of the next iteration. There is no specific method for finding an optimum number of hidden layers and neurons in the hidden layers [67]. Therefore, a trial-and-error method was used to find the optimum number of hidden layers and neurons. We did preliminary testing to determine the number of hidden layers and neurons by varying the number of hidden layers and neurons within a range of 1 to 100 and 1 to 500, respectively. The best result was obtained using 1 hidden layer and 230 neurons. The ReLu function, used as an activation function, produced good results. A quasi-Newton method-based solver with a value of 0.06 for regulation, which penalizes the worst weights in the process to prevent overfitting and underfitting problem, and a maximum number of iterations of 100,000 was used to reach the optimum weighting values. The R 2 value for the training step was considerably better than the value in the prediction step. The scatterplots in Figure 6 show good performance in the training step and poor performance in the prediction step. The performance indices were R 2 values of 0. 75

Regression Tree (RT)
Regression Tree (RT) is a widely used ML method due to its simplicity. This study implemented a pruning process to help the RT algorithm avoid overfitting problems. Therefore, a maximum depth value of 10 was assigned for the trees. The RT method predicted TSS concentrations with R 2 values of 0.61 and 0.34, NSE values of 0.53 and 0.30, and RMSE values of 160 mg/L and 210 mg/L for the training and the prediction steps, respectively, as shown in Figure 7a,b. The RT method was unable to make a good generalization. Thus, the method could not be regarded as a good algorithm for the estimation of TSS concentration.

Regression Tree (RT)
Regression Tree (RT) is a widely used ML method due to its simplicity. This study implemented a pruning process to help the RT algorithm avoid overfitting problems. Therefore, a maximum depth value of 10 was assigned for the trees. The RT method predicted TSS concentrations with R 2 values of 0.61 and 0.34, NSE values of 0.53 and 0.30, and RMSE values of 160 mg/L and 210 mg/L for the training and the prediction steps, respectively, as shown in Figure 7a,b. The RT method was unable to make a good generalization. Thus, the method could not be regarded as a good algorithm for the estimation of TSS concentration.

Bagging Model-Random Forest (RF)
An RF model with three features ("m" try) in each iteration and 500 ("n" tree) weak learner RTs was used. Generally, increasing the number of basic learners after 100 did not change the result significantly. However, based on the literature, 500 basic learners would be a reasonable number of single learners within the RF structure [76,81]. We did preliminary testing to determine the number of basic learners using 100, 200, 400, 500, and 1000 basic learners. The best results were obtained for 500 RTs.
The R 2 values indicate that the algorithm can reasonably predict the target value at both higher and lower TSS concentrations, as shown in Figure 8. The R 2 , NSE, and RMSE values were 0.77, 0.76, and 110 mg/L, respectively, for the training step ( Figure 8a) and 0.67, 0.62, and 150 mg/L, respectively, for the prediction step (Figure 8b).

Bagging Model-Random Forest (RF)
An RF model with three features ("m" try) in each iteration and 500 ("n" tree) weak learner RTs was used. Generally, increasing the number of basic learners after 100 did not change the result significantly. However, based on the literature, 500 basic learners would be a reasonable number of single learners within the RF structure [76,81]. We did preliminary testing to determine the number of basic learners using 100, 200, 400, 500, and 1000 basic learners. The best results were obtained for 500 RTs.
The R 2 values indicate that the algorithm can reasonably predict the target value at both higher and lower TSS concentrations, as shown in Figure 8. The R 2 , NSE, and RMSE values were 0.77, 0.76, and 110 mg/L, respectively, for the training step ( Figure 8a

Boosting Model-Adaptive Boosting (AdB)
This study implemented an AdB algorithm that used 30 RTs as basic learners with a learning rate of 1, which used the results obtained from the last single learner. Additionally, it used an exponential regression loss function to fit the dataset. The model performed well across all indices, with R 2 values of 0.74 and 0.64, NSE values of 0.72 and 0.62, and RMSE values of 110 mg/L and 150 mg/L, respectively, for the training and prediction steps. However, TSS concentration estimates were relatively more accurate in the training step, as shown in Figure 9a, compared to the prediction step, as shown in Figure 9b.

Boosting Model-Adaptive Boosting (AdB)
This study implemented an AdB algorithm that used 30 RTs as basic learners with a learning rate of 1, which used the results obtained from the last single learner. Additionally, it used an exponential regression loss function to fit the dataset. The model performed well across all indices, with R 2 values of 0.74 and 0.64, NSE values of 0.72 and 0.62, and RMSE values of 110 mg/L and 150 mg/L, respectively, for the training and prediction steps. However, TSS concentration estimates were relatively more accurate in the training step, as shown in Figure 9a, compared to the prediction step, as shown in Figure 9b.

Single, Bagging, and Boosting Models
The Multiple Linear Regression model was used for the training and prediction steps to estimate TSS concentration using six different features. Figure 2a shows the efficiency of linear regression in capturing patterns in the measured values data in the training step. The results demonstrate that the Multiple Linear Regression model failed to mimic the patterns in the measured values of TSS concentration. This could be attributed to the underfitting problem that arose from the simplicity and linearity of the algorithm. The MLR model was not able to find the relationship between the features and the target during the training step. Thus, the model could not predict TSS concentration in the prediction step, as shown in Figure 2b. These results indicate that MLR is not a robust model for the estimation of TSS concentration.

Single, Bagging, and Boosting Models
The Multiple Linear Regression model was used for the training and prediction steps to estimate TSS concentration using six different features. Figure 2a shows the efficiency of linear regression in capturing patterns in the measured values data in the training step. The results demonstrate that the Multiple Linear Regression model failed to mimic the patterns in the measured values of TSS concentration. This could be attributed to the underfitting problem that arose from the simplicity and linearity of the algorithm. The MLR model was not able to find the relationship between the features and the target during the training step. Thus, the model could not predict TSS concentration in the prediction step, as shown in Figure 2b. These results indicate that MLR is not a robust model for the estimation of TSS concentration.
Two different kNN methods were used: One with a fewer number of neighbors with uniform weighting and one with a larger number of neighbors with variable weighting based on the distance of each datapoint to its neighbors. Both approaches demonstrated good performance in the training step, as shown in Figure 3a, mimicking the overall patterns, particularly for high TSS concentrations. However, the performance of the VW-kNN method in the lower concentration ranges was slightly better than the UW-kNN. A fewer number of neighbors forced the algorithm to capture the details in the training dataset. Therefore, this process was more likely to fall into an overfitting problem. Moreover, in UW-kNN, the same weight was assigned to the farthest and the nearest neighbor. Nevertheless, there might have been differences between the farthest and nearest neighbor with respect to all the features. As a result, the algorithm could not mimic the general pattern observed in the training dataset. This inability led to a poor fit in the prediction step.
The VW-kNN process resulted in a relatively better fit than the UW-kNN approach in both the training and prediction steps. Thus, using a larger number of neighbors and a variable weighting approach helped the algorithm capture the general pattern observed in the dataset in the training step. The scatterplots in Figure 4 show that the variable weighting method performed better, not only in predicting low concentrations, but also in high TSS concentrations when compared to the uniform weighting method in the prediction step. The uniform and variable weighting kNN methods were compared using other statistical measures (NSE and RMSE).
The SVR algorithm captured the general pattern in both the training and prediction steps. The scatterplots for the training and prediction steps are presented in Figure 5a,b). The model had good performance across all indices in both the training and prediction steps in terms of R 2 , NSE, and RMSE values. The SVR algorithm could have better performance in high and mid-range TSS concentration compared to the lower value. Although the SVR had higher R 2 and NSE values in the prediction step compared to the VW-kNN, the VW-kNN had better performance in the training step compared to the SVR. Despite the relatively better performance in prediction steps compared to other models, such as MLR and UW-kNN, the performance in the prediction step was not as good as the training step.
The ANN performed well on the training dataset. However, it could not accurately mimic the pattern in the prediction step, which could have been due to an overfitting problem in the training step resulting from the complexity of the ANN model structure.
Overfitting suggests that the ANN model learned and applied details in the training dataset that might have been absent in the prediction dataset. Another factor contributing to the overfitting problem could have been the limited number of datapoints. Having a larger number of datapoints may have reduced the overfitting problem and may have led to better performance in both the training and prediction steps [51].
The next algorithm that was utilized in this study was the RT method. The algorithm captured the general pattern in both the training and prediction steps. However, RT could not capture the pattern for several of the datapoints, especially in the prediction step, which is attributable to an overfitting problem. The RT method can be used inside the structure of ensemble bagging and ensemble boosting algorithm to improve the results obtained in the training and prediction step.
The RF method, which was developed based on the RT, was one of the best-performing algorithms implemented in this study in terms of estimating the TSS concentration. The model was efficient and could capture the patterns in the measured values data for both high and low TSS concentration ranges in the training dataset. In the prediction step, the algorithm captured the pattern but not as accurate as it did in the training step. The algorithm performed slightly better in lower and mid-range TSS concentration compared to higher concentration values in both the training and prediction steps. The reason could have been the relatively larger number of events in the lower and mid-range concentrations. Furthermore, the results show that RF was more efficient than RT, which was used as a basic learner in RF. These results demonstrate that ensemble modeling improved the performance of the RT algorithm compared to the application of the RT algorithm as a single learner.
As stated earlier, AdB can improve the results of single learners that failed to capture patterns in the data due to an overfitting problem. The model was able to recognize the patterns in both the training and prediction datasets. The scatterplots for the training and prediction steps are presented in Figure 9a,b. The scatterplots demonstrate that the AdB method predicted TSS concentrations, with relatively higher R 2 and NSE values compared to the single learners presented earlier. The algorithm did not perform with the same level of accuracy across all the ranges of TSS concentration. The algorithm was able to capture patterns in the TSS concentration for the prediction step for lower, mid-level, and high concentrations of TSS.
All the applied methods in this study were compared by considering the relative R 2 and relative RMSE values in the prediction step, which demonstrated the benefits of using ensemble modeling. The R 2 values were normalized by dividing all the R 2 values by the maximum R 2 value to obtain values between "0" and "1." A value of "1" for a model refers to the highest level of accuracy. A value near "0" represents the lower level of accuracy of ( Figure 10a). The RMSE values were normalized by dividing the minimum RMSE value by all the RMSE values to obtain values between "0" and "1." A value of "1" for a model refers to the highest level of accuracy. A value near "0" represents the lower level of accuracy ( Figure 10b). single learner.
As stated earlier, AdB can improve the results of single learners that failed to capture patterns in the data due to an overfitting problem. The model was able to recognize the patterns in both the training and prediction datasets. The scatterplots for the training and prediction steps are presented in Figure 9a,b. The scatterplots demonstrate that the AdB method predicted TSS concentrations, with relatively higher R 2 and NSE values compared to the single learners presented earlier. The algorithm did not perform with the same level of accuracy across all the ranges of TSS concentration. The algorithm was able to capture patterns in the TSS concentration for the prediction step for lower, mid-level, and high concentrations of TSS.
All the applied methods in this study were compared by considering the relative R 2 and relative RMSE values in the prediction step, which demonstrated the benefits of using ensemble modeling. The R 2 values were normalized by dividing all the R 2 values by the maximum R 2 value to obtain values between "0" and "1." A value of "1" for a model refers to the highest level of accuracy. A value near "0" represents the lower level of accuracy of (Figure 10a). The RMSE values were normalized by dividing the minimum RMSE value by all the RMSE values to obtain values between "0" and "1." A value of "1" for a model refers to the highest level of accuracy. A value near "0" represents the lower level of accuracy (Figure 10b).
The AdB and RF method used multiple RT algorithm as its basic learner. The RT method was unable to perform well when applied individually. However, the boosting and bagging process considerably improved the performance compared to the performance under the individual application. The ensemble structure in AdB and RF allowed the establishment of a robust correlation between a feature and target values in the training step, which led to limiting overfitting and underfitting problems and better results in the prediction step. However, the AdB method performed better than RF, attributable to the boosting process that helped to improve the limitation of each basic learner and subsequent basic learners. The AdB and RF method used multiple RT algorithm as its basic learner. The RT method was unable to perform well when applied individually. However, the boosting and bagging process considerably improved the performance compared to the performance under the individual application. The ensemble structure in AdB and RF allowed the establishment of a robust correlation between a feature and target values in the training step, which led to limiting overfitting and underfitting problems and better results in the prediction step. However, the AdB method performed better than RF, attributable to the boosting process that helped to improve the limitation of each basic learner and subsequent basic learners.

Sensitivity Analysis
Six different features were considered for the application of supervised ML algorithms to reproduce the pattern in TSS concentration observations retrieved from the NSQD. Further investigation was conducted to assess the relative importance of each feature on the performance of the ML model. Thus, a sensitivity analysis was implemented using one of the best-performing methods, the AdB method. We developed a methodology where we excluded one feature at a time during the training step and assessed how the feature affected the performance of the model. Figure 11 shows the sensitivity results for the six features used in the AdB algorithm. The significance of each feature was assessed by comparing the difference in R 2 values with and without the feature. Then, the data were normalized by dividing the differences by the maximum difference in R 2 values to obtain values between "0" and "1," with a value of "1" for a feature indicating the highest level of sensitivity. A value near "0" value represents the lower level of sensitivity of the output to a feature.

Sensitivity Analysis
Six different features were considered for the application of supervised ML algorithms to reproduce the pattern in TSS concentration observations retrieved from the NSQD. Further investigation was conducted to assess the relative importance of each feature on the performance of the ML model. Thus, a sensitivity analysis was implemented using one of the best-performing methods, the AdB method. We developed a methodology where we excluded one feature at a time during the training step and assessed how the feature affected the performance of the model. Figure 11 shows the sensitivity results for the six features used in the AdB algorithm. The significance of each feature was assessed by comparing the difference in R 2 values with and without the feature. Then, the data were normalized by dividing the differences by the maximum difference in R 2 values to obtain values between "0" and "1," with a value of "1" for a feature indicating the highest level of sensitivity. A value near "0" value represents the lower level of sensitivity of the output to a feature. The AdB outcomes were sensitive to several of the features. However, there were differences in the level of sensitivity. The models showed higher sensitivity to land use data, followed by an antecedent number of dry days and drainage areas. The AdB outcomes were sensitive to both rainfall depth and runoff volume, with a sensitivity higher to runoff volume than rainfall depth. Runoff volume was more relevant to TSS concentration than rainfall depth, likely because runoff is directly linked to detachment and transport of sediment [89]. Thus, when available, runoff volume data could be a better predictor of TSS concentration that rainfall data. The results showed that AdB outcomes were sensitive to both land use and imperviousness, with a higher sensitivity to land use. Land use is relatively more sensitive because of the greater potential for the generation of relatively more sediment load from pervious areas than for wash-off from the impervious areas [89]. Percent imperviousness increases runoff volume, which is responsible for detachment, wash-off, and transport. The AdB also showed that the results obtained by this model were sensitive to antecedent dry days. The amount of solids deposited over the impervious area increases with an increasing number of dry days [89]. The drainage area is also an important factor because it is directly related to the runoff volume and peak. The sensitivity analysis demonstrated all of the factors used in our analysis were important to the prediction of TSS. The AdB outcomes were sensitive to several of the features. However, there were differences in the level of sensitivity. The models showed higher sensitivity to land use data, followed by an antecedent number of dry days and drainage areas. The AdB outcomes were sensitive to both rainfall depth and runoff volume, with a sensitivity higher to runoff volume than rainfall depth. Runoff volume was more relevant to TSS concentration than rainfall depth, likely because runoff is directly linked to detachment and transport of sediment [89]. Thus, when available, runoff volume data could be a better predictor of TSS concentration that rainfall data. The results showed that AdB outcomes were sensitive to both land use and imperviousness, with a higher sensitivity to land use. Land use is relatively more sensitive because of the greater potential for the generation of relatively more sediment load from pervious areas than for wash-off from the impervious areas [89]. Percent imperviousness increases runoff volume, which is responsible for detachment, wash-off, and transport. The AdB also showed that the results obtained by this model were sensitive to antecedent dry days. The amount of solids deposited over the impervious area increases with an increasing number of dry days [89]. The drainage area is also an important factor because it is directly related to the runoff volume and peak. The sensitivity analysis demonstrated all of the factors used in our analysis were important to the prediction of TSS.

Conclusions
This study investigated the application of supervised machine learning algorithms for urban stormwater quality management. Version 4.02 of the National Stormwater Quality Database (NSQD) was used to extract event-based data on TSS concentration, and associated site features such as drainage area, land use, percent of impervious, antecedent dry days, rainfall depth, and runoff volume. We compiled 530 datapoints from NSQD containing all the features and target values with no missing data. We used 66% of the dataset as input for the training step and 34% for the prediction step.
Eight different ML algorithms were compared: Linear Regression, Regression Tree, Support Vector Regression, Random Forest, Adaptive Boosting, variable weighting k-Nearest Neighbor, uniform weighting k-Nearest Neighbor, and Artificial Neural Network. Linear Regression failed in both the training and prediction steps. However, all the other methods showed a good performance in both the training and prediction steps except the uniform weighting k-Nearest Neighbor and Artificial Neural Network algorithms. These two methods (UW-kNN and ANN) had a good performance in the training step but failed in the prediction step. The highest R 2 and NSE and the lowest RMSE in both the training and prediction steps, indicators of good performance, were obtained by the RF and AdB algorithms. Moreover, a sensitivity analysis demonstrated that the prediction accuracy of AdB was sensitive to all input features.
It was demonstrated that machine learning methods are plausible approaches to the prediction of TSS concentration. A limitation identified in some of the models, poor performance in the prediction step, is attributed to overfitting and underfitting problems. Thus, these limitations can be addressed with the choice of appropriate models and the use of sufficient data points. The approach could benefit from the expansion of the NSQD dataset. Thus, with further enhancement of the ML methods and data sources, ML methods have the potential for application across regions. Future efforts to enhance model accuracy should consider the use of hybrid methods or ensemble models.