A Comparison of Linear and Non-Linear Machine Learning Techniques (PCA and SOM) for Characterizing Urban Nutrient Runoff

: Urban stormwater runoff represents a signiﬁcant challenge for the practical assessment of diffuse pollution sources on receiving water bodies. Given the high dimensionality of the problem, the main goal of this study was the comparison of linear and non-linear machine learning (ML) methods to characterize urban nutrient runoff from impervious surfaces. In particular, the principal component analysis (PCA) for the linear technique and the self-organizing map (SOM) for the non-linear technique were chosen and compared considering the high number of successful applications in the water quality ﬁeld. To strengthen this comparison, these techniques were supported by well-known linear and non-linear methods. Those techniques were applied to a complete dataset with precipitation, ﬂow rate, and water quality (sediments and nutrients) records of 577 events gathered for a watershed located in Southern Italy. According to the results, both linear and non-linear techniques can represent build-up and wash-off, the two main processes that characterize urban nutrient runoff. In particular, non-linear methods are able to capture and represent better the rainfall-runoff process and the transport of dissolved nutrients in urban runoff (dilution process). However, their computational time is higher than the linear technique (0.0054 s vs. 15.24 s, for linear and non-linear, respectively, in our study). The outcomes of this study provide signiﬁcant insights into the application of ML methods for the water quality ﬁeld.


Introduction
Surface freshwater, among the aquatic ecosystems, is one of the fundamental components of the water cycle for human life worldwide. It is generated from surface water bodies, which provide drinking water, preserve biodiversity, control climate, and maintain phosphorus and nitrate cycling [1]. However, in recent decades, the quality of these water bodies has been threatened by anthropogenic activities (e.g., urbanization, agriculture, water extraction, sewage discharge) [2][3][4][5]. Therefore, in the management plan of pollution control at a watershed scale, a worthwhile initial step is the identification of the pollution sources. They can be point sources (PS) or non-point sources (NPS). PS control is more direct and quantifiable and, in several countries, its mitigation has been linked to water treatment, achieving lower pollutant concentrations before discharge. Instead, NPS pollution occurs when contaminants from diverse and widely spread sources are transported by runoff into water bodies. NPS pollution is more difficult to quantify due to its diffuse nature and the fact that many small sources might contribute to the generation of NPS pollutants [6][7][8]. Stormwater runoff from urban areas has been considered one of the most critical types of NPS pollution [8]. Thus, in this context, efficient and sustainable strategies for urban catchment management play a significant role in the protection of surface water quality.
The dynamic and random nature of urban runoff quality may be related to different local factors [9]. Previous studies have categorized such factors into four main groups: (i) pollutant type (with its characteristics like composition, load, decay rate) [10]; (ii) water quality physical characteristics (temperature, pH, salinity) [11]; (iii) temporal factors (seasonality, antecedent dry days, first flush pattern) [12]; and (iv) spatial factors (land cover, land use, slope, soil type, and other catchment physical characteristics) [1,13]. In addition to the intrinsic random nature of urban runoff quality, most of the works mentioned above have also demonstrated that many multifaceted interactions among these factors have occurred in multi-dimensions. In such conditions, conventional univariate (e.g., ANOVA) and multivariate (e.g., linear regression) statistical techniques have been widely adopted to investigate the correlation among water quality characteristics and influential factors [3,14]. However, the outcomes were often affected by several sources of statistical bias, unless all the analysis requirements (e.g., model selection criteria, parametric assumptions, and interaction terms) were rigidly tackled or fixed [9,15]. Therefore, considering that these issues may prevent an in-depth understanding of water quality changes and their different effects on water bodies in response to various rainfall events, the challenge in evaluating the variability of urban stormwater quality is clear.
In this context, machine learning (ML) methods are used to explore the hidden information in a multidimensional water quality dataset [16,17]. During the last decade, linear and non-linear ML techniques have been well reviewed for surface water quality assessments due to their outstanding capability to overcome the above-mentioned issues by processing and analyzing large amounts of data in a relatively short time. Gorgoglione et al. [7] adopted principal component analysis (PCA) to assess the effect of rainfall, watershed, and drainage network characteristics on urban nutrient runoff in poorly gauged areas. Furthermore, they used hierarchical cluster analysis (HCA) to evaluate the performance of a hydrologic/hydraulic and water quality model implemented in two different study areas. Dutta et al. [18] also used PCA and HCA to investigate the geospatial differences in water quality monitoring locations and identify potential water pollution sources. Liu et al. [19] carried out a comprehensive investigation into the relationship between land-use type and mineral components in river sediments by exploiting the PCA technique.
However, several researchers have stated that the linear multivariate ML techniques are constrained by the assumption of linearity, which is an unverified hypothesis for the urban nutrient runoff process [17,20,21]. For this reason, lately, non-linear multivariate methods have received significant attention from environmental researchers. Among several techniques, the self-organizing map (SOM) is one of the most adopted in the water quality field [20,22,23]. It is a type of artificial neural network (ANN) composed of fully connected neuron arrays, able to describe an environmental phenomenon depending on different physical variables (represented by a high-dimensional space) through a new low-dimensional space (usually two dimensions) [24]. Nevertheless, prior to the network training, the user has to define the number and arrangement of neurons to outline the topology structure, which directly affects the classification results. On its side, the SOM has the advantages of providing a suitable representation of non-linear processes, as well as a large number of parallel distributed structures, and is capable of learning and induction. Ding et al. [4] used the SOM training to improve linear techniques to identify the temporal and spatial patterns of several water quality variables. The authors found that the samplingsite elevation affected the water quality throughout different seasons. In fact, the reservoir water quality was poorer in the rainy season and particularly for reservoirs located on plains than the ones located on the mountains. For a similar purpose, Jiang et al. [17] adopted SOM and the growing hierarchical self-organizing map (GHSOM) in the Songhua river basin (China). These techniques were able to explore spatial and temporal features, the correlation between water quality parameters, and the major contaminants presented in the river (chemical oxygen demand, ammonia nitrogen, total phosphorus, and fecal coliform). Ki et al. [9] applied the SOM technique to the storm water monitored dataset to gain new insights about stream water quality profiles under different precipitation conditions. Among the several outcomes of this study, it was found that, for different monitoring sites and rainfall events, the SOM showed significant variability in trace metal concentrations, with a greater impact of runoff on river water quality at the upstream stations than at the downstream ones, except under low rainfall conditions (≤4 mm).
Taking into account the successful applications of the PCA and SOM, this study aims to compare the results of these two approaches, which respectively belong to the linear and non-linear ML techniques, regarding the characterization of nutrient runoff from impervious surfaces in urban watersheds. In particular, the comparison is carried out following three main aspects: (i) the ability to represent the correlation among the selected variables to represent the system and, therefore, depict the build-up and washoff processes (feature correlation); (ii) the capability to group the dataset, based on the variables that represent the build-up and wash-off processes (data point grouping); (iii) the ability to quantify the importance of each variable (feature importance). PCA and SOM are supported by other linear and non-linear methods to strengthen this comparison. These two techniques are both used for dimension-reduction but, as far as we know in the recent literature, they have never been computed for the same dataset and their results have never been compared.
A watershed located in Southern Italy was used as a case study, where: (i) precipitation, flow rate, and water quality (sediments and nutrients) were monitored; (ii) a hydrologic/hydraulic and water quality model was calibrated and validated; (iii) a model for synthetic rainfall generation was implemented. The findings of this study will provide essential insights into the application of ML techniques for water quality data exploration in urban areas.

Methodology Description
A flowchart that summarizes the methodology adopted in this study to accomplish the main and the specific objectives is presented in Figure 1. Four main steps can be identified. The first one is represented by the dataset creation, including a monitoring campaign, a precipitation-generation model, and a hydrologic/hydraulic/water quality model (see Section 2.3). The second and third steps include data analysis performed by PCA and SOM, respectively (see Section 3). The last step compares the outcomes obtained by the previous two phases (see Section 3). It is worth remarking that the comparison includes not only the physical processes that characterize urban nutrient runoff but also the computational cost (computational time and hardware resource requirements) of the two algorithms.

Study Area
The urban area that we took into account for this work is located in Southern Italy (Puglia region), in Sannicandro di Bari (SB). From the climatic point of view, this region belongs to the southeastern Mediterranean area [25]. In the Koppen classification, the climate is designated as Cs to indicate a sub-tropical climate with dry summers [26]. Mainly, the Cs climate is characterized by rainy winters and dry summers, with peaks of precipitation in the shoulder seasons [27].
SB watershed has a surface equal to 31.24 ha. The average slope is equal to 1.56% and the average elevation is 169 m above sea level. The mean annual temperature is equal to 15.0 • C and the mean annual rainfall is equal to 586 mm. The impervious area represents the dominant land cover (70% of the entire catchment), while only 3.80% of the watershed is covered by green area (source: SIT Puglia) [28]. The stormwater drainage network is 1.96 km long and collects water into a concrete rectangular channel that is 1.20 m × 1.70 m. In Figure 2, the area of the watershed, the drainage network, and the outfall are shown.

Observations
A monitoring campaign was carried out in SB to collect precipitation, flow rate, and water quality records. A rain gauge (ISCO 674 model), installed close to the basin's outlet, was used for recording the precipitation. A bubble flowmeter (ISCO 730 model) was adopted to measure discharge. Water quality data were monitored by collecting samples through an autosampler with 24 bottles of 0.5 L each and evaluated by adopting the standardized methods reported in Eaton et al. [29]. In Di Modugno et al. [30], we provided more details about the equipment used for the monitoring campaign. The monitored water quality variables were total suspended solids (TSS), total nitrogen (TN), and total phosphorus (TP). Data of five rainfall events were collected (11/10/2006, 11/22/2006, 12/17/2006, 01/24/2007, and 02/10/2007).
Summaries of the observed rainfall-runoff (antecedent dry period (ADP), total rainfall, event duration, maximum rainfall intensity, runoff volume, and runoff peak) and water quality (minimum, maximum, and event mean concentration (EMC)) data are presented in Tables 1 and 2. Hereafter, this dataset is called "observations."

Simulations
From now on, we will call "simulations" those events characterized by observed precipitation (rainfall data described in Section 2.3.1) and simulated flow rate and water quality variables (TSS, TP, and TN). The simulations were obtained using the Storm Water Management Model (SWMM), a well-known, worldwide hydrologic/hydraulic and water quality model used to simulate hydrographs and pollutographs in urban areas [31]. It is worth mentioning that recent bibliographic works have used this model for characterizing urban runoff and for estimating pollutant loadings [32][33][34][35][36][37]. For our work, the observed precipitation of each monitored event was the input for the SWMM model, along with the physical characteristics of the watershed and the drainage network. Therefore, five "simulation events" were added to the global dataset adopted in this study. SWMM operates in blocks or units. For the current study, we adopted the runoff and the transport block. In particular, the kinematic wave was implemented to simulate the runoff from impervious surfaces. The water losses considered in the system were represented by the infiltration process (Horton's equation) and the depression storage of the impervious portion of the watershed. Pollutant build-up, pollutant wash-off, and first-order decay were the water quality processes simulated by the model. The first two processes occur at a watershed scale, while the third occurs in the drainage network. Buildup and wash-off were both simulated with an exponential function (Equation (1) and Equation (2), respectively) [31]: where b is the pollutant build-up during the dry period [kg/ha], B max is the maximum asymptotical limit of the build-up curve [kg/ha], k B represents the build-up rate constant is the cumulative mass of constituent washed off at time t, C w is the wash-off coefficient [1/mm], k w represents the wash-off exponent, and q is the runoff rate over the subcatchment [mm/hr]. A comprehensive description of the above-mentioned physical processes can be found in the scientific literature [30,38,39].
The model was implemented in the study area, and the calibration and validation processes for the quantity and quality components were already successfully tackled in our previous work [30]. The calibrated parameters of Eq. 1 and Eq. 2 are summarized in Table 3.

Generations
Thereafter, we will call "generations" those events characterized by synthetic precipitation, produced by the Iterated Random Pulse (IRP) model, and simulated flow rate and water quality variables (TSS, TP, and TN) obtained using the SWMM model. In this case, the synthetic precipitation events were used as input of the SWMM model for generating hydrographs and pollutographs for each event at SB.
The IRP model was proposed by Veneziano and Iacobellis [40] and Veneziano et al. [41]. It adopts the classical depiction of the exterior process of the precipitation as an alternating sequence of dry and wet periods with independent lengths that describe the arrival, duration, and average intensity of rainfall events at the synoptic scale. The dry and wet periods are assumed to follow a Weibull and exponential distribution, respectively. The average precipitation intensities in various wet periods are independent and follow an exponential distribution. Precisely, the wet periods of the exterior model are scattered through the "interior" scheme, where the precipitation is represented as the overlapping of pulses with a hierarchically nested structure of temporal occurrences, with multifractal properties of intensity and location [42].
The IRP model was implemented at SB and provided a 15-year-long precipitation time series with 15 min of aggregation. Considering the regional regulation [43], single rainfall events were identified considering 48 h of the antecedent dry period. Consequently, 567 synthetic rainfall events were defined and introduced in SWMM for getting the simulated flow rate and water quality load and concentration.

Variable Selection
Suitable rainfall and water quality characteristics were chosen to better represent the cause-effect process of nutrient urban runoff.
For the rainfall characteristics, antecedent dry period (ADP), total rainfall (Tot_Rainfall), and runoff volume (Runoff_Vol) were chosen. Respectively, they represent the no-rainy days before the rainfall event (dry period), the input (Tot_Rainfall), and the output (Runoff_Vol) of the hydrologic component (wet period).
For the water quality characteristics, the event mean concentration (EMC) and the event mean load (EML) of TSS, TN, and TP were considered to not overshadow any process related to dissolved and particulate nutrients. In particular, the following equations were adopted [44]: where V is the total runoff volume for each event [L], C i is the average pollutant concentration at time step i [mg/L], V i represents the runoff volume proportional to the flow rate at the time i [L], and n is the total number of samples collected during a rainfall event. EMC and EML were calculated for TSS (EMC_TSS and EML_TSS), TN (EMC_TN and EML_TN), and TP (EMC_TP and EML_TP).

Machine Learning Techniques
The two groups of ML techniques adopted for this study were linear and non-linear. For the linear methods, PCA was chosen and supported by Pearson's correlation coefficient (r). For the non-linear algorithms, SOM was selected and supported by Spearman's rank correlation coefficient (ρ).
The main objective of PCA and SOM is reducing the dimensionality of a dataset that contains a large number of interrelated variables while preserving most of the dataset variance [45]. In this study, Pearson's r and Spearman's ρ were adopted as supporting techniques (linear and non-linear, respectively) to investigate beforehand the correlation among data points to confirm or add further information to the outcomes obtained with both dimension-reduction methods.
Both groups of techniques belong to the family of unsupervised methods, where information about other response variables or group belonging is not used to obtain results. This makes these techniques suitable for exploratory analysis, where the main aim is hypothesis generation rather than hypothesis verification [5,46].

Linear Techniques: PCA and Pearson's r
The PCA decreases the dimensionality and, therefore, the complexity of a given dataset of independent variables. This method generates a new set of variables containing orthogonal-uncorrelated variables. The latter, known as principal components (PCs), are linear combinations of the original features and are arranged by decreasing variance [47,48]. The eigenvalues quantify the importance of the PCs. They are able to expose possible emerging characteristics of the system that may be hidden if we emphasize one original variable at a time [44].
Pearson's r is a measure of linear correlation between two variables. Its value lies between -1 and +1, -1 indicates a total negative linear correlation, 0 shows no linear correlation, and 1 indicates a total positive linear correlation.

Non-Linear Techniques: SOM and Spearman's ρ
The SOM is an ANN proposed by Kohonen [24]. It is a competitive self-organizing network, constituted by fully connected neuron arrays, which can create a two-dimensional space mapping starting from a multidimensional space. In SOM, neurons learn in an unsupervised way since no network is required to provide a specific target or objective. Neurons compete with each other to better describe the input data, with the activation of only one neuron (or one node of neurons) when a data pattern is defined (competition phase) [1,49]. In the training phase, the input values are progressively adjusted to maintain the neighborhood relationship in the given input dataset (adaptation phase). This phase generates a mapping between the multidimensional space input and the two-dimensional space output (co-operation phase). In this way, the SOM clusters alike data close to each other in the 2D space [50,51].
Spearman's ρ is a measure of the monotonic correlation between two variables, and, therefore, works better in catching non-linear monotonic correlations than Pearson's r. Its value ranges between -1 and +1, -1 indicates a total negative monotonic correlation, 0 shows no monotonic correlation and 1 indicates a total positive monotonic correlation.

Dataset Profiling
The resulting dataset was composed of 577 storm events: 5 observed, 5 simulated, and 567 generated. The data profiling process was programmed and run in Python 3.8, using the pandas_profiling library [52]. In Table 4, the quintile statistics (minimum, 5th percentile, median, 95th percentile, and maximum) and the descriptive statistics (standard deviation, coefficient of variation, kurtosis, mean, and variance) of each variable of the dataset (577 storm events) are reported. The histograms that represent the frequency of each variable are reported in the Supplementary Materials (SM-1). The correlation among variables was investigated in advance by calculating the Pearson's r and the Spearman's ρ (Figure 3). From Figure 3a, it is possible to identify a high linear correlation between Runoff_Vol and Tot_Rainfall, EML_TN and ADP, EML_TP, and EML_TSS. It is noteworthy to remark that Pearson's r is not able to comprehensively describe the non-linear rainfall-runoff process represented by the variables Runoff_Vol and Tot_Rainfall, which, instead, is well represented by Spearman's ρ (Figure 3b). Furthermore, it was able to catch the nutrient transport driven by sediments (EML_TP and EML_TSS), showing, in particular, the highest particulate nature of TP compared to TN. However, no information about the dissolved portion of nutrients is provided.
Further information about non-linear processes is given by Spearman's ρ (Figure 3b). In this matrix, not only the correlation between Runoff_Vol and Tot_Rainfall is higher, as mentioned before, but it is also able to represent the dilution process of the dissolved nutrient portion: the higher Tot_Rainfall, and therefore Runoff_Vol, the lower the concen-tration of TP and TN (EMC_TP and EMC_TN). Furthermore, it is possible to see that the phosphorus off-site movement occurs mainly due to sediment transport. In contrast, the nitrogen mobilization occurs primarily due to water surface runoff (EMC_TN has a stronger correlation with Tot_Rainfall and Runoff_Vol compared to EMC_TP).
The plots that represent each of these correlations are reported in the Supplementary Materials (SM-2).

PCA and SOM Run
A (577 × 9) matrix was used as input for the PCA and SOM analysis, where 577 refers to the number of events that occurred at SB (observed + simulated + generated) and 9 are the selected variables. Prior to the analysis, the variables were standard normalized (i.e., mean = 0, standard deviation = 1) to give equal weight to each of them and deal with their various measurement units.
In this work, we coded all the algorithms in Python 3.8 and ran them on a 2.6 GHz Intel i7 PC with 32 GB of memory.
The first two PCs were selected for SB since they represent 66.61% of the total variance (39.19% is represented by PC1 and 27.42% by PC2). PCA was implemented using the scikit-learn library [53]. Regarding the computing time, solving PCA with our dataset on our PC took 0.0054 s.
For evaluating the SOM map size, we calculated the neuron number from the number of data points of the training dataset using the equation proposed by Vesanto et al. [54]: where M represents the number of neurons, rounded to the nearest integer, and N is the number of data points. In this work, N = 577, therefore M ≈ 121; this means a map with 11 × 11 neurons. In this study, SOM implementation was programmed using the minisom package [55]. Solving SOM with the same input of PCA on the same PC took 15.24 s.
It is noteworthy that SOM's computational load increases quadratically with the number of data points. In our study, there is a four orders of magnitude difference, in terms of computational time consumption, between PCA and SOM.

Feature Correlation
The first aspect that was compared between PCA and SOM was the ability to correlate rainfall characteristics to water quality variables, in other words, to correctly represent build-up and wash-off processes. With this aim, the PCA-loading plot and the SOM weight map are represented in Figure 4.
It is important to remark that in the PCA-graphical representation, vectors representing variables that form an acute angle are considered as correlated features, while those that are perpendicular are considered as uncorrelated. In the SOM weight map, variables are correlated if they activate the same neurons in the map (red neurons, also called positive neurons). The phases of the map-weights initialization and the map training were both realized by picking samples at random from the input dataset. After a training phase of 10,000 epochs (all the input data samples were used 10,000 times), a quantization error of 0.57 and a topographic error of 0.042 were obtained, assuring the resulting map's quality.
Regarding the rainfall-related features (ADP, Tot_Rainfall, and Runoff_Vol), in both plots, it is possible to detect the strong relationship between Tot_Rainfall and Runoff_Vol., while ADP is independent of these two variables. In particular, from Figure 4b, it is possible to observe that ADP activates exactly the symmetric neurons of Runoff_Vol (neurons (1,11) and (1,10)) and slightly activates neuron (1,9), whose symmetric neuron is activated by Tot_Rainfall. Considering the water-quality-related variables, in the biplot (Figure 4a), we can identify two groups of variables: pollutant EMCs and EMLs. For both groups, one of the most significant results is represented by the reliable correlation between TSS and TP, and a weaker relationship between TSS and TN. These correlations suggest that sediment transport is critical in the process of nutrient mobilization from impervious surfaces. Notably, in our study watershed, phosphorus had a higher particle-bound component than nitrogen. It is possible to also identify these patterns in Figure 4b, where the stronger correlation between EML_TSS and EML_TP is represented by the same dark red neurons.
Furthermore, this graphical representation shows that the variable EML_TSS covers the highest variance of the system since it activates the highest number of neurons.
If we look at all the features (rainfall and water quality characteristics), Tot_Rainfall and Runoff_Vol are inversely correlated to EMC_TP and EMC_TN; this means that the more it rains, the higher the runoff volume from impervious surfaces and the smaller the nutrient concentration due to a dilution process. In both plots, this effect is more evident for EMC_TN, confirming the hypothesis mentioned above about the dissolved TN and the particle-bound TP. In particular, in Figure 4b, EMC_TN activates the two opposite neurons to Runoff_Vol that is the variable that represents the wash-off process. Furthermore, only in the SOM weight map, the actual strong relationship between EMC_TSS and EMC_TP is clear that, instead, is not clearly visible in the PCA biplot. Again, the assumption of dominant particulate TP at SB is confirmed. Another aspect that can be better depicted in the SOM weight map is the high correlation between ADP and pollutant loads, particularly EML_TN: the longer the no-rainy period, the more significant the amount of pollutant load accumulated on the impervious surfaces. However, this process is usually characterized by an asymptotic superior limit and degradation processes (e.g., street cleaning, wind) that cannot be described by these techniques.
Another important aspect to highlight in this comparison is that, to make a humanreadable-graphical representation, it is common to plot only the first two (or three) PCs that, in our case, cover 66.61% of the variance of the system under study. SOM, instead, considers 100% of the variance in two dimensions.

Data Point Grouping
The second aspect that was considered in this comparison was the ability of the PCA and SOM techniques to group data points based on a specific feature. In particular, their capability to graphically represent data point similarities was discussed. Since we are representing nutrient build-up and wash-off (respectively depicted by dry and wet days), we tested the two methods by grouping the data points based on the ADP and the Tot_Rainfall values. We identified three categories for both variables: In Figures 5 and 6, the PCA biplot and the SOM distance frequency map for both variables are shown. The PCA biplot is a combination of the PCA loading plot (shown in Figure 4a) and the score plot. The latter represents the original data points in the new rotated coordinated system. The grouping was applied to the scores. The SOM distance map is a tool that visualizes how much neurons differ from each other in a twodimensional space. When two neurons correspond to different sets of data points, they would be separated by a larger distance, represented by a lighter color. On the contrary, neurons representing similar data points are separated by shorter distances, symbolized by a darker color. In the SOM frequency map, data points are plotted in the cell of the corresponding activated neuron, i.e., the winner of the competition phase.  Overall, it is possible to appreciate that both techniques are able to group data points well based on ADP and Tot_Rainfall features. It is interesting to see that for both PCA and SOM the groups are better defined with Tot_Rainfall ( Figure 6) than with ADP ( Figure 5).
The PCA groups data points following the same direction of the vector chosen for labeling them. This can be seen for both ADP and Tot_Rainfall vectors: data points characterized by a high value of ADP (or Tot_Rainfall) are closer to the head of the ADP vector (or Tot_Rainfall vector), while the points with a low ADP (or Tot_Rainfall) value are located in the opposite quadrant of the ADP vector (or Tot_Rainfall vector).
From the visual point of view, SOM provides the distance map that helps in better identifying the grouping borders just by checking the shade of the neurons: the darker the neuron, the more different the data points that belong to that neuron are from their neighbors. Furthermore, it is possible to combine different results to get further information. For instance, by overlapping the distance map with the frequency map, we can identify the storm events that belong to each neuron. In this case, knowing that data points that belong to the same neuron are very similar, it is possible to identify more substantial similarities (or dissimilarities) within groups that cannot be detected in the PCA. Moreover, it is easier to detect events that belong to different groups but that are similar to each other since they belong to the same neuron. Additionally, we can understand which is the feature that characterizes each neuron (the most important one) by combining the feature-importance map (see Section 3.5) with the previous maps. Consequently, only with the SOM can we identify the most significant variable for each data point.
It is clear that one of the advantages of the SOM technique is the possibility to overlap maps that contain different information and combine their insights to get new results.

Feature Importance
The third aspect considered for this comparison study is the ability of both methods to represent feature importance. To correctly compare this capability, it is worth noting that the PCA is able to calculate the meaningful features for each PC with reference to the entire system (or, at least, for the percentage of the system represented by the PCs chosen). While, the SOM, even though it represents 100% of the system variance, identifies the most critical features only for each neuron (local feature importance) and not for the entire system.
The PCA-feature importance can be visualized by looking at the eigenvalues (loadings) showed in Table 5. For the SOM, it is possible to map the n most important features in each neuron (Figure 7). For this example, we use the first four PCs, whose variance is respectively equal to 39.19%, 27.42%, 13.53%, and 7.65%.  PCA loadings with an absolute value higher or equal to 0.5 are usually considered to have a significant influence on the related factor [18,56] (highlighted in blue in Table 5). It is worth noting that PC1 is described by EML_TSS and EML_TP, and partially by EML_TN. This component can be interpreted as representative of the particle-bound nutrient wash-off. PC2 and PC3 can be mainly explained by EMC_TP and EML_TSS − EMC_TN, respectively. These two components represent the nutrient dilution process occurring during wash-off. It is noteworthy that for the first time, the linear technique is able to detect this process well. PC4 symbolizes the runoff process since it is represented by Runoff_Vol. In conclusion, it is possible to state that nutrient runoff in our study area is well represented by the first three PCs since EMCs and EMLs are the most important features for the entire system. However, this is not always true. In fact, in some cases, more than three PCs may be needed to represent this process. In such a case, even though the PCA loadings table is readable, the corresponding biplot would not be a human-readable-graphical representation anymore, and a non-linear technique has to be adopted.
In the feature-importance map computed by SOM (Figure 7), it is possible to arbitrarily decide the number of the most important variables to plot per neuron (three in our case). This representation allows identifying areas characterized by particular features. For instance, the bottom part, from left to right, is represented by EMC_TN, EMC_TP, and EMC_TSS. ADP, Tot_Rainfall, and Runoff_Vol, from left to right, characterize the upper part of the map. EMLs contribute to the middle area. These results confirm the previous findings, particularly those obtained from the weight map ( Figure 4b). As previously mentioned, the SOM technique's advantage is represented by the possibility of coupling these different maps and detecting further hidden information.

Further Discussion
Wash-off from impervious polluted surfaces generates transport phenomenon from a range of pollutants (i.e., nutrients) such as TN and TP. Therefore, a comprehensive understanding of urban nutrient runoff is essential for water managers and environmental engineers for an efficient stormwater-treatment design in the context of sustainable urban watershed management and surface water quality protection. In this work, we aimed to assess the capability of the adopted ML methods to characterize the main processes of urban nutrient runoff. Particularly, three main aspects were analyzed: (i) the ability to represent the correlation among water quality and water quantity variables that describe the build-up and wash-off processes aiming at finding interesting insights; (ii) the ability to group the dataset to detect similarities or dissimilarities among data points; (iii) the capability to quantify the importance of each variable. In this section, the main findings are compared to the previous scientific literature related to urban runoff.
In the recent literature, some authors confirmed strong relationships between TSS and different common pollutants that can be found in urban runoff (i.e., nutrients, metals, pesticides) [57][58][59][60]. This is in accordance with the correlations highlighted in Figure 3 between TSS and TP (obtained by using Pearson's and Spearman's coefficients). This is also confirmed by Borda et al. [61]. They evaluated agronomic management's effect on the potential risk of P losses from soil to water bodies, where P losses were estimated using a simple dispersion test and the amount of suspended solids. Viviano et al. [62] found different relationships between turbidity and TP concentration in the investigated urban watershed, distinguishing between the TP from point (domestic wastewaters) and diffuse (surface runoff) sources.
Considering the results reported in Figure 4a,b, a reliable correlation between TSS and TP is evident. This confirms that phosphorus off-site movement occurs mainly due to sediment transport that is able to trigger the nutrient mobilization from impervious surfaces. In contrast, a weaker relationship between TSS and TN suggests that the nitrogen mobilization occurs primarily due to water-surface runoff (EMC_TN has a stronger correlation with Total_Rainfall than EMC_TP). In this context, Ciaponi et al. [60] corroborated a stronger correlation between TP and TSS rather than TN and TSS. Moreover, different recent studies confirmed the N transport carried out by surface runoff [63][64][65][66].
Looking at Figure 3a,b, it is possible to recognize a strong dependence between nutrient load (EML_TN and EML_TP) and runoff volume. This is confirmed by the detailed study conducted by De Girolamo et al. on the Celone river, located in the same region of our study area (Puglia Region-Southern Italy) [67]. This work quantified the nutrient loads delivered to the downstream reservoir (Capaccio dam) on a seasonal and annual time scale. In particular, De Girolamo et al. [67] demonstrated the importance of flood event contribution to the annual nutrient load, stating that "nitrogen and phosphorus loads tend to be substantially higher during years of high precipitation, because of increased erosion and transport of the nutrients to stream channels." They also found that in the winter season, the high level of nutrient load is primarily due to surface runoff.
Another aspect that can be better depicted in the SOM weight map (Figure 4b) is the high correlation between ADP and pollutant loads, particularly EML_TN: the longer the no-rainy period, the more significant the amount of pollutant load accumulated on the impervious surfaces. This is confirmed by Gorgoglione et al. [7,20]. Moreover, in this context, Li et al. [68] found that the antecedent dry weather period and runoff volume were the determining factors in the generation of urban pollution runoff. Bian et al. [69] presented a significant positive correlation between water quality parameters and the ADP.
Considering the relationship between ADP and EMC_TSS, well represented in Figures 3 and 4, Lee et al. [70], analyzing four events in South Korea, found that the ADP and rainfall intensity were the main factors affecting TSS and COD concentrations and the loading mass of highway runoff in urban areas.
Further information about non-linear processes is given by Spearman's ρ (Figure 3b), where the dilution process of the dissolved nutrient portion is represented: the higher Tot_Rainfall, and therefore Runoff_Vol, the lower the concentration of TP and TN, meaning that the more it rains, the higher the runoff volume from impervious surface and the smaller the nutrient concentration due to dilution process. This is confirmed by Gorgoglione et al. [7,20].

Conclusions
The primary purpose of this work was the comparison of linear and non-linear ML techniques, PCA and SOM, respectively, to characterize urban nutrient runoff. In particular, this comparison was based on three main aspects: (i) the ability to represent the correlation among the variables chosen to represent the system and, therefore, depict the build-up and wash-off processes (cause-effect process) (feature correlation); (ii) the ability to group the dataset based on the two variables that symbolize build-up and wash-off processes (ADP and Tot_Rainfall) (data point grouping); (iii) the ability to identify and quantify the importance of each variable (feature importance). To strengthen this comparison, these techniques were supported by other linear (Pearson's r) and non-linear (Spearman's ρ) methods. The main results can be summarized as follows: • Pearson's r was able to represent the main urban nutrient runoff processes detected in the study area: rainfall-runoff and phosphorus transport driven by sediments. Spearman's ρ, by strengthening the rainfall-runoff process, was also able to depict the transport of dissolved nutrients in urban runoff.

•
Regarding feature correlation, both PCA and SOM methodologies captured the primary process that symbolizes nutrient build-up and wash-off. Notably, both were able to represent the critical role played by TSS in the nutrient mobilization from impervious surfaces. This was proved particularly for phosphorus, which dominantly was particle-bound, while nitrogen transport mainly occurred through water (dissolved). The latter was better depicted by the SOM analysis.

•
Regarding datapoint grouping, both techniques were able to group data points well. The PCA groups data points following the same direction of the vector chosen for labeling them. The SOM better delineates the groups by assigning different shades to the neurons: the lighter, the more similar to its neighbors (distance map). Furthermore, by overlapping distance and frequency map, we can identify similarities (or dissimilarities) among data points that belong to the same group. • Concerning feature importance, the main difference between the two techniques is that the PCA can compute the meaningful variables for the system, while the SOM can only provide the feature importance for each neuron. PCA loadings are able to detect the dilution process that was never well detected by previous linear techniques. The SOM outcomes can detect the main processes under study by confirming the previous results. Furthermore, SOM maps can be coupled to extract further information.
To conclude, according to the outcomes of this work, we suggest that the SOM technique can provide a useful complementary tool to other methods, such as PCA, and can be successfully adopted for water quality research. Although both techniques can be run with the same hardware resources, it must be considered that the benefits of SOM, regarding data insights, come at a high computational cost, particularly when compared to PCA. In fact, in our study, there was a four orders of magnitude difference in terms of computational time.
The results presented in this work are expected to assist researchers and water managers in improving their water quality assessment ability. Furthermore, they support decision-making in the design of management strategies to reduce pollution impacts on receiving waters and, consequently, protect the surrounding ecological environment. An interesting aspect that will support the findings of this study is a more extensive monitoring campaign at the study area to enlarge the observation dataset. However, it is important to highlight that by exploiting accurate models that were properly calibrated and validated (IRP and SWMM), we were able to successfully characterize the nutrient urban runoff with both PCA and SOM. Based on these considerations, further steps to be considered in future works include an integrating field campaign, planned for considering more rainfall events and various pollutants.