3.1. Phase 1 Results: Identification of Urban Planning Variables
Phase 1 related to the identification of urban planning variables, where the confirmatory model validation provides valid data to the proposed theoretical structure in the
Section 2. The CFA assessment confirms the effectiveness of theoretical constructs, as described in the latent variables of RS, CP, and RACC (see
Table 1).
As shown in
Table 1, all factor loadings were significant and scored above the minimum conditions, which means that there is an adequate link between the observable indicators and their related constructs. The standardized factor loadings presented in
Table 1 demonstrate that RACC indicators ranged from 0.74 to 0.85, CP indicators from 0.70 to 0.75, and RS indicators from 0.67 to 0.83. The model fit indices of the overall model fit criteria in CFI, TLI, and RMSEA as they are within the recommended ranges described in the methodology, underlining its internally consistent theoretical model [
22]. These results evidence that the chosen constructs reflect the views on the importance of amenities and the link they may have to combat urban resilience to climate change. Additionally,
Table 1 shows the internal consistency metrics (Cronbach’s α, rhoA, and composite reliability) were all at or above 0.70, and the average variance extracted (AVE) exceeded 0.50, confirming the convergent validity and robustness of the measurement model.
To assess the discriminant validity of the constructs, this study used two well-established methods: the Fornell–Larcker criterion and the Heterotrait–Monotrait (HTMT) ratio [
16]. The Fornell–Larcker criterion, a common approach, is based on the idea that the square root of the average variance extracted for each construct should be greater than its correlation with other constructs, ensuring that each construct is conceptually distinct and measured separately [
23]. However, recent research has highlighted certain limitations of this method in detecting violations of discriminant validity in Structural Equation Models [
24]. In contrast, the HTMT provides a more reliable assessment by comparing the average correlation between constructs (heterotrait) with the average correlation within a construct (monotrait). HTMT values below 0.85 are generally regarded as evidence of discriminant validity, and this approach has been shown to have better sensitivity and specificity than the Fornell–Larcker criterion [
19].
The results of this study, summarized in
Table 2, corroborate that there was a clear conceptual separation between analytically measured constructs. All HTMT relationships were lower than the 0.85 threshold, which also suggests that the construct has validity [
25]. This supports earlier work focusing on the utility of the HTMT in establishing discriminant validity in the absence of a strong Fornell–Larcker criterion [
26].
Finally, combining and comparing both criteria improves the validity of constructs (as evaluated in this article) and demonstrates compliance with the contemporary best practices of structural research. Such consistency between methods is essential to ensure reliability and validity. Regarding the overall fit of the model, it achieved acceptable indices, with SRMR = 0.073, CFI = 0.921, TLI = 0.907, and RMSEA = 0.072; supporting the validity of the factorial structure. Additionally, the R
2 values indicated the moderate-to-high explanatory power of the model, with the RACC construct showing the highest R
2 (0.783) [
22].
Finally,
Table 3 presents the comparison of structural hypotheses using the SEM-PLS method, allowing for the evaluation of the relationships among the constructs in the proposed model. The results show that RS positively impacts CP (β = 0.205,
p < 0.05), while RACC positively influences both RS (β = 0.710,
p < 0.001) and CP (β = 0.526,
p < 0.001).
Figure 3 depicts these validated causal relationships, confirming the importance of the proposed model [
27].
In summary, the results from
Table 3 comply with the requirements detailed in the
Section 2 to validate the sample size of the research. Therefore, the model reaches adequate statistical power, being sufficient to identify moderate-to-strong effects, thereby fulfilling the empirical robustness, reliability, and validity criteria expected in this type of structural modeling, as discussed in the
Section 2. These results establish a strong basis for moving into the second phase of the study.
3.2. Phase 2 Results: Definition and Criteria Weighting
In this phase, the validated constructs were translated into decision criteria, weighted using the AHP methodology with the experts’ input. The three constructs outlined in Phase 1 (RS, RACC, and CP) were translated into operational criteria for objective assessment within the SUDS framework (see
Table 4). This conversion of conceptual dimensions into measurable criteria ensured methodological traceability and connected theoretical levels with practical evaluation. Further information about the original constructs, criteria, and sub criterion can be found in the
Supplementary Materials section of this manuscript.
The eight criteria listed in
Table 4 were evaluated and weighted by 35 international experts using the Analytic Hierarchy Process (AHP). Detailed guidance to the participants was provided to complete the paired comparison of criteria, on a scale from 1 to 9, where 1 indicated equal importance, and 9 signified extreme importance. Instructions included explanations of partial importance levels (3: slight, 5: moderate, 7: strong) and visual examples to improve clarity and ensure consistent responses [
28]. The initial overall matrix, containing 35 assessments, resulted in a CR of 0.136 (
Table 5).
Results from
Table 5 indicated initial inconsistency, and therefore required methodological adjustment. To resolve this, a filter was applied based on Jato-Espino et al. [
29], retaining only the 15 individual matrices that met the CR consistency criterion of less than 0.10. The final consensus matrix, derived from the geometric mean of the 35 consistent assessments, demonstrates greater statistical robustness than that obtained by weighting the criteria of all experts (N = 35). The maximum eigenvalue λmax is 8.04, reaching almost perfect consistency (see
Table 5 for further details). These results lead to a final CR of 0.0044, below the threshold (0.10). This outcome formally validates the experts’ assessments, establishing the basis for subsequently computing the weight vector (w). As a consequence, all expert opinions were considered for the final weighting and ranking of the eight proposed criteria. The CI values remain well below the accepted threshold of 0.10, confirming the reliability of the experts’ judgments. The random index (RI = 1.41), corresponding to the size of the matrix, was used to standardize the consistency assessment. Based on these results, the normalized priority vector was derived, resulting in the weightings and final ranking of the criteria presented in
Table 6.
The results in
Table 6 show that the criteria related to RESCLIM and RESIL together account for more than 40% of the total weight of criteria. Although conceptually linked, each criterion addresses a different design aspect: immediate system performance, long-term adaptability, and the provision of complementary urban functions. This weighting pattern may suggest a preference among experts for choosing SUDS typologies that combine technical efficiency with adaptability and complementarity to address future impacts related to climate change and urban development, as well as those that help reduce climate comfort needs in public spaces.
3.3. Phase 3 Results: Evaluation and Prioritization of SUDS Typologies
The final phase consists of the testing of the instrument. Thus, phase 3 shows the results of the different typologies of SUDSs that were systematically classified according to the criteria and weightings defined in the previous phase and categorized according to their significance. A rating scale was developed for each criterion using the Saaty scale: “Very High” with intensity 9, “High” with intensity 7, “Medium” with intensity 5, “Low” with intensity 3, and “Does Not Contribute” with intensity 0.
This rating scale is important for assessing the contribution level of each SUDS and was utilized by each expert. Finally, the importance of ranking for each SUDS was calculated by multiplying every criterion’s intensity value by the weight assigned to it. In this step, various SUDS typologies were categorized according to the criteria and weightings, then scored based by importance. Every expert’s perspective was critical to obtaining the final solution, and this ensured that the SUDS were placed at the top according to how relevant it was for the respective urban environment under study. A radar chart on various SUDS types is depicted in
Figure 4, according to the assessment criteria used in the study. This not only provides a visual insight into the area but also acts as a pragmatic decision-making aid for both urban planning and environmental control. The concentric circles indicate elements: infiltration ponds, green roofs, rain gardens, vegetated swales, and permeable pavements, while the colored segments represent the eight evaluation categories: LEGIB, ASPVIS, EDAM, MANTCOM, SEGUR, MULTIF, RESIL, and RESCLIM. The numbers within each cell represent the normalized weight for each pair of criterion–typology, and an additional value means that this pair of SUDSs will perform better, or that it has a better fit for that specific criterion. This visualization illustrates the multidimensional performance of SUDSs and enables the comparison of performance (both good and bad) among existing types. For instance, infiltration ponds and rain gardens receive strong scores in most of the criteria (perhaps due to their multifunctionality and sustainability roles). Permeable pavements, as well as filter drains, appear to score in the lower end for both ecological and resilience purposes. The chart is, overall, helpful for understanding how each SUDS option relates to each element of sustainability and for identifying where urban implementation strategies could be improved.
Moreover, the results of the weighting analysis for each SUDS typology are depicted in
Figure 5. The models with the highest relative weights (infiltration ponds 0.82, green roofs 0.81, and rain gardens 0.80) have been defined as being very-high- and high-performing. Such systems have clear potential to be of high significance for enhancing infiltration, retention, and environmental quality, while contributing to the social and esthetic appearance of urban public spaces. Nonetheless, storage tanks (0.42), filter drains (0.44), and attenuation storage tanks (0.47) were ranked lowest, perhaps reflecting lower multifunctionality and relatively small contributions to social interactions. Therefore, the distribution of weights reveals a strong inclination from practitioners towards NBS options that combine engineering efficiency with ecological and social benefits. This is a characteristic identified in the gaps of the literature amongst the emerging perception that stormwater management systems transcend technical performance, potentially embracing multifunctional benefits, including the social amenity design pillar.
The analysis was supplemented by a non-hierarchical clustering method (k-means), applied to the criteria of weights with the aim of identifying common patterns in the experts’ judgements. The average silhouette values for all points provide an overall measure of cluster quality and help define the optimal number of clusters.
Figure 6 shows hierarchical clustering of 35 experts based on their preferences for evaluation criteria, using Ward’s linkage method with Euclidean distance. The red dashed vertical line indicates the cut-off point (Euclidean distance = 0.518) that defines five distinct clusters (k = 5). Branch colors indicate cluster membership: Cluster 1 (brown,
n = 18), Cluster 2 (purple,
n =4), Cluster 3 (red,
n = 6), Cluster 4 (green,
n = 4), and Cluster 5 (orange,
n = 3). Gray branches beyond the cut-off line show hierarchical relationships among clusters. Each cluster represents a group of experts with similar priority patterns across evaluation criteria.
This unified structure of hierarchical and non-hierarchical strategies improved the interpretation of the dendrogram. It enabled clear segmentation of the data into well-defined groups, capturing both the hierarchical relationships between cases and direct similarities in the experts’ judgments. This reassures the reader of the thoroughness of the analysis.
The clustering analysis of priority vectors from the pairwise comparison matrices computed an average for every cluster with the geometric mean (
Figure 6). This technique preserved the multiplicative logic of the AHP and the hierarchy of comparisons (see
Table 7). The consistent and adjusted matrices showed similar hierarchical relations, allowing the algorithm to locate typical patterns and classify them by grouping them into the same clusters while preserving the overall integrity of the results. The stability of the experts’ assessments during the process meant that tweaking the matrices for improved consistency did not have an impact on the validity of the results.
In addition,
Figure 7 shows a bar chart of the distribution of the eight evaluation criteria’s importance across four clusters derived from an AHP-based clustering analysis. The color palette indicates each evaluation criterion, and each segment within a cluster is highlighted to represent its overall value or relative weight. At the top, a red dashed line indicates that the weights are normalized to 1.0 for each cluster.
The sensitivity analysis indicates that the final ranking of the alternatives remained remarkably stable, especially at the highest priority levels. Despite the marked heterogeneity of opinions among the five groups, for example, Group 1 strongly prioritizes (EDAMB: 0.38) and Group 4 focuses on (SEGUR: 0.39), the main hierarchical structure did not undergo critical alterations. Groups 2, 3, 4, and 5 assign priority to the main criteria RECLIM and RESIL, leading to the results in
Table 8. The key findings of this analysis are as follows:
Infiltration pond maintained top performance across all simulated scenarios (Consensus and Clusters 1–5), with scores ranging from 0.80 to 0.84. This invariance demonstrates the high robustness of the AHP model in the face of the diverse perspectives.
Moderate sensitivity was observed between positions 2 and 3. Under the Cluster 1 (social–educational) and Cluster 4 (safety) approaches, rain gardens ranked second, displacing green roofs. This suggests that while both are leading solutions, their relative preference may vary depending on prevailing political or community objectives.
The model’s robustness is supported by consistency metrics, with the Global Consensus showing a CR of 0.004380. Even in the most “stressful” scenario (Cluster 1, with a CR of 0.104119), the core of the decision remained stable, thereby validating the legitimacy of the prioritization process.
The lowest-performing typologies, such as storage tanks (SUDS 13) and filter drains (SUDS 12), consistently ranked at the bottom across most scenarios, reinforcing the reliability of excluding these alternatives in the urban context analyzed.
This convergence of results across radically different weighting scenarios (
Figure 7) allows us to conclude that the hierarchy obtained is not the result of a statistical average but rather a resilient technical solution that satisfies multiple urban sustainability criteria.