Decoding Urban Riverscape Perception: An Interpretable Machine Learning Approach Integrating Computer Vision and High-Fidelity 3D Models

Tang, Yuzhen; Chen, Shensheng; Xu, Wenhui; Ren, Jinxuan; Luo, Junjie

doi:10.3390/ijgi15020091

Open AccessArticle

Decoding Urban Riverscape Perception: An Interpretable Machine Learning Approach Integrating Computer Vision and High-Fidelity 3D Models

by

Yuzhen Tang

^1,2,

Shensheng Chen

^2,3,

Wenhui Xu

¹,

Jinxuan Ren

^4,5 and

Junjie Luo

^1,*

¹

College of Landscape and Architecture, Zhejiang A&F University, No. 666 Wusu Street, Lin’an District, Hangzhou 311300, China

²

Logistics Service Center, Zhejiang A&F University, No. 666 Wusu Street, Lin’an District, Hangzhou 311300, China

³

Zhejiang Provincial Institute of Rural Revitalization, No. 666 Wusu Street, Lin’an District, Hangzhou 311300, China

⁴

School of Architecture, Tianjin University, No. 92 Weijin Street, Nankai District, Tianjin 300100, China

⁵

Taishun County Bureau of Agricultural and Rural Affairs, No. 202 Pingxi Street, Wenzhou 325500, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(2), 91; https://doi.org/10.3390/ijgi15020091

Submission received: 30 November 2025 / Revised: 13 February 2026 / Accepted: 18 February 2026 / Published: 20 February 2026

Download

Browse Figures

Versions Notes

Abstract

Visual perception serves as a crucial interface connecting human psychology with the built environment. However, current studies on urban riverscapes often rely on static 2D imagery, failing to capture the spatial depth and immersive experience essential for ecological validity. Furthermore, the “black box” nature of traditional machine learning models hinders the understanding of how specific environmental features drive public perception. To address these gaps, this study proposes an innovative framework integrating high-fidelity 3D models, computer vision (CV), and interpretable artificial intelligence (XAI). Using the River Thames (London) and the River Seine (Paris) as diverse case studies, we constructed high-precision 3D digital twins to quantify 3D spatial metrics (e.g., Viewshed Area, H/W Ratio) and applied the SegFormer model to extract 2D visual elements (e.g., Green View Index) from water-based panoramic imagery. Subjective perception data were collected via immersive Virtual Reality (VR) experiments. Random Forest models combined with SHAP were employed to decode the non-linear driving mechanisms of perception. The results reveal three universal principles: (1) Sense of Affluence and Vibrancy are primarily driven by high building density and vertical enclosure, challenging the traditional preference for openness in waterfronts; (2) Scenic Beauty is determined by a synergy of high Green View Index and quality artificial interfaces, suggesting a preference for nature-culture integration; (3) Sense of Boredom is significantly positively correlated with Viewshed Area, indicating that empty prospects without visual foci lead to monotony. This study demonstrates the efficacy of integrating Digital Twins and XAI in revealing robust perception mechanisms across different urban contexts, providing a scientific, evidence-based tool for precision urban planning and riverside regeneration.

Keywords:

visual perception; digital twins; immersive virtual reality; semantic segmentation; interpretable artificial intelligence; riverscapes

1. Introduction

Quantifying and modeling human perception of the built environment serves as a critical bridge between urban morphology and environmental psychology, providing a scientific basis for evidence-based urban design [1,2,3]. In this domain, vision acts as the primary channel for human interaction with the external world, constituting the cornerstone of perception research [4]. However, a persistent scientific challenge lies in ensuring the ecological validity of research paradigms—specifically, the extent to which the experimental environment represents the real world and elicits human behaviors and psychological responses consistent with reality [5,6]. While the mainstream “Computational Paradigm” utilizing 2D static image representations (such as street view images) and computer vision has revolutionized large-scale analysis [7,8,9], its inherent dimensionality reduction (from 3D to 2D) systematically strips the environment of critical spatial attributes, such as depth cues and the sense of enclosure [3].

In the specific context of urban riverscapes, this methodological evolution is particularly pertinent. As unique linear blue–green corridors, urban rivers possess irreplaceable value for urban ecological resilience and public well-being [8,10,11]. Distinct from land-based spaces such as streets, rivers offer a unique water-based perspective and a continuous moving observation mode via boat travel, forming a dynamically changing landscape sequence [12]. Research on this unique spatial typology currently faces two specific methodological gaps that generic 2D approaches cannot resolve. First, the “Spatial Gap”: Existing studies are predominantly confined to land-based views or static 2D images [13,14,15], failing to capture the “river canyon effect”—a key psychological sensation defined by 3D morphological factors such as the height-to-width ratio and volumetric viewshed [16,17]. Neglecting these 3D attributes results in an incomplete understanding of how the spatial enclosure of a river corridor drives perception. Second, the “Analytical Gap”: A “black box” dilemma persists in predictive modeling. While machine learning algorithms can fit complex non-linear relationships [18,19], their opacity restricts the transition from “prediction” to “understanding,” failing to reveal how specific environmental features influence perception to inform actionable design strategies.

To bridge these specific gaps, this study proposes an integrated high-fidelity 3D modeling, Immersive Virtual Reality (IVR), and Explainable Artificial Intelligence (XAI) framework as a targeted solution [20]. By substituting static imagery with 3D reality models within IVR, we restore the spatial dimensionality and experiential authenticity (“sense of presence”) essential for valid riverscape evaluation, directly addressing the spatial gap [21,22,23]. Simultaneously, by integrating Explainable Artificial Intelligence (XAI), we overcome the black box dilemma, enabling a transparent decoding of how specific environmental features drive public perception [24,25]. Using urban morphologies of the River Thames (London) and the River Seine (Paris) as empirical cases, this study aims to validate this integrated paradigm, capturing robust perceptual patterns that transcend local contexts and providing actionable, evidence-based guidelines for waterfront regeneration.

2. Background and Related Work

2.1. Visual Perception Theory and Indicators in Built Environments

How the physical form of the built environment shapes human psychological perception and behavior is a core topic at the intersection of urban science and environmental psychology. Early studies established the theoretical foundations of this field. For instance, Kaplan and Kaplan’s Attention Restoration Theory (ART) systematically elucidates the critical role of natural environments, particularly vegetation and water bodies, in restoring directed attention and alleviating mental fatigue [26]. Appleton’s Prospect-Refuge Theory, grounded in evolutionary psychology, reveals the profound influence of the spatial combination of an open view (prospect) and a secure hiding place (refuge) on human aesthetic preferences [27]. To bridge these abstract theoretical frameworks with empirical urban assessment, scholars have operationalized perceptual dimensions into quantifiable physical indicators. For instance, drawing on ART, the Green View Index (GVI) was developed to quantify the visible proportion of vegetation, serving as a proxy for restorative potential [28]. Similarly, the spatial configurations described in Prospect-Refuge Theory have been translated into metrics such as the Sky View Factor (SVF) and Building View Index (BVI), which measure spatial openness and the degree of enclosure, respectively [29]. These indicators have been widely confirmed to correlate significantly with public mental health, sense of safety, and aesthetic preference.

While urban waterfront quality assessment is a multidimensional construct involving water quality, acoustic landscapes, and physical accessibility [30], visual perception remains the primary interface for human–environment interaction [4]. When the research focus shifts from general urban spaces to specific urban riverscapes, the uniqueness of the perceptual experience becomes evident. As a “linear blue–green interface,” the visual quality of riverscapes is determined by the synergistic coupling of natural elements (e.g., vegetation) and artificial structures (e.g., bridges) [31]. Crucially, unlike the static viewing in parks or the pedestrian experience on streets, the water-based perspective offers a continuous, dynamic observation mode. This moving perspective generates a unique landscape sequence characterized by fluid spatiotemporal changes [8,32].

However, current indicator systems often fail to capture the specific spatial attributes of this water-based experience. Most existing studies adopt a land-based perspective, treating the river merely as a static object to be viewed from the shore [33,34]. This approach neglects the “river canyon effect”—a critical spatial sensation determined by the height-to-width ratio of the channel and the verticality of the urban interface [8]. Furthermore, traditional 2D indicators (e.g., pixel ratios in static images) struggle to quantify the “sense of spatial depth” and “3D viewshed,” which are essential for defining the prospect and enclosure in a linear river corridor [35,36,37]. Consequently, there is a theoretical need to expand the indicator system from 2D visual elements to 3D spatial configurations to fully decode the perception of urban riverscapes.

2.2. Evolution of Environmental Representation: From Static 2D Imagery to Immersive 3D Environments

Methodologies for representing the built environment in perception research have undergone profound evolution, driven by the quest to balance ecological validity with research scalability. Traditional methods, such as on-site questionnaires and interviews, have long been considered the “gold standard” for validity as they capture the multisensory authenticity of real-world experiences. However, they are inherently limited by high costs, low controllability, and restricted spatial scale, making them difficult to support large-scale, cross-cultural comparative studies [8,34].

To overcome these scalability limitations, the field witnessed a “digital turn” characterized by the widespread adoption of 2D static imagery, particularly large-scale street view datasets (e.g., Google Street View) [38,39]. While this paradigm successfully expanded the scale and efficiency of urban research, it introduced a critical trade-off: the sacrifice of ecological validity [40,41,42]. 2D images are inherently dimensionality-reduced representations of the 3D world. They systematically strip the environment of depth cues and spatial volume, failing to simulate the “sense of presence” essential for ecological perception [7]. Furthermore, static imagery restricts the observer to a passive viewing mode, severing the link between body movement (e.g., head rotation, locomotion) and visual exploration, which is fundamental to how humans perceive space in reality [5].

To bridge this gap, built environment perception research is undergoing an “immersive turn.” The combination of high-fidelity 3D models, particularly those constructed via UAV oblique photogrammetry, with IVR technology is regarded as the frontier solution for addressing ecological validity [43,44,45]. This integrated approach offers multiple advantages: first, spatial fidelity, as 3D models provide precise geometric information, serving as a solid foundation for quantifying 3D spatial indicators; second, experiential immersion, as IVR isolates the participant’s visual system from the real world via a Head-Mounted Display (HMD), creating a psychological experience of “being there” that elicits physiological and psychological responses closer to real situations [46]; and finally, interaction freedom, allowing participants to freely rotate their perspective within the virtual environment. This user-driven interactive behavior is an integral part of the perceptual process, unmatched by the passive viewing mode of static images [47,48,49]. Recent studies have begun applying IVR technology to explore urban design evaluation, historical district revitalization, and environmental health effects, confirming its validity and immense potential as a scientific research tool [50,51].

2.3. Analytical Approaches: From Linear Prediction to Interpretable Machine Learning

Establishing robust mathematical models linking objective environmental features to subjective perception is the critical analytical step following data representation. Historically, this domain has been dominated by traditional statistical models, such as multiple linear regression [8]. While these models offer intuitive interpretability (e.g., coefficients directly indicate the direction of influence), their strict “linearity assumption” struggles to capture the complex, non-linear realities of human–environment interaction. For instance, the positive effect of the Green View Index on scenic beauty often follows a “diminishing returns” curve rather than a straight line, and interaction effects—such as an open view only eliciting positive feelings under specific weather conditions—are difficult to model via simple regression [52].

To address these limitations and handle the high-dimensional data generated by modern urban sensing, the field has shifted towards ensemble learning models, represented by Random Forest [53]. Known for their ability to capture complex non-linear relationships and interaction effects without pre-assumptions, these models have significantly improved predictive accuracy. However, this superior performance comes at the cost of a “black box” dilemma. While these models can accurately predict perception scores, their internal decision-making processes are opaque. They provide coarse-grained “feature importance” rankings but fail to answer refined questions crucial for design: Does a specific feature promote or inhibit perception? What is the threshold for its marginal effect? This opacity restricts the transition from “high-precision prediction” to “design-oriented understanding”.

To resolve this conflict between predictive power and model transparency, Explainable Artificial Intelligence (XAI) has emerged as a bridge [54]. Among various XAI methods, SHAP (SHapley Additive exPlanations), grounded in cooperative game theory, stands out for its solid theoretical foundation [55]. The core advantage of SHAP lies in its ability to fairly decompose complex model predictions into the contribution values of each input feature, achieving a seamless unification of global and local explanations. Through SHAP, researchers can not only obtain robust global importance rankings but also visualize non-linear dependence plots to identify specific thresholds (e.g., the exact percentage of greenery where aesthetic benefits plateau) [56]. While SHAP has been applied in housing price and traffic prediction [54], its application in riverscape perception remains in its infancy. This study introduces this tool to decode the “black box,” enabling a shift from merely predicting scores to deeply understanding the driving mechanisms of perception.

2.4. Summary and Research Gaps

Synthesizing the literature reviewed above, while significant progress has been made in quantifying built environment perception, three specific challenges remain to be addressed in the context of urban riverscapes:

Beyond 2D Indicators: Existing indicator systems, predominantly based on 2D visual elements, have yet to fully incorporate 3D morphological metrics—such as the river canyon ratio and 3D viewshed—which are essential for quantifying the unique spatial experience of water-based perspectives.
Enhancing Ecological Validity: Mainstream research relies heavily on static 2D imagery. While efficient, this approach faces limitations in representing depth and immersion. Adopting High-Fidelity 3D Reality Models and IVR offers a pathway to better simulate the spatial depth and “sense of presence” required for ecologically valid perception research.
From Prediction to Interpretation: Current analytical tools often struggle to balance non-linear fitting capability with interpretability. There is a need for an integrated analytical framework that not only predicts perception outcomes accurately but also transparently reveals the driving mechanisms to inform specific urban design interventions.

To systematically address these gaps, this study constructs an integrated research framework that synergizes 3D spatial quantification, immersive VR experiments, and interpretable machine learning. Specifically, this study aims to validate this integrated paradigm by answering two core research questions:

RQ1 (Quantification and Distribution): How are the multi-dimensional objective visual characteristics (incorporating both 2D visual elements and 3D spatial metrics) and subjective visual perceptions (e.g., Scenic Beauty, Vibrancy) spatially distributed across different urban riverscape contexts?

RQ2 (Driving Mechanisms): What are the underlying non-linear mechanisms linking these objective environmental characteristics to public subjective perception outcomes?

3. Methodology

3.1. Study Areas and Data Sources

3.1.1. Rationale for Site Selection and Study Areas

To construct a robust perception model capable of capturing diverse urban riverscape typologies, this study selected core sections of the River Thames in London and the River Seine in Paris as empirical cases. Critically, the selection rationale extends beyond their global iconic status to their distinct morphological paradigms, which ensures maximum variance in the independent variables (objective features) required for the regression models.

The study section of the River Thames extends approximately 5 km from the London Eye in the west to Cumberland Wharf Park in the east. This section represents a typical “Organic Evolution” urban waterfront, characterized by high spatial heterogeneity and significant statistical variance in vertical scale. Morphologically, it features a dramatic skyline where historic structures (e.g., Houses of Parliament) coexist with ultra-modern skyscrapers (e.g., The Shard), resulting in substantial fluctuations in the Height-to-Width (H/W) Ratio and creating a spatial rhythm defined by “nodal mutations.” In contrast, the study section of the River Seine spans approximately 5 km from the Musée de l’Orangerie to the Gare d’Austerlitz. This section exemplifies a “Planned Regularity” classical waterfront, embodying the Haussmannian philosophy of axial symmetry. Morphologically, it is characterized by low variance in building height and a stable, continuous H/W Ratio, creating a regulated and visually harmonious urban interface defined by “rhythmic continuity” (Figure 1).

By integrating data from these distinct morphological contexts—high variance (heterogeneity) versus low variance (homogeneity)—this study establishes a diversified dataset that reflects the broad spectrum of built environments faced by global waterfronts.

3.1.2. Data Acquisition and Processing

To comprehensively characterize these riverscapes, we constructed a multi-source dataset integrating High-Fidelity 3D Reality Models with High-Resolution 2D Panoramic Imagery. A summary of the acquired data and sources is presented in Table 1.

The high-precision 3D models served as the spatial database for quantifying 3D morphological indicators and establishing the immersive VR environment. To balance visual fidelity with data acquisition efficiency across the extensive river corridors, we adopted a multi-scale hybrid modeling approach. First, the immediate riverfront interface (including water bodies, revetments, and first-line buildings), which constitutes the primary focus of visual attention, was captured using a DJI Mavic 3 drone. Oblique photography was conducted at a flight altitude of approximately 100–120 m, ensuring a Ground Sampling Distance (GSD) superior to 5 cm/pixel. This ensured high-texture quality for the close-up visual perception of building facades and vegetation [57,58,59]. Second, to complete the broader urban skyline and background depth cues beyond the drone’s flight corridor, we integrated 3D mesh data derived from Google Earth Studio. Third, the drone-based models were georeferenced using the aircraft’s onboard high-precision GNSS positioning system. While traditional Ground Control Points (GCPs) were not deployed due to strict urban flight regulations, we utilized the Google Earth 3D model as a geometric reference baseline to validate the scale and orientation of the drone models. The alignment process, conducted within the WGS84 coordinate system, revealed that the structural dimensions (e.g., building heights) in our model aligned closely with the baseline, confirming that the relative geometric accuracy is robust enough for visual perception analysis and morphological metric extraction.

Complementing the 3D models, water-based panoramic images were acquired to quantify detailed 2D visual elements (e.g., greenery, sky). We systematically collected 360-degree panoramic images via the Google Street View API. First, sampling points were generated at fixed 50-m intervals along the centerline of each river (Figure 2). This interval was selected to balance landscape continuity with data processing efficiency [18,19]. Second, at each point, images were downloaded at the highest available resolution (16,384 × 8192 pixels). After manual screening to remove samples with stitching errors or severe overexposure, a total of 100 high-quality sampling points were retained for each river (200 total). Third, to ensure the validity of the combined analysis, strict spatial alignment was performed. The GPS coordinates of the 2D panoramas were matched directly to the 3D model within the WGS84 system.

3.2. Analysis of Objective Visual Characteristics

The multi-source dataset acquired in Section 3.1.2 serves as the foundational data layer for calculating the objective indicators defined in this section. Specifically, the high-fidelity 3D reality models provided the geometric basis for computing 3D spatial configuration metrics (e.g., Viewshed Area, H/W Ratio) via parametric modeling algorithms. Simultaneously, the water-based panoramic imagery dataset was processed using the SegFormer deep learning model to quantify 2D visual composition elements (e.g., Green View Index, Sky View Index). To ensure precise spatial correspondence, all indicators were calculated at the same 100 sampling locations where the panoramic images were captured. To ensure precise spatial correspondence, all indicators were calculated at the same 100 sampling locations where the panoramic images were captured. A comprehensive summary of these indicators, including their operational definitions, calculation sources, and selection rationale, is presented in Table 2.

3.2.1. Quantification of 2D Visual Composition

2D visual indicators quantify what the observer sees. We employed the SegFormer deep learning model (Transformer-based architecture) [60,61], fine-tuned on a custom “Urban Riverscape Imagery Dataset” (URID), to perform semantic segmentation. The model was trained to recognize 15 distinct categories: Water, Sky, Terrain, Traditional Building, Modern Building, Revetment, Bridge, Car, Truck, Bicycle, Boat, Tree, Grass, People, and Void [8], achieving a Mean Pixel Accuracy (MPA) of 93.69%.

Based on the segmentation results, five core visual perception indices were derived. Each index represents the aggregate percentage of pixel counts for specific target categories (N_target) relative to the total image resolution (N_total): 1. Green View Index (GVI): The sum of Tree and Grass pixels, serving as a key proxy for “naturalness” and psychological restoration [37]. 2. Sky View Index (SVI): The proportion of Sky pixels, quantifying “spatial openness” to assess potential oppressive feelings in high-density areas [2]. 3. Building View Index (BVI): The sum of Traditional Building and Modern Building pixels, reflecting the “visual density” and dominance of artificial interfaces [7]. 4. Revetment View Index (RVI): The proportion of Revetment pixels, determining the visual hardness of the immediate water-land interface. 5. Dynamic Object Index (DOI): The aggregate proportion of mobile elements (Boat, Car, Truck, Bicycle, People), quantifying the “visual vibrancy” of the scene [8].

3.2.2. Quantification of 3D Spatial Configuration

The quantification of 3D spatial morphological indicators fully utilized the precise geospatial information from the high-fidelity 3D reality models. By deploying spatial analysis algorithms within the Rhinoceros-Grasshopper parametric platform, we extracted morphological features critical to spatial experience that 2D images cannot directly reflect. These indicators include: River Width, the shortest perpendicular distance between the two riverbanks at each sampling point, defining the fundamental scale of the spatial experience; Building Height, the average vertical scale of buildings surrounding the sampling point; the Height-to-Width Ratio (H/W Ratio), the ratio of average building height to river width, a classic urban design metric used here to quantify the “canyon effect” or openness of the river space [8]; and Viewshed Area, calculated via 3D viewshed analysis as the total unobstructed visible area within a 360-degree horizontal field of view for the observer, comprehensively assessing spatial visual permeability [2]. By combining these cross-dimensional indicators, this study constructed a comprehensive and complementary system for quantifying the built environment, laying a solid data foundation for the subsequent in-depth exploration of the complex relationships between the objective environment and subjective perception.

3.3. Subjective Visual Perception

To obtain robust public subjective perception evaluations, this study designed and executed a laboratory experiment based on Immersive Virtual Reality (IVR). This method was chosen to acquire perception data closer to real-world contexts than traditional 2D image assessments by simulating a “being there” experience.

3.3.1. Experimental Apparatus and Environmental Standardization

To ensure the reliability of the perceptual data, strict standardization was applied to the experimental environment and equipment. The PICO NEO 3 head-mounted display (HMD) was selected as the primary device due to its high resolution (4K) and ergonomic comfort. To simulate the experience of a boat passenger, participants viewed the scenes from a fixed viewpoint but were granted 3 degrees of freedom. This allowed for omnidirectional visual exploration (360-degree head rotation horizontally and vertically) via the HMD, while hand controllers enabled interactions such as confirming scores. This setup ensures that participants can perceive the spatial enclosure and depth cues of the riverscape without the motion sickness often associated with virtual locomotion [62]. The experiment was conducted in a quiet, sound-proofed, and temperature-controlled laboratory with constant lighting conditions. This isolation eliminated external auditory and visual distractions, ensuring that participants’ attention was focused solely on the virtual stimuli.

3.3.2. Selection of Perceptual Indicators and Rationale for IVR

The selection of subjective perception indicators was grounded in the widely validated MIT Place Pulse framework [63], adapted to capture the unique phenomenological characteristics of the water-based perspective. Crucially, the use of Immersive Virtual Reality (IVR) was deemed necessary for these specific indicators because static 2D images fail to convey the spatial depth and volumetric enclosure cues required to validly assess complex psychological constructs. For instance, the oppression or grandeur of a “canyon-like” CBD cannot be fully perceived without the 3D immersion provided by VR.

Four perceptual constructs were operationally defined specifically for this experiment to ensure clarity and relevance:

Sense of Affluence: In this study, this is defined as the perceived level of economic prosperity and urban prestige conveyed by the built environment. The use of IVR is critical here, as the perception of wealth in high-density waterfronts is often driven by the “vertical scale” and “spatial enclosure” of skyscrapers. 2D images tend to flatten this verticality, potentially leading to misinterpretation as mere crowding, whereas VR restores the imposing atmospheric quality of the skyline.
Vibrancy: Defined as the perceived intensity of human activity and dynamic energy. This construct captures the “social liveliness” of the waterfront, distinct from static aesthetics. IVR enhances the validity of this metric by immersing the observer in the scene, making the presence of dynamic elements (e.g., moving boats, flowing water, crowds on the banks) feel physically proximal and engaging rather than just visual symbols.
Scenic Beauty: Defined as the aesthetic harmony and attractiveness of the visual composition, focusing on the synergy between natural and artificial elements. Unlike framed photographs that can be carefully composed to hide eyesores, IVR forces a holistic evaluation of the 360-degree environment, ensuring that the “beauty” score reflects the authentic, uncurated reality of the riverscape.
Sense of Boredom: Defined as the level of psychological under-stimulation or monotony induced by the environment. This serves as a negative control variable to identify river sections lacking visual foci or rhythmic variation. VR is uniquely suited to measure this, as the sensation of boredom often arises from the “emptiness” of spatial scale (e.g., a vast, featureless water surface) that is difficult to convey through a cropped 2D image.

All indicators were assessed using a standardized 7-point Likert scale (1 = Lowest perception, 7 = Highest perception), ensuring quantitative consistency across all participants and scenes.

3.3.3. Participants and Experimental Procedure

To ensure statistical validity and sample diversity, a total of 120 participants were recruited for this study, comprising 94 professionals (students and faculty in landscape architecture or urban planning) and 26 non-professionals. Crucially, to minimize potential “place attachment” bias and ensure the evaluation reflected purely visual responses, individuals with prior residency or long-term travel experience in London or Paris were excluded. The cohort was randomly divided into two independent groups of equal size (n = 60) to facilitate consistency testing. Group A had an average age of 28.5 years, with 61.7% female representation and 91.7% possessing prior VR experience, while Group B had an average age of 27.3 years, with 56.7% female representation and 85% prior experience. No statistically significant differences in demographics were observed between the groups (p > 0.05).

The experiment was conducted in a strictly standardized laboratory environment featuring sound-proofing, constant artificial lighting, and a fixed temperature (24 °C) to eliminate external distractions. Participants viewed the virtual environments using a PICO NEO 3 head-mounted display. To simulate the authentic experience of a boat passenger, participants were seated in a swivel chair with a 3-Degree-of-Freedom interaction mode. This setup allowed for omnidirectional visual exploration (360-degree head rotation) while fixing the viewpoint to the boat’s trajectory, thereby enabling immersive observation while minimizing motion sickness commonly associated with virtual locomotion [62].

The experimental procedure followed a rigorous three-stage protocol to ensure data quality. First, participants underwent a training session to familiarize themselves with the device and indicator definitions, including a warm-up with five non-experimental scenes. Subsequently, each participant performed the formal evaluation of 20 scenes. To eliminate order effects associated with fatigue or learning, the presentation sequence of these scenes was randomized for each participant. The entire session was limited to 20–25 min to prevent visual fatigue. After completing the visual perception task, correlation analysis was performed on the scoring results of the two groups. By analyzing the correlation between the experimental results of the two groups, we judged the consistency and reliability of the data. A high correlation between the two groups’ results would indicate good consistency and high test–retest reliability.

3.4. Statistical Analysis and Interpretable Machine Learning Strategy

To rigorously decode the complex mechanisms driving riverscape perception, this study adopted a layered analytical strategy progressing from global correlation checks to deep mechanistic decoding. Initially, Pearson correlation analysis was employed to examine linear associations between the nine objective environmental indicators and four subjective perception scores [7], providing a statistical baseline for understanding the fundamental direction of influence. However, given the inherent complexity of human–environment interactions—often characterized by threshold effects and non-linearities—linear models alone are insufficient. Therefore, we constructed Random Forest (RF) regression models to predict perception scores. As an ensemble learning method, Random Forest is robust against overfitting and multicollinearity, making it well-suited for high-dimensional urban data [64]. The models were implemented using the scikit-learn library in Python v3.14.3. To ensure reproducibility and rigorous performance, we employed Grid Search with 5-fold Cross-Validation to optimize key hyperparameters (e.g., number of trees, maximum depth). Model accuracy was subsequently evaluated using the Coefficient of Determination (R²), Mean Squared Error (MSE), and Mean Absolute Error (MAE) [65].

While Random Forest offers superior predictive power, its “black box” nature often hinders the extraction of actionable design insights. To bridge the gap between prediction and understanding, this study integrated the SHAP (SHapley Additive exPlanations) framework. Grounded in cooperative game theory, SHAP assigns an importance value to each feature for a given prediction, ensuring a fair distribution of contributions [55]. Unlike traditional feature importance metrics that only provide global rankings, SHAP enables the visualization of Partial Dependence Plots, thereby revealing how specific changes in landscape features (e.g., variations in the Green View Index) quantitatively shift perception scores. It is important to note that SHAP interprets the model’s predictive logic based on learned associations within the dataset. While it effectively identifies strong predictors and potential driving mechanisms, these insights represent model-based attributions rather than direct causal proofs in the physical world [66]. By combining RF’s non-linear fitting capability with SHAP’s interpretability, this framework provides a robust, evidence-based basis for deriving urban design guidelines.

4. Results

4.1. Analysis of Objective Characteristics

4.1.1. Model Performance and Data Summary

The SegFormer computer vision model trained in this study performed robustly in the image semantic segmentation task, achieving a Mean Pixel Accuracy (MPA) of 93.69% and a Mean Intersection over Union (mIoU) of 49.53%, satisfying the experimental conditions. Based on the quantitative analysis results from image semantic segmentation and the digital twin model (Figure 3), this paper conducted statistical analysis and an overall comparison of multiple objective visual characteristic indicators across the study areas.

To facilitate a clear comparison between the distinct urban morphologies of the two cases, the descriptive statistics for indicators are summarized in Table 3. The complete spatial distribution patterns are visualized in Figure 4.

4.1.2. Comparative Analysis of Urban Morphologies

As evidenced by Table 3, the two rivers exhibit differences in urban form and landscape composition, reflecting their distinct planning histories.

At the level of 2D visual composition, the River Seine presents a more continuous and ecological interface. Its Green View Index (9.0%) is nearly three times that of the Thames (3.3%), corresponding to the well-preserved tree-lined avenues characterizing the Parisian waterfront. Furthermore, the Seine exhibits higher visual vibrancy, indicated by higher values in both the Dynamic Object Index and Revetment View Index, suggesting a closer interaction between the water and active urban life. In contrast, the River Thames is defined by its spatial openness, with a markedly higher Sky View Index (45.5%). Regarding the Building View Index (BVI), although the mean values are similar, the standard deviation for the Seine (SD = 2.3%) is considerably larger than for the Thames (SD = 1.4%). This indicates that while the Parisian building interface is generally regulated and harmonic, significant visual surges occur at specific monumental nodes (e.g., the Louvre complex), creating rhythmic visual focal points.

Differences in 3D spatial morphology more profoundly reveal the underlying urban design philosophies: “Organic Evolution” (Thames) versus “Planned Regularity” (Seine). The River Thames is generally wider (M = 83.6 m) with a smaller standard deviation in width (SD = 8.8 m), indicating a consistent channel morphology. However, its vertical dimension is highly volatile; the standard deviation of Building Height reaches a striking 44.7 m (over three times that of the Seine). This reflects the juxtaposition of historic low-rise structures and super-tall modern skyscrapers (e.g., The Shard). The combination of these factors results in a Height-to-Width Ratio (H/W) for the Thames that spans an extremely wide range (0.13 to 4.15), forming visually impactful “urban canyons” in localized sections. In sharp contrast, the Seine’s H/W Ratio remains stably within a concentrated range (0.25 to 1.52), creating a scale-appropriate and unified spatial experience. Consequently, the mean Viewshed Area of the Thames (0.70 km²) is more than three times that of the Seine, quantifying a grander, more permeable, but less enclosed spatial pattern.

4.1.3. Spatial Sequence and Evolutionary Patterns

Beyond aggregate statistical differences, the longitudinal evolution of environmental indicators along the river channels reveals distinctly different internal rhythms of the urban fabric (as visualized in the trend lines of Figure 3). The River Thames is characterized by “High Heterogeneity” and “Sectional Differentiation,” where the spatial sequence exhibits dramatic fluctuations driven by the juxtaposition of distinct urban tissues. Specifically, its Building View Index (BVI) and Height-to-Width Ratio (H/W) peak significantly in the central section (Sampling Points 40–50), creating a zone of intense spatial enclosure corresponding to the landmark-dense Central Business District. Conversely, the Sky View Index (SVI) presents a distinct “U-shaped” distribution, reaching a trough in this central “urban canyon” while expanding rapidly at both the upstream and downstream ends. Furthermore, the Green View Index (GVI) displays a “high-middle, low-ends” pattern, with peaks concentrated in the middle-upper reaches where large riparian parks break the urban continuity. These asynchronous and dramatic changes in indicators collectively shape a spatial experience defined by “nodal mutations,” where the observer moves between contrasting sequences of extreme enclosure and vast openness.

In sharp contrast, the River Seine demonstrates strong “Morphological Continuity” and “Rhythmic Harmony.” Although its BVI and H/W Ratio also show local peaks at key monumental nodes (e.g., Sampling Points 50–60), the overall transition is notably gentler, lacking the extreme outliers seen on the Thames. The SVI remains at a relatively stable and low level throughout the study section, reflecting the persistent sense of enclosure maintained by the strict height controls and continuous cornice lines characteristic of Haussmannian planning. Notably, the Seine’s GVI exhibits a “pulse-like” pattern distinct from the Thames; instead of broad green belts, peaks appear as rhythmic intervals corresponding to specific waterfront nodes (e.g., the Tuileries Garden). This sequence embodies a “regulated rhythm,” offering a unified and coherent visual narrative that stands in contrast to the dramatic collage of the Thames.

4.2. Analysis of Subjective Visual Perception Results

The validity of the subjective perception data was first confirmed through a consistency check between the two independent participant groups. The Pearson correlation analysis yielded a coefficient of 0.773, indicating robust inter-group reliability and confirming that the experimental data accurately reflect public consensus rather than random variation.

Figure 5 compares the distribution characteristics of subjective perception results between the River Thames and the River Seine. Examining specific perception indicators, for Sense of Affluence, the score for the River Seine in Paris (Mean M = 4.71, Standard Deviation SD = 0.31) was significantly higher than that for the River Thames in London (M = 4.04, SD = 0.63). Furthermore, the distribution of scores for the Seine was more concentrated, suggesting a high public consensus regarding its perception as a symbol of wealth and prosperity. This may be closely related to the regular, magnificent, and well-maintained classical architectural complexes along its banks. Regarding Scenic Beauty, the Seine also demonstrated a clear advantage, with a mean score (M = 4.94) higher than that of the Thames (M = 4.31), further confirming that its harmonious and unified landscape character elicits higher aesthetic evaluations.

For the other two indicators, the rivers exhibited more complex differences. In terms of Vibrancy, the score for the Seine (M = 4.73) was slightly higher than that for the Thames (M = 4.48), which may be attributed to the higher Dynamic Object Index observed in the Seine’s raw data. However, a notable contrast emerged in the negative indicator, Sense of Boredom. The boredom score for the Thames (M = 3.56) was significantly higher than for the Seine (M = 3.21), with a wider range of scores (1.83 to 4.08). This suggests that while the Thames is vibrant overall, the heterogeneity of its landscape sequence leads to stronger feelings of monotony in certain sections, likely those with sparse or monotonous architecture downstream. In contrast, the lower and more concentrated boredom scores for the Seine reflect the continuity and high quality of its landscape experience. These significant differences in subjective perception provide a clear direction for the subsequent in-depth investigation into the underlying objective environmental drivers.

Subjective perception evaluations differed not only in overall means but also in their spatial sequences along the river channels, exhibiting evolutionary patterns highly coupled with their respective objective environmental characteristics.

The subjective perception sequences for Sense of Affluence, Vibrancy, and Scenic Beauty along the River Thames all displayed a “high in the middle, low at the ends” characteristic, whereas the Sense of Boredom indicator showed the inverse. Specifically, the evaluation results for Sense of Affluence showed considerable fluctuation, with relatively higher scores between sampling points 37–50 and lower scores in the 23–37 and 73–81 regions. Vibrancy scores were higher at point 21 and between points 41–47, while remaining relatively lower in other areas. Sense of Boredom scores exhibited an “overall high, locally low” pattern, with lower scores between points 41–51 and relatively higher values elsewhere. Additionally, the evaluation results for Scenic Beauty presented a spatial pattern largely opposite to that of Sense of Boredom, with higher scores mainly appearing between points 40–50 and relatively lower scores at points 23–30 and 89–93.

In comparison, the subjective perception sequence of the River Seine showed gentler overall changes and greater rhythm. Sense of Affluence scores for the Seine were higher between points 53–59 and relatively lower in the 1–15 and 77–89 regions. The highest score for Sense of Safety appeared at sampling point 63, while the lowest values occurred in the 1 and 17 regions. Vibrancy scores were higher at point 27 and between points 71–77, with other areas scoring relatively lower. As a negative evaluation indicator, Sense of Boredom showed higher values in the 9–13 and 51–59 intervals, while scores in the 3–7 and 13–25 intervals were relatively lower. Regions with higher Scenic Beauty scores were mainly concentrated between points 33–45 and 59–81, while other areas scored relatively lower and exhibited some fluctuation.

Although significant spatial heterogeneity exists in the distribution of landscape characteristics between the River Thames and the River Seine, this diversity provides a rich sample space for constructing a perception prediction model with strong generalization capabilities. By merging these two datasets representing different urban morphological paradigms, we aim to transcend the limitations of single-case studies and capture the universal laws driving riverscape perception hidden beneath different surface appearances.

4.3. Correlation Analysis and Model Construction Results

4.3.1. Correlation Analysis Results

This study employed Pearson Correlation Analysis to perform an exploratory examination of the linear relationships between objective environmental indicators and subjective perception scores across all sampling points (Table 4). The results reveal a series of significant and complex association patterns between built environment characteristics and human perception.

The analysis indicates that Sense of Affluence is predominantly defined by indicators reflecting “spatial enclosure and artificial construction density.” It exhibits highly significant positive correlations with Building View Index, Height-to-Width Ratio, and Building Height (r > 0.63, p < 0.01), while showing strong negative correlations with Sky View Index and Viewshed Area, which embody spatial openness. This suggests that the public tends to associate high-density, highly enclosed waterfront spaces with wealth and prosperity.

The association pattern for Scenic Beauty reveals a preference for the “fusion of nature and artifice.” Although Green View Index remains a significant positive driver, its positive correlation is even surpassed by that of Building View Index (r = 0.595 vs. r = 0.560), and both are significantly more favorable than Sky View Index, which shows a negative correlation. This implies that in the context of urban riverscapes, purely open nature is not the sole source of scenic beauty; a well-designed artificial interface is equally critical.

Vibrancy is primarily driven by vertical spatial scale and dynamic elements. It is significantly positively correlated with Building Height and Dynamic Object Index, while showing no significant association with Green View Index, indicating that a soaring skyline and abundant water-based activities are core sources of vibrancy. Conversely, Sense of Boredom is negatively correlated with most of the aforementioned positive indicators (such as Building View Index and Viewshed Area), confirming that environments lacking spatial variation and visual focal points are the primary cause of monotony.

4.3.2. Model Construction Results

In this study, objective environmental characteristics served as independent variables and the four subjective perception scores as dependent variables to train four separate Random Forest regression models. Following hyperparameter optimization (see Table 5), all models demonstrated robust predictive performance, indicating that objective environmental indicators effectively explain the variance in subjective perception evaluations.

To gain a deeper understanding of the key environmental factors driving perception and their non-linear influence mechanisms, we utilized the SHAP framework to provide global explanations for the models. Figure 6 presents the SHAP Summary Plots for each perceptual dimension. The beeswarm plots on the left reveal the direction of impact (positive or negative) of feature values on model output, while the bar charts on the right quantify the global importance of each feature.

First, Sense of Affluence is dually influenced by spatial enclosure and vertical scale. The analysis results indicate that the Sky View Index is the most important feature predicting Sense of Affluence and is significantly negatively correlated with predicted values (in the left plot, red dots representing high Sky View Index are distributed on the negative side of the SHAP value axis, while blue dots representing low values are on the positive side). This implies that less visible sky (i.e., greater obstruction by tall buildings) corresponds to a higher perceived level of affluence. This is followed by Viewshed Area, which also shows a negative correlation. Conversely, Building View Index and Building Height rank third and fourth, respectively, both exhibiting a positive correlation. This pattern clearly demonstrates that the public tends to associate high-density, towering waterfront landscapes with a strong sense of spatial enclosure with economic prosperity.

Second, for Vibrancy, Building Height occupies an overwhelmingly dominant position, with its importance far exceeding other features. The beeswarm plot shows that the higher the building height (red dots), the greater the positive contribution to Vibrancy. Sky View Index ranks second and is negatively correlated, further confirming that urban canyons lined with high-rises are key scenes for stimulating vitality. Notably, the Height-to-Width Ratio and Dynamic Object Index also show positive influences, suggesting that the compactness of spatial form and the richness of water-based activities jointly shape the perception of vitality.

In the Sense of Boredom model, Viewshed Area emerges as the most critical predictor, showing a significant positive correlation. This means that the wider and emptier the field of view, the more likely it is to induce monotony. Building Height and River Width follow, both showing negative correlations (i.e., lower buildings and wider rivers are associated with stronger feelings of boredom). This finding profoundly reveals that open water areas lacking vertical visual focal points and spatial variation are the primary causes of negative experiences. Furthermore, the alleviating effects of Green View Index and Sky View Index on boredom are relatively limited.

The Scenic Beauty model embodies the core value of green ecology. Its driving mechanism is the most unique. The Green View Index becomes the most important predictor by an absolute margin, showing a strong positive correlation. This powerfully confirms the central role of natural vegetation in enhancing landscape aesthetic value. Interestingly, River Width and Building View Index rank second and third, respectively, both showing positive correlations. This indicates that while greenery is paramount, expansive water surfaces and high-quality architectural interfaces are also indispensable for constructing a complete “scenic beauty”.

In summary, the SHAP analysis not only validates the conclusions of linear correlation but also more precisely quantifies the weights and directions of different landscape elements within specific perceptual dimensions [67], revealing a diverse perceptual logic ranging from “high-density prosperity” to “green aesthetics”.

5. Discussion

By integrating high-fidelity 3D reality data, Immersive Virtual Reality (IVR), and Explainable Artificial Intelligence (XAI), this study constructed an analytical framework for riverscape perception that transcends specific urban contexts. The empirical investigation of the River Thames in London and the River Seine in Paris not only validated the efficacy of this framework in quantifying environment-perception associations but also revealed a set of universal perceptual driving mechanisms underlying these seemingly distinct urban fabrics.

5.1. Critical Reflection on Model Performance and Predictability

Before delving into specific driving mechanisms, it is crucial to interpret the divergence in predictive performance across the four perceptual models. As evidenced by the Random Forest regression results (Table 5), the models exhibited unequal explanatory power: Vibrancy (R² = 0.667) and Scenic Beauty (R² = 0.619) achieved relatively high accuracy, whereas Sense of Affluence (R² = 0.509) and Sense of Boredom (R² = 0.425) were comparatively more difficult to predict based solely on visual and spatial metrics.

This disparity reveals a fundamental distinction in how different perceptions are constructed. “Vibrancy” and “Scenic Beauty” appear to be predominantly stimulus-driven (bottom-up). They rely heavily on explicit visual cues—such as the physical presence of mobile elements (Dynamic Object Index), the density of vegetation (Green View Index), and the scale of the skyline (Building Height)—which are precisely captured by our 2D and 3D objective indicators. Consequently, the machine learning model can accurately map these morphological features to human ratings.

In contrast, “Sense of Boredom” and “Sense of Affluence” involve more complex cognitive processing (top-down) and semantic associations. For Affluence, while vertical scale is a strong predictor, the perception of “wealth” also depends on subtle cues like architectural detailing, material quality, and brand prestige, which are difficult to fully capture through broad semantic segmentation or simple geometry. Similarly, Boredom had the lowest predictability (R² = 0.425), suggesting it is heavily influenced by unmeasured latent variables such as cultural meaning, historical familiarity, and individual emotional states. For instance, a visually empty riverbank might be classified as “boring” by the model due to a lack of features (high Viewshed, low DOI), yet a human observer might perceive it as “poetic” or “historically significant” if they understand its context. This finding highlights a critical boundary of morphological analysis: while visual/spatial features are strong predictors of urban intensity and aesthetics, they cannot fully account for the semantic depth of human emotional experience.

5.2. Core Driving Mechanisms of Riverscape Perception

This study transcends simple correlation analysis to reveal that public perception of urban riverscapes is driven by a complex interplay between 3D Spatial Configuration and 2D Visual Composition. By synthesizing the statistical results with the River Thames (organic evolution) and the River Seine (planned regularity), three core driving mechanisms emerge, offering insights into the “Blue–Green–Grey” interaction.

(1): The “Water-Buffered” Canyon Effect: Redefining Density and Prosperity. Contrary to traditional urban design wisdom, our findings challenge the assumption that spatial enclosure invariably leads to psychological oppression. The SHAP analysis identifies a counter-intuitive mechanism where high “spatial enclosure” (high Building View Index, low Sky View Index) and “vertical scale” (Building Height) act as primary drivers for Sense of Affluence and Vibrancy. In terrestrial street canyons, high H/W ratios are often associated with stress and pollution [8]. However, our results suggest that in riverscapes, the expansive water surface acts as a “Blue Buffer,” mitigating the oppressive feeling of density while framing the skyline as a visual symbol of economic power. This is exemplified by the heterogeneous skyline of the Thames: the juxtaposition of historic landmarks and ultra-modern skyscrapers (e.g., The Shard) creates a dramatic “urban canyon” that is interpreted by the public not as crowdedness, but as a manifestation of vitality and prosperity. This finding refines the application of “Street Canyon Theory” in waterfront contexts, suggesting that moderate enclosure is not a defect but a catalyst for shaping the image of a Global City [7].
(2): Synergistic Aesthetics: The Interdependence of Nature and Artifact. While “green ecology” remains the cornerstone of aesthetic experience, this study reveals that Scenic Beauty in urban riverscapes is not derived from wilderness alone but from the synergy between high-quality artificial interfaces and natural elements. The dominance of the Green View Index supports Kaplan’s Attention Restoration Theory [68]. However, the significant positive contribution of the Building View Index highlights a critical nuance: the “riverscape beauty” aspired to by the public is a “civilized nature.” This is vividly illustrated by the River Seine, where the Haussmannian planning philosophy creates a continuous, rhythmic architectural façade interwoven with orderly tree-lined embankments. Unlike the fragmented green patches of the Thames, the Seine’s beauty stems from this “Nature–Artifact Synergy,” where the built environment is not an intrusion but a harmonious frame for the water. This aligns with Nasar’s theory of “Order and Complexity,” suggesting that visual coherence in the built interface significantly amplifies the aesthetic value of blue–green spaces [69].
(3): The Paradox of Prospect: Why Openness Can Be Boring. A crucial theoretical contribution of this study is identifying the root cause of negative perception through the Sense of Boredom. Our model reveals a “Paradox of Prospect”: while openness is generally valued, “unfocused emptiness” (excessive Viewshed Area without visual focal points) is the primary driver of boredom. This finding adds a critical layer to Appleton’s Prospect-Refuge Theory [70]. While humans desire “prospect” (open views), an unobstructed view loses its psychological appeal if it lacks “content” or “complexity” to engage visual attention. This explains the spatial heterogeneity observed in the results: the River Thames, despite its grandeur, suffers from high boredom scores in its downstream sections where the spatial fabric becomes loose and lacks vertical landmarks. In contrast, the River Seine maintains a low boredom level throughout due to its continuous visual rhythm and appropriate H/W ratio. Thus, avoiding “visual vacuums” is as important as creating openness in riverscape design.

5.3. Practical Implications for Urban Waterfront Renewal

While derived from the specific morphological contexts of London and Paris, the findings of this study offer evidence-based insights for the renewal of high-density urban waterfronts, particularly those balancing heritage conservation with modernization. By translating the “environment-perception” mechanisms revealed by the SHAP model into spatial interventions, we propose a transition from intuition-based to evidence-based design strategies.

First, planners should strategically leverage the “Water-Buffered Canyon Effect” to shape distinct urban identities. Contrary to the conventional wisdom of strictly limiting building heights to preserve openness, our results suggest that in core functional zones (e.g., Central Business Districts), moderate spatial enclosure is essential for stimulating the public’s Sense of Affluence and Vibrancy. For “organic” riverscapes akin to the Thames, where the urban fabric is heterogeneous, designers should not shy away from verticality. Instead, they can utilize the vast water surface as a buffer to introduce high-quality vertical skylines. This approach creates dramatic “visual anchors” that define the city’s economic image without inducing the psychological oppression typically associated with narrow street canyons [71]. However, this density must be clustered rather than continuous to maintain the rhythmic “nodal mutations” that prevent visual monotony.

Second, ecological restoration must be integrated with “Visual Complexity” to enhance aesthetics and mitigate boredom. The study confirms that while a high Green View Index is the baseline for Scenic Beauty, the ideal riverscape is not a pure wilderness but a “civilized nature.” Drawing from the “Nature–Artifact Synergy” observed in the Seine, design strategies should foster visual corridors that guide sightlines from riparian vegetation to high-quality architectural facades, rather than obscuring the city entirely [2,34]. Furthermore, to address the “Paradox of Prospect”—where excessive openness leads to a Sense of Boredom—designers must actively intervene in monotonous river sections. For wide, featureless channels, the focus should shift from merely “opening up views” to “enriching content.” This can be achieved by introducing dynamic elements (e.g., increasing the Dynamic Object Index via water-based activities) or installing vertical landmarks and tall canopy trees to segment overly broad viewsheds. These interventions effectively break the visual vacuum, transforming “empty prospect” into “engaging scenery” [72].

Finally, this study validates a methodological paradigm shift towards “AI-Aided Pre-evaluation.” The “Digital Twin + Explainable AI” framework demonstrates that subjective perception is predictable and quantifiable. This offers a powerful tool for future urban design governance: planners can estimate key indicators (e.g., Green View Index, Sky View Index) during the schematic phase and utilize the SHAP model to simulate potential public reactions. This capability enables a shift from reactive post-occupancy evaluation to proactive “Perception-Driven Optimization,” ensuring that design decisions—whether preserving the rhythmic continuity of a historic waterfront or reshaping the skyline of a modern district—are grounded in objective data and aligned with human perceptual needs [2].

5.4. Limitations and Future Prospects

Despite the rigorous methodological framework established in this study, several limitations inherent to the experimental design and case selection warrant candid discussion.

First, the generalizability of the identified “universal principles” requires cautious interpretation across different geo-cultural contexts. This study derived perceptual mechanisms based on two iconic rivers in the Global North (London and Paris). While they represent contrasting morphologies (organic vs. planned), they both sit within Western, temperate, and highly developed urban frameworks. Consequently, these findings may not fully apply to riverscapes in Asian, African, or Global South contexts. For instance, in high-density Asian cities, the extreme verticality of riverbanks might push the “enclosure-affluence” correlation beyond the thresholds observed here. Similarly, in many cities of the Global South, the river functioning as a resource for daily living (e.g., washing, fishing) creates a functional bond distinct from the aesthetic appreciation focused on in this study. Future research must expand the typology to include non-Western riverscapes to test the cross-cultural validity of these mechanisms.

Second, regarding participant selection, the control for “place attachment” introduced a specific perspectival bias. By recruiting participants with low familiarity to the sites, this study successfully isolated pure visual/spatial responses from memory-based biases. However, this approach inherently frames the evaluation from a “Tourist Perspective” or “Outsider Gaze.” Participants judged the riverscapes as spectacles to be viewed, rather than as habitats to be lived in. This limits the applicability of the findings for resident-centered community design, as local residents often prioritize functional accessibility and social memory over pure scenic beauty. Future studies should include comparative groups of local residents to disentangle the differences between the “tourist’s eye” and the “resident’s heart”.

Third, the “static snapshot” nature of the data limits the representation of the riverscape’s dynamic essence. Urban rivers are temporally fluid systems characterized by seasonal phenology, tidal variations, and diurnal lighting changes. The current study, relying on Google Street View and 3D models captured at specific times (mostly summer days), fails to reflect how perception might shift under different conditions (e.g., the potential oppressiveness of an “urban canyon” at night or in winter). Additionally, the exclusion of non-visual sensory channels is a major constraint. Riverscapes are multisensory environments where the sound of flowing water, the humidity, and even the smell significantly influence psychological restoration [73,74]. Future research should leverage advancements in VR to integrate dynamic lighting, temporal changes, and high-fidelity soundscapes (e.g., traffic noise vs. water sounds) to build a more ecologically valid perceptual model [75].

6. Conclusions

This study addressed the challenge of decoding the complex, non-linear mechanisms between objective urban morphology and subjective public perception. By constructing and validating an integrated research framework combining High-Fidelity 3D Reality Models, Immersive Virtual Reality (IVR), and Explainable Artificial Intelligence (XAI), the research transcends the limitations of traditional 2D-based assessments, offering a methodologically rigorous protocol for capturing the authentic “sense of presence” in riverscape studies.

Theoretically, this research challenges the conventional design orthodoxy that prioritizes absolute openness in waterfront planning. Our empirical findings from the Thames and Seine reveal a counter-intuitive “Water-Buffered Canyon Effect”: in high-density urban contexts, “Spatial Enclosure” and “Vertical Scale” do not induce oppression but are decoded by the public as positive symbols of Sense of Affluence and Vibrancy. Furthermore, the study refines the understanding of urban aesthetics by identifying the “Nature–Artifact Synergy,” demonstrating that the optimal aesthetic experience stems from the harmonious interweaving of high-quality architectural interfaces and green elements, rather than the pursuit of pure wilderness. Additionally, the identification of the “Paradox of Prospect”—where unfocused openness leads to a Sense of Boredom—adds a critical layer to Prospect-Refuge Theory, emphasizing the necessity of visual focal points in linear landscape sequences.

Methodologically and practically, the study demonstrates the transformative potential of the “Digital Twin + Explainable AI” framework. The application of the SHAP model successfully visualized the non-linear threshold effects of landscape indicators, proving its superiority over traditional linear regression in interpreting complex human–environment interactions. This validates the framework not merely as an analytical tool but as a predictive instrument for evidence-based design. It empowers urban planners to transition from intuition-driven decisions to data-driven simulations, allowing for the pre-assessment of public psychological responses during the schematic design phase.

In conclusion, this study provides scientific evidence for waterfront revitalization. By clarifying the core driving mechanisms of perception, it lays a solid foundation for future research to construct holistic, multi-sensory “Perception Knowledge Graphs,” ultimately fostering more human-centric, vibrant, and sustainable urban riverscapes.

Author Contributions

Conceptualization, Yuzhen Tang and Junjie Luo; methodology, Yuzhen Tang and Junjie Luo; software, Yuzhen Tang; validation, Yuzhen Tang, Shensheng Chen, Wenhui Xu and Junjie Luo; formal analysis, Yuzhen Tang and Junjie Luo; investigation, Yuzhen Tang; resources, Yuzhen Tang, Shensheng Chen and Jinxuan Ren; data curation, Yuzhen Tang; writing—original draft preparation, Yuzhen Tang and Junjie Luo; writing—review & editing, Yuzhen Tang and Junjie Luo; visualization, Yuzhen Tang; supervision, Junjie Luo; project administration Junjie Luo; funding acquisition, Junjie Luo. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Zhejiang A&F University Scientific Research Development Fund (Project Title: Visual Perception Analysis of Urban Riverscapes Using Computer Vision), grant number 2023LFR076.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We gratefully acknowledge the participants of the experiment.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, X.; Wang, X.; Jiang, X.; Han, J.; Wang, Z.; Wu, D.; Lin, Q.; Li, L.; Zhang, S.; Dong, Y. Prediction of riverside greenway landscape aesthetic quality of urban canalized rivers using environmental modeling. J. Clean. Prod. 2022, 367, 133066. [Google Scholar] [CrossRef]
Luo, J.; Yuan, Z.; Xu, L.; Xu, W. Assessing the Impact of Waterfront Environments on Public Well-Being Through Digital Twin Technology. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4536–4553. [Google Scholar] [CrossRef]
Luo, J.; Liu, P.; Kong, X.; Shen, J.; Wu, Q.; Xu, D. Urban digital twins for citizen-centric planning: A systematic review of built environment perception and public participation. Int. J. Appl. Earth Obs. Geoinf. 2025, 143, 104746. [Google Scholar] [CrossRef]
Gu, Y.; Quintana, M.; Liang, X.; Ito, K.; Yap, W.; Biljecki, F. Designing effective image-based surveys for urban visual perception. Landsc. Urban Plan. 2025, 260, 105368. [Google Scholar] [CrossRef]
Wronkowski, A. Towards discovering human urban activity—Tactics of human spatial behavior. J. Hum. Behav. Soc. Environ. 2025, 35, 624–642. [Google Scholar] [CrossRef]
Tavakoli, A.; Douglas, I.P.; Noh, H.Y.; Hwang, J.; Billington, S.L. Psycho-behavioral responses to urban scenes: An exploration through eye-tracking. Cities 2025, 156, 105568. [Google Scholar] [CrossRef]
Luo, J.; Zhao, T.; Cao, L.; Biljecki, F. Semantic Riverscapes: Perception and evaluation of linear landscapes from oblique imagery using computer vision. Landsc. Urban Plan. 2022, 228, 104569. [Google Scholar] [CrossRef]
Luo, J.; Zhao, T.; Cao, L.; Biljecki, F. Water View Imagery: Perception and evaluation of urban waterscapes worldwide. Ecol. Indic. 2022, 145, 109615. [Google Scholar] [CrossRef]
Chiciudean, V.; Florea, H.; Beche, R.; Oniga, F.; Nedevschi, S. Data augmentation for environment perception with unmanned aerial vehicles. IEEE Trans. Intell. Veh. 2024, 10, 2334–2348. [Google Scholar] [CrossRef]
Pradilla, G.; Hack, J. An urban rivers renaissance? Stream restoration and green–blue infrastructure in Latin America–Insights from urban planning in Colombia. Urban Ecosyst. 2024, 27, 2245–2265. [Google Scholar] [CrossRef]
Zhou, M.; Wang, F. The driving factors of recreational utilization of ecological space in urban agglomerations: The perspective of urban political ecology. Ecol. Indic. 2024, 158, 111409. [Google Scholar] [CrossRef]
Grzyb, T.; Kulczyk, S. How do ephemeral factors shape recreation along the urban river? A social media perspective. Landsc. Urban Plan. 2023, 230, 104638. [Google Scholar] [CrossRef]
Yuan, Z.; Luo, J.; Liu, Y.; Xu, L.; Chen, L.; Xu, W. Toward climate-adaptive waterfront walking spaces: Thresholds and cross-modal mechanisms of multi-sensory comfort. Urban For. Urban Green. 2026, 118, 129278. [Google Scholar] [CrossRef]
Garau, E.; Torralba, M.; Pueyo-Ros, J. What is a river basin? Assessing and understanding the sociocultural mental constructs of landscapes from different stakeholders across a river basin. Landsc. Urban Plan. 2021, 214, 104192. [Google Scholar] [CrossRef]
Kim, B.C.; Kim, S.N.; Joo, Y. Visual impact control of urban waterfront development on the background mountain view: Examining its justifiability through two types of immersive virtual reality experiments. Environ. Impact Assess. Rev. 2024, 106, 107500. [Google Scholar] [CrossRef]
Jiang, F.; Ma, J.; Webster, C.J.; Chiaradia, A.J.; Zhou, Y.; Zhao, Z.; Zhang, X. Generative urban design: A systematic review on problem formulation, design generation, and decision-making. Prog. Plan. 2024, 180, 100795. [Google Scholar] [CrossRef]
Puspitasari, A.W.; Kwon, J. Analysis of the visual quality of riverfront skyline through the feature of height and spatial arrangement of tall building. Archit. Res. 2019, 21, 91–98. [Google Scholar]
Wang, Z.; Ito, K.; Biljecki, F. Assessing the equity and evolution of urban visual perceptual quality with time series street view imagery. Cities 2024, 145, 104704. [Google Scholar] [CrossRef]
Ogawa, Y.; Oki, T.; Zhao, C.; Sekimoto, Y.; Shimizu, C. Evaluating the subjective perceptions of streetscapes using street-view images. Landsc. Urban Plan. 2024, 247, 105073. [Google Scholar] [CrossRef]
Xu, Z.; Jiang, T.; Zheng, N. Developing and analyzing eco-driving strategies for on-road emission reduction in urban transport systems-A VR-enabled digital-twin approach. Chemosphere 2022, 305, 135372. [Google Scholar] [CrossRef]
D’Urso, F.; Santoro, C.; Santoro, F.F. An integrated framework for the realistic simulation of multi-UAV applications. Comput. Electr. Eng. 2019, 74, 196–209. [Google Scholar] [CrossRef]
Liu, Y.; Yuan, Z.; Luo, J.; Shen, Y.; Yao, X.; Xu, W. The Impact of 3D Urban Landscapes on Multidimensional Human Experience: An Interpretable Machine Learning Approach. Build. Environ. 2026, 293, 114320. [Google Scholar] [CrossRef]
Ardiny, H.; Beigzadeh, A.; Mahani, H. Applications of unmanned aerial vehicles in radiological monitoring: A review. Nucl. Eng. Des. 2024, 422, 113110. [Google Scholar] [CrossRef]
Zhu, J.; Wang, S.; Ma, H.; Shan, T.; Xu, D.; Sun, F. Nonlinear effect of urban visual environment on residents’ psychological perception—An analysis based on XGBoost and SHAP interpretation model. City Environ. Interact. 2025, 27, 100202. [Google Scholar] [CrossRef]
Li, C.; Managi, S. Impacts of community attachment and community livability on environmental activity according to XGBoost and SHAP. Cities 2025, 156, 105559. [Google Scholar] [CrossRef]
Collins, S.A.; McDonnell, A.S.; Scott, E.E.; McNay, G.D.; Shannon, M.F.; Augustin, L.; Hoffmann, J.N.; Johnson, S.; Strayer, D.; LoTemplio, S. Nature imagery’s influence on ERN amplitude: An examination of Attention Restoration Theory using EEG. Front. Hum. Neurosci. 2025, 19, 1567689. [Google Scholar] [CrossRef]
Sniderman, S. The Design of Emotionally Resonant Learning Experiences: Prospect-Refuge, Framing, and Friction. Int. J. Adv. Corp. Learn. 2025, 18, 82. [Google Scholar] [CrossRef]
Haeri, S.; Masnavi, M.R. Analyzing and Developing Strategies for the Ecological Restoration of Urban Rivers in the Framework of Ecological Urbanism. MANZAR Sci. J. Landsc. 2023, 15, 54–71. [Google Scholar]
Wantzen, K.M. River culture: How socio-ecological linkages to the rhythm of the waters develop, how they are lost, and how they can be regained. Geogr. J. 2024, 190, e12476. [Google Scholar] [CrossRef]
Li, X.; Li, Y.; Zhang, S.; Lin, R.; Chen, M.; Feng, L. Driving effects of land use and landscape pattern on different spontaneous plant life forms along urban river corridors in a fast-growing city. Sci. Total Environ. 2023, 876, 162775. [Google Scholar] [CrossRef]
Bush, J.; Doyon, A. Planning a just nature-based city: Listening for the voice of an urban river. Environ. Sci. Policy 2023, 143, 55–63. [Google Scholar] [CrossRef]
Çınar, E.; Tuncal, A. The Future of UAVs in Urban Air Mobility: Public Perception and Concerns. Türkiye İnsansız Hava Araçları Derg. 2023, 5, 50–58. [Google Scholar] [CrossRef]
Kamarulzaman, A.M.M.; Jaafar, W.M.; Said, M.; SNM; Saad; Mohan, M. UAV Implementations in Urban Planning and Related Sectors of Rapidly Developing Nations: A Review and Future Perspectives for Malaysia. Remote Sens. 2023, 15, 2845. [Google Scholar] [CrossRef]
Chen, C.; Li, H.; Luo, W.; Xie, J.; Yao, J.; Wu, L.; Xia, Y. Predicting the effect of street environment on residents’ mood states in large urban areas using machine learning and street view images. Sci. Total Environ. 2022, 816, 151605. [Google Scholar] [CrossRef]
Li, X.; Huang, K.; Zhang, R.; Chen, Y.; Dong, Y. Visual Perception Optimization of Residential Landscape Spaces in Cold Regions Using Virtual Reality and Machine Learning. Land 2024, 13, 367. [Google Scholar] [CrossRef]
Clay, G.R.; Daniel, T.C. Scenic landscape assessment: The effects of land management jurisdiction on public perception of scenic beauty. Landsc. Urban Plan. 2000, 49, 1–13. [Google Scholar] [CrossRef]
Wu, C.; Ye, Y.; Gao, F.; Ye, X. Using street view images to examine the association between human perceptions of locale and urban vitality in Shenzhen, China. Sustain. Cities Soc. 2023, 88, 104291. [Google Scholar] [CrossRef]
Liu, Y.; Abbasabadi, N. Enhancing urban building energy models with Vision Transformers: A Case study in material classification from Google street view. Energy Build. 2025, 333, 115457. [Google Scholar] [CrossRef]
Danish, M.; Labib, S.M.; Ricker, B.; Helbich, M. A citizen science toolkit to collect human perceptions of urban environments using open street view images. Comput. Environ. Urban Syst. 2025, 116, 102207. [Google Scholar] [CrossRef]
Liu, M.; Gou, Z.; Zhao, Q. Enhancing building performance evaluation through Google street View: A multimodal transfer learning framework. Energy Build. 2025, 344, 115968. [Google Scholar] [CrossRef]
Huang, Y.; Sanatani, R.P.; Liu, C.; Kang, Y.; Zhang, F.; Liu, Y.; Duarte, F.; Ratti, C. No “true” greenery: Deciphering the bias of satellite and street view imagery in urban greenery measurement. Build. Environ. 2025, 269, 112395. [Google Scholar] [CrossRef]
Nathvani, R.; Cavanaugh, A.; Suel, E.; Bixby, H.; Clark, S.N.; Metzler, A.B.; Nimo, J.; Moses, J.B.; Baah, S.; Arku, R.E.; et al. Measurement of urban vitality with time-lapsed street-view images and object-detection for scalable assessment of pedestrian-sidewalk dynamics. ISPRS J. Photogramm. Remote Sens. 2025, 221, 251–264. [Google Scholar] [CrossRef]
Mehta, D.; Gopalakrishnan, P. Real Environment or Virtual Environment?: Perception Bias Evaluation of Perceived Security in Urban Parks. J. Park Recreat. Adm. 2025, 43, 22–38. [Google Scholar] [CrossRef]
Li, F.; Zhang, Z.; Xu, L.; Yin, J. The effects of professional design training on urban public space perception: A virtual reality study with physiological and psychological measurements. Cities 2025, 158, 105654. [Google Scholar] [CrossRef]
Alkhresheh, M. Effects of Levels of Realism on Perceived Distance in Computer-Simulated Urban Spaces. Buildings 2025, 15, 3565. [Google Scholar] [CrossRef]
Ju, H.; Bulbul, T.; Yang, X.; Withers, J. Risk awareness assessment of construction workers wearing AR head-mounted displays using EEG signals. J. Saf. Res. 2025, 95, 237–250. [Google Scholar] [CrossRef]
Figueiredo, M.; Eloy, S.; Marques, S. Age-friendly cities and active mobility: A thematic analysis based on immersive 360-degree video elicitation. Cities 2025, 166, 106260. [Google Scholar] [CrossRef]
Jiang, Y.; Yao, Y.; Yu, Q.; Jiang, Z.; Liu, X.; Li, J.; Zhang, H. Integrating Virtual Reality-based eye-tracking with urban digital twin for unveiling pedestrian visual attention in wayfinding tasks. Travel Behav. Soc. 2026, 42, 101120. [Google Scholar] [CrossRef]
Juan, Y.K.; Chen, Y. Not or Yes in My Back Yard? A physiological and psychological measurement of urban residents in Taiwan. Landsc. Urban Plan. 2025, 255, 105256. [Google Scholar] [CrossRef]
Li, T.; Hu, H.; Ma, H.; Ma, J.; Li, Q. Using Virtual Reality to Enhance Learning Performance and Address Educational Resource Disparities in Architectural History Courses. Sustainability 2025, 17, 866. [Google Scholar] [CrossRef]
Huang, D.; Gong, W.; Wang, X.; Liu, S.; Zhang, J.; Li, Y. A Cognition–Affect–Behavior Framework for Assessing Street Space Quality in Historic Cultural Districts and Its Impact on Tourist Experience. Buildings 2025, 15, 2739. [Google Scholar] [CrossRef]
Hong, T.; Yim, S.H.; Heo, Y. Interpreting complex relationships between urban and meteorological factors and street-level urban heat islands: Application of random forest and SHAP method. Sustain. Cities Soc. 2025, 126, 106353. [Google Scholar] [CrossRef]
Qiao, Y.; Sun, H.; Qi, J.; Liu, S.; Li, J.; Ji, Y.; Wang, H.; Peng, Y. Examining water bodies’ cooling effect in urban parks with buffer analysis and random forest regression. Urban Clim. 2025, 59, 102301. [Google Scholar] [CrossRef]
Eshraghi, P.; Talami, R.; Dehnavi, A.N.; Mirdamadi, M.; Zomorodian, Z.S. Adopting explainable-AI to investigate the impact of urban morphology design on energy and environmental performance in dry-arid climates. Adv. Build. Energy Res. 2025, 19, 497–531. [Google Scholar] [CrossRef]
Sheng, T.; Zhang, Z.; Qian, Z.; Ma, P.; Xie, W.; Zeng, Y.; Zhang, K.; Sun, Z.; Yu, J.; Chen, M. Examining urban agglomeration heat island with explainable AI: An enhanced consideration of anthropogenic heat emissions. Urban Clim. 2025, 59, 102251. [Google Scholar] [CrossRef]
Liu, Y.; Wang, H.; Guan, X.; Meng, Y.; Xu, H. Urban flood depth prediction and visualization based on the XGBoost-SHAP model. Water Resour. Manag. 2025, 39, 1353–1375. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, P.; Hu, Q.; Ai, M.; Hu, D.; Li, J. A UAV-based panoramic oblique photogrammetry (POP) approach using spherical projection. ISPRS J. Photogramm. Remote Sens. 2020, 159, 198–219. [Google Scholar] [CrossRef]
Alvarez-Vanhard, E.; Corpetti, T.; Houet, T. UAV & satellite synergies for optical remote sensing applications: A literature review. Sci. Remote Sens. 2021, 3, 100019. [Google Scholar]
Ding, L.; Zhou, J.; Meng, L.; Long, Z. A practical cross-view image matching method between UAV and Satellite for UAV-based geo-localization. Remote Sens. 2020, 13, 47. [Google Scholar] [CrossRef]
Larkin, A.; Huang, T.; Chen, L.; Lin, P.I.; Hart, J.E.; Zhang, W.; Coull, B.A.; Yi, L.; Suel, E.; Hankey, S.; et al. Developing Nationwide Estimates of Built Environment Quality Characteristics Using Street-View Imagery and Computer Vision. Environ. Sci. Technol. 2025, 59, 13638–13646. [Google Scholar] [CrossRef]
Li, S.; Tan, Y.; Zhou, Z.; Chen, P.; Li, C.; Zhou, C. Enhancing 3D building reconstruction quality using UAV multi-view photogrammetry and multi-modal large models. Autom. Constr. 2026, 181, 106664. [Google Scholar] [CrossRef]
Maniee, S.; Zomorodian, Z.S.; Tahsildoost, M. Assessing Lighting Design in Urban Open Spaces: A Virtual Reality Experiment. Build. Environ. 2025, 283, 113362. [Google Scholar] [CrossRef]
Zhang, H. Understanding Human Perceptions of Street View Images Using MIT Place Pulse 2.0 Dataset. Ph.D. Thesis, Waseda University, Shinjuku, Japan, 2023. [Google Scholar]
Reza, S.A.; Hasan, M.S.; Amjad, M.H.; Islam, M.S.; Rabbi, M.M.; Hossain, A.; Shovon, M.S.; Jakir, T. Predicting energy consumption patterns with advanced machine learning techniques for sustainable urban development. J. Comput. Sci. Technol. Stud. 2025, 7, 265–282. [Google Scholar] [CrossRef]
Zhang, A.; Tariq, A.; Quddoos, A.; Naz, I.; Aslam, R.W.; Barboza, E.; Ullah, S.; Abdullah-Al-Wadud, M. Spatio-temporal analysis of urban expansion and land use dynamics using google earth engine and predictive models. Sci. Rep. 2025, 15, 6993. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Wang, Z.; Chen, Y.; Wang, Z.; Kuang, D. Exploring the impact of land use on bird diversity in high-density urban areas using explainable machine learning models. J. Environ. Manag. 2025, 374, 124080. [Google Scholar] [CrossRef]
Hao, J.; Ho, T.K. Machine learning made easy: A review of scikit-learn package in python programming language. J. Educ. Behav. Stat. 2019, 44, 348–361. [Google Scholar] [CrossRef]
Neilson, B.N.; Craig, C.M.; Travis, A.T.; Klein, M.I. A review of the limitations of Attention Restoration Theory and the importance of its future research for the improvement of well-being in urban living. Vis. Sustain. 2019, 11, 59–67. [Google Scholar]
Akcelik, G.N.; Choe, K.W.; Rosenberg, M.D.; Schertz, K.E.; Meidenbauer, K.L.; Zhang, T.; Rim, N.; Tucker, R.; Talen, E.; Berman, M. Quantifying urban environments: Aesthetic preference through the lens of prospect-refuge theory. J. Environ. Psychol. 2024, 97, 102344. [Google Scholar] [CrossRef]
Dosen, A.S.; Ostwald, M.J. Evidence for prospect-refuge theory: A meta-analysis of the findings of environmental preference research. City Territ. Archit. 2016, 3, 4. [Google Scholar] [CrossRef]
de Faria, L.C. Airflow in urban canyons: A comparison of CFD calculated results with values from physical models with reduced scale tested in a wind tunnel. Arte 21 2023, 20, 63–112. [Google Scholar] [CrossRef]
Li, J.; Huang, Z.; Zheng, D.; Zhao, Y.; Huang, P.; Huang, S.; Fang, W.; Fu, W.; Zhu, Z. Effect of landscape elements on public psychology in urban park waterfront green space: A quantitative study by semantic segmentation. Forests 2023, 14, 244. [Google Scholar] [CrossRef]
Han, T.; Tang, L.; Liu, J.; Jiang, S.; Yan, J. The influence of multi-sensory perception on public activity in urban street spaces: An empirical study grounded in landsenses ecology. Land 2024, 14, 50. [Google Scholar] [CrossRef]
Foellmer, J.; Tyler, N. Multisensory approaches to urban pollution: A controlled study of sound-odour interactions and their impact on physiology and perception. Build. Environ. 2025, 285, 113506. [Google Scholar] [CrossRef]
Soleimanpour, M.; Alizadeh, O.; Sabetghadam, S. Analysis of diurnal to seasonal variations and trends in air pollution potential in an urban area. Sci. Rep. 2023, 13, 21065. [Google Scholar] [CrossRef]

Figure 1. The research areas and survey points.

Figure 2. Integrated methodological framework.

Figure 3. Quantification of multi-dimensional environmental features. (A) The high-fidelity 3D model enables precise calculation of spatial metrics such as viewshed area (red zone). (B) Semantic segmentation results illustrate distinct visual elements (e.g., vegetation, buildings, sky) across the two study areas. The color legend at the bottom corresponds to the 15 semantic categories identified by the SegFormer model.

Figure 4. Spatial distribution patterns of objective environmental metrics along the sampling points.

Figure 5. Comparative violin plots of subjective perception scores between the River Thames and the River Seine across four dimensions. The dashed lines within the violin plots represent the median (middle line) and the interquartile range (upper and lower lines indicate the 75th and 25th percentiles, respectively).

Figure 6. SHAP summary plots illustrating the feature importance and impact on model output for each perceptual dimension. Each sub-figure consists of a beeswarm plot (left) showing the distribution of SHAP values and feature values (color-coded), and a bar chart (right) ranking the global importance of features based on mean absolute SHAP values.

Table 1. Summary of acquired data sources and specifications.

Data Category	Specific Data Type	Source/Instrument	Resolution/Scale	Extent/Quantity	Purpose
3D Spatial Data	Oblique Photography Images	DJI Mavic 3 Drone	GSD = 5 cm	Core Riverfront Zone	High-fidelity texture for visual focus
	Contextual 3D Data	Google Earth	-	Peripheral Urban Background	Skyline context & Geometric validation
	Integrated Reality Model	ContextCapture v4.4.10	Relative Accuracy Verified	2 Models (Thames & Seine)	3D Metric Extraction
2D Visual Data	Water-based Panoramic Images	Google Street View API	16,384 × 8192 pixels	200 Sampling Points	Segmentation of Visual Elements

Table 2. Summary of objective environmental indicators: definitions, data sources, and rationale.

Dimension	Indicator Name (Abbr.)	Operational Definition/Formula	Data Source	Brief Rationale
2D Visual	Green View Index (GVI)	Ratio of pixel area identified as ‘Tree’ and ‘Grass’ to total image pixels.	SegFormer (Panoramic Image)	Proxy for “naturalness” and ecological visual quality.
	Sky View Index (SVI)	Ratio of pixel area identified as ‘Sky’ to total image pixels.	SegFormer (Panoramic Image)	Indicator of spatial openness and potential psychological relief.
	Building View Index (BVI)	Sum of ‘Traditional’ and ‘Modern Building’ pixel ratios.	SegFormer (Panoramic Image)	Represents the visual density and dominance of the artificial interface.
	Revetment View Index (RVI)	Ratio of pixel area identified as ‘Revetment’ (riverbank walls).	SegFormer (Panoramic Image)	Quantifies the hardness of the immediate water-land boundary.
	Dynamic Object Index (DOI)	Aggregate ratio of mobile element pixels (People, Boats, Cars, etc.).	SegFormer (Panoramic Image)	Proxy for visual vibrancy and social activity intensity.
3D Spatial	River Width (W)	Shortest perpendicular distance between riverbanks at the sampling point.	Grasshopper (3D Model)	Defines the fundamental horizontal scale of the corridor.
	Building Height (H_avg)	Average vertical height of the first-line buildings along the banks.	Grasshopper (3D Model)	Key metric for the vertical scale of the urban interface.
	Height-to-Width Ratio (H/W)	Ratio of average building height to river channel width at the cross-section.	Grasshopper (3D Model)	Quantifies the “Canyon Effect” and degree of spatial enclosure.
	Viewshed Area (V_3D)	Total visible area (km²) from the observer’s viewpoint.	Grasshopper (3D Model)	Measures 3D spatial permeability and visual scope.

Table 3. Comparative summary of descriptive statistics (Mean and SD) for objective visual characteristics between the River Thames and the River Seine.

Category	Indicator	River Thames	River Seine	Characteristic Difference
2D Visual	Green View Index (GVI)	M = 3.3% SD = 3.1%	M = 9.0% SD = 4.3%	Seine is significantly greener.
	Sky View Index (SVI)	M = 45.5% SD = 4.9%	M = 38.6% SD = 3.1%	Thames is more open.
	Building View Index (BVI)	M = 5.7% SD = 1.4%	M = 5.4% SD = 2.3%	Thames is uniform; Seine varies at nodes.
	Revetment View Index (RVI)	M = 2.8% SD = 0.8%	M = 5.0% SD = 1.9%	Seine has higher revetment visibility.
	Dynamic Object Index (DOI)	M = 0.5% SD = 0.7%	M = 1.2% SD = 0.9%	Seine has higher vibrancy/activity.
3D Spatial	River Width (W)	M = 83.6 m SD = 8.8 m	M = 79.5 m SD = 17.7 m	Thames is wider and more uniform.
	Building Height (H_avg)	M = 57.1 m SD = 44.7 m	M = 45.8 m SD = 13.6 m	Thames has extreme vertical fluctuations.
	Height-to-Width Ratio (H/W)	M = 0.7 SD = 0.6	M = 0.6 SD = 0.2	Thames creates dramatic “urban canyons”.
	Viewshed Area (V_3D)	M = 0.7 km² SD = 0.18 km²	M = 0.2 km² SD = 0.05 km²	Thames has significantly higher permeability.

Note: M = Mean; SD = Standard Deviation.

Table 4. Pearson Correlation Analysis Between Objective features and Subjective Perception Results.

	Building View Index	Green View Index	Sky View Index	Revetment View Index	Dynamic Object Index	Viewshed Area	H/W Ratio	Building Height	River Width
Sense of Affluence	0.719 **	0.407 **	−0.715 **	0.249 **	0.112	−0.523 **	0.683 **	0.637 **	−0.369 **
Vibrancy	0.367 **	−0.005	−0.280 **	0.155 *	0.592 **	0.386 **	0.421 **	0.638 **	0.082
Sense of Boredom	−0.173 **	0.250	0.130 **	0.238	−0.392 **	−0.499 **	−0.200	−0.410	−0.262
Scenic Beauty	0.595 **	0.560 **	−0.688 **	−0.238 *	0.286 *	−0.287 *	0.181	0.216 **	−0.097

Note: * p < 0.05 ** p < 0.01.

Table 5. Random Forest Models’ Performance.

		Sense of Affluence	Vibrancy	Sense of Boredom	Scenic Beauty
Model optimal parameters	max_depth	10	10	10	5
	min_samples_split	6	2	6	2
	n_estimators	100	50	50	100
R²		0.509	0.667	0.425	0.619
MSE		0.080	0.107	0.274	0.134
MAE		0.234	0.238	0.415	0.281

Note: R² = Coefficient of Determination; MSE = Mean Squared Error; MAE = Mean Absolute Error.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Tang, Y.; Chen, S.; Xu, W.; Ren, J.; Luo, J. Decoding Urban Riverscape Perception: An Interpretable Machine Learning Approach Integrating Computer Vision and High-Fidelity 3D Models. ISPRS Int. J. Geo-Inf. 2026, 15, 91. https://doi.org/10.3390/ijgi15020091

AMA Style

Tang Y, Chen S, Xu W, Ren J, Luo J. Decoding Urban Riverscape Perception: An Interpretable Machine Learning Approach Integrating Computer Vision and High-Fidelity 3D Models. ISPRS International Journal of Geo-Information. 2026; 15(2):91. https://doi.org/10.3390/ijgi15020091

Chicago/Turabian Style

Tang, Yuzhen, Shensheng Chen, Wenhui Xu, Jinxuan Ren, and Junjie Luo. 2026. "Decoding Urban Riverscape Perception: An Interpretable Machine Learning Approach Integrating Computer Vision and High-Fidelity 3D Models" ISPRS International Journal of Geo-Information 15, no. 2: 91. https://doi.org/10.3390/ijgi15020091

APA Style

Tang, Y., Chen, S., Xu, W., Ren, J., & Luo, J. (2026). Decoding Urban Riverscape Perception: An Interpretable Machine Learning Approach Integrating Computer Vision and High-Fidelity 3D Models. ISPRS International Journal of Geo-Information, 15(2), 91. https://doi.org/10.3390/ijgi15020091

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Decoding Urban Riverscape Perception: An Interpretable Machine Learning Approach Integrating Computer Vision and High-Fidelity 3D Models

Abstract

1. Introduction

2. Background and Related Work

2.1. Visual Perception Theory and Indicators in Built Environments

2.2. Evolution of Environmental Representation: From Static 2D Imagery to Immersive 3D Environments

2.3. Analytical Approaches: From Linear Prediction to Interpretable Machine Learning

2.4. Summary and Research Gaps

3. Methodology

3.1. Study Areas and Data Sources

3.1.1. Rationale for Site Selection and Study Areas

3.1.2. Data Acquisition and Processing

3.2. Analysis of Objective Visual Characteristics

3.2.1. Quantification of 2D Visual Composition

3.2.2. Quantification of 3D Spatial Configuration

3.3. Subjective Visual Perception

3.3.1. Experimental Apparatus and Environmental Standardization

3.3.2. Selection of Perceptual Indicators and Rationale for IVR

3.3.3. Participants and Experimental Procedure

3.4. Statistical Analysis and Interpretable Machine Learning Strategy

4. Results

4.1. Analysis of Objective Characteristics

4.1.1. Model Performance and Data Summary

4.1.2. Comparative Analysis of Urban Morphologies

4.1.3. Spatial Sequence and Evolutionary Patterns

4.2. Analysis of Subjective Visual Perception Results

4.3. Correlation Analysis and Model Construction Results

4.3.1. Correlation Analysis Results

4.3.2. Model Construction Results

5. Discussion

5.1. Critical Reflection on Model Performance and Predictability

5.2. Core Driving Mechanisms of Riverscape Perception

5.3. Practical Implications for Urban Waterfront Renewal

5.4. Limitations and Future Prospects

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI