Next Article in Journal
Arable Land Abandonment and Land Use/Land Cover Change in Southeastern South Africa
Previous Article in Journal
Assessing Landscape Ecological Risk from Mining in the River Source Region of the Yellow River Basin
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Public Perception of Urban Recreational Spaces Based on Large Vision–Language Models: A Case Study of Beijing’s Third Ring Area

School of Architecture, Tianjin University, Tianjin 300072, China
*
Author to whom correspondence should be addressed.
Land 2025, 14(11), 2155; https://doi.org/10.3390/land14112155
Submission received: 21 September 2025 / Revised: 20 October 2025 / Accepted: 27 October 2025 / Published: 29 October 2025

Abstract

Urban recreational spaces (URSs) are pivotal for enhancing resident well-being, making the accurate assessment of public perceptions crucial for quality optimization. Compared to traditional surveys, social media data provide a scalable means for multi-dimensional perception assessment. However, existing studies predominantly rely on single-modal data, which limits the comprehensive capturing of complex perceptions and lacks interpretability. To address these gaps, this study employs cutting-edge large vision–language models (LVLMs) and develops an interpretable model, Qwen2.5-VL-7B-SFT, through supervised fine-tuning on a manually annotated dataset. The model integrates visual-linguistic features to assess four perceptual dimensions of URSs: esthetics, attractiveness, cultural significance, and restorativeness. Crucially, we generate textual evidence for our judgments by identifying the key spatial elements and emotional characteristics associated with specific perceptions. By integrating multi-source built environment data with Optuna-optimized machine learning and SHAP analysis, we further decipher the nonlinear relationships between built environment variables and perceptual outcomes. The results are as follows: (1) Interpretable LVLMs are highly effective for urban spatial perception research. (2) URSs within Beijing’s Third Ring Road fall into four typologies, historical heritage, commercial entertainment, ecological-natural, and cultural spaces, with significant correlations observed between physical elements and emotional responses. (3) Historical heritage accessibility and POI density are identified as key predictors of public perception. Positive perception significantly improves when a block’s POI functional density exceeds 4000 units/km2 or when its 500 m radius encompasses more than four historical heritage sites. Our methodology enables precise quantification of multidimensional URS perceptions, links built environment elements to perceptual mechanisms, and provides actionable insights for urban planning.

1. Introduction

Urban recreational spaces (URSs) are defined as publicly accessible open spaces, buildings, and related facilities that primarily serve leisure, social interaction, and tourism experiences [1]. As a vital component of urban public space systems, URSs not only function as an important indicator of urban social civilization, but also embody the core dimension of residents’ quality of life, highlighting significant potential in enhancing urban livability and resident well-being [2]. URS perception refers to a comprehensive subjective evaluation derived from an individual’s direct experience, cognitive processing, and behavioral interactions within recreational spaces [3]. As a fundamental metric for assessing recreational experience satisfaction, it directly influences an individual’s intention to revisit, recommend the location, and place attachment formation [4]. Therefore, accurate evaluation of public perception is crucial for optimizing URS quality and for realizing its profound social value.
Despite the growing recognition of the importance of URS perception assessment, the spatial diversity of URSs poses significant evaluation challenges. Current research mainly focuses on specific URS types. Studies on ecological-natural recreational spaces (e.g., parks, green spaces, and scenic areas) constitute the majority of current perception research [5], focusing primarily on core dimensions such as esthetics, attractiveness, restorativeness, accessibility, and usability [6,7,8]. Considerable research has also utilized the cultural ecosystem services (CES) framework [9] to analyze human–environment affective bonds, where recreational activities, esthetic value, social interaction, and sense of place emerge as the most frequently perceived CES categories [10]. Concurrently, historical heritage spaces have gained scholarly attention as significant URSs that synergistically blend recreational functions with cultural significance. Research efforts have specifically concentrated on visual perception within heritage spaces [11] and cultural meaning [12,13]. Additionally, scholars have investigated public perception in urban pedestrian spaces [14], cultural venues [15], and shopping centers [16].
The advent of social media platforms in recent years has precipitated the emergence of distinctive phenomena characterized by “check-in” behaviors and “Internet Celebrity” effects. These developments have catalyzed the redesign and optimization of urban spaces, and have substantially transformed URS usage patterns [17], comprehensive space evaluations [18], and visitor–resident interaction dynamics [19]. In this context, “Internet Celebrity” check-in destinations have evolved beyond their physical attributes into media that convey spatial narratives and serve as “associative places,” bridging public cognition and emotional responses to urban environments [20]. Concurrently, extensive user-generated content from platforms such as Weibo, Xiaohongshu, and Douyin is characterized by voluntary participation, openness, and shareability. These characteristics make it a valuable resource for analyzing public spatial perception by providing genuine reflections of users’ spatial experiences [21].
In contrast to conventional survey methodologies, social media data (SMD) offer a transformative approach that overcomes traditional limitations in acquiring large-scale, multidimensional perceptual information. Previous methodological approaches employing SMD in urban spatial perception research can be classified into three principal analytical frameworks: text-based analysis, image-based analysis, and integrated text–image analysis. The text-based approach predominantly utilizes computational techniques such as word frequency analysis, topic modeling, and sentiment analysis [22,23] to identify key discussion themes regarding urban spaces from large-scale textual datasets. This approach demonstrates advantages in processing efficiency, rapid assessment of public attitudes, and strong temporal relevance [24]; however, it encounters challenges stemming from textual characteristics including internet slang, sarcastic expressions, and information noise [6]. The image-based approach concentrates on implementing computer vision techniques to automatically classify and characterize visual features in social media images. Particularly effective for visual representation of spatial information, this approach has demonstrated value in quantifying preferences for recreational cultural ecosystem services [25]. However, its limitations stem from the inherent inability of image data to directly convey emotional context, which complicates the precise differentiation of user perception tendencies [26]. Integrated text–image approaches offer a more comprehensive perspective by synergistically analyzing textual and visual information. Early multi-modal fusion studies typically analyzed text and image data independently before correlating the results. Although achieving basic information complementarity, these approaches essentially constitute simple superimpositions of single-modal analyses, failing to reveal deeper semantic relationships between textual and visual data [22,27].
The emergence of large language models (LLMs) has revolutionized natural language processing. Representative models like GPT-4 and LLaMA demonstrate advanced reasoning capabilities in text generation and language understanding. LLMs have demonstrated exceptional performance in task-specific applications through few-shot learning and prompt engineering [28,29]. However, LLMs are inherently limited in processing visual information, which is crucial for urban spatial perception. Large vision–language models (LVLMs) represent a groundbreaking advancement of LLMs. By pretraining on large-scale multi-modal corpora comprising text, images, and other data types, LVLMs demonstrate enhanced cross-modal semantic understanding and joint representation capabilities, opening new pathways for advancing built environment perception assessment through the deep integration of textual and visual information [30]. Following this trend, researchers have increasingly used vision–language tasks to study urban spatial perception. For instance, recent studies have leveraged ChatGPT’s visual reasoning capabilities to automate the evaluation of walkability [31], attractiveness [32], and perceived safety [23] in urban street environments using streetscape imagery. Sun et al. [33] have integrated LLaMA and LLaVA to analyze social media image–text data, identifying the key perceptual factors associated with the popularity of coastal spaces. These studies provide preliminary evidence of LVLMs’ efficacy in urban spatial perception assessment, demonstrating their potential to process complex multi-source data and generate comprehensive insights, which offer valuable technical support for assessing URS perception in this study [34].
Despite these advancements, three critical research gaps remain. First, there is a notable lack of large vision–language models (LVLMs) specifically designed for multidimensional URS perception assessment. Owing to challenges in data acquisition and measurement complexity, current studies predominantly focus on isolated perceptual dimensions within specific types of spaces—such as parks, green spaces, or historic districts—and fail to establish comprehensive frameworks for systematic, multidimensional URS perception analysis. Furthermore, much of the existing literature relies on generic sentiment analysis APIs derived from natural language processing (NLP) platforms. While these models are optimized for general-purpose applications, they are ill-suited to capturing the subtle and context-dependent aspects of human perception [35]. As a result, the development of domain-specific annotation datasets and specialized deep learning architectures for multidimensional URS perception has been largely overlooked. Second, limited attention has been paid to explainability in identifying perceptual drivers. By examining environmental features that influence perceptual outcomes, many studies employ rudimentary text analysis techniques, such as high-frequency keyword extraction. This approach does not effectively capture contextual semantics, syntactic structures, or inter-word relationships, and is susceptible to irrelevant lexical interference, resulting in limited explanatory power. Explainable LVLMs present a promising alternative by enabling the simultaneous evaluation of perceptual dimensions and the generation of interpretable rationales, thus elucidating the underlying spatial characteristics that shape public perception. Nevertheless, few studies have investigated the integration of established public perception assessment frameworks with explainable LVLMs. Third, empirical research on the complex nonlinear relationships between public perception and environmental attributes remains insufficient. Most current approaches rely on correlation analyses and linear regression models, which are inherently constrained by their ability to capture nonlinear patterns and threshold effects [36]. Moreover, by focusing on data from discrete locations, these studies often overlook the broader urban spatial context and its influence on perceptual responses. Research employing interpretable machine learning methods—such as ensemble models combining Light Gradient Boosting Machine (LightGBM) and Random Forest (RF) with SHapley Additive exPlanations (SHAP) interpretation techniques—to investigate the intricate associations between environmental factors and public perception remains scarce.
To address these gaps, this study employs cutting-edge large vision–language models (LVLMs). We developed an interpretable model, Qwen2.5-VL-7B-SFT, by performing supervised fine-tuning (SFT) on a manually annotated perception dataset. The model is designed to achieve a precise assessment of multi-dimensional perceptions of URSs and to simultaneously generate textual explanations for the spatial features that trigger different public perceptions based on the prediction results. Furthermore, this study employs explainable machine Learning techniques and integrates multi-source geospatial data (e.g., POIs, street view imagery, road networks) to investigate nonlinear relationships and threshold effects in public perceptions. This study aims to answer three core research questions:
  • How can advanced LVLMs be optimized for precise multidimensional evaluation of URS perceptions?
  • What typologies characterize URSs, and how do key spatial elements relate to specific affective responses (e.g., pleasure, interest, comfort, nostalgia) across different perceptual dimensions?
  • What nonlinear relationships and threshold effects exist between built environment features and URS perceptions?

2. Materials and Methods

2.1. Study Area

This study focuses on Beijing’s Third Ring Road area as the case study area (Figure 1). As China’s capital and political-cultural center, Beijing represents both a 3000-year urban development legacy and a contemporary showcase of China’s modernization. The city’s rapid urbanization has produced a distinctive urban morphology, characterized by the symbiotic coexistence of historic preservation and modern functional development [37]. The study area encompasses the principal section of Beijing’s “Capital Function Core Area” as defined in the Beijing Urban Master Plan (2016–2035), converging diverse URS typologies ranging from historically significant districts and ecological leisure zones to contemporary commercial hubs. The spatial concentration of resident daily activities and tourist recreational patterns within this area generates particularly rich social media data streams. This unique combination of spatial heterogeneity and behavioral data density makes the Third Ring Road an exemplary research setting for examining differential public perceptions of URSs in global megacities.

2.2. Research Framework

The research framework is illustrated in Figure 2. During the data collection and model training phase, we systematically collected and processed 100,389 user check-in records related to URSs from two major Chinese social media platforms, following rigorous screening and data cleaning procedures. Esthetics, attractiveness, restorativeness, and cultural significance were identified as four core dimensions for evaluating public perceptions of URSs. These dimensions were selected based on a comprehensive assessment of data availability, conceptual distinctiveness, and their established relevance in both academic literature and practical applications. From this dataset, a random sample of 5050 comments was annotated manually to construct a labeled perception dataset across the four dimensions. This annotated dataset was subsequently used for SFT the advanced large vision–language model Qwen2.5-VL-7B. The resulting fine-tuned model (Qwen2.5-VL-7B-SFT) was then applied to perform a large-scale automated evaluation of perceptual attributes and semantic content across all 100,389 user comments. In the recreational perception interpretation stage, we utilized the interpretive outputs generated using Qwen2.5-VL-7B-SFT to accurately identify spatial features and their associated emotional connotations underlying public perceptions across different dimensions. Furthermore, to examine nonlinear effects on URS perceptions, we adopted the well-established “5D model” from urban studies as the theoretical foundation to systematically quantify characteristics of the built environment. After applying feature selection methods to determine salient predictive variables, multiple machine learning models were developed and optimized using the Optuna hyperparameter tuning framework. The best-performing model, combined with SHAP interpretability techniques, enabled a robust analysis of the nonlinear relationships between built environment attributes and multidimensional URS perceptions.

2.3. Data Acquisition

This study utilizes SMD sourced from two platforms: Xiaohongshu and Weibo. Xiaohongshu is renowned for its user-generated, experience-based rich text and image sharing, while Weibo is distinguished by its real-time updates and extensive user base [38]. The user demographics of these platforms, within the Chinese context, exhibit notably youthful and highly urbanized characteristics. Consequently, the perceptual patterns examined in this study primarily reflect the perspectives of this young and active demographic. Notably, “check-in” behavior is strongly associated with both platforms, where recreational “check-in” content not only reflects high user engagement, but also often includes detailed geolocation data.
Data collection was conducted from May to July 2024 using a custom Python script (integrating browser simulation and Mitmproxy packet capture), following a standardized workflow: (1) keyword-based search, (2) note detail collection, (3) geographic coordinate extraction, and (4) image link retrieval. Through systematic analysis of prevalent recreational tags on the platform, five representative keywords were identified for iterative retrieval: “Beijing Check-in”, “Beijing Weekend Destinations”, “Beijing Tourism”, “Beijing Citywalk”, and “Beijing Shop Exploration”. This methodology yielded a preliminary dataset of 225,524 geotagged entries spanning the entire Beijing metropolitan area. To ensure data quality and representativeness, we implemented a three-tier filtering process: First, we geographically clipped the raw dataset to the study area. Second, we filtered out contradictory data where the same user appeared at different locations simultaneously. Finally, to focus on spontaneous public check-ins, we excluded work-related check-ins and frequent advertisements from merchants. The exclusion criteria were defined as a user checking in at the same location more than three times per week or posting more than three identical notes per week. Following this rigorous filtering process, we obtained 100,389 validated recreational check-in records with precise geographic coordinates within Beijing’s Third Ring Road area. Subsequently, corresponding image data were systematically collected. To optimize processing efficiency and minimize content redundancy, given that multiple images per note often exhibit high similarity, we selectively downloaded the first two images from each post, yielding a final curated image dataset of 172,556 images (see Table 1 for spatial distribution).

2.4. Development of Interpretable LVLMs

2.4.1. Manually Annotated Dataset

From the collected dataset comprising 100,389 URS check-in records, we initially randomly selected 3000 posts for annotation. Using a Python-based labeling platform, we assessed each post’s textual and visual content across four dimensions, where “0” denotes a negative evaluation, “1” indicates a positive evaluation, and “2” represents irrelevant content. To improve LVLM performance and ensure label balance, we augmented the evaluation labels for underrepresented samples in the original annotated dataset, yielding a final dataset of 5050 posts (Table 2). The annotation process was performed collaboratively by two researchers, with discrepancies resolved through iterative discussions to ensure consistency and scientific validity of the training data. Detailed specifications and illustrative examples of the four URS perception themes are provided in Appendix A.

2.4.2. Model Structure

This study developed a customized Qwen2.5-VL-7B-SFT based on Qwen2.5-VL for multidimensional perception analysis of URSs. Qwen2.5-VL, an LVLM released by Alibaba Group on 28 January 2025, exhibits exceptional capabilities in visual recognition, object localization, document parsing, and long-video comprehension. Its native dynamic resolution processing capability, which accommodates images of arbitrary resolutions, renders it suitable for both static and dynamic tasks. The flagship Qwen2.5-VL-72B model has demonstrated comparable or superior performance to leading models such as GPT-4o and Claude 3.5 Sonnet in document and chart understanding, while maintaining robust capabilities in text-based tasks [39]. In this study, we utilized the more compact Qwen2.5-VL-7B variant to balance computational efficiency and model performance.
The architecture of Qwen2.5-VL-7B-SFT comprises three hierarchical layers: data layer, prediction layer, and interpretation layer (Figure 3). The data layer constructs model training inputs through systematic integration of the image–text dataset developed in Section 2.4.1 with carefully engineered prompts. These prompts guide the model to generate outputs meeting predefined quality standards. The prompt design in this layer incorporates two key components: a task definition, which specifies the model’s required computational tasks, and a perceptual feature definition, which provides descriptive guidance for target perceptual characteristics (more details in Appendix A).
The prediction layer takes, as input, the structured information output from the data processing layer. The training process utilizes Low-Rank Adaptation (LoRA) for fine-tuning Qwen2.5-VL, which introduces trainable low-rank parameter layers to capture task-specific features while preserving the majority of pretrained model weights. This approach significantly reduces the number of trainable parameters and computational costs, thereby facilitating efficient task transfer [40]. During training, the visual encoder and cross-modal fusion layers remain frozen, whereas trainable LoRA structures are incorporated into both the language model backbone and cross-modal interaction modules for targeted refinement of joint visual–textual representations. As detailed in Table 3, this layer ultimately generates emotional classification labels for the input image–text pairs across specified perceptual dimensions.
The interpretation layer is designed to provide post hoc text-based rationales for the classification results generated using the prediction layer. Unlike intrinsic interpretability approaches that focus on a model’s internal mechanisms, post hoc interpretation does not require specific constraints on model architecture or training interventions. This offers greater flexibility and generalizability for applications across various pre-trained complex models [41]. Building on this principle, the interpretation layer in this study shifts its analytical focus from the direct parsing of raw image–text data to joint reasoning based on predicted emotion labels and their corresponding textual content. This shift significantly reduces the complexity of interpretation generation and alleviates the model’s computational burden. Methodologically, this layer leverages the language model backbone within the LVLM and incorporates few-shot learning along with Chain-of-Thought (CoT) prompt engineering to generate justifications, thereby producing corresponding explanatory text for each predicted label. The CoT prompt template guides the model through a structured, two-phase reasoning process: text segment localization and keyword extraction. The text segment localization phase identifies comment segments that manifest specific URS perceptual characteristics, while the subsequent keyword extraction phase isolates relevant theme words and their associated modifiers from the identified segments. Theme words represent concrete urban spatial entities or environmental elements that elicit particular perceptions (e.g., “temple”, “park”, “sunset glow”, “lake surface”), whereas modifier words characterize inherent environmental qualities or typical activities within these spaces (e.g., “breathtaking”, “leisurely”, “shop exploration”, “social gathering”). The layer ultimately produces structured explanatory outputs formatted as “text segment + modifier word + theme word”.

2.4.3. Model Evaluation

The prediction layer utilizes the cross-entropy loss function to compute the training loss; see Equation (1):
L o s s = 1 N i = 1 N ( y i × l o g ( p i ) + ( 1 y i ) × l o g ( 1 p i ) )
where N represents the sample size, yi denotes the ground truth label, and pi indicates the predicted probability for the positive class. The model evaluation employs four performance metrics: Accuracy (ACC), Precision, and F1 score.
A C C = ( T P + T N ) / ( T P + T N + F P + F N )
P r e c i s i o n = T P / ( T P + F P )
R e c a l l = T P / ( T P + F N )
F 1 S c o r e = 2 × ( P r e c i s i o n × R e c a l l ) / ( T P + F N )
TP, TN, FP, and FN correspond to true positives, true negatives, false positives, and false negatives, respectively. To ensure rigorous comparative analysis, we selected the InternVL-3.0 as our primary baseline—a LVLM with parameter dimensionality matching Qwen2.5-VL. Developed by Shanghai AI Laboratory released on 16 April 2025, InternVL-3.0 attains state-of-the-art results on 12 standardized benchmarks through its multi-stage pretraining architecture, including specialized evaluations for expert-level and multi-modal assessments. The model shows particular efficacy in specialized applications, including graphical user interface (GUI) systems, architectural plan analysis, and spatial cognition tasks [42]. The control model’s training protocol exactly replicated our experimental parameters: employing LoRA fine-tuning with AdamW optimization (learning rate = 0.0001, batch size = 16) while maintaining frozen visual encoder weights. For reference, we incorporated the original Qwen2.5-VL base model in zero-shot benchmarking to assess performance improvements from SFT on target tasks.

2.4.4. Topic Analysis Based on the Interpretation Layer

To further uncover latent explanatory sub-themes within each perceptual dimension, we implemented a thematic clustering workflow after acquiring modifier words and theme words associated with specific perceptions from each comment:
(1)
Synonym consolidation: High-frequency keywords were semantically integrated using a predefined synonym dictionary to eliminate terminological inconsistencies and improve analytical accuracy. For instance, variants including “coffee shop”, ”café”, and “coffee bar” were standardized as the canonical form “coffee shop”.
(2)
Keyword vectorization: We employed Tencent AI Lab’s Chinese Word and Sentence Embedding Corpus to convert both keyword categories into 200-dimensional semantic vectors. This comprehensive corpus provides vector representations for over 8 million Chinese lexical items, effectively establishing a high-dimensional semantic space for subsequent cluster analysis while preserving nuanced linguistic relationships.
(3)
Thematic term clustering: The K-means clustering algorithm was implemented on the vectorized thematic terms, with the optimal cluster number determined through a combined evaluation of the elbow method and silhouette coefficient. For visualization, we applied UMAP to reduce the high-dimensional vectors to two-dimensional space, generating scatter plots where point diameters correspond to term frequencies and color gradients represent distinct thematic clusters.
(4)
Modifier word cloud generation: Within each perceptual cluster, we generated weighted word clouds based on modifier term frequency distributions, using cluster-specific color schemes. This approach effectively reveals the systematic relationships between spatial elements represented by theme terms and their associated emotional qualities captured by modifier terms, enabling comprehensive perception deconstruction.

2.5. Exploring the Nonlinear Influencing Mechanisms of the Built Environment on URS Perception

2.5.1. Variable Selection

Considering the scale effect and spatial heterogeneity of urban spaces, this study adopts the “urban street block” as the fundamental unit of analysis, defined as the smallest urban plot enclosed by main roads, secondary roads, and branch roads. Grounded in the actual urban fabric, this unit effectively captures the internal morphological characteristics of the block and provides a consistent basis for the quantitative comparison of multi-dimensional spatial perceptions. Using the spatial density of positive perceptions across various dimensions within URSs at the block level as the dependent variable, we selected features based on the “5D model” [43]. This model encompasses density, diversity, distance to transit, design, and destination accessibility; detailed calculation methods are provided in Table 4. For design-related indicators, we obtained batch street view imagery within Beijing’s Third Ring Road using the Baidu Map API and analyzed streetscape composition using the Mask2Former semantic segmentation model pre-trained on the Cityscapes dataset. This approach facilitated pixel-level quantification of 19 environmental elements (e.g., sky, buildings, walls, roads, sidewalks, trees) to calculate their proportional coverage within the streetscape imagery.

2.5.2. Machine Learning Modeling

To investigate the key influencing factors and underlying mechanisms associated with multidimensional URS perceptions, we adopted a four-stage analytical framework: In the initial stage, we performed feature selection using both the Boruta algorithm and recursive feature elimination (RFE) on the 22 variables for each perceptual dimension. Subsequently, the selected feature subsets were incorporated into four machine Learning algorithms: eXtreme Gradient Boosting (XGBoost), Gradient Boosted Decision Trees (GBDT), LightGBM, and RF. These algorithms have proven effective for nonlinear modeling and overfitting prevention in urban perception studies [44,45,46]. During model training, the 2026 street block samples were divided into 8:2 training-test sets, while hyperparameter optimization was automated using the Optuna framework. Model performance was assessed through three evaluation metrics, Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and R-squared (R2), which guided the selection of optimal prediction models. In the final stage, SHAP was employed to interpret the complex relationships between built environment characteristics and URS perceptions (more details in Appendix B).

3. Results

3.1. Training and Evaluating the Qwen2.5-VL-7B-SFT

Table 5 demonstrates the performance of the Qwen2.5-VL-7B-SFT in URS perception evaluation tasks. The model attained high ACC scores across four perceptual dimensions, esthetics (0.935), attractiveness (0.872), culturality (0.937), and restorativeness (0.915), showing significant improvements compared to the base Qwen2.5-VL model’s zero-shot performance, demonstrating the effectiveness of SFT. Notably, it consistently surpassed the InternVL3-8B-SFT model in all metrics: Accuracy, Precision, Recall, and F1 score. These results collectively demonstrate that our Qwen2.5-VL-7B-SFT provides reliable URS perception assessment capabilities.
Our model’s automated analysis of 100,389 social media posts revealed that 45,308 comments (45.1%) were classified as Category 2 due to their irrelevance to the predefined perceptual dimensions. This pattern reflects fundamental characteristics of social media check-in behavior, where many users primarily engage in location tagging or social interaction rather than environmental evaluation. These posts typically contain objective scene descriptions (e.g., “crowded on weekends”, “Beijing Hutongs”) or event records (e.g., “dinner with friends”, “successful check-in”), without subjective environmental assessments. Following data filtering, the analysis retained 55,081 valid perception-related comments, constituting 54.9% of the original dataset. As shown in Table 6, positive evaluations were most prevalent for attractiveness at 44.3% and restorativeness at 38.1%, indicating strong user appreciation of these URS qualities. Positive assessments were less common for esthetics at 18.4% and culturality at 17.6%, likely because these dimensions require more sophisticated esthetic discernment and cultural interpretation. Negative evaluations remained consistently low across all dimensions, ranging from 0.4 to 3.9%. Notably, restorativeness showed the highest negative evaluation rate of 3.9%, exceeding other dimensions by 4.9 to 9.8 times. This suggests that shortcomings in restorative environmental features are particularly noticeable to users. Given the minimal absolute and relative frequency of negative evaluations, subsequent analyses focused exclusively on positive assessment data.
Figure 4 illustrates the block-scale spatial distribution patterns of positive evaluation density for the four perceptual dimensions through ArcGIS mapping. The esthetics and attractiveness dimensions exhibit significant spatial clustering. Their density peaks are strongly correlated with prominent historic districts such as Shichahai, Nanluoguxiang, and Dashilan. Notably, major urban open spaces, including Temple of Heaven Park and Yuyuantan Park, serve as focal points for esthetic perceptions. The attractiveness dimension forms dual cores in both the traditional commercial-tourism zone of Wangfujing-Qianmen and the modern consumption area of Sanlitun. In comparison, culturality and restorativeness display relatively uniform distributions within Beijing’s old city, and their spatial patterns closely correspond to traditional hutong network configurations. These findings collectively demonstrate that the old city functions as the primary spatial container for multidimensional perceptual values. Meanwhile, specialized urban functional zones and ecological-natural spaces create distinct spatial differentiation patterns across various perceptual dimensions.

3.2. Thematic Analysis of URS Perception

The interpretation layer’s integration of Cot prompting and few-shot learning effectively filters irrelevant information, allowing for the accurate identification of urban spatial features that correlate strongly with users’ specific perceptual responses. Based on this framework, we performed thematic clustering analysis of positive evaluations across all perceptual dimensions.
The esthetics dimension revealed four distinct clusters (Figure 5). Historic districts that present the ancient capital’s nocturnal charm, including “Shichahai”, “Wudaoying”, and “Guozijian” (Cluster #1), along with traditional architectural elements such as “vermilion wall”, “traditional architecture”, and “glazed tile” (Cluster #4), collectively demonstrated the dominant role of heritage-driven visual appreciation in esthetic perception. This form of appreciation stems from the historical sense and esthetic uniqueness conveyed by traditional neighborhoods and buildings that function as cultural symbols. Natural weather phenomena, including “sunset”, “evening glow”, and “dusk”, together with seasonal flora such as “ginkgo”, “magnolia”, and “peach blossoms” (Cluster #2), although not permanent built environment features, significantly enhanced esthetic experiences derived from urban blue–green infrastructure [47,48]. Furthermore, specific typologies of urban spaces represented by “Monet Garden”, “coffee shops”, and “terrace” (Cluster #3), frequently characterized as “atmospheric”, “beautiful”, and “charming”, indicate their significant potential for creating distinctive visual experiences. These spaces essentially represent a form of commercially curated visual consumption, catering to contemporary urban dwellers’ pursuit of heterogeneous and distinctive visual experiences.
The attractiveness dimension revealed six distinct clusters (Figure 6). Modern consumption spaces with distinctive stylistic identities or functional attributes, including cultural venues offering “immersive” and “awe-inspiring” experiences (Cluster #2), “cafe shops”, “bookstores”, and “bars” described as “internet celebrity” or “industrial style” (Cluster #4), “gardens”, “malls”, and “hotels” characterized as “design-forward”, “large”, and “luxurious” (Cluster #6), and event-based elements or emerging entertainment spaces that create “joyful” and “youthful” experiences, such as “bazaar” and “talk shows” (Cluster #3) all effectively enhance venue attractiveness and stimulate exploration. The core appeal of these spaces lies in their ability to craft distinctive place imagery through strong stylistic identities or deeply engaging experiential programs, which effectively addressing younger audiences’ aspirations for self-expression and novel sensory engagement. Simultaneously, high-traffic areas such as “Sanlitun”, “Nanluoguxiang”, and “Houhai” (Cluster #1) attract crowds for “photo-taking” and “check-in” behaviors due to their social media propagation value. The act of checking in at these recognized “internet-famous landmarks” reflects both the public’s need for leisure experiences and daily enjoyment, as well as their desire for social recognition and inclusion within group practices [49]. Additionally, historic built environments that carry historical memories (Cluster #5), including “Siheyuan”, “drum and bell tower”, and the “church” described as “mysterious”, “antique”, or “magnificent”, become important sources of attractiveness by creating unique place imagery. These heterogeneous environments, which contrast markedly with contemporary urban routines, fulfill deep-seated psychological needs for exploration and nostalgia.
The culturality dimension identified five primary cultural sources and their spatial manifestations based on Chen et al.’s [12] cultural cognition framework for Beijing’s historic urban areas (Figure 7). Among these, Beijing-style culture (Cluster #4) predominantly manifested in the historical street network texture, and ancient capital culture (Cluster #5), represented by historical relics such as “temples”, “central axis”, and “former residences”, constitute significant sources of culturality. Both are rooted in the city’s long historical accumulation and regional characteristics, forming the core of local identity. Celebrity culture (Cluster #1), associated with historical figures such as “Mei Lanfang”, “Lao She”, and “Li Dazhao”, has established a more emotionally warm cultural connection by concretizing abstract history into vivid character narratives and relying on their former activity sites. This cultural form is often closely intertwined with historical evaluations and collective memories such as “loyal spirit”, “reputation”, and “humiliation”. Furthermore, Red culture (Cluster #2), centered on national symbols such as the “Great Hall of the People”, the “Five-Star Red Flag”, and “flag-raising ceremonies”, evokes transcendent collective emotions and national identity through the sublime nature of its political landmarks and ritual activities. These experiences are frequently expressed through terms like “prosperity”, “greatness”, and “pride”. Meanwhile, innovative culture (Cluster #3), characterized by “intangible cultural heritage”, “exhibitions”, and “museums”, provides a contemporary interpretation and display of traditional cultural resources. Thereby, it enables the public to gain systematic cultural cognition through evaluations such as “precious”, “artistic”, and “profound”.
The restorativeness dimension identified three characteristic clusters (Figure 8). Commercial spaces, including “coffee shops”, “bars”, and “terraces” (Cluster #1), function as “third places” by providing comfortable and secure environments, enabling individuals to obtain social support and stress relief through activities described as “chatting”, “gathering”, and “drinking”. Outdoor environments comprising physical spaces such as “rooftops”, “courtyards”, and “gardens” and integrated with natural elements like “autumn”, “sunshine”, and “breeze” (Cluster #2) create tranquil, relaxing, and psychologically safe settings through their open and bright vistas. These environments are posited to reduce cognitive load, thereby enhancing psychological restorativeness [23], and frequently evoke emotional experiences described as “pleasant”, “romantic”, and “peaceful”. Furthermore, in areas such as historical alleys and heritage parks (Cluster #3), terms such as “Citywalk”, “strolling”, “leisurely”, and “quiet” are frequently mentioned. This pattern suggests a tendency among participants to establish a positive interaction between physical activity and mental wandering through walking. This process allows for them to perceive and understand the city’s cultural heritage, artistic ambiance, and daily life scenes in a more relaxed state.

3.3. Analysis of Influencing Factors on URS Perception

This study integrated optimal feature subsets identified through both Boruta and RFE screening methods with four machine Learning algorithms. Employing the Optuna framework for hyperparameter optimization, we determined the optimal models for each perceptual dimension. The comparative analysis revealed that RFE-selected feature subsets generally surpassed Boruta-selected subsets in model performance, demonstrating R2 improvements ranging from 0.008 to 0.066 across most models (see Appendix B). When implementing the RFE-optimized feature subsets uniformly, predictive performance variations persisted across different perceptual dimensions. LightGBM emerged as the superior algorithm for modeling esthetics, attractiveness, and culturality dimensions, while XGBoost achieved optimal performance for restorativeness. Consequently, this study adopted the LightGBM model, which exhibited more robust overall fitting performance, to examine influencing factors and nonlinear effects in public perception based on the RFE-optimized feature subsets.
SHAP analysis was employed to quantify the contribution of various built environment characteristics to each perceptual dimension (Figure 9). For esthetics, AHH and PFD demonstrated the highest importance, followed by NTAD and EI. Attractiveness was primarily governed by PFD, with secondary influences from AHH, RND, and PR. Culturality showed a similar pattern to esthetics, dominated by AHH and PFD but supplemented by RND and PR. Restorativeness maintained PFD and AHH as predominant factors while incorporating significant contributions from RND and EI. To characterize these nonlinear relationships, we implemented SHAP dependence plots with Locally Weighted Scatterplot Smoothing curves for enhanced interpretation.
Regarding density indicators, PFD demonstrated a crucial role across all perceptual dimensions, exhibiting significant nonlinear positive correlations. The analysis revealed a threshold effect: when the PFD fell below approximately 4000 units/km2, it had negative effects on perceptual evaluations; however, above this threshold, the marginal effects transitioned to positive.
Regarding distance to transit indicators, RND showed a nonlinear positive correlation with public perception. When RND was below approximately 50 km/km2, it had negative effects on perception; however, above this threshold, the effects became positive. NBSD exhibited a weak positive correlation with culturality, possibly because scenic areas with higher perception levels tend to be located in larger neighborhood units, resulting in greater distances between block centers and bus stops.
Regarding design indicators, SVF showed a positive correlation with esthetics. When SVF values were low, they suppressed public perception; however, when SVF exceeded approximately 0.15, the marginal effects became positive, leading to significant improvements in esthetic perception. For GVI, values below 0.1 had positive effects on attractiveness, likely because traditional alleys in the study area generally exhibited low greening coverage; however, this did not inhibit overall positive perception in these areas. When GVI exceeded 0.3, the marginal effects shifted from negative to positive, indicating that sufficient visible greenery actively enhances environmental perception. PR exhibited an initial positive and then negative correlation with both attractiveness and culturality. When PR was below 1.5, it showed positive effects; between 1.5 and 2.5, the positive effects gradually weakened; and beyond 2.5, the marginal effects turned negative. EI demonstrated an initial positive and then negative correlation with both esthetics and restorativeness. It showed prominent positive effects within the 3.0–6.0 range, while excessive enclosure did not further enhance positive perception and could even induce negative impacts such as oppression and monotony.
Regarding destination accessibility indicators, AHH emerged as the dominant factor across all perceptual dimensions, with an average contribution rate of 21.1%. When the number of historic heritage sites within a 500 m walking distance of a block was below four, it exerted a negative influence on public perception. SHAP values increased sharply with additional historic heritage sites, stabilizing when approximately seven heritage sites were present nearby. NTAD showed positive correlations with esthetics, attractiveness, and restorativeness. At lower values of NTAD, negative effects were observed across these three dimensions, potentially due to disturbances common in commercial districts, such as noise, crowding, and pedestrian interference, which reduced public experience; when NTAD exceeded approximately 1.0 km, the marginal effects transitioned from negative to positive. Notably, compared to its impact on esthetics, NTAD exhibits less pronounced positive effects on attractiveness and restorativeness.
Regarding diversity indicators, PFM demonstrated no statistically significant impact on public perception in this study. SHDI showed negative correlations with culturality and restorativeness, although its explanatory power reached only 6.52%. While these findings initially appear inconsistent with Jacobs’ diversity theory, when considered in conjunction with the distinctive characteristics of the study area, the relatively low visual complexity in the old city may enhance cultural identity and the sense of place through strengthened spatial narrative coherence and overall ambiance.

4. Discussion

4.1. Methodological Contribution

SMD, characterized by its large-scale availability, cost-effectiveness, and voluntary generation, offers multidimensional insights for URS perception research, encompassing macro-level patterns through micro-level experiences. Such data have become a primary source for analyzing public behavioral preferences and emotional responses [19,50]. Utilizing extensive datasets from two leading Chinese social media platforms, Xiaohongshu and Weibo, this study employs interpretable LVLMs to conduct a comprehensive analysis of multidimensional public perceptions toward URSs.
Previous studies on urban space perception predominantly relied on single-modal data analysis, focusing exclusively on either textual or visual data. Text-based approaches focused on semantic mining of unstructured comment data, whereas image-based approaches primarily employed computer vision techniques to extract scene characteristics. However, single-modal approaches fail to capture the synergistic effects between textual and visual data, limiting their ability to comprehensively characterize the public’s complex perceptual experiences. Recent studies have demonstrated the growing potential of vision–language fusion methodologies in urban space perception research [32,51,52]. This study utilizes a high-quality dataset comprising 5050 manually annotated entries and employs cutting-edge LVLMs technology to develop a Qwen2.5-VL-7B-SFT through LoRA fine-tuning, facilitating automated evaluation and in-depth semantic analysis of 100,389 image–text pairs. Through cross-modal semantic alignment, the proposed model effectively identifies complex urban space perception scenarios and accurately evaluates four distinct types of public perceptions regarding URSs, achieving an average accuracy of 91.4%.
To evaluate the cross-city applicability of the Qwen2.5-VL-7B-SFT model, we designed an additional few-shot comparative experiment, selecting Tianjin—geographically adjacent to Beijing yet distinct in urban character—as a comparative case. Tianjin is renowned for its unique modern colonial architecture, vibrant local life, and open port-city culture. Following consistent data preprocessing standards, we collected and annotated 500 image–text data samples from recreational spaces within Tianjin’s historic urban area to form an independent test set. Inference with the Qwen2.5-VL-7B-SFT model achieved an average accuracy of 91.2% on the Tianjin dataset (see Appendix C for details), a result highly comparable to the 91.4% accuracy on the Beijing Third Ring Road test set. This outcome indicates that, despite significant differences in built environment characteristics and cultural contexts between the two cities, the model’s learned multi-modal perceptual logic exhibits strong robustness and cross-regional generalization capabilities, enabling its adaptation to diverse urban spatial contexts.
Although contemporary advanced LVLMs can process multi-modal data concurrently, their decision-making processes are frequently characterized as “black boxes”, owing to their limited traceability, rendering enhanced model interpretability essential [53]. In analyzing model predictions, this study comprehensively capitalizes on LVLMs’ robust textual sentiment analysis capabilities. The interpretation layer incorporates few-shot learning and two-stage CoT prompt engineering to construct a comprehensive post hoc explanation framework. This framework initially accurately identifies key textual segments (e.g., “The Hall of Prayer for Good Harvests looks absolutely stunning when illuminated”), subsequently methodically deconstructing perceptual subjects (“Hall of Prayer for Good Harvests”) and their associated emotional descriptors (“illuminated”, “stunning”). This methodology effectively circumvents the semantic noise prevalent in conventional word-frequency statistics or topic modeling [54]. It achieves a clear separation between keywords denoting physical spatial attributes and those representing subjective sensory experiences. Consequently, it distinctly elucidates the profound semantic relationships between spatial elements and emotional features. These methodological innovations furnish novel technical pathways and analytical perspectives for forthcoming urban space perception research utilizing multi-modal data.

4.2. Thematic Composition of URS Perception

Based on the spatial elements identified through URS perception analysis, the primary recreational spaces within Beijing’s Third Ring Road area can be classified into four distinct categories: historic heritage spaces, commercial entertainment districts, ecological-natural spaces, and cultural facilities. These URS categories significantly influence public engagement behaviors and experiential perceptions through their unique environmental characteristics and functional orientations, creating differentiated perceptual landscapes. Historic heritage spaces, represented by “Hutongs”, “Forbidden City”, and “Drum and Bell Tower” demonstrate the highest levels of public recognition and attention. Their architectural style, color schemes, and light–shadow effects constitute key attributes affecting their esthetic quality, consistent with the well-established esthetic value of urban historic environments [55,56]. The frequent occurrence of user-generated descriptors including “check-in”, “photogenic”, and “interesting” confirms the strong appeal of these spaces as premier sightseeing destinations. Concurrently, recurring expressions of identification with Beijing’s local culture and imperial heritage strongly demonstrate the pivotal role of historic heritage spaces in shaping local cultural identity [57,58]. Notably, public preference strongly favors experiencing these historic environments through spontaneous walking activities, popularly termed “city walks”, deriving significant enjoyment and satisfaction. Nevertheless, this study also reveals that negative evaluations regarding attractiveness and esthetics are frequently associated with physical environmental conditions described as “dilapidated” and “desolated”. Criticism concerning cultural value stems from urban renewal practices characterized as “vanished” and “demolitioned”, particularly in cases involving demolition of “Hutongs”, resulting in disruptions to collective memory and cultural continuity [59]. Additionally, excessive commercialization which compromises authenticity has emerged as another significant focus of public concern [60]. Negative perceptions related to restorative quality are predominantly linked to overcrowding, reduced thermal comfort, and physical fatigue (see Appendix C).
Commercial entertainment spaces constitute another important category of recreational spaces that evoke positive public perceptions, particularly in terms of attractiveness, esthetics, and restorativeness. Descriptors including “atmospheric”, “internet celebrity”, and “design-forward” are strongly associated with modern consumption spaces, reflecting contemporary consumers’ preference for personalized, high-quality environments [22]. These spaces have transformed from purely transactional venues into “third places” [61], serving multiple functions such as social leisure, temporary workspaces, and relaxation. Keywords such as “comfortable”, “chatting”, and “resting” demonstrate these spaces’ capacity to facilitate psychological and emotional restoration. Beyond functionally defined URSs, ephemeral urban elements and emerging entertainment spaces [62] offer users joyful, liberating, and highly interactive experiences. Negative perceptions of commercial entertainment spaces mainly arise from overcrowding, noise pollution, monotonous content or formats, and excessively high pricing.
Ecological and natural recreational spaces, including parks, green spaces, and waterfront areas, serve as significant sources of esthetic perception and demonstrate notable psychological benefits through their restorative qualities. Analysis reveals that keywords such as the “20min park effect” frequently appear in reviews, reflecting a common public desire for “being away”—that is, to temporarily detach from daily routines and immerse themselves in environments characterized by a soft fascination with natural elements such as the sunset, breeze, and sunlight. The coherent organization of these spaces provides sufficient scope for immersive experience. The observed harmony between individual behaviors and environmental rhythms demonstrates high compatibility, often manifesting as emotional feedback described as “peaceful”, “healing”, and “enjoyable”. This empirically corroborates the fundamental premise of Attention Restoration Theory, which posits that natural spaces, with their low cognitive-load esthetic appeal, can effectively facilitate the restoration of mental states [63]. Consequently, these ecological spaces function as indispensable “psychological buffers” in high-density urban contexts. Conversely, factors such as vegetation withering or inclement weather may diminish environmental coherence and soothing charm, thereby impairing their restorative function by disrupting the essential components of a restorative experience.
Cultural spaces, including museums, memorial halls, and art galleries, predominantly shape public perceptions of culturality and attractiveness. Contemporary public cultural venues have transformed from traditional educational exhibition spaces into visitor experience-centered public cultural environments [64]. Thoughtfully designed immersive environments and meaningful exhibition content effectively evoke visitors’ emotional engagement and substantially improve cultural assimilation [65]. Negative experiences primarily arise from insufficient exhibition interactivity or unengaging content, leading to visitor disinterest.

4.3. Effects of Built Environmental Characteristics on URS Perceptions

This study reveals nonlinear relationships and threshold effects between the built environment and URS perception. AHH and PFD were identified as the factors most strongly associated with URS perception. Clusters of historical and cultural heritage strengthen place identity by evoking cultural recognition and collective memory, consistent with Jacobs’ assertion regarding historic buildings’ essential role in fostering vibrant neighborhoods [66]. Existing research similarly demonstrates strong correlations between historic structures and both pedestrian activity intensity [67] and neighborhood spatial vitality [68]. Concurrently, our thematic analysis of URS perception confirms that the distinctive architectural features and spatial environments of historic heritage sites constitute important sources of positive public perception. As a highly influential characteristic, PFD exhibits a threshold effect, with URS perception showing significant improvement once the density exceeds 4000 units/km2. This indicates that high functional density enhances both service accessibility and positive spatial experiences through facilitating diverse social interactions and activity opportunities [69].
Regarding spatial morphological elements, elevated RND demonstrates robust correlations with favorable public perceptions, a relationship consistently confirmed in empirical studies [70]. Optimized RND contributes to enhanced spatial vitality and perceptual quality through improved intra-zonal circulation efficiency and superior spatial connectivity. Correspondingly, SVF manifests a nonlinear positive association with URS esthetic evaluation, corroborating established research demonstrating that open spaces with moderate building heights generate expansive visual fields that enhance perceptual comfort [46,69]. Although the importance of GVI is emphasized in several studies, its contribution in our research is relatively limited, only showing a noticeable influence in the dimension of spatial attractiveness. Our analysis further reveals that moderate EI positively influences both esthetic and restorativeness. Conversely, PR of the block surpassing 3.0 negatively affects attractiveness and cultural identity metrics. These findings substantiate the existing theoretical frameworks which associate excessive density with increased psychological stress and reduced spatial clarity and visual permeability [71].
Contrary to previous studies [72], NTAD exhibited no significant effect on public perception measures. Dependency plot analysis indicated that greater distances from core commercial zones were associated with enhanced perceptual experiences. This implies that cultural experience values in historic districts may create spatial premium effects that supersede conventional commercial convenience requirements. Moreover, diversity metrics show generally weak correlations with URS perception, while higher SHDI values demonstrated negative associations with public perception. These results suggest that visual coherence may more effectively support cultural narrative continuity than excessive element diversity.

4.4. Research Limitations and Future Directions

First, the SMD sample exhibits inherent representational biases. The data sources primarily rely on Weibo and Xiaohongshu, platforms whose user demographics exhibit pronounced youth-oriented and metropolitan characteristics. This makes it difficult to fully capture the perceptual experiences of elderly populations, children, and individuals from diverse socioeconomic backgrounds. This limitation also precludes an in-depth examination of how individual user attributes—such as age, income, and residential status (e.g., resident versus tourist)—moderate perception formation. For instance, tourists may prioritize landmarks and cultural appeal, whereas local residents might place greater value on daily restorativeness and practical functionality. Future research should therefore aim to integrate multi-source data to construct a more comprehensive perception map. This could include combining satisfaction surveys released by urban management authorities, cross-platform travel reviews (e.g., from Mafengwo or TripAdvisor), and structured interviews and questionnaires targeting specific groups to obtain detailed user profile data [21,73]. Such approaches would enable a finer-grained analysis of the heterogeneous mechanisms underlying perception formation at the individual attribute level.
Second, the data collection process faced challenges related to insufficient geotag coverage. During data collection, only 70,807 out of the initially retrieved 189,400 Xiaohongshu posts contained valid geolocation tags based on predefined keyword searches, yielding an effective tagging rate of merely 37.38%. This limitation partially constrains both data collection efficiency and spatial coverage in geographic analysis. To address this issue, future research could explore spatial localization enhancement methods that integrate natural language processing and computer vision techniques. For instance, spatial coordinates for non-geotagged data could be inferred by identifying toponyms, road segments, or prominent landmarks mentioned in textual content. Alternatively, similarity comparisons could be performed between non-geotagged images and street-view image libraries with known geographical ranges. These methods would improve the utilization rate and completeness of spatial data.
Third, the scope of influencing factor analysis requires further expansion. Although this study primarily examines the impact mechanisms of built environment characteristics on public perception, the perceptual experience of urban spaces constitutes a complex socio-spatial process inevitably shaped by socioeconomic background factors. Future research could incorporate variables such as neighborhood housing price gradients, resident income levels, and age structure into the analytical framework to explore how the interaction between these factors and the built environment collectively shapes public spatial perception.
Finally, the geographical generalizability of the research findings requires further validation. The distinctive perception patterns identified in this study are shaped by the built environment characteristics of Beijing’s Third Ring Road area, which is characterized by dense historical districts and prominent cultural symbols. However, varying urban cultural contexts, spatial structures, and resident behavioral patterns may result in differentiated mechanisms. Subsequent research could apply the analytical framework developed in this study to other types of urban cases by creating targeted SMD datasets with relevant labels and fine-tuning LVLMs such as GPT-4o and Qwen2.5-VL, thereby enabling a systematic distinction between universal mechanisms in urban spatial perception and context-specific patterns highly dependent on local conditions. Furthermore, while the current study primarily relies on image and text modalities, future work could integrate video, audio, and dynamic geospatial data to construct more comprehensive multi-modal large models, enabling interdisciplinary, multidimensional automated perception evaluation of the built environment.

5. Conclusions

This study utilizes cutting-edge LVLMs to develop an interpretable Qwen2.5-VL-7B-SFT. Using a LoRA fine-tuning strategy, the model achieves multi-modal fusion analysis of social media image–text data, enabling precise evaluation of public perception across four dimensions of URSs: esthetics, attractiveness, culturality, and restorativeness. Simultaneously, it systematically identifies objective the spatial elements and subjective emotional characteristics associated with each perceptual dimension from the public experience perspective. Furthermore, this study introduces the Optuna hyperparameter optimization framework to iteratively train multiple mainstream machine Learning models, combined with SHAP interpretability techniques, investigating the nonlinear impacts of built environment features on multidimensional public perception. Our findings include the following: (1) Interpretable LVLMs demonstrate significant efficacy in urban spatial perception research. The supervised fine-tuned Qwen2.5-VL-7B-SFT effectively addresses the limitations of traditional single-modal data sources, achieving notable performance improvements in SMD sentiment analysis tasks. This model integrates a post hoc interpretation framework with few-shot learning and two-stage CoT prompting, which provides clear reasoning processes and justifications for perceptual predictions, thereby significantly enhancing the interpretability and credibility of the outputs. (2) URSs within Beijing’s Third Ring Road are classified into four categories: historical heritage spaces, commercial entertainment spaces, ecological-natural spaces, and cultural spaces. SMD analysis reveals specific spatial elements and their associated emotional features that trigger distinct perceptual dimensions: esthetics is shaped by architectural styles, color schemes, and light–shadow effects in historical heritage spaces, floral and celestial landscapes in ecological spaces, and the unique esthetic appeal of commercial spaces. Attractiveness correlates strongly with functional diversity, design quality, and social dissemination potential in commercial entertainment spaces, as well as cultural experiences and place imagery embedded in historical architecture. Culturality derives from the “ancient capital culture” and “Beijing-style culture” embodied in historical heritage spaces, “innovative cultural” expressions in designated cultural spaces, “red culture” through national symbolism elements, and “celebrity culture” intertwined with historical contexts. Restorativeness manifests through physiological and psychological relaxation experiences, primarily originating from environmental comfort and tranquility in ecological spaces, along with recreational value from strolling through historical spaces. (3) AHH and PFD are the core built environment factors most strongly associated with positive public recreational perception. When PFD of a block exceeds 4000 units/km2 or when the number of historical heritage sites within a 500 m radius surpasses 4, their positive effects on public perception become particularly pronounced.
In summary, this study innovatively integrates social media multi-modal data with vision–language model technology to achieve the precise quantification of public perception in URSs, while elucidating linkage mechanisms between built environment elements and perceptual dimensions. Our workflow demonstrates notable scalability and adaptability, supporting diverse applications from recreational space planning to urban management. Methodologically, this research expands analytical dimensions of urban spatial studies and establishes a user perception-centric evaluation paradigm, providing scientific evidence for “people-oriented” urban governance to enhance urban management precision.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W.; software, Y.W.; validation, Y.W. and W.F.; formal analysis, Y.W.; investigation, Y.W.; resources, X.H.; data curation, Y.W. and W.F.; writing—original draft preparation, Y.W.; writing—review and editing, X.H.; visualization, Y.W.; supervision, X.W.; project administration, X.W.; funding acquisition, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Education Humanities and Social Sciences Planning Fund Project. Project: Research on the “Space Sharing” Model, Mechanism and Guidance Control of Historical Districts [23YJA630032].

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
URSsUrban recreational spaces
LVLMsLarge vision-language models
SMDSocial media data
SFTSupervised fine-tuning
LoRALow-Rank Adaptation
CoTChain-of-Thought
SHAPSHapley Additive exPlanations
RFERecursive feature elimination
XGBoosteXtreme Gradient Boosting
GBDTGradient Boosted Decision Trees
LightGBMLight Gradient Boosting Machine
RFRandom Forest
MAEMean Absolute Error
RMSERoot Mean Square Error
R2R-squared
ACCAccuracy

Appendix A. Urban Recreational Spaces Perception Dimensions Description

Table A1. Label description and examples of URS perception dimensions.
Table A1. Label description and examples of URS perception dimensions.
DescriptionExamples
Esthetics
PositiveFavorable visual experience and appreciation of spatial quality. Emphasizing explicit expressions of admiration for visual beauty.Beihai Park · White Pagoda Peach Blossoms and Willow Banks are called Spring and Jingming, still remembered in the children’s song: “Let’s swing our sculls, and the boat pushes away the waves. “ The sea surface reflects the beautiful White Pagoda, surrounded by green trees and red walls……Land 14 02155 i001
NegativeUnfavorable visual experience and criticism of spatial quality. Emphasizes explicit expressions of dissatisfaction with visual defects and environmental discordance.Wintersweet near Qingyin Pavilion has bloomed late, and recently it has not been the flowering period of Taoranting. The slightly overcast sky, the dead wood, and the residual lotus are somewhat dark and windy. The Chinese National Garden reproduces many famous gardens in the south. It is estimated that it will look good when the north winter is gone and there are green trees and flowing water……Land 14 02155 i002
Attractiveness
PositiveSpatial characteristics that inspire interest and participation. Emphasizes environments with captivating, worth-exploring qualities.Huguo Temple | The online celebrity ladder in the courtyard was evaluated some days ago, and grass has been planted. It is located in the courtyard of Huguo Temple, where there is a scenic courtyard and Z03 coffee……Land 14 02155 i003
NegativeSpatial characteristics lacking appeal or participatory value. Emphasizes environments perceived as dull or not worth exploring.Is this the online celebrity punch-in place? It is really not so good; it is too ordinary. It is recommended to go all the way north to Wangfujing.Land 14 02155 i004
Culturality
PositiveSuccessful preservation and presentation of cultural heritage.
Emphasizes well-conserved cultural traditions or historical features.
Shijia Hutong is said to be “one hutong, half of China”. Many celebrities and scholars once lived in this hutong, and Prince William of England also visited the first Hutong museum in China. Shu Yi, the signature on the plaque, is the son of writer Lao She.Land 14 02155 i005
NegativeLoss or compromise of cultural-historical values. Emphasizes damaged or inappropriately modified cultural elements.For many years, this street recorded many memories of my life in the hutong: joys, sorrows, and further joys. Now it is full of commerce and no fireworks and no smell.Land 14 02155 i006
Restorativeness
PositivePositive experiences of physical or mental restoration. Emphasizes environments conducive to stress relief and mental recovery.Beijing Hutong! This is a small shop overlooking the Lama Temple. It is so good to drink coffee in the alley in autumn. I met a little stray cat by chance. Cat lovers are friendly. When the weather is fine, I come to the terrace with my friends to drink coffee and bask in the sun.Land 14 02155 i007
NegativeNegative experiences causing physical or mental discomfort.
Emphasizes environments with adverse psychological or physiological impacts.
11 May 2024: The last day of traveling to Beijing with my roommate. Let me be the first to say that the conclusion is very bad. It is terrible. In particular, when I walked in the Guomao Building myself, I found that I felt depression instead of prosperity. Maybe I should not have come……Land 14 02155 i008

Appendix A.1. Prompt for Qwen2.5-VL-7B-SFT

  • [Task Description]
To identify four perceptual dimensions of urban recreational spaces (URSs) based on social media “image–text” data and conduct sentiment polarity annotation. Specific requirements:
Perception Dimensions: Asthetics, Attractiveness, Culturality, Restorativeness.
Sentiment Annotation Standards: Positive = 1, Negative = 0, Irrelevant = 2
  • [URS Perception Description]
  • Esthetics:
Positive Perception: Favorable visual experience and appreciation of spatial quality. Emphasizing explicit expressions of admiration for visual beauty.
Positive Keywords: picturesque, beautiful, stunning, magnificent, etc.
Negative Perception: Unfavorable visual experience and criticism of spatial quality. Emphasizing explicit expressions of dissatisfaction with visual defects and environmental discordance.
Negative Keywords: messy, dilapidated, dirty, etc.
Irrelevant Perception: Content that does not pertain to visual esthetic evaluation.
  • Attractiveness:
Positive Perception: Spatial characteristics that inspire interest and participation. Emphasizing environments with captivating, worth-exploring qualities.
Positive Keywords: interesting, vibrant, fun, lively, etc.
Negative Perception: Spatial characteristics lacking appeal or participatory value. Emphasizing environments perceived as dull or not worth exploring.
Negative Keywords: boring, deserted, monotonous, uninteresting, dead, etc.
Irrelevant Perception: Content unrelated to the space’s appeal or ability to attract visitors.
  • Culturality:
Positive Perception: Successful preservation and presentation of cultural heritage. Emphasizing well-conserved cultural traditions or historical features.
Positive Keywords: cultural, historical, artifacts, etc.
Negative Perception: Loss or compromise of cultural-historical values. Emphasizing damaged or inappropriately modified cultural elements.
Negative Keywords: over-commercialized, gentrified, demolished, artificial, etc.
Irrelevant Perception: Content without cultural/historical relevance.
  • Restorativeness:
Positive Perception: Positive experiences of physical or mental restoration. Emphasizing environments conducive to stress relief and mental recovery;
Positive Keywords: comfortable, leisurely, healing, pleasant, relaxed, etc.
Negative Perception: Negative experiences causing physical or mental discomfort. Emphasizing environments with adverse psychological/physiological impacts.
Negative Keywords: crowded, stuffy, oppressive, irritable, overwhelming, etc.
Irrelevant Perception: Content unrelated to restorative experiences.

Appendix B. Machine Learning Model Performance Enhancement Technology

Appendix B.1. Feature Screening

Feature selection plays a pivotal role in identifying key predictors and enhancing the model’s capability to capture complex data patterns. This study employed two distinct feature selection approaches: The first method utilized the Boruta algorithm, which operates within a random forest framework by generating randomly permuted copies (shadow features) of each original feature and comparing their importance scores through an iterative process. Features demonstrating lower importance than their shadow counterparts are progressively eliminated until all variables are either confirmed or rejected [74]. The second approach implemented recursive feature elimination (RFE), which first calculates feature importance using trained XGBoost regression models and then iteratively removes the least important features while evaluating model performance (R2) within a cross-validation framework. The optimal feature subset is determined based on the performance curve derived from cross-validation.

Appendix B.2. Optuna Hyperparameter Tuning

Optuna is an automated hyperparameter optimization framework that iteratively evaluates parameter configurations using objective function calls to identify optimal solutions. Employing an enhanced Bayesian optimization algorithm, it terminates explorations of low-probability parameter spaces and allocates greater computational resources to more promising regions, thereby improving search efficiency [75]. In this study, we integrated the optimal feature subsets obtained from both Boruta and RFE feature selection methods with four machine learning models within the Optuna framework to determine each model’s optimal hyperparameter configuration.
Table A2 presents the optimal feature subsets identified by Boruta and RFE feature selection methods. Using these subsets, we conducted hyperparameter tuning for each model with Optuna (300 iterations), evaluating model performance using five-fold cross-validation for each hyperparameter combination.
Table A2. The feature subsets for four dimensional URS perception derived from both Boruta and RFE.
Table A2. The feature subsets for four dimensional URS perception derived from both Boruta and RFE.
Perception DimensionFeature SelectionFeature Subset
EstheticsBorutaBD, NTAD, NBSD, WI, W, AHH, E, SVF, GVI, SHDI, PFD, RND, HSD
RFEAHH, PFD, NTAD, E SVF RND NMSD, SHDI, GVI, WI, AOS
AttractivenessBorutaBD, NTAD, NMBD, WI, W, AHH, E, GVI, SHDI, PR, PFD, RND, HSD
RFEPFD, AHH, RND, PR, GVI, NTAD, NMSD, E, SHDI, BD, W
CulturalityBorutaBD, BC, NTAD, NMBD, SSI, SSC, WI, W, AHH, GVI, RND, SHDI, PR, PFD, E, HSD
RFEAHH, PFD, RND, PR, NMSD, SHDI, NTAD, E, GVI, WI, HSD
RestorativenessBorutaBD, NTAD, NMBD, WI, W, AHH, E, GVI, MSD, SHDI, PFD, RND, HSD
RFEPFD, AHH, RND, E, NTAD, SHDI, NMSD, GVI, W, WI, HSD

Appendix B.3. Comparison of Machine Learning Models

This study employed four machine Learning models for performance comparison: eXtreme Gradient Boosting (XGBoost), Gradient Boosting Decision Tree (GBDT), Light Gradient Boosting Machine (LightGBM), and Random Forest (RF). We evaluated model predictive accuracy using three metrics: R-squared (R2), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE). R2 measures the proportion of variance in the dependent variable explained by the independent variables, ranging from 0 to 1, with higher values indicating better goodness-of-fit. Both MAE and RMSE quantify prediction errors as non-negative real numbers, where lower values indicate higher predictive accuracy. The calculation formulas are as follows:
M A E y , y ^ = i = 1 n ( y i y i ^ ) 2
R M S E y , y ^ = 1 n i = 1 n ( y i y i ^ ) 2
R 2 y , y ^ = 1 i = 1 n ( y i y i ^ ) 2 i = 1 n ( y i y ) 2
where y i represents the true values, y i ^ denotes the predicted values from the regression model, and n indicates the number of samples in the dataset. The evaluation results are presented in Table A3, which demonstrates that the combination of the optimal feature subset from RFE and the LightGBM model achieved the best overall performance.
Table A3. Four types of machine Learning model evaluation indicator results.
Table A3. Four types of machine Learning model evaluation indicator results.
Perception Dimension Feature ScreeningXGBoostRFGBDTLightGBM
EstheticsR2Boruta0.54870.49420.53580.5436
RFE0.55930.50660.55750.5959
RMSEBoruta1700.53941800.24081724.66311710.006
RFE1680.39771777.87641683.74451609.1301
MAEBoruta841.5249890.3494841.0683860.2127
RFE839.9607894.5046825.5100792.0527
AttractivenessR2Boruta0.70000.55840.64440.7089
RFE0.69510.516280.63820.7239
RMSEBoruta6278.76167618.24216836.38246184.9220
RFE6330.20197973.06996895.20566023.8856
MAEBoruta2902.87843381.39593104.43832839.2208
RFE3007.95383514.46723244.32082868.1066
CultureR2Boruta0.57850.56920.57690.5661
RFE0.63110.57680.61540.6319
RMSEBoruta2541.93142569.65802546.52042578.9323
RFE2378.02642547.06062427.85372375.5058
MAEBoruta1271.41361290.95791277.91521319.6692
RFE1167.68081260.89891175.06451166.0874
RestorativenessR2Boruta0.6202940.548670.60840.60921
RFE0.648030.556550.62750.63753
RMSEBoruta7813.39888518.47877935.19767926.6090
RFE7522.52238443.79447739.27917633.9517
MAEBoruta3414.17663584.08253445.53273425.2580
RFE3361.69743579.34123407.00833330.4188

Appendix B.4. Shapley Additive Explanation (SHAP)

This study employs SHAP interpretability models to analyze the complex relationships between built environment features and URS perceptions. Based on cooperative game theory’s Shapley value concept, this method quantifies each feature’s marginal contribution by calculating its average impact across all possible feature combinations [76]. The approach provides both global interpretation of aggregate feature impacts across the study population and local explanation of individual feature contributions for specific observations, effectively transforming “black-box” models into interpretable machine learning systems. The SHAP value function is formally expressed in Equation (A4):
j = S F \ j S ! ( F S 1 ) ! F ! f S x S j f S x S
where j is the SHAP value for feature j , F is the set of all features, S is a subset of features excluding feature j , S is the number of features in subset S , F is the total number of features, f S x S j is the prediction of the model using the features in subset S along with feature j , and f S x S is the prediction of the model using only the features in subset S .

Appendix C. Negative Comments in Urban Recreational Space

Table A4. The ACC, Precision, Recall, and F1 score of Qwen2.5-VL-7B-SFT on the Tianjin dataset.
Table A4. The ACC, Precision, Recall, and F1 score of Qwen2.5-VL-7B-SFT on the Tianjin dataset.
ModelEstheticsAttractivenessCulturalityRestorativeness
ACCQwen2.5-VL-7B-SFT0.92260.86070.96910.8994
PrecisionQwen2.5-VL-7B-SFT0.92650.86040.96840.9013
RecallQwen2.5-VL-7B-SFT0.92260.86070.96910.8994
F1 scoreQwen2.5-VL-7B-SFT0.92430.85910.96690.9002
Figure A1. Negative comment word clouds in four dimensions.
Figure A1. Negative comment word clouds in four dimensions.
Land 14 02155 g0a1

References

  1. Wu, B.; Dong, L.; Tang, Z. A study on categories an attributes of public urban recreation space. Chin. Landsc. Archit. 2003, 5, 48–50. [Google Scholar]
  2. Yu, L.; Liu, J.; Li, T. Important Progress and Future Prospects for Studies on Urban Public Recreational Space in China. J. Geogr. Sci. 2019, 29, 1923–1946. [Google Scholar] [CrossRef]
  3. Li, H. Evaluation and optimization countermeasures for service functions of urban ecological recreation space. City Plan. Rev. 2015, 39, 63–69. [Google Scholar]
  4. Kang, L.; Yang, Z.; Han, F. The Impact of Urban Recreation Environment on Residents’ Happiness—Based on a Case Study in China. Sustainability 2021, 13, 5549. [Google Scholar] [CrossRef]
  5. Li, J.; Guo, X.; You, J.; He, Z.; Yang, Z.; Wang, L. Perception and Drivers of Cultural Ecosystem Services in Waterfront Green Spaces: Insights from Social Media Text Analysis. Anthropocene 2025, 50, 100477. [Google Scholar] [CrossRef]
  6. Huang, W.; Zhao, X.; Lin, G.; Wang, Z.; Chen, M. How to Quantify Multidimensional Perception of Urban Parks? Integrating Deep Learning-Based Social Media Data Analysis with Questionnaire Survey Methods. Urban For. Urban Green. 2025, 107, 128754. [Google Scholar] [CrossRef]
  7. Li, Y.; Zhao, B.; Jiang, B.; Jia, X.; Li, H.; Zhang, J. Beyond Visits: Investigating the Restorative Pathways and Cumulative Effects of Park Engagement and Sustained Exposure on Psychological Well-Being with Park Type as a Moderator. Environ. Res. 2025, 276, 121520. [Google Scholar] [CrossRef]
  8. Zhao, X.; Lu, Y.; Huang, W.; Lin, G. Assessing and Interpreting Perceived Park Accessibility, Usability and Attractiveness through Texts and Images from Social Media. Sustain. Cities Soc. 2024, 112, 105619. [Google Scholar] [CrossRef]
  9. Reid, W.V.; Mooney, H.A.; Cropper, A.; Capistrano, D.; Carpenter, S.R.; Chopra, K.; Dasgupta, P.; Dietz, T.; Duraiappah, A.K.; Hassan, R.; et al. Ecosystems and Human Well-Being—Synthesis: A Report of the Millennium Ecosystem Assessment; Island Press: Washington, DC, USA, 2005; ISBN 978-1-59726-040-4. [Google Scholar]
  10. Li, J.; Gao, J.; Zhang, Z.; Fu, J.; Shao, G.; Zhao, Z.; Yang, P. Insights into Citizens’ Experiences of Cultural Ecosystem Services in Urban Green Spaces Based on Social Media Analytics. Landsc. Urban Plan. 2024, 244, 104999. [Google Scholar] [CrossRef]
  11. Mundher, R.; Al-Sharaa, A.; Al-Helli, M.; Gao, H.; Abu Bakar, S. Visual Quality Assessment of Historical Street Scenes: A Case Study of the First “Real” Street Established in Baghdad. Heritage 2022, 5, 3680–3704. [Google Scholar] [CrossRef]
  12. Chen, S.; Meng, B.; Liu, N.; Qi, Z.; Liu, J.; Wang, J. Cultural Perception of the Historical and Cultural Blocks of Beijing Based on Weibo Photos. Land 2022, 11, 495. [Google Scholar] [CrossRef]
  13. Jiang, S.; Liu, J. Comparative Study of Cultural Landscape Perception in Historic Districts from the Perspectives of Tourists and Residents. Land 2024, 13, 353. [Google Scholar] [CrossRef]
  14. Chen, X.; Sun, Y.; Ibrahim, F.I.B.; Kamarazaly, M.A.B.; Abidin, S.N.B.Z.; Tang, S. Social Media Interaction and Built Environment Effects on Urban Walking Experience: A Machine Learning Analysis of Shanghai Citywalk. PLoS ONE 2025, 20, e0320951. [Google Scholar] [CrossRef]
  15. Kim, J.; Lee, J. An Analysis of Spatial Accessibility Changes According to the Attractiveness Index of Public Libraries Using Social Media Data. Sustainability 2021, 13, 9087. [Google Scholar] [CrossRef]
  16. Du, X.; Zhang, Y.; Lv, Z. Investigations and Analysis of Indoor Environment Quality of Green and Conventional Shopping Mall Buildings Based on Customers’ Perception. Build. Environ. 2020, 177, 106851. [Google Scholar] [CrossRef]
  17. Nguyen, T.V.T.; Han, H.; Sahito, N. Role of Urban Public Space and the Surrounding Environment in Promoting Sustainable Development from the Lens of Social Media. Sustainability 2019, 11, 5967. [Google Scholar] [CrossRef]
  18. Wang, Y.; Feng, D. History, Modernity, and City Branding in China: A Multimodal Critical Discourse Analysis of Xi’an’s Promotional Videos on Social Media. Soc. Semiot. 2023, 33, 402–425. [Google Scholar] [CrossRef]
  19. Tang, Y.; Li, L.; Gan, Y.; Xie, S. Investigating Resident–Tourist Sharing of Urban Public Recreation Space and Its Influencing Factors. ISPRS Int. J. Geo-Inf. 2024, 13, 305. [Google Scholar] [CrossRef]
  20. Ding, J.; Liu, N. How can social media construct the spatial image of a “internet-famous city”. News Writ. 2021, 9, 87–91. [Google Scholar] [CrossRef]
  21. Huang, J.; Obracht-Prondzynska, H.; Kamrowska-Zaluska, D.; Sun, Y.; Li, L. The Image of the City on Social Media: A Comparative Study Using “Big Data” and “Small Data” Methods in the Tri-City Region in Poland. Landsc. Urban Plan. 2021, 206, 103977. [Google Scholar] [CrossRef]
  22. Guo, C.; Yang, Y. A Multi-Modal Social Media Data Analysis Framework: Exploring the Complex Relationships among Urban Environment, Public Activity, and Public Perception—A Case Study of Xi’an, China. Ecol. Indic. 2025, 171, 113118. [Google Scholar] [CrossRef]
  23. Zhang, Z.; Wang, X.; Jiang, M. Empirical Study on Emotional Perception and Restorative Effects of Suzhou Garden Landscapes: Text Mining and Statistical Analysis. Land 2025, 14, 122. [Google Scholar] [CrossRef]
  24. Huang, Y.; Li, Z.; Huang, Y. User Perception of Public Parks: A Pilot Study Integrating Spatial Social Media Data with Park Management in the City of Chicago. Land 2022, 11, 211. [Google Scholar] [CrossRef]
  25. Marine, N.; Arnaiz-Schmitz, C.; Santos-Cid, L.; Schmitz, M.F. Can We Foresee Landscape Interest? Maximum Entropy Applied to Social Media Photographs: A Case Study in Madrid. Land 2022, 11, 715. [Google Scholar] [CrossRef]
  26. Tieskens, K.F.; Van Zanten, B.T.; Schulp, C.J.E.; Verburg, P.H. Aesthetic Appreciation of the Cultural Landscape through Social Media: An Analysis of Revealed Preference in the Dutch River Landscape. Landsc. Urban Plan. 2018, 177, 128–137. [Google Scholar] [CrossRef]
  27. Yang, C.; Zhang, Y. Public Emotions and Visual Perception of the East Coast Park in Singapore: A Deep Learning Method Using Social Media Data. Urban For. Urban Green. 2024, 94, 128285. [Google Scholar] [CrossRef]
  28. Parnami, A.; Lee, M. Learning from Few Examples: A Summary of Approaches to Few-Shot Learning. arXiv 2022, arXiv:2203.04291. [Google Scholar] [CrossRef]
  29. Luo, H.; Zhang, Z.; Zhu, Q.; Houda Ben Ameur, N.E.; Liu, X.; Ding, F.; Cai, Y. Using Large Language Models to Investigate Cultural Ecosystem Services Perceptions: A Few-Shot and Prompt Method. Landsc. Urban Plan. 2025, 258, 105323. [Google Scholar] [CrossRef]
  30. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021. [Google Scholar]
  31. Leung, T.M.; Miao, S.; Lin, M.; Hou, H.; Sun, M. Tourist Walkability in Traditional Villages: The Role of Built Environment, Shareability, and Personal Attributes. Sustainability 2025, 17, 5311. [Google Scholar] [CrossRef]
  32. Malekzadeh, M.; Willberg, E.; Torkko, J.; Toivonen, T. Urban Attractiveness According to ChatGPT: Contrasting AI and Human Insights. Comput. Environ. Urban Syst. 2025, 117, 102243. [Google Scholar] [CrossRef]
  33. Sun, P.; Zhao, H.; Zhong, J.; Cao, S.; Gao, M. Popularity Influence Mechanism of Coastal Spaces in Urban Areas: Insights from Multi-Modal Large Language Models. Cities 2025, 161, 105909. [Google Scholar] [CrossRef]
  34. Martí, P.; Serrano-Estrada, L.; Nolasco-Cirugeda, A. Social Media Data: Challenges, Opportunities and Limitations in Urban Studies. Comput. Environ. Urban Syst. 2019, 74, 161–174. [Google Scholar] [CrossRef]
  35. Zhou, S.; Wang, H.; Li, D.; Ng, S.T.; Wei, R.; Zhao, Y.; Zhou, Y. Revealing Public Attitudes toward Mobile Cabin Hospitals during Covid-19 Pandemic: Sentiment and Topic Analyses Using Social Media Data in China. Sustain. Cities Soc. 2024, 107, 105440. [Google Scholar] [CrossRef]
  36. Yang, Y.; Du, S.; Xiao, Y. Identification of Spatial Influencing Factors and Enhancement Strategies for Cultural Tourism Experience in Huizhou Historic Districts. Buildings 2025, 15, 1568. [Google Scholar] [CrossRef]
  37. Wu, W.; Gaubatz, P. The Chinese City, 2nd ed.; Routledge: London, UK, 2020; ISBN 978-0-429-82955-0. [Google Scholar]
  38. Wang, Z.; Huang, W.-J.; Liu-Lastres, B. Impact of User-Generated Travel Posts on Travel Decisions: A Comparative Study on Weibo and Xiaohongshu. Ann. Tour. Res. Empir. Insights 2022, 3, 100064. [Google Scholar] [CrossRef]
  39. Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
  40. Li, Q. Parameter Efficient Fine-Tuning on Selective Parameters for Transformer-Based Pre-Trained Models. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
  41. Post-Hoc Interpretability for Neural NLP: A Survey | ACM Computing Surveys. Available online: https://dl.acm.org/doi/10.1145/3546577 (accessed on 16 October 2025).
  42. Zhu, J.; Wang, W.; Chen, Z.; Liu, Z.; Ye, S.; Gu, L.; Tian, H.; Duan, Y.; Su, W.; Shao, J.; et al. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models. arXiv 2025, arXiv:2504.10479. [Google Scholar]
  43. Huang, B.; Zhou, Y.; Li, Z.; Song, Y.; Cai, J.; Tu, W. Evaluating and Characterizing Urban Vibrancy Using Spatial Big Data: Shanghai as a Case Study. Environ. Plan. B Urban Anal. City Sci. 2020, 47, 1543–1559. [Google Scholar] [CrossRef]
  44. Hou, X.; Chen, P. Analysis of Road Safety Perception and Influencing Factors in a Complex Urban Environment—Taking Chaoyang District, Beijing, as an Example. ISPRS Int. J. Geo-Inf. 2024, 13, 272. [Google Scholar] [CrossRef]
  45. Yang, D.; Lin, Q.; Li, H.; Chen, J.; Ni, H.; Li, P.; Hu, Y.; Wang, H. Unraveling Spatial Nonstationary and Nonlinear Dynamics in Life Satisfaction: Integrating Geospatial Analysis of Community Built Environment and Resident Perception via MGWR, GBDT, and XGBoost. ISPRS Int. J. Geo-Inf. 2025, 14, 131. [Google Scholar] [CrossRef]
  46. Zhu, J.; Wang, S.; Ma, H.; Shan, T.; Xu, D.; Sun, F. Nonlinear Effect of Urban Visual Environment on Residents’ Psychological Perception—An Analysis Based on XGBoost and SHAP Interpretation Model. City Environ. Interact. 2025, 27, 100202. [Google Scholar] [CrossRef]
  47. Wohlwill, J.F. Environmental Aesthetics: The Environment as a Source of Affect. In Human Behavior and Environment; Altman, I., Wohlwill, J.F., Eds.; Springer: Boston, MA, USA, 1976; pp. 37–86. ISBN 978-1-4684-2552-9. [Google Scholar]
  48. Gugulica, M.; Burghardt, D. Mapping Indicators of Cultural Ecosystem Services Use in Urban Green Spaces Based on Text Classification of Geosocial Media Data. Ecosyst. Serv. 2023, 60, 101508. [Google Scholar] [CrossRef]
  49. Yan, F.; Shu, B.; Zhao, X.; Li, X.; Wu, W.; Huang, M. Secular Experience or Spiritual Pursuit? The Attribution of Checking into Internet-famous Places in the Consumerism Context. Tourism Tribune 2022, 37, 94–105. [Google Scholar] [CrossRef]
  50. Liu, Z.; Wang, A.; Weber, K.; Chan, E.H.W.; Shi, W. Categorisation of Cultural Tourism Attractions by Tourist Preference Using Location-Based Social Network Data: The Case of Central, Hong Kong. Tour. Manag. 2022, 90, 104488. [Google Scholar] [CrossRef]
  51. Rui, J.; Xu, Y.; Cai, C.; Li, X. Leveraging Large Language Models for Tourism Research Based on 5D Framework: A Collaborative Analysis of Tourist Sentiments and Spatial Features. Tour. Manag. 2025, 108, 105115. [Google Scholar] [CrossRef]
  52. Zhang, J.; Li, Y.; Fukuda, T.; Wang, B. Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images. Cities 2025, 165, 106122. [Google Scholar] [CrossRef]
  53. Parekh, J.; Khayatan, P.; Shukor, M.; Newson, A.; Cord, M. A Concept-Based Explainability Framework for Large Multimodal Models. arXiv 2024, arXiv:2406.08074. [Google Scholar] [CrossRef]
  54. Khaledi, H.J.; Khakzand, M.; Faizi, M. Landscape and Perception: A Systematic Review. Landsc. Online 2022, 97, 1098. [Google Scholar] [CrossRef]
  55. Cerasi, M. The Urban and Architectural Evolution of the Istanbul Dïvanyolu: Urban Aesthetics and Ideology in Ottoman Town Building. Muqarnas 2005, 2, 189–232. [Google Scholar] [CrossRef]
  56. Tveit, M.; Ode, Å.; Fry, G. Key Concepts in a Framework for Analysing Visual Landscape Character. Landsc. Res. 2006, 31, 229–255. [Google Scholar] [CrossRef]
  57. Jia, M.; Feng, J.; Chen, Y.; Zhao, C. Visual Analysis of Social Media Data on Experiences at a World Heritage Tourist Destination: Historic Centre of Macau. Buildings 2024, 14, 2188. [Google Scholar] [CrossRef]
  58. Zhu, H.; Chang, J.; An, X.; Li, S. Global and Local Feature Extraction of Urban Historical Spatial Perception Using Large Language Models: A Case Study of Harbin Central Street District. Cities 2025, 165, 106183. [Google Scholar] [CrossRef]
  59. González Martínez, P. Authenticity as a Challenge in the Transformation of Beijing’s Urban Heritage: The Commercial Gentrification of the Guozijian Historic Area. Cities 2016, 59, 48–56. [Google Scholar] [CrossRef]
  60. Luo, L.; Chen, J.; Cheng, Y.; Cai, K. Empirical Analysis on Influence of Authenticity Perception on Tourist Loyalty in Historical Blocks in China. Sustainability 2024, 16, 2799. [Google Scholar] [CrossRef]
  61. Al-Shami, H.W.; Al-Alwan, H.A.S.; Abdulkareem, T.A. Cultural Sustainability in Urban Third Places: Assessing the Impact of “Co-Operation in Science and Technology” in Cultural Third Places. Ain Shams Eng. J. 2024, 15, 102465. [Google Scholar] [CrossRef]
  62. Li, G.; Zhao, C.; Ling, S.; Lu, L. A study on the urban creative space production based on the perspective of “stage-interaction”: A case study of Lei street in Hefei. Hum. Geogr. 2024, 39, 106–117. [Google Scholar] [CrossRef]
  63. Kaplan, S. The Restorative Benefits of Nature: Toward an Integrative Framework. J. Environ. Psychol. 1995, 15, 169–182. [Google Scholar] [CrossRef]
  64. Redaelli, E.; Hansen, L.E.; Djupdræt, M.B. Museums as Public Spaces in the City: Insights from Aarhus, Denmark. Cities 2025, 159, 105778. [Google Scholar] [CrossRef]
  65. González Martínez, P. Curating the Selective Memory of Gentrification: The Wulixiang Shikumen Museum in Xintiandi, Shanghai. Int. J. Herit. Stud. 2021, 27, 537–553. [Google Scholar] [CrossRef]
  66. Kutsche, P. The Death and Life of Great American Cities. Jane Jacobs. Am. Anthropol. 1962, 64, 907–909. [Google Scholar] [CrossRef]
  67. Sung, H.; Lee, S. Residential Built Environment and Walking Activity: Empirical Evidence of Jane Jacobs’ Urban Vitality. Transp. Res. Part Transp. Environ. 2015, 41, 318–329. [Google Scholar] [CrossRef]
  68. Li, X.; Li, Y.; Jia, T.; Zhou, L.; Hijazi, I.H. The Six Dimensions of Built Environment on Urban Vitality: Fusion Evidence from Multi-Source Data. Cities 2022, 121, 103482. [Google Scholar] [CrossRef]
  69. Liu, W.; Li, D.; Meng, Y.; Guo, C. The Relationship between Emotional Perception and High-Density Built Environment Based on Social Media Data: Evidence from Spatial Analyses in Wuhan. Land 2024, 13, 294. [Google Scholar] [CrossRef]
  70. Wu, J.; Lu, Y.; Gao, H.; Wang, M. Cultivating Historical Heritage Area Vitality Using Urban Morphology Approach Based on Big Data and Machine Learning. Comput. Environ. Urban Syst. 2022, 91, 101716. [Google Scholar] [CrossRef]
  71. Wu, T.; Chen, Z.; Li, S.; Xing, P.; Wei, R.; Meng, X.; Zhao, J.; Wu, Z.; Qiao, R. Decoupling Urban Street Attractiveness: An Ensemble Learning Analysis of Color and Visual Element Contributions. Land 2025, 14, 979. [Google Scholar] [CrossRef]
  72. Wu, W.; Ma, Z.; Guo, J.; Niu, X.; Zhao, K. Evaluating the Effects of Built Environment on Street Vitality at the City Level: An Empirical Research Based on Spatial Panel Durbin Model. Int. J. Environ. Res. Public. Health 2022, 19, 1664. [Google Scholar] [CrossRef]
  73. Rasoolimanesh, S.M.; Seyfi, S.; Hall, C.M.; Hatamifar, P. Understanding Memorable Tourism Experiences and Behavioural Intentions of Heritage Tourists. J. Destin. Mark. Manag. 2021, 21, 100621. [Google Scholar] [CrossRef]
  74. Kursa, M.B.; Jankowski, A.; Rudnicki, W.R. Boruta—A System for Feature Selection. Fundam. Informaticae 2010, 101, 271–285. [Google Scholar] [CrossRef]
  75. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
  76. Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
Figure 1. The study area.
Figure 1. The study area.
Land 14 02155 g001
Figure 2. Research framework.
Figure 2. Research framework.
Land 14 02155 g002
Figure 3. Structure of the Qwen2.5-VL-7B-SFT.
Figure 3. Structure of the Qwen2.5-VL-7B-SFT.
Land 14 02155 g003
Figure 4. Density distribution of perceived positive evaluations in four dimensions.
Figure 4. Density distribution of perceived positive evaluations in four dimensions.
Land 14 02155 g004
Figure 5. Clusters related to URS esthetics. The left column displays the spatial elements influencing URS esthetics, while the right column shows their corresponding emotional characteristics.
Figure 5. Clusters related to URS esthetics. The left column displays the spatial elements influencing URS esthetics, while the right column shows their corresponding emotional characteristics.
Land 14 02155 g005
Figure 6. Clusters related to URS attractiveness. The left column presents the spatial elements influencing URS attractiveness, while the right column shows their corresponding emotional characteristics.
Figure 6. Clusters related to URS attractiveness. The left column presents the spatial elements influencing URS attractiveness, while the right column shows their corresponding emotional characteristics.
Land 14 02155 g006
Figure 7. Clusters related to URS culturality. The left column presents the spatial elements influencing URS culturality, while the right column shows their corresponding emotional characteristics.
Figure 7. Clusters related to URS culturality. The left column presents the spatial elements influencing URS culturality, while the right column shows their corresponding emotional characteristics.
Land 14 02155 g007
Figure 8. Clusters related to URS restorativeness. The left column presents the spatial elements influencing URS restorativeness, while the right column shows their corresponding emotional characteristics.
Figure 8. Clusters related to URS restorativeness. The left column presents the spatial elements influencing URS restorativeness, while the right column shows their corresponding emotional characteristics.
Land 14 02155 g008
Figure 9. Importance of core environmental indicators and nonlinear effects on URS perception. The left column displays the results of the relative importance of environmental attributes on URS perception, while the right column shows the nonlinear effects of the top six environmental attributes.
Figure 9. Importance of core environmental indicators and nonlinear effects on URS perception. The left column displays the results of the relative importance of environmental attributes on URS perception, while the right column shows the nonlinear effects of the top six environmental attributes.
Land 14 02155 g009
Table 1. Overview of collected data distribution.
Table 1. Overview of collected data distribution.
Geographic Coordinate Data of the Entire CityThree Ring Geographic Coordinate DataThree Ring Picture Data
Xiaohongshu70,80723,04645,571
Weibo154,71777,343126,985
Total225,524100,389172,556
Table 2. Overview of annotated data distribution.
Table 2. Overview of annotated data distribution.
LabelEstheticsAttractivenessCulturalityRestorativeness
0107280125267
18801,6079531,298
24,0653,1653,9743,497
Table 3. Experimental parameters.
Table 3. Experimental parameters.
CategoryParameterValueDescription/Function
LoRA Configurationrank8Dimension of low-rank matrices
alpha32Scaling factor for weight adjustment
dropout0.05Dropout rate to prevent overfitting
Traininglearning rate0.0001Initial learning rate for optimization
train_val_split_ratio9:1Ratio of training set to validation set
epochs4Number of training iterations
batch size16Samples processed per GPU per iteration
Hardwaregradient checkpointingEnabledReduces GPU memory usage via cross-layer caching
mixed precision (BF16)EnabledAccelerates training with bfloat16 precision
Visionmax pixels602,112Maximum image pixels (448 × 1344)
Table 4. Calculation and statistics of influencing factors.
Table 4. Calculation and statistics of influencing factors.
DomainsVariablesFormula Description
DensityBD
(Building density)
B D = j n m j / S i
j is a building in the block, m j is the building base area of the j-th building, and Si is the area of the block (km2).
PFD
(POI Functional Density)
P F D = C i / S i
Ci is the sum of the values of the POI points for that facility point within the block i, and Si is the area of the block (km2).
PR
(Plot Ratio)
P R = j n m j × f j / S i
j is a building in the block, mj is the building base area of the j-th building, fj is the number of floors of the building, and Si is the area of the block (km2).
HHD
(Historical Heritage Density)
H H D = H i / S i
Hi is the sum of the number of historical heritages in block i, and Si is the area of the block (km2).
PFM
(POI Functional Mixture)
P F M = i n P i × l n P i
i is the number of POI types in the block i, and Pi is the ratio of the number of types of i-th POI in the block to the total number.
SHDI
(Shannon’s Diversity Index)
S H D I = i n p i × l n p i
i is the number of visual feature types computed in the semantic segmentation task within block i, and pi is the proportion of visual feature i to the total pixels.
Distance to
transit
BSD
(Bus Sop Density)
B S D = B i / S i
Bi is the sum of the number of bus stops in block i, and Si is the area of the block (km2).
MSD
(Metro Station Density)
M S D = M i / S i
Mi is the sum of the number of metro stations in block i, and Si is the area of the block (km2).
NBSD
(Nearest Bus Sop Distance)
Straight-line distance from the centroid of block i to the nearest bus stop (km).
RND
(Road Network Density)
R N D = R l / S i
Rl is the sum of the lengths of all types of roads in the block, and Si is the area of the block (km2).
DesignSVF
(Sky View Factor)
S V F = k = 1 m 1 4 j = 1 4 s j k / m
SVF represents the proportion of the sky visible in the field of vision, sjk is the pixel ratio of the sky in the j-th street view picture of the k-th sample point in block i, and m is the number of street view sample points in block i (the same below).
EI
(Enclosue Index)
E I = k = 1 m j = 1 4 b j k + w j k + t j k / j = 1 4 p j k + r j k / m
bjk/wjk/tjk/pjk/rjk is the pixel ratio of buildings/walls/trees/sidewalks/roadways in the j-th street view picture of the k-th sample point in i in the spatial unit.
GVI
(Green View Index)
G V I = k = 1 m 1 4 j = 1 4 g j k / m
gik is the proportion of green plants pixels in the j-th street view picture of the k-th sample point in block i.
WAI
(Walkability Index)
W A I = k = 1 m j = 1 4 p j k / j = 1 4 p j k + r j k / m
pjk/rjk: the proportion of sidewalk/roadway pixels in the j-th street view picture of the k-th sample point in block i.
BC
(Building Continuity)
B C = 1 m k = 1 m 1 4 j = 1 4 b j k 1 4 m k = 1 m j = 1 4 b j k 2
BC represents the standard deviation of the proportion of buildings in block I, and bjk is the proportion of building pixels in the j-th street view image of the k-th sample point in i in the spatial unit.
GSR
(Green Space Ratio)
G S R = G r / S i
Gr is the sum of the areas of the various types of green spaces in the blocks, and Si is the area of the block (km2).
WI
(Water Index)
W I = K N E A R _ D I S T
K takes the value of 3000 m search radius, and NEAR_DIST is the closest distance to the water.
Destination
accessibility
NTAD
(Nearest Trade Area Distance)
Straight-line distance from the centroid of block i to the nearest business district (km).
AHH
(Accessibility of Historical Heritage)
Number of historical heritages within 500 m walking of block i, including temples, memorials, churches, etc. (units/500 m).
AOS
(Accessibility of Open Space)
Number of open spaces within 500 m walking distance of block i, including green spaces, city parks, squares, etc. (units/500 m).
Destination
accessibility
SSC
(Space Syntax Choice)
S S C = j k d j k i / d j k
djk is the shortest path between line segment j and line segment k, and djk (i) is the shortest path between line segment j and line segment k that contains line segment i. r = 1000 m.
SSI
(Space Syntax Integration)
S S I = n l o g 2 n + 2 / 3 1 / n 1 × M D i 1
n is the number of units in the street network and MDi is the average depth of segment i. r = 1000 m.
Table 5. Comparative performance of the LVLMs based on ACC, Precision, Recall, and F1 score.
Table 5. Comparative performance of the LVLMs based on ACC, Precision, Recall, and F1 score.
ModelEstheticsAttractivenessCulturalityRestorativeness
ACCQwen2.5-VL-7B-SFT0.93480.87150.93680.9150
InternVL3-8B-SFT0.91870.85150.91460.8972
Qwen2.5-VL-7B0.66530.52570.68580.6735
PrecisionQwen2.5-VL-7B-SFT0.93580.87030.93820.9169
InternVL3-8B-SFT0.91870.85180.91480.8978
Qwen2.5-VL-7B0.66630.52440.68680.6764
RecallQwen2.5-VL-7B-SFT0.93480.87150.93680.9150
InternVL3-8B-SFT0.91870.85150.91460.8972
Qwen2.5-VL-7B0.66530.52570.68580.6735
F1 scoreQwen2.5-VL-7B-SFT0.93480.87010.93650.9155
InternVL3-8B-SFT0.91830.85090.91440.8973
Qwen2.5-VL-7B0.66530.52450.68540.6739
Table 6. Quantitative results of URS perception in four dimensions.
Table 6. Quantitative results of URS perception in four dimensions.
LabelEstheticsAttractivenessCulturalityRestorativeness
0227 (0.4%)1175 (2.1%)450 (0.8%)2128 (3.9%)
110,112 (18.4%)24,390 (44.3%)9703 (17.6%)20,978 (38.1%)
244,742 (81.2%)29,516 (53.6%)44,928 (81.6%)31,975 (58.1%)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Hou, X.; Wang, X.; Fan, W. Public Perception of Urban Recreational Spaces Based on Large Vision–Language Models: A Case Study of Beijing’s Third Ring Area. Land 2025, 14, 2155. https://doi.org/10.3390/land14112155

AMA Style

Wang Y, Hou X, Wang X, Fan W. Public Perception of Urban Recreational Spaces Based on Large Vision–Language Models: A Case Study of Beijing’s Third Ring Area. Land. 2025; 14(11):2155. https://doi.org/10.3390/land14112155

Chicago/Turabian Style

Wang, Yan, Xin Hou, Xuan Wang, and Wei Fan. 2025. "Public Perception of Urban Recreational Spaces Based on Large Vision–Language Models: A Case Study of Beijing’s Third Ring Area" Land 14, no. 11: 2155. https://doi.org/10.3390/land14112155

APA Style

Wang, Y., Hou, X., Wang, X., & Fan, W. (2025). Public Perception of Urban Recreational Spaces Based on Large Vision–Language Models: A Case Study of Beijing’s Third Ring Area. Land, 14(11), 2155. https://doi.org/10.3390/land14112155

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop