Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images
Abstract
1. Introduction
- (1)
- It introduces an automated framework for assessing urban landscape attractiveness based on GPT-4o, a multimodal large language model. By leveraging GPT-4o’s capabilities in image comprehension and linguistic reasoning, the method replaces labour-intensive aesthetic evaluations with an efficient, standardised process [21]. The open-ended question “Which image is more attractive?” is transformed into a structured decision task, yielding clear categorical outcomes (“more attractive,” “indistinguishable,” and “less attractive”). This structured, replicable workflow significantly improves evaluation efficiency while maintaining high levels of consistency and interpretability, illustrating the practical potential of MLLMs for subjective, cognitively demanding tasks.
- (2)
- Based on the pairwise comparison methodology of MIT Place Pulse 2.0, we constructed a dataset of 1020 real-world street-view images for visual attractiveness assessment. Each image was compared to 50 others, resulting in a model-generated relative ranking of aesthetic appeal. Simultaneously, we collected human preference data for the same image set via a crowdsourcing platform. The resulting dataset enables rigorous cross-modal comparison and serves as a standardised reference for future research in urban visual analysis and aesthetic model development. Figure 1 illustrates the research workflow—from model-based and human aesthetic judgments to a comprehensive attractiveness evaluation of the Yuzhong District in Chongqing, China, which highlightings the method’s scalability and potential for broad application.
- (3)
- The MLLM-driven framework proposed in this study offers a novel quantitative decision-support tool for urban design. By analysing the relationship between visual elements in street-view imagery and corresponding attractiveness judgments, the model identifies key visual features—such as greenery, building façades, and street openness—that significantly influence perceptions of visual attractiveness. This approach not only enhances understanding of public aesthetic preferences but also provides data-driven insights for policymakers and urban planners seeking to optimise cityscapes. For instance, when designing new urban districts or renovating ageing streetscapes, this framework can serve as an assistive system to evaluate the potential perceptual impact of proposed interventions, thereby facilitating the creation of more human-centred urban environments.
2. Related Work
2.1. Definition of Visual Attractiveness
2.2. Applications of Multimodal Large Language Models in Urban Perception
3. Materials and Methods
3.1. Research Area
3.2. Human Annotation
- (1)
- Compositional structure and visual balance of the image;
- (2)
- Colour harmony and saturation;
- (3)
- Spatial layering and architectural detail richness;
- (4)
- Overall visual appeal and aesthetic value.
- –
- If the target image (e.g., image #1) was deemed more attractive, it received a score of +1.
- –
- If the two images were indistinguishable in aesthetic appeal, the score was 0.
- –
- If the target image was less attractive, the score was –1.
3.3. Model-Based Scoring
- –
- If image is judged more attractive than image , the score is +1;
- –
- If the images are indistinguishable, the score is 0;
- –
- If image is judged less attractive, the score is –1.
4. Results
4.1. Image Classification
- Low score (0–3),
- Medium score (3–7),
- High score (7–10) (see Figure 4).
- –
- The harmony of colour and architectural proportions;
- –
- The continuity and layering of building silhouettes;
- –
- The spatial comfort influenced by the D/H ratio.
- Low-scoring images (0–3) predominantly featured visually cluttered scenes with disjointed building outlines and a lack of spatial coherence.
- Medium-scoring images (3–7) represented streetscapes that were generally coordinated but lacked strong visual appeal or architectural refinement.
- High-scoring images (7–10) showcased streetscapes with soothing colour palettes, harmonious proportions, and rich spatial layering—serving as exemplars of visual attractiveness.
4.2. Score Comparison
- is the original aesthetic score of image (ranging from 0 to 100);
- μ is the mean of all image scores;
- σ is the standard deviation of the scores;
- is the standardised score for image .
- GPT-4o scores and human scores;
- Place Pulse 2.0 scores and human scores.
- GPT-4o vs. human scores: R2 = 0.695.
- Place Pulse 2.0 vs. human scores: R2 = 0.385.
- Symmetry;
- Open skies;
- Unobstructed greenery;
- Balanced lighting;
- Enclosing yet cohesive building outlines.
4.3. Spatial Syntax and Visualisation Analysis
4.4. Joint Analysis of Streetscape Aesthetic and Accessibility
- High Visual attractiveness and High Accessibility:
- 2.
- High Visual attractiveness but Low Accessibility:
- 3.
- Low Visual attractiveness but High Accessibility:
- 4.
- Low Visual attractiveness and Low Accessibility:
5. Discussion
5.1. Differentiated Development Strategies for Chongqing Streetscapes
5.2. Design Strategies and Limitations of MLLMs in Urban Aesthetic Evaluation
6. Conclusions
6.1. Research Contributions and Limitations
6.2. Future Research Directions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Luo, S.; Xie, J.; Furuya, K. Assessing the Preference and Restorative Potential of Urban Park Blue Space. Land 2021, 10, 1233. [Google Scholar] [CrossRef]
- De Vries, S.; Van Dillen, S.M.E.; Groenewegen, P.P.; Spreeuwenberg, P. Streetscape Greenery and Health: Stress, Social Cohesion and Physical Activity as Mediators. Soc. Sci. Med. 2013, 94, 26–33. [Google Scholar] [CrossRef]
- Hou, X.; Chen, P. Analysis of Road Safety Perception and Influencing Factors in a Complex Urban Environment—Taking Chaoyang District, Beijing, as an Example. ISPRS Int. J. Geo-Inf. 2024, 13, 272. [Google Scholar] [CrossRef]
- Wan, R.; Zhang, J.; Huang, Y.; Li, Y.; Hu, B.; Wang, B. Leveraging Diffusion Modeling for Remote Sensing Change Detection in Built-Up Urban Areas. IEEE Access 2024, 12, 7028–7039. [Google Scholar] [CrossRef]
- Yi, K.; Xu, Z. Exploring the Aesthetic Principles of Traditional Lingnan Architecture in Guangzhou Influencing Economic Development and Socio-Economic Perspective—A Notch from Public Well-Being and Modernity. J. Inf. Syst. Eng. 2023, 8, 22838. [Google Scholar] [CrossRef]
- Zawadzka, A.K. Architectural and Urban Attractiveness of Small Towns: A Case Study of Polish Coastal Cittaslow Towns on the Pomeranian Way of St. James. Land 2021, 10, 724. [Google Scholar] [CrossRef]
- Zhang, J.; Fang, J.; Zhang, C.; Zhang, W.; Ren, H.; Xu, L. Geographic Named Entity Matching and Evaluation Recommendation Using Multi-Objective Tasks: A Study Integrating a Large Language Model (LLM) and Retrieval-Augmented Generation (RAG). ISPRS Int. J. Geo-Inf. 2025, 14, 95. [Google Scholar] [CrossRef]
- Tang, F.; Zeng, P.; Wang, L.; Zhang, L.; Xu, W. Urban Perception Evaluation and Street Refinement Governance Supported by Street View Visual Elements Analysis. Remote Sens. 2024, 16, 3661. [Google Scholar] [CrossRef]
- He, H.; Xiong, W.; Zhou, F.; He, Z.; Zhang, T.; Sheng, Z. Topology-Aware Multi-View Street Scene Image Matching for Cross-Daylight Conditions Integrating Geometric Constraints and Semantic Consistency. ISPRS Int. J. Geo-Inf. 2025, 14, 212. [Google Scholar] [CrossRef]
- Salesses, P.; Schechtner, K.; Hidalgo, C.A. The Collaborative Image of The City: Mapping the Inequality of Urban Perception. PLoS ONE 2013, 8, e68400. [Google Scholar] [CrossRef]
- Naik, N.; Philipoom, J.; Raskar, R.; Hidalgo, C. Streetscore—Predicting the Perceived Safety of One Million Streetscapes. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; IEEE: New York, NY, USA, 2014; pp. 793–799. [Google Scholar]
- Zheng, S.; Zhang, J.; Zu, R.; Li, Y. Visual Perception Differences and Spatiotemporal Analysis in Commercialized Historic Streets Based on Mobile Eye Tracking: A Case Study in Nanchang Wanshou Palace, China. Buildings 2024, 14, 1899. [Google Scholar] [CrossRef]
- Zhou, G.; Zhi, H.; Gao, E.; Lu, Y.; Chen, J.; Bai, Y.; Zhou, X. DeepU-Net: A Parallel Dual-Branch Model for Deeply Fusing Multiscale Features for Road Extraction From High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 9448–9463. [Google Scholar] [CrossRef]
- Ali, M.B.; Jamal, S. Modelling the present and future scenario of urban green space vulnerability using PSR based AHP and MLP models in a Metropolitan city Kolkata Municipal Corporation. Geol. Ecol. Landsc. 2024, 8, 1–19. [Google Scholar] [CrossRef]
- Pan, J.; Deng, Y.; Yang, Y.; Zhang, Y. Location-Allocation Modelling for Rational Health Planning: Applying a Two-Step Optimization Approach to Evaluate the Spatial Accessibility Improvement of Newly Added Tertiary Hospitals in a Metropolitan City of China. Soc. Sci. Med. 2023, 338, 116296. [Google Scholar] [CrossRef] [PubMed]
- Zhou, G.; Qian, L.; Gamba, P. A Novel Iterative Self-Organizing Pixel Matrix Entanglement Classifier for Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 62, 1–21. [Google Scholar] [CrossRef]
- Han, Y.; Liu, J.; Luo, A.; Wang, Y.; Bao, S. Fine-Tuning LLM-Assisted Chinese Disaster Geospatial Intelligence Extraction and Case Studies. ISPRS Int. J. Geo-Inf. 2025, 14, 79. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Zhang, J.; Li, Y.; Fukuda, T.; Wang, B. Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images. Cities 2025, 165, 106122. [Google Scholar] [CrossRef]
- Liang, H.; Zhang, J.; Li, Y.; Wang, B.; Huang, J. Automatic Estimation for Visual Quality Changes of Street Space via Street-View Images and Multimodal Large Language Models. IEEE Access 2024, 12, 87713–87727. [Google Scholar] [CrossRef]
- Zhang, J.; Xiang, R.; Kuang, Z.; Wang, B.; Li, Y. ArchGPT: Harnessing Large Language Models for Supporting Renovation and Conservation of Traditional Architectural Heritage. Herit. Sci. 2024, 12, 220. [Google Scholar] [CrossRef]
- Kim, J.; Kim, S. Finding the Optimal D/H Ratio for an Enclosed Urban Square: Testing an Urban Design Principle Using Immersive Virtual Reality Simulation Techniques. Int. J. Environ. Res. Public Health 2019, 16, 865. [Google Scholar] [CrossRef] [PubMed]
- Salahi, S.; Moztarzadeh, H. Providing Design Solutions of Urban Facades Based On the Aesthetics Principles of Colors Case Study: Afifabad Street Shiraz. Space Ontol. Int. J. 2023, 12, 61–76. [Google Scholar] [CrossRef]
- Salingaros, N.A. Design Patterns and Living Architecture; Levellers Press: Amherst, MA, USA, 2017. [Google Scholar]
- Zhang, J.; Hu, J.; Zhang, X.; Li, Y.; Huang, J. Towards a Fairer Green City: Measuring Unfairness in Daily Accessible Greenery in Chengdu’s Central City. J. Asian Archit. Build. Eng. 2023, 23, 1–20. [Google Scholar] [CrossRef]
- Zhang, H.; Ao, M.; Ardabili, N.G.; Xu, Z.; Wang, J. Impact of Urban Sunken Square Design on Summer Outdoor Thermal Comfort Using Machine Learning. Urban Clim. 2024, 58, 102214. [Google Scholar] [CrossRef]
- Zhang, J.; Yu, Z.; Li, Y.; Wang, X. Uncovering Bias in Objective Mapping and Subjective Perception of Urban Building Functionality: A Machine Learning Approach to Urban Spatial Perception. Land 2023, 12, 1322. [Google Scholar] [CrossRef]
- Tan, R.; Wu, Y.; Zhang, S. Walking in Tandem with the City: Exploring the Influence of Public Art on Encouraging Urban Pedestrianism within the 15-Minute Community Living Circle in Shanghai. Sustainability 2024, 16, 3839. [Google Scholar] [CrossRef]
- Wang, R.; Zhao, J.; Meitner, M.J.; Hu, Y.; Xu, X. Characteristics of Urban Green Spaces in Relation to Aesthetic Preference and Stress Recovery. Urban For. Urban Green. 2019, 41, 6–13. [Google Scholar] [CrossRef]
- Jaglarz, A. Perception of Color in Architecture and Urban Space. Buildings 2023, 13, 2000. [Google Scholar] [CrossRef]
- Li, L.; Chung, W. Application of Artificial Intelligence in Visual Communication of Green Urban Rural Integration Landscape Design. Ecol. Chem. Eng. S 2024, 31, 583–597. [Google Scholar] [CrossRef]
- Husselman, T.-A.; Filho, E.; Zugic, L.W.; Threadgold, E.; Ball, L.J. Stimulus Complexity Can Enhance Art Appreciation: Phenomenological and Psychophysiological Evidence for the Pleasure-Interest Model of Aesthetic Liking. J. Intell. 2024, 12, 42. [Google Scholar] [CrossRef] [PubMed]
- Bianchi, I.; Actis-Grosso, R.; Ball, L.J. Grounding Cognition in Perceptual Experience. J. Intell. 2024, 12, 66. [Google Scholar] [CrossRef] [PubMed]
- Zhang, F.; Zhou, B.; Liu, L.; Liu, Y.; Fung, H.H.; Lin, H.; Ratti, C. Measuring Human Perceptions of a Large-Scale Urban Region Using Machine Learning. Landsc. Urban Plan. 2018, 180, 148–160. [Google Scholar] [CrossRef]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Naik, N.; Kominers, S.D.; Raskar, R.; Glaeser, E.L.; Hidalgo, C.A. Computer Vision Uncovers Predictors of Physical Urban Change. Proc. Natl. Acad. Sci. USA 2017, 114, 7571–7576. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
- Kang, Y.; Abraham, J.; Ceccato, V.; Duarte, F.; Gao, S.; Ljungqvist, L.; Zhang, F.; Näsman, P.; Ratti, C. Assessing Differences in Safety Perceptions Using GeoAI and Survey across Neighbourhoods in Stockholm, Sweden. Landsc. Urban Plan. 2023, 236, 104768. [Google Scholar] [CrossRef]
- Ma, H.; Li, J.; Ye, X. Deep Learning Meets Urban Design: Assessing Streetscape Aesthetic and Design Quality through AI and Cluster Analysis. Cities 2025, 162, 105939. [Google Scholar] [CrossRef]
- Verma, D.; Mumm, O.; Carlow, V.M. Assessing Visual Similarity of Neighbourhoods with Street View Images and Deep Learning Techniques. J. Urban Des. 2024, 30, 1–12. [Google Scholar] [CrossRef]
- Vinchon, F.; Gironnay, V.; Lubart, T. GenAI Creativity in Narrative Tasks: Exploring New Forms of Creativity. J. Intell. 2024, 12, 125. [Google Scholar] [CrossRef]
- Sternberg, R.J. Do Not Worry That Generative AI May Compromise Human Creativity or Intelligence in the Future: It Already Has. J. Intell. 2024, 12, 69. [Google Scholar] [CrossRef] [PubMed]
- Vicovaro, M. Grounding Intuitive Physics in Perceptual Experience. J. Intell. 2023, 11, 187. [Google Scholar] [CrossRef] [PubMed]
- Li, M.; Fan, Z. Constructing High-Quality Livable Cities: A Comprehensive Evaluation of Urban Street Livability Using an Approach Based on Human Needs Theory, Street View Images, and Deep Learning. Land 2025, 14, 1095. [Google Scholar] [CrossRef]
- Suel, E.; Bhatt, S.; Brauer, M.; Flaxman, S.; Ezzati, M. Multimodal Deep Learning from Satellite and Street-Level Imagery for Measuring Income, Overcrowding, and Environmental Deprivation in Urban Areas. Remote Sens. Environ. 2021, 257, 112339. [Google Scholar] [CrossRef] [PubMed]
- Wu, T.; Lin, D.; Chen, Y.; Wu, J. Integrating Street View Images, Deep Learning, and sDNA for Evaluating University Campus Outdoor Public Spaces: A Focus on Restorative Benefits and Accessibility. Land 2025, 14, 610. [Google Scholar] [CrossRef]
- Li, X.; Wen, C.; Hu, Y.; Yuan, Z.; Zhu, X.X. Vision-Language Models in Remote Sensing: Current Progress and Future Trends. IEEE Geosci. Remote Sens. Mag. 2024, 12, 32–66. [Google Scholar] [CrossRef]
- Peng, Y.; Li, Z.; Shah, A.M.; Lv, B.; Liu, S.; Liu, Y.; Li, X.; Song, H.; Chen, Q. Decoding the Role of Urban Green Space Morphology in Shaping Visual Perception: A Park-Based Study. Land 2025, 14, 495. [Google Scholar] [CrossRef]
- Jiang, R.; Chen, C.W. Multimodal LLMs Can Reason about Aesthetics in Zero-Shot. arXiv 2025, arXiv:2501.09012. [Google Scholar] [CrossRef]
- Cai, C.; Kuriyama, K.; Gu, Y.; Biljecki, F.; Herthogs, P. Can a Large Language Model Assess Urban Design Quality? Evaluating Walkability Metrics Across Expertise Levels. arXiv 2025, arXiv:2504.21040. [Google Scholar] [CrossRef]
Aesthetic Appeal | Definitions and Interpretations |
---|---|
Proximity Relationships | Buildings that exhibit continuity and are directly connected to streets, with open facades facing the street, create a more inviting and human-centred streetscape, enhancing its aesthetic appeal and fostering engagement. |
D/H Ratio | The ratio of street width (D) to building height (H) influences spatial perception. When D/H > 1, the sense of detachment increases with the ratio. Conversely, when D/H < 1, the sense of intimacy intensifies. D/H = 1 marks a pivotal point in spatial perception, shaping the scale and perspective of individuals within the streetscape. |
Building Contours | The “primary contour” refers to the inherent form of the building, while the “secondary contour” involves protrusions or added elements. Minimising and integrating secondary contours into the primary form enhances the visual harmony of streetscapes |
Shadowed Spaces | Indented areas or “shadowed spaces” create enclosed, intimate, and comforting environments. These spaces also highlight the geometric forms of public squares, enhancing visual clarity and usability |
Sunken Spaces | Depressed or below-ground spaces, such as sunken gardens, add depth and visual interest to urban areas while making efficient use of space. |
Street Sculptures | Sculptures contribute to the collective urban identity, acting as public assets that restore attractiveness to the community and foster cultural engagement. |
Street Greening | Urban greenery not only fulfils ecological requirements but also provides a calming and restorative atmosphere. The blue of the sky and the green of vegetation, classified as tranquil colours in colour psychology, soothe and refresh human emotions. |
Urban Colour Palette | The colour schemes of buildings and street structures profoundly affect mood. Warm tones like red, orange, and yellow evoke energy and enthusiasm, while cool tones like blue and green promote calm and comfort. |
Pedestrian Area Scale | A clearly defined pedestrian path of approximately three metres ensures ease of navigation and fosters walkability |
Safety | Factors such as excessive vehicular traffic, nearby construction, and limited sky visibility (openness of the skyline) significantly impact the perceived safety and comfort of streetscapes. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, Q.; Zhang, J.; Zhu, Z. Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images. Buildings 2025, 15, 2970. https://doi.org/10.3390/buildings15162970
Zhou Q, Zhang J, Zhu Z. Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images. Buildings. 2025; 15(16):2970. https://doi.org/10.3390/buildings15162970
Chicago/Turabian StyleZhou, Qianyu, Jiaxin Zhang, and Zehong Zhu. 2025. "Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images" Buildings 15, no. 16: 2970. https://doi.org/10.3390/buildings15162970
APA StyleZhou, Q., Zhang, J., & Zhu, Z. (2025). Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images. Buildings, 15(16), 2970. https://doi.org/10.3390/buildings15162970