Next Article in Journal
AI-Based Potato Crop Abiotic Stress Detection via Instance Segmentation
Previous Article in Journal
A Physics-Aware Real-Time Matching and Asynchronous Settlement Framework for Distributed Energy Storage Services
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Artificial Intelligence-Simulated Cognition of a Pedestrian Assessing a Built Environment

by
Rachid Belaroussi
1,* and
Nikos A. Salingaros
2,3
1
COSYS-GRETTIA, University Gustave Eiffel, F-77447 Marne-la-Vallée, France
2
Department of Mathematics, The University of Texas at San Antonio, San Antonio, TX 78249, USA
3
Thrust of Urban Governance and Design, Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China
*
Author to whom correspondence should be addressed.
AI 2026, 7(3), 110; https://doi.org/10.3390/ai7030110
Submission received: 23 December 2025 / Revised: 4 March 2026 / Accepted: 10 March 2026 / Published: 13 March 2026

Abstract

How closely do the subjective perceptions simulated by Artificial Intelligence align with the subjective perceptions of human participants when evaluating an urban environment? This study serves as a pilot investigation to explore how far multimodal Large Language Models can effectively model human responses to visual stimuli based on subjective criteria. The exploratory nature of this research intends to test the feasibility of the methodology rather than provide a definitive standard. By focusing on a small set of detailed audits, a small-scale experiment performs an in-depth, qualitative examination of how machines and human assessments compare to each other in specific situations. To conduct the comparison, ratings of urban scenes were collected from human participants and two multimodal Large Language Models: ChatGPT and Gemini. After showing them an image of a sidewalk, these appraisers used a set of proposed statements to rate three sidewalks on a Likert scale. The investigation focuses on seven statements that subjectively characterize walkability factors, overall friendliness of an area, and the environment’s influence on well-being. Each participant rated each image once for all statements to establish a human baseline. The algorithms’ scores were generated using the exact same prompt, repeated multiple times to account for non-determinism. We then compared the AI’s scores to the humans’ distribution of scores and evaluated their alignment according to different experiential qualities across diverse visual environments.

1. Introduction

This article evaluates the attraction and suitability of an urban environment based on human responses. The human body is used as a measuring instrument that integrates obvious response features with several subtle channels of informational communication. Based on the processed information, a user reacts by making decisions on everyday actions, such as movement along a street. Incoming information is strongly biased toward the visual channel because informational input to the brain is predominantly visual. For this reason, we expect reasonably accurate results when using photos in a survey. Of course, a photo cannot replace the actual site because it lacks the real-time experience of other channels of sensory information (sound, smell, and touch).
Beneath this apparently simple diagnostic tool lies a methodological and philosophical approach to urban design that we make explicit in this paper. We follow the work of Christopher Alexander [1,2] in accepting that the physical environment shapes human physiological states and, consequently, people’s behavior. Yet this influence is in large part subconscious. For this reason, trying to judge urban design by means of standard criteria is limiting at best, since there are so many other factors that contribute to a user’s experience of the built environment. This line of reasoning, originally deemed contrary to abstract urban design based on formalist principles, is now supported by recent experimental measurements based on neurophysiology. Wang et al. [3] offer a clear introduction to neuro-architecture, a branch of neuroscience that examines how the human brain responds to and interacts with the built environment, along with its limitations. Karakas and Yildiz [4] recently proposed a survey that systematically examined the corresponding emerging concepts at the intersection of architecture and neuroscience. Other authors proposed surveys of neuroarchitecture assessment, such as Ghamari et al. [5] or Higuera-Trujillo et al. [6]. One aim of the present study is to make readers aware of this work by other researchers who are establishing the neurophysiological basis for how the body reacts to urban geometry and all its visual components.
The human body reacts instinctively to every piece of information presented in its immediate environment: built structures, changes of level, vegetation, street furniture, wall surfaces, and enveloping geometries both horizontally (curved walls and entrances) and vertically (overhangs) [5,6]. In an urban setting, kinetic information of moving vehicles and other pedestrians adds to evaluating the place for the observer’s own safety. For example, the presence of bollards lining the street can make a sidewalk feel instinctively safer, even though the user does not notice the reason immediately. We can understand all of those reactions, acting unconsciously, because the body reacts to evolved neurocircuits that developed for human (and animal) survival in natural environments. Geometrical and other information from the environments determines bodily states and influences decisions unconsciously. People find positive valence reactions from natural settings, a topic that is now being investigated in the context of nature-induced healing, such as Seresinhe et al. [7], who investigate the features that make outdoor spaces aesthetically pleasing and explore whether places perceived as beautiful differ from those that are simply natural. It is worth pointing out that traditional healthcare included the influence of natural settings in promoting well-being, while new experiments establish a strong nature-health correlation.
Historically, personal surveys and questionnaires were the standard means of assessing user reactions to urban environments, whether related to geometry or other features of usability. However, that method fails to separate opinion (due to socio-cultural norms) from actual physiological responses. Nowadays artificial intelligence (AI) is rapidly replacing humans for many evaluative assessments, as it proves easier to disentangle pure body reactions from learned preferences. One aim of this paper is to compare how humans and AI—specifically, Large Language Models (LLMs)—can evaluate an urban setting for its attractiveness to users. This work is focused on the user feeling at ease and is not directed towards any aesthetic or “design” goal. The results are compelling and are intended to provoke further research.
A key aim of this investigative comparison is to establish criteria by which a large-scale survey could be duplicated by the clever use of LLMs. The reason is that statistically significant public surveys depend upon a large number of participants. This makes them both cumbersome and expensive to set up and administer. By contrast, LLMs can be used immediately with the appropriate set of prompts. The result is instantly available and infinitely repeatable. Although LLM simulation of human subjects is not yet developed to a sufficiently high level, promising developments in AI will make this an increasingly approachable research goal. At some point in the future, therefore, we expect to be able to evaluate the human-centered qualities of urban environments using Large Language Models. That is, to use generative AI to predict how humans will react to a built environment based upon their human, emotional sensing. Once this goal is achieved, opportunities for both research and applied design become apparent. Several authors have already applied LLMs to evaluate the “human” qualities of urban environments, with unexpected success. Some points for investigation can be listed here:
  • Recursion in evaluating streetscapes and open urban space can identify features to be improved and also test possible solutions virtually for their efficacy. Feedback is almost immediate.
  • Urban designs can be evaluated by presenting a spectrum of variations, while an LLM selects from among them. The process of variation and selection can be repeated indefinitely, leading to convergence on an ideal human-centered design solution.
It would be useful to develop an evaluative tool that requires little overhead and can be applied recursively without costs. We draw the analogy with traditional urban design features that developed over generations of trial and error, paralleling the classic model of organismic evolutionary development through trial selection. The drastic difference is that historical urban fabric evolved towards optimal form over years and centuries, whereas we hope to duplicate this process through AI acting on the very short time scale of a standard design project.
Investigations of LLMs for urban planning from street view pictures are recent: their applications range from the analysis of road safety, comparison of walkability preferences from pairwise images, perception of streetscape changes, saliency modeling, and architectural aesthetics or urban attractiveness. On the application of road safety, Zhang et al. [8] input two street view images simultaneously into an LLM to determine which image is perceived as safer, Tang et al. [9] unveiled road safety factors with LLM, whereas Cheng et al. [10] proposed to integrate LLM for safety within a digital twin framework. Wedyan et al. [11] used ChatGPT to compare pairs of images to rank walkability preference. Xiao et al. [12] as well as Liang et al. [13] investigate how LLMs perceive changes in streetscapes, assessing whether they interpret these transformations as improvement or deterioration. Other authors, such as Zhu et al. [14] or Zhang et al. [15,16], proposed methods for saliency prediction. In a more holistic approach, Malekzadeh et al. [17] assessed the performance of ChatGPT when scoring urban attractiveness from a single street view image, while Zhou et al. [18] proposed to use ChatGPT to rank a set of images based on attractiveness and accessibility.
The possibilities of applying AI to urban planning are endless, as described by Laurini [19]. They comprise distinct tools at very different scales. On the scale of a city, distinct overlapping flow networks can be optimized through AI. Here, the focus is on the human dimensions of urban space and, in particular, street space. Analyzing the emotional responses of pedestrians on a sidewalk gives a much clearer picture of which urban design elements contribute towards—versus which detract from—the urban experience. Again, we emphasize our focus on emotional and physical well-being, as opposed to being concerned strictly with the mechanical efficiency of urban functions that is now standard, as described by Mouratidis [20].
In contrast to top-down urban planning, where streets are drawn on a plan to optimize some traffic functions, the physical and emotional experience of a street setting is shaped by complex visual and spatial interactions that are not yet part of standard planning practice. These interactions of the human body with the geometry of the environment occur on a much more intimate range of scales: these are defined by the physical scales of the human body, from its immediate reach to its height to its components (arm, hand, and eye, etc.). Here, we observe the principle of welcoming space: pedestrian use will increase only where people feel an emotionally welcoming atmosphere. Among the major factors contributing towards this positive state, the complex geometry of the environment plays an important, though often neglected, role. This is what we try to measure in this paper.
Attempting to assess the users’ reaction to a multitude of informational signals in real time proves to be an impossible task in physical situations. Nevertheless, recent technical tools such as portable sensors and virtual reality dynamic analysis in the laboratory make such measurements feasible. We hope to use image analysis carried out by Large Language Models to approximate human reactions to the urban information field. Towards this goal, the prompts to the LLM are phrased in terms of human emotions, not abstract geometrical concepts. Since the training base for LLMs covers open-source data on how humans react to a multitude of urban elements and settings, the LLM can indeed gather distributed information to give an accurate answer.
An analysis of AI-simulated cognition is proposed: not just testing an LLM’s language skills, but its ability to integrate a visual stimulus with an adopted point of view to produce a subjective response that mimics human perception. The following two research questions (RQ) can be formulated:
  • RQ1: Can an AI agent’s evidence-based rating be used as a substitute for human subjective perception? More precisely, we measure how closely the subjective perceptions simulated by Gemini and ChatGPT align with the median subjective perceptions of a sample of human participants when evaluating an urban environment.
  • RQ2: In the subjective assessment of an urban environment, which multimodal Large Language Model—Gemini or ChatGPT—demonstrates a higher degree of correlation with the average subjective ratings of human participants across statements requiring a mix of objective visual analysis and simulated aesthetic and emotional perception? The second research question is about the performance differences between ChatGPT and Gemini when tackling a subjective, visual perception task. How do the ratings differ when comparing human raters, Gemini, and ChatGPT on a 5-point Likert-scale assessment of specific aesthetic, safety, and functional attributes of a visual urban environment?
In this study, our research question focuses on the analysis of an urban environment by two AIs, based on photographs of the city of Santa Cruz, Spain. First, we analyze four factors of Alfonzo’s hierarchy of walking needs [21] related to the urban form. For each of these walking needs, human raters and the AI chatbots assessed a statement selected from the protocol proposed by Lindelow et al. [22]. Second, we evaluate a more global descriptor of the overall feeling of the urban scenes following a methodology previously proposed in the literature. Third, we analyze the auditing of two factors related to the influence of the built environment on well-being.
This investigation is designed as a pilot study to test the feasibility of using multimodal LLMs in urban auditing. Rather than providing an exhaustive technical standard, this exploratory research utilizes a small-scale experiment to evaluate the nuances of AI-human alignment within a highly controlled setting. While the results must be interpreted with caution due to the focused selection of three illustrative cases, by prioritizing the depth of consensus and the qualitative mechanisms of agreement over a large representative corpus, this work serves as a starting point for determining how AI might eventually supplement human auditors in the field. This targeted approach ensures that the human participants could provide the high level of attention required for detailed inspections, thereby establishing a high-fidelity baseline for this preliminary comparison.
Human-led urban auditing is a bottleneck for creating healthy cities, and AI “simulated cognition” is a necessary solution. Traditional urban audits require physical site visits or manual surveys. This makes it impossible to audit entire cities or compare thousands of street-level designs in real-time. Without scalable tools, urban planning remains reactive rather than proactive. Poorly designed environments lead to lower well-being and reduced walkability, but we lack the “eyes” to evaluate every street. The present research is not just a test; it is a preliminary validation of AI tools that could theoretically audit a million street views in hours, using a psychologically grounded framework. Current automated systems excel at objective feature extraction but lack the capacity for subjective appraisal—the simulated cognition required to evaluate an environment’s influence on well-being and safety. Recent approaches testing the use of an LLM remain on a holistic assessment level (general safety or visual appeal), failing to capture the subjective lived experience that determines whether people actually choose to walk, linger, or avoid a place. Without scalable methods to evaluate this experiential dimension, urban design risks remaining optimized for vehicles and abstract efficiency—rather than for human comfort, safety, and joy. This research represents an initial step in investigating whether emerging LLMs might eventually bridge the gap between automated data collection and subjective human experience. By testing AI-simulated perceptions against human consensus in this focused context, we offer a preliminary exploration into the feasibility of more scalable, psychologically informed urban auditing.
The novelty of the approach is that we adopt a specific human centric lens (e.g., a pedestrian) to analyze visual stimuli, testing the AI’s ability to bridge low level visual features with high level cognitive appraisals. But instead of general image tagging, we propose a prompting framework based on environmental psychology theory. We map AI responses to three distinct psychological dimensions: functional hierarchy (accessibility, safety, and comfort) based on Alfonzo’s Walking Needs, holistic synthesis representing an overall streetscape perception, and affective well-being (place and distance) measuring the environment’s emotional draw. We thus provide an empirical test of Alfonzo’s walking needs hierarchy within the context of LLMs. We also identify the specific urban attributes where AI intuition fails compared to human lived experience, highlighting the current limits of LLM visual reasoning in environmental psychology, on selected, specific conditions. We examine which model architecture more effectively correlates with the average human observer when the task requires an affective rather than a purely descriptive response.
The comparative analysis of this exploratory research rests upon four fundamental evidentiary pillars: median-based measures of central tendency, interquartile range for repeatability assessment, violin and boxplot distribution visualizations, and intraclass correlation coefficients. This analysis goes beyond simple correlation and addresses the stochastic nature of AI outputs. Median values provide a robust consensus measure for subjective Likert data, the interquartile range serves as a measure of intra-model stability and the spread among human ratings, identifying which factors are the most subjective and less likely to be captured by the AI. Violin plots are employed to indicate the density of human and AI sentiments for each aspect that is measured, while the intraclass correlation coefficient provides a statistic descriptive of absolute agreement. These tools together provide a detailed and reproducible metric for determining how far a human audit can agree with an impartial AI prompt. These results are exploratory and context-dependent. They provide an illustrative foundation for the potential integration of AI tools in sidewalk assessments, paving the way for more extensive studies to validate these initial observations across diverse urban environments.
The remainder of this paper is structured as follows: Section 2 describes the methods, audited aspects, and data used in the study. Section 3 presents the experimental results obtained from the comparative analysis of human ratings and AI output. Section 4 provides a framework for discussing the results, while Section 5 outlines the limitations of the work. Finally, Section 6 offers the conclusions and outlines directions for future research.

2. Materials and Methods

This section describes the input used for the audit and the proposed methodology, including a description of the human sample and the conversational AI agents assessed. The input is composed of the visual scenes and the audited aspects: the evaluated images with their characteristics are presented, and the audited aspects are listed, with a description of the selected statement proposed to be rated and their origins in the literature. The methodology explains the step-by-step protocol required to implement the experiment and describes the rater groups. The demographics of the human raters are specified, as well as the version of the base model of the two AI systems.

2.1. The Stimulus and Audited Aspects

The case study is based on photographs of three specific sidewalks selected in the city of Santa Cruz de Tenerife, Spain. The human raters do not have any residential experience of these areas. Figure 1a shows the spatial location of the three selected sidewalks. Three sites were chosen from the city center: the first one, Calle San Martin, is in a residential neighborhood; the second one, Calle Serís, is in a predominantly commercial area; and the third one, Calle Méndez Núñez, is in the City Hall historical area. The photos of the urban scene were acquired with a smartphone; these images are available upon request to the authors. They were selected because the camera has a similar angle, the buildings are of similar size, and the weather is similar—particularly the color of the sky—to avoid bias.
Sidewalk 1 is shown in Figure 1b. The ambiance is somewhat Art Deco, with, on the left, some buildings predominantly painted in a vibrant terracotta color and some vivid green window frames and doors on the right side of the street. A solid yellow line separates the road from the sidewalk. It is paved with white stones and is marked by an absence of vegetation. Sidewalk 2 is in a part of Calle Serís illustrated in Figure 1c: it is a more tropical urban street combining tall city buildings. The street is marked with a solid yellow line, similar to the previous image. A row of parked scooters lines the curb on the left side of the street, making the scene slightly cluttered. In the distance, a distinctive rounded corner building is visible, painted in a bright, contrasting orange. Sidewalk 3 is pictured in Figure 1d: it was acquired in a more historic street of the city. It is paved with large, unpolished stone blocks and runs alongside a plain, light-colored wall of a large, older building. The wall appears aged, with some stains and minor graffiti. A row of trees separates the sidewalk from the road, providing a strong natural buffer and shade. The roadway is separated from a bicycle path with a lane marking and poles, and two cars are moving down the street. Across the street, the town hall of Santa Cruz de Tenerife, a more ornate, classical white building with multiple windows and balconies, is visible.
Each of these three urban scenes is used as a set of stimuli for a quality assessment process: Table 1 summarizes the audited aspects and the statement formulated to the raters who had to evaluate them. There are three groups of audited aspects in Table 1: the first is related to the walking needs hierarchy introduced by Alfonzo [21], the second is a more general assessment of the streetscape, and the third consists of two essential components of well-being in an urban space.
The hierarchy of walking needs [21] refers to a conceptual framework that ranks pedestrian needs from basic feasibility to higher-order comfort and enjoyment, based on the fact that higher-level needs are only relevant once lower-level considerations (related to access and safety) are satisfied. These include factors of urban form and design related to the choice of affordances for enabling and encouraging walking, namely:
  • Accessibility is the fundamental aspect of urban form, reflecting the presence or absence of walking infrastructure;
  • Safety measures the absence/presence of threat or feeling of being in a secure place to walk;
  • Comfort designates urban design facilitating walking, such as barriers protecting people from vehicular traffic;
  • Pleasurability is associated with stress-free walking, including aesthetic appeal and urban liveliness.
To measure the satisfaction of these needs brought by an urban environment, we selected or adapted specific questions from the protocol defined by Lindelow et al. [22], in a research addressing perceptions of the built environment using the conceptual framework of Alfonzo.
These factors are related to the needs that must be fulfilled to make the choice of walking possible. A more general impression of the perception of the streetscape can be evaluated globally by asking a person directly how friendly they perceive the environment to be. We adapted a statement previously proposed in the literature, in a research project proposing a streamlined evaluation using a single descriptor, “friendliness”, to compare pairs of architectural design proposals. This criterion offered a rapid screening measure that corroborated a tailored prompt instructing the LLM to use specific emotional and geometrical criteria for separate evaluations of image pairs distributed over 25 questions.
The last set of inquiry points relates to the impact of urban environment on the well-being of people. Two descriptors were selected: the attractiveness of the place and the extent to which the distant view encourages forward movement. Together, they capture how the environment supports either lingering or comfortable progression. These descriptors are referred to as place and distance:
  • Place describes the extent to which an individual feels comfortable remaining in a given location, contributing to psychological well-being.
  • Distance characterizes the ability of the streetscape to provide an appealing visual goal motivating forward movement along the street rather than discouraging it.
Together, these factors are essential for good urban design because they balance the psychological and physiological need for immediate comfort with the innate desire for exploration.

2.2. Methodology

2.2.1. Procedure

Figure 2 illustrates the experimental protocol proposed to execute the comparison methodically; the test plan can be summed up as follows:
  • Collect human data: gather scores from a statistically significant sample of human participants, to establish a human baseline.
  • Evaluate internal consistency (or repeatability): measure how much each LLM varies from one run to the next when given the same prompt.
    Collect AI data: ask the AI the same prompt N i t number of times for each statement and average the scores.
    Metric: interquartile range (IQR) for the scores produced by Gemini and ChatGPT for each statement. A high IQR means the model is highly inconsistent (low repeatability), which would make its scores unreliable for comparison.
  • Measure cross-group alignment (AI versus human) by comparing central values, i.e., the AI’s median scores to the human median scores. Complete the analysis using intraclass correlation to assess the reliability and agreement of each conversational AI agent with human raters, which provides a statistical descriptive measure of how interchangeable AI and humans are as raters.
Table 2 summarizes the data collection process. The human questionnaire starts with a short description of the experiment, stated as follows:
In this study, we are focused on auditing urban environments through selected scenes. The scenes are photos acquired on different streets. The participation is anonymous, and the whole survey will take you about 4 min.
You will be shown three images of areas of a city and will then be asked questions related to the quality and walkability of the sidewalk.
This opinion survey may be published in a research paper; by completing it, you voluntarily consent to the processing and publication of your data.
No specific definitions or specifications related to the way of assessing an urban environment are given. The questionnaire is composed of three identical parts: in each, an image of a sidewalk is displayed, and the participant is instructed to “Imagine you are a person walking on this sidewalk”. They are then asked to rate the four statements related to the walking needs on a 5-point Likert scale, ranging from strongly disagree to strongly agree. The image is displayed again, with the instruction “Now, imagine you are standing there on the street”. They are asked to rate the two statements related to environmental influence on well-being and to provide an overall score for the friendliness of the street out of 10.
A new independent session is opened to collect data from the AI chatbots for each run and for each image. The image is uploaded, and the following prompt is entered:
First, task: Imagine you are a person walking on this sidewalk. Rate the importance of the following statements regarding this urban environment on a 5-point Likert scale, from Strongly disagree (1) to Strongly agree (5). (A) The environment along the route is beautiful and attractive. (B) The route feels planned for me as a pedestrian. (C) I do not worry about the traffic when I walk along this route. (D) It is a practical path to walk. Second, task: (E) Give an overall pedestrian-friendly score (e.g., out of 10) for this street. Third task: Imagine you are standing there on the street. Rate the importance of the following statements regarding this urban environment, on a 5-point Likert scale, from Strongly disagree (1) to Strongly agree (5). (F) Visual information makes it a pleasure to have to wait here for someone. (G) The distant view is interesting and draws me to walk in that direction. No need for justification: return just the scores in a table with seven values (row A to G).
N i t = 10 independent new chats are opened in order to eliminate any effect of memory of the LLM; otherwise, each time the LLM would modify its previous estimations in order to accommodate the prompter’s supposed desire.

2.2.2. Rater Groups

The study involves two categories of raters: human participants and AI systems. Human raters were recruited from a diverse pool of volunteers with no residential experience in the evaluated areas. Both AI models were used in their publicly available versions at the time of the study.
A questionnaire was distributed online through an academic e-mail invitation to participate. Among the N = 68 human participants, 70% are researchers or professors, 13% are students, 10% are administrative staff, and 7% are retired or unemployed, as shown in Figure 3a. Gender is slightly biased toward males, with 38 men (57%) and 28 women responding (42%), and one person who declined to declare their gender. Figure 3b shows the number of persons who declared their age: 7% of respondents are above 60 years old, and the remainder is equally divided between young 20–40-year-old adults and more mature 40–60-year-old persons. Considering the subjective character of the statements proposed and the large variety of people represented in this group, a widespread range of expressed opinions is to be expected.
Version 5.1 of ChatGPT was used (chatgpt.com accessed on 18 November 2025). Due to its extensive training in conversational nuance and subjective human feedback, ChatGPT should show a better ability to emulate the human persona and therefore achieve higher alignment on purely subjective statements, despite a potentially less integrated visual processing framework.
Gemini 2.5 Pro was used (gemini.google.com accessed on 18 November 2025). Gemini has a natively multimodal architecture, processing images and text in a unified framework. Supposedly, it should demonstrate a higher alignment with human scores on statements requiring complex, visual contextual inference, such as safety and aesthetic judgment, compared to ChatGPT.

3. Experimental Results

This section reports the results obtained based on the proposed statements, broken down into the three categories of audited aspects: walking needs, overall impressions, and environmental aspects influencing well-being. Results are reported statement by statement. Based on visual comparisons of the rating distributions, the median rating differences and the statistical dispersion of the distribution are measured by the interquartile range. Finally, the reliability and agreement of each conversational AI agent with human raters are assessed using intraclass correlation, which provides a statistical descriptive measure of how interchangeable AI and humans are as raters. Summary statistics of ratings can be found in Table A1 and Table A2.
It is important to note that these findings represent an exploratory analysis of the three specific urban environments selected for this pilot study. The resulting claims should be viewed as illustrative rather than exhaustive: they are intended to provide an in-depth examination of these specific, well-defined contexts. The scope of these conclusions is inherently linked to the specific sidewalk characteristics investigated. Consequently, they serve as a methodological proof-of-concept that warrants further investigation across a broader range of urban settings.

3.1. Urban Form: Walking Needs

The yellow violin plots in Figure 4 show the distribution of human ratings for the four criteria abbreviated as Pleasure, Comfort, Safety, and Access. The distribution of ChatGPT output for each criterion and each sidewalk is drawn in gray. For each case, the third distributions in light blue come from Gemini responses. Boxplots of each distribution are also drawn in order to show the interquartile range—the difference between the 75th and 25th percentiles of the data—as well as the spread of the data. The median of each distribution is the red segment. Comparing the location of this segment across the three groups for each image and each walking needs item is the most straightforward way to assess the central tendency of their ratings.
There are three sidewalks, and for each one, four aspects of walking needs are rated, resulting in twelve distributions for each rater. Out of twelve, four human rating distributions have an IQR = 2, which shows a large spread of opinions, while the remaining eight have an IQR = 1, which is a small spread in the opinions. Most of the cases of uncertainty are human opinions on a statement oscillating between neutral and agree or neutral/agree/strongly agree, with the notable exception of pleasurability of Sidewalks 1 and 2, for which the opinions showed three modes: disagree, neutral, and agree.
Eight out of twelve ChatGPT distributions, illustrated in gray in Figure 4, have an IQR of 0, showing extreme confidence and repeatability in its output. Three ChatGPT distributions have an IQR = 1 for rulings oscillating between Neutral and Agreement within walking needs statements. Gemini is less confident or more nuanced with half of its distributions with an IQR = 0, and 4 out of 12 distributions with an IQR 1. Gemini also shows more outliers than ChatGPT, making it a little less consistent a priori.
To compare the three appraisers (human, ChatGPT, and Gemini), using the median of the responses is preferable to the mean for several reasons. Subjective rating data such as Likert-type scores are ordinal and not numerical intervals, they often contain outliers, and they frequently show unequal variance across raters. In particular, humans ratings are rarely normally distributed but mostly bimodal or trimodal, as seen in nearly all twelve yellow distributions of Figure 4. Some respondents can be strict or lenient: the median is unaffected by extreme values, whereas the mean is pulled by them. The median therefore provides a more robust representation of the “typical” score of each appraiser, which is the gauge of interest in this study.
The first distribution in Figure 4a shows a substantial variability in human ratings of the pleasurability need for Sidewalk 1: 30% judge it negatively, 30% neutrally, and 40% more positively. ChatGPT deemed this aspect as either neutral (4 runs out of 10) or positive (6 out of 10), whereas Gemini selected neutral most of the time. The median rating for both humans and Gemini is 3, so that the difference median (Gemini)—median (humans) is equal to zero, as reported in Table 3. The median difference with ChatGPT is +1, since the median of ChatGPT ratings of pleasurability for Sidewalk 1 is equal to 4.
For this aspect of pleasurability evoked by Sidewalk 1, it is noteworthy that both LLMs can have comparable justifications when giving the same rating. ChatGPT explained its choice of neutral by assessing Sidewalk 1 as “The street has colorful, traditional-style buildings that add charm, but some facades appear worn and could use maintenance. It is visually interesting but not highly polished”. Gemini explained that “It is not “beautiful,” but it is not ugly either. The colorful buildings (like the red one) have some character, but other parts (like the gray wall on the right) look a bit worn down.”
ChatGPT ruled with a total confidence (IQR = 0) the aspect of comfort as neutral (rating = 3) and the aspect of accessibility as good (rating = 4), leading to a difference with human median ratings of these two aspects of +1 and zero, respectively. On the aspect of comfort, when rating Sidewalk 1 as neutral, ChatGPT stated, “There is a sidewalk, it is continuous, and the curb is well-defined. However, the sidewalk is narrow, and the pole in the middle further restricts the walking space. The street seems designed primarily for vehicles.” On the aspect of accessibility, ChatGPT used justifications such as “The route looks straightforward, flat, and continuous, making it practical for walking, though the limited sidewalk width might be inconvenient at times”.
Gemini justified its disagreement with the comfort statement (rating = 2) with the following explanation: “The sidewalk is a decent width, but there is a large pole placed right in the middle of it, which is a major obstacle. It forces me to walk around it and does not feel like a well-planned path”. Both LLMs detected the pole, but only Gemini deemed it as a major inconvenience. Comfort and safety of Sidewalk 1 were largely underestimated by Gemini, with a difference of −2 with human median ratings. On the aspect of safety, Gemini stated “The sidewalk is right next to the active traffic lane with no buffer at all—no parked cars, no grass strip, nothing. I would be very conscious of cars passing closely.”, while ChatGPT justified a ruling of safe sidewalk by declaring, “Traffic appears light, with only one lane for vehicles and a clear pedestrian zone. The narrow street likely limits vehicle speed, which helps safety.”
The close alignment of median ratings displayed on Figure 4b indicates that Sidewalk 2 was fairly well modeled by the two AIs across the four criteria. However, the safety need statement was rated with low confidence by both AIs (IQR = 2), their judgments oscillating between 2 (disagree), 3 (neutral), and 4 (agree) across the 10 independent runs.
Figure 4c shows that, on the criterion of pleasurability, Sidewalk 3 was judged by humans either as beautiful and attractive (54%) or neutral (35%). The other three criteria were judged more positively, especially the accessibility with 80% of the raters considering Sidewalk 1 a practical path to walk. ChatGPT was less positive, judging pleasurability, comfort, and safety as mostly neutral. On accessibility, ChatGPT aligned with human raters, judging the sidewalk as a practical path to walk. Gemini was more positive on comfort and accessibility assigning ratings of 5 out of 5 with strong confidence. Pleasure assessment by Gemini aligns correctly with human perception. Gemini deemed the safety of Sidewalk 3 as neutral to very safe, while the humans expressed a strong sense of safety.
As reported in Table 3, ChatGPT gives the same median rating as humans in one-third of the twelve walking needs statements ( Δ ChatGPT = 0). In nearly all other cases, it underestimates human evaluation by one unit ( Δ ChatGPT = −1), opting for a neutral assessment when humans are more positive about an aspect. ChatGPT is typically more conservative than Gemini but also more confident in its judgments, as reflected by its lower IQR values. Gemini matches the human median rating in half of the twelve cases ( Δ Gemini = 0). It overestimates human evaluation by one unit ( Δ Gemini = + 1 ) in 25% of the cases, strongly agreeing (rating = 5) with a statement when humans only agree to it (rating = 4). However, in two instances its output directly contradicts the human ratings ( Δ Gemini = 2 ), deeming Sidewalk 1 as uncomfortable and unsafe, when humans judged it safe and comfortable. Gemini is therefore more frequently aligned with human ratings than ChatGPT, but with a larger IQR, it is also more creative yet occasionally entirely off-target.

3.2. Pedestrian Friendliness

Figure 5 presents the distribution of ratings for the three sidewalks of interest. These scores reflect a streamlined evaluation using a single descriptor, “friendliness”, which gathers greater consensus than the walking needs statements. The legend remains the same as in previous figures, but the scale of the possible scores now ranges from 1 to 10: the human ratings are in yellow, ChatGPT’s ratings are in gray, and Gemini’s are in blue. All raters assigned relatively high scores, especially human ratings that concentrate in scores between 6 and 9: 80% of persons for Sidewalk 1, 85% for Sidewalk 2 and Sidewalk 3.
ChatGPT’s ratings are still consistent with IQR 1 for the three sidewalks, and a median scoring aligns closely with the human evaluations as reported in Table 4. Gemini’s performance falls slightly below that of ChatGPT, especially because it underestimated the quality of Sidewalk 1, rating it with a median score of five out of ten. To justify such a low grading, Gemini invoked the lack of any buffer from moving traffic and the pole obstructing the middle of the street.
Interestingly, ChatGPT and Gemini produced almost identical median friendliness scores for Sidewalk 2, yet with opposite explanations. Gemini invoked the wide, flat sidewalk and the addition of greenery as beneficial features while noting the lack of a sufficient buffer from road traffic. ChatGPT, on the other hand, deemed the road fairly walkable and safe but pointed out that the sidewalk could be wider and include more greenery; the drawback was the prioritizing of parked scooters.
The separate arguments of the two AIs for judging Sidewalk 3 were complementary: Gemini underlined the safety and practicality of the walkway, whereas ChatGPT based its judgment on space constraints and some visual clutter.

3.3. Environmental Aspects Influencing Well-Being

Two descriptors were chosen to characterize the influence of the built environment on well-being: the attractiveness of the place and how the distant view draws a person toward forward motion. They reflect how the environment supports or discourages movement or makes one feel comfortable resting in place. This second factor underlies the success of all pedestrian environments, yet it is often insufficiently emphasized in mainstream planning. We called these descriptors place and distance:
  • Place characterizes that we feel comfortable standing for a period of time in that spot; hence, it is ultimately good for our psychological and physiological health.
  • Distance validates the entire street since it offers an attractive goal instead of discouraging us from moving forward along the street. This is our body reacting.
Figure 6a shows the distributions of ratings of the “Place” factor, Figure 6b contains the results for the “Distance” factor.
According to human perception of the Place statement, Sidewalk 3 appears predominantly as a good place to stay and linger—probably due to the presence of historical features—while Sidewalk 2 was judged as good by 40% of voters but neutral or bad by 50% of them. Sidewalk 1 was the least popular, with 50% of human voters judging that its visual information does not make it a pleasure to have to wait there for someone, and the rest of the voters equally judging it as either a neutral or attractive place. As reported in Table 5, ChatGPT ruled all three sidewalks as neutral regarding their visual attractiveness and propensity to be a nice spot to stay and linger (rating = 3). The place factor of Sidewalk 1 was rated as 2 by humans and slightly overestimated by ChatGPT and Gemini (rating = 3); meanwhile, Sidewalk 3 was underestimated by both AI with a rating of 3 against a human median rating of 4. On this aspect, compared to human ratings, Sidewalk 2 was correctly rated by ChatGPT and slightly underestimated by Gemini (rating = 2).
The statement to rate for evaluating the distance factor was “The distant view is interesting and draws me to walk in that direction”. Humans did not reach a consensus on this aspect for Sidewalk 1, according to the distribution illustrated in Figure 6b, their votes being equally distributed between disagree, neutral, and agree, leading to a median score of 3. Sidewalks 2 and 3 were judged as more interesting to walk along due to the distant view, reaching a median score of 4. As reported in Table 5, ChatGPT’s median scores of the distance factor align almost perfectly with those of humans for the three sidewalks. Gemini failed to align with the human rating on Sidewalk 3 for this criterion by underestimating it, rating it 2, meaning a bad walkway for movement, when the median rating of humans was 4—meaning that Sidewalk 3’s distant view was perceived as drawing the person to walk in that direction.
Without going into an exhaustive qualitative analysis of model reasoning, we can specify some elements on which the LLM based their decision. ChatGPT based its assessment of visual information on the presence of colorful façades, building textures, seating, greenery at eye level, shopfront displays, and public art, whereas negative scoring is based on the narrow sidewalk. Its assessment on the distance view is based on the clear linear perspective of the street, human-scale buildings, visible activity ahead, and the presence of people walking that create a sense of continuity and curiosity about what lies further along. Gemini indicated that staying in a place is favored by geometric details, historic charm, the presence of seating and visual interest at eye level such as shop windows or greenery, and disadvantaged by narrow sidewalks and visual obstructions such as a black utility pole in Sidewalk 1. For the distant view, Gemini also mentioned the other pedestrians further down the path, suggesting that the street leads to a more active area, as well as an urban canyon effect produced by the buildings framing the road, “combined with the light at the end of the corridor creates a strong sense of curiosity”. Low rating of Sidewalk 3 by Gemini comes from the transition from a sidewalk to a wider multilane roadway that can be “intimidating rather than inviting”, and also from the poor-looking facade with metal shutters and lack of windows, creating a boring effect that does not encourage someone to walk toward the distance.

3.4. Intraclass Correlation Analysis

When comparing how humans and conversational AI agents perceive an urban environment, it is important to know if they give similar ratings for each attribute, which can be done by comparing median ratings item by item, and how reliable the ratings are, which can be done by IQR analysis. But what matters most is whether the two types of raters are consistent in how they evaluate different scenes across multiple criteria: the question is about agreement between raters, not just similarity in central tendency. To address this question, one can use a statistical tool called the intraclass correlation coefficient (ICC), a measure of agreement among a set of ratings. The most specifically suited for this study is the ICC(2,1) model, also called the two-way random effects model for absolute agreement. It is two-way because it accounts for both the variance in the targets (the urban scene) and the variance in the raters (humans vs. AI). It is random effects because AI and the human group are treated as representative samples of a larger population of potential raters, suggesting that if the AI aligns well with humans on the case scenarios, it would align with them on other urban images. The 1 refers to a single measure, in this case the median of the 10 AI responses compared to the median of the 68 human responses, for the 21 items (3 images × 7 statements). The ICC(2,1) model estimates how much of the total variance in ratings is due to differences between the input stimulus—the sidewalk images and statements—versus differences between raters. The main question addressed is, “Are AI and humans interchangeable as raters?”. ICC(2,1) measurements indicate whether one could swap a human rater for an AI rater and expect similar evaluations, not just similar rankings, which Pearson’s r would be about. Median and IQR alone cannot tell whether two systems agree across multiple items: ICC aggregates agreement across all items and provides a single interpretable number between 0 and 1 that summarizes how well AI aligns with human judgment as a system, not just item by item or image by image.
Table 6 provides the intraclass correlation coefficients obtained when comparing the typical score of humans and the ones from the AI agents. ICC’s values range from 0 to 1: an ICC below 0.5 indicates poor reliability, between 0.5 and 0.75 moderate reliability, 0.75 to 0.9 good reliability, and excellent reliability results in an ICC above 0.9. For human vs. ChatGPT, the ICC value of 0.87 indicates a good to excellent reliability. With a 95% confidence interval starting at 0.67—fully above moderate agreement—and a highly significant p-value, one can conclude that ChatGPT’s subjective perception aligns closely with the human median across all the conditions. This result indicates that the judgment of ChatGPT of urban scenes is largely interchangeable for the identified audited aspects. Gemini reaches an ICC of 0.76, which falls into the “Good” reliability range, though it sits at the lower end of that bracket as compared to ChatGPT. The significantly narrower Confidence Interval (CI) for ChatGPT indicates that ChatGPT’s alignment with humans is much more robust and statistically stable. The alignment of Gemini with humans varies strongly across the 21 distinct items with a 95% CI = [0.51, 0.90]. This is an effect of ICC(2,1) that penalizes consistently higher scores. ChatGPT’s ratings were substantially more aligned with aggregated human ratings than Gemini’s in terms of absolute agreement rather than just rank ordering. We will later indicates the items where Gemini used an incorrect logic to evaluate a score, leading to a low ICC.
Notice, however, that, as a preliminary study based on three illustrative cases, these findings should be viewed as context-dependent observations rather than a definitive technical standard. While these patterns are promising, they do not imply that AI can currently substitute for human auditors in a professional capacity. Instead, this empirical comparison serves as a focused starting point, demonstrating the feasibility of using specific LLMs as supplemental tools for identifying nuances in urban perception. Broader claims regarding the reliability of AI as a proxy for human sentiment would require a more extensive dataset to confirm these initial trends across a wider variety of environments.

3.5. Comparative Descriptive Statistics

While the ICC analysis focused on rater interchangeability, this section evaluates the global performance of Gemini and ChatGPT against the human rulings. Table 7 summarizes the correlation ρ , reliability α , and agreement κ coefficients for both models. Spearman’s Rank Correlation ( ρ ) characterizes ordinal trend, evaluating if the AI agrees with humans on the relative ranking of sidewalks. Weighted Cohen’s Kappa ( κ ) characterizes categorical accuracy, how often the AI picks the exact same Likert category as the humans. It measures agreement beyond what would happen by random guessing. The weighted part gives partial credit for near misses, for instance, when the AI chooses agree when the human picks strongly agree. Krippendorff’s alpha ( α ) characterizes general reliability and is used to assess if the humans and the AI are in consensus. α determines if the variation in the scores comes from the actual differences between the sidewalks (signal) or from disagreement between the raters. Table 7 reports the metrics computed by analyzing the distribution of ratings between the LMMs and human participants.
Both models show a strong, significant positive correlation ( p < 0.01 ). Gemini ( ρ = 0.628 ) actually has a slightly better sense of order than ChatGPT ( ρ = 0.614 ) in this specific dataset. Both models are highly successful at ranking which sidewalk items are better or worse in a way that mirrors human trends. According to Krippendorff’s Alpha, ChatGPT ( α = 0.864 ) is in the highly reliable range (>0.80). Gemini ( α = 0.756 ) is in the tentatively reliable range (0.667–0.80). This metric penalizes the magnitude of disagreement: ChatGPT is more reliable as a scientific instrument because its deviations from the human ratings are smaller than Gemini’s. According to the quadratic weighted Kappa, both models achieve susbtantial agreement with κ within the range 0.61–0.80. ChatGPT ( κ = 0.782 ) is very close to the “Almost Perfect” threshold ( 0.81 ). This means that when ChatGPT disagrees with a human, it is almost always by only one point on the scale. Gemini’s slightly lower score ( κ = 0.739 ) indicates some larger misses. While Gemini slightly outperformed ChatGPT in rank correlation ( ρ = 0.63 ), its lower alpha score ( 0.76 ) suggests that its absolute ratings are more volatile when compared to the human baseline. ChatGPT demonstrates superior performance as a proxy for human consensus, achieving high reliability and substantial agreement, yet comparative analysis reveals that both AI models are significantly aligned with human judgment in terms of ranking sidewalk aspects ( p < 0.01 ).
Also, a paired-samples t-test was conducted to evaluate further the alignment between human consensus ( N = 68 ) and AI ratings across 21 street-walkability items: mean ratings for each item are listed in Table A3. Both AI models yielded identical performance metrics when compared to humans. No significant difference was found between human normalized ratings ( M = 0.63 , S D = 0.09 ), ChatGPT’s normalized ratings ( M = 0.59 , S D = 0.10 ), and Gemini’s ( M = 0.591 , S D = 0.20 ). Both LLMs demonstrated strong alignment with human consensus (p > 0.05). A paired-samples t-test revealed no significant systematic bias for ChatGPT ( t ( 20 ) = 1.59 , p = 0.13 ) or Gemini ( t ( 20 ) = 0.98 , p = 0.34 ). Notably, ChatGPT’s variability ( S D = 0.09 ) closely mirrored human response patterns ( S D = 0.09 ), while Gemini exhibited significantly higher variance ( S D = 0.20 ), suggesting a more polarized evaluation of the urban features, avoiding the “middle ground” that humans and ChatGPT usually occupy. However, Gemini might be useful for identifying extreme architectural or environmental features that trigger strong positive or negative reactions if those reactions are exaggerated.
The degrees of freedom ( d f = 20 ) are a direct result of the study’s focused design, where 21 distinct data points were generated from 3 highly detailed audits across 7 specific criteria, while the number of scenes is small, the number of evaluated items provides enough granularity for a meaningful exploratory analysis. These statistical outputs suggest that LLMs can accurately shadow human intuition in specific, well-defined contexts, the intent being to demonstrate the feasibility of the methodology rather than to establish a universal technical standard for AI-assisted auditing.

3.6. Comparison with Related Works

Directly benchmarking results across AI-driven visual evaluations of the built environment remains a significant challenge due to the high degree of heterogeneity in datasets and methodologies. Many existing studies utilize pairwise image comparisons as their primary input, which differs fundamentally from our approach. Furthermore, the specific research objectives and the nature of the inquiries—both for human participants and AI models—vary considerably. Additionally, as a pilot investigation, this work prioritizes the fine-grained examination of specific sidewalk environments, so the claims are limited to context-dependent cases. In contrast, more extensive datasets often prioritize volume at the potential expense of the cognitive engagement required for complex panoramic image assessments.
While some studies share thematic similarities with our work—such as Wedyan et al. [11] regarding the hierarchy of walking needs, or Malekzadeh et al. [17] and Zhou et al. [18] regarding urban attractiveness—they operate under different parameters. A critical distinction lies in the image acquisition: most research employing Street View Imagery (SVI) utilizes road-level perspectives captured from vehicles, whereas our data are acquired directly from the sidewalk to better reflect the pedestrian experience. Other work, such as Xiao and Tang [12], focuses on longitudinal changes over time rather than static quality assessment views to assess if the changes made a location worse, better, or stable. Despite these divergences, Table 8 provides a comparative summary in which to situate our findings within the current literature as accurately as possible.
By comparison, Wedyan et al. [11] assessed several walkability aspects across eight participant groups, with 6 to 38 participants per group. Their methodology employed a pairwise comparison where participants and ChatGPT rated two images describing different urban environments on a 1–10 scale. To evaluate alignment, the authors matched cases where both the AI and humans identified the same image as the “superior” one in a pair, excluding instances of disagreement from their statistical analysis. For the matched sets, their results showed no statistically significant difference between human and GPT-4o ratings for Image 1 ( t ( N ) = 1.91 , p = 0.063 ) or Image 2 ( t ( N ) = 0.64 , p = 0.526 ), suggesting the model’s ratings were generally consistent with human perception when their qualitative preferences aligned. In the present study, both LLMs demonstrated strong alignment with human consensus ( p > 0.05 ) across a broader range of 21 environmental items, without excluding instances of individual disagreement. Despite the inclusion of these potential outliers, our results similarly demonstrate no significant systematic bias for ChatGPT ( t ( 20 ) = 1.59 , p = 0.13 ) or Gemini ( t ( 20 ) = 0.98 , p = 0.34 ). When comparing these results, the t-values provide a metric for the closeness of the models to the human baseline. Wedyan et al.’s highest t-value ( t = 1.91 ) approached the threshold of significance ( p = 0.063 ), indicating a more pronounced, though still non-significant, deviation from human scores. In contrast, our findings for Gemini ( t = 0.98 ) and ChatGPT ( t = 1.59 ) yield lower t-values, suggesting an even closer alignment with the human mean within our experimental framework.
Wedyan et al. compare preferences (which image is better) while we compare consensus (the absolute score of an environment). Both studies agree that GPT models do not exhibit a systematic bias, which is the primary point of comparison. Another key distinction between the two studies lies in the granularity of the evaluation per aspect measured. In Wedyan et al. [11], each walkability aspect was evaluated across eight pairs of images, providing a larger sample of visual stimuli per category. In contrast, our exploratory study evaluated each aspect across three representative images. The larger number of visual samples in Wedyan et al. allows for a more granular capture of AI-human drift across diverse urban contexts, which is reflected in their higher t-value ( 1.91 ) approaching significance. Our approach, by focusing on a smaller set of highly detailed audits, prioritizes the depth of consensus for specific environments. While both studies conclude that no systematic bias exists ( p > 0.05 ), the higher p-values in our study ( 0.13 and 0.34 ) may be attributed to the stability of the AI’s mean when applied to a more focused set of stimuli.
In another recent study, Malekzadeh et al. [17] compared ChatGPT’s ratings against a human cohort consisting of 13 local residents and 11 non-residents. Participants were tasked with evaluating the visual appeal and functionality of approximately 2000 panoramic street view images on a 1–7 scale, while the study provides a large-scale data set, it is worth noting that participants provided an average of 1014 ratings each. In the context of panoramic images—which require active interaction and significant cognitive load to inspect thoroughly—such an extensive task volume may introduce concerns regarding participant fatigue, potentially affecting the granularity of the human baseline. Another significant concern in the methodology of Malekzadeh et al. [17] lies in the standardization of both human and AI ratings, while the authors justify this post-processing as a means to handle subjectivity and varying evaluator “scales”, this step potentially introduces a fundamental bias into the study. The authors state they standardized ratings but do not specify the mathematical transformation used. Without knowing the formula, it is impossible to determine if the transformation artificially compressed the variance of the GPT-4 responses to match the human distribution. This is particularly problematic given that LLMs are known to have centrist tendencies (clustering around the mean), whereas humans utilize the full scale. By adjusting the data across different evaluators and prompts before performing the t-test, the authors may have inadvertently manufactured the very alignment they were testing for. In addition, their study relied on Google Street View imagery captured from a vehicle’s roof at a height of approximately 2.5 m, which lacks the ecological validity of a pedestrian’s eye-level perspective. By using a vast dataset of vehicle-centric images, their study risks measuring a visual appeal from a detached, bird’s-eye view.
Despite these precautions, their results indicated no significant differences between the groups, with p-values of p = 0.33 ( t = 0.95 ) for residents and p = 0.36 ( t = 0.89 ) for non-residents. This led the authors to conclude that GPT-4’s distributions align closely with human perception for broad aesthetic and functional audits. Comparing these findings to our own requires careful consideration of the differences in evaluation aspects, i.e., broad “visual appeal” vs. our seven specific walkability criteria, and sample density. However, the t-statistics offer a useful point of comparison regarding the centering of the AI models: Malekzadeh et al. reported t-values ( 0.95 and 0.89 ) that are remarkably similar to our result for Gemini ( t = 0.98 , p = 0.34 ). Both studies show that the AI mean is positioned less than one standard error away from the human mean. This suggests that for general urban evaluation, LLMs tend to converge on a “middle-of-the-road” consensus that effectively mimics human averages. Malekzadeh et al. used a significantly larger number of images compared to our three representative locations. In statistics, a larger N typically increases the power to detect even small differences. The fact that their t-values remained low (<1.0) despite such a massive sample size strongly suggests that LLMs do not appear to have a systematic bias even when tested across thousands of varied environments. While the t-values are comparable, the interpretation of the human baseline differs. Because our participants rated only three images, we captured a high-depth consensus with lower risk of fatigue-induced noise. In contrast, the similarity in Malekzadeh et al.’s results—where residents and non-residents provided nearly identical t-scores relative to the AI—suggests that GPT-4 may be capturing a generic aesthetic standard that transcends local residency, a phenomenon we also observed in our models’ alignment with the general human consensus.
In a large-scale assessment of urban aesthetics, Zhou et al. [18] utilized a pairwise comparison methodology to evaluate a subset of 1020 street-view images. Each image was paired with 50 randomly selected counterparts, with human evaluators and ChatGPT asked to identify which image is the more appealing in each instance. This process generated a cumulative relative score for each image, reflecting its aesthetic standing within the specific dataset. An image’s score of visual appeal is then its win rate against 50 other images. Their results demonstrated a strong correlation between GPT-4o and human rankings, yielding an R 2 of 0.695, suggesting that GPT-4o is proficient at ranking urban beauty. While the pairwise comparison method used is designed to simplify the subjective task of aesthetic judgment, the scale of their implementation introduces a potential limitation regarding evaluator fatigue. Each auditor was tasked with assessing 100 images, with each image paired 50 times, resulting in a total of 5000 evaluations per auditor. Even though selecting the more attractive image in a pair is cognitively less demanding than assigning a precise numerical score to a single image, the repetitive nature of 5000 consecutive judgments raises several concerns: diminishing discriminatory power, vigilance, and engagement, especially in the case of panoramic images, which require some interactivity at each pair comparison.
Nevertheless, our results complement Zhou et al.’s findings by showing that the model is also capable of replicating the specific magnitude of that beauty, as evidenced by our low MAE and non-significant t-values ( p > 0.05 ). While a direct numerical comparison between our results and Zhou et al. is limited by the differing nature of the data (relative ranking vs. absolute Likert ratings). Zhou et al. achieve a stable human baseline through the sheer volume of comparisons (51,000 pairs). In contrast, our study achieves stability through a high-depth audit of fewer images. The convergence of both studies—despite these vastly different scales—suggests that GPT-4o’s aesthetic “judgment” is not an artifact of specific dataset sizes but instead a robust reflection of average human preference. By reporting an R 2 near 0.70, Zhou et al. establish a high benchmark for AI-human correlation in aesthetics. Our findings complement this result by showing that even without the “corrective” power of massive pairwise datasets, the AI’s absolute ratings remain statistically indistinguishable from the human mean ( t = 1.59 for ChatGPT).

4. Discussion

In Figure 4, Figure 5 and Figure 6, each “violin” is a distribution plot that shows where the ratings for a given sidewalk are concentrated. Pleasure, comfort, safety, and access are plotted separately coming from humans (yellow), ChatGPT (gray), and Gemini (blue). The higher parts of the violin represent higher Likert scores. A wide “belly” around a score means that many respondents (or many model runs) produced that score. The violin shape gets wider where more responses pile up and narrower where few responses occur. A bulge where the violin gets noticeably wide marks a peak in the distribution. A tall and wide human violin rating usually means that humans disagree and responses are spread across several categories. A very thin AI violin rating typically means high repeatability across different runs. If ChatGPT’s red median sits above the human red median, then ChatGPT is rating that item more positively than humans on average (for that sidewalk). Different heights in the figures for the three sidewalks mean that the group is judging one sidewalk as more pleasant/comfortable/safer/more accessible than another. Modality is just the number of main piles of answers, which show as one, two, or three bulges in a violin diagram. In the unimodal case, most answers cluster around one main score. In the bimodal case, answers split into two main clusters. In the trimodal case, answers cluster into three main piles. One key point in this paper’s results is that humans can show bi- or tri-modality (different people genuinely experience the same sidewalk differently), whereas an LLM’s repeated runs often collapse to unimodality (it keeps giving essentially the same answer). Furthermore, the paper notes that ChatGPT distributions have high consistency, while Gemini is “less confident or more nuanced”, with more outliers.
Figure 7 illustrates the value of the relative errors showing AI median minus human median: positive values mean that the AI rated higher and negative means lower ratings. Most boxes for both models fall below the 0 line (the red dashed human median). This suggests that both AIs tend to rate these statements more conservatively (lower) than the human sample. Gemini exhibits a much higher spread, especially for aspect E1 and G3; Gemini’s boxes are significantly taller and reach lower deviations (down to 3 ) compared to ChatGPT’s more stable, narrower boxes. The gray shaded area ( [ 0.5 ,   0.5 ] ) represents the zone of high human alignment. ChatGPT sits comfortably in this zone or exactly on the 0 line more frequently (e.g., D1, E1, G1, A2, B2, D2, F2, G2, D3, and G3). Significant disagreement occurs in aspects B1, C1, E1, and G3, where Gemini is much more critical than both humans and ChatGPT.
To quantify the performance of each model relative to the human baseline, the mean absolute error (MAE) was calculated by comparing the median of the 10 AI iterations against the human median for each of the 21 items. Table A4 reports the MAE per sidewalk, showing how the context of the statements affected the AI’s accuracy. Sidewalk 2 yielded the highest alignment for both models, suggesting the phrasing in this section was the least controversial for the AI. Gemini struggled significantly with Sidewalk 1, while ChatGPT saw its largest error margin in Sidewalk 3.
Table A5 reports the MAE across the seven evaluated aspects: ChatGPT demonstrates superior alignment with the human baseline, achieving a lower MAE in five categories and reaching its peak precision in Aspect D (MAE = 0.10). In contrast, Gemini exhibits more specialized strengths, significantly outperforming ChatGPT in Aspect A (MAE = 0.30) and Aspect F, though it struggles with higher volatility and error margins exceeding 1.10 in Aspect C. Overall, while both models find Aspects C, ChatGPT’s narrower range of error suggests a more consistent cross-aspect performance, whereas Gemini’s accuracy is more sensitive to the specific nature of the statement being assessed.
To discuss the results in light of research questions RQ1 and RQ2, our analysis is based on the alignment of the human median ratings with the median computed from the two AI programs. We decided to sample the alignment according to three possibilities related to the difference Δ in medians: a median rating from an AI equal to a human rating plus or minus 0.5 is labeled as correct alignment; a difference of ± 1 in median ratings is labeled as fair alignment; and a Δ = ± 2 as incorrect alignment. Table 9 summarizes the results obtained on the three sidewalks for the seven statements as evaluated. The use of specific thresholds for the difference between human and AI medians, i.e., ± 0.5 , ±1, and ±2, is grounded in the resolution of the 5-point Likert scale and the statistical nature of central tendency in ordinal data. A difference of ±0.5 is defined as “Correct” alignment, as it represents the smallest possible step between an integer rating and a halfway point, indicating that the AI is effectively within the same perceptual category as the human majority. A difference of ±1 is classified as “Fair” alignment; while this represents a shift of one full Likert point (e.g., from “Agree” to “Neutral”), it remains within the same general half of the spectrum and often reflects the common variance found among human raters themselves. Finally, a difference of ±2 or more is categorized as “Incorrect”, as it signifies a fundamental shift in perception—moving from agreement to disagreement or from a strong sentiment to neutrality—thereby failing to capture the human contextual intent.
Globally, ChatGPT indicated a correct alignment with human ratings in 12 cases out of the 21 proposed statements and a fair alignment in the other 9 cases. Gemini was correct in 11 cases, fairly aligned in 6 cases, and incorrectly aligned in 4 cases out of 21. Overall, ChatGPT shows consistently moderate to strong agreement with human ratings, achieving either correct or fair alignment with no instances of incorrect alignment. Gemini demonstrates a more variable performance, less consistent due to the presence of misalignments across the evaluated statements. Sidewalk 2 was the best-modeled streetscape evaluated, with both LLMs showing correct alignment with human ratings for nearly all the criteria investigated. ChatGPT performed similarly on Sidewalk 1 and Sidewalk 3 with fair to correct alignments, whereas Gemini’s performance was notably poor on Sidewalk 1, with approximately half of the statements misjudged. Model performance is sensitive to the specific urban context: Sidewalk 2 seems to represent a well-defined case for pedestrian perception modeling. Both models can perform well under favorable conditions; however, ChatGPT generalizes more reliably across varying urban settings, whereas Gemini’s performance appears to be more context dependent. Regarding the experiential criteria investigated, ChatGPT demonstrated stronger coverage by perfectly modeling three out of seven factors: the accessibility, overall friendliness, and distant view impact on well-being factor (statements D, E, and F). Gemini was always in agreement only regarding the pleasurability factor (statement A).
The bottom line is that this focused experimentation establishes a methodological foundation for the following observations:
  • RQ1: AI is able to substitute for human subjective perception of a walking environment, to an extent ranging from Fair to Correct. AI can function as a reliable proxy for human perception when evaluating a visual stimulus based on subjective criteria but still requires human supervision.
  • RQ2: Across statements requiring a mix of objective visual analysis and simulated aesthetic and emotional perception, ChatGPT demonstrates a higher degree of correlation with the median subjective ratings of human participants. On average, ChatGPT 5.1 performs better than Gemini 2.5 Pro.
  • Indeed, both LLMs aligned well with human ratings in some environments but not uniformly across all streetscapes: Gemini showed weaknesses in more challenging or ambiguous environments, whereas ChatGPT demonstrated greater consistency across different sidewalks. Ultimately, while these findings are limited to the three illustrative cases, they demonstrate the feasibility of the methodology and provide an exploratory starting point for future research aimed at scaling AI-assisted sidewalk auditing.

5. Limitations of the Work

An important limitation of this research concerns the human corpus used, which is an academic population. The study is not an exploration of the university community, but it is common in research to rely on academic participants for reasons of feasibility: large-scale public surveys are costly, while students and staff are the most accessible population within a university setting. As for the occupation of the participants, the main inclusion criterion is self-selection to take part in the study. This is relevant for two reasons: professional practices and academic norms. The professional audit practices usually involve less than 20 people in public space auditing, recruited through self-selection, often adults engaged in community development and generally over 30 years old. Similarly, many academic studies draw conclusions on this type of experiment with a limited number of participants sharing similar occupations as in our study: student (bachelor’s/licence to master’s degree) or professor/researcher (including PhD students).
As for the age representation, our corpus is mainly composed of people between 20 and 60 years old. The older population (>70 years old) is a particularly interesting group to survey, as their walking needs are often not adequately met in today’s cities. However, they are also a difficult population to reach. We believe they deserve a separate dedicated study, as do people with disabilities. This outreach is beyond the scope of the present study. A more socio-demographically diverse sample might lead to different results: this study is exploratory and should therefore be interpreted with caution.
The number of case studies is relatively small, as only three sidewalks are investigated. It is a deliberate choice, one that allows for an in-depth, qualitative examination of how AI and human assessments compare to each other. Our goal was not to build a large and representative corpus but to explore the mechanisms and nuances of agreement and disagreement in specific, well-defined situations. Expanding the number of cases would have required a different study design and would have reduced the level of detail we could provide for each example. This is a preliminary study based on three illustrative cases, so broader claims would require a more extensive dataset.
Another factor that was not included in this study is the time of week or the time of day: it significantly alters the walkability of the spaces studied, but we stuck to the most common street view imagery, which is generally daytime imagery. Inclement weather (rain, sun, and snow) and the seasons also have a strong impact, a phenomenon that is exacerbated by climate change. Other potential confounding variables, such as lighting and shooting angles, were not controlled. Those factors may affect the evaluation, but future research could enhance the rigor of experimental control through standardized image processing.
Finally, we decided that an ablation study using the present data would not add any significant insight to our results. In the context of LLM evaluation, temperature settings influence the stochasticity of responses, and prompt engineering can significantly alter model performance, while we maintained fixed parameters to simulate a standardized professional use case, we acknowledge that from an AI-methodological perspective, the lack of an ablation study limits the generalizability of the findings. This follows from the relatively small number of human data points, but more importantly, it would deviate from the present focus on comparing the AI to the human results. This remains a critical area for future research, where the sensitivity of urban-perceptual outputs to specific hyperparameter tuning should be systematically mapped.
Ideally, a technical sensitivity analysis could increase the number of iterations of AI (20 runs, 50 runs, and 100 runs), run three temperatures (0.0, 0.5, and 1.0) on all sidewalks, and try two or three alternative prompt wordings or different prompt structures [17], and present the results for each alternative to show what affects the performance.
We have not implemented them as new experiments for multiple reasons. We wanted to preserve the scope of the study, as stated by the two research questions RQ1 and RQ2, for a limited exploratory study: the core objective of this research is an empirical comparison of AI and human urban perception. Our current protocol (10 iterations at a fixed temperature) was established to provide a stable baseline for reliability. The signed deviation charts and the IQR analysis demonstrate that this 10-run distribution already provides a clear picture of models’ stability for a pilot study. As a pilot investigation, this work is only designed to test the feasibility of the methodology: the lack of sensitivity testing is a deliberate limitation, a necessary boundary to maintain a strict proportionality between the experimental scale and the resulting claims.

6. Conclusions and Perspectives

This study compares how two multimodal AI models and human participants evaluate subjective qualities of urban environments using identically image-based rating tasks. By collecting Likert-scale judgments about walkability needs, friendliness, and well-being across diverse sidewalk scenes, we established a human baseline and measured the degree to which LLM-generated perceptions align or not with human score distributions. This is a pilot study based on a small-scale dataset, so the statements implying that AI can substitute human auditors have to be moderated. The results are exploratory and context-dependent: AI’s opinion changes based on the specific data and prompt it is given, and its accuracy depends on the diversity of the content of the image. This is a preliminary study based on three illustrative cases: broader generalizations would necessitate a more extensive dataset to confirm these patterns across a wider variety of urban contexts.
The finding suggests that multimodal models may serve as consistent representatives for collective human sentiment when assessing visual urban scenes. The results point to a practical way of making streets more pleasant for pedestrians without relying upon large, expensive surveys every time. With the correct prompting, multimodal LLMs can come close to the typical (median) human judgment across multiple factors for walkability and well-being. LLM results are instantly repeatable; hence, “pedestrian comfort” can be estimated quickly, while at the same time alternatives are compared in real time. In other words, if a proposed change (more tree buffer, less clutter, better separation from traffic, fewer obstacles, and more interesting destination cues) consistently pushes the AI’s scores in the same direction as human preferences, that change is likely to increase the pleasurable walking experience.
More importantly, this paper suggests a design feedback loop that can directly improve the streetscape: generate multiple visual variations of a street (simulated alternatives), score them with the same human-centered prompts, and iterate until the design converges towards higher pedestrian pleasure. This “variation and selection” can run fast enough to fit inside a normal project timeline. Used in this way, the method proposed here does not just rate existing sidewalks; it becomes a quality control and optimization tool. Implementing this design approach will help to ensure that new proposals can evolve—virtually in the direction of what most people actually feel is welcoming and enjoyable at walking speed.
One novel feature of the present analysis is to apply recent, non-standard evaluators of pedestrian urban space. Departing from an industrial planning model that focuses on efficient structure and traffic flow, we instead asked specifically whether a standing spot on a pavement was perceived as inviting to stay in and linger. This quality is found in traditional parts of pedestrian urban fabric but no longer in built environments dating from the end of World War II, except for neotraditionalist and new urbanist developments today. Asking generative AI to estimate this essential human quality—which it readily and accurately did—comes as a welcome tool.
Another notable feature is to respect human emotional attraction above and beyond strict mechanical movement. That is, discover whether a person is drawn to walk along a sidewalk based on unconscious criteria—not because they have to to get to a destination, but because the visual prospect is attractive enough to explore. This is a primary force behind a tourist’s random pedestrian exploration for pleasure but a neglected factor in everyday movements and transactions. Gaining ambulatory pleasure from urban design contributes to long-term health. These two factors underlie the pedestrian urban experience, a missing basis for urban design because they were hitherto difficult to measure. Also, human responses to urban visual settings were thought to be entirely subjective, while our results reveal consistent objective effects.
This study focuses on walkability, environmental friendliness, and well-being, but many other experiential qualities can be inferred from urban photographs. Future studies could examine perceptions such as vibrancy, enclosure, or organized complexity using a similar comparative framework. Another perspective of this work is concerned with how human perceptions can vary depending on culture, age, and lived experience. By collecting ratings from diverse demographic groups, researchers can evaluate whether AI aligns more closely with certain populations than others. Future research can also compare different model architectures, training regimes, and prompting strategies to measure how much accuracy improves over time. For instance, LLM-as-a-judge seems to be a relevant technique to improve AI performance on assessing experiential qualities across diverse visual environments by checking one LLM against another.

Author Contributions

Conceptualization, R.B. and N.A.S.; methodology, R.B. and N.A.S.; software, R.B.; validation, R.B.; formal analysis, R.B. and N.A.S.; investigation, R.B.; resources, R.B.; data curation, R.B.; writing—original draft preparation, R.B. and N.A.S.; writing—review and editing, R.B. and N.A.S.; visualization, R.B.; supervision, R.B.; project administration, R.B.; funding acquisition, R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work has received support under the program “France 2030” launched by the French Government and implemented by ANR, with the reference ANR-21-EXES-0007.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to Legal Regulations. Article L1121-1 Code de la santé publique (public health code defining the three categories of research involving humans): https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI000046125746 accessed on 24 November 2025.

Informed Consent Statement

Informed consent for publication was obtained from all human participants.

Data Availability Statement

Image data are available upon request from the authors.

Acknowledgments

The authors would like to thank Irène Sitohang for giving permission to use her pictures of Ténérife.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Summary statistics of rating for aspects rated on a scale of 1–5.
Table A1. Summary statistics of rating for aspects rated on a scale of 1–5.
GroupMeanStdMin25%50%75%Max
Humans3.501.0613445
ChatGPT3.340.6223345
Gemini3.361.0812345
Table A2. Summary statistics of rating for overall scores rated on a scale of 1–10.
Table A2. Summary statistics of rating for overall scores rated on a scale of 1–10.
GroupMeanStdMin25%50%75%Max
Humans6.951.61167810
ChatGPT6.660.6166778
Gemini6.331.673577.759
Table A3. Mean ratings per aspect: Human vs. ChatGPT vs. Gemini.
Table A3. Mean ratings per aspect: Human vs. ChatGPT vs. Gemini.
ImageStatementHumanChatGPTGemini
Sidewalk 1A: Pleasurability3.13.62.9
B: Comfort3.83.02.5
C: Safety3.83.42.0
D: Accessibility4.04.03.8
E: Friendliness (1–10)6.87.04.8
F: Place2.73.02.8
G: Distance2.93.33.0
Sidewalk 2A: Pleasurability3.23.03.1
B: Comfort3.43.13.3
C: Safety3.53.43.2
D: Accessibility3.54.14.5
E: Friendliness (1–10)6.76.46.2
F: Place3.43.73.9
G: Distance3.43.33.9
Sidewalk 3A: Pleasurability3.63.23.7
B: Comfort3.83.44.8
C: Safety4.03.24.1
D: Accessibility4.04.25.0
E: Friendliness (1–10)7.36.68.0
F: Place3.62.63.1
G: Distance3.53.52.4
Table A4. Mean Absolute Error (MAE) by sidewalk context.
Table A4. Mean Absolute Error (MAE) by sidewalk context.
SidewalkChatGPT MAEGemini MAE
Sidewalk 10.561.20
Sidewalk 20.370.57
Sidewalk 30.810.78
Table A5. Mean Absolute Error by statement (A–G).
Table A5. Mean Absolute Error by statement (A–G).
AspectChatGPT MAEGemini MAEBest Model
A: Pleasureability0.470.30Gemini
B: Comfort0.701.00ChatGPT
C: Safety1.001.10ChatGPT
D: Accessibility0.100.83ChatGPT
E: Friendliness (1–10)0.801.13ChatGPT
F: Place0.870.77Gemini
G: Distance0.370.83ChatGPT

References

  1. Alexander, C.; Ishikawa, S.; Silverstein, M.; Jacobson, M.; Fiksdahl-King, I.; ANgel, S. A Pattern Language: Towns, Buildings, Construction; Oxford University Press: Oxford, UK, 1977. [Google Scholar]
  2. Alexander, C. The Nature of Order, Book 1: The Phenomenon of Life: An Essay on the Art of Building and the Nature of the Universe; The Centre for Environmental Structure: Berkeley, CA, USA, 2002. [Google Scholar] [CrossRef]
  3. Wang, S.; Sanches de Oliveira, G.; Djebbara, Z.; Gramann, K. The Embodiment of Architectural Experience: A Methodological Perspective on Neuro-Architecture. Front. Hum. Neurosci. 2022, 16, 833528. [Google Scholar] [CrossRef] [PubMed]
  4. Karakas, T.; Yildiz, D. Exploring the influence of the built environment on human experience through a neuroscience approach: A systematic review. Front. Archit. Res. 2020, 9, 236–247. [Google Scholar] [CrossRef]
  5. Ghamari, H.; Golshany, N.; Naghibi Rad, P.; Behzadi, F. Neuroarchitecture Assessment: An Overview and Bibliometric Analysis. Eur. J. Investig. Health Psychol. Educ. 2021, 11, 1362–1387. [Google Scholar] [CrossRef] [PubMed]
  6. Higuera-Trujillo, J.L.; Llinares, C.; Macagno, E. The Cognitive-Emotional Design and Study of Architectural Space: A Scoping Review of Neuroarchitecture and Its Precursor Approaches. Sensors 2021, 21, 2193. [Google Scholar] [CrossRef] [PubMed]
  7. Seresinhe, C.I.; Preis, T.; Moat, H.S. Using deep learning to quantify the beauty of outdoor places. R. Soc. Open Sci. 2017, 4, 170170. [Google Scholar] [CrossRef] [PubMed]
  8. Zhang, J.; Li, Y.; Fukuda, T.; Wang, B. Urban safety perception assessments via integrating multimodal large language models with street view images. Cities 2025, 165, 106122. [Google Scholar] [CrossRef]
  9. Tang, Y.; Qu, A.; Yu, X.; Deng, W.; Ma, J.; Zhao, J.; Sun, L. From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models. arXiv 2025, arXiv:2506.02242. [Google Scholar] [CrossRef]
  10. Cheng, Y.; Yin, Z.; Li, D.; Li, Z. Assessing urban safety: A digital twin approach using streetview and large language models. In Proceedings of the 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall), Washington, DC, USA, 7–10 October 2024; pp. 1–5. [Google Scholar] [CrossRef]
  11. Wedyan, M.; Yeh, Y.C.; Saeidi-Rizi, F.; Peng, T.Q.; Chang, C.Y. Urban walkability through different lenses: A comparative study of GPT-4o and human perceptions. PLoS ONE 2025, 20, e0322078. [Google Scholar] [CrossRef] [PubMed]
  12. Xiao, Y.; Tang, Y. Can ChatGPT-4o assess the perceptions of streetscape change? Evidence from Shanghai, China. Sustain. Cities Soc. 2025, 130, 106674. [Google Scholar] [CrossRef]
  13. Liang, H.; Zhang, J.; Li, Y.; Wang, B.; Huang, J. Automatic Estimation for Visual Quality Changes of Street Space Via Street-View Images and Multimodal Large Language Models. IEEE Access 2024, 12, 87713–87727. [Google Scholar] [CrossRef]
  14. Zhu, Y.; Duan, H.; Min, X.; Zhai, G.; Le Callet, P. Exploring The Potential of Vision-Language Models for Pure-Image and Text-Guided-Image Saliency Prediction. In Proceedings of the 2025 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 14–17 September 2025; pp. 1780–1785. [Google Scholar]
  15. Zhang, Y.; Xiao, Y.; Zhang, Y.; Zhang, T. Video saliency prediction via single feature enhancement and temporal recurrence. Eng. Appl. Artif. Intell. 2025, 160, 111840. [Google Scholar] [CrossRef]
  16. Zhang, Y.; Wang, T.; Xue, L.; Lian, W.; Tao, R. ORSI Salient Object Detection via Progressive Interaction and Saliency-Guided Enhancement. IEEE Geosci. Remote Sens. Lett. 2026, 23, 6002105. [Google Scholar] [CrossRef]
  17. Malekzadeh, M.; Willberg, E.; Torkko, J.; Toivonen, T. Urban attractiveness according to ChatGPT: Contrasting AI and human insights. Comput. Environ. Urban Syst. 2025, 117, 102243. [Google Scholar] [CrossRef]
  18. Zhou, Q.; Zhang, J.; Zhu, Z. Evaluating urban visual attractiveness perception using multimodal large language model and street view images. Buildings 2025, 15, 2970. [Google Scholar] [CrossRef]
  19. Laurini, R. Promises of Artificial Intelligence for Urban and Regional Planning and Policymaking. In Knowledge Management for Regional Policymaking; Laurini, R., Nijkamp, P., Kourtit, K., Bouzouina, L., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 3–20. [Google Scholar] [CrossRef]
  20. Mouratidis, K. Urban planning and quality of life: A review of pathways linking the built environment to subjective well-being. Cities 2021, 115, 103229. [Google Scholar] [CrossRef]
  21. Alfonzo, M.A. To walk or not to walk? The hierarchy of walking needs. Environ. Behav. 2005, 37, 808–836. [Google Scholar] [CrossRef]
  22. Lindelöw, D.; Svensson, Å.; Sternudd, C.; Johansson, M. What limits the pedestrian? Exploring perceptions of walking in the built environment and in the context of every-day life. J. Transp. Health 2014, 1, 223–231. [Google Scholar] [CrossRef]
Figure 1. (a) Localization of the three cases studied (cartography: OpenStreetMap). Photos used as stimulus: (b) Calle San Martin, (c) Calle Serís, (d) Calle Méndez Núñez. (Credits: Irène Sitohang).
Figure 1. (a) Localization of the three cases studied (cartography: OpenStreetMap). Photos used as stimulus: (b) Calle San Martin, (c) Calle Serís, (d) Calle Méndez Núñez. (Credits: Irène Sitohang).
Ai 07 00110 g001
Figure 2. Comparative analysis of human participants and conversational AI agents in urban scene audits: overview of the experimental protocol.
Figure 2. Comparative analysis of human participants and conversational AI agents in urban scene audits: overview of the experimental protocol.
Ai 07 00110 g002
Figure 3. Human sample: (a) occupation and (b) age.
Figure 3. Human sample: (a) occupation and (b) age.
Ai 07 00110 g003
Figure 4. Violin plot showing the distribution of walking needs (pleasure, comfort, safety, and access) for the three sidewalks. The median score is represented as a red segment for the human corpus (yellow), ChatGPT (gray), and Gemini (blue).
Figure 4. Violin plot showing the distribution of walking needs (pleasure, comfort, safety, and access) for the three sidewalks. The median score is represented as a red segment for the human corpus (yellow), ChatGPT (gray), and Gemini (blue).
Ai 07 00110 g004
Figure 5. Environment friendliness for a walker: human results versus AI output for each sidewalk.
Figure 5. Environment friendliness for a walker: human results versus AI output for each sidewalk.
Ai 07 00110 g005
Figure 6. Environment influence: place (a) and distance (b). Humans results versus AI output for each sidewalk.
Figure 6. Environment influence: place (a) and distance (b). Humans results versus AI output for each sidewalk.
Ai 07 00110 g006
Figure 7. Deviation from human consensus: bar charts comparing human versus AI ratings for each aspect A to G and for Sidewalk 1 to 3. Red and blue lines indicate the nominal value of the relative errors Δ Median ( AI Human ) for ChatGPT and Gemini respectively.
Figure 7. Deviation from human consensus: bar charts comparing human versus AI ratings for each aspect A to G and for Sidewalk 1 to 3. Red and blue lines indicate the nominal value of the relative errors Δ Median ( AI Human ) for ChatGPT and Gemini respectively.
Ai 07 00110 g007
Table 1. Audited aspects of each urban scene: walking needs, overall walkability, and environment influence on well-being.
Table 1. Audited aspects of each urban scene: walking needs, overall walkability, and environment influence on well-being.
AspectSelected Statement
PleasurabilityThe environment along the route is beautiful and attractive
ComfortThe route feels planned for me as a pedestrian
SafetyI do not worry about the traffic when I walk along this route
AccessibilityIt is a practical path to walk
FriendlinessGive an overall pedestrian-friendly score (e.g., out of 10) for this street
PlaceVisual information makes it a pleasure to have to wait here for someone
DistanceThe distant view is interesting and draws me to walk in that direction
Table 2. Target groups and evaluation of each urban scene. The auditing gauge is the 5-point Likert scale (1 = strongly disagree to 5 = strongly agree) and the seven statements.
Table 2. Target groups and evaluation of each urban scene. The auditing gauge is the 5-point Likert scale (1 = strongly disagree to 5 = strongly agree) and the seven statements.
Rater GroupSample SizeInstruction
Human participants N p = 68 peopleEach person rates the image once for all seven statements
ChatGPT N i t = 10 runsThe exact same prompt is run multiple times to account for non-determinism
Gemini N i t = 10 runs
Table 3. Hierarchy of walking needs results. Median ratings are reported for the human corpus. Value Δ expresses the relative difference between AI rating and human rating.
Table 3. Hierarchy of walking needs results. Median ratings are reported for the human corpus. Value Δ expresses the relative difference between AI rating and human rating.
Walking Need StatementHumans Rating Δ ChatGPT Δ Gemini
Sidewalk 1
(A) Environment is beautiful/attractive3+10
(B) Route feels planned for pedestrians4−1−2
(C) I do not worry about traffic4−1−2
(D) It is a practical path to walk400
Sidewalk 2
(A) Environment is beautiful/attractive300
(B) Route feels planned for pedestrians3.5−0.50
(C) I do not worry about traffic4−1−0.5
(D) It is a practical path to walk40+1
Sidewalk 3
(A) Environment is beautiful/attractive4−10
(B) Route feels planned for pedestrians4−1+1
(C) I do not worry about traffic4−10
(D) It is a practical path to walk40+1
Table 4. Overall pedestrian–friendliness scores by corpus and by street.
Table 4. Overall pedestrian–friendliness scores by corpus and by street.
HumansChatGPTGemini
Sidewalk 1775
Sidewalk 2766.5
Sidewalk 3878
Table 5. Place and distance influencing well-being: median score by each group of raters.
Table 5. Place and distance influencing well-being: median score by each group of raters.
HumansChatGPTGemini
Visual information makes waiting a pleasure
Sidewalk 1233
Sidewalk 2332
Sidewalk 3433
Distant view is interesting and draws me
Sidewalk 1333
Sidewalk 2444
Sidewalk 343.52
Table 6. Inter-rater reliability between human consensus and LLMs across audited aspects.
Table 6. Inter-rater reliability between human consensus and LLMs across audited aspects.
ModelICC(2,1)95% CIp-ValueAgreement Level
Human vs. ChatGPT0.87[0.67, 0.95]<0.001Good to Excellent
Human vs. Gemini0.76[0.51, 0.9]<0.001Good
Table 7. Statistical comparison of LLMs and participant ratings wouldistributions.
Table 7. Statistical comparison of LLMs and participant ratings wouldistributions.
MetricHuman vs. ChatGPTHuman vs. Gemini
Spearman ρ 0.614 (p = 0.003)0.628 (p = 0.002)
Krippendorff’s α 0.8640.756
Weighted Cohen’s quadratic κ 0.7820.739
Table 8. Comparison of AI-based built environment visual evaluations.
Table 8. Comparison of AI-based built environment visual evaluations.
StudyPrimary AspectInput SourceMethodologyKey Findings
Wedyan et al. [11]Hierarchy of walking needsSidewalk photosPairwise comparison and Likert-scale ratingAI-human alignment on safety/comfort
Malekzadeh et al. [17]Visual appealSVI (Google Street)Still image, normalized Likert-scale ratingHigh correlation with human ratings
Zhou et al. [18]AttractivenessSVI (Baidu Map)Multiple pairwise comparisonHigh correlation with human ratings
Xiao & Tang [12]Temporal change detectionMulti-year SVI (Baidu Map)Pairwise categorical classificationAccuracy in detecting urban decay/growth
Current studyWalkability, friendliness, healthSidewalk photosStill image, Likert-scale rating, comparison of LLMsFair to correct alignment, ChatGPT > Gemini
Table 9. Alignment of AI median rating to human perception. Three possible alignment: correct ( | Δ | 0.5 ), fair ( | Δ | = 1 ), and incorrect ( | Δ | = 2 ) alignment.
Table 9. Alignment of AI median rating to human perception. Three possible alignment: correct ( | Δ | 0.5 ), fair ( | Δ | = 1 ), and incorrect ( | Δ | = 2 ) alignment.
Urban Environment StatementHumans RatingAlignment ChatGPTAlignment Gemini
Sidewalk 1
(A) Environment is beautiful/attractive3FairCorrect
(B) Route feels planned for pedestrians4FairIncorrect
(C) I do not worry about traffic4FairIncorrect
(D) It is a practical path to walk4CorrectCorrect
(E) Overall friendliness7CorrectIncorrect
(F) Visual information makes waiting a pleasure2FairFair
(G) Distant view is interesting and draws me3CorrectCorrect
Sidewalk 2
(A) Environment is beautiful/attractive3CorrectCorrect
(B) Route feels planned for pedestrians3.5CorrectCorrect
(C) I do not worry about traffic4FairCorrect
(D) It is a practical path to walk4CorrectFair
(E) Overall friendliness7CorrectCorrect
(F) Visual information makes waiting a pleasure3CorrectFair
(G) Distant view is interesting and draws me4CorrectCorrect
Sidewalk 3
(A) Environment is beautiful/attractive4FairCorrect
(B) Route feels planned for pedestrians4FairFair
(C) I do not worry about traffic4FairCorrect
(D) It is a practical path to walk4CorrectFair
(E) Overall friendliness8CorrectCorrect
(F) Visual information makes waiting a pleasure4FairFair
(G) Distant view is interesting and draws me4CorrectIncorrect
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Belaroussi, R.; Salingaros, N.A. Artificial Intelligence-Simulated Cognition of a Pedestrian Assessing a Built Environment. AI 2026, 7, 110. https://doi.org/10.3390/ai7030110

AMA Style

Belaroussi R, Salingaros NA. Artificial Intelligence-Simulated Cognition of a Pedestrian Assessing a Built Environment. AI. 2026; 7(3):110. https://doi.org/10.3390/ai7030110

Chicago/Turabian Style

Belaroussi, Rachid, and Nikos A. Salingaros. 2026. "Artificial Intelligence-Simulated Cognition of a Pedestrian Assessing a Built Environment" AI 7, no. 3: 110. https://doi.org/10.3390/ai7030110

APA Style

Belaroussi, R., & Salingaros, N. A. (2026). Artificial Intelligence-Simulated Cognition of a Pedestrian Assessing a Built Environment. AI, 7(3), 110. https://doi.org/10.3390/ai7030110

Article Metrics

Back to TopTop