Next Article in Journal
Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders
Previous Article in Journal
FedDT: A Communication-Efficient Federated Learning via Knowledge Distillation and Ternary Compression
Previous Article in Special Issue
Large Language Model-Driven Code Compliance Checking in Building Information Modeling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Architectural Ambiance: ChatGPT Versus Human Perception

by
Rachid Belaroussi
1,* and
Jorge Martín-Gutierrez
2
1
COSYS-GRETTIA, University Gustave Eiffel, F-77447 Marne-la-Vallée, France
2
Higher School of Engineering and Technology, Universidad de La Laguna, 38071 San Cristóbal de La Laguna, Spain
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(11), 2184; https://doi.org/10.3390/electronics14112184
Submission received: 15 April 2025 / Revised: 17 May 2025 / Accepted: 26 May 2025 / Published: 28 May 2025
(This article belongs to the Special Issue Artificial Intelligence-Driven Emerging Applications)

Abstract

:
Architectural ambiance refers to the mood perceived in a built environment, assessed through human reactions to virtual drawings of prospective spaces. This paper investigates the use of a ready-made artificial intelligence model to automate this task. Based on professional BIM models, videos of virtual tours of typical urban areas were built: a business district, a strip mall, and a residential area. GPT-4V was used to assess the aesthetic quality of the built environment based on keyframes of the videos and characterize these spaces shaped by subjective attributes. The spatial qualities analyzed through subjective human experience include space and scale, enclosure, style, and overall feelings. These factors were assessed with a diverse set of mood attributes, ranging from balance and protection to elegance, simplicity, or nostalgia. Human participants were surveyed with the same questions based on the videos. The answers were compared and analyzed according to these subjective attributes. Our findings indicate that, while GPT-4V demonstrates adequate proficiency in interpreting urban spaces, there are significant differences between the AI and human evaluators. In nine out of twelve cases, the AI’s assessments aligned with the majority of human voters. The business district environment proved more challenging to assess, while the green environment was effectively modeled.

Graphical Abstract

1. Introduction

1.1. Assessing a Built Environment with AI

Streetscapes—the appearance and design of streets—are essential to creating safe and healthy urban environments [1]. They encompass the exterior aspects of public spaces that shape the experiential perception of an area. Exploring architectural ambiance helps identify specific features provoking feelings tied to a particular atmosphere [2,3]. Some features can be quantified to establish guidelines for designing vibrant and animated living spaces.
The literature relies on subjective methods to assess the perception of individuals: streetscape factors that influence whether a particular area is considered safe and comfortable are usually evaluated. Auditing public spaces often involves time-intensive field census, or surveys on focus groups; using images enables the alleviation of the burden of the task as a more frugal method to evaluate the aesthetic of a public space [4,5].
However, conducting in-person assessments requires substantial time and effort, especially when recruiting a sufficient number of participants within a reasonable timeframe, particularly on topics that may not align with their interests. Artificial intelligence, particularly multimodal large language models (MLLMs), offers a potential alternative, as these models can process both text and images and generate textual interpretations of their analysis. A key question when applying MLLMs to architectural ambiance assessment is their performance in tasks requiring highly subjective interpretations. While these ready-made models are not specifically fine-tuned for such tasks, as we will demonstrate, their capabilities can be remarkably effective.
GPT-4V (short for GPT-4 Vision) is an extension of OpenAI’s GPT-4 architecture of ChatGPT [6]: it is a multilingual, multimodal deep-learning AI model created using OpenAI. It is specifically designed to handle both text- and image-related tasks. It represents a significant advancement in AI technology by allowing the model to understand and generate responses based on visual information, in addition to traditional text processing. We investigated here how this new ChatGPT extension assesses architectural ambiance by comparing its interpretations of various urban scenes to the subjective evaluations of a group of human participants.
Adopting the methodology proposed by Gomez et al. [7], this study is based on visual-related factors of various parts of a large real-estate project. The protocol is based on a set of experiential aspects, including space and scale, enclosure, architectural style, and overall feelings.

1.2. Contributions of the Work

The contribution of the work is threefold: advancing AI-based environmental assessment, human versus AI perception analysis, and establishing a framework for AI-driven urban studies.
The study provides insights into how AI models, such as ChatGPT, can interpret and characterize urban environments in comparison to human perception. It demonstrates AI’s potential in analyzing architectural ambiance and subjective spatial qualities.
By comparing AI-generated responses with human assessments, the study explores potential biases, similarities, and differences in interpreting urban spaces, contributing to a deeper understanding of how subjective and computational approaches align.
The study establishes a structured methodology for incorporating AI into urban and architectural research, providing a replicable framework for future investigations into AI-human comparative analysis.

Research Questions

Two research questions are the main focus of the article:
  • How do AI-generated assessments compare to human perceptions in characterizing urban spaces based on spatial, architectural, and emotional attributes?
  • To what extent can AI models replicate human subjective interpretations of urban ambiance, and what factors influence discrepancies between AI and human responses?
  • We will provide elements of response using three urban scenes of a diverse character: a business district, a strip mall, and a green residential area. Four aspects of architectural ambiance are investigated—space and scale, enclosure, architectural style, and overall feelings—with each one characterized by a set of subjective attributes covering the feelings evoked via an urban environment.
The paper is organized as follows, providing a comprehensive breakdown of the methodology, analysis, and findings. Section 2 presents a literature review on architectural ambiance and the use of AI in urban scene analysis. Section 3 presents the virtual environment used materials, the methodology proposed with the AI model, and the architectural assessment criteria. In Section 4, the experimental protocol is presented, followed by a detailed aspect-by-aspect comparison, along with demographic insights. Section 5 discusses the alignment between AI and human evaluators, as well as the advantages and limitations of the ChatGPT model. In Section 6, a conclusion is provided, along with some reflections on the creativity of artificial intelligence.

2. Literature Review

2.1. Assessment of Architectural Ambiance

Perceptions of a living environment influence the active lifestyles of citizens [8]: the subjects of walkability and cyclability are largely covered in the scientific literature, but architectural ambiance is barely studied.
Assessing the qualities associated with public spaces—places and sidewalks—involves the active participation of human groups [9]. It can use real street-view images [10] for the assessment of existing infrastructure or a virtual environment for a prospective study on regeneration projects [11].
Different approaches exist for analyzing the aesthetic qualities of architecture. For instance, Brown and Gifford [12] submitted images of real buildings to architect participants and asked them to rate them on a scale ranging from “terrible” to “excellent” architecture. Li et al. [13] used electroencephalogram, heart rate, and electrodermal sensors to measure physiological responses to exposition to panoramic videos of three different public spaces. Kashani et Pazhouhanfar [14] investigated the preference for building facades based on a questionnaire related to images of virtual five-story residential apartments.
The use of virtual reality (VR) can be helpful in the definition of ambiance in architecture [15] and the specification of determining factors on ambiance. For instance, research on the role of water in ambiance underscores its significance in connecting place memory, emotion, and architecture [5,16,17].
Gomez et al. [15] used human participants to assess ephemeral architecture elements, based on a protocol gauging their feelings about dimensions such as scale and size, construction materials, style, related activity, or degree of enclosure. Belaroussi et al. [3] adapted this experimental protocol to compare the perception of laypersons to architecture experts on a planned real estate project: the present paper constitutes an extension of this study to the subject of artificial intelligence-aided design.

2.2. AI in Urban Scene Analysis

Natural language processing (NLP) can be used to process urban data sources, yet the recent review of Cai [18] shows that its application to city analytics is still in its early phase. Very few approaches used NLP of urban design from textual inputs. Sentiment analysis can be performed over data, mainly from geotagged social media: for instance by processing hashtags from Instagram photos [19] or social reviews of businesses [20].
Using computer vision for the semantic segmentation of street-view imagery is another example of AI-based urban analytics, as reported by a recent review of Biljecki et Ito [10]. Computer vision algorithms [21] are largely exploited to process still images for this purpose: for instance, Tang et al. [22] considered factors such as enclosure, greenery, and openness.
Large language models (LLMs), pretrained on extensive text datasets, exhibit advanced feature extraction capabilities that effectively capture complex data patterns. Their in-context learning ability leveraged through prompts enhances their problem-solving skills, allowing them to achieve performance that can surpass fine-tuned models in certain tasks [23,24].
The recent advent of multimodal large language models (MLLMs) such as GPT-4, DeepSeek [25], or Gemini [26] and vision language models (VLMs) such as CLIP [27], BLIP-2 [28], or Flamingo [29] presents a promising new avenue for automated urban analytics by combining the strengths of both textual and visual inferences. They integrate the NLP capabilities of LLMs with computer vision, offer exciting possibilities for tasks requiring a seamless interpretation of visual and textual data. Nonetheless, their utilization for urban environment analytics is still in its early stages.
Liang et al. [30] investigated the proficiency of GPT-4 in discerning temporal street variations for the evaluation of the visual quality of streetscapes. The prompts asked GPT-4 to analyze some aspects of a scene illustrated with images taken from different directions of the street at two different time periods. Nine aspects were evaluated independently, such as buildings, road signs, vegetation, and public equipment. The answer choices given to GPT-4 were between positive, negative, and no changes.
More notably, Malekzadeh et al. [31] investigated the application of ChatGPT to generate estimates of urban attractiveness using simple text prompts with Street View Imagery. They asked GPT-4 to rate the overall visual quality of scenes on a Likert-like scale. Their study found broad agreement between the AI model ratings and human assessments; however, significant contextual discrepancies were observed.
The capability of a ready-made MLLM of answering to a categorical characterization of the human sensing of its built environment, instead of a scale rating or binary answers, has not yet been investigated.

3. Materials and Methods

3.1. Materials

The urban scenes analyzed were extracted from a large real estate construction project that began in 2021 and is scheduled for completion in 2027: Figure 1 displays some areas of the future district. We collaborated with the project’s general contractor, “SEMOP Châtenay-Malabry Parc Centrale”, in the frame of a university–industry–city academic initiative. At the time of the study, the built environment was not completed, so we created a virtual environment based on 3D BIM data designed by several architecture firms: Atelier M3, Leclerc Associés, and Agence Pietri for buildings and BASE for landscapes.
The 3D city models were obtained from the BIM360 collaborative platform and further integrated using Revit 2020 and TwinMotion 2020.2 software, which allowed the inclusion of elements such as trees, stones, and urban furniture.
Three guided tours were selected to represent distinct urban identities: a business-oriented office district, a shopping mall with vibrant street life, and a green residential area offering leisure opportunities. The first tour focuses on the business district, featuring extravagant, modern buildings along a pedestrian-only pathway. The second tour begins at a public square and continues through a commercial street lined with shops. The final tour showcases a residential area with homes set in a greener environment, including a partially pedestrian-only sidewalk for enhanced accessibility.

3.2. Methodology

3.2.1. Evaluation by GPT-4

We asked ChatGPT for an analysis combining visual and textual understanding. For each tour, we uploaded three keyframes of the video of the guided virtual visit to ChatGPT and input the following criteria-based prompts:
“These three images illustrate a small part of an urban district. Imagine visiting the urban environment shown in these three images. We ask you to characterize this space according to four factors: (1) Space and scale, (2) Enclosure, (3) Architectural style and (4) General Feelings.
Space and scale: what feeling would you feel there? Choose between the three answers: Restlessness, Balance or Grandeur.
Enclosure: What degree of enclosure would you feel inside the scene? Choose between one of these four: Protection, Calmness, Freedom, Animation.
Architectural style: What do the buildings inspire you? Choose between one the three couples: Elegance/Satisfaction or Simplicity/Serenity or Eccentricism/Surprise.
General feelings: What kind of feeling describe the most your sensation inside the scene? Choose between one of the four couples: Joy/Theatricality or Sadness/Nostalgia or Emotion/Spirituality or Indifference/Unnoticed.”
Figure 2 shows how the process works and some technical aspects of the algorithm used. It is made of a vision analysis stage, analyzing objects, layouts, and stylistic cues from images, a natural language-processing stage, structuring responses based on the predefined categories provided, and a multimodal integration synthesizing insights from both modalities (images and text).
The three images are processed to extract visual features like building dimensions, spacing, materials, tree placement, and urban layout. These details informed the evaluation of factors such as human scale, enclosure, and style. Image segmentation is first performed, assigning a class label (e.g., “building”, “tree”, “sky”) to each pixel in the image, and object detection enables to differentiate instances of the same object class (e.g., separating two buildings). GPT-4V likely uses advanced transformer-based vision architectures or hybrid deep learning methods (e.g., combining segmentation with object detection), but the exact algorithm or architecture remains undisclosed via OpenAI.
The key components analyzed include building structure, their scale, shape, and design (e.g., rectangular facades and modern balconies), the urban layout made of the relation between buildings, streets, and natural elements like trees, and environmental cues such as sky, lighting, and vegetation indicating openness and mood.
In parallel, natural language processing of the input prompts is performed. GPT-4 uses an advanced transformer-based NLP method [32]. GPT-4V is fine-tuned on the specific tasks of answering questions about images or generating captions. This text processing guided the output categories. For example, the options provided such as “Restlessness” and “Balance” created structured outputs. Semantic categorization extracts visual features and mapped them to the categories in the questions. For example, space and scale are determined through the relative proportions of buildings, streets, and open areas; while enclosure is based on the openness of the scene and the balance between natural and built elements.
GPT-4V matches visual observations from the images to the provided category choices using contextual reasoning. For instance, if the images showed densely packed buildings with little open space, it might have chosen “Restlessness” as a response. By combining spatial relationships from the image (e.g., tree-lined streets and building facades) with predefined descriptive categories, it is able to generate a coherent answer with explanations such as the following:
“Space and Scale: The environment feels Balanced. The scale of the buildings and their repetition create a sense of proportion and uniformity, without overwhelming grandeur or creating restlessness.”

3.2.2. Architectural Aspects and Mood Attributes

Based on the preliminary work of Gomez et al. [7,15], four aspects were selected, namely space and scale, the degree of enclosure, architectural style, and overall feelings. They are well grounded in capturing the essence of architectural ambiance:
  • Space and scale are fundamental in shaping perceptions and experience of a built environment. The proportions of spaces influence whether an area feels grand and imposing, intimate and human-scaled, or claustrophobic. Mood attributes that can characterize these aspects are balance, grandeur, and restlessness.
  • Enclosure determines openness versus containment in a space. An environment with high enclosure can feel protected, private, or even restrictive, while highly open spaces convey freedom, exposure, and accessibility. This aspect is key to defining how people interact with the space emotionally. Attributes related to this aspect can include the following: protection, calmness, freedom, and animation.
  • Style embodies historical, cultural, and aesthetic values. Whether a space is modern and minimalist, ornate, or industrial and raw, it immediately affects ambiance and perception. The materials, geometry, and detailing play a role in reinforcing a mood—whether eccentric, elegant, or simplistic.
  • Overall feelings encapsulate subjective experience, integrating physical attributes with emotional response. Architecture is not just about form; it influences psychological comfort. Whether a space feels nostalgic, spiritual, theatrical, or indifferent, these impressions result from the interplay of the previous three aspects, making this the final necessary piece to complete ambiance characterization.
  • To comprehensively cover built-environment assessments, other aspects such as imageability, transparency, and complexity [33,34], we chose not to investigate those aspects since our focus was on analyzing AI responses, rather than evaluating urban design characteristics.

4. Experiment and Results

4.1. Experimental Protocol

Figure 3 explains the experimental protocol implemented. The architectural aspects of three urban scenes were evaluated by a sample of laypeople on the one side and via ChatGPT on the other side. The answers of the human cohort were compared to the mood generated via the ChatGPT AI.
Human recruitment was performed online. An online questionnaire presented a video of each tour, followed by questions about the feelings evoked via the virtual experiences according to different aspects of architecture. Personal information was limited to age, gender, and occupation, as the questionnaire remained anonymous.
The experiment started with a very general presentation and asked for voluntary consent to the publication of the collected data in a research paper. We chose to follow a protocol initially proposed by Gomez et al. in [7,15]: no definitions or explanation of any notions of architecture was proposed as a preamble to the questionnaire. This paper exploits part of previous results collected during a larger experiment: see [3] for the detailed protocol used and the list of questions submitted.
Participants were mostly people with a university background—students and professors—from French and Spanish universities. A total of n = 118 people participated in the study: half men, half women, with 50% of them under the age of 26 and 20% over 50 years old.

4.2. Analysis of Results by Architectural Aspects

4.2.1. Space and Scale Criteria

Three attributes were proposed to assess the space and scale aspects of the scenes: grandeur, balance, and restlessness. Figure 4 displays the resulting polls from the participants on the first line and the outputs of GPT-4 on the second line.
ChatGPT felt balance in all scenes. The human participants’ perceptions showed more variety with the additional choice of grandeur or restlessness. Remember that, for each scene, GPT-4 was prompted for a unique answer, whereas participants’ results could show a variety of answers from each personal choice. Therefore, the feelings evoked via space and scale are more contrasted in the participants’ poll and unique in ChatGPT’s opinion. ChatGPT gave a single dominant answer, missing the subtle diversity present in human perception, but the question was whether ChatGPT’s opinion would capture the overall trends well.
The business district was judged as a grandeur scene by 55% of the participants and as balanced by 38%. With these criteria of space and scale, this virtual tour indicated the main contradiction between ChatGPT and human responses: if one decided to follow the majority of voters, the scene would be labeled grandeur, whereas if one used ChatGPT, it would be characterized as balanced.
The shopping street stimulated a feeling of balance in approximately 60% of human participants, while the remaining felt restlessness for 25% and a grandeur feeling among almost 20% of human participants. ChatGPT strictly categorized it as balanced, in accordance with the majority of voters.
In the virtual tour of the green residential units, about two-thirds of the participants reported a sense of balance. The remaining third was evenly split between feelings of restlessness and grandeur, though both groups represented a clear minority. Humans largely leaned towards the balance attributes, and again, ChatGPT categorized this scene similarly. Overall, ChatGPT closely matched the dominant sentiment in two tours out of the three.

4.2.2. Enclosure Feelings

The degree of enclosure reflects the feeling of being surrounded in a defined space shaped by buildings and streetscape elements. Rather than using overly technical terms, we proposed categorizing this sensation on a spectrum of sensations, allowing users to express their experience in the most intuitive way. Figure 5 illustrates the results obtained from the human participants and from GPT-4, showing a good share of differences.
In the business district, the most voted sensations were calmness for 42% of people and freedom for 28% of the cases, with animation and protection coming in last. In the green residencies, calmness also garnered a majority of votes with 45%, with the remaining votes being approximately equally distributed between the three other choices. For these two environments, GPT-4 agreed with the characterization of calmness.
There was more variety of feelings expressed in the strip mall environment, with freedom, animation, and calmness roughly equal among human participants, and with protection being in the minority. It seems that it was difficult to categorize this place with this choice of descriptions. GPT-4 chose calmness, giving the following convincing argument:
The combination of continuous facades and the open street with tree planting offers a blend of spatial definition and visual breathing room.
The visual analysis of the scene is indeed correct, even though the model did not explain why the other characterizations fit less with the ambiance.

4.2.3. Architecture Style

The architectural style was evaluated using three pairs of attributes: a sense of elegance paired with satisfaction, a feeling of simplicity accompanied by serenity, and more eccentric styles that evoke surprise. Public opinions were collected based on these three pairs of descriptors, with participants allowed to select only one. As shown in Figure 6, the three chosen urban identities primarily elicited two main types of reactions from the audience.
The virtual tour of the business office area predominantly evoked feelings of elegance and satisfaction among participants, while the simplicity/serenity and eccentricity/surprise options were selected less frequently. The other two virtual tours—the shopping street and the residential area—primarily evoked a general sense of simplicity and serenity. A secondary response was a feeling of elegance and satisfaction.
In Tour 1 (business district), the majority of humans felt elegance/satisfaction, and ChatGPT categorized it similarly. In Tour 2 (strip mall) and Tour 3 (green residences), humans leaned towards simplicity/serenity, which was also ChatGPT’s response.
ChatGPT captures overall trends fairly well, but it gives a single dominant answer, missing the subtle diversity present in human perception. It does not reflect the full range of subjective impressions that humans express in response to architecture: human answers are distributed across multiple categories. However, ChatGPT’s responses closely match the dominant sentiment expressed by human participants, leading to the belief that this aspect of architectural ambiance is well modeled using the MLLM.

4.2.4. Overall Feelings

To characterize the overall feelings sensed in the urban scene, four pairs of attributes were chosen between joy/theatricality, sadness/nostalgia, emotion/spirituality, and indifference/unnoticed.
As Figure 7 shows, human participants expressed a range of emotions, often splitting their responses between indifference in green, joy/theatricality in blue, and emotion/spirituality in yellow. Humans displayed varying emotional reactions across the scenes, with indifference being the most commonly expressed feeling across all three urban identities—particularly in the strip mall, where nearly 60% of participants reported it, and over 40% did so in each of the other tours. Sadness/nostalgia, shown in red, represented a clear minority in all cases.
The business district, for example, had significant percentages of both indifference (46%) and sadness/nostalgia (32%), but ChatGPT classified it as joy/theatricality, which differs significantly. The strip mall was overwhelmingly classified as indifferent by humans (57%), which ChatGPT correctly identified. However, for green residences, humans showed a balanced mix between indifference (41%) and joy/theatricality (36%), while ChatGPT strictly categorized it as indifference, missing some emotional richness but still aligning with the majority of voters.
The predominant feeling of indifference among participants can be attributed to an unclear understanding of what constitutes overall ambiance. Participants struggled to make a choice given the four options because they did not align with their actual experiential perceptions. Emotions such as joy and sadness are intense, and it can be challenging to express more nuanced or subdued feelings.
ChatGPT agreed with this indifference/unnoticed characterization of the strip mall and green residencies but voted for joy/theatricality in the case of the business district. Its arguments were as follows:
The sleek design, interplay of glass and sunlight, and the vibrant urban environment contribute to a lively and dynamic atmosphere.
This is convincing argumentation, and it fit well with the visual scene. But this choice qualifies as a form of hallucination since it significantly deviates from the dominant human perception. ChatGPT categorized it as joy/theatricality, which contradicts the prevailing sentiment and suggests a disconnect in its contextual understanding. ChatGPT may associate business districts with vibrancy, energy, and modernity, leading it to overemphasize positive emotions. In the two other tours, GPT-4 correctly chose indifference because the tours lacked standout features that might evoke a stronger emotional or memorable response, an opinion that seemed to be shared by the majority of the public audience.

4.2.5. Overall Feelings: Cross-AI Comparison

The set of attribute pairs—joy/theatricality, sadness/nostalgia, emotion/spirituality, and indifference/unnoticed—captures a range of feelings. It touches upon positive, negative, and neutral emotional responses to urban environments, and it connects feeling to a quality: pairing an emotion with a descriptive quality (e.g., Sadness/Nostalgia) adds a layer of understanding as to why that feeling might arise. It also offers distinct categories: the four pairs provide relatively separate ways of experiencing a city. However, some feelings can overlap, which can explain the difficulty for humans to classify a scene differently as indifference/unnoticed. For example, a “theatrical” scene might also evoke “emotion” beyond just joy. Similarly, “nostalgia” can be tied to other emotions than sadness. As mentioned in [3], the overall feelings aspect of architectural ambiance is a hard one to grasp, and this AI comparison confirms the difficulty in labeling such an aspect.
To investigate further the overall aspect of architectural ambiance more systematically, we used the AI-aided design of urban environments. The latest text-to-image model of Google Deepmind—Imagen 3—is able to generate photorealistic images based on textual prompts. We submitted to Imagen the following prompt:
Let us suppose that the general feeling of an urban scene can be described by one of the following four pairs of attributes: Joy/Theatricality, Sadness/Nostalgia, Emotion/Spirituality, or Indifference/Unnoticed. Please generate four urban scene images illustrating these types of built environments, all under the same conditions: clear weather, during daytime (at noon), with the sun and lighting not influencing these feelings. Could you ensure that the setup features modern buildings in an urban area, with no people, no cars, and no fountains?
Figure 8 shows the resulting generated environments. Some elements on which Imagen based its generated can be inferred. For instance, in Figure 8a, the use of light pink tones and the orderly, open courtyard space evokes a feeling of calm positivity and visual harmony that suggest joy. The warm tones on the buildings, combined with the rich blue sky, evoke a feeling of lightness and positivity. In Figure 8b, the color palette is very muted and de-saturated, giving it a washed-out, almost melancholic appearance favorable to a nostalgic experience. Somewhere between atmospheric minimalism and brutalism, the scenes of Figure 8c display a soft haze, muted tones, and pale lighting that give the scenes contemplative feelings. This creates a serene, contemplative mood that leans toward the introspective. In the four scenes depicted in Figure 8d, the buildings are stark, geometric, and nearly identical, creating a sterile and depersonalized environment. There is no architectural detail or color variation to provoke emotional engagement, suggesting an impersonal, utilitarian design.
The classification of the generated architectural spaces would benefit from human verification, but we arbitrarily used these sixteen generated environments as ground truth for our visual characterization of overall feelings of architectural ambiance in an AI-versus-AI comparison. Table 1 presents the confusion matrix of the ChatGPT evaluation of the generated architectural spaces. The rows (joy, sadness, emotion, and indifference) show the target labels for the environments; the columns depict how ChatGPT predicted the emotions for each type of generated environment.
From the first line, we can see that joy was predicted correctly once but misclassified as sadness twice and as emotion once. The second line of the matrix indicates that sadness environments were misclassified as indifference or emotion, resulting in four incorrect predictions. The four emotion/spirituality environment predictions were evenly spread across labels, suggesting some ambiguity or overlap in emotional cues. The only environments that were perfectly classified through ChatGPT was indifference/unnoticed environments (Figure 8d), with four perfect classifications. This matrix shows challenges in emotional interpretation, particularly in distinguishing closely related emotions like spirituality and nostalgia. With seven scenes out of sixteen classified as indifference/unnoticed, it also confirms the tendency of ChatGPT for conservatism and cautiousness, a tendency shared by humans when the proposed attributes are not discriminative enough, as can be seen in Figure 7. Emotion and spirituality were the second most predicted attributes, same as for the human evaluation of urban identities, with four scenes out of sixteen, but only one of these predictions was correct.
The overlap and lack of clarity between the attributes seem to have contributed to the ambiguity in the results. To improve the protocol, researchers might consider clearly defining each attribute to minimize overlap and using less nuanced feelings than the ones proposed in this paper. Using diverse environments for evaluation could highlight areas where ambiguity arises and reveal whether certain attributes are truly distinct: the difficulty is finding such a dataset, which, our extensive research suggests, does not exist yet in the field of architectural ambiance. This could give our proposition of protocol sharper boundaries and improve interpretability.

4.3. Overview Evaluation of AI-Generated Outputs

4.3.1. Analysis by Urban Identity

A summary of the results is presented in Table 2, with the agreement between ChatGPT output and public audience votes on attributes related to ambiance investigated aspects. Three percentages are displayed in bold: they are the cases where ChatGPT was in disagreement with the relative majority of voters. For instance, in the business district, 38% of votes were tallied by the attribute balance chosen via GPT-4, but in a majority case of 55% of voters, this urban identity evoked grandeur. In this same urban identity, the overall feelings chosen through ChatGPT were joy/theatricality, while the public audience voted for indifference or emotion/spirituality, as can be seen in Figure 7. Only 15% of these participants felt joy/theatricality in the business district virtual tour: this ambiance aspect and this urban identity constituted the major difference between human perception and GPT-4. Indeed, GPT-4 based its assessment on the interplay between glass and sunlight, feeling that it contributed to a lively and dynamic atmosphere. The human response is more difficult to analyze and would require a more direct investigation.
The two other urban identities were remarkably modeled: even though the percentage of agreement on some aspects was less than 50%, GPT-4 felt the same way as the relative majority of people in seven of the reported cases out of eight. In particular, the green residencies’ identity was an area of interest that seems to be less difficult to appraise through AI, probably because of the presence of vegetation that tends to evoke balance, calmness, and serenity in humans.

4.3.2. Analysis by Gender and Age

Looking for demographic-specific findings, we measured how well ChatGPT’s output aligns with subsets of human responses, focusing on two sociodemographic attributes: gender and age. The sample of human participants was approximately evenly distributed between men and women. Approximately half of the sample was under 26 years old, while the other half was 26 or older. The concordance between ChatGPT and subsets of human responses was measured using the agreement rate for each question. For instance, regarding the space and scale factor, ChatGPT’s output “Balance” matched 38% of human annotators in the business district, 58% for the strip mall, and 67% of the green residences, as shown in the first line of Table 2. Considering only women participants, the agreement rates on the space and scale factor were 36% on Tour 1, 60% on Tour 2, and 69% on Tour 3. The average of these three values is 55%. The agreement scores for the four architectural factors segmented by gender are displayed in Figure 9a. The scores are averaged across the three virtual tours. For instance, on the space and scale factor, ChatGPT agreed with 52% of the men on average. Figure 9b illustrates the average agreement rate for the two selected age groups: under 26 and older.
The concordance of annotation between AI-generated response and human judgments was similar between women and men, with a maximum difference of 7% for enclosures and overall aspects. Style perceptions were relatively similar across both categories, suggesting a shared aesthetic understanding of urban environments.
The difference in AI–human judgment concordance was more significant when age groups were compared: older individuals generally exhibited higher agreement rates, particularly in space/scale (10% more) and overall perception (15% more). However, for enclosure and architectural style, the AI was slightly more in accordance with younger participants.
Since the largest difference occurred between age groups, rather than gender, one could cautiously say that older participants are slightly closer to AI-generated responses. However, this difference is not strong enough to make definitive claims—AI–human agreement seems relatively balanced across demographics.

5. Discussion

5.1. General Findings of the Work

Overall, GPT-4 chose an attribute in accordance with the relative majority of the public audience in nine cases out of twelve, corresponding to four architectural ambiance aspects in the three investigated urban identities. In two of the three cases where GPT-4 was in disagreement with people, it made a choice equal or close to the second-best attributes characterizing ambiance. Its rate of accordance with human voters was, most of the time, superior to 40%.
The relatively strong alignment between GPT-4’s responses and public audiences underscores its practical usefulness and potential for a wide range of applications. This alignment suggests that AI tools like ChatGPT can democratize access to complex assessments for users with no knowledge of AI methods, such as municipal workers and urban planners. By eliminating the need for users to train models—which typically require labeled data, as well as significant human and computational resources—ChatGPT offers a practical and efficient solution. Its widespread accessibility and increasing familiarity among users further enhance its usability, making the proposed approach both user-friendly and reproducible. Users only need to focus on crafting appropriate prompts and selecting relevant assessment criteria, streamlining the evaluation process.
Despite its advantages, ChatGPT involves several limitations that must be considered. As a proprietary model, it operates with constraints in API access and associated costs, which may limit its widespread adoption in budget-constrained organizations. Also, understanding the underlying mechanisms and decision-making processes of GPT-4 remains a challenge, making it difficult for users to interpret or validate the outputs fully. These factors could pose barriers to trust and transparency, particularly for decision-makers who require a clear understanding of AI-based recommendations.

5.2. Limitations of the Work

The work presented in this paper also faced some limitations. The proposed approach offers only a limited focus of the environment, as it is primarily based on visual aspects without accounting for the full range of multi-sensory experiences, including auditory stimuli. Future studies should aim to incorporate a more comprehensive assessment of aesthetic qualities to better reflect real-life experiences, as MLLMs are supposed to be able to process audio files.
Some other aspects of architectural ambiance such as greenery, imageability, transparency, or complexity could also be investigated, but we chose a limited set of factors and attributes to focus on the analysis of AI responses.
We focused our work on only three urban identities that represented a variety of urban environments: for a broader generalization of the results, more data and more experiments need to be conducted. Our study intentionally focused on selected case scenarios to deepen the methodological approach, allowing for a more precise evaluation of AI–human alignment in architectural ambiance assessment.
Another key aspect of AI that warrants further investigation is the consistency of its responses. Across multiple accounts and varying environments, some degree of variation in AI-generated outputs is likely to occur. We did not explore this factor, as our approach simulated a typical practitioner who would engage with the chatbot only once to obtain a quick response, rather than conducting repeated trials. While the scope is limited, the results serve as a preliminary contribution to this emerging field, providing insights that can inform future, broader investigations.

6. Conclusions

In this paper, we proposed to evaluate the capability of a widely spread AI platform, ChatGPT, and its latest architecture, GPT-4, in the task of the assessment of architecture ambiance. Four aspects of this field were investigated—human scale, enclosure, style, and overall feelings—with three or four intuitive attributes for their characterization. Virtual environments were designed to investigate three different urban identities: a business district, a strip mall, and a green residential area. For each scene, 118 human participants voted for one attribute among the subjective propositions for each of the architectural aspects.
Comparing the outputs of GPT-4 with the public samples, we found that, in nine out of twelve cases, the AI’s assessments aligned with the majority of human voters. The business district environment proved more challenging to assess, while the green environment was effectively modeled. Contextual visual information was largely taken into account in the AI model, but its responses need nuancing with a human perspective to capture the full range of experiential feelings. Our findings indicate that, while GPT-4V demonstrates adequate proficiency in interpreting urban spaces, there are significant differences between the AI and human evaluators, especially when it comes to subjective aspects such as emotional resonance and the perception of grandeur.
It is important to consider the current limitations, errors, and issues of generative AI. These limitations refer to hallucinations, biases, and data errors that AI systems may produce. By hallucinations, we refer to outputs that are inaccurate—instances where the AI generates invented information or responds in an overly creative manner detached from factual reality. While this creativity can be interesting in certain contexts, such as creating poems or fictional literary works, in other specific contexts, we need the AI to behave objectively and precisely, avoiding these “hallucinations”. Bias is a reflection of the data with which the AI has been trained. It is possible that the training dataset contains a large amount of information that carries more weight than objective scientific data, which may be overshadowed by the dominant information in the training set. In other words, the AI provides results based on the information and data it has been trained on.

Author Contributions

Conceptualization, R.B. and J.M.-G.; methodology, R.B. and J.M.-G.; software, R.B.; validation, R.B.; formal analysis, R.B. and J.M.-G.; investigation, R.B.; resources, R.B.; writing—original draft preparation, R.B. and J.M.-G.; writing—review and editing, R.B. and J.M.-G.; visualization, R.B.; supervision, R.B. and J.M.-G.; project administration, R.B.; funding acquisition, R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly funded by the E3S project, a partnership between Eiffage and the I-SITE FUTURE consortium. FUTURE bénéficie d’une aide de l’État gérée par l’Agence Nationale de la Recherche (ANR) au titre du programme d’Investissements d’Avenir (référence ANR-16-IDEX-0003) en complément des apports des établissements et partenaires impliqués. As part of the City-Fab project, this work has received support under the program “France 2030” launched by the French Government and implemented by ANR, with the reference ANR-21-EXES-0007.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to Legal Regulations. Article L1121-1 Code de la santé publique (public health code defining the three categories of research involving humans): https://www.legifrance.gouv.fr/codes/article_lc/LEGIARTI000046125746, accessed on 1 March 2025.

Informed Consent Statement

Informed consent for publication was obtained from all human participants.

Data Availability Statement

Image data are available upon request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ogawa, Y.; Oki, T.; Zhao, C.; Sekimoto, Y.; Shimizu, C. Evaluating the subjective perceptions of streetscapes using street-view images. Landsc. Urban Plan. 2024, 247, 105073. [Google Scholar] [CrossRef]
  2. Gomez-Tone, H.C.; Alpaca Chávez, M.; Vásquez Samalvides, L.; Martin-Gutierrez, J. Introducing Immersive Virtual Reality in the Initial Phases of the Design Process–Case Study: Freshmen Designing Ephemeral Architecture. Buildings 2022, 12, 518. [Google Scholar] [CrossRef]
  3. Belaroussi, R.; González, E.D.; Dupin, F.; Martin-Gutierrez, J. Appraisal of Architectural Ambiances in a Future District. Sustainability 2023, 15, 13295. [Google Scholar] [CrossRef]
  4. Luo, J.; Liu, P.; Xu, W.; Zhao, T.; Biljecki, F. A perception-powered urban digital twin to support human-centered urban planning and sustainable city development. Cities 2025, 156, 105473. [Google Scholar] [CrossRef]
  5. Luo, J.; Zhao, T.; Cao, L.; Biljecki, F. Water View Imagery: Perception and evaluation of urban waterscapes worldwide. Ecol. Indic. 2022, 145, 109615. [Google Scholar] [CrossRef]
  6. OpenAI. GPT-4: OpenAI’s Multimodal Large Language Model. 2013. Available online: https://openai.com/gpt-4 (accessed on 15 January 2025).
  7. Gómez-Tone, H.C.; Bustamante Escapa, J.; Bustamante Escapa, P.; Martin-Gutierrez, J. The drawing and perception of architectural spaces through immersive virtual reality. Sustainability 2021, 13, 6223. [Google Scholar] [CrossRef]
  8. Bornioli, A. The walking meeting: Opportunities for better health and sustainability in post-COVID-19 cities. Cities Health 2023, 7, 556–562. [Google Scholar] [CrossRef]
  9. Zhu, Y.; Zhang, Y.; Biljecki, F. Understanding the user perspective on urban public spaces: A systematic review and opportunities for machine learning. Cities 2025, 156, 105535. [Google Scholar] [CrossRef]
  10. Biljecki, F.; Ito, K. Street view imagery in urban analytics and GIS: A review. Landsc. Urban Plan. 2021, 215, 104217. [Google Scholar] [CrossRef]
  11. Corticelli, R.; Pazzini, M.; Mazzoli, C.; Lantieri, C.; Ferrante, A.; Vignali, V. Urban Regeneration and Soft Mobility: The Case Study of the Rimini Canal Port in Italy. Sustainability 2022, 14, 14529. [Google Scholar] [CrossRef]
  12. Brown, G.; Gifford, R. Architects predict lay evaluations of large contemporary buildings: Whose conceptual properties? J. Environ. Psychol. 2001, 21, 93–99. [Google Scholar] [CrossRef]
  13. Li, F.; Zhang, Z.; Xu, L.; Yin, J. The effects of professional design training on urban public space perception: A virtual reality study with physiological and psychological measurements. Cities 2025, 158, 105654. [Google Scholar] [CrossRef]
  14. Hashemi Kashani, S.; Pazhouhanfar, M.; van Oel, C. Role of physical attributes of preferred building facades on perceived visual complexity: A discrete choice experiment. Environ. Dev. Sustain. 2023, 26, 13515–13534. [Google Scholar] [CrossRef]
  15. Gómez-Tone, H.C.; Martin-Gutierrez, J.; Bustamante-Escapa, J.; Bustamante-Escapa, P. Spatial Skills and Perceptions of Space: Representing 2D Drawings as 3D Drawings inside Immersive Virtual Reality. Appl. Sci. 2021, 11, 1475. [Google Scholar] [CrossRef]
  16. Sioui, G.B. Ambiantal Architecture—Defining the role of water in the aesthetic experience of sensitive architectural ambiances. In SHS Web of Conferences; EDP Sciences: Les Ulis, France, 2019. [Google Scholar]
  17. Gascon, M.; Zijlema, W.; Vert, C.; White, M.P.; Nieuwenhuijsen, M.J. Outdoor blue spaces, human health and well-being: A systematic review of quantitative studies. Int. J. Hyg. Environ. Health 2017, 220, 1207–1221. [Google Scholar] [CrossRef]
  18. Cai, M. Natural language processing for urban research: A systematic review. Heliyon 2021, 7, e06322. [Google Scholar] [CrossRef]
  19. Wagiri, F.; Wijaya, D.C.; Sitindjak, R.H.I. Embodied spaces in digital times: Exploring the role of Instagram in shaping temporal dimensions and perceptions of architecture. Architecture 2024, 4, 948–973. [Google Scholar] [CrossRef]
  20. Olson, A.W.; Calderón-Figueroa, F.; Bidian, O.; Silver, D.; Sanner, S. Reading the city through its neighbourhoods: Deep text embeddings of Yelp reviews as a basis for determining similarity and change. Cities 2021, 110, 103045. [Google Scholar] [CrossRef]
  21. Lee, M.; Kim, H.; Hwang, S. Virtual audit of microscale environmental components and materials using streetscape images with panoptic segmentation and image classification. Autom. Constr. 2025, 170, 105885. [Google Scholar] [CrossRef]
  22. Tang, J.; Long, Y. Measuring visual quality of street space and its temporal variation: Methodology and its application in the Hutong area in Beijing. Landsc. Urban Plan. 2019, 191, 103436. [Google Scholar] [CrossRef]
  23. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  24. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and short papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  25. Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
  26. Fu, C.; Zhang, R.; Wang, Z.; Huang, Y.; Zhang, Z.; Qiu, L.; Ye, G.; Shen, Y.; Zhang, M.; Chen, P.; et al. A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise. arXiv 2023, arXiv:2312.12436. [Google Scholar]
  27. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  28. Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
  29. Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
  30. Liang, H.; Zhang, J.; Li, Y.; Wang, B.; Huang, J. Automatic Estimation for Visual Quality Changes of Street Space Via Street-View Images and Multimodal Large Language Models. IEEE Access 2024, 12, 87713–87727. [Google Scholar] [CrossRef]
  31. Malekzadeh, M.; Willberg, E.; Torkko, J.; Toivonen, T. Urban attractiveness according to ChatGPT: Contrasting AI and human insights. Comput. Environ. Urban Syst. 2025, 117, 102243. [Google Scholar] [CrossRef]
  32. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. Available online: https://dl.acm.org/doi/10.5555/3295222.3295349 (accessed on 13 April 2025).
  33. Xiao, Y.; Song, M. How are urban design qualities associated with perceived walkability? An AI approach using street view images and deep learning. Int. J. Urban Sci. 2024, 1–26. [Google Scholar] [CrossRef]
  34. Ewing, R.; Clemente, O.; Neckerman, K.M.; Purciel-Hill, M.; Quinn, J.W.; Rundle, A. Measuring Urban Design: Metrics for Livable Places; Springer: Berlin/Heidelberg, Germany, 2013; Volume 200. [Google Scholar] [CrossRef]
Figure 1. Case study: a 3D city model with various urban environments.
Figure 1. Case study: a 3D city model with various urban environments.
Electronics 14 02184 g001
Figure 2. Principle diagram of GPT-4V: the algorithm uses multimodal capabilities to integrate the processed image features extracted from a tour, with the semantic structure of the questions.
Figure 2. Principle diagram of GPT-4V: the algorithm uses multimodal capabilities to integrate the processed image features extracted from a tour, with the semantic structure of the questions.
Electronics 14 02184 g002
Figure 3. Workflow diagram: comparing AI-generated ratings and human participants’ perspectives.
Figure 3. Workflow diagram: comparing AI-generated ratings and human participants’ perspectives.
Electronics 14 02184 g003
Figure 4. Space and scale: a feeling of balance was predominant among human raters, except for the business district that evoked first grandeur; GPT-4 classified the scenes uniformly.
Figure 4. Space and scale: a feeling of balance was predominant among human raters, except for the business district that evoked first grandeur; GPT-4 classified the scenes uniformly.
Electronics 14 02184 g004
Figure 5. Degree of enclosure: human participants versus GPT-4 classification.
Figure 5. Degree of enclosure: human participants versus GPT-4 classification.
Electronics 14 02184 g005
Figure 6. Architecture style: simplicity and elegance were the most prized choices for AI and humans.
Figure 6. Architecture style: simplicity and elegance were the most prized choices for AI and humans.
Electronics 14 02184 g006
Figure 7. Overall feelings: What kind of feelings describe most your sensation inside the scene?
Figure 7. Overall feelings: What kind of feelings describe most your sensation inside the scene?
Electronics 14 02184 g007
Figure 8. AI-generated architectural ambiance: for each of the overall feelings of pairs of attributes, four urban environments were generated via the Imagen text-to-image model (Google DeepMind).
Figure 8. AI-generated architectural ambiance: for each of the overall feelings of pairs of attributes, four urban environments were generated via the Imagen text-to-image model (Google DeepMind).
Electronics 14 02184 g008aElectronics 14 02184 g008b
Figure 9. Comparison of average agreement rates between AI-generated attributes and participants’ perceptions: (a) by gender (women vs. men); (b) segmented by age-group: under 26 and over 26.
Figure 9. Comparison of average agreement rates between AI-generated attributes and participants’ perceptions: (a) by gender (women vs. men); (b) segmented by age-group: under 26 and over 26.
Electronics 14 02184 g009
Table 1. Confusion matrix of ChatGPT predictions over Imagen-generated environments. Hot colors correspond to stronger values.
Table 1. Confusion matrix of ChatGPT predictions over Imagen-generated environments. Hot colors correspond to stronger values.
Predicted
JoySad.Emot.Indiff.
ActualJoy1210
Sadness0022
Emotion1111
Indifference0004
Table 2. Agreement rate between ChatGPT outputs and public audience votes. In bold are shown the cases where ChatGPT made a choice different than the relative majority of human voters.
Table 2. Agreement rate between ChatGPT outputs and public audience votes. In bold are shown the cases where ChatGPT made a choice different than the relative majority of human voters.
Ambiance AspectBusiness DistrictStrip MallGreen Residencies
Space and scale38%58%67%
Enclosure42%27%45%
Architecture style55%64%58%
Overall feelings15%58%40%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Belaroussi, R.; Martín-Gutierrez, J. Architectural Ambiance: ChatGPT Versus Human Perception. Electronics 2025, 14, 2184. https://doi.org/10.3390/electronics14112184

AMA Style

Belaroussi R, Martín-Gutierrez J. Architectural Ambiance: ChatGPT Versus Human Perception. Electronics. 2025; 14(11):2184. https://doi.org/10.3390/electronics14112184

Chicago/Turabian Style

Belaroussi, Rachid, and Jorge Martín-Gutierrez. 2025. "Architectural Ambiance: ChatGPT Versus Human Perception" Electronics 14, no. 11: 2184. https://doi.org/10.3390/electronics14112184

APA Style

Belaroussi, R., & Martín-Gutierrez, J. (2025). Architectural Ambiance: ChatGPT Versus Human Perception. Electronics, 14(11), 2184. https://doi.org/10.3390/electronics14112184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop