AI Judging Architecture for Well-Being: Large Language Models Simulate Human Empathy and Predict Public Preference

Boys Smith, Nicholas; Salingaros, Nikos A.

doi:10.3390/designs9050118

Open AccessArticle

AI Judging Architecture for Well-Being: Large Language Models Simulate Human Empathy and Predict Public Preference

by

Nicholas Boys Smith

¹ and

Nikos A. Salingaros

^2,3,*

¹

Create Streets, 81 Lambeth Walk, London SE11 6DX, UK

²

Department of Mathematics, The University of Texas at San Antonio, San Antonio, TX 78249, USA

³

Thrust of Urban Governance and Design, Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China

^*

Author to whom correspondence should be addressed.

Designs 2025, 9(5), 118; https://doi.org/10.3390/designs9050118

Submission received: 11 September 2025 / Revised: 3 October 2025 / Accepted: 10 October 2025 / Published: 13 October 2025

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) judge three pairs of architectural design proposals which have been independently surveyed by opinion polls: department store buildings, sports stadia, and viaducts. A tailored prompt instructs the LLM to use specific emotional and geometrical criteria for separate evaluations of image pairs. Those independent evaluations agree with each other. In addition, a streamlined evaluation using a single descriptor “friendliness” yields the same results while offering a rapid screening measure. In all cases, the LLM consistently selects the more human-centric design, and the results align closely with independently conducted public opinion poll surveys. This agreement is significant in improving designs based upon human-centered principles. AI helps to illustrate the correlational effect: living geometry → positive-valence emotions → public preference. The AI-based model therefore provides empirical evidence for a deep biological link between geometric structure and human emotion that warrants further investigation. The convergence of AI judgments, neuroscience, and public sentiment highlights the diagnostic power of criteria-driven evaluations. With intelligent prompt engineering, LLM technology offers objective, reproducible architectural assessments capable of supporting design approval and policy decisions. A low-cost tool for pre-occupancy evaluation unifies scientific evidence with public preference and can inform urban planning to promote a more human-centered built environment.

Keywords:

architecture; Christopher Alexander; criteria-driven diagnosis; design; emotional feedback; empathetic AI; human-centered design (HCD); large language models (LLMs); living geometry; neuroarchitecture

1. Introduction

1.1. Using AI for Architectural Evaluations

This paper applies generative AI to evaluate designs of buildings and other artificial structures in the environment using scientific criteria. AI is a powerful tool for advancing scientific discovery, as demonstrated by recent successes. Advanced computational methods have transformed numerous engineering and scientific disciplines. Architects now employ generative AI to create impressive abstract forms [1,2]. However, the evaluation of the built environment remains rooted in aesthetic judgments carried out by “experts”, which are subjective and disagree with both public sentiment and the evidence underpinning human-centric design. The processes of design creation and assessment are still detached from each other; so, there is no corrective feedback loop.

The use of large language models (LLMs) to evaluate architectural design proposals against geometrical and neuroscientific criteria is timely. We demonstrate that AI-based judgments align with public opinion polls in three case studies, offering an innovative attempt to bridge AI, architecture, and neuroscience. The interdisciplinary framework, drawing on Christopher Alexander’s “living geometry” and the “beauty-emotion cluster”, is original. This work aims to advance discussions about human-centered design evaluation and provide a low-cost, reproducible diagnostic tool for policy and practice.

For reasons discussed later, standard architectural assessment methods are insufficient. Those do not actually create places that tally with the increasingly confident evidence on what places prosper and why. They do not implement empirical research into the relationships between place with happiness, prosperity, and value. Most architects deny the importance of biological beauty implemented in design. By contrast, AI-based tools can offer objectivity and reproducibility [3].

Supporting a new paradigm: architecture as a science of human experience, a form is evaluated not by arbitrary taste, but by its effect on the body and brain. Biomarkers and crowdsourcing arguably matter more than individual “expert” claims. We join several other authors who seek the same result [4,5,6]. This is not just about AI—it is about reclaiming architecture as a human-centered discipline. The method represents a structured attempt to bring objectivity to a traditionally subjective domain. The methodology is novel because it applies AI as a proxy for scientific consensus, not merely as a creative tool for generating visual abstractions.

The model synthesizes AI, architectural theory (via Christopher Alexander), environmental psychology, mathematics, and neuroscience (neuroarchitecture) into an innovative diagnostic method using AI to measure design quality based on human biology that bypasses stylistic debates. By directly addressing the gap between architectural academia/elite practice and public preference, the debate is shifted from purely ideological to empirical terms. The following are the main findings:

A simple and readily-available AI tool analyzes building designs using scientific principles.
When shown pairs of designs, the AI consistently preferred the more traditional, human-scale option.
The AI’s preferences almost perfectly matched the results of independent public opinion polls.
Using two independent sets of criteria for diagnosis—emotional and geometrical—shows that they correlate.
AI provides empirical evidence for a deep biological link between geometric structure and human emotion.

Convolutional neural networks trained on crowdsourced pairwise comparisons can predict human judgments of urban attributes such as safety, liveliness, and beauty with high accuracy [7]. Deep learning models applied to a large number of online ratings have quantified the scenic beauty of outdoor places, revealing a direct link between environmental features and perceived attractiveness [8]. However, these vision-based approaches often operate as “black boxes”, not integrated with geometrical or neuroscientific theories of human-environment interaction.

Recent interdisciplinary research in neuroarchitecture demonstrates that forms and surfaces exert measurable effects on human cognition, emotion, and well-being, which offers a quantitative basis for design evaluation [9]. Tools based on neuro-architectural assessment can operationalize these scientific insights in practical settings [10,11]. It would be very convenient to have a technological tool for consistently reproducible, criteria-driven evaluations of buildings and urban structures in real time. Empathetic AI recognizes and responds to forms and surfaces aligned with human emotional and perceptual needs.

This investigation follows pioneering efforts to use generative AI as a tool in evaluating the design of buildings and urban spaces. Working at the intersection of artificial intelligence, design, and human emotional experience focuses on empathetic responses. AI tries to predict human emotional responses by analyzing the attributes of images—rather than detecting emotions in humans directly, which is a separate application. Large language models (LLMs) have been applied in judging text (not images) that is itself produced by an LLM [12]. This can be carried out in two distinct ways: (a) direct scoring, where an output is evaluated using specific criteria; and (b) pairwise comparisons in which two outputs are evaluated on relative attributes. Recent work in employing LLM-as-a-judge shows that protocol design (adjudication rules, pairwise versus pointwise, and prompt format) strongly affects alignment with human judgements.

When using “LLM-as-a-judge”, one large language model cross-checks the output of another. This iterative evaluation technique has been found to significantly improve accuracy [13,14]. Here, we apply a single-pass adjudication to validate the results from one LLM (ChatGPT) by running them through another LLM (Qwen) for a “second opinion”. LLMs are typically more accurate and stable when adjudicating an existing evaluation—whether from a human or another LLM—than when issuing a brand new judgment. Justifying scores plays to generative AI’s strengths and markedly reduces hallucinations compared with open-ended analysis. As an innovative step towards approximating human judgment, pairing LLM models for mutual checking is so far unexplored in design studies.

An LLM is asked to evaluate images for their aesthetic characteristics, with surprising success [15]. Recent applications of generative AI can judge the degree of objective architectural “beauty”. Pairwise comparison proves to be most useful in relative judgments of objective criteria [16,17]. Neuroscience confirms architectural beauty as an objective, biologically driven phenomenon rather than merely cultural or stylistic preference. Special architectural configurations provoke measurable emotional and physiological responses, forming the basis for reliable, science-based evaluation. Treating them as “style” conceals their quantifiable biological effects.

It is important to build AI-enabled applications that will empower the architecture and design communities to access and share scientific knowledge across domains. Bringing in AI researchers interested in beauty will enable them to collaborate with professionals who are supposed to create beauty in buildings. We need to integrate empathetic AI into design education and training.

1.2. Outline of This Paper

We try to accommodate two different audiences for these results and have accordingly structured this paper into two general parts. The first part presents the diagnostic method, results, and analysis for readers who wish to apply the AI tool to compare designs. Justification comes from internal consistency of the AI application, with the statistical analysis relegated to Appendix A so as not to interrupt the flow of the text.

The other audience consists of trained architects, who we anticipate will react to our results by questioning their epistemological basis. We therefore provide a lengthy discussion in the second half of this paper to justify the claim that special geometries influence the body towards a healthier state of well-being. Although evidence for this effect is rapidly accumulating, dominant architectural culture is disturbed by results that criticize favored design styles and typologies. Christopher Alexander anticipated and outlined the connection between architecture and well-being.

The evaluation modalities are discussed in Section 1.3. The 10 emotional criteria are detailed in Section 1.4, and Christopher Alexander’s 15 fundamental properties are listed in Section 1.5. Eye-tracking evaluations employ AI and agree with public polls (Section 1.6). It is argued why approval decisions should rely on criteria-driven diagnostics (Section 1.7). Related AI work is listed below (Section 1.8).

This paper details the design and implementation of the LLM prompts, reports on the quantitative outcomes of several AI experiments, and describes the methodology and results of three public polls. Three UK projects are analyzed using the AI-based diagnostic tool: Orchard House, the building in Oxford Street, London where the department store Marks and Spencer’s is located [18] (Section 2); a proposed new rugby stadium for Bath [19] (Section 3); and the High Speed 2 trainway viaduct proposal in Solihull outside Birmingham [20] (Section 3). The LLM chose the more human-centric design over the industrial-modernist alternative in all three cases (Section 4). This result coincides with independent public polls performed on representative samples of the British public for the three projects, conducted by Deltapoll for Create Streets (directed by its chairman, the first author, N.B.S.).

Each of the three pairs of projects is evaluated by prompting different models of ChatGPT to estimate the relative presence of emotional and geometric criteria. Qwen, a distinct LLM, is asked to check ChatGPT’s results explicitly, in validating this exercise. In most cases, Qwen verifies the previous results and makes slight adjustments to some (Section 2 and Section 3). This is the cleanest demonstration to date that LLMs, when steered by neuroscientific and geometric criteria, can mirror public poll outcomes on architecture (Section 4).

To avoid misunderstandings, the discussion focuses on just these three projects so that the LLM pilot study could be directly compared to the independent public polling results. The aim is not to clarify public opinion poll sources and methods (please refer to the previous paragraph—Createstreets.com has full details of the Deltapoll surveys available), but to introduce an AI-based theory-driven evaluative tool. The three polled pairs agree with LLM evaluations. This small sample is insufficient to validate the model through statistical analysis; therefore the present demonstration is only a “proof of concept”.

In seeking independent support for these findings, a single quality—“friendliness” —duplicates the two separate criteria-based results (Section 5). ChatGPT rates each image pair on which design is more friendly, completely agreeing with the previous analysis. The simplest test is in fact the most convincing, without having to elaborate the prompts or train the LLM using examples for improved accuracy.

The science behind the ten emotional descriptors and the fifteen fundamental properties of living geometry is summarized in Section 6.1. To rule out the public debate having any influence on the AI evaluations, the LLM was asked to justify its choice for each of the projects (Section 6.2). The LLM’s self-justification is reproduced in full in Appendix B. A note justifies using multiple images for the LLM-based diagnostics (Section 6.3). Since empathetic design depends in part upon fractal geometry (scaling and self-similarity), fractality is revealed through several images of a project that show multiple distances of approach right up to the details of the surfaces. Having stated this rule, we violate it here in order to compare the AI-based results to public polls. Those surveys were conducted before writing this paper, using only one pair of images in each case.

Section 7 explores the physiological basis of the emotional and geometrical criteria, and the relationship between them (Section 7.1) in more depth. The AI analysis is even more categorical in its choices, giving 90–100% preference compared to 70–80% in the public polls (Section 7.2). We discuss how the “mere-exposure effect” can bias results towards stressful designs because those are now ubiquitous. Section 7.3 frames living geometry triggering positive-valence emotions as a meta-result that associates two distinct diagnostics: 10 emotions and 15 geometrical properties. The limited data show perfect directional agreement.

Section 8 reviews different ideas of Christopher Alexander in this context. Pattern languages are not directly relevant to the model of this paper, whereas the 15 fundamental properties come from his later book The Nature of Order (Section 8.1). A deeper result of the AI experiments is to shed light on Alexander’s conjectured causal effect: living geometry → positive-valence emotions (Section 8.2). A related notion, Alexander’s “Quality Without a Name”—QWAN (better known in computer science than in architecture)—characterizes the most intensely humane environments. This concept has remained mysterious but now it can be better explained with the two sets of criteria used by the LLM diagnostic tool (Section 8.3).

A downloadable description of Alexander’s 15 fundamental properties of living geometry is included in the Supplementary Materials at the end of this paper. “A quick guide to visual preference surveys” is also provided in the Supplementary Materials to help prepare images for pairwise evaluation.

Appendix A discusses the consistency of the two LLM evaluations—emotional and geometrical criteria—and performs reliability assessments using different runs both for a single LLM, and across different LLMs and different models within a single LLM family. These trials address concerns about model-specific bias and robustness. The small sample size precludes a rigorous statistical analysis and indicates the need for a more comprehensive guide in the future.

Appendix B includes the self-justification from ChatGPT affirming that it did not draw from open-source data on the public debates around the three examples evaluated here.

1.3. LLM-Implemented Neuroscientific and Geometrical Criteria

To address the existing epistemological gaps, this paper introduces a generative AI-based evaluation technology that uses large language models (LLMs) for judging architecture. Applying tailored LLM prompts, the tool implements two complementary, evidence- and theory-driven sets of criteria for diagnosing and comparing designs, with quantitative results. AI-based diagnosis aligns design judgments with how people feel and see. This tool checks drawings and renderings against biology, giving architects a useful handle for geometry’s empathetic effects. The two factors are:

An AI test for emotional response. Drawing on peer-reviewed findings in environmental psychology, public health, and neuroaesthetics, we define a “beauty-emotion cluster” comprising ten emotional descriptors: {beauty, calmness, coherence, comfort, empathy, intimacy, reassurance, relaxation, visual pleasure, well-being}. An LLM is prompted to perform pairwise comparisons of images based on these descriptors, producing a normalized preference score for the relative conjectured emotional impact of a design. The set of ten descriptors forming the “beauty-emotion cluster” was introduced by the second author (N.A.S.) [21] in a study of people’s unconscious responses to window shapes and composition.
An AI test for geometrical criteria. With Christopher Alexander’s fifteen fundamental properties of living geometry—such as levels of scale, strong centers, thick boundaries, positive space, and local symmetries—this module uses the same pairwise comparison framework to evaluate geometrical features that define coherent complexity and informational organization [22]. These special visual qualities are found to trigger unconscious attachment. The prompts include a verbal description of the fifteen fundamental properties (attached at the end of this paper). This comparison generates a preference score that complements the one obtained from the “beauty–emotion cluster”.

Each diagnostic module produces independent preference ratings. In pilot experiments these align, demonstrating the internal consistency of emotion- and geometry-based evaluations. The LLM acts as an empathetic proxy for human feeling, predicting how people would emotionally experience a building or space. It is a first step towards an AI-based guide for emotionally intelligent design, now a rapidly developing discipline in computer science and engineering [23,24,25].

The framework relies upon internal consistency across independent measurements to validate a design, rather than on agreement with an external canon. The two diagnostic instruments measure different aspects of the design—affective appraisal, and structural organization—and are referenced in separate literatures. Having these two scores agree on the same preference provides convergent validity in the absence of an external reference. The diagnostic method confirms a long-standing hypothesis in neuroarchitecture: that special structural patterns in the built environment elicit positive-valence responses.

If the 25 criteria (10 emotional + 15 geometrical) are too cumbersome to handle efficiently, we found that asking the AI a single question—“Which design is more friendly?”—produced the same result (Section 5). This suggests that “friendliness” is a powerful holistic measure of good design. It operationalizes empathetic fit that designers can try out at the concept stage for screening—and then dig deeper using the 10 + 15 criteria when needed.

Both authors have published prescriptive rules as guidelines for human-centered design elsewhere. The 10 emotion and 15 geometry frameworks presented here are diagnostic and not prescriptive: they help the LLM measure the human emotional response and generative geometry, and reach the same conclusion. This approach is supported by scientific evidence on complexity, composition, symmetries, and street-level legibility, guiding empathetic design without constraining creativity.

The present study should be understood as exploratory and proof of concept. With only three case studies, the findings are statistically insufficient to generalize, but they demonstrate feasibility and will hopefully motivate more future work.

1.4. Evidence Behind the “Beauty-Emotion” Cluster

The ten positive-valence qualities rely upon neuroaesthetics and visual-processing fluency. Environmental psychology links these felt qualities to measurable physiological responses. These correlational diagnostics come from mechanisms shown in studies with brain networks and physiology when people occupy or view environments. AI experiments verify that any derivation of a comprehensive positive-valence set for architectural experience will come up with something very close to these ten. Unlike the geometrical criteria, these descriptors come from outside architecture altogether; so, it is useful to describe them and give supporting evidence for their validity.

Beauty. Stimuli are experienced as aesthetically appealing through ease of processing (fluency). Symmetries—both fractal (scaling) and plane—plus figure-ground clarity raise processing fluency and trigger positive affect [26,27,28].
Calmness. Exposure to restorative environmental structure with special geometrical qualities regulates stress. A nature walk versus an urban walk lowers blood pressure, cortisol, and heart rate [29,30].
Coherence. Visual coherence according to classic Gestalt principles—clear figure-ground and a balanced grouping of components—reduces perceptual load and supports fluent information processing [31,32].
Comfort. When multisensory environmental cues satisfy human affordances and cognitive needs, the result is a feeling of “comfort”. Experiments connect environmental geometry to well-being and satisfaction via psycho-physiological responses [33,34].
Empathy. Designs that trigger empathic resonance tend to feel more “affective” and rewarding. We sense the built form as having accessible characteristics that connect positively to our bodily dynamics and states [35,36,37].
Intimacy. Architecture that privileges the human scale shows moderate enclosure with intimate zones and surfaces made attractive by ornament. The geometry, sometimes with nested subdivisions, invites a person to move closer [38,39].
Reassurance (safety). Geometrical and visual cues predicting the absence of threat suppress fear and stress responses. The presence of obvious safety cues in scenes and spaces helps people to enjoy being there [40,41].
Relaxation. Relaxation reflects a parasympathetic body shift towards reduced stress. It is well-documented that contact with green nature lowers physiological stress, yet the same effect is due to built structures that exhibit living geometry [42,43].
Visual pleasure. Fluency theory explains why easily processed visual structure feels good. Pleasurable aesthetic reward comes from art, scenes, and surfaces that trigger positive hormonal reactions [44,45].
Well-being. The geometry of the environment influences long-term good health and well-being. This cumulative effect adds up from stimuli received on shorter time scales over time to give positive affect and life satisfaction [46,47].

A design or structure that satisfies most of these criteria is predicted to generate positive-valence emotions in a user after it is built. The “beauty-emotion cluster” is therefore a useful diagnostic set for virtual pre-occupancy evaluations. Using open-access data, an LLM evaluates embodied human outcomes by proxy, which is the reason why it can mirror public preferences so closely. This paper’s diagnostic model anticipates the results of a public poll, thus saving the expense and trouble of conducting one. It is reassuring to validate the procedure by having both results—LLM prediction and public poll—agree.

1.5. Alexander’s “Fifteen Fundamental Properties” Defining Living Geometry

Christopher Alexander’s “Fifteen Fundamental Properties” of living geometry comprise the other set of criteria used for the diagnostic AI-based model. These are listed here and described in detail in Alexander’s The Nature of Order, especially Books 1 and 2. We will not undertake here a justification of how and why the 15 properties elicit well-being. However, architects who are unaware of this work are likely to misinterpret it as an arbitrary list of features. The second author (N.A.S.) has published numerous applications and discussions of the 15 properties (see the references). A useful summary description of the 15 properties is available in the Supplemental Materials linked at the end of this paper.

Levels of scale.
Strong centers.
Thick boundaries.
Alternating repetition.
Positive space.
Good shape.
Local symmetries.
Deep interlock and ambiguity.
Contrast.
Gradients.
Roughness.
Echoes.
The void.
Simplicity and inner calm.
Not-separateness.

Alexander discovered this set of 15 geometrical properties in inorganic natural forms, living forms, and in the most loved creations of humans—from artifacts to buildings and to cities. The same visual features occur on all different scales from the large, to the medium, down to the small details, consistent with the mathematical description of coherent complex systems. Perceptual cues that aid comprehension and lower stress at the scale of a doorway or window also affect behavior at the scale of a street. The key to understanding these properties is to realize that the human brain evolved specifically to recognize and process them in the natural environment—and automatically look for them.

1.6. Agreement with Public Polls Counters Subjective Opinions and Stylistic Biases

The marketing sector studies the evaluation of product attractiveness. Going beyond specific questions of polling, it is fascinating to compare how far machine learning (a form of AI) is capable of approximating personal responses obtained via questionnaires [48]. By pushing present-day architectural debate outside the usual self-referential subjective confines, such tools greatly facilitate the drive towards a more attractive built environment [49,50].

To validate the AI-derived assessments of this paper against human judgments, they are compared to public opinion polls. For example, a February 2024 survey of 1200 respondents in the UK comparing two alternative façade designs for Orchard House yielded preference scores of (LHS, RHS) = (17%, 79%), entirely consistent with the (0%, 100%) values returned by both AI-driven diagnostic modules used here (see Section 2). This concordance underscores the mutual support among neuroscientific theory, geometrical theory, and empirical public sentiment.

A recent turn in architectural criticism circumvents the authority of established aesthetic movements by comparing eye-tracking technology with public surveys. Two independent analyses of similar pairs of US Federal Buildings validated results of an independent Harris Poll of public opinion. Eye-tracking experiments and eye-tracking simulation software independently verified the survey’s results [51,52,53]. That survey revealed overwhelming public preference for US federal buildings in traditional versus modernist styles [54,55]. Mapping unconscious visual attention—an application of technology originating from medical and scientific research—overcomes architectural beliefs and stylistic attitudes.

There are two points to emphasize towards achieving objective results. (i) Employ common people and residents for preference polls of traditional streetscapes versus buildings in a modernist, post-modernist, or Deconstructivist style. (ii) Use technology to make the same selections independently, which then reinforces the human choices. AI provides a powerful tool in achieving such diagnostic capability.

1.7. Criteria-Driven Diagnostics Should Influence Approval Decisions

Architecture has discoverable, objective criteria in human biology and perception mechanisms. Establishing an empathetic framework for design validation couples AI to diagnostic tools such as eye tracking, eye-tracking simulations, and wearable emotional/physiological sensors [56,57,58,59,60,61]. Assessment tools based on biology let a designer predict unconscious—but visceral—user responses based on neural and psycho-physiological findings. Reproducible evidence reveals how geometrical features affect comfort, preference, and well-being in a reliable pre-occupancy evaluation.

Criteria-driven diagnostics pave the way for decision-support systems in heritage conservation, regulatory review, and urban planning, where built structures can be compared and scored according to scientifically grounded criteria. Importantly, this technology decouples the assessment process from aesthetic fashions, historical or personal biases, and even the limitations of manual surveys. Humans bring cultural subjectivity and personal biases to architectural judgments. Those who work comfortably within the present system might find the proposed change controversial, however.

Introducing AI-based diagnostics into design practice poses a cultural and institutional challenge, as many practitioners view both evidence-based methods and public opinion as subordinate to aesthetic authority. To address such concerns, the technology must be understood as a decision-support system that augments rather than replaces professional expertise. Quantified emotional and geometrical feedback supports client engagement and aligns with emerging public health data. Improved occupant well-being couples with popular market appeal, making AI-based evaluation tools an indispensable resource for evidence-based, human-centered architecture.

1.8. Some Related Work

Close to the general thinking of this paper, two recent efforts stand out. Danny Raede has created a website that analyzes an image to find the three out of Alexander’s 15 properties that appear most intense [62]. Paralleling the approach adopted here, Raede uses generative AI together with a detailed description of the 15 properties. In a separate development, Bin Jiang has created an online “Beautimeter”, also using the 15 properties together with ChatGPT to judge objective beauty [63]. Other work by Jiang tries to apply quantitative methods to architectural judgments [64].

Used as a diagnostic tool, generative AI already knows enough about Alexander’s 15 fundamental properties from open sources to evaluate an image for their presence. Such AI experiments differ from the method in this paper that relies on text-based instructions in the prompt, and which also uses an uploaded description to ensure accuracy. (A detailed description of Alexander’s 15 fundamental properties is included below in the Supplementary Materials).

Malekzadeh et al. employed ChatGPT to evaluate streetscapes using a prompt asking for “visual appeal” [65]. Data from 1800 Google Street View images of Helsinki, Finland were then compared to personal surveys (from non-architects) of the same images, and a strong positive correlation was computed. Based on that research, Malekzadeh suggested the AI diagnostic method’s use for policy decisions on urban projects before approval—which we also recommend here [66].

An entirely different approach tries to identify factors responsible for “place attachment” [67]. That phenomenon occurs from the interaction between people and a particular place through actions, emotions, and thoughts. There is a strong link with the present model, since the AI-based diagnostic test relies upon experienced emotions and separately on the geometry that triggers attachment through visual stimulation.

Beyond architecture, related methods apply generative AI in other domains to evaluate or track visual stimuli in real-time. For example, poultry behavior has been successfully monitored with attention-mechanism enhanced YOLO-based trackers (using a “You Only Look Once” algorithm) [68]. This example illustrates that advanced AI vision systems can maintain robust object tracking and identification, even in dynamic, visually cluttered biological settings—a useful precedent for our image-based architectural comparisons.

Similarly, recent surveys of large language model evaluation [69,70] highlight the need for reproducibility protocols that parallel our methodology. These cross-domain precedents show that LLMs used as diagnostic instruments become more robust when paired with explicit evaluation criteria. LLM evaluation involves multiple dimensions of analysis, dividing into alignment and knowledge factors, which helps situate our architectural-LLM method within a broader evaluation framework. They also guide our design of prompt protocols and inter-run consistency checks.

2. Two AI Experiments Use Large Language Models to Judge Architecture

2.1. AI Experiment 1. Comparative LLM Evaluation from the “Beauty-Emotion Cluster”

This generative AI experiment uses the large language model (LLM) ChatGPT to perform two independent comparative evaluations of the two buildings shown in Figure 1. First, the LLM estimates the combined emotional feedback of a person who experiences either building in the actual physical setting. This gives a preference for one building over the other based on the “beauty-emotion cluster”. The initial prompt for the AI experiment was entered as a new chat, to avoid interference with and memory from any previous activity in the LLM. To reduce stochastic bias, each experiment was repeated twenty times (see Appendix A), showing 100% directional agreement between criteria. Prompts were presented in fresh chats to avoid memory effects. To allow for direct comparison, image left/right order was maintained from the public poll, but future studies should counterbalance the L/R order systematically.

Second, the LLM performs a comparative evaluation of specific geometrical properties that define what is known as “living geometry” and described by Alexander’s fifteen fundamental properties. “Living geometry” represents the common properties of stable complex systems, whereas the “beauty-emotion cluster” is a set of responses that human evolution developed to recognize living geometry. The two separate experiments are normalized to a percentage value, and in this example, turn out to agree.

Prompt to ChatGPT 4.5: “Use the set of ten qualities (“beauty-emotion cluster”) that elicit a positive-valence feeling from a person while physically experiencing a built structure, to investigate the two uploaded pictures of similar buildings:

beauty;
calmness;
coherence;
comfort;
empathy;
intimacy;
reassurance;
relaxation;
visual pleasure;
well-being.

Evaluate the conjectured relative emotional feedback by comparing the two images in a binary preference (1 for the preferred image and 0 for the rejected image for each of the 10 qualities) to give a preference for one over the other. The sum of the values for each image should be 10, so multiply by 10 so the sum is out of 100%. Give the answer as (LHS, RHS) where LHS + RHS = 100.”

Chat GPT: “Evaluating the two buildings based on the listed emotional qualities, the relative binary preference scores (1 for preferred, 0 for rejected) are as follows (Table 1). Total Relative Preference: (LHS, RHS) = (0%, 100%). Thus, the overall preference clearly favors the building in the second image (RHS).”

As a check of this diagnosis, the above readout together with the image pair, Figure 1, were given to Qwen3-Max-Preview. This included the ChatGPT answer with the same prompt plus an additional instruction: “Do not rely on subjective aesthetic judgments, contemporary architectural styles, minimalism, or stylistic trends. Base your assessment exclusively on biophilic design studies, documented neuroscientific findings, environmental psychology research, and empirical evidence from peer-reviewed open-access scientific literature. Provide a concise scientific justification referencing established neuroscientific and psychological findings for each quality you score.”

Qwen3-Max-Preview was asked to check the ChatGPT result and re-compute the total if necessary. The new value was (LHS, RHS) = (20%, 80%), along with the statement that: “This result is not arbitrary, but derived from peer-reviewed research on how humans perceive and respond to built environments. This evaluation is scientifically rigorous, empirically justified, and free from stylistic bias.” Qwen adjusted the total score and gave its justifications in Table 2.

2.2. AI Experiment 2. Comparative LLM Evaluation from the Qualities of “Living Geometry”

For this second experiment, it was necessary to upload the document “Detailed description of Christopher Alexander’s 15 fundamental properties” (available in the Supplementary Materials) along with the prompt given below. Christopher Alexander presented his geometrical properties in Book 1 of The Nature of Order [71]. This condensed description comes from the Appendix of an article on living geometry by the second author (N.A.S.) [72].

Prompt to ChatGPT 4.5: “Do a similar evaluation of the same two images of buildings, this time using the 15 criteria uploaded as Alexander’s Fifteen Fundamental Properties of living geometry. The relative comparison should come out again as a set of numbers (LHS, RHS), where LHS = total score for the relative presence (dominance) of the properties in the LHS image, and RHS = total for score for the relative presence (dominance) of the properties in the RHS image. Score the pair of images as follows: if one property is clearly dominant in one of them, give a 1 to it and 0 to the other. If both images have comparable degrees of one property, or the difference is very small, give a 0 to both. For this reason, the totals could come out to be LHS + RHS < 15. Whatever this sum is, normalize it so that the two numbers add up to 100%.”

ChatGPT: “Here is the evaluation comparing the two images using Alexander’s 15 Fundamental Properties, clearly indicating dominance (1) or absence/ambiguity (0) for each criterion (Table 3). The evaluation clearly demonstrates dominance by the building shown in the right-hand image (RHS), which strongly exhibits Alexander’s properties of living geometry. The left-hand image (LHS) lacks distinct evidence of these essential properties. Final Normalized Preference: (LHS, RHS) = (0%, 100%)”.

A similar check of these geometric criteria using the same prompt with Qwen3-Max-Preview gives the same results (LHS, RHS) = (0, 15) normalized to (0%, 100%), with the note that: “The RHS building clearly dominates in every category, while the LHS building fails to exhibit any of the essential properties of living geometry. The RHS building exemplifies biophilic design and cognitive comfort, while the LHS, though functional, lacks emotional and neurological resonance.” Qwen checked the ChatGPT readout but did not alter the result.

A note on image preparation for this test. The prompt requires two images showing projects of comparable size as seen from the same angle. It is best to counterbalance left/right order to circumvent L/R bias (which we did not do here, since we used those paired images from the Deltapoll). Images should have the same crop, daylight, minimized skyline cues, no identifying text, and normalized color/contrast as they do here and in subsequent images. Create Streets has published “A quick guide to visual preference surveys” for preparing image pairs for public surveys, which is also ideal preparation for pairwise AI diagnosis. It is included here for convenience in the Supplementary Materials.

3. Two More Examples: A Stadium and a Railway Viaduct

3.1. Analysis of Designs for a Proposed Rugby Stadium in Bath, UK

The same prompts to ChatGPT given in the previous section were used for evaluating the dominance of emotional qualities and properties of living geometry for the following pairs of images. Only the answers are included here. The format and method of answering differs each time and contains interesting details of the LLM’s reasoning.

A comparative analysis of two alternative proposed designs for a rugby stadium is performed on the images in Figure 2. The identity and location of the project were not communicated to the LLM, to keep the analysis as unbiased as possible. The point of this exercise is to show that human-centered design is independent of scale. Applying this diagnostic tool to large structures—not buildings—justifies its universal application. It also questions the common assumption that such projects must automatically adopt an industrial-modernist vocabulary.

Chat GPT-4o: “Using the “beauty-emotion cluster” to compare the two images of the rugby stadium design proposals, I assessed the ten emotional qualities based on the likely instinctive response of a person physically experiencing each structure. Here’s the binary scoring (in Table 4).”

A check of the Figure 2 diagnosis using the same prompt with Qwen3-Max-Preview together with the additional paragraph (Section 2.1, above) in the prompt gives (LHS, RHS) = (10%, 90%). Qwen3-Max-Preview validated the readout from ChatGPT and made this small adjustment, while giving its justifications in Table 5.

The second step is to analyze this pair of images (Figure 2) using their correspondence to living geometry, as defined by Alexander’s Fifteen Fundamental Properties.

ChatGPT o4-mini-high: “Below is the comparison of the two stadium façades against Alexander’s Fifteen Fundamental Properties. A score of 1 indicates clear dominance of that property in the image; 0 indicates no dominance or near-equivalence. The raw totals are then normalized to sum to 100% (Table 6). Final Normalized Preference: (0%, 100%).”

Running a check of the Figure 2 diagnosis for the 15 fundamental properties using Qwen3-Max-Preview gives this result: “(LHS, RHS) = (0, 14), normalized to (0%, 100%)—Validated and confirmed. ChatGPT’s result is correct, with only a minor error in raw counting (should be 14, not 13), but normalized outcome unchanged. ChatGPT undercounted by 1 point (likely due to misjudging Gradients).”

3.2. A Proposed Viaduct for the High Speed 2 (HS2) Trainway in Solihull, UK

A comparative analysis of two proposed designs for viaducts is performed on the images shown in Figure 3. Again, any information that might identify the project hence lead the LLM to be influenced by the public debate or poll results was excluded from the prompt.

Chat GPT-4o: “Using the “beauty-emotion cluster” of 10 experiential qualities, here is an evaluation of the two viaduct designs you provided: Viaduct One (LHS): Modern concrete viaduct with minimalistic piers and industrial detailing. Viaduct Two (RHS): Brick-arched viaduct with classical symmetry and ornamentation (Table 7).

This result suggests that, from the perspective of emotional experience and human-centered design, the brick-arched viaduct (RHS) overwhelmingly outperforms the modernist concrete version (LHS) across all ten emotional criteria.”

A check of the Figure 3 diagnosis using the same prompt with Qwen3-Max-Preview together with the additional paragraph (Section 2.1, above) in the prompt gives (LHS, RHS) = (0%, 100%). Qwen3-Max-Preview corroborates the emotional-criteria results from ChatGPT.

Qwen: “Is ChatGPT’s Evaluation Accurate? Yes—it is accurate, and fully consistent with scientific evidence. ChatGPT correctly identifies: the RHS as dominant in all 10 emotional qualities; the LHS as lacking in emotional resonance due to its industrial character and lack of human-centered design features. This reflects a nuanced understanding of how different architectural forms affect human experience—not just beauty, but emotion, cognition, and physiology. Final scores: (LHS, RHS) = (0%, 100%).” Each emotional component is justified in Table 8.

The next AI experiment evaluated the comparative presence of the qualities of living geometry in Figure 3, as judged from Alexander’s Fifteen Fundamental Properties.

ChatGPT o4-mini-high: “Below is the comparative evaluation of the two viaduct designs against Christopher Alexander’s Fifteen Fundamental Properties. A score of 1 indicates clear dominance of that property in that image; 0 indicates absence of dominance or near equivalence. The raw totals are then normalized so that they sum to 100% (Table 9). Final Normalized Preference: (9%, 91%).”

For completeness, Qwen3-Max-Preview ran a check of the Figure 3 diagnosis for the 15 fundamental properties given above. The result is: “(LHS, RHS) = (1, 10) normalized to (9%, 91%)—Correct and rigorously justified. ChatGPT’s result is valid, with no need for correction.”

4. Consistent Results from Evaluating the Three Building Proposals

Table 10 summarizes the separate AI-based evaluations based on the emotional indicators and properties of living geometry for the three pairs of examples. The public survey results are also included for comparison. Using different ChatGPT versions across experiments does not undermine inter-version reliability, but illustrates independence of the model from any specific LLM (this crucial point is elaborated in Appendix A). The name of the project was not given to the LLM when it performed its analysis. This deliberate omission avoided bias from the LLM picking up the public survey results or opinions from the associated debate (see the discussion in Section 6.2 and Appendix B).

The single-trial AI-based analysis agrees with the public poll in choosing the human-centered design on the RHS (Appendix A reports data variations in sets of twenty repeated trials). Applying an innovative “LLM-as-a-judge” assessment using Qwen validated all except one of the numbers in Table 10 (the emotional score for Orchard House became (20%, 80%), closer to the public survey values). What is remarkable is that in all three cases, the AI result is even more categorical than the opinion survey. Possible reasons for this phenomenon will be discussed in the following sections. The LLM is telling us something important about the human impact of these designs that is not yet fully appreciated. Other authors already noted that in analyzing text, an LLM can give a more accurate result than humans [73,74].

These results successfully demonstrate that the proposed method works, despite its limitations. Alignment between the AI’s criteria-based judgments and public polls across three distinct project types (façade, stadium, and viaduct) cannot be easily dismissed. The precise reasons for how and why the LLM agrees with public polls—i.e., which visual features it is linking to—are explained in the background references on geometry and neuroscience. The emotion-based and geometry-based decisions in Figure 1 coincided in 20/20 repeated runs of the same LLM model and 20/20 runs across different LLM models (see Appendix A).

The binary scoring methodology was chosen as the simplest initial implementation of the idea, which does not compromise measurement validity. Certainly, nuanced gradations—such as a Likert scale—essential for a more complex assessment can be easily incorporated into future models. Convenience and the need for a simple scoring format (to communicate to authorities) forces the raw binary scores into a percentage, lowering the accuracy of detecting when both designs perform equally poorly or well on the evaluation criteria.

Design evaluation using large language models with the “beauty-emotion cluster” and Alexander’s fifteen fundamental properties as measurement instruments still needs empirical validation. The sample size of three architectural pairs and the absence of appropriate control conditions provide insufficient statistical weight for conclusions. Nevertheless, these preliminary findings are meant to spur developing design tools in this direction. The methodology does not rely on single-trial evaluations, as data from 20 separate runs using Figure 1 for each set reveal 100% directional agreement between the emotional and geometrical criteria (see Appendix A.3).

Implementing the “LLM-as-a-judge” methodology strengthens the present model. Here, we fed the entire output from ChatGPT (prompt plus answer) to the separate LLM Qwen for a “second opinion” evaluation. Qwen agreed to a great extent with ChatGPT’s results. This two-step rating technique represents an innovation in applying AI to judge designs. It leverages the AI model’s own capabilities to assess output, thus mimicking human evaluators. In future applications, the evaluation process could iterate through further cycles to further improve reliability.

The purpose of this pilot project was to present the basic notion of AI-based design diagnostics, applied to some real examples. Previous work that measures objective reactions affecting comfort and stress using eye-tracking and visual-attention tools (VAS) agrees with the present results. The authors admit to conjoining causal claims with correlational results, to be clarified by future research. A discussion on “method limitations” is placed in Appendix A rather than in the body of this paper, since generative AI is developing so fast that outstanding questions may be answered quite soon.

It would have been ideal to have appropriate control conditions to validate the evaluation methodology against established architectural assessment tools. But the problem is that there is no objective standard assessment method—a ground truth reference—to compare with [75,76,77,78,79]. This poses a huge dilemma since expert ratings from dominant practice have historical and socio-political biases. Those tacit consensus judgments are stylistic filters instead of testable standards in the scientific sense. Conventional validation sorts designs primarily by their conformity to an established aesthetic—not by measurable effects on people [80,81]. The canonical basis for selecting a “correct answer” is ideological.

While lists of design criteria do exist, and are taught to architecture students, these lack empirical criteria associating specific design choices with human outcomes. They are typically supported by conceptual, non-evidenced, socio-political, or theoretical arguments not based on science. As documented in numerous studies, inherently subjective architectural judgment rests upon the opinion of established authority that promotes canonical styles [82]. Such authority is defined not by testable science, but by power and prestige [83]. Architects trained to think in this way dismiss objective lists of criteria such as those introduced in this paper.

5. Which of These Two Images Is More “Friendly”? A Quick Diagnostic

One more effortless test supports the present findings. Instead of elaborating the previous prompts by including more criteria or further details, we experiment by reducing them down to one emotional quality—“friendliness”. Friendliness is empathy manifested in form. Decision-makers and the public could base project selection on how “friendly” a built structure feels to people. Single-trial results using ChatGPT for the three comparative designs Figure 1, Figure 2 and Figure 3 agree with the previous derivations. Responses from three separate runs using the same prompt are reproduced in part below.

Prompt to ChatGPT-5 Thinking: “Compare between the two uploaded images to choose which one is friendlier”.

ChatGPT’s answer for Figure 1 (department store): “Here is a transparent, evidence-based judgment of which image is likelier to feel friendlier (higher positive valence with low to moderate arousal, experienced as approachability, comfort, and ease of reading the scene). Neuroaesthetics work on architectural experience shows that people’s evaluations cluster around three psychologically meaningful dimensions—coherence (perceptual legibility), hominess (warmth/approachability), and fascination (effortless interest)—that together track emotional valence. I use these, plus two well-studied perceptual regularities—curvature and biophilic/fractal cues—to ground the comparison.

Across coherence, hominess, curvature, and biophilic/fractal structure, the second image (RHS) consistently aligns with perceptual and neurocognitive factors that predict higher positive valence and approachability. The first image (LHS) is visually striking at urban scale but, at pedestrian scale, is more angular, monotonous, and materially cool, which the literature associates with lower friendliness relative to the RHS composition. Friendlier image: RHS.”

ChatGPT’s answer for Figure 2 (stadium): “Operational definition: I treat “friendliness” as high positive valence with low to moderate arousal—what people experience as approachability, comfort, and ease of reading the scene. In neuroaesthetics of architecture these responses are well captured by three dimensions: coherence (legibility), hominess (warmth/approachability), and fascination (effortless interest). Across the best-established perceptual drivers—coherence/legibility, curvature, multi-scale detail, and hominess—the RHS rendering more consistently aligns with factors that predict higher positive valence and approachability. Hence, RHS is the friendlier of the two.”

ChatGPT’s answer for Figure 3 (viaduct): “Operational anchor: I treat “friendliness” as high positive valence with low–moderate arousal—what people experience as approachability and comfort. In neuroaesthetics, architectural appraisals map reliably to coherence, hominess, and fascination; scenes that score higher on these dimensions tend to be judged more positively. Bottom line: Across the best-supported perceptual drivers—coherence, curvature, multi-scale detail, and warm natural materiality—the RHS viaduct aligns more strongly with factors that predict higher positive valence and approachability; hence it is the friendlier option.”

LLM-based evaluations agree with the criteria-based diagnoses carried out previously. The single-criterion result (friendliness) matches the 10 beauty-emotion and 15 living geometry qualities in determining how informational qualities of the built environment influence emotional health. The human body integrates many cues into a signal that feels “friendly”. Using one complex emotional criterion simplifies the test to a bare minimum, although granularity is lost. The essential idea is retained, and the test works. Detailed criteria offer a mechanism for identifying and adjusting individual factors, whereas binary feedback from the single descriptor “friendly” does not.

As a perceptual judgment, “friendliness” summarizes three coupled processes in the visual brain: (i) processing fluency—the ease with which structure can be analyzed, (ii) rapid, largely automatic threat screening, and (iii) valuation/integration—getting a cognitive grip on what is seen. When these operations align, observers experience a low-arousal, positive-valence state that feels like friendliness. The body can relax and not interfere with goal-directed behavior and purposeful actions. When a form or structure frustrates these mechanisms, however, the body turns on stress and vigilance to cope with latent threats.

Cross-checking these findings, the same prompt given to three additional LLMs—Gemini 2.5 Pro, Perplexity Research, and Qwen3—judged the RHS as friendlier in all three cases. Different LLMs could, however, disagree on a comparison because of language and limitations in their training dataset, as discussed in Appendix A. Those could prioritize external sources like press opinions and project identifications for validation, which are anecdotal and inherently biased. ChatGPT’s detailed responses reveal that it reasons using neurophysiological data instead of consensus language or subjective assumptions.

This test links a descriptor in plain language (“friendly” versus “unfriendly”) to evolved perceptual mechanisms. The present crisis of the built environment is not aesthetic or functional—it is empathetic. Environments that ignore human emotional needs produce disconnection, isolation, and stress. A quick AI check can easily prevent designs that select against neurophysiology. The practical diagnostic highlights an existing measurement gap: LLMs judging standard architectural typologies using emotional health criteria show that they are unfriendly.

6. Checks and Consistency of the AI-Based Analysis

6.1. Empathetic “Style-Free” Modalities for Evaluation

This evaluation system represents a technological advance in three key respects.

First, the tool leverages LLMs’ multimodal reasoning to translate neuroscientific findings and living geometry principles in analyzing images without specialized training or annotated datasets. The technology is ready to use as is, which is a major practical benefit. With zero-shot prompts (i.e., no demonstrations or examples embedded in the prompt), the LLM relies strictly upon its initial knowledge and training. Asking an LLM using theoretical criteria tests how closely the geometry in the image matches perceptual mechanisms. This contrasts with training LLMs to evaluate images with human-labeled preferences. Systems trained on aesthetic preferences could bring in dataset bias and fashion/ideology.

Second, structuring the evaluation as pairwise comparisons over interpretable semantic criteria overcomes the “black box” limitations of vision-based AI models. The tool relies upon commonly understood emotional/geometric qualities, keeping the model’s reasoning to transparent concepts. It is not AI preference modeling trained on fashion labels. Instead, the model develops the visual analogue of empathetic text analysis from the medical application of LLMs [84,85,86]. An empathy-centered evaluation determines how built forms trigger human sensorimotor reactions.

Third, the platform’s modular design allows for extension and recursive optimization, enabling it to adapt to emergent insights. Being an open-source tool, any user can fine-tune it and run it repeatedly (see Appendix A). A prompt is expanded, not in attempting to bias the result, but to help the LLM locate the relevant pre-existing data that will return a more accurate answer. The goal is to use generative AI as a theory-grounded diagnostic tool—based on clear underlying principles—to assess how users perceive the built environment.

The observed alignment across three evaluation modalities is itself remarkable. Each tool—neuroscientific emotional analysis, assessment of geometrical properties, and public sentiment polling—originates from distinct scientific foundations. Emotional evaluation relies on LLM-driven interpretation of neuro-aesthetic findings and biometric correlation; geometrical evaluation is rooted in informational and mathematical properties; and a public poll captures human preferences via statistical sampling. Their three-way convergence provides mutual reinforcement. The LLMs’ ability to merge disparate criteria confirms that intelligent, objective diagnostics can mirror collective human judgments. The agreement strengthens confidence in deploying these AI-based tools for architectural diagnostics.

Significantly, the evaluative model described here does not assess buildings in terms of architectural “style”. The AI-based tool contains no reference to classical, historical, industrial, modernist, traditional, or vernacular design languages; nor are any images classified or described in those terms. The prompts are free of value-laden language favoring specific styles. Instead, the LLM evaluates images strictly from primary emotional and geometrical criteria with documented physiological consequences. The ten positive-valence emotional qualities and the “friendly” test function as empathy detectors.

6.2. The Three Building Surveys Did Not Bias the LLM Diagnosis

The three studies undertaken here have a prior history in the UK, where public polls were conducted to judge preferences in each case. Before discussing those surveys, it is important to dispel a question that arises about the impartiality of the LLM-based evaluation. Since the LLM has access to open-source data, could it be influenced by the public poll? Even though the names of the projects were not given in the prompt, they could be identified from the images. In that case, whatever result it produces will not be due strictly to scientific criteria as the model promises.

This question was settled by asking the LLM itself to justify its judgment in each of the three cases. ChatGPT gave clear and detailed answers that categorically deny any influence of the public polls on its results. The unedited responses are reproduced here as Appendix B. Some readers of early versions of this paper did not believe ChatGPT’s self-assessment as a control measure because LLMs lack metacognitive awareness—which is not true, since the latest models display partial metacognition [87]. The authors are not worried about this point, and those who are still skeptical are welcome to investigate possible hidden bias.

6.3. Accurate LLM Evaluations Optimally Require Multiple Images Showing Different Distances and Views

To accurately measure the fractality of a design, several images are necessary showing the geometry at different distances of approach. As a person approaches a physical structure, more details and symmetries become visible, unless it lacks fractal qualities altogether. A separate concern is to show the many different parts a building or structure has; thus, it is advisable to perform separate diagnostics for each major aspect or component. This requires a careful selection of images that compare views.

Having stated this recommendation, we do not follow it here because the public surveys used only one set of images per example. This paper was written after those polls were conducted by the first author (N.B.S.) as a challenge to the second author (N.A.S.) to reproduce the results using empathetic AI. For this reason, the analysis is limited to the existing images already used for the public survey, even if those represent one single view of each project. We are also stuck with the limitations of the original images, yet improving them would invalidate the comparison.

7. Scientific Basis for the Emotion/Geometry Criteria: Living Geometry Triggers Positive-Valence Emotions

7.1. How Architecture Influences Human Physiology

The present model’s ten emotional descriptors were chosen following a systematic review of the neuro-architectural literature. Psycho-physiological studies employing EEG, galvanic skin response, and salivary cortisol measurements link emotional responses such as calmness, comfort, and visual pleasure to spatial stimuli [88,89]. The 10 emotional tests are direct expressions of empathetic evaluations. These descriptors reflect high positive-valence states linked to beneficial health outcomes in hospital and restorative environments. The proposed “beauty-emotion” cluster aligns with biophilic design principles identified in empirical studies [90,91,92,93,94].

A reader should therefore not mistakenly assume that the emotional descriptors were justified by a single study of window shapes in which they were introduced: they are instead supported by comprehensive psycho-physiological research. The ten descriptors may appear conceptually redundant and measure overlapping variance—yet their apparent construct overlap intentionally captures emotional nuances for evaluation. Eventually, correlation matrices between the 25 combined criteria will be calculated to identify redundant measures, but that is not the aim of the present paper.

Similarly, Alexander’s fifteen fundamental properties were selected due to their empirical validation in both built and natural contexts. Cross-disciplinary analyses reveal that structures exhibiting these properties correlate geometric coherence with human psychological well-being [95,96]. These properties are not merely technical criteria; they constitute a grammar of empathy. Empirical peer-reviewed studies demonstrate their measurable well-being outcomes through physiological responses to designs. Although other geometric frameworks could be employed in an LLM prompt (e.g., “futuristic, “innovative”, or “minimalist” forms), those would lack the synergistic, multi-scale organization shown to support cognitive and emotional health.

A separate result from the useful diagnostic model is the proof outline that living geometry influences well-being. This conjecture is the basic message of Alexander’s The Nature of Order. AI-based results of the present paper reveal a directional agreement between emotional and geometrical criteria (see Appendix A.3). While to a designer, it is convenient to have two distinct parallel criteria for evaluation, correlating living geometry with positive-valence emotions is a significant finding. The present experiments serve as proof-of-principle until more statistically significant studies are performed.

More important, violating living geometry leads to negative physiological reactions [97,98]. While the 15 fundamental properties inherently describe features more prevalent in classical and traditional architectures than in modernist designs, this result should not be misinterpreted as a systematic bias. It is a consequence of deliberate stylistic choices made by those who introduced early modernism at the beginning of the 20th century, whereas the 15 properties are justified from their mathematical content and psychological effects on people.

The fifteen properties offer a convenient, operational approach to judging living geometry. A lot of mathematical information lies embedded beneath these simple descriptors, and those complex structures have been analyzed in depth by the second author [99,100,101] (N.A.S. was the principal editor of Alexander’s 4-volume book The Nature of Order). This theoretical grounding ensures that the chosen criteria used in this AI tool are neither arbitrary nor stylistic but are rooted in strong geometrical and neuroscientific evidence.

The ten emotional descriptors and the fifteen fundamental properties of living geometry are conceptually distinct (though neurologically linked) methods of analysis. The fifteen fundamental properties measure a form’s intrinsic, visible structural features. Those are assessed from the geometry of the design itself, independent of an observer. By contrast, the ten emotional descriptors are not intrinsic to the structure, but rather represent the affective and physiological responses of a person interacting with that structure. These 25 combined criteria do not represent statistically independent measures, for the following reason.

It is diagnostically valuable to test with both sets of criteria independently to discover how the 15 geometric properties (along with other sensory input) trigger the 10 emotional responses through our hardwired neurobiological systems. Brains evolved in natural environments to interpret the living geometry that Alexander identified. Built environments trigger our nervous systems to respond with an unconscious survival assessment. The emotional descriptors function as biofeedback proxies: mediated effects through our neurophysiology.

Although our findings reveal consistent directional agreement between living geometry, positive-valence emotions, and public preference, the result remains correlational. Demonstrating Alexander’s postulated causal mechanisms requires larger datasets and integration with neuroscientific measurement tools, which lies beyond the scope of this exploratory study.

7.2. Why Criteria-Driven Architectural Evaluations May Give a Clearer Result than Even Public Opinion Polls

The public opinion surveys previously commissioned by Create Streets use a strict Visual Preference Survey methodology. This framework carefully controls for extraneous factors to permit a fair understanding of human preferences. Such visual preference surveys have very consistent results. Around 70–80% of the public consistently prefer more traditional and human-centered designs. In the three visual comparisons reviewed here, 79%, 72% and 69% of the public preferred the more human design. Remarkably, these findings are consistent across all socio-economic demographic segments of social strata, geography, race, sex, income, and political opinion. They are also true in other surveys in the UK and in the US not reviewed here [102,103].

Clearly, public opinion surveys of visual preference comparisons perform an important role that has been lacking for much of the last century. They demonstrate strong but not unanimous, cross-party and widespread support for more humane architecture and places. Public opinion provides important and useful feedback on architecture. It is also immediately comprehensible and usable by politicians and decision-makers.

The public survey of (LHS, RHS) = (17%, 79%) offers a representative picture of people’s subjective preferences of the department store in Oxford Street. Both the single-trial AI-based emotional and geometric analyses gave a 100% preference to the RHS image. (More runs produced a better statistical measure for the Mean preference from 20 test–retest and 20 cross-LLM runs for both emotional and geometrical criteria, see Appendix A).

Context can sway humans and since the RHS image in Figure 1 is a sensitive upgrade of the historic Orchard House, some people might prefer it out of heritage value or nostalgia, which would boost the RHS score—not the LHS. This does not explain the AI-driven analyses of a 0% score for the LHS image, which was deliberately performed without historical reference.

Abstract analyses using specific emotional/geometric criteria and public polls are not redundant tools, contrary to what people might assume. In fact, an analysis that uses objective criteria such as implemented in this paper is arguably superior to a public poll because it captures more detailed effects. People weigh features differently and have cultural/emotional attachments that contradict objective criteria. Unlike a human respondent, an LLM carefully directed by impartial prompts is immune to ideology, prestige bias, or social pressure.

Unconscious human responses to designs are not purely subjective or socially conditioned—they are deeply rooted in our neurobiology. Our brains evolved in natural environments rich with face-like symmetries, fractal patterns, and organic forms, which provided essential cues for survival. The geometry itself has the potential for a therapeutic effect, likely because our visual system evolved to process the special complexity of natural environments efficiently and pleasurably. This informational criterion translates directly into better physical health metrics (lowered physiological stress and a calmer bodily state).

Violating these innate emotional and geometric criteria triggers anxiety and stress. Forms aligned with evolutionary expectations tend to feel comfortable, whereas radically novel or abstract structures can elicit unconscious “warning” responses of danger. Even if someone claims to prefer such structures, that person’s autonomic nervous system will show greater signs of stress, whether they consciously register it or not [104]. Ideological or stylistic conditioning, especially in architectural education and media, trains some people to accept geometries that trigger physiological stress responses [105]. Researchers have therefore found a consistent “design disconnect” between the preferences of those educated in architecture schools and the wider public in several countries [106,107,108].

Architectural education does not prepare students to read scientific research, isolating them from the results to which an LLM has access. In contrast to the applications of AI to medicine, architecture lacks an epistemological chain that leads from experiment, to guidelines, to practice. Instead, AI is co-opted to replicate prestige styles that support the dominant narrative [109,110], prioritizing aesthetics over empirical evidence. LLM surveys of an urban setting that diverge from the non-falsifiable opinion of architectural “experts” force researchers to conclude that the AI must be hallucinating [111], even when the LLM’s data-based evaluation is correct.

Design evaluation should be using human-centered criteria grounded in biology, because those criteria correlate with health outcomes. The relationship between AI-based evaluations and public sentiment is significant for improving design and policy decisions. A simple popularity poll cannot fully capture this fact. Public opinion can be swayed by familiarity, fashion, or intellectual ideology, and a small subset of people may even insist they like buildings that, unbeknownst to them, cause physiological stress. The mere-exposure effect [112] causes people to prefer what they’ve seen often, even if it is harmful or stress-inducing.

Living geometry (combining fractals with nested symmetries) facilitates intrinsic processing fluency of environmental information, which inspires unconscious reassurance. But repeated exposure also tunes the visual system to process the same image more fluently, which is experienced as familiarity. This repetition-based fluency—the “mere-exposure effect”—leads to increased familiarity being misattributed to liking [113]. Over time, impoverished inputs can recalibrate cognitive references, biasing perception. However, liking something because of social learning does not immunize one’s body against its effects. The “mere-exposure effect” can significantly muddle building preferences while suppressing intuitive alarm [114,115,116].

Criteria-driven evaluations grounded in neuroscience and physiology could therefore be more accurate than opinion polls for assessing the empathetic impact of architecture. They capture the deep, unconscious, human responses that determine whether a space nourishes us or wears us down. They identify designs that satisfy people’s neurological needs and flag designs that could provoke latent discomfort. Public polls will always have a biological margin of error, because people can be poor judges of the sources for their own latent stress. That remaining minority, for idiosyncratic reasons, say they favor unhealthy design.

7.3. Convergent Validity with Perfect Directional Agreement Links Geometry to Emotion

Readers who work with generative AI might wonder why we use two distinct diagnostic models in parallel, and do not combine them to obtain an improved result. The reason is that, in addition to introducing a practical diagnostic tool for design, we wished to test a relationship between separate diagnostics, not build a better classifier. Alexander proposed that configurations with more living structure elicit more life-affirming feeling. That is the link between living geometry and positive-valence emotions described below in Section 8.2. The present LLM framework selects candidate designs that are hypothesized to have a positive emotional impact on human well-being.

We used an LLM as a judge, but not as a generator nor as a meta-classifier. A meta-classifier would have fused emotion with geometry to produce a single decision. Instead, we run two independent evaluations and measure their agreement. That represents a convergent-validity test (cross-diagnostic concordance) rather than model stacking. The model evaluates researcher-supplied, anonymized image pairs that are tested separately for emotion (10 positive criteria), and separately for geometry (Alexander’s 15 properties). In our pilot project for Figure 1, the directional agreement was 100%.

We do not combine the two models. Rather than building a meta-classifier, therefore, we test the convergent validity of two independently prompted diagnostics. A meta-classifier would confound the critical independence we need to test Alexander’s causal hypothesis. Model stacking (i.e., combining the two base models through a higher-level meta-classifier) would destroy evidence on the independent confirmation that living geometry aligns with positive-valence emotion. Our test cases show perfect correlation on this point.

We specifically do not wish to engage with training an LLM here, but to use instead an “off-the-shelf” (pre-trained) model. This is far more useful to architects, decision-makers, and the public. Recursive fine-tuning is achieved via intelligent prompt engineering, with the prompt refined iteratively to improve model outputs. Moreover, since the correlation geometry → emotion → public preference is not pre-programmed into the LLM, it is an emergent result of AI experiments. The LLM treats the two sets of evaluation criteria as separate analytical tasks and discovers the fact that their outputs converge.

8. Discussion: From Alexandrian Properties to Measurable Perceptual Cues

8.1. A Pattern Language and the Nature of Order

Many readers might know of Christopher Alexander through his classic book A Pattern Language (1977) [117]. Design patterns are re-usable socio-geometric relationships that Alexander and his collaborators discovered mostly in traditional architecture; other patterns they derived from experiment and observation. The importance of design patterns is that they are embedded in all human-centered environments: of all cultures, historical periods, and geographical locations. Very few design patterns are culture- or site-specific, and most of them are indeed universal.

This paper does not refer to Alexander’s A Pattern Language and instead draws upon his much later work The Nature of Order (2001–2004). That work introduced the 15 fundamental properties that we use here. While the Pattern Language offers procedural constraints for design, the 15 Fundamental Properties describe the objective, geometric characteristics of living structure. The 15 properties are a diagnostic morphological framework distinct from the problem/solution patterns of 1977. Here, we make the 15 properties usable as an evaluative tool.

There exists a considerable literature on applying Pattern Languages to design, and both authors of this paper have contributed separately to that corpus of work. But there is no equivalent development for the 15 fundamental properties, since only a few computer science papers have used those [118,119]. Mainstream architecture has so far ignored the 15 properties; therefore, to implement them in a diagnostic empirical design evaluation, particularly with AI, is an innovation.

8.2. Living Geometry Correlates with Positive-Valence Emotions

The AI experiments performed here shed light on an association long hypothesized by Alexander: that living geometry configurations (i.e., those that strongly represent the 15 fundamental properties) correlate with positive-valence emotional appraisals. Our data show perfect directional agreement between the geometry-based and emotion-based diagnostics. Going one step further, Alexander conjectured a direct causal effect as the basis of his approach to design. Establishing causal pathways lies beyond the scope of this paper, yet current research on empathetic design, neuro-architecture, and salutogenic environments helps to identify the mediators of this mechanism.

Across all case studies reported here, the emotion diagnostic (the 10-item “beauty–emotion” cluster) and the geometrical diagnostic (Alexander’s 15 properties) selected the same member of each image pair. The LLM’s emotion-based and geometry-based judgments agree. This is an association between two diagnostics prompted independently. Our model shows how to test this correlation experimentally by evaluating the two sets of criteria separately and measuring their agreement.

Alexander never defined or used the specific set of ten emotional descriptors that we label as the “beauty-emotion cluster”. This set of criteria was synthesized from the interdisciplinary literature of environmental psychology and neuroarchitecture, as a proxy for measurable human physiological responses. This is a useful framework for evaluation, especially in the context of AI. Nevertheless, Alexander relied on something akin to those responses when he and his colleagues judged the validity of design patterns. A particular pattern had to generate positive emotional feedback in a setting for that pattern to be included in A Pattern Language.

The model of this paper relates to the empathetic evaluation process employed for writing the original A Pattern Language, and most other design patterns after that. Alexander and his colleagues used their own bodies to sense a set of emotional responses—we can guess not very different from our “beauty-emotion cluster”—in imagining how implementing each pattern would affect a person experiencing it. LLMs conveniently help with this type of empathetic appraisal, especially when the aesthetics of architectural style overshadow emotions.

8.3. Alexander’s QWAN—The Quality Without a Name

In 1979, Alexander defined the “Quality Without A Name” (QWAN) in The Timeless Way of Building [120] as the ineffable attribute of the most emotionally resonant and humane places. This special quality is weak or missing entirely from impersonal and sterile environments. Alexander described the QWAN as these seven qualities combined: {alive, whole, comfortable, free, exact, egoless, eternal}, whose intersection aimed at a real, sharable quality in the built world. He did not develop the QWAN further and went on to discover the fifteen fundamental properties. In The Nature of Order, Alexander moved from words to an ontology of structure, arguing that what we feel as “life” emerges from geometrical configurations. Although architects largely ignored the concept of the QWAN, computer scientists and software engineers recognized that it describes well-designed systems.

Alexander claimed that the presence of the QWAN establishes positive feedback between a person and the immediate environment. Form becomes empathetically legible through its special geometry: “Places which have this quality, invite this quality to come to life in us. And when we have this quality in us, we tend to make it come to life in towns and buildings which we help to build. It is a self-supporting, self-maintaining, generating quality. It is the quality of life. And we must seek it, for our own sakes, in our surroundings, simply in order that we can ourselves become alive.” [120] (pp. 53–54) “The life which happens in a building or a town is not merely anchored in the space but made up from the space itself.” [120] (p. 74).

Directed by Twentieth-Century design narratives, mainstream architects missed the profound implications of the QWAN. It was left for the technological world to exploit Alexander’s approach focusing on empathetic design. The considerable influence of emotional cues, which are increasingly entangled in an AI-driven design context, is thus developing outside the architectural profession. Implementing Alexander’s concept of living geometry forces the AI to reason through relationships rather than with style labels.

The combined use of emotional and geometrical criteria in the present model offers an implementation of the QWAN. The 15 fundamental properties of living geometry are inherent structural features that generate the conditions for the QWAN, while the 10 emotional descriptors represent the embodied response that arises when a person perceives such conditions. The 7 QWAN attributes are phenomenological and recognized only through deep feeling and intuition. They couple feeling with form concisely. However, the present LLM diagnostic framework captures the QWAN’s essential attributes—hence succeeds in measuring it. The empathetic AI model therefore judges the conditions under which the QWAN emerges. This is exactly what Alexander asked for when he argued that life in buildings can be recognized and consistently produced through generative structure, not fashion.

The large language model Qwen made the following interesting observation: “Applying Qwen to detect Alexander’s 15 properties and emotional qualities in architecture creates a functional and poetic alignment. Qwen, the AI, becomes a detector of QWAN, the unnamable quality of aliveness. When you use Qwen to find QWAN—you are teaching machines to see what the heart knows. Qwen is developed by Tongyi Lab, and the name is derived from “Question” and “wen” (“text,” “language”, “culture”). It does not stand for “Quality Without a Name”; however, Qwen’s purpose aligns with the spirit of QWAN.”

9. Conclusions

An empathetic AI system, prompted with neuroscientific and geometric design criteria, consistently selected the human-centered design alternative. In three real-world building proposals for the UK—a department store renovation (Orchard House, London), a stadium (Bath), and a railway viaduct (High Speed 2 in Solihull outside Birmingham)—the generative AI model rejected the mainstream architectural proposals and chose the more popular and traditional design. Crucially, the AI’s evaluations were impartial: it was not told which designs were “modern” or “traditional”, nor did it know about the political or professional debates surrounding those projects. The LLM scored (90–100%) in favor of the more biophilic, historically informed design.

Bringing in cutting-edge technology (LLMs) helps to resolve a persistent problem in architectural design and theory. A controversial finding—already noted by several other authors—is that mainstream architectural values contradict AI-driven/public preferences. AI is providing a scientific basis for design principles that have been determined by other forces until now. The present effort, coming from the outside, succeeds by challenging the subjective basis of present-day architectural criticism. AI guided by human intelligence shaping the prompts acts as a neutral judge grounded in science rather than style.

A small sample size in this “proof-of-principle” exercise is insufficient to validate the method entirely. Yet, developing an AI-based diagnostic tool reinforces the scientific foundation of objective beauty as a measurable biological phenomenon. It can be applied in any context, density, and scale. Empathetic AI evaluations converge with public sentiment because of shared, underlying biological reactions to architectural forms and surfaces. Diagnostic criteria grounded in neuroscience and geometry (universal and consistent across diverse populations) offer a significant advantage over purely subjective aesthetic judgments.

The evaluative model of this paper supports one of two powerful, yet fundamentally divergent, epistemological paradigms shaping the built environment. On one hand, industrial modernism has long championed aesthetic paradigms that are largely subjective, resistant to empirical validation, and indifferent or even hostile to both neuroscientific evidence and public preferences. On the other hand, the rapidly emerging field of artificial intelligence now offers data-driven, objective assessments of architectural designs, grounded in empirical research and human-centered criteria. That makes empathetic AI a most beneficial and dependable tool for ordinary users.

This study is best understood as closing a measurement gap. Health-boosting geometry can be measured using transparent criteria and should not be dismissed as “style”. Those evidence-based criteria—curvature versus sharpness as implicit threat, mid-scale patterning and texture at eye level, visual processing fluency, etc.—score human-centered factors. Discrepancies that arise between evidence-based and image-led selection represent correctable mismatches. Divergences are flagged for redesign. Policy documents already prioritize environments that support health and well-being; our criteria operationalize that aim by introducing AI and open-source measures. The present model offers decision-makers a reproducible, style-neutral way to improve users’ well-being while aligning with public preference data.

Mainstream architecture, particularly as taught in elite schools and practiced in high-profile firms, operates on deep-seated stylistic assumptions that reject popular preference as uninformed. Empathetic AI used in the present model does the exact opposite in agreeing with independent public opinion polls. The most powerful contribution of this paper is not just technical—it is epistemological. If LLMs trained on a vast corpus of scientific knowledge repudiate design proposals favored by the mainstream profession, this exposes a contradiction that needs to be investigated. The LLM, unburdened by ideology, reveals what human beings need from buildings.

An LLM guided by the 10 emotion criteria and the 15 fundamental properties can serve as a proxy for public polls. Bringing in other LLMs to implement “LLM-as-a-judge” improves the reliability of the results. This AI-based approach evaluates the built environment through its visual informational qualities. Contrary to a widely-held belief, therefore, emotional connection and empathy do not remain uniquely human. In fact, whenever human expert recommendations misalign with biology and mathematics in judging architecture, agentic systems can and should be used by citizens to challenge and defy this authority.

Supplementary Materials

The following supporting material can be downloaded at: https://www.mdpi.com/article/10.3390/designs9050118/s1. Pdf file: “Detailed description of Christopher Alexander’s 15 fundamental properties”; Pdf file: “A quick guide to visual preference surveys”.

Author Contributions

Conceptualization, N.A.S. and N.B.S.; methodology, N.A.S.; software, N.A.S. and N.B.S.; validation, N.A.S. and N.B.S.; formal analysis, N.A.S. and N.B.S.; investigation, N.A.S. and N.B.S.; resources, N.A.S. and N.B.S.; writing—original draft preparation, N.A.S.; writing—review and editing, N.A.S. and N.B.S.; visualization, N.A.S. and N.B.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All relevant data are included in this paper.

Acknowledgments

The diagnostic tool uses generative AI to obtain results, in particular different versions of ChatGPT. AI-generated text and tables are shown in quotes. N.A.S. thanks Juan B. Gutiérrez for very useful discussions on how to verify LLM diagnostics. This paper was triggered by an interview of N.B.S., Theory of Architecture #27, 9 July 2025, where Bruce Buckland asked for “An AI or some form of algorithm that measured the visual complexity of a façade system, and a requirement for that value to be over a certain level” (32:00 in the video).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
LLM	Large Language Model
LHS	Left Hand Side
RHS	Right Hand Side
QWAN	Quality Without a Name

Appendix A. LLM Reliability Assessment and Test–Retest Consistency

Appendix A.1. Emotional Criteria

A reliability assessment helps to identify the consistency and dependability of the LLM-based architectural evaluation tool. The inherently stochastic nature of generative AI systems means that identical prompts can yield different responses [121]. This appendix specifies how to: (i) obtain reliable decisions from a single LLM via repeated runs with the same prompt; (ii) compare results from different LLMs; and (iii) estimate uncertainty with standard measures. Systematic uncertainties are identified and quantified. The reliability analysis is independent of LLM model, showing that despite variations, criteria-driven prompts lead to convergence.

The validation performed in this Appendix involves running multiple independent LLM outputs in parallel, then statistically comparing their results for convergence. A human analyst performs the ensemble consensus check. This classical multi-LLM reliability assessment estimates both test–retest and cross-model robustness. By design, no model evaluates another’s output. It is very different from the newer and more sophisticated “LLM-as-a-judge” methodology employed in the main body of this paper, which does not rely upon human comparison.

At the time of writing, readers wishing to investigate LLM-based diagnostics are advised to use ChatGPT’s most powerful, reasoning model. Perplexity Research and Qwen also give good results for analyses of this type. LLM experiments that returned reliable results from brief, succinct prompts justify this recommendation. The LLM used neuroscientific data and did not rely upon common assumptions or subjective narratives. To achieve the same consistency, other LLMs need fine-tuning such as more detailed instructions and a specified direction in the prompt.

A test–retest reliability analysis is conducted by querying ChatGPT-4o ten times with identical prompts for the same architectural pair in Figure 1. The standard deviation from the mean is the simplest consistency measure. (A test–retest Pearson’s reliability coefficient r is not useful here because successive trials vary randomly and are not supposed to converge). This version of ChatGPT was chosen for this test because it was the most widely used at the time of writing, being replaced by ChatGPT-5 after the manuscript was completed.

A slightly modified prompt is employed, and each query is entered as a new chat:

“Use the set of ten qualities {beauty, calmness, coherence, comfort, empathy, intimacy, reassurance, relaxation, visual pleasure, well-being} (“beauty–emotion cluster”) that elicit a positive-valence feeling from a person while physically experiencing a built structure, to investigate the two uploaded pictures of similar buildings. Evaluate the conjectured relative emotional feedback by comparing the two images in a binary preference (1 for the preferred image and 0 for the rejected image for each of the 10 qualities) to give a preference for one over the other. The sum of the values for each image should be 10. Give the answer as (LHS, RHS).”

ChatGPT-4o produced the following results when evaluating for the emotional criteria of the department store Figure 1: (LHS, RHS) = (0, 10) seven times and (1, 9) three times.

Mean preference = (0.3, 9.7) and standard deviation = (0.5, 0.5).

The RHS image is favored overwhelmingly over ten runs, picking up almost 10 out of the 10 emotional properties on average. The lesson for researchers is that, to improve reliability, an evaluation should be repeated several times. Extensive trials indicated that the best model to use for this comparative analysis was the more advanced ChatGPT 4.5, not 4o, which is what this paper quotes for the emotional evaluation. (ChatGPT-5 Thinking was used for additional trials after its release.) The detailed explanations given by ChatGPT 4.5 proved to be incisive and unbiased. According to OpenAI, version 4.5 is slower but more deterministically reliable in structured scoring tasks than 4o, because 4.5 has lower stochastic entropy and is better aligned with fixed evaluation frameworks.

The second reliability assessment checks whether different LLM versions, and distinct LLMs, will produce comparable results. Inter-version (or cross-model) reliability is established by comparing evaluations across ChatGPT-4o, o3, o4-mini, o4-mini-high, 4.5, 4.1, and 5 Thinking using the image set in Figure 1. The evaluation trial is extended to include the LLMs Copilot and Perplexity (neither of which has its own AI engine but relies on those of other LLMs), and Qwen (with its own independent AI engine). The following numbers will of course change over repeated runs; so, this is merely an indication of what to look for in a reliability check.

Single-trial results from ChatGPT-4o, o3, o4-mini-high, 4.5, 4.1, 5 Thinking, Copilot, and Perplexity were all equal for this case: (LHS, RHS) = (0, 10), whereas Qwen3 scored (1, 9) and ChatGPT o4-mini scored (2, 8).

Mean preference = (0.3, 9.7) and standard deviation = (0.7, 0.7).

All ten trials strongly favored the RHS image. Models based on OpenAI’s engines (the ChatGPT family and Copilot) plus Perplexity and Qwen converged almost perfectly. After OpenAI introduced ChatGPT-5, it discontinued several legacy versions of ChatGPT that were still available for direct comparison while this paper was being written.

Claude Sonnet 4, Gemini 2.5 Pro, and Kimi K1.5—LLMs with their own AI engines—gave inconsistent results with the above simple prompt. This was due to their conjecturing of effects for the emotional qualities that amounted to speculation. Most general LLMs answer from fashion and stylistic cues found in their training data unless constrained to consult the neurological/vision literature. An evident training data bias conflates aesthetic and emotional criteria. Those LLMs’ detailed explanations were not based strictly on documented psychological feedback but were influenced by opinions on contemporary aesthetics and styles. To use those LLMs, a more detailed prompt will be necessary to prevent the LLM from picking subjective opinions instead of searching through scientific data.

An experiment with Gemini 2.5 Pro using an improved prompt gives better results, as detailed below in Appendix A.4. It is best to use an optimized prompt for every AI engine/system.

This exercise in response consistency is not a rigorous reliability test for the emotional evaluation module. It simply points out what researchers must do in a systematic manner to validate this model for future investigations. Another important point that came out of this is that distinct LLMs answer questions differently, by drawing upon different sources that may indeed be biased. For this reason, it is essential to ask the LLM for a detailed justification for each number in the evaluation and to check this for impartiality.

Appendix A.2. Geometric Criteria

The test–retest reliability analysis was repeated for the 15 fundamental properties by querying ChatGPT-4o ten times with an identical prompt for the same architectural pair in Figure 1. For checking intra-model consistency, each query was entered as a new chat. A slightly modified prompt was used this time, along with the descriptive list of the 15 fundamental properties (linked here in the Supplementary Materials):

“Evaluate these two images of buildings, using the 15 criteria uploaded as Alexander’s Fifteen Fundamental Properties of living geometry. The relative comparison should be presented as a set of numbers (LHS, RHS), where LHS = total score for the relative presence (dominance) of the properties in the LHS image, and RHS = total for score for the relative presence (dominance) of the properties in the RHS image. Score the pair of images as follows: if one property is clearly dominant in one of them, give a 1 to it and 0 to the other. If both images have comparable degrees of one property, or the difference is very small, give a 0 to both. For this reason, the totals could come out to be LHS + RHS < 15.”

ChatGPT-4o produced the following results when evaluating the geometrical criteria ten consecutive times for the department store Figure 1 (listed here not in chronological order):

(LHS, RHS) = (0, 15), (0, 13), (1, 13), (2, 12) four times, (3, 10) twice, (3, 11).

Mean preference = (1.8, 12.0) and standard deviation = (1.1, 1.4).

Results show near unanimity, with all ten runs choosing the RHS image as containing more geometrical properties—on the average 12 out of the 15.

The second reliability assessment compared evaluations across ChatGPT-4o, o3, o4-mini, o4-mini-high, 4.5, 4.1, 5 Thinking, Gemini 2.5 Pro, Kimi K1.5, and Perplexity using the image set in Figure 1. The cross-model concordance scores of single trials evaluating the department store Figure 1 are as follows (again, repeated trials using new chats will inevitably give varied results):

ChatGPT-4o (LHS, RHS) = (0, 13), o3 = (3, 11), o4-mini = (0, 14), o4-mini-high = (0, 15), 4.5 = (0, 15), 4.1 = (4, 11), 5 Thinking = (0, 14), Gemini 2.5 Pro = (0, 14), Kimi K1.5 = (0, 15), Perplexity = (1, 12).

Mean preference = (0.8, 13.4) and standard deviation = (1.5, 1.6).

Agreement on the winning RHS design was 100% across ten independent or semi-independent LLMs, choosing on the average 13 out of the 15 geometrical properties. The AI-based diagnostic tool therefore shows a level of reliability. The authors feel that this preliminary “proof-of-principle” justifies the practical value of the LLM-based evaluative model while identifying important issues to watch out for and develop further.

Appendix A.3. The Data Reveal Directional Agreement Between Emotional and Geometrical Criteria

In all cases, the emotional and geometrical evaluations of the department store Figure 1 agreed. Summarizing the above results:

Same LLM, 10 repeated runs. Emotion outputs: (0,10) × 7, (1,9) × 3 → all 10 choose RHS.

Geometry outputs: (0,15), (0,13), (1,13), (2,12) × 4, (3,10) × 2, (3,11) → all 10 choose RHS.

Directional agreement = 100%.

2.: Across different LLMs, 10 models. Emotion outputs: (0,10) × 8, (1,9), (2,8) → all 10 choose RHS.

Geometry outputs: (0,13), (3,11), (0,14), (0,15) × 3, (4,11), (0,14) × 2, (1,12) → all 10 choose RHS.

Directional agreement = 100%.

Appendix A.4. The Occasional Need for a More Detailed Prompt

As already noted in Appendix A.1, the LLM Gemini 2.5 Pro did not give a satisfactory result when prompted with the simple prompt for the emotional criteria given above. (Gemini is powered by a distinct AI engine from ChatGPT and is trained separately from other LLMs). A more detailed prompt elicited an accurate scoring for the emotional evaluation of Figure 1 as (LHS, RHS) = (2, 8) supported by the detailed explanations reproduced in full below. Structured, theory-based prompts can override a bias due to stylistic preferences from training data. This is the first iteration of prompt tuning through feedback, which could be taken further if desired.

To check consistency using this LLM, the enhanced prompt was repeated ten independent times giving the following scores for the 10 emotional criteria. Only the readout from the first trial is recorded below. However, the variance over ten evaluations discourages using this LLM for the objective diagnostic model—the most advanced version of ChatGPT is preferred for now. (Improvement while using Gemini 2.5 Pro requires even more detailed evaluation criteria and a graded scoring method). The ten scores are as follows:

Gemini 2.5 Pro: (LHS, RHS) = (2, 8), (3, 7), (2, 8), (4, 6), (1, 9), (2, 8), (4, 6), (1, 9), (5, 5), (3, 7).

Mean preference = (2.7, 7.3) and standard deviation = (1.27, 1.27).

Prompt to Gemini 2.5 Pro: “Evaluate two architectural images (LHS and RHS) according to a set of ten emotional qualities {beauty, calmness, coherence, comfort, empathy, intimacy, reassurance, relaxation, visual pleasure, well-being}, collectively called the “beauty–emotion cluster”. Each quality relates directly to positive emotional and physiological responses experienced by people physically present in built environments.

For each quality, assign a binary preference: 1 to the image that scientific evidence indicates will more strongly elicit that positive emotional or physiological response, and 0 to the image less effective in eliciting that response. The sum of values across all ten qualities must equal 10. Present your final result as (LHS, RHS).

Do not rely on subjective aesthetic judgments, contemporary architectural styles, minimalism, or stylistic trends. Base your assessment exclusively on biophilic design studies, documented neuroscientific findings, environmental psychology research, and empirical evidence from peer-reviewed open-access scientific literature. Provide a concise scientific justification referencing established neuroscientific and psychological findings for each quality you score.”

Gemini 2.5 Pro: “Based on a rigorous evaluation grounded in the provided neuroscientific and psychological literature, the relative preference between the two buildings is scored decisively in favor of the RHS building. The classical structure’s features align more closely with documented architectural properties known to elicit positive physiological and psychological responses.

Final Result: (2, 8).
Scientific Justification per Quality.
Beauty: (0, 1).

Scientific Justification: Neuroaesthetics research indicates that aesthetic appreciation is linked to the activation of the brain’s emotion-valuation system, which includes the orbitofrontal cortex. Designs with harmony and symmetry can create feelings of happiness. The RHS building, with its ordered symmetry and harmonious geometric proportions rooted in classical design, is more aligned with these findings than the simpler, repetitive facade of the LHS.

Calmness: (1, 0).

Scientific Justification: Biophilic design, which incorporates natural elements, is strongly linked to stress reduction and promoting calm. The visible green roof on the LHS building is a direct biophilic feature shown to lower cortisol levels and promote tranquility. Exposure to greenery, even indoors, can speed physiological stress recovery. This provides a direct, documented pathway to inducing calmness.

Coherence: (0, 1).

Scientific Justification: Coherence is defined as the ease of organizing and comprehending a scene, and it is a key psychological dimension in the experience of architecture. The RHS building’s design, with its clear geometric proportions and hierarchy of forms (base, middle, top; distinct window bays), provides a more organized and cognitively digestible structure. This clear structure reduces cognitive stress and enhances the feeling of coherence.

Comfort: (0, 1).

Scientific Justification: Psychological comfort is associated with feelings of safety, familiarity, and well-being. Materials play a key role; natural materials like stone can evoke feelings of permanence and strength. Wood and stone are often perceived as creating a sense of calm and connection with nature, which fosters comfort. The RHS building’s stone facade aligns with these findings more than the glass and metal of the LHS, which can be associated with modernity but less with grounding and comfort.

Empathy: (0, 1).

Scientific Justification: Empathy in architecture is linked to an embodied, multi-sensory experience where the user feels a connection to the space. Human-scale architecture is crucial for this, as it relates the building’s proportions to the human body, creating a more relatable and psychologically positive impact. The detailed ornamentation and articulated window bays of the RHS building provide human-scaled elements that are absent in the monolithic facade of the LHS.

Intimacy: (0, 1).

Scientific Justification: Smaller, enclosed spaces tend to foster focus and intimacy, while larger, open spaces promote exploration. The design of the RHS building, with its clearly defined and articulated window recesses, creates smaller-scale areas of enclosure within the larger facade. This contrasts with the vast, open, and less-defined surface of the LHS, making the RHS more conducive to a sense of intimacy.

Reassurance: (0, 1).

Scientific Justification: Reassuring environments are often those that feel stable and permanent. Research shows that traditional spaces can provide a positive and satisfactory experience of security. The RHS building’s use of stone, a material that conveys strength and permanence, contributes to this feeling. Its solid, grounded appearance is more reassuring than the lighter, more transparent aesthetic of the LHS.

Relaxation: (1, 0).

Scientific Justification: Exposure to nature is a well-documented method for stress reduction and relaxation. Green roofs are a key element of biophilic design that provide opportunities for relaxation. Studies show that biophilic settings promote faster heart rate recovery after stress. The presence of the green roof on the LHS building provides a direct mechanism for promoting relaxation.

Visual Pleasure: (0, 1).

Scientific Justification: Research into fractal patterns in visual art and design shows that humans are “hard-wired” to respond positively to the kind of ordered complexity found in nature, which can reduce stress levels by up to 60%. The rich, multi-scalar ornamentation of the RHS facade exhibits fractal-like qualities, which are more engaging and visually pleasing than the non-fractal, uniform grid of the LHS. Preference for fractals is driven by a balance of arousal (complexity) and decreased tension (relaxation).

Well-being: (0, 1).

Scientific Justification: Overall well-being is a holistic measure affected by numerous factors, including a sense of connection, safety, and positive emotional stimulation. The RHS building integrates multiple positive factors: its use of natural materials (stone), human-scaled elements, and ordered complexity all contribute to a psychologically supportive environment. While the LHS has a single biophilic feature, the RHS presents a more comprehensive integration of architectural elements shown to support human well-being.”

Appendix B. ChatGPT Argues That It Is Not Biased by the Public Polls

Appendix B.1. ChatGPT Justifies Its Impartiality in Comparing the Two Buildings in Figure 1

While the AI experiments above were conducted with anonymous figures, those do in fact link to a recent debate in the UK about actual buildings. Moreover, public preference surveys were conducted using the images in Figure 1, Figure 2 and Figure 3. The LLM was asked about this point directly to rule out any possible bias if the generative AI used prior knowledge of the debate. Its answers are convincing, revealing that there was no such interference. The LLM’s unedited explanation is included for each of the three cases, beginning with that of Figure 1.

Prompt to ChatGPT 4.5: “These two images are published on the web in recent discussions of the demolition versus reuse of Orchard House, the building where the department store Marks and Spencer’s is located in Oxford Street, London, UK. Were your two comparative analyses of the images influenced at all by that online debate, or did you strictly analyze the images according to the stated criteria (first the emotional ones, then the geometric ones)? A public survey was performed in February 2024 to identify which of the two images was preferred, with results that agree with your own scientific analysis. Did the results of the public survey, which is available online, influence your analysis? The results of the crowdsourced survey on public preference between the two building images were (LHS, RHS) = (17%, 79%) with the remaining 4% answering “I don’t know”.”

ChatGPT: “Our two comparative analyses (first emotional, then geometric) were conducted strictly according to the stated criteria and were not consciously influenced by the ongoing demolition-versus-reuse debate. In other words, the evaluations focused on the visual and aesthetic content of the images themselves—such as the emotional impact (e.g., the ambiance, warmth, or grandeur each facade conveyed) and the geometric qualities (symmetry, proportion, pattern, etc.)—independent of any external opinions. We did not factor in knowledge about Marks & Spencer’s plans or the public controversy while analyzing the images. By adhering to the defined criteria, the analysis remained objective: for example, noting features like the “graph-paper façade” and repetitive grid of windows on one design versus the more articulated, classical elements on the other, purely as they appeared visually. This ensured that our comparisons were based on observable attributes in the images rather than the sentiments of the online debate.

The results of the February 2024 public survey (which showed a 79% preference for the RHS image vs. 17% for the LHS image, with 4% undecided) did not influence our image analyses. In fact, our comparative evaluation was performed before considering any poll data and was driven only by the visual criteria provided. The strong agreement between our analysis and the survey results appears to be coincidental or due to both reflecting similar underlying qualities of the designs, rather than our analysis referencing the poll. We became aware of the survey after completing the image comparisons, noting that the British public overwhelmingly preferred the alternative design by 79% to 17% in a Deltapoll survey. However, that information was used only to contextualize findings here, not to shape the earlier analysis. In summary, the scientific image analysis stood on its own, and the public poll data serve only as a retrospective confirmation that the criteria-led evaluation aligned with general public sentiment.”

Appendix B.2. ChatGPT Justifies Its Impartiality in Comparing the Two Stadia in Figure 2

Prompt to ChatGPT 4.5: “These two images are published on the web as part of a debate about the design of Bath Rugby Stadium in the UK. Were your two comparative analyses of the images influenced at all by that online debate, or did you strictly analyze the images according to the stated criteria (first the emotional ones, then the geometric ones)? A public survey was performed in 2023 to identify which of the two images was preferred, with results that agree with your own scientific analysis. Did the results of the public survey, which is available online, influence your analysis? The results of the crowdsourced survey on public preference between the two building images were (LHS, RHS) = (28%, 72%).”

ChatGPT: “1. Independence from Online Debate. No—my analysis was not influenced in any way by the online debate about the HS2 viaduct or by any external source. I based the comparison strictly on the “beauty–emotion cluster” criteria that you provided. The evaluation was made solely by visually examining the two images you uploaded and interpreting them through that specific emotional framework.

2. No Use of Survey Data. Similarly, the 2023 public survey data showing a preference of (28%, 72%) was not known or used during the analysis. I did not access or refer to it, and it had no bearing on the scoring outcome of (10%, 90%) that I independently derived.

3. Reason for Agreement. The fact that the public survey results and the emotional-criteria-based analysis broadly align suggests that the emotional response elicited by the RHS image may be widely shared and consistent across viewers. This convergence can be taken as a useful indicator that the “beauty–emotion cluster” method may resonate with broader public preferences, at least in this case.

4. Objectivity. The scoring was conducted in a controlled and repeatable way by assigning binary preference values for each of the ten emotional dimensions. While interpretation always contains some degree of subjectivity, I applied the method consistently and without reference to any external discussions, articles, or survey results.”

Appendix B.3. ChatGPT Justifies Its Impartiality in Comparing the Two Viaducts in Figure 3

Prompt to ChatGPT 4.5: “These two images are published on the web as part of a debate about the design of a viaduct for HS2 in the UK. Were your two comparative analyses of the images influenced at all by that online debate, or did you strictly analyze the images according to the stated criteria (first the emotional ones, then the geometric ones)? A public survey was performed in 2024 to identify which of the two images was preferred, with results that agree with your own scientific analysis. Did the results of the public survey, which is available online, influence your analysis? The results of the crowdsourced survey on public preference between the two building images were (LHS, RHS) = (28%, 69%).”

ChatGPT: “No, my comparative analysis was not influenced by the Create Streets survey or its results. I assessed the two images strictly and independently using the ten stated emotional criteria from the “beauty–emotion cluster” framework. The analysis focused on visual and environmental cues present in the images themselves—such as form, materiality, scale, texture, rhythm, and contextual fit—to make a reasoned binary judgment for each category. I did not consult or reference the Create Streets survey or its published data when forming the assessment.

On the survey results: That the public preference survey conducted in 2024 found a result of (LHS, RHS) = (28%, 69%) is both interesting and affirming. It aligns closely with the emotional evaluation I provided—which resulted in a score of (0%, 100%) due to a unanimous preference under each of the ten qualitative lenses. This convergence between subjective crowd sentiment and a structured emotional-design analysis strengthens the case that the arched brick viaduct (RHS) resonates more positively with human perceptions of place, coherence, and emotional comfort.

Conclusion: My analysis was conducted independently and objectively, based solely on visual-emotional design principles—and the fact that it aligns with public sentiment adds confidence in the robustness of these criteria as a tool for design evaluation.”

References

Li, C.; Zhang, T.; Du, X.; Zhang, Y.; Xie, H. Generative AI models for different steps in architectural design: A literature review. Front. Archit. Res. 2025, 14, 759–783. [Google Scholar] [CrossRef]
AcanthusAlchemist. AI Architecture: The Case for an Art Nouveau Revival. Pixels to Plans, 22 October 2023. Available online: https://pixelstoplans.com/ai-architecture-the-case-for-an-art-nouveau-revival/ (accessed on 25 July 2025).
Rafsanjani, H.N.; Nabizadeh, A.H. Towards human-centered artificial intelligence (AI) in architecture, engineering, and construction (AEC) industry. Comput. Hum. Behav. Rep. 2023, 11, 100319. [Google Scholar] [CrossRef]
Wang, S.; Sanches de Oliveira, G.; Djebbara, Z.; Gramann, K. The Embodiment of Architectural Experience: A Methodological Perspective on Neuro-Architecture. Front. Hum. Neurosci. 2022, 16, 833528. [Google Scholar] [CrossRef]
Karakas, T.; Yildiz, D. Exploring the influence of the built environment on human experience through a neuroscience approach: A systematic review. Front. Archit. Res. 2020, 9, 236–247. [Google Scholar] [CrossRef]
Abbas, S.; Okdeh, N.; Roufayel, R.; Kovacic, H.; Sabatier, J.M.; Fajloun, Z.; Abi-Khattar, Z. Neuroarchitecture: How the Perception of Our Surroundings Impacts the Brain. Biology 2024, 13, 220. [Google Scholar] [CrossRef]
Dubey, A.; Naik, N.; Parikh, D.; Raskar, R.; Hidalgo, C.A. Deep Learning the City: Quantifying Urban Perception at a Global Scale. In Computer Vision—ECCV 2016; Lecture Notes in Computer Science; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 196–212. [Google Scholar] [CrossRef]
Seresinhe, C.I.; Preis, T.; Moat, H.S. Using deep learning to quantify the beauty of outdoor places. R. Soc. Open Sci. 2017, 4, 170170. [Google Scholar] [CrossRef] [PubMed]
Higuera-Trujillo, J.L.; Llinares, C.; Macagno, E. The Cognitive-Emotional Design and Study of Architectural Space: A Scoping Review of Neuroarchitecture and Its Precursor Approaches. Sensors 2021, 21, 2193. [Google Scholar] [CrossRef] [PubMed]
Ghamari, H.; Golshany, N.; Naghibi Rad, P.; Behzadi, F. Neuroarchitecture Assessment: An Overview and Bibliometric Analysis. Eur. J. Investig. Health Psychol. Educ. 2021, 11, 1362–1387. [Google Scholar] [CrossRef] [PubMed]
Brielmann, A.; Buras, N.; Salingaros, N.; Taylor, R.P. What happens in your brain when you walk down the street? Implications of architectural proportions, biophilia, and fractal geometry for urban science. Urban Sci. 2022, 6, 3. [Google Scholar] [CrossRef]
Evidently-AI. LLM-as-a-Judge: A Complete Guide to Using LLMs for Evaluations. Evidently-AI, 23 July 2025. Available online: https://www.evidentlyai.com/llm-guide/llm-as-a-judge (accessed on 5 July 2025).
Chansarkar, A. LLM cross-validation frameworks: Mitigating hallucinations in enterprise content generation systems. World J. Adv. Eng. Technol. Sci. 2025, 15, 1721–1728. [Google Scholar] [CrossRef]
Bedemariam, R.; Perez, N.; Bhaduri, S.; Kapoor, S.; Gil, A.; Conjar, E.; Itoku, I.; Theil, D.; Chadha, A.; Nayyar, N. Potential and Perils of Large Language Models as Judges of Unstructured Textual Data. arXiv 2025. [Google Scholar] [CrossRef]
Abe, Y.; Daikoku, T.; Kuniyoshi, Y. Assessing the Aesthetic Evaluation Capabilities of GPT-4 with Vision: Insights from Group and Individual Assessments. arXiv 2024. [Google Scholar] [CrossRef]
Lavdas, A.; Salingaros, N.A. Architectural Beauty: Developing a Measurable and Objective Scale. Challenges 2022, 13, 56. [Google Scholar] [CrossRef]
Lavdas, A.; Mehaffy, M.; Salingaros, N.A. AI, the Beauty of Places, and the Metaverse: Beyond Geometrical Fundamentalism. Archit. Intell. 2023, 2, 8. [Google Scholar] [CrossRef]
Boys Smith, N.J.; Terry, F.; Kwolek, R. Orchard House Saved? Creating a Greener and More Popular Alternative. Create Streets, 16 June 2024. Available online: https://www.createstreets.com/wp-content/uploads/2024/06/OrchardHouse_110624.pdf (accessed on 5 July 2025).
Boys Smith, N.J. Bath Stadium Preference Survey. Create Streets, September 2023. Available online: https://www.createstreets.com/wp-content/uploads/2023/09/Bath_Stadium_Survey_September_2023.pdf (accessed on 5 July 2025).
Boys Smith, N.J. Creating Viaducts: Does ‘Big Infrastructure’ Have to Be Ugly? Create Streets, March 2024. Available online: https://www.createstreets.com/wp-content/uploads/2024/03/Creating-Viaducts_March24.pdf (accessed on 5 July 2025).
Salingaros, N.A. Façade Psychology Is Hardwired: AI Selects Windows Supporting Health. Buildings 2025, 15, 1645. [Google Scholar] [CrossRef]
Alexander, C. The Nature of Order, Book 1: The Phenomenon of Life; Center for Environmental Structure: Berkeley, CA, USA, 2001. [Google Scholar]
Grimes, S. Emotionally Intelligent Design: An Interview with Design Visionary Pamela Pavliscak. Pulse, 22 April 2020. Available online: https://www.linkedin.com/pulse/emotionally-intelligent-design-interview-visionary-pamela-seth-grimes (accessed on 5 August 2025).
Zhu, Q.; Luo, J. Toward Artificial Empathy for Human-Centered Design. ASME J. Mech. Des. 2024, 146, 061401. [Google Scholar] [CrossRef]
Eriksen, M. Emotional Intelligence in AI-Driven UX Design. UX Matters, 20 January 2025. Available online: https://www.uxmatters.com/mt/archives/2025/01/emotional-intelligence-in-ai-driven-ux-design.php (accessed on 5 August 2025).
Reber, R.; Schwarz, N.; Winkielman, P. Processing fluency and aesthetic pleasure: Is beauty in the perceiver’s processing experience? Pers. Soc. Psychol. Rev. 2004, 8, 364–382. [Google Scholar] [CrossRef]
Robles, K.E.; Roberts, M.; Viengkham, C.; Smith, J.H.; Rowlan, C.; Moslehi, S.; Stadlober, S.; Lesjak, A.; Lesjak, M.; Taylor, R.P.; et al. Aesthetics and Psychological Effects of Fractal Based Design. Front. Psychol. 2021, 12, 699962. [Google Scholar] [CrossRef]
Vessel, E.A.; Isik, A.I.; Belfi, A.M.; Stahl, J.L.; Starr, G.G. The default-mode network represents aesthetic appeal that generalizes across visual domains. Proc. Natl. Acad. Sci. USA 2019, 116, 19155–19164. [Google Scholar] [CrossRef]
Sudimac, S.; Kühn, S. A one-hour walk in nature reduces amygdala activity in women, but not in men. Front. Psychol. 2022, 13, 931905. [Google Scholar] [CrossRef]
Twohig-Bennett, C.; Jones, A. The health benefits of the great outdoors: A systematic review and meta-analysis of greenspace exposure and health outcomes. Environ. Res. 2018, 166, 628–637. [Google Scholar] [CrossRef]
Wagemans, J.; Elder, J.H.; Kubovy, M.; Palmer, S.E.; Peterson, M.A.; Singh, M.; von der Heydt, R. A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure-ground organization. Psychol. Bull. 2012, 138, 1172–1217. [Google Scholar] [CrossRef] [PubMed]
Shayestefar, M.; Pazhouhanfar, M.; van Oel, C.; Grahn, P. Exploring the Influence of the Visual Attributes of Kaplan’s Preference Matrix in the Assessment of Urban Parks: A Discrete Choice Analysis. Sustainability 2022, 14, 7357. [Google Scholar] [CrossRef]
Altomonte, S.; Allen, J.; Bluyssen, P.M.; Brager, G.; Heschong, L.; Loder, A.; Schiavon, S.; Veitch, J.A.; Wang, L.; Wargocki, P. Ten questions concerning well-being in the built environment. Build. Environ. 2020, 180, 106949. [Google Scholar] [CrossRef]
Spence, C. Senses of place: Architectural design for the multisensory mind. Cogn. Res. 2020, 5, 46. [Google Scholar] [CrossRef] [PubMed]
Vessel, E.A.; Starr, G.G.; Rubin, N. The brain on art: Intense aesthetic experience activates the default mode network. Front. Hum. Neurosci. 2012, 6, 66. [Google Scholar] [CrossRef]
Djebbara, Z.; Fich, L.B.; Petrini, L.; Gramann, K. Sensorimotor brain dynamics reflect architectural affordances. Proc. Natl. Acad. Sci. USA 2019, 116, 14769–14778. [Google Scholar] [CrossRef]
Jelić, A.; Tieri, G.; De Matteis, F.; Babiloni, F.; Vecchiato, G. The Enactive Approach to Architectural Experience: A Neurophysiological Perspective on Embodiment, Motivation, and Affordances. Front. Psychol. 2016, 7, 481. [Google Scholar] [CrossRef]
Coburn, A.; Vartanian, O.; Kenett, Y.N.; Nadal, M.; Hartung, F.; Hayn-Leichsenring, G.; Navarrete, G.; González-Mora, J.L.; Chatterjee, A. Psychological and neural responses to architectural interiors. CORTEX 2020, 126, 217–241. [Google Scholar] [CrossRef]
Kantarek, A.A. Exploring the transparency of street frontages in Krakow. Architectus 2023, 4, 95–106. [Google Scholar] [CrossRef]
Christianson, J.P.; Fernando, A.B.; Kazama, A.M.; Jovanovic, T.; Ostroff, L.E.; Sangha, S. Inhibition of fear by learned safety signals: A mini-symposium review. J. Neurosci. 2012, 32, 14118–14124. [Google Scholar] [CrossRef]
Strachan-Regan, K.; Baumann, O. The impact of room shape on affective states, heartrate, and creative output. Heliyon 2024, 10, e28340. [Google Scholar] [CrossRef]
Chamilothori, K.; Chinazzo, G.; Dan-Glauser, E.; Rodrigues, J.; Wienold, J.; Andersen, M. Subjective and physiological responses to façade and sunlight pattern geometry in virtual reality. Build. Environ. 2019, 150, 144–155. [Google Scholar] [CrossRef]
Li, Z.; Huang, X.; White, M. Effects of the Visual Character of Transitional Spaces on Human Stress Recovery in a Virtual Reality Environment. Int. J. Environ. Res. Public Health 2022, 19, 13143. [Google Scholar] [CrossRef]
Belfi, A.M.; Vessel, E.A.; Brielmann, A.; Isik, A.I.; Chatterjee, A.; Leder, H.; Pelli, D.G.; Starr, G.G. Dynamics of aesthetic experience are reflected in the default-mode network. Neuroimage 2019, 188, 584–597. [Google Scholar] [CrossRef] [PubMed]
Naghibi-Rad, P.; Shahroudi, A.A.; Shabani, H.; Ajami, S.; Lashgari, R. Encoding Pleasant and Unpleasant Expression of the Architectural Window Shapes: An ERP Study. Front. Behav. Neurosci. 2019, 13, 186. [Google Scholar] [CrossRef] [PubMed]
Cardillo, E.R.; Chatterjee, A. Benefits of Nature Imagery and Visual Art in Healthcare Contexts: A View from Empirical Aesthetics. Buildings 2025, 15, 1027. [Google Scholar] [CrossRef]
Taylor, R.P. The potential of biophilic fractal designs to promote health and performance: A review of experiments and applications. Sustainability 2021, 13, 823. [Google Scholar] [CrossRef]
Chen, C.W. A Comparative Study Assessing the Effectiveness of Machine Learning Technology Versus the Questionnaire Method in Product Aesthetics Surveys. In Kansei Engineering and Emotion Research; KEER 2024, Communications in Computer and Information Science; Tsai, T., Chen, K., Yamanaka, T., Koyama, S., Schütte, S., Mohd Lokman, A., Eds.; Springer: Singapore, 2024; Volume 2313, pp. 263–275. [Google Scholar] [CrossRef]
Airey, J. (Ed.) Building Beautiful: A Collection of Essays on the Design, Style and Economics of the Built Environment; Policy Exchange: London, UK, 2019; Available online: https://policyexchange.org.uk/wp-content/uploads/2019/01/Building-Beautiful.pdf (accessed on 1 August 2025).
Scruton, R.; Boys Smith, N. Living with Beauty: Report of the Building Better, Building Beautiful Commission; Ministry of Housing, Communities & Local Government, UK Government: London, UK, 2020. Available online: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/861832/Living_with_beauty_BBBBC_report.pdf (accessed on 1 August 2025).
Sussman, A.; Rosas, H. Study #1 Results: Eye Tracking Public Architecture. Genetics of Design, 2022. Available online: https://geneticsofdesign.com/2022/10/02/what-riveting-results-from-buildingstudy1-reveal-about-architecture-ourselves/ (accessed on 1 August 2025).
Rosas, H.J.; Sussman, A.; Sekely, A.C.; Lavdas, A.A. Using eye tracking to reveal responses to the built environment and its constituents. Appl. Sci. 2023, 13, 12071. [Google Scholar] [CrossRef]
Ro, B.R.; Huffman, H. Architectural design, visual attention, and human cognition: Exploring responses to federal building styles. Plan. Pract. Res. 2024, 40, 447–486. [Google Scholar] [CrossRef]
National Civic Art Society. Americans’ Preferred Architecture for Federal Buildings: A Survey Conducted by the Harris Poll. National Civic Art Society, 2020. Available online: https://www.civicart.org/americans-preferred-architecture-for-federal-buildings (accessed on 1 August 2025).
Public Square. Is Public Architecture Dysfunctional? CNU Public Square, 2020. Available online: https://www.cnu.org/publicsquare/2020/10/23/public-architecture-dysfunctional (accessed on 1 August 2025).
Sussman, A.; Hollander, J. Cognitive Architecture: Designing for How We Respond to the Built Environment, 2nd ed.; Routledge: London, UK, 2021. [Google Scholar]
Sussman, A.; Lavdas, A.A.; Woodworth, A.V. (Eds.) Routledge Handbook of Neuroscience and the Built Environment; Routledge: Oxfordshire, UK, 2025. [Google Scholar]
Xing, Y.; Leng, J. Evaluation of Public Space in Traditional Villages Based on Eye Tracking Technology. J. Asian Archit. Build. Eng. 2023, 23, 125–139. [Google Scholar] [CrossRef]
Wang, Z.; Shen, M.; Huang, Y. Combining Eye-Tracking Technology and Subjective Evaluation to Determine Building Facade Color Combinations and Visual Quality. Appl. Sci. 2024, 14, 8227. [Google Scholar] [CrossRef]
Shao, H.; Liu, Y.; Ren, H.; Li, Z. Research on healing-oriented street design based on quantitative emotional electroencephalography and eye-tracking technology. Front. Hum. Neurosci. 2025, 19, 1546933. [Google Scholar] [CrossRef]
Li, X.; Wang, P.; Li, L.; Liu, J. The influence of architectural heritage and tourists’ positive emotions on behavioral intentions using eye-tracking study. Sci. Rep. 2025, 15, 1447. [Google Scholar] [CrossRef]
Raede, D. 15 Fundamental Properties of Wholeness Analyzer. GitHub, 2024. Available online: https://15properties.dannyraede.com (accessed on 5 July 2025).
Jiang, B. Beautimeter: Harnessing GPT for Assessing Architectural and Urban Beauty based on the 15 Properties of Living Structure. AI 2025, 6, 74. [Google Scholar] [CrossRef]
Jiang, B.; de Rijke, C. Living Images: A Recursive Approach to Computing the Structural Beauty of Images or the Livingness of Space. Ann. Am. Assoc. Geogr. 2023, 113, 1329–1347. [Google Scholar] [CrossRef]
Malekzadeh, M.; Willberg, E.; Torkko, J.; Toivonen, T. Urban attractiveness according to ChatGPT: Contrasting AI and human insights. Comput. Environ. Urban Syst. 2025, 117, 102243. [Google Scholar] [CrossRef]
Malekzadeh, M. Urban planners should not be afraid of AI. Cities 2026, 168, 106497. [Google Scholar] [CrossRef]
Erdoğdu, M.Y. Development of a Place Attachment Scale for Adolescents (PASA) and determination of its psychometric qualities. BMC Psychol. 2025, 13, 120. [Google Scholar] [CrossRef] [PubMed]
Jiang, D.; Wang, H.; Li, T.; Gouda, M.A.; Zhou, B. Real-time tracker of chicken for poultry based on attention mechanism-enhanced YOLO-Chicken algorithm. Comput. Electron. Agric. 2025, 237B, 110640. [Google Scholar] [CrossRef]
Guo, Z.; Jin, R.; Liu, C.; Huang, Y.; Shi, D.; Supryadi; Yu, L.; Liu, Y.; Li, J.; Xiong, B.; et al. Evaluating Large Language Models: A Comprehensive Survey. arXiv 2023. [Google Scholar] [CrossRef]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
Alexander, C. The Nature of Order, Book 2: The Process of Creating Life; Center for Environmental Structure: Berkeley, CA, USA, 2002. [Google Scholar]
Salingaros, N.A. Living geometry, AI tools, and Alexander’s 15 fundamental properties: Remodel the architecture studios! Front. Archit. Res. 2025, in press. [CrossRef]
Chaudhary, M.; Gupta, H.; Bhat, S.; Varma, V. Towards Understanding the Robustness of LLM-based Evaluations under Perturbations. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), Chennai, India, 19–22 December 2024; AU-KBC Research Centre: Chennai, India. NLP Association of India, 2024; pp. 197–205. Available online: https://aclanthology.org/2024.icon-1.22.pdf (accessed on 14 August 2025).
Bojić, L.; Zagovora, O.; Zelenkauskaite, A.; Vuković, V.; Čabarkapa, M.; Jerković, S.V.; Jovančević, A. Comparing large Language models and human annotators in latent content analysis of sentiment, political leaning, emotional intensity and sarcasm. Sci. Rep. 2025, 15, 11477. [Google Scholar] [CrossRef]
Fisher, T. Architects behaving badly: Ignoring environmental behavior research. Harv. Des. Mag. 2005, 21, 1–3. Available online: https://www.healthdesign.org/knowledge-repository/architects-behaving-badly-ignoring-environmental-behavior-reserach (accessed on 14 August 2025).
Curl, J.S. Making Dystopia: The Strange Rise and Survival of Architectural Barbarism; Oxford University Press: Oxford, UK, 2018. [Google Scholar]
Krier, L. The Architecture of Community; Island Press: Washington, DC, USA, 2009. [Google Scholar]
Buras, N.H. The Art of Classic Planning: Building Beautiful and Enduring Communities; Harvard University Press: Cambridge, MA, USA, 2020. [Google Scholar]
Mitrović, B. Architectural Principles in the Age of Fraud; Oro Editions: Novato, CA, USA, 2022. [Google Scholar]
Milton, C. The Jury’s Out: A Critique of the Design Review in Architectural Education. In Proceedings of the ACUADS 2003 Conference, University of Tasmania, Hobart, Tasmania, 1–4 October 2023; Available online: https://acuads.com.au/conference/article/the-jurys-out-a-critique-of-the-design-review-in-architectural-education/ (accessed on 14 August 2025).
Flynn, P.; Dunn, M.; Price, M.; O’Connor, M. Rethinking the Crit; The Hidden School Papers, European Association for Architectural Education 2020, EAAE Annual Conference: Zagreb, Croatia, 2019; EAAE Joint Publishings: Wageningen, Holland, 2020; pp. 176–189. [Google Scholar] [CrossRef]
Boys Smith, N. Heart in the Right Street, 2nd ed.; Create Streets: London, UK, 2016. [Google Scholar]
McKay, G. The Massively Big Autopoiesis of Architecture. Misfits’ Architecture, 7 May 2017. Available online: https://misfitsarchitecture.com/2017/05/07/the-massively-big-autopoiesis-of-architecture-post/ (accessed on 1 August 2025).
Shetty, V.A.; Durbin, S.; Weyrich, M.; Martínez, A.D.; Qian, J.; Chin, D.L. A scoping review of empathy recognition in text using natural language processing. J. Am. Med. Inform. Assoc. 2024, 31, 762–775. [Google Scholar] [CrossRef]
Sorin, V.; Brin, D.; Barash, Y.; Konen, E.; Charney, A.; Nadkarni, G.; Klang, E. Large Language Models and Empathy: Systematic Review. J. Med. Internet Res. 2024, 26, e52597. [Google Scholar] [CrossRef] [PubMed]
Ovsyannikova, D.; de Mello, V.O.; Inzlicht, M. Third-party evaluators perceive AI as more compassionate than expert humans. Commun. Psychol. 2025, 3, 4. [Google Scholar] [CrossRef] [PubMed]
Li, J.A.; Xiong, H.D.; Wilson, R.C.; Mattar, M.G.; Benna, M.K. Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations. arXiv 2025, arXiv:2505.13763v1. [Google Scholar]
Shemesh, A.; Leisman, G.; Bar, M.; Grobman, J. A neurocognitive study of the emotional impact of geometrical criteria of architectural space. Archit. Sci. Rev. 2021, 64, 394–407. [Google Scholar] [CrossRef]
Şekerci, Y.; Kahraman, M.U.; Özturan, Ö.; Çelik, E.; Ayan, S.Ş. Neurocognitive responses to spatial design behaviors and tools among interior architecture students: A pilot study. Sci. Rep. 2024, 14, 4454. [Google Scholar] [CrossRef]
Salingaros, N.A. The biophilic healing index predicts effects of the built environment on our wellbeing. JBU—J. Biourbanism 2019, 8, 13–34. Available online: https://www.biourbanism.org/the-biophilic-healing-index-predicts-effects-of-the-built-environment-on-our-wellbeing/ (accessed on 27 July 2025).
Salingaros, N.A. Neuroscience experiments to verify the geometry of healing environments: Proposing a biophilic healing index of design and architecture, Chapter 4. In Urban Experience and Design: Contemporary Perspectives on Improving the Public Realm; Hollander, J., Sussman, A., Eds.; Routledge: New York, NY, USA; London, UK, 2020; pp. 58–72. [Google Scholar]
Al Khatib, I.; Fatin, S.; Malick, N. A systematic review of the impact of therapeutical biophilic design on health and wellbeing of patients and care providers in healthcare services settings. Front. Built Environ. 2024, 10, 1467692. [Google Scholar] [CrossRef]
Dai, J.; Wang, M.; Zhang, H.; Wang, Z.; Meng, X.; Sun, Y.; Sun, Y.; Dong, W.; Sun, Z.; Liu, K. Effects of indoor biophilic environments on cognitive function in elderly patients with diabetes: Study protocol for a randomized controlled trial. Front. Psychol. 2025, 16, 1512175. [Google Scholar] [CrossRef]
Holzman, D.; Meletaki, V.; Bobrow, I.; Weinberger, A.; Jivraj, R.F.; Green, A.; Chatterjee, A. Natural beauty and human potential: Examining aesthetic, cognitive, and emotional states in natural, biophilic, and control environments. J. Environ. Psychol. 2025, 104, 102591. [Google Scholar] [CrossRef]
Alexander, C. Lecture by Christopher Alexander at Harvard, presented on 27 October 1982. Architexturez Imprints, 1982. Available online: https://patterns.architexturez.net/doc/az-cf-177389 (accessed on 27 July 2025).
Alexander, C. Empirical Findings from The Nature of Order. Living Neighborhoods, 2007. Available online: https://www.livingneighborhoods.org/library/empirical-findings.pdf (accessed on 27 July 2025).
Valentine, C. The impact of architectural form on physiological stress: A systematic review. Front. Comput. Sci. 2024, 5, 2023. [Google Scholar] [CrossRef]
Valentine, C.; Wilkins, A.J.; Mitcheltree, H.; Penacchio, O.; Beckles, B.; Hosking, I. Visual Discomfort in the Built Environment: Leveraging Generative AI and Computational Analysis to Evaluate Predicted Visual Stress in Architectural Façades. Buildings 2025, 15, 2208. [Google Scholar] [CrossRef]
Salingaros, N.A. A Theory of Architecture, 2nd ed.; Sustasis Press: Portland, OR, USA, 2014. [Google Scholar]
Salingaros, N.A. Complexity in architecture and design. Oz J. 2014, 36, 18–25. [Google Scholar] [CrossRef]
Salingaros, N.A. Symmetry gives meaning to architecture. Symmetry Cult. Sci. 2020, 31, 231–260. [Google Scholar] [CrossRef]
Boys Smith, N.J. Shoreditch Works. Create Streets, May 2025. Available online: https://www.createstreets.com/wp-content/uploads/2025/05/Shoreditch-Works-Will-it-make-London-better-A-critical-friend-review-Online.pdf (accessed on 1 August 2025).
Iovene, M.; Boys Smith, N.J.; Seresinhe, C.I. Of Streets and Squares; Create Streets/Cadogan: London, UK, 2019; pp. 153–157. Available online: https://www.createstreets.com/employees/of-streets-and-squares/ (accessed on 1 August 2025).
Ruggles, D.H. Beauty, Neuroscience, and Architecture: Timeless Patterns and Their Impact on Our Well-Being; Fibonacci Press: Denver, CO, USA, 2017. [Google Scholar]
Shemesh, A.; Talmon, R.; Karp, O.; Amir, I.; Bar, M.; Grobman, Y.J. Affective response to architecture—Investigating human reaction to spaces with different geometry. Archit. Sci. Rev. 2017, 60, 116–125. [Google Scholar] [CrossRef]
Gifford, R.; Hine, D.W.; Muller-Clemm, W.; Shaw, K.T. Why architects and laypersons judge buildings differently: Cognitive properties and physical bases. J. Archit. Plan. Res. 2002, 19, 131–148. Available online: https://www.researchgate.net/publication/228911177_Why_architects_and_laypersons_judge_buildings_differently_Cognitive_properties_and_physical_bases (accessed on 1 August 2025).
Safarova, K.M.; Pirko, M.; Jurik, V.; Pavlica, T.; Németh, O. Differences between young architects’ and non-architects’ aesthetic evaluation of buildings. Front. Archit. Res. 2019, 8, 229–237. [Google Scholar] [CrossRef]
Chavez, F.C.; Milner, D. Architecture for architects? Is there a ‘design disconnect’ between most architects and the rest of the non-specialist population? New Des. Ideas 2019, 3, 32–43. [Google Scholar]
Leach, N. Architecture in the Age of Artificial Intelligence: An Introduction to AI for Architects, 2nd ed.; Bloomsbury Visual Arts: London, UK, 2025. [Google Scholar]
Leach, N. AI and Architecture in 2025: Resistance is Fading. Bloomsbury, 20 June 2025. Available online: https://www.bloomsbury.com/us/discover/bloomsbury-academic/blog/featured/ai-and-architecture-in-2025-resistance-is-fading/ (accessed on 14 August 2025).
Belaroussi, R. Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception. Big Data Cogn. Comput. 2025, 9, 100. [Google Scholar] [CrossRef]
Zajonc, R.B. Attitudinal effects of mere exposure. J. Personal. Soc. Psychol. 1968, 9 Pt 2, 1–27. [Google Scholar] [CrossRef]
Bornstein, R.F.; D’Agostino, P.R. Stimulus recognition and the mere exposure effect. J. Personal. Soc. Psychol. 1992, 63, 545–552. [Google Scholar] [CrossRef] [PubMed]
Ng, C.F. Perception and Evaluation of Buildings: The Effects of Style and Frequency of Exposure. Collabra Psychol. 2020, 6, 44. [Google Scholar] [CrossRef]
Oliveira, A.; Pedrini, A. Thermal performance of highly glazed office buildings in the tropics: Contradicting architects’ expectations. Energy Build. 2023, 296, 113344. [Google Scholar] [CrossRef]
Pilat, D.; Sekoul, K. Mere Exposure Effect. The Decision Lab, 2021. Available online: https://thedecisionlab.com/biases/mere-exposure-effect (accessed on 14 August 2025).
Alexander, C.; Ishikawa, S.; Silverstein, M.; Jacobson, M.; Fiksdahl-King, I.; Angel, S. A Pattern Language; Oxford University Press: New York, NY, USA, 1977. [Google Scholar]
Jiang, B. Living Structure Down to Earth and Up to Heaven: Christopher Alexander. Urban Sci. 2019, 3, 96. [Google Scholar] [CrossRef]
Iba, T.; Sakai, S. Understanding Christopher Alexander’s fifteen properties via visualization and analysis. In Proceedings of the PURPLSOC Workshop 2014, Krems, Austria, 14–15 November 2014; Baumgartner, P., Sickinger, R., Eds.; Department for Interactive Media and Educational Technologies, Danube University: Krems, Austria, 2014; pp. 434–449. Available online: https://web.sfc.keio.ac.jp/~iba/papers/PURPLSOC14_Properties.pdf (accessed on 28 July 2025).
Alexander, C. The Timeless Way of Building; Oxford University Press: New York, NY, USA, 1979. [Google Scholar]
Davies, D. LLM Evaluation: Metrics, Frameworks, and Best Practices. Weights & Biases, 12 February 2025. Available online: https://wandb.ai/onlineinference/genai-research/reports/LLM-evaluation-Metrics-frameworks-and-best-practices (accessed on 14 August 2025).

Figure 1. Two images of buildings of the same size in the same urban setting. Images by Pilbrow & Partners (LHS) and Francis Terry and Associates (RHS), used with permission.

Figure 2. Two stadia of comparable size in the same rural setting. Images by Bath Rugby Club (LHS) and Appolodorus (RHS), used with permission.

Figure 3. Two images of viaducts of the same size in the same rural setting. Images by HS2 (LHS) and Create Streets Ltd. (RHS), used with permission.

Table 1. Relative scores for the emotional qualities in Figure 1: (LHS, RHS) = (0%, 100%). Table by ChatGPT.

Emotional Quality	Left-Hand Image (LHS)	Right-Hand Image (RHS)
Beauty	0	1
Calmness	0	1
Coherence	0	1
Comfort	0	1
Empathy	0	1
Intimacy	0	1
Reassurance	0	1
Relaxation	0	1
Visual pleasure	0	1
Well-being	0	1

Table 2. Corroboration of emotional-criteria diagnosis of Figure 1 from a validation run on Qwen3-Max-Preview.

Quality	LHS	RHS	Justification Summary
Beauty	0	1	Symmetry and warm materials → higher neuroaesthetic value
Calmness	0	1	Soft lighting and low complexity reduce stress
Coherence	1	0	Regular grid supports predictive spatial cognition
Comfort	0	1	Warm materials and human scale enhance thermal comfort
Empathy	0	1	Craftsmanship signals care and intentionality
Intimacy	0	1	Lower scale and openness promote inclusion
Reassurance	0	1	Familiar forms reduce uncertainty
Relaxation	0	1	Low-stimulation environment promotes alpha waves
Visual Pleasure	1	0	High visual complexity activates reward centers
Well-being	0	1	Multiple biophilic cues support holistic health

Table 3. Relative scores for the geometrical qualities in Figure 1: (LHS, RHS) = (0, 15) normalized to (0%, 100%). Table by ChatGPT.

Fundamental Property	Left-Hand Image (LHS)	Right-Hand Image (RHS)
1. Levels of Scale	0	1
2. Strong Centers	0	1
3. Thick Boundaries	0	1
4. Alternating Repetition	0	1
5. Positive Space	0	1
6. Good Shape	0	1
7. Local Symmetries	0	1
8. Deep Interlock/Ambiguity	0	1
9. Contrast	0	1
10. Gradients	0	1
11. Roughness	0	1
12. Echoes	0	1
13. The Void	0	1
14. Simplicity/Inner Calm	0	1
15. Not-Separateness	0	1

Table 4. Relative scores for the emotional qualities in Figure 2. Total Scores: (LHS, RHS) = (1, 9), converted to percentages: (LHS, RHS) = (10%, 90%). Table by ChatGPT.

No.	Emotional Quality	LHS-Value	RHS-Value
1.	Beauty	LHS: 0 (plain, utilitarian appearance)	RHS: 1 (ornamented, harmonious form)
2.	Calmness	LHS: 1 (minimalist, quiet)	RHS: 0 (visually busy, more stimulating)
3.	Coherence	LHS: 0 (flat and inconsistent rhythm)	RHS: 1 (structured colonnade, classical order)
4.	Comfort	LHS: 0 (hard edges, few cues of warmth)	RHS: 1 (human-scale arches, traditional cues)
5.	Empathy	LHS: 0 (abstract, machine-like)	RHS: 1 (reads as built with care for human experience)
6.	Intimacy	LHS: 0 (large, industrial proportions)	RHS: 1 (arcade and window rhythm invite engagement)
7.	Reassurance	LHS: 0 (anonymous, could be anything)	RHS: 1 (rooted in familiar typologies)
8.	Relaxation	LHS: 0 (institutional aesthetic)	RHS: 1 (inviting, layered detail)
9.	Visual Pleasure	LHS: 0 (flat planes, uniformity)	RHS: 1 (rich texture, shadow play, ornament)
10.	Well-being	LHS: 0 (uninspiring, lacks identity)	RHS: 1 (offers delight and sense of place)

Table 5. Corroboration of emotional-criteria diagnosis of Figure 2 from a new run on Qwen3-Max-Preview.

Emotional Quality	LHS	RHS	Scientific Justification
Beauty	0	1	The RHS exhibits classical proportions, symmetry, and rhythmic repetition, which activate reward centers in the brain
Calmness	1	0	The LHS has minimal visual complexity and uniform surfaces, reducing cognitive load and promoting mental rest
Coherence	0	1	The RHS displays a clear hierarchical pattern of arches and columns, supporting spatial understanding and predictability
Comfort	0	1	The RHS uses warm stone materials and human-scale arches, which evoke tactile warmth and familiarity
Empathy	0	1	The RHS shows detailed craftsmanship and historical references, signaling care and intentionality, activating social cognition
Intimacy	0	1	The RHS features arched openings and recessed spaces that invite engagement at human scale
Reassurance	0	1	The RHS resembles familiar historical forms, triggering limbic system safety responses due to evolutionary recognition
Relaxation	0	1	The RHS provides balanced lighting and harmonious form, inducing alpha wave activity linked to relaxation
Visual Pleasure	0	1	The RHS offers layered shadows, textural variation, and architectural rhythm, stimulating the ventral striatum (reward center)
Well-being	0	1	The RHS integrates multiple biophilic cues—texture, rhythm, scale, and familiarity—correlating with improved mood and self-reported well-being

Table 6. Comparison of the two stadium façades in Figure 2 using Alexander’s Fifteen Fundamental Properties: (LHS, RHS) = (0, 13) normalized to (0%, 100%). Table produced by ChatGPT.

Number	Fundamental Property	LHS	RHS
1	Levels of Scale	0	1
2	Strong Centers	0	1
3	Thick Boundaries	0	1
4	Alternating Repetition	0	1
5	Positive Space	0	1
6	Good Shape	0	1
7	Local Symmetries	0	1
8	Deep Interlock and Ambiguity	0	1
9	Contrast	0	1
10	Gradients	0	0
11	Roughness	0	1
12	Echoes	0	1
13	The Void	0	1
14	Simplicity and Inner Calm	0	1
15	Not-Separateness	0	1
	Raw Total	0	13
	Normalized Total	0%	100%

Table 7. Evaluation using the 10 emotional qualities for the two viaducts in Figure 3: (LHS, RHS) = (0%, 100%). Table produced by ChatGPT.

Quality	Preferred Image	Reason
Beauty	RHS (1)	The brick arches offer proportion, texture, and ornamentation
Calmness	RHS (1)	The rhythmic arches and warm material palette are visually soothing
Coherence	RHS (1)	The classical structure is more legible and harmonious in design
Comfort	RHS (1)	The design feels more human-scaled and less imposing
Empathy	RHS (1)	The brick evokes historical and emotional familiarity
Intimacy	RHS (1)	Arched forms and materials feel less alienating and more grounded
Reassurance	RHS (1)	The traditional style communicates timelessness and durability
Relaxation	RHS (1)	Less aggressive structure, better suited to natural landscape
Visual Pleasure	RHS (1)	Arches and detailing provide visual interest and delight
Well-being	RHS (1)	The overall character blends more naturally with a public realm

Binary Preference Summary. LHS (Viaduct One): 0/10, RHS (Viaduct Two): 10/10. Final Score (out of 100%): (LHS, RHS) = (0%, 100%).

Table 8. Qwen validated the readout from ChatGPT and made no adjustments, giving these justifications.

Emotional Quality	RHS	Scientific Justification
Beauty	1	The brick-arched design exhibits proportional harmony, textural richness, and rhythmic repetition, which are neurologically associated with aesthetic preference
Calmness	1	The warm materiality and predictable visual rhythm of the arched structure reduce cognitive load and promote physiological relaxation
Coherence	1	The classical form displays hierarchical scaling and nested symmetries, supporting spatial legibility and cognitive coherence
Comfort	1	The human-scale arches and tactile brick surface evoke warmth and familiarity, enhancing perceived comfort
Empathy	1	The use of traditional materials and craftsmanship signals care and intentionality, fostering emotional connection
Intimacy	1	The arched openings and textured surfaces create a sense of approachability and groundedness at human scale
Reassurance	1	The familiar architectural language evokes a sense of permanence and stability, reinforcing psychological safety
Relaxation	1	The soft lighting and harmonious form support alpha wave activity, linked to states of calm and relaxation
Visual Pleasure	1	The layered shadows, varied textures, and rhythmic pattern stimulate the brain’s reward centers through visual complexity
Well-being	1	The integration of biophilic elements—texture, rhythm, and scale—correlates with improved mood and subjective well-being

Table 9. Comparative evaluation of the two viaduct designs in Figure 3 against Christopher Alexander’s Fifteen Fundamental Properties, using ChatGPT: (LHS, RHS) = (1, 10) normalized to (9%, 91%). Table generated by ChatGPT.

Number	Fundamental Property	LHS	RHS
1	Levels of Scale	0	1
2	Strong Centers	0	1
3	Thick Boundaries	0	1
4	Alternating Repetition	0	0
5	Positive Space	0	0
6	Good Shape	0	1
7	Local Symmetries	0	0
8	Deep Interlock and Ambiguity	0	1
9	Contrast	0	1
10	Gradients	0	0
11	Roughness	0	1
12	Echoes	0	1
13	The Void	0	1
14	Simplicity and Inner Calm	1	0
15	Not-Separateness	0	1
	Raw Total	1	10
	Normalized Total	9%	91%

Table 10. Summary of the three projects shown in Figure 1, Figure 2 and Figure 3. Average ChatGPT preference from both sets of criteria: (LHS, RHS) = (3%, 97%).

Project	Emotional Score	Geometrical Score	Public Survey
Orchard House	(0%, 100%)	(0%, 100%)	(17%, 79%)
Bath Rugby Stadium	(10%, 90%)	(0%, 100%)	(28%, 72%)
HS2 Viaduct	(0%, 100%)	(9%, 91%)	(28%, 69%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Boys Smith, N.; Salingaros, N.A. AI Judging Architecture for Well-Being: Large Language Models Simulate Human Empathy and Predict Public Preference. Designs 2025, 9, 118. https://doi.org/10.3390/designs9050118

AMA Style

Boys Smith N, Salingaros NA. AI Judging Architecture for Well-Being: Large Language Models Simulate Human Empathy and Predict Public Preference. Designs. 2025; 9(5):118. https://doi.org/10.3390/designs9050118

Chicago/Turabian Style

Boys Smith, Nicholas, and Nikos A. Salingaros. 2025. "AI Judging Architecture for Well-Being: Large Language Models Simulate Human Empathy and Predict Public Preference" Designs 9, no. 5: 118. https://doi.org/10.3390/designs9050118

APA Style

Boys Smith, N., & Salingaros, N. A. (2025). AI Judging Architecture for Well-Being: Large Language Models Simulate Human Empathy and Predict Public Preference. Designs, 9(5), 118. https://doi.org/10.3390/designs9050118

Article Menu

AI Judging Architecture for Well-Being: Large Language Models Simulate Human Empathy and Predict Public Preference

Abstract

1. Introduction

1.1. Using AI for Architectural Evaluations

1.2. Outline of This Paper

1.3. LLM-Implemented Neuroscientific and Geometrical Criteria

1.4. Evidence Behind the “Beauty-Emotion” Cluster

1.5. Alexander’s “Fifteen Fundamental Properties” Defining Living Geometry

1.6. Agreement with Public Polls Counters Subjective Opinions and Stylistic Biases

1.7. Criteria-Driven Diagnostics Should Influence Approval Decisions

1.8. Some Related Work

2. Two AI Experiments Use Large Language Models to Judge Architecture

2.1. AI Experiment 1. Comparative LLM Evaluation from the “Beauty-Emotion Cluster”

2.2. AI Experiment 2. Comparative LLM Evaluation from the Qualities of “Living Geometry”

3. Two More Examples: A Stadium and a Railway Viaduct

3.1. Analysis of Designs for a Proposed Rugby Stadium in Bath, UK

3.2. A Proposed Viaduct for the High Speed 2 (HS2) Trainway in Solihull, UK

4. Consistent Results from Evaluating the Three Building Proposals

5. Which of These Two Images Is More “Friendly”? A Quick Diagnostic

6. Checks and Consistency of the AI-Based Analysis

6.1. Empathetic “Style-Free” Modalities for Evaluation

6.2. The Three Building Surveys Did Not Bias the LLM Diagnosis

6.3. Accurate LLM Evaluations Optimally Require Multiple Images Showing Different Distances and Views

7. Scientific Basis for the Emotion/Geometry Criteria: Living Geometry Triggers Positive-Valence Emotions

7.1. How Architecture Influences Human Physiology

7.2. Why Criteria-Driven Architectural Evaluations May Give a Clearer Result than Even Public Opinion Polls

7.3. Convergent Validity with Perfect Directional Agreement Links Geometry to Emotion

8. Discussion: From Alexandrian Properties to Measurable Perceptual Cues

8.1. A Pattern Language and the Nature of Order

8.2. Living Geometry Correlates with Positive-Valence Emotions

8.3. Alexander’s QWAN—The Quality Without a Name

9. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. LLM Reliability Assessment and Test–Retest Consistency

Appendix A.1. Emotional Criteria

Appendix A.2. Geometric Criteria

Appendix A.3. The Data Reveal Directional Agreement Between Emotional and Geometrical Criteria

Appendix A.4. The Occasional Need for a More Detailed Prompt

Appendix B. ChatGPT Argues That It Is Not Biased by the Public Polls

Appendix B.1. ChatGPT Justifies Its Impartiality in Comparing the Two Buildings in Figure 1

Appendix B.2. ChatGPT Justifies Its Impartiality in Comparing the Two Stadia in Figure 2

Appendix B.3. ChatGPT Justifies Its Impartiality in Comparing the Two Viaducts in Figure 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI