1. Introduction
The Architecture, Engineering, and Construction (AEC) industry is undergoing a digital transformation, with a growing emphasis on automation to enhance efficiency and performance across the building lifecycle. This trend is particularly critical for complex environments like Urban Underground Spaces (UUS), a vital strategy for sustainable urban growth where enclosed and dynamic conditions pose distinct challenges to user well-being and operational efficiency [
1,
2,
3]. In this context, Post-Occupancy Evaluation (POE) serves as a crucial user-centered approach for assessing how people interact with these spaces, identifying environmental stressors and psycho-physiological responses to inform better design [
4]. However, while significant progress has been made in automating design (e.g., Building Information Modeling (BIM)) and construction (e.g., robotics), the automation of POE remains a critical challenge. This gap prevents the creation of a closed-loop feedback system where operational data can dynamically inform future designs and real-time building control, a limitation especially pronounced in complex underground environments, hindering the development of truly adaptive and health-oriented building systems [
5].
Traditional POE, while valuable, exemplifies this digital break-point. Its methodologies, typically relying on static surveys and discrete observations, are labor-intensive and cannot be integrated into the continuous, automated workflows of modern building systems. This static approach fails to capture the evolving interplay between environmental conditions and user responses—such as stress variations due to shifting pedestrian density—and overlooks the cumulative effect of emotions [
6]. Consequently, the insights derived are often delayed and too generalized to support the fine-grained, dynamic adjustments required by intelligent building technologies.
To address this, this paper introduces a novel framework to bridge this gap by automating the process of dynamic, human-centric POE. We leverage multimodal sensing and deep learning to create a predictive model that translates complex environmental and behavioral data into actionable performance metrics. This represents a key step towards the automation of building diagnostics and the creation of truly adaptive built environments. By predicting Galvanic Skin Response (GSR) with second-by-second precision along dynamic spatial paths, our model establishes an automated mapping between the objective environment and physiological responses. Importantly, while GSR fundamentally measures general autonomic arousal rather than stress specifically, within the confined and often adverse context of UUS, such elevated arousal is predominantly associated with negative spatial experiences. Therefore, this study explicitly operationalizes these high-arousal states as “environmental stress”. In doing so, our model reveals spatial–temporal changes in user stress and investigates the cumulative effect of emotions in underground environments.
We present three key contributions: (1) Data: We introduce a pioneering multi-modal dataset for UUS, encompassing three functional typologies—transit, commercial, and office. The dataset includes six environmental parameters (temperature, humidity, noise, VOC, PM2.5, and foot traffic) and 45 h of high-resolution, spatially and temporally aligned data collected via wearable sensors, environmental monitors, and behavioral logs. The dataset provided a foundational resource for training automated environmental perception models. (2) Model: We develop a dual-branch deep learning model for GSR prediction. The first branch, based on the ResNet-18 architecture, extracts spatial and semantic features from visual data. The second branch captures temporal dynamics through two parallel components: a Bidirectional LSTM (Bi-LSTM) for modeling visual sequence patterns, and a 1D Convolutional Neural Network (1D-CNN) for analyzing trends in environmental sensor data. (3) Application: Using the trained model, we infer GSR from synchronized visual and environmental data, enabling estimation of users’ stress levels along dynamic spatial trajectories. These predictions allow us to quantify the cumulative effect of emotions of underground environments without relying on direct physiological measurements. The framework automates building diagnostics by mapping objective environmental conditions to subjective stress responses, thereby enabling evidence-based design interventions aimed at enhancing user comfort.
This study develops a human-centric, automated performance evaluation framework that spans the entire building lifecycle. Before detailing these lifecycle applications, it is crucial to distinguish between our immediate modeling objective of predicting physiological arousal from multimodal inputs, and our longer-term engineering objective of making these predictions actionable for building management. At the modeling level, this framework develops a human-centric, automated performance evaluation tool. First, it achieves the Automation of the Evaluation Process by transforming traditionally labor-intensive POE, which relies on manual surveys and interviews, into an automated data acquisition, fusion, and analysis pipeline. Furthermore, the framework serves as a robust proof-of-concept for Automation in Design Validation, enabling the rapid, quantitative assessment of the potential psycho-physiological impacts of various design schemes at the early design stage. Beyond these immediate predictive capabilities, our longer-term engineering objective conceptualizes how this framework can be utilized during the operational phase. While we do not claim immediate operational readiness for real-time control, we envision that our model can eventually be integrated as a “soft sensor” into Building Management Systems (BMS) to support Intelligent and Automated Building Operations through adaptive environmental control. Ultimately, this research facilitates the Enrichment of Digital Twins by proposing a critical “dynamic human-factors layer”. This conceptual pathway allows future digital twins to simulate and predict complex human-building interactions, laying the groundwork for advanced automated facility management.
To systematically address these objectives, this study commits to a single, testable primary research question: To what extent can a multimodal deep learning framework reliably predict continuous, second-by-second (1 Hz) human physiological arousal—operationalized as environmental stress—using objective visual and environmental data collected in naturalistic, field-based UUS? Complementing this, our secondary research question investigates the model’s generalization capacity: How robustly does this predictive model transfer across different spatial typologies within the UUS, and to what extent can it generalize to unseen individuals and novel sites not represented in the training data? By explicitly answering these questions, this study clarifies the conditions under which data-driven human-centric POE can be reliably deployed.
The remainder of this paper is organized as follows.
Section 2 introduces the related research.
Section 3 describes the methodology, including the dataset and model.
Section 4 selects a typical case for the dynamic analysis of the model’s prediction results.
Section 5 provides a discussion.
Section 6 presents practical recommendations based on the findings.
Section 7 outlines the study’s limitations and future work. Finally,
Section 8 presents the conclusions.
2. Related Research Works
This review is structured to systematically build the rationale for our proposed methodology. First, we position our work within the broader context of Automation in Construction, examining the recent advancements and remaining gaps in automating POE (
Section 2.1). We then narrow our focus to the specific challenges of evaluating User Experience in UUS, dissecting the traditional assessment methods—from subjective evaluations to objective physiological metrics—and analyzing the key environmental and spatial factors that influence user perception (
Section 2.2). Finally, we synthesize these two perspectives to identify a critical research gap and establish the necessity for an automated, data-driven predictive model capable of mapping dynamic environmental inputs to continuous physiological responses (
Section 2.3).
2.1. Automating POE in the Construction Lifecycle
While automation in the AEC industry has advanced design and construction, human-centric performance evaluation remains a challenge. POE is a critical but traditionally manual, static process, creating a digital gap in the automated building lifecycle. Recent advancements aim to transform POE into a proactive, automated tool.
One stream focuses on predictive automation through Pre-Occupancy Evaluation (PrOE), leveraging simulation to forecast user behavior [
7]. This includes using BIM to evaluate layouts against user requirements [
8], agent-based modeling (ABM) to simulate complex social patterns [
9], and Virtual Reality (VR) to predict spatial cognition [
10,
11,
12]. Automated modeling of user space preferences also optimizes spatial efficiency from the design phase [
13].
A second stream concentrates on data-driven automation of post-occupancy analysis. Wireless Sensor Networks provide an automated, non-disruptive method for collecting performance data for system diagnostics [
14]. Advanced systems employ computer vision and deep learning to automate real-time occupancy sensing and human motion analysis, offering objective data that validates POE findings and can automate environmental control [
15].
Despite these advances, a significant gap remains. Current methods automate either predictive simulations or objective data collection. An integrated framework that can autonomously map dynamic, objective environmental inputs to the continuous, subjective psycho-physiological responses of users is still lacking. Bridging this gap is the next frontier for building evaluation, requiring a system that not only observes but also understands and predicts the real-time human experience.
2.2. POE in UUS
2.2.1. Subjective Evaluation for UUS
Such questionnaire-based techniques have been applied to explore user perceptions in various UUS [
16,
17,
18]. However, the pursuit of more continuous data has led to the integration of biosensing technologies in environmental research. Wearable physiological sensors can capture real-time bodily responses, offering a more direct reflection of a user’s condition. While these physiological signals are objectively measurable, they reflect subjective internal emotions. For instance, electroencephalography (EEG) has been used to investigate the neural mechanisms of thermal adaptation [
19] and to quantify the cognitive and comfort impacts of underground office environments [
20]. This shift toward physiological metrics signifies a move from static feedback to dynamic assessment of the human-environment interaction.
This shift toward physiological metrics is heavily informed by foundational advancements in affective computing and recent progress in continuous physiological sensing. Affective computing, initially conceptualized by Picard [
21], seeks to systematically recognize, interpret, and process human emotional states through robust computational models. In the context of the built environment and urban studies, emerging stress detection studies leverage wearable biosensors to continuously monitor the autonomic nervous system [
22]. This provides a real-time, objective window into how humans process spatial and environmental stimuli, moving beyond discrete self-reports to capture transient emotional fluctuations in actual architectural spaces [
23]. Among various physiological metrics, GSR has become a key indicator for evaluating emotional and stress responses to environmental stimuli. GSR measures fluctuations in the electrical conductivity of the skin, which is directly influenced by sweat gland activity tied to the sympathetic nervous system [
24]. Consequently, it can reliably reflect an individual’s physiological arousal and stress levels—such as anxiety or tension—in real time.
GSR is particularly well-suited for dynamic environmental assessment due to several advantages over other metrics like heart rate variability (HRV) or EEG. It is non-invasive, can capture transient stress fluctuations with high sensitivity, and its signal amplitude directly correlates with physiological arousal. Furthermore, the portability of GSR sensors allows for data collection in authentic, real-world settings, overcoming the ecological validity limitations of laboratory simulations. The validity of GSR in quantifying the emotional impact of an environment has been corroborated in field experiments, where it has shown a significant correlation with the perceived quality of a space, thereby providing an objective physiological basis for design evaluation [
12,
25]. This makes GSR a powerful tool for understanding users’ subconscious psycho-physiological reactions within complex environments like UUS.
2.2.2. Physical Environmental Factors in UUS
The assessment of the physical environment in UUS adapts principles from established green building evaluation systems (e.g., WELL, LEED, CASBEE, BREEAM, DGNB) [
26,
27,
28,
29,
30] while emphasizing challenges unique to subterranean settings. Key factors that are particularly critical underground include thermal comfort, air quality, and the acoustic environment. These settings often intensify risks such as pollutant accumulation, limited natural light, and other stressors that impact psychological well-being.
Thermal comfort, governed by temperature and humidity, is a foundational element of user experience in UUS. Imbalances can lead not only to immediate discomfort but also to health risks. Research has evolved beyond static measurements, revealing that thermal perception is a dynamic process influenced by users’ activity levels, physiological adaptation, and neurobehavioral responses [
19,
31,
32]. This highlights the need to evaluate thermal comfort not as a fixed state but as an interactive relationship between the user and the environment.
The atmospheric and acoustic environments are also critical determinants of well-being. Due to enclosed geometries that trap sound energy, noise levels in UUS can be significantly higher than in surface-level buildings, leading to increased acoustic stress and discomfort [
33,
34,
35]. Similarly, air quality directly impacts physiological health [
36]. Limited natural ventilation can heighten sensitivity to odors from volatile organic compounds (VOC) [
37] and increase exposure to pollutants like PM
2.5, which are linked to respiratory and cardiovascular strain [
38,
39,
40].
Beyond these ambient conditions, the spatial–temporal dynamic of pedestrian flow significantly shapes user experience and safety. High-density crowds, particularly during peak hours, can cause physical discomfort and psychological anxiety [
41]. While the impacts of these individual physical factors are often confirmed by correlating environmental sensor data with user surveys, such approaches typically assess them in isolation.
Furthermore, integrating these disparate environmental metrics into a holistic assessment framework presents a significant methodological challenge. As established by the OECD/JRC Handbook on Constructing Composite Indicators [
42], the aggregation and equal weighting of diverse variables is not a neutral technical choice, but a normative one that implicitly assigns equivalent conceptual importance to distinct environmental stressors. This theoretical complexity highlights the limitations of traditional linear indices, as they often overlook the synergistic, non-linear, and cumulative effects that arise from the dynamic interplay of multiple environmental variables as a user moves through the space.
2.2.3. Spatial Design and Visual Environments
Spatial design is a critical factor in shaping user experience within UUS. The inherently enclosed nature of UUS can trigger feelings of isolation and stress due to a lack of natural stimuli. To counteract this, design strategies focus on enhancing environmental affinity. Key among these is increasing visual permeability to improve spatial cognition [
43], integrating green landscapes to simulate nature, and optimizing natural or artificial lighting to support physiological rhythms [
44]. Other morphological characteristics, such as cleanliness, color choice, and material selection, also significantly influence user satisfaction [
45,
46]. These design challenges are rooted in fundamental aspects of human visual perception [
47,
48].
Emerging computational methods now allow for the objective and scalable quantification of these design attributes, marking a paradigm shift in environmental evaluation [
49]. Computer vision (CV) techniques can systematically analyze the visual characteristics of a space across multiple levels [
50,
51,
52,
53,
54]. This computational approach transforms qualitative design assessment into a dynamic, data-driven analysis, creating a pathway for the multi-modal POE paradigm that our study aims to establish [
55].
2.3. Synthesis and a Call for an Automating Predictive Model
Existing research establishes that objective environmental factors and visual features significantly impact subjective emotional states. However, conventional statistical methods in POE are often insufficient for the environmental complexity of UUS. There is a notable absence of effective methods to systematically map the pathway from objective environmental inputs to subjective perceptual outcomes.
With the recent advancements in artificial intelligence, a significant research question has emerged: whether it is feasible to leverage objective observational data to predict perceived emotional changes in underground spaces through data-driven, model-based approaches [
56,
57]. This approach can not only address the challenge of dynamically mapping objective environmental data to subjective emotional responses but also circumvent the issues associated with directly acquiring physiological data like GSR, such as the high cost of equipment, the error margins between different groups, and the time and effort required for collection.
While emerging studies have successfully applied ML to create comprehensive digital models [
58], quantify subjective perceptions such as visual comfort [
59] and pedestrian space quality [
60], and differentiate user needs across various functional zones [
61], a significant methodological limitation persists. As theoretically established by Pratschke et al. in their ecological analysis of urban environments [
62], interpreting patterns across different spatial zones requires a rigorous distinction between compositional effects (i.e., the aggregate baseline characteristics of the individuals present) and true spatial-morphological drivers. Traditional applications predominantly rely on static, discrete datasets that often conflate these factors, thereby failing to capture the continuous, dynamic interactions between users and their environment. This gap underscores the need for a dynamic analytical paradigm capable of isolating the genuine structural impacts of the UUS from individual compositional variance.
Therefore, this study aims to develop a deep learning framework that performs time-series prediction of GSR by processing environmental indicators and visual features. Ultimately, this framework provides a tangible path to automate building performance analysis, redefining POE as a dynamic, data-driven feedback loop for the construction lifecycle.
3. Methodology
3.1. Dataset
3.1.1. Experimental Design and Data Collection
This study constructed the first multi-modal dynamic perception dataset for UUS. The data collection was conducted in 30 real-world, operational UUS scenarios in Wuhan, China. As illustrated in
Figure 1, these sites included three primary functional types (transportation hubs, commercial complexes, and cultural facilities).
A total of 30 volunteers (15 male, 15 female, aged 22–45) participated in the study. Each participant was equipped with a custom-designed, integrated sensor system and traversed a preset path. This approach ensured the ecological validity of the data by capturing user experiences in authentic environments. In total, the experiment yielded approximately 45 h of synchronized, multi-modal data.
3.1.2. Multi-Modal Data Acquisition
As depicted in
Figure 2, we constructed a comprehensive dataset by simultaneously collecting three types of data:
Environmental Data: Six key physical environmental indicators were monitored in real-time: temperature, humidity, noise, VOC, PM
2.5, and foot traffic. A customized integrated sensor device recorded these metrics at 15 s intervals. The measurement ranges and precision of the sensors comply with the instrumentation requirements specified in ASHRAE standards for indoor environmental quality (e.g., ASHRAE 55 for thermal conditions) [
61].
Visual Data: This study used a DJI Action 5 Pro camera (DJI, Shenzhen, China) to capture omnidirectional visual features of UUS. Volunteers wore the camera on their chests to record first-person perspective videos, with one frame captured per second as real-time visual features of the UUS.
3.1.3. Data Processing and Synchronization
Following data collection, the raw multimodal streams underwent a rigorous preprocessing pipeline. As depicted in
Figure 3, the data were organized, standardized, and precisely aligned in both time and space. A critical step in this alignment was synchronizing the three disparate data streams—first-person video (1 fps), environmental sensor data (sampled every 15 s), and GSR signals (4 Hz)—to a unified 1 Hz temporal resolution. Specifically, the 4 Hz GSR data were downsampled to 1 Hz by calculating the arithmetic mean within each 1 s window, effectively reducing high-frequency noise. For the lower-frequency environmental data, we applied linear interpolation to estimate the second-by-second values. While this interpolation mathematically introduces temporal smoothness, it is physically justified, as ambient environmental parameters in large underground spaces inherently exhibit gradual, continuous variations.
Furthermore, to account for inter-individual differences in baseline physiological states, a subject-level baseline normalization (e.g., Z-score normalization) was applied to the GSR data prior to model training. This rigorous temporal and subject-level alignment enables the simultaneous training of visual, environmental, and physiological data. For instance, as illustrated by the synchronized points in the figure, an anomaly detected in one data stream (e.g., an acute spike in normalized GSR) can be accurately cross-referenced with corresponding visual and environmental data to identify potential environmental correlates. Finally, we performed a data cleaning procedure to remove abnormal segments caused by sensor failures, video data interruptions exceeding 30 s, or motion artifacts in physiological signals. This comprehensive process resulted in a final, high-quality dataset with an effectiveness rate of 91.2%.
3.2. Multimodal Spatial–Temporal Perception Model
To dynamically map objective environmental inputs to subjective stress responses,
Figure 4 outlines our proposed multimodal spatial–temporal fusion deep learning model, which is designed for the high-precision, real-time prediction of GSR. The model employs a dual-branch architecture that processes visual–temporal and environmental data branches in parallel before fusing their features to generate a unified prediction. The core innovation lies in its ability to couple semantic analysis of visual sequences with differential feature extraction from environmental sensor waveforms, enabling a comprehensive understanding of the user’s experience.
The model’s architecture is inspired by the human perceptual process. The ResNet-18 [
63] backbone acts as the ‘visual cortex’ to extract spatial features, while the Bi-LSTM [
64] models the ‘memory’ of the journey, capturing how past experiences influence current perception. Simultaneously, the 1D-CNN [
65] processes ambient environmental data, akin to subconscious sensory inputs. This dual-branch fusion allows for a holistic and neurologically inspired interpretation of the user’s state [
66].
3.2.1. Visual–Temporal Feature Branch
This branch is designed to interpret the dynamic visual experience of a user moving through the UUS. It consists of two main components operating in sequence:
Spatial Feature Extraction: We use a pre-trained ResNet-18 network as the backbone for spatial feature extraction. For each frame t in the input video sequence, the ResNet-18 model processes the image and outputs a 128-dimensional feature vector, denoted as xt.
Temporal Dynamics Modeling: The sequence of frame-level feature vectors (
x1,
x2, …,
xt) is then fed into a Bi-LSTM network. The core of the LSTM unit consists of three gates (input
it, forget
ft, output
ot) that regulate information flow. At each time step
ct, the cell state
ct and hidden state
ht are updated as follows:
where
σ is the sigmoid function, tanh is the hyperbolic tangent function,
W and
b are the weight matrices and bias vectors for each gate, and
denotes element-wise multiplication. The Bi-LSTM processes the sequence in both forward (
ht) and backward (
ht) directions, and the final output for the visual branch at time
t,
hvist, is the concatenation of these two hidden states.
3.2.2. Environmental Data Processing Branch
Running in parallel to the visual branch, this component processes the time-series data from the environmental sensors. We employ a one-dimensional Convolutional Neural Network (1D-CNN) to analyze the synchronized multi-variate sensor data. This network is adept at capturing local trends and characteristic patterns, outputting a 64-dimensional temporal feature vector, henvt, for each time step.
3.2.3. Feature Fusion and GSR Prediction
In the final stage, the model integrates information from both modalities. For each time step t, the high-level features from the visual branch (
) and the environmental branch (
) are concatenated to form a unified feature vector,
hfusedt:
where [;] denotes the concatenation operation. This fused vector is then passed through a multi-layer perceptron (
), which acts as a regression head to generate the final predicted GSR value,
yt for that time step:
where
Wout and
bout are the weight and bias of the final linear layer in the
. This architecture generates an independent prediction for each time step, achieving frame-level estimation of the user’s emotional stress.
3.3. Implementation and Training Details
3.3.1. Data Preprocessing and Leakage-Free Splitting
To prepare the data for training, we implemented a custom data loader that uses a sliding window mechanism. Each training sample consists of a sequence of 10 consecutive image frames (representing 10 s) and the corresponding synchronized sensor data. Crucially, because this sliding window approach creates a significant overlap between adjacent samples, a simple random split would introduce severe data leakage and artificially inflate model performance. To rigorously prevent this and evaluate true spatial generalization, the dataset was partitioned using a strict site-level split with an allocation ratio of 20:5:5. Specifically, data from 20 distinct sites were assigned to the training set, 5 to the validation set, and 5 to the unseen testing set. This trajectory-independent strategy guarantees that the model learns generalized environment-driven physiological patterns rather than memorizing intra-trajectory autocorrelations.
To enhance model robustness, visual data undergoes data augmentation during training, including random cropping, horizontal flipping, minor rotation, and color jittering. During validation and testing, only center cropping and standardization are applied. Environmental sensor data is normalized using the StandardScaler from the scikit-learn library (v1.3.0), with the parameters fitted exclusively on the training set and subsequently applied to the validation and test sets to ensure consistent data distribution without temporal data snooping.
3.3.2. Model Training and Optimization
The model is trained end-to-end using a mini-batch gradient descent strategy. We utilize the Mean Squared Error (MSE) as the loss function to quantify the discrepancy between the predicted GSR values (
) and the ground-truth values (
y). The loss (
LMSE) for a sequence of length
T is calculated as
For parameter optimization, the AdamW algorithm is employed, supplemented with gradient clipping to ensure training stability. The training process incorporates an early stopping mechanism and a dynamic learning rate scheduler.
3.3.3. Evaluation Metric
The model’s performance was initially evaluated using the Pearson correlation coefficient (r), which measures the linear relationship between the predicted GSR values and the ground-truth measurements. On the completely unseen test set (under the strict site-level split), the optimal model achieved a Pearson correlation coefficient of 0.72. To ensure this metric is robust against within-trajectory autocorrelation, we computed the 95% confidence interval (CI) using a block bootstrap resampling method (block size = 10 frames, 1000 iterations) at the site level, yielding a 95% CI of [0.67, 0.76]. While Pearson’s r demonstrates the model’s strong capacity to capture relative arousal trends, evaluating its potential integration into threshold-based BMS requires an assessment of absolute predictive error and scale bias. Therefore, we supplemented the correlation analysis with several absolute error metrics based on the Z-score normalized GSR data. The model achieved a Root Mean Square Error (RMSE) of 0.684, a Mean Absolute Error (MAE) of 0.542, and a normalized RMSE (nRMSE) of 12.4%. Together, these comprehensive metrics robustly demonstrate the feasibility, statistical rigor, and operational effectiveness of our proposed approach for UUS stress assessment.
3.3.4. Ablation Study and Comparative Baselines
To rigorously justify the architectural complexity of the proposed multimodal framework, we conducted a comprehensive ablation study and compared our model against simpler statistical baselines. This evaluation was specifically designed to isolate and quantify the contributions of multimodal fusion and temporal sequence modeling. As presented in
Table 1, the full multimodal Bi-LSTM framework (r = 0.720, RMSE = 0.684) significantly outperforms all alternative configurations, empirically validating our architectural design choices.
To establish a traditional machine learning baseline, we trained a Random Forest regressor utilizing hand-engineered semantic and environmental features. Instead of raw images, the visual data for this baseline was processed using a semantic segmentation model to extract the pixel-level proportions of key environmental elements (e.g., pedestrians, walls, ceilings, and lighting) within each frame. These semantic ratios were averaged over each 10 s window and concatenated with the corresponding statistical features (mean and variance) of the environmental metrics. While this baseline leverages robust mid-level semantic information, it achieved the lowest predictive performance (r = 0.514). This substantial performance gap demonstrates that traditional models inherently lack the capacity to capture the complex spatial topologies and non-linear spatiotemporal dynamics that our deep learning architecture successfully extracts.
Furthermore, we conducted modality ablations to evaluate the necessity of multimodal data fusion. Relying solely on the Vision-only branch (r = 0.582) or the Environment-only branch (r = 0.375) resulted in considerable performance degradation compared to the fused model. The higher correlation of the visual branch relative to the environmental branch suggests that immediate spatial characteristics and visual crowdedness exert a more rapid influence on physiological arousal. However, the optimal performance of the combined framework confirms that physiological stress in UUS is a multifaceted response driven by both explicit visual stimuli and invisible ambient factors, necessitating a synchronized multimodal approach.
Finally, to clarify the specific contribution of sequence modeling to predictive accuracy, we ablated the temporal module by replacing the Bi-LSTM with a simple Mean Temporal Pooling layer. This modification reduced the correlation coefficient to 0.625 and increased the absolute error. This decline underscores the critical necessity of advanced sequence modeling in physiological prediction. Electrodermal activity possesses inherent temporal inertia, characterized by accumulation and decay phases, which a simple averaging mechanism fails to capture. The Bi-LSTM effectively models these contextual temporal dependencies, ensuring that the framework recognizes not just the average environmental state, but its sequential evolution over the trajectory.
4. Model Application Analysis—A Case Study of the Wuhan Optics Valley Square Comprehensive UUS
4.1. Model Application and Case Study Selection
To evaluate the model’s real-world applicability and generalization capabilities, we conducted a preliminary field test across 30 diverse UUS in Wuhan, which were not part of the initial training dataset. In 20 of these locations, the model generated GSR sequences based solely on first-person-view video and environmental data inputs. In the remaining 10 locations, ground-truth GSR data was simultaneously recorded using wearable wristbands to validate the model’s predictive accuracy across different environments. The successful results from this large-scale validation supported the model’s robustness and reliability.
Based on this successful validation, the Wuhan Optics Valley Plaza underground integrated space was selected as a representative case for an in-depth analysis (
Figure 5). This site was chosen due to its large scale, complex spatial configuration, and high volume of user traffic, making it an ideal environment to test the model’s performance in a dynamic and challenging real-world scenario. The Wuhan Optics Valley Plaza Comprehensive UUS is one of the largest underground transportation hubs in Central China, with a total area of approximately 120,000 square meters distributed across three levels (B1 to B3). The space integrates multiple functions, including subway transit, commercial services, and cultural exhibitions, and serves an average daily passenger flow exceeding 300,000. Its complex spatial structure—comprising long corridors, commercial districts, dining areas, and exhibition halls—combined with dense crowds and significant environmental fluctuations during peak hours, provides a rich context for assessing the model’s dynamic perception capabilities.
4.2. Data Acquisition and Model-Generated GSR Sequence
A participant was equipped with a first-person-view action camera and a suite of portable environmental sensors to conduct a comprehensive walking survey through five distinct zones within the Optics Valley Plaza UUS: straight passages, the central hall, the underground commercial zone, the subway platform, and circular corridors. During the survey, a total of 1765 s of data were collected. The collected visual data, along with six key environmental parameters (e.g., temperature, humidity, noise, VOC, PM2.5 and foot traffic), were subsequently input into our pre-trained model.
Figure 6 visualizes the continuous, second-by-second predicted GSR sequence dataset generated by this process, dynamically representing the participant’s physiological stress throughout the journey.
To verify the model’s capacity to capture spatial and environmental impacts on physiological states within the specific context of the Optics Valley UUS, a concurrent validation experiment was conducted. Five additional participants, wearing synchronized GSR wristbands, traversed the same route simultaneously with the main data collection participant. The model’s predicted GSR sequence exhibited a strong and statistically significant positive correlation with the averaged ground-truth GSR data recorded from the five validation participants (Pearson’s r = 0.86). It is important to clarify the interpretive scope of this finding: rather than constituting an absolute individual-level predictive validation, this strong correlation robustly demonstrates the model’s ability to capture an environment-driven shared arousal pattern. It confirms that the architectural and environmental stimuli along this specific UUS trajectory elicit a highly consistent, collective physiological response trend. This high degree of correlation validates the framework’s reliability in extracting environment-induced stress patterns, thereby justifying its use for the subsequent spatial analysis.
4.3. Analysis of Results
4.3.1. Correlation Between Spatial Type and GSR Levels
The analysis presented in
Figure 7 reveals a strong correlation between spatial type and GSR levels, effectively indicating varying degrees of physiological arousal.
Specifically, straight and circular corridors were associated with the highest median GSR levels and the greatest variance, with several high-value outliers observed in the straight corridor. This suggests that narrow, enclosed spatial configurations induced a state of heightened physiological arousal in participants. In stark contrast, the central hall and commercial zone exhibited the lowest and most stable GSR levels. Notably, values in the central hall approached baseline levels, reflecting a potential stress-alleviating effect associated with open, expansive environments.
Intermediate between these extremes were the subway platforms, where GSR levels were more moderate. This finding suggests that environments with a medium level of stimulation may induce a stable, rather than stressful, state of physiological arousal. Overall, these results quantify a clear gradient of physiological stress corresponding to the degree of spatial enclosure.
4.3.2. Overall GSR Time-Series Analysis
As depicted in
Figure 8, the time-series data reveals a distinct “U-shaped” trajectory for physiological arousal during the participants’ journey through the UUS. In the initial phase of the path (0 to ~550 s), GSR levels declined precipitously from a high state of arousal, likely reflecting an initial alert response to the novel subterranean environment and the onset of movement, followed by a rapid period of psychological adaptation.
Subsequently, during the latter segment of the journey (~550 to 1765 s), GSR levels began a sustained and gradual increase, culminating in a final peak. This secondary rise in arousal is likely associated with the cumulative effects of physiological fatigue from prolonged walking, increased environmental complexity, and sustained exposure to enclosed spaces. The high GSR values at the conclusion of the journey suggest that participants experienced significant states of high arousal, operationalized in this study as environmental stress. This finding has important implications, indicating that after initial novelty subsides, prolonged exposure to complex underground environments may elicit a secondary rise in arousal, potentially diminishing the overall spatial experience and perception of safety.
Superimposed on this primary U-shaped trend were fluctuations at different temporal frequencies. Low-frequency variations corresponded to transitions between functional zones, observed as “stair-step” changes, such as the stable low-arousal state in the central hall and the gentle increase through the commercial area. In contrast, high-frequency fluctuations manifested as transient peaks and troughs, often synchronized with acute environmental stimuli (e.g., the spike at 240 s). Notably, autocorrelation analysis revealed no significant periodic oscillations, suggesting that GSR dynamics were closely coupled with environmental changes rather than solely governed by intrinsic biological rhythms.
4.3.3. Dynamic Evolution of GSR Across Spatial–Temporal Zones
Analyzing the spatial–temporal trends revealed distinct GSR patterns corresponding to each environmental zone. Upon entering the straight corridor, participants’ GSR levels peaked rapidly and then progressively declined, suggesting that an initial arousal was followed by gradual environmental adaptation. The transition into the open central hall was associated with a further decline to near-baseline levels, suggesting the potential arousal-moderating role of increased spatial openness.
In the commercial zone, GSR values exhibited a slight increase upon entry before stabilizing, reflecting the achievement of a homeostatic physiological state under moderate environmental stimulation. Conversely, a slow but steady upward trend in GSR was observed on the subway platform, likely linked to sustained environmental stimuli such as noise and crowding, alongside the confounding factor of cumulative fatigue. Finally, entering the circular corridor corresponded to a sharp and continuous rise in GSR. This sustained intensification of physiological arousal highlights the potential compounding association between navigating a monotonous, enclosed environment and the accumulated effects of physical and mental fatigue.
4.3.4. Contextual Analysis of Acute High-Arousal Events
To systematically identify acute environmental stress events, by referring to Boucsein’s physiological thresholds [
67], we defined a precise mathematical criterion based on the pre-normalized, artifact-filtered GSR amplitude. For each 10 s sliding window
Wt, the amplitude ratio
Rt is defined as the ratio of the maximum peak amplitude to the minimum baseline amplitude within that specific window:
where
i indexes the individual data points within the defined sliding time window
. The variable
represents the physiological amplitude at data point
i, specifically defined as the signal that has been artifact-filtered but remains pre-normalized.
An acute stress event (indicating sudden arousal) is triggered when Rt > 1.5, and a rapid relaxation event when Rt < 0.5.
Crucially, to ensure that these detected events represent genuine psycho-physiological arousal rather than physical movement or sensor noise, a rigorous artifact exclusion protocol was applied prior to this calculation (as outlined in
Section 3.1.3). High-frequency spikes (duration < 1 s) characteristic of motion artifacts were removed using a low-pass filtering technique, and any segments with sensor contact failures were entirely excluded.
Furthermore, we conducted a sensitivity analysis to evaluate the robustness of the chosen Rt = 1.5 threshold. Lowering the threshold to 1.3 increased the event count to 54, which introduced potential false positives caused by natural physiological baseline drifts. Conversely, raising the threshold to 1.7 reduced the count to 18, missing several visual-corroborated environmental stressors. The selected 1.5 threshold yielded the 32 identified stress events, offering an optimal balance for isolating high-confidence, environment-driven acute stress responses while maintaining reproducibility.
From these, we selected three representative events for a detailed contextual analysis. As detailed in
Figure 9, this examination reveals that sharp increases in GSR frequently co-occur with abrupt spatial–temporal transitions.
The transition from a narrow corridor to an open atrium was associated with a threefold increase in GSR. This paradoxical physiological response in a theoretically relaxing space can be interpreted as reflecting a sensory adaptation conflict. The sudden shift from dim to bright light can elicit an instantaneous visual stress response, while the abrupt expansion of spatial scale may contribute to temporary disorientation and heightened alertness.
Entering a bustling commercial zone from a quieter passageway also corresponded to a significant spike in GSR. This is interpreted as multimodal sensory overload, where the combined stimuli of noise, temperature changes, odors, and crowding exceed an individual’s processing capacity, eliciting a sharp increase in physiological arousal to navigate a complex environment.
Similarly, a sharp GSR increase was recorded at the threshold between the underground and ground levels. We classify this as an environmental boundary effect. The intense contrast in lighting and airflow creates a strong physical stimulus, while the cognitive load required to reorient to the above-ground environment (e.g., different temperatures, new spatial cues) may further elevate the physiological arousal response.
4.3.5. Causal Inference and Identifying Assumptions
While our deep learning framework demonstrates a strong predictive capability mapping multimodal UUS environments to physiological arousal, it is crucial to clarify that this study primarily establishes predictive associations rather than strict causal relationships. To interpret our predicted GSR signals as environmentally induced stress, several identifying assumptions must be acknowledged. We must assume that unobserved confounding variables—such as inter-individual baseline differences in electrodermal activity, varying levels of physical exertion (e.g., walking speed), and cognitive load derived from navigation anxiety—do not disproportionately bias the observed environmental effects. Given the naturalistic, ambulatory design of our in situ experiments, perfectly isolating the architectural stimuli from these behavioral and physiological confounders remains challenging. Therefore, the terms “impact” or “effect” used in our analysis should be interpreted strictly through a predictive and correlational lens.
5. Discussion
The findings from our case study demonstrate more than just a correlation between environment and stress; they represent a fundamental methodological shift for UUS assessment, transitioning from static evaluation to a dynamic, automated perception system. This transition from ‘what is’ to ‘what is felt in real-time’ unlocks significant implications for the construction lifecycle, starting with a deeper understanding of the environmental–physiological interplay.
5.1. Decoding Environmental–Physiological Interplay: Spatial Type and Cumulative Environmental Stress
Our empirical findings reveal a profound link between spatial morphology and autonomic arousal. The elevated and volatile GSR levels observed in enclosed, linear corridors stand in stark contrast to the suppressed, stable levels in open areas like central halls and commercial zones. This aligns with neurobiological principles: confined spaces with limited lines of sight and repetitive visual fields impose a continuous cognitive load for navigation, potentially activating sympathetic nervous system pathways. Conversely, the volumetric openness and visual diversity of central halls likely promote parasympathetic engagement, an effect consistent with “restorative environments” literature but previously unquantified dynamically in subterranean contexts.
Beyond static spatial features, our model excels at capturing the cumulative burden of sequential spatial experiences. Its ability to model temporal dependencies is evident in scenarios where prolonged exposure to a series of suboptimal conditions is associated with escalating environmental stress. For instance, a sequence involving a dimly lit corridor, a noisy food court, and a crowded transit platform corresponds to a compounded high-arousal trajectory distinct from the sum of isolated environmental stimuli. This extends findings on thermal stress dynamics [
31,
32] to the multimodal underground realm. Moreover, the sensor fusion mechanism clarifies how abrupt environmental shifts—such as a rapid increase in VOCs from cleaning agents—manifest as acute physiological “jolts”, visible as transient GSR spikes superimposed on baseline trends. These findings challenge conventional single-parameter optimization and advocate for a holistic design approach that considers temporal sequencing and transition zones to mitigate cumulative physiological load.
Finally, the strong correlation (Pearson r = 0.72) between model-predicted GSR and ground-truth measurements validates the model’s capacity to capture environment-driven patterns of shared autonomic arousal. It confirms that the extracted spatial semantics (e.g., visual clutter detected by ResNet) and sensor features are robust predictors for arousal fluctuations. This opens a significant new avenue for non-contact, scalable well-being assessment in UUS environments. While further controlled experiments are required to account for individual physiological baselines before achieving full operational readiness, this framework serves as a robust proof-of-concept for integrating human-centric perception into future intelligent building systems.
5.2. Automated Design Validation: Integrating Human-Centric Metrics into Digital Workflows
A primary implication of this research is the automation of a critical, yet traditionally manual, part of the performance-based design validation process. In current practice, evaluating the human-centric performance of design alternatives relies heavily on subjective interpretation, static standards, or time-consuming physical mock-ups. Our framework transforms this process into a rapid, data-driven workflow that can be seamlessly integrated into digital design environments, such as those based on BIM.
By inputting design schemes into our model, architects and engineers can automatically predict the potential physiological stress levels that different spatial configurations, material choices, or lighting strategies would induce in users. The framework generates dynamic stress heatmaps and temporal stress trajectories, providing a quantitative and intuitive basis for comparison. For example, a designer could test two different corridor layouts and receive an automated report quantifying which design is likely to result in lower cumulative stress during peak hours. This shifts the evaluation from a qualitative guess to an evidence-based decision, allowing for the iterative optimization of designs for user well-being before any physical construction begins. This capability automates the crucial feedback loop between design choices and their human impact, a cornerstone of next-generation, performance-driven architectural design.
5.3. Enabling Intelligent and Automated Building Operations
Beyond the design phase, our framework provides a direct pathway to enhance the intelligence and automation of building operations. The predictive model can be deployed as a ‘soft sensor’ within the BMS, translating readily available data into a high-value, previously unmeasurable metric: real-time occupant well-being. By processing data streams from existing environmental sensors (e.g., for temperature, noise, VOC) and visual sensors, the system can proactively identify zones or conditions that are causing heightened user stress.
This automated diagnostic capability is the first step in creating a closed-loop, adaptive control system. Upon detecting a high-stress signature in a specific area, the system could automatically trigger targeted interventions through the BMS. For instance, it could modulate lighting spectra to warmer tones in monotonous corridors, adjust ventilation rates to mitigate a surge in VOC, or reroute pedestrian flow via digital signage to alleviate congestion-induced anxiety. These transforms building operation from a reactive, schedule-based model to a proactive, physiology-informed paradigm. Instead of waiting for complaints or relying on static setpoints, the building automates its response to the occupants’ real-time, subconscious needs, continuously optimizing for both comfort and health.
A practical application of this conceptual framework lies in envisioning adaptive environmental control for transient spaces like underground passages. Facility managers often receive subjective complaints of “stuffiness” or “oppressiveness” during peak hours, even when traditional BMS metrics like temperature and CO2 are within standard limits. Our framework proposes a pathway to address this challenge by functioning as an intelligent “soft sensor”.
For instance, in an underground passage, the model could theoretically identify a correlation between high pedestrian density, subtle increases in VOC levels, and a corresponding rise in the predicted collective GSR—a clear signature of environmental stress. Once this stress signature surpasses a clinically validated threshold, the framework envisions a future scenario where a closed-loop intervention via the BMS could be triggered. Instead of a crude, energy-intensive, 24/7 increase in ventilation, a future integrated system might issue a localized, automated command to adjust the Variable Air Volume (VAV) system, dynamically increasing fresh air delivery to the affected zone. However, transitioning this conceptual scenario into a validated intervention protocol requires a rigorous future research agenda. Subsequent controlled trials are necessary to determine the precise parameters of such interventions (e.g., the exact optimal ventilation rates and duration) and to evaluate their actual efficacy in mitigating occupant stress. Furthermore, the lightweight nature of our architecture presents significant practical advantages for this long-term vision. Its suitability for deployment on edge devices has the potential to facilitate real-time feedback loops for future facility management systems.
Beyond the technical validation of our multimodal predictive framework, the real-world deployment of such human-centric systems in UUS fundamentally intersects with broader institutional and governance dimensions. UUS are complex socio-technical infrastructures governed by intricate institutional arrangements—such as regulatory standards, operational protocols, and procurement frameworks. While our deep learning model demonstrates what can be technically predicted (i.e., physiological arousal), its actual integration into BMS raises a critical institutional question: what are building managers realistically incentivized to act upon? To fully realize the potential of automated physiological feedback, technical capabilities must be aligned with institutional motivations. Drawing on Ostrom’s Institutional Analysis and Development framework and literature addressing urban policy governance [
68], it is essential to distinguish between “regulation by incentives” (using incentive schemes as direct management tools within the BMS) and the “regulation of incentives” at a meta-level (governing the broader structure of incentives for facility managers). Acknowledging these institutional constraints grounds our ambitious BMS integration scenario in a realistic understanding of the operational conditions required for such automated human-centric technologies to be adopted and acted upon in practice.
5.4. Building the Foundation for Human-Centric Digital Twins
Perhaps the most significant long-term contribution of this work is its potential to enrich and evolve the concept of Digital Twins in the AEC industry. Current digital twins primarily excel at modeling a building’s physical assets, systems, and energy performance. They can answer “What is the temperature?” or “Is the HVAC system running efficiently?” However, they are largely silent on the human experience, lacking the capability to answer “How does this space feel to its occupants?”
Our framework adds this crucial ‘human dynamics layer’ to the digital twin. By integrating our predictive model, a digital twin can reflect not just how the building is, but how it is experienced. This enables a new class of advanced simulations and predictions fundamental for automating facility management and long-term strategic planning. For example, a facility manager could use the human-centric digital twin to simulate the psycho-physiological impact of changing a lighting retrofit plan or altering cleaning schedules before implementation, predicting potential downstream effects on occupant well-being and productivity. This layer transforms the digital twin from a passive repository of asset data into an active, predictive tool for human-environment interaction, laying the essential groundwork for the next generation of truly autonomous and health-optimized buildings.
For instance, the human-centric digital twin serves as a powerful tool for proactive renovation and planning, de-risking large-scale investments by quantifying their potential human impact. Consider an asset owner planning to convert a traditional office into an open-plan collaborative space. The key question is whether the new design will foster collaboration or inadvertently increase stress and reduce productivity due to acoustic and visual distractions.
Our framework automates the validation process to answer this. Designers can model the proposed layout within the digital twin, and the ‘human dynamics layer’ will automatically simulate the resultant human-centric factors, such as pedestrian flows, sightlines, and noise propagation. The model then generates a predictive psycho-physiological impact report, quantifying the expected average GSR levels and spatially identifying potential stressors, such as workstations with high exposure to traffic. By comparing these automated reports across multiple design alternatives, stakeholders can make a data-driven decision, selecting the optimal layout that balances collaboration with individual well-being before committing capital. This transforms the design validation process from a costly, post hoc remediation effort into a proactive and automated scientific evaluation. This creates a powerful digital feedback loop: the digital twin predicts the human impact, informs a better design, the real-world renovated space generates new operational data, which in turn continuously refines the digital twin’s predictive accuracy. This self-improving cycle is the hallmark of a truly intelligent automated system.
6. Recommendations
Based on the research findings regarding the association between UUS environments and GSR, this study proposes systematic recommendations for the design, construction, and management of UUS. The findings indicate that both static attributes (e.g., openness and enclosure) and dynamic changes (e.g., environmental transitions) of spatial characteristics significantly influence participants’ physiological arousal. Specifically, narrow, enclosed spaces like straight and circular passages were found to consistently induce high GSR levels, indicative of a high arousal state. In contrast, open spaces such as central halls significantly suppressed GSR, promoting relaxation. Moderately stimulating environments, including commercial zones and subway platforms, maintained a moderate level of arousal. Furthermore, spatial transition events triggered abrupt changes in GSR, particularly when moving from open to enclosed spaces, resulting in sharp increases in GSR values. These phenomena are attributed to the combined effects of environmental stimulus intensity, fatigue accumulation, and perceptual shifts. Therefore, this study recommends constructing physiologically friendly UUS by focusing on four key dimensions: spatial design, environmental control, transition optimization, and intelligent management, all aimed at reducing users’ physiological load and psychological stress.
6.1. Spatial Design Strategies
In spatial design, optimizing the adverse effects of enclosed areas appears to be a priority. The high level of enclosure in straight and circular passages, which led to sustained high GSR values, suggests that avoiding long, monotonous paths is advisable. We recommend dividing narrow corridors into shorter segments, and installing visual buffer facilities (such as small green courtyards or art installations) at nodes to reduce the sense of oppression. Additionally, utilizing curved walls instead of right-angle turns and integrating virtual windows (e.g., LED screens simulating natural landscapes) could expand perceptual dimensions and help alleviate sustained activation of the sympathetic nervous system. For open spaces, the observed GSR suppression effect of the central hall highlights its value as a core node for stress reduction. Using translucent materials on the ceiling to introduce natural light, and integrating green walls and quiet zones would further enhance the parasympathetic-dominated state of relaxation. While commercial zones induced moderate arousal, their rapid adaptability suggests that a steady state can be maintained by controlling stimulus sources, such as by limiting the density of dynamic advertising screens and using soft lighting. As semi-open transition zones, subway platforms exhibited moderate GSR levels; their design could balance functionality and comfort, for instance, by installing diversion channels to mitigate cumulative stress from crowd congestion.
6.2. Environmental Control and Management
Environmental control strategies could be tailored to the dynamic changes in stimulus intensity. The research revealed that GSR exhibited a “biphasic” trend over time: an initial decrease due to adaptation, followed by an increase with fatigue and accumulated stimuli. Consequently, management strategies could focus on suppressing the accumulation of stressors, especially in high-load areas such as circular corridors. This can be achieved by deploying intelligent ventilation systems that automatically increase air exchange rates when PM2.5 exceeding control standards, preventing stuffy environments from exacerbating fatigue. Laying shock-absorbing materials (such as rubber composite flooring) could also reduce walking-related energy consumption. Concurrently, establishing a graded environmental complexity system may prove effective. For example, implementing time-segmented broadcast strategies on subway platforms (e.g., flow control reminders during peak hours, soothing music during off-peak hours) and installing real-time crowd monitoring in commercial areas to trigger diversion when density exceeds predefined thresholds. Such measures could interrupt stress accumulation pathways and prevent GSR values from surging in later stages of exposure.
6.3. Optimization of Transition Interfaces
The design of transition interfaces is critical for mitigating environmental stress during spatial changes. The analysis of sudden GSR changes indicates that abrupt shifts in environmental perceptual intensity (e.g., transitioning from an open to an enclosed space) are strongly associated with arousal spikes. It is therefore recommended to incorporate gradual transition zones at spatial junctions. For instance, at the exit from an enclosed corridor to a central hall, a 10–15 m gradually widening area with adjustable lighting gradients and acoustic barriers (sound-absorbing partitions) could mitigate GSR fluctuations. High-risk transition nodes, such as the passage from a subway platform to a circular corridor, may require special optimization. Adding a pre-adaptation zone before the entrance, equipped with environmental warning screens, blue-green lighting, and tactile guidance paths, could help reduce the anxiety frequently associated with disorientation. Interventions of this nature have the potential to buffer the physiological fluctuations linked to abrupt spatial transitions.
6.4. Data-Driven Intelligent Management
An intelligent management system enables data-driven, dynamic optimization, creating a pathway toward truly adaptive environments. Our research emphasizes that GSR levels are dominated by dynamic events and environmental parameters rather than a fixed rhythm, which underscores the necessity of a closed-loop feedback mechanism for real-time environmental management.
The implementation of such a system involves two critical steps. First, to enable large-scale, non-intrusive monitoring, the system could deploy non-contact physiological sensing technologies at key nodes. For instance, high-resolution infrared thermal imaging cameras could potentially serve as a non-contact proxy for GSR, an avenue that warrants further investigation. While not measuring skin conductance, these devices can detect subtle changes in facial temperature patterns caused by shifts in blood flow, which are highly correlated with the sympathetic nervous system activity that also governs GSR. By training a “proxy model” that learns the mapping between thermal data and ground-truth GSR, it becomes possible to estimate the real-time arousal state of a crowd without requiring wearable sensors. Second, this real-time data stream would feed into the environmental parameter-GSR correlation model developed in this study, creating a closed-loop system. When the estimated crowd arousal index exceeds a predefined threshold, the system could trigger automated, adaptive controls. For example, upon detecting a high-arousal cluster in a corridor, the BMS might automatically modulate the environment to promote relaxation by shifting the lighting color temperature towards warmer tones or increasing the fresh air exchange rate.
This approach facilitates spatial–temporal differentiated strategies: expanding the capacity of central hall rest areas during peak weekday hours; controlling the total volume of sensory stimuli in commercial areas during holidays; or closing redundant corridors at night to reduce negative experiences. Ultimately, this transforms UUS from being mere “functional carriers” to becoming “health-promoting environments”.
This research, therefore, points towards a new set of performance-based design goals and management heuristics. Instead of prescriptive standards, future strategies could focus on achieving a high segmentation rate for enclosed passages, ensuring a substantial provision of open spaces at core nodes, and systematically implementing pre-adaptive interventions in all high-risk transition zones. The effectiveness of these measures would then be continuously validated and refined by a physiological big data platform, realizing a truly data-driven and human-centric approach to UUS design and operation.
7. Limitations and Future Work
While this study introduces a novel framework for dynamic POE in UUS, it is critical to acknowledge its methodological limitations, which in turn define a rigorous agenda for future research.
First, regarding the dataset scale and model generalization, the collected field data—comprising 30 adults and approximately 45 h of multimodal data in Wuhan, China—provides a solid empirical foundation for this proof-of-concept, but its scale remains relatively limited for deep learning architectures. Physiological signals inherently exhibit substantial inter-individual variability. Although we employed rigorous subject-level Z-score normalization and a site-level data split to prevent temporal leakage and demonstrate spatial generalization, we have not yet conducted exhaustive participant-level cross-validation. Consequently, the current model primarily validates an environment-driven shared arousal pattern rather than absolute individual-level prediction. Future work must build a more massive, demographically and culturally diverse dataset to enhance the model’s robustness and cross-individual generalizability.
Second, it is essential to clarify the epistemological boundary of our proposed framework regarding physiological validation. The current model relies solely on GSR as the indicator of arousal. While GSR reliably reflects sympathetic nervous system activity, it can be confounded by physical exertion (e.g., walking) or internal cognitive processes. Furthermore, the framework technically predicts a continuous physiological signal (autonomic arousal) rather than directly measuring explicit human psychological perception. It predicts a general stress magnitude but does not distinguish emotional valence (e.g., excitement vs. anxiety). Without concurrent independent psychological validation (such as continuous ecological momentary assessments), the predicted signals serve as an objective proxy for environment-driven stress rather than a definitive psychological measure. Future research must integrate complementary physiological measurements (e.g., HRV, ECG, EEG) and contextual semantic features (e.g., facial expressions) to establish a multi-dimensional psychological validation.
Third, a structural limitation exists within the environmental branch of our methodology. In the current framework, the six physical environmental indicators are integrated using an equal-weighting scheme. As established by the OECD/JRC Handbook on Constructing Composite Indicators [
42], equal weighting is a normative choice that implicitly assigns equivalent conceptual importance to distinct variables (e.g., noise vs. VOC). For this initial study, this was adopted as a practical simplification to establish a baseline. Future iterations will engage with advanced data-driven approaches (e.g., Principal Component Analysis) or expert-driven weighting to more accurately reflect the disproportionate physiological impacts of different environmental stimuli.
Finally, while our findings present a compelling vision for integrating human-centric perception into digital twins and BMS, we must distinguish demonstrated feasibility from operational readiness. Our current application is primarily diagnostic. Transitioning this conceptual scenario into a prescriptive, closed-loop intervention protocol (e.g., autonomously adjusting ventilation or lighting based on predicted stress) requires a rigorous future research agenda. Subsequent controlled trials are strictly necessary to establish standardized, clinically validated intervention thresholds before such automated, physiology-informed systems can be deployed in practice.
8. Conclusions
This study successfully developed and validated a deep learning-based framework that fundamentally transforms Post-Occupancy Evaluation from a manual, retrospective analysis into a dynamic, automated process. By successfully fusing multimodal data to continuously predict second-by-second changes in user autonomic arousal (GSR), this research has demonstrated a powerful new method for embedding human-centric feedback directly into the digital workflows of the AEC industry. The creation of a unique, spatially and temporally aligned dataset for Urban Underground Spaces provided the empirical foundation for this advancement, and the robust performance of the model confirmed its feasibility as a rigorous proof-of-concept in complex, real-world environments.
The primary contribution of this work lies in its significant implications for Automation in Construction. Our framework delivers an automated engine that can be deployed across the building lifecycle to close critical feedback loops. In the design phase, it enables the automated validation of design schemes, providing quantitative, evidence-based metrics on physiological responses that can inform BIM-based workflows. During operation, the model functions as an intelligent soft sensor for Building Management Systems, enabling a paradigm shift from static, schedule-based control to proactive, automated environmental adjustments based on the real-time physiological dynamics of occupants.
Most significantly, this research establishes the foundation for human-centric digital twins. By adding a crucial ‘human dynamics layer’, it enriches digital replicas with the ability to simulate and predict not just how a building performs, but how it is experienced. This capability is essential for automating advanced facility management and long-term strategic planning. Ultimately, this study provides a novel tool and a new paradigm for the AEC industry, paving the way for the automated design and operation of buildings that are not only efficient and sustainable, but are also autonomously and intelligently optimized for human health and well-being.