3.1. Participants
To investigate users’ emotional experiences in a cultural game with different interaction modalities, the participants are recruited via the Wenjuanxing v2.2.6 program on WeChat v1.0.9 between April and August 2025. This study was approved by the Beijing University of Technology Ethics Committee. Written informed consent was obtained from all participants prior to data collection. Participants who successfully completed the experiment received a monetary reward of 20 RMB. A total of 66 volunteers participated in the study. Due to unexpected instrumental interruptions and accidental loss of certain channels in fNIRS recordings, complete and analyzable datasets were ultimately obtained from only 61 participants. We also administered a pre-experiment questionnaire to collect participants’ background information. In addition to gender, age, and academic major, the questionnaire includes items assessing participants’ familiarity with Chinese traditional culture and digital gaming experience to capture their cultural cognition and operational competence. Questionnaires are scored using a Likert scale (1 = not at all, 5 = very).
We conducted basic descriptive statistical analysis of the questionnaire results, see
Table 1. The results show that all participants are full-time college students, ranging in age from 18 to 28 years old, with undergraduate, master’s, and doctoral degrees. Their majors include electronic engineering, computer science, industrial design, mechanical engineering, etc. This diversity is intended to ensure a wide range of perspectives and evaluations during the cultural game experience. We also calculated the mean (M) and standard deviation (SD) of key variables. Participants report an average familiarity with Chinese traditional culture of M = 2.18, SD = 0.74, indicating a moderate cultural background. Their average weekly gameplay time is M = 11.37 h, SD = 10.46 h. About 75% report prior experience with PC or console games, but only 18.03% have used Leap Motion or similar gesture-based devices. These results show that participants generally possess sufficient gaming skills to complete the tasks. Meanwhile, cultural familiarity reduces the risk of misunderstanding the content of game. Limited exposure to gesture interaction also minimizes familiarity bias in the Leap Motion condition. Therefore, participants’ background variables exert minimal influence on the internal validity of this study. As most participants have a comparable baseline in cultural understanding and operational ability, observed emotional and physiological differences are likely attributable to the interaction modality than to individual variation.
The experiment follows a single-blind design, in which participants are unaware of the research hypotheses and expected outcomes, thereby minimizing subjective bias. This participant setup provides a stable and representative data foundation for the subsequent multimodal data collection and comparison across interaction modalities.
3.2. Materials
In this study, the Quanzhou String Puppet is selected as the cultural carrier based on the digital interactive design of ICH. As an ancient form of traditional folk art, Quanzhou string puppetry is characterized by profound cultural attributes and distinctive artistic features [
37]. The structure of a typical string puppet consists of the puppet head, torso, limbs, control rods, and strings. Each puppet is equipped with more than a dozen to several dozen strings, meticulously connected to various joint parts, forming a complex and precise string control system. This system enables flexible joint movement and allows for highly refined motion control through string manipulation [
38]. The operational uniqueness of Quanzhou string puppetry lies in the intricate string system and the highly skilled string-handling techniques. During performance, puppeteers must precisely modulate the tension and rhythm of each string using delicate coordination of fingers and wrists, thereby enabling the puppet to exhibit a lifelike gait and expressive movements [
39].
Unlike most static or visually dominant forms of ICH, string puppetry emphasizes dynamic manipulation and embodied performance, relying on fine motor skills and long-term embodied memory. The mapping between the puppeteer’s hand movements and the puppet’s actions is fundamental to its cultural expression mechanism. Given the unique operational logic embedded in this traditional heritage, its digital reinterpretation necessitates an interaction modality that maintains both cultural fidelity and operational authenticity. The goal is to construct a digital interaction system capable of fostering physical engagement and re-enacting traditional manipulation techniques, thereby unlocking the embodied cultural experience embedded in this art form.
Based on these cultural performance characteristics, a Leap Motion gesture recognition device is introduced into the experiment as a culturally consistent gesture interaction paradigm. This setup enables users to control the digital puppet through natural hand and wrist movements, simulating traditional string manipulation. Through micro-operations of fingers such as opening and closing, rotating, pushing, and pulling, key movements such as puppet walking, jumping, and waving can be achieved. To allow for comparative analysis, two additional interaction conditions are implemented: keyboard-based interaction and non-interactive viewing. This comparative framework enables us to explore the relationship between interaction modality and user experience.
All interaction prototypes are developed using the Unity3D engine (version 2023.3.0f1c1). The tasks and visual content are kept consistent across all interaction conditions to minimize confounding effects from content variation. The core mechanic requires users to control puppet actions to perform designated cultural movements, preserving the symbolic cultural logic of traditional puppetry while enabling controlled manipulation of interaction variables.
As illustrated in
Figure 1, the three interaction modalities vary in terms of their logical and cultural congruence:
Gesture Interaction: Gesture interaction allows players to engage with the game using intuitive, natural gestures that simulate real-world puppet manipulation. It offers high immersion and strong cultural alignment.
Keyboard Interaction: Keyboard interaction employs traditional key inputs to control the puppet’s arm movements. While moderately immersive, it lacks the physicality and cultural resonance of traditional manipulation.
No interaction: The non-interaction mode is to let participants simply watch a prerecorded gameplay video without any user input, representing a passive mode of cultural reception with low engagement and limited cultural transmission.
To enable smooth and natural gesture interaction, the Leap Motion controller is used for hand-tracking. This device employs high-precision infrared sensors to capture users’ hand movements in real time without noticeable latency. Specifically, the height of the user’s index and ring fingers is mapped to the puppet’s right and left arm movements, as shown in the red circle in
Figure 1(a). The position of hand determines the puppet’s spatial coordinates within the game environment. Compared to conventional motion tracking solutions, Leap Motion offers advantages such as portability, high responsiveness, and intuitive operation, making it particularly suitable for interactive scenarios that require detailed hand motion, like string puppetry. In contrast, the keyboard-based control uses the Shift key for left arm movement, the Space bar for right arm movement, and the “A” and “D” keys for directional navigation. The non-interactive condition involves a timed playback of a prerecorded gameplay session.
To ensure that the gesture design and narrative expression of the digital puppet system avoid cultural misrepresentation, a validation process is conducted through expert review and literature-based justification.
Semi-structured interviews are conducted in our research for the reason that they are flexible and also applicable. An outline of the interview could be drawn up in advance, but it is not necessary to follow it completely, and it can be adjusted flexibly according to the interview subjects, the process, and content [
40]. Therefore, this study adopts semi-structured interviews with highly representative stakeholders to obtain the most direct and honest perception of the cultural experts on the current status and shortcomings of the digital design of Puppetry. Three categories of stakeholders have been selected: The Puppetry expert performer, the Puppetry show organizer and manager, and also researcher focused on cultural heritage. The details of the respondents are shown in
Table 2.
The interviews are conducted through the Tencent Meeting online platform. Each one takes around 30 min. Firstly, we inquire about the background information. Then, after showing the digital game developed by our team, as well as the one used as the study material in our experiment, questions about the Puppetry gestures, overall narrative, and digital form are put forward to gain their insights on whether the gesture–action mappings and the narrative structure of the digital puppetry system align with the traditional string puppet. The specific questions are shown in
Table 3.
All three experts gave positive feedback on these questions. They confirm that the design retained the expressive qualities of traditional string puppetry and represented a culturally recognizable form of performance such as the gestures and ways of operating. Although the game is presented in a relatively simple way, it offers novel insights and a creative route for developing traditional Puppetry, especially regarding the needs of the audience of the young generation. In actual performances, the number of strings is usually large and the manipulation methods are also more complex. However, this digital game is sufficient for testing the form and is a valuable way to promote the communication of traditional Puppetry. Their attitude reflects a high degree of cultural consistency and endorsement of our digital approach. They also expressed approval of the narrative of the game. They are embracing and welcoming new repertoire stories that are rooted in our society. All three experts highly praise the new digital approach to redesigning Puppetry culture. They also provide further suggestions, such as diversifying the types of puppet shows by incorporating Glove Puppet shows and Rod Puppet shows, as string puppetry is typically popular in the southern regions of China, while rod puppetry is more commonly disseminated in the northern part. Designing distinct digital interaction methods based on different types of puppets will enhance cultural dissemination effectiveness and broaden audience reach.
Additionally, the design approach is supported by prior research on digital puppetry systems. Previous studies confirm that well-designed interactive and gesture-based systems are capable of preserving essential cultural elements [
41]. For instance, Zhang demonstrates that hand gestures such as opening and clenching can replicate traditional manipulation techniques like twisting and rubbing, thereby maintaining performative authenticity [
42]. Antonijoan et al. show that tangible puppets controlling virtual avatars help expand narratives without distorting cultural meaning [
43]. Wang further concludes that digital puppetry systems support embodied learning and contribute to the transmission of traditional skills and knowledge [
44]. These findings reinforce the cultural validity of the proposed system design.
To ensure that the gesture-based interaction does not introduce discomfort or excessive task demands that could confound the interpretation of physiological responses (e.g., mistaking discomfort for emotional arousal), a usability evaluation is conducted. The evaluation framework is informed by gestural interaction usability heuristics [
45] and adapts quantitative measures from established gesture interface evaluation studies [
46,
47]. Three critical dimensions are examined: learnability, cognitive load, and fatigue. See
Table 4 for the usability evaluation scale. A total of fourteen 5-point Likert scale items are used, with scores ranging from 1 (“strongly disagree”) to 5 (“strongly agree”). Positive items are reverse-coded to ensure that lower scores consistently indicate better usability. The median reference value is set to 3 for subsequent non-parametric testing.
Internal consistency of the scales is satisfactory to excellent, with Cronbach’s α of 0.746 for learnability, 0.750 for cognitive load, and 0.814 for fatigue (overall α = 0.832), as shown in
Table 4. Normality tests (Shapiro–Wilk) indicate that all item distributions significantly deviate from normality (
p ≤ 0.05); thus, Wilcoxon signed-rank tests with continuity correction (10,000 iterations) are applied to compare each item’s distribution against the median.
Table 5 summarizes descriptive statistics and test results for each item.
Results indicate that all learnability items scored significantly below the median (all p < 0.05 and all Z < 0, the Z-score is a standardized test statistic used to measure the direction and strength of systematic deviations). For the reason that learnability is verse coding (1 for most learnable and 5 for not learnable at all) to stay consistent with the other two dimensions, it suggests that participants found the gesture interaction intuitive and easy to understand from the outset. Within the cognitive load dimension, most subscales are significantly below the median except mental demand (C1) and time demand (C3) (p < 0.05 and Z < 0), indicating that the gesture interface imposed a low cognitive load during use. All fatigue-related items are significantly below the median (all p < 0.05 and all Z < 0), with particularly low scores in energy depletion (F4) and motivation drop (F5), suggesting that the gesture interface does not cause notable physical or mental fatigue even during sustained use.
Taken together, the findings demonstrate that the gesture interface achieves high learnability, imposes minimal physical and mental strain, and maintains ergonomic comfort. This supports the conclusion that participants’ physiological responses in the main experiment are unlikely to be confounded by discomfort or usability issues, reinforcing the ecological validity of the emotional engagement results.
Overall, the design aims to simulate real-world cultural practice processes while controlling for experimental consistency. This approach maximizes the comparability of emotional, cognitive, and physiological responses elicited by different interaction modalities. The findings serve as an empirical foundation for future studies on the role of cultural congruence and interactional immersion in shaping user emotional experiences in culturally themed digital interaction design.
3.3. Experimental Procedure
This study adopts a within-subject experimental design, in which each participant is asked to evaluate their personal gameplay experience with the Leap Motion system, keyboard interaction, and non-interactive viewing in order to determine individual preferences. To counterbalance potential carryover effects inherent in repeated-measures designs, participants are randomly assigned to different groups and exposed to the three systems in varied sequences, as illustrated in
Figure 2. Specifically, a randomized sequence list of the three interaction modes was generated using an online random sequence generator prior to the experiment, and each participant was assigned a corresponding experience order accordingly. This approach effectively mitigates potential fatigue or learning effects associated with fixed sequences in within-subject designs, thereby improving the validity of the data and ensuring the fairness of cross-condition comparisons.
Prior to the experiment, administrators assist participants with the setup and calibration of physiological monitoring equipment, including an fNIRS device, EDA sensors, and an infrared thermal imaging camera. The entire experimental procedure is conducted under the supervision of two administrators and consists of the following six steps, as shown in
Figure 3.
Step 1: Participants are asked to provide informed consent for the use of fNIRS, EDA sensors, and an infrared imaging camera to record their physiological responses throughout the experiment.
Step 2: Administrators introduce the task for each cultural experience game: “Use your hands or the keyboard to control the character’s movements. When the character reaches the designated position, the level is completed.” For the non-interactive condition, “Simply watch the video. No action is required.”
Step 3: Participants are engaged with the first assigned cultural heritage interaction game for 50 s, followed by a 50 s rest period. This process is repeated three times. During this phase, the fNIRS and EDA sensors continuously and automatically record participants’ physiological activity, while the infrared imaging is manually operated and recorded by the experimenter using specialized software.
Step 4: After completing the first task, participants rest for 5 min before proceeding to the next interaction condition, repeating the procedures outlined in Step 3.
Step 5: After each interaction session, participants are given a 5 min interval to complete a subjective gameplay experience questionnaire, assessing their perceived experience during the game.
Step 6: Upon completing all three interaction conditions, participants are asked to indicate which cultural heritage experience game they prefer and explain why. The rationale for asking a comparative preference question rather than an absolute evaluation is to elicit more authentic insights into user preferences through relative judgments, which tend to produce more reliable results.
Throughout the experiment, two categories of data are collected: physiological signal data and subjective rating data. The physiological signals comprise three modalities: fluctuations in cerebral oxygenation obtained through fNIRS, variations in skin conductance recorded by EDA sensors, and thermal distribution images of the facial regions captured using an infrared thermal imaging camera. All physiological data are recorded as continuous time-series signals. The subjective data consist of questionnaire scale scores.
Figure 4 presents schematic representations and sample waveforms corresponding to the three physiological signal types.
The synchronized acquisition and analysis of both physiological and subjective data ensures multidimensional evidence for evaluating the effectiveness of different interaction modes, thereby providing a robust foundation for subsequent assessments.
3.5. Measurement
3.5.1. FNIRS
In this study, the brain oxygen β-values are primarily measured to reflect users’ cultural experience and their perception of ICH. For this experiment, the Photon Cap C20 system (Cortivision, Lublin, Poland) is utilized to collect real-time brain oxygen signals from participants’ prefrontal cortex regions. The sampling frequency is 10 Hz, with wavelengths of 760 nm and 850 nm. Using the 10–20 coordinate system, 21 probes (11 light sources and 10 detectors) are placed on the participants’ left and right prefrontal areas, covering three major functional regions: the orbitofrontal cortex (OFC), the ventrolateral prefrontal cortex (VLPFC), and the dorsolateral prefrontal cortex (DLPFC) [
48], as shown in
Figure 5. The final measure of neural activation strength is quantified by the β-value of changes in oxygenated hemoglobin concentration (HbO). Changes in HbO concentration, especially in the prefrontal cortex, are used to reflect affective responses during cultural interaction tasks, such as attention, engagement, and emotional regulation, which represent the emotional experience that culture brings to users.
The system’s accompanying software, Cortiview (version 1.11.1), is used to record the entire process, including both the interactive task segments and the corresponding resting segments. To minimize the motion artifacts, we conduct IMU (Inertial Measurement Unit) calibration through the Cortiview software. This calibration process captures real-time head motion parameters (accelerations and angular velocities across three axes) and adjusts the signal quality accordingly. The IMU module helps ensure that only stable head positions are accepted for the start of the task. During the recording, participants are also instructed to minimize unnecessary movement, especially during gesture-based tasks. Through this combination of IMU-based signal correction and behavioral control, the potential impact of motion artifacts is effectively reduced. For each interaction modality, three task repetitions are conducted, and the average value across the three trials is calculated. To control for individual baseline variability, we subtract the corresponding resting-state baseline from the averaged task-state value. This resting-state measurement serves as the emotional baseline of the participant, allowing for obtaining a baseline-normalized activation measure that more accurately reflects task-induced neural responses. This method ensures that the final values represent activation changes specifically attributable to the interaction task, rather than individual emotional or physiological differences at baseline.
3.5.2. EDA
Galvanic skin response (GSR) sensors are employed in this study to monitor participants’ emotional physiological reactions. GSR signals, due to their sensitivity to the sympathetic nervous system, are widely used in research on emotional arousal, stress perception, and interactive experiences. In the experiment, ErgoLAB Human Factors Experimentation Platform and the ErgoLAB EDA Wireless Galvanic Skin Sensor (Kingfar, Beijing, China) are used to collect participants’ skin conductance signals. For the EDA indicator, the collection range is from 0 to 3 μS, with an accuracy of 0.01 μS, providing high temporal precision to ensure synchronization with events. The EDA sensors are attached to the index and middle fingers of the participant’s non-dominant hand, continuously recording the dynamic changes in skin conductance. The primary metrics record included the skin conductance (SC), tonic signal, and phasic signal, which are used to assess the intensity of emotional arousal and the activation level of the sympathetic nervous system during the gaming experience, reflecting the real-time intensity of emotional arousal during cultural interaction. After the experiment begins, EDA data are synchronously collected during both the resting and interactive phases of each task, aligned with the brain oxygen timeline. The final EDA index is obtained by subtracting the resting-state baseline from the task-state average value. This baseline-corrected EDA reflects sympathetic nervous system activation specifically induced by the interaction experience, minimizing the influence of individual emotional arousal levels at baseline.
3.5.3. IRT
Infrared thermography (IRT) is used to track subtle temperature changes in facial areas, which are associated with emotional states such as stress, excitement, and engagement. The infrared camera used in our experiment is the KIR-2008z (Huajingkang Optoelectronics Technology Co., Ltd., Wuhan, China), which features a high-sensitivity, uncooled infrared focal plane detector, excellent imaging circuit components, and optical and display systems, providing superior infrared imaging performance. The camera’s optical resolution is 384 × 288 pixels, with a measurement range from 30 °C to 42 °C and a temperature measurement accuracy of ±0.3 °C. The field of view is 44.3° × 34.0°. In the experiment, participants are instructed to face the thermal camera, which is positioned 0.5 m away from them.
The application used for data collection is the KIR-2008Z Infrared Thermal Imaging Health Management System (version 5.6.0), which is capable of receiving data from the thermal camera. Based on this software, an experimental procedure to acquire thermal images is carried out. During the resting period before each game session, a thermal image is captured every 10 s, resulting in 4 thermal images per game session’s resting phase, for a total of 12 images across three game sessions. The average temperature of the region of interest (ROI) in this state is taken as the participant’s baseline temperature. After the game begins, during each of the three 50 s gameplay periods, a thermal image is captured every 10 s, resulting in 4 thermal images per 50 s period and 12 thermal images per game session. The temperature of the ROI during these periods is recorded as the experimental temperature. The final IRT measure is derived by subtracting the resting-state baseline temperature from the average task-state temperature. This baseline correction isolates the thermal responses specifically induced by the interactive task and minimizes the impact of inter-individual variability.
3.6. Data Preprocessing
In this study, to improve the quality of the fNIRS brain oxygen data and ensure its applicability for subsequent statistical analysis, the NIRS-KIT V3.0 Beta (a MATLAB toolbox) developed by the State Key Laboratory of Cognitive Neuroscience and Learning at Beijing Normal University is utilized for preprocessing task-related data [
49]. This toolbox supports graphical operation and various standardized data processing workflows, making it suitable for near-infrared brain imaging data in brain activation research. First, raw data exported from the Cortview system is imported into MATLAB R2024a and converted into a data structure compatible with NIRS-KIT V3.0 Beta. Then, the following preprocessing steps are performed on each participant’s data: (1) The raw oxygenated HbO concentration time series is trimmed to remove irrelevant time intervals. The data is segmented based on the recorded start and end timestamps of each trial. Only the time periods corresponding to the resting phase and the task phase are retained, resulting in a total of 24 min and 30 s of usable data per participant. All unrelated intervals, such as instruction time and system initialization, are excluded to avoid introducing noise into the analysis. (2) A first-order polynomial regression model is applied to estimate the underlying linear trend in the time series. The estimated trend is then subtracted from the original HbO concentration data to eliminate slow drifts and preserve task-evoked hemodynamic fluctuations. (3) To minimize motion-related artifacts, the Temporal Derivative Distribution Repair (TDDR) algorithm is applied. This method corrects sudden spikes in the data by modeling the distribution of temporal derivatives, effectively restoring signal continuity while preserving underlying neural signals. (4) Artifacts unrelated to the experimental data are eliminated by using the filtering module in NIRS-KIT, which preserves low-frequency brain activity related to the task and suppresses high-frequency physiological noise. A 3rd-order Butterworth Infinite Impulse Response (IIR) bandpass filter is applied with a frequency range set between 0.01 Hz and 0.1 Hz. This filtering process removes high-frequency physiological artifacts such as cardiac signals and motion noise, as well as low-frequency drift. It ensures the retention of meaningful hemodynamic signals that reflect task-related neural activation, consistent with standard frequency characteristics in task-based fNIRS research [
50]. After preprocessing, the data is saved in the NIRS-KIT standard format (.mat file), containing three types of hemoglobin concentration data (oxyData, dxyData, totalData), channel information, and task reference wave information. Following the preprocessing of the task-related fNIRS data, we perform individual-level statistical analysis based on the General Linear Model (GLM), extracting the β-values corresponding to task activation for each channel, which serves as an indicator of neural activation strength. The model is formulated as in (1):
In this model, y represents the dependent variable, which is the fNIRS signal from a specific observation channel. x
1, x
2, …, x
L denote the independent variables, which can be understood as the individual hemodynamic responses elicited by different task conditions. β
1, β
2, …, β
L represent the model coefficients for the independent variables, indicating the extent to which each variable contributes to the observed fNIRS signal. The portion of the fNIRS signal that cannot be explained by the explanatory variables is referred to as the residual ε. Considering all observation time points of the fNIRS signal y
1, y
2, …, y
T, where T is the total number of observation points, the equation can be expressed in the following form, as in (2):
In this formula, Y is the observed data matrix, X is the design matrix,
is the estimated parameter vector, and ε is the residual vector. Given the design matrix X and the observed data Y, the model parameters are estimated using the ordinary least squares method, as in (3):
X
T represents the transpose of X, and the parameters
combine the independent variables to produce predicted values of Ŷ that approximate the observed data, thereby minimizing the residual ε. The design matrix composed of all independent variables is the core of fNIRS data modeling and plays a decisive role in the quality of modeling and the accuracy of estimating individual hemodynamic response indicators [
50,
51]. In this study, the β
1, β
2, …, β
L derived from the model are used as the primary indicators of task-evoked neural activation. This standardized preprocessing workflow ensures the temporal consistency and comparability of the fNIRS data.
For the EDA data preprocessing, signal denoising, feature extraction, and baseline correction are carried out. Under synchronized experimental conditions, the ErgoLAB v3.17.16 records the time points for task onset, task offset, and rest states, ensuring that the physiological signals correspond accurately to the experimental conditions, which guarantees precise mapping between EDA data and task events. For the EDA data preprocessing, a standardized pipeline is implemented using the ErgoLAB Human Factors Experimentation Platform to ensure data quality and comparability. The preprocessing of raw EDA data is conducted as follows: (1) To remove high-frequency noise and enhance signal clarity, a Gaussian filter is applied for smoothing, with a window size of 5 samples. Gaussian filtering is a linear smoothing technique designed to reduce Gaussian noise in EDA signals. By modeling the signal data as an energy transformation process, where noise typically resides in the high-frequency domain, the Gaussian filter effectively suppresses noise while preserving the underlying physiological signal. Its core function is as in (4):
(2) The original SC signal comprises two components: a tonic component and a phasic component. The tonic component is represented by the Skin Conductance Level (SCL), which reflects the participant’s overall arousal level during a given task or resting period. For a given time interval, SCL is computed as the arithmetic mean of the SC samples in that interval, as in (5):
(3) SCR features are extracted using the SCR analysis module in ErgoLAB, with peak detection sensitivity set to medium, a maximum rise time of 4 s, a half-recovery time of 4 s, and a minimum response amplitude threshold of 0.03 μS. The SCR amplitude is computed as the peak minus baseline, as in (6):
The SCpeak is the maximum SC value within the post-stimulus window, and the SCbaseline is the average SC value within the pre-stimulus baseline window. Event-related analysis windows are set to 1–4 s following the onset of each stimulus to ensure temporal relevance of SCR extraction. (4) To control for inter-individual variability, the SCL and SCR values for each task condition are baseline-corrected by subtracting the corresponding resting-state averages, resulting in ΔSCL and ΔSCR. All feature values are represented as the difference between each task round and the baseline phase to assess the relative changes induced by the interaction task. This enables cross-subject comparative analysis under different conditions, revealing how various interaction methods affect users’ physiological arousal levels.
For preprocessing the infrared temperature data, we first define the ROI for analysis. According to previous studies, the nasal tip and bilateral cheeks in the facial region exhibit higher physiological sensitivity and stability in emotional regulation and autonomic nervous system responses [
52]. Among them, the nasal tip area, in particular, shows a significant response to sympathetic nervous system activation, with a notable decrease or increase in temperature under stress, pleasure, or alertness. Therefore, this study selects the nasal tip, left cheek, and right cheek as the primary temperature analysis areas, which have good emotional indicator validity and signal stability. A threshold of approximately 1.3 °C change in facial temperature is considered metrologically significant and consistent with prior studies in emotion thermography [
53]. During the preprocessing of the temperature images, the optical detector first captures the infrared radiation signal from the target area and converts it into Analog-to-Digital (AD) data. The AD data are processed in the camera’s internal processing unit using non-uniformity correction, image filtering, sharpening, and related algorithms to transform the raw digital signal into calibrated temperature data in thermal images. In the infrared thermal images, color represents temperature, with red indicating higher temperatures and blue indicating lower temperatures. Subsequently, we use infrared thermal imaging analysis software to manually select the three ROI areas in each thermal image, with each region set as a 5 × 5 pixel window to represent the target areas, as shown in
Figure 6.
Subsequently, the average temperature value of the pixels within the window is extracted to quantify the thermal change in the region. The temperature change (∆T
ROI) is defined as a representative indicator of emotional activation level, as in (7).
where T
task represents the average temperature of the ROI during the interaction task, and T
baseline is the average temperature during the resting state. The positive or negative change in ∆T
ROI reflects the activation level of the autonomic nervous system (particularly the sympathetic nervous system) and is used to indirectly assess the emotional arousal and psychological stress levels of the participants under different interaction conditions.