1. Introduction
Autism spectrum disorder (ASD) encompasses a heterogeneous set of neurodevelopmental conditions characterized by persistent difficulties in social communication and interaction, often accompanied by restricted or repetitive patterns of behavior and variable profiles of cognitive, motor, and linguistic functioning [
1,
2,
3,
4,
5]. Language-related difficulties are especially relevant in educational contexts, as they may affect expressive communication, pragmatic competence, verbal comprehension, and participation in classroom interaction [
6,
7,
8]. Given this heterogeneity, educational and therapeutic interventions for students with ASD require flexible and personalized approaches capable of adapting to diverse communicative and behavioral profiles [
3,
9].
Moreover, schools face increasing pressure to respond to the growing prevalence of autism while working with limited human, material, and organizational resources [
10,
11]. This mismatch has reinforced interest in technology-enhanced educational solutions that can support inclusion, personalize intervention, and provide objective evidence about students’ responses and progress [
12,
13,
14,
15]. Within this context, socially assistive robots have gained relevance as mediating tools for autism intervention because they can offer highly structured, predictable, and repeatable interactions, characteristics that are often especially appropriate for learners with ASD [
16,
17,
18].
The educational potential of social robotics is particularly significant in the field of communicative and linguistic intervention. Previous research has shown that robot-assisted activities may promote attention, imitation, joint engagement, spontaneous language use, and aspects of dialogic interaction in children with ASD [
19,
20,
21,
22]. This is relevant because communication training for individuals with autism often depends on the interactional conditions under which the task is performed. In this sense, robots can reduce social complexity while maintaining an interactive format by providing more predictable and consistent and less ambiguous interaction patterns than human partners [
16]. Specifically, robot-mediated interaction is characterized by simplified verbal input, reduced variability in social cues, controlled timing of responses, and highly structured turn-taking, which can facilitate processing and reduce cognitive load in individuals with ASD [
17]. This creates an intermediate space between fully human-led instruction and more individualized, technologically mediated support.
This emphasis on predictability is particularly relevant for individuals with ASD, as difficulties in anticipating events and understanding environmental demands have been linked to increased anxiety and challenges in learning processes [
23]. In this context, structured and predictable environments have been shown to reduce behavioral difficulties and support more stable engagement in activities [
24] while also contributing to improved emotional regulation and reduced sensory overload. Furthermore, routines and familiar contexts play a central role in the daily functioning of individuals with ASD, helping to provide stability and reduce stress [
25]. Experimental evidence also suggests that environmental predictability can mitigate anxiety-related responses in autism models [
26]. For these reasons, conducting sessions in a known educational setting has been considered essential for facilitating participation, interaction, and reliable data collection.
Previous research has highlighted that UX must be specifically adapted to the cognitive and behavioral characteristics of users with ASD, as standard evaluation methods are often insufficient to capture their interaction needs [
27]. Moreover, the design of the interface and interactions directly influence usability, engagement, and acceptance, with inadequate user experience (UX) potentially leading to rejection of otherwise beneficial systems [
28]. Studies have also shown that individuals with ASD interact with digital interfaces in distinct ways, requiring tailored design strategies to support efficient and meaningful interactions [
29].
From this perspective, teachers’ feedback becomes a key component in interpreting UX within real educational contexts, complementing quantitative data and contributing to the refinement of intervention designs.
The present study was conducted within the framework of the DivInTech project (divintech.es) [
30], which investigates the use of social robots in educational activities designed to support the social and linguistic development of students with ASD. Within this framework, students engage in structured robot-mediated tasks, while behavioral and contextual data are collected to characterize their interaction and progress. Previous project-related work has highlighted the relevance of personalized profiling and data extraction models for adapting interventions to the characteristics of each student [
31,
32]. Although prior studies have suggested that robot-assisted interventions may promote the generalization of social and communicative skills to human interaction contexts [
33], further research is needed to understand how this transfer operates in real educational environments and whether it extends to linguistic performance during classroom activities. In addition, limited attention has been given to how interaction design and UX factors may influence both performance and a facilitation pattern suggestive of transfer in these contexts. In this regard, the value of social robots lies in their potential to act as scaffolding tools that may support the transition toward teacher-led interaction.
Therefore, this study presents a multimodal mixed-methods analysis that contributes to the field of human–robot interaction in education in terms of methodology and design. Specifically, the study contributes the following:
A multimodal evaluation approach that integrates behavioral performance, response time, physiological activation, and teacher-based UX feedback in a naturalistic school intervention;
An analysis of how interaction design features (e.g., repetition, timing, visual support, interface constraints) shape participation and task execution in children with ASD;
Preliminary evidence of a context-bound pattern consistent with facilitation, whereby prior robot-mediated structured interaction may be associated with more efficient subsequent teacher-led task performance, particularly in terms of response efficiency.
Accordingly, the value of the paper lies in informing the iterative design and evaluation of social-robot-supported educational practices for children with ASD through the analysis of performance efficiency, usability, and interaction dynamics in structured communicative tasks.
In line with this, the first research question (RQ1) asks to what extent performance in structured communicative–linguistic tasks changes across the intervention and how it differs between robot-led and teacher-led sessions. Since the intervention progressively substitutes the robot with the teacher, this comparison provides a way to explore whether performance differences between robot-mediated and teacher-led interactions may be consistent with a facilitation pattern while acknowledging the possible contribution of practice and order effects.
The second research question (RQ2) involves assessing whether teachers’ qualitative feedback is coherent with the data collected during the activities and whether this feedback contributes to the identification of possible improvements in the intervention design. This perspective can also be understood in terms of UX, which plays a critical role in the effectiveness of technology-based interventions for individuals with ASD.
Taken together, this study addresses the need to better understand how robot-mediated interventions can be effectively integrated into real educational contexts for students with ASD. By simultaneously examining task-related communicative performance and user experience, it moves beyond approaches that focus solely on engagement or isolated behavioral outcomes. The proposed framework combines multimodal data with teacher feedback to provide a more comprehensive view of interaction processes, allowing for the identification of both learning effects and design-related constraints. In doing so, this work contributes to bridging the gap between controlled experimental evidence and classroom implementation, highlighting the role of social robots as structured mediators that may support the transition toward meaningful human interaction while informing the iterative design of more effective and adaptive educational interventions.
2. Materials and Methods
2.1. Intervention Framework and Design
This study was conducted within the framework of the DivInTech project (Divintech.es) [
30], which aims to enhance the communicative and linguistic skills of students with ASD through structured interactions with social robots in educational contexts. In the broader DivInTech intervention, a humanoid robot (NAO) was used because of its demonstrated potential to support communication, engagement, and language-related activities in individuals with ASD [
34,
35,
36]. The project is conducted in a familiar school environment to ensure ecological validity and reduce the potential stress associated with novel contexts. This decision is supported by previous research indicating that individuals with ASD show a strong preference for predictability and may experience discomfort in uncertain or unfamiliar situations [
37]. Each activity includes the NAO robot positioned in front of the participant, a tablet for task interaction, a facilitator interface operated by educators, audio–video recording systems, and a wearable biometric device (NoWatch) for physiological data collection. A dedicated network configuration was established to ensure stable communication between devices and prevent technical disruptions during the sessions.
The intervention followed a structured yet adaptable design based on repeated task execution. Sessions were organized as short, game-based activities, which have been shown to provide a natural and engaging context for supporting communication and social skill development in children with ASD [
38]. In addition, evidence from meta-analytic and experimental studies suggests that game-based approaches related to communicative and language skills can contribute to improvements in social behavior, cognition, and engagement in this population [
39]. Each session consisted of four phases: (i) a welcome phase, in which the participant was greeted and invited to begin; (ii) a core activity phase, in which a task was performed and repeated three times; (iii) a final motivational activity, adapted to the participant’s preferences and encouraging spontaneous interaction; and (iv) a closing phase.
The activities were structured following a pattern aligned with the different domains of cognitive processing [
40,
41,
42,
43] and with the modular interaction with attentional and memory systems. These included:
- (a)
Quick response tasks aimed at lexical access and categorization.
- (b)
Image description tasks focused on vocabulary and sentence construction.
- (c)
Storytelling activities promoting narrative production and turn-taking.
- (d)
Emotion classification tasks addressing emotional recognition and verbalization.
- (e)
Memory-based tasks targeting attention and recall.
These activities were conceived as functional probes to explore how different task demands influence performance and interaction and in coordination among the pedagogical teams and diversity support staff of each school, the teachers of the autism support classrooms, two pedagogues and psychologists specializing in ASD, and the technical team that implemented the activities on the tablet in synchrony with the robot.
The Divintech project intervention is structured as a sequence of repeated sessions (morning–afternoon) distributed across five days (in the first week) and repeated in the second week, allowing for the analysis of performance changes over time. A key element of the design was the progressive transition from robot-mediated to teacher-led interaction. Initially, all the sessions were conducted with the robot (the five sessions of the first week and the two sessions in the second week). In later stages, the robot was gradually replaced by the educator, who conducted the same or comparable activities (the last three sessions of the second week).
This design of Divintech activities allows for comparisons between robot-assisted and human-mediated interactions, in line with previous research highlighting differences in how information is processed depending on the instructional agent [
20,
44]. Furthermore, the progressive substitution of the robot by the educator reflects the aim of supporting the transition of learned behaviors to more natural social contexts, as sustained communicative behaviors are more likely to be consolidated when reinforced within real human interactions [
45].
The study was approved by the Research Ethics Committee of Ramon Llull University. Written informed consent was obtained from the participants’ legal guardians. All procedures complied with ethical standards for research involving minors and vulnerable populations. Participation was voluntary, and the sessions were adapted to the participants’ emotional state to ensure their well-being throughout the intervention.
2.2. Data Collection, Instrumentation, and Participants
A multimodal data collection strategy was implemented to capture behavioral, linguistic, and physiological responses during the sessions. Behavioral and linguistic data were obtained from audio–video recordings, which were transcribed and analyzed to extract variables such as response accuracy, response time, speech production, and task performance patterns. These measures provided direct indicators of communicative and linguistic performance. The selected behavioral variables were used as functional indicators of performance during structured communicative–linguistic tasks, capturing dimensions such as response efficiency, task completion, verbal production during the activity, spontaneity, and support needs (see [
43]). Accordingly, the results should be interpreted as reflecting changes in task-related communicative performance under specific interaction conditions, not as direct evidence of generalized language acquisition.
Physiological data were collected using a wearable device (NoWatch), focusing on heart rate (HR), heart rate variability (HRV), and electrodermal activity (EDA), which were used as indicators of emotional activation and stress [
46,
47,
48]. Biometric data were segmented according to the start and end timestamps of each session to ensure alignment with the experimental protocol. Heart rate (HR) values were extracted directly from the processed output provided by the wearable device and segmented according to session boundaries. Heartbeat data were additionally used to compute heart rate variability (HRV) using the root mean square of successive differences (RMSSD). Electrodermal activity (EDA) signals were preprocessed to remove artifacts using a threshold-based method applied to the first-order difference in the signal. Specifically, samples exhibiting abrupt changes greater than ±0.2 units between consecutive time points were identified as artifacts and removed. This procedure was applied independently within each session to prevent cross-session contamination. No interpolation was performed after artifact removal (additional data can be found in [
43]).
Interaction data were obtained through the robot’s camera, which allowed for the tracking of head movements and gaze direction. These data were used to estimate attentional focus during the activities, distinguishing between attention directed toward the task, the educator, or other elements in the environment. All the data sources were synchronized to enable integrated analysis across modalities.
The collected data were obtained from interactions with three educational centers, La Salle La Seu d’Urgell (School 1 (S1)), La Salle San Ildefonso (S2) in Santa Cruz de Tenerife, and La Salle Sagrado Corazón (S3) in Madrid, all of which are located in Spain. The participants with ASD were anonymized using alphanumeric codes (U# for S1, N# for S2, and M# for S3), as were the neurotypical participants (NTU#, NTN#, and NTM#, respectively). The number indicates the selected participants within the total pool of candidates proposed by each school, following the protocol published in [
31].
The initial sample consisted of 12 ASD participants from S1, five from S2, and seven from S3, all of whom were in primary and secondary Spanish levels. This sample was predetermined to include participants with a prior diagnosis who have validated access to the ISIE (an acronym that refers to the classroom for Intensive Support of Inclusive Education). The school selection criteria were based on two principles: having a program that addresses the diversity of students with ASD and supports the personalization of learning exercises, monitoring, and support and having a dedicated space where these students can interact with their teachers and peers and engage in personalized activities and where future interactions with robots can take place. If a school met these two criteria, the next step was categorizing the participants. Symptom intensity was systematically categorized and prioritized across three core domains: social communication, social interaction, and restricted and repetitive behaviors (DSM-5-TR).
Following the protocol described above [
31], potential participants were prioritized, enabling the selection of the focal case study and the identification of suitable alternatives. Finally, seven participants were selected (see
Table 1 for description). Additionally, neurotypical participants were recruited as part of the broader project protocol to provide matched reference cases for task calibration and future comparative analyses. This approach is consistent with autism research, where comparisons with typically developing peers are typically used to contextualize communicative, pragmatic, and interactional profiles and to support the interpretation of observed differences [
49,
50,
51]. However, neurotypical data were not included in the analyses reported in this manuscript, which focuses exclusively on the sampled participants with ASD.
2.3. Teacher Feedback and User Experience Evaluation
The interventions were qualitatively evaluated through interviews with the observing teachers given the characteristics of the participants and the methodological constraints associated with the direct assessment of usability and user experience in approximately 10-year-old children with ASD. In this context, obtaining stable and comparable self-reports may be particularly challenging because of the verbal, communicative, and introspective demands involved, which may limit the validity of measures based exclusively on the child’s own responses. For this reason, applied research with neurodivergent populations has frequently relied on proxy informants, such as parents, caregivers, or professionals, especially when the aim is to collect evidence on observable behaviors, interactions with the environment, or functional aspects of performance. Although this strategy does not replace user-centered assessment, it represents a methodologically sound alternative when direct measurement poses substantial barriers or when the analysis focuses on the observable implementation of the intervention [
52].
In the present study, teachers were considered especially appropriate informants because they continuously observed the participants’ interactions with the activity and simultaneously acted as mediators of the intervention by guiding tasks, identifying difficulties, interpreting responses, and adjusting support throughout the process. Their perspective therefore provided access to relevant information that would have been difficult to obtain through conventional questionnaires administered directly to the children. This rationale is also consistent with recent user experience research in the context of ASD, which stresses the need to adapt methods and instruments to the characteristics of this population and highlights the value of including professionals and caregivers when direct evaluation is limited [
53].
Given the limited number of teachers involved, a qualitative approach was considered appropriate, as its strength lies in generating rich and detailed accounts of the phenomena under study rather than in achieving statistical representativeness. In interview-based qualitative research, sample adequacy is typically justified in terms of informational richness, alignment with the analytical purpose, thematic saturation, and the pragmatic constraints of the study context [
54,
55].
Accordingly, qualitative feedback was collected by means of a semistructured interview administered after the intervention and designed according to the Pocket Bipolar Laddering (Pocket-BLA) approach, derived from the Bipolar Laddering Assessment proposed by [
56]. This method structures participants’ responses around positive and negative poles of experience, enabling the systematic identification of strengths, weaknesses, and suggestions for improvement. Its use with a small number of teachers is methodologically defensible because, in usability research, the main goal is often not to estimate population parameters but to detect problems, understand interaction patterns and identify opportunities for refinement.
Qualitative usability methods have been shown to be effective with reduced samples when the purpose is exploratory or formative [
57]. In autism and educational intervention research, teachers’ perceptions of the acceptability, feasibility, and usefulness of applied practices have been found appropriate for judging their practical implementation and informing subsequent adjustments, particularly in natural classroom settings [
58].
The use of this method is justified by its capacity to capture both favorable and unfavorable aspects of the intervention in a structured yet flexible manner [
55,
56], making it particularly suitable for exploratory evaluations in educational and human–technology interaction contexts to obtain overall perceptions of the activities, including task difficulty, engagement, usability issues, emotional reactions, and suggestions for improvement.
Table 2 presents the positive and negative elements identified by five teachers (one from S1, two from S2, and two from S3).
Table 3 presents the categorization of the main negative findings (coded by only one BLA expert) and grouped by the key-findings identified as structured exploratory usability feedback [
59,
60,
61,
62], and
Table 4 shows the main issues related to the participants according to the data collection in a simplified statement.
Across the different intervention contexts, teachers consistently reported that repetition could reduce engagement and lead to anticipation of the robot’s instructions when interaction patterns remained unchanged. They also identified usability challenges, such as difficulties with drag-and-drop interactions (especially in the Emotion’s classification activity, number 4) or insufficient visual support, which affected task performance, particularly among the participants with ASD. Furthermore, aspects such as task duration, timing, and cognitive load were reported to influence attention and participation.
From this perspective, teacher feedback was considered an indicator of UX within the intervention, complementing the quantitative measures.
2.4. Data Analysis
The data analysis was based on quantitative performance measures along with qualitative teacher feedback to evaluate the consistency between observed outcomes and teachers’ perceptions and to identify potential improvements in the UX of the intervention. This approach is particularly relevant in the context of ASD, where interaction design, usability, and predictability significantly influence engagement, performance, and emotional responses [
27,
63].
Teacher feedback, collected through a semistructured interview at the end of the intervention, was analyzed thematically and contrasted with the quantitative data obtained during the sessions. This comparison aimed to determine whether perceived task difficulty, engagement, and usability issues were supported by objective measures, such as accuracy, response time, and physiological indicators. In this sense, teacher feedback was used as a contextualized indicator of UX, enabling the identification of consistencies and discrepancies between subjective evaluations and measured performance.
In parallel, quantitative analyses were conducted to examine changes in task-related communicative performance across conditions, particularly between robot-mediated and teacher-led interaction contexts. On the basis of the intervention design and the qualitative findings, three main aspects were examined:
Agent effect, comparing performance efficiency between robot-led and teacher-led sessions, as measured through behavioral, verbal and physiological metrics. This corresponds to RQ1.
Effect of repetition, analyzing differences across repeated executions of the same activities (related to RQ2).
Activity-specific effect, comparing the activities presenting UX with the remaining activities (also related to RQ2).
To operationalize these analyses, several quantitative metrics were extracted from the recorded data. For each activity, the analysis window was defined from the participant’s first button press to the participant’s last response to the main interlocutor. Behavioral and linguistic performance was assessed through response time (time elapsed between the main interlocutor prompt and the participant’s response), accuracy (percentage of correct responses), frequency (number of responses per activity), average duration (mean duration of individual responses), total duration (total time spent on the activity), production (amount of verbal output), and spontaneity (proportion of unprompted or self-initiated responses). In addition, interaction-related variables such as the number of interventions from the robot or the educator were considered as indicators of required support.
Physiological measures included heart rate, heart rate variability (HRV), and electrodermal activity (EDA), which were used indicators of emotional activation and cognitive load. Prior to statistical analysis, normality assumptions were assessed and were not met for any of the analyzed variables. Therefore, nonparametric tests were employed, as they do not require the assumption of normality and are appropriate for small samples and those that are ordinal or nonnormally distributed [
64,
65]. To evaluate the effect of repetition across multiple iterations, Friedman tests were conducted as a nonparametric alternative to repeated-measures ANOVA for related samples [
66,
67], followed by post hoc pairwise comparisons using two-sided Wilcoxon signed-rank tests.
p values were adjusted for multiple comparisons using the Benjamini–Hochberg false discovery rate (FDR) correction [
68].
For the activity-specific and agent comparisons, two-sided Wilcoxon signed-rank tests were used, given the paired nature of the data, as all conditions involved the same set of participants [
69]. The Wilcoxon test is appropriate for comparing two related samples when normality cannot be assumed [
63].
To ensure comparability across conditions, performance data was first aggregated at the participant level. Specifically, for each variable, all observations corresponding to a given condition were averaged within each participant, resulting in a single representative value per participant and condition. This approach reduces the within-conditional variability and preserves the within-subject structure of the data.
In the activity-specific analysis, “the emotion classification” task, in which UX-related issues were identified, was contrasted with the remaining activities. For each participant, performance values were aggregated separately for the emotion classification condition and for the set of remaining activities, and paired comparisons were conducted between these two conditions.
Similarly, for the agent effect, performance values were aggregated by participant and interaction condition (robot vs. teacher), and paired comparisons were conducted between the two conditions.
Descriptive statistics, including the mean, median, and standard deviation, were also computed to support the interpretation of the results.
Furthermore, effect sizes were calculated to quantify the magnitude of the observed effects independently of statistical significance. Kendall’s W was used for Friedman tests as a measure of the degree of agreement or consistency across repeated measures, while rank-biserial correlation (r) was used for Wilcoxon comparisons to estimate the strength of pairwise differences. Reporting effect sizes is particularly important in studies with small sample sizes, as it provides information about the practical relevance of the findings beyond
p-values [
70].
3. Results
3.1. Agent Effect: Teacher vs. Robot
To evaluate differences between robot-mediated and teacher-led interactions, participants first engaged in robot-mediated sessions and later performed comparable activities with a teacher. All statistical comparisons were conducted at the participant level using paired observations (n = 7).
Wilcoxon signed-rank tests revealed a significant effect on response time, with faster responses observed during teacher-led sessions than during robot-mediated sessions (W = 0.00,
p = 0.016, r = 1.00, n = 7). This large effect size indicates a substantial difference in response efficiency when interacting with the teacher. This pattern was consistent when comparing the robot-only and mixed conditions, where the response time was again significantly lower in the teacher’s condition (W = 1.00,
p = 0.031, r = 0.929, n = 7). As shown in
Figure 1, this difference is consistently observed across participants, with lower aggregated response times in the teacher’s condition.
In contrast, spontaneity was significantly lower in teacher-led sessions than in robot-mediated interactions (W = 1.00, p = 0.031, r = 0.929, n = 7), indicating that participants produced fewer spontaneous or self-initiated responses when interacting with the teacher. This large effect size suggests a substantial shift in interaction dynamics, with more efficient but less spontaneous behavior in the teacher’s condition.
No statistically significant differences were found for frequency, average duration, total duration, or accuracy, indicating that the overall interaction structure remained comparable across conditions. However, several of these variables showed moderate-to-large effect sizes (average duration: r = 0.50; total duration: r = 0.714; accuracy: r = 0.714), suggesting potential differences that may not reach statistical significance because of the limited sample size.
Similarly, physiological measures did not differ significantly between the conditions. Nevertheless, moderate effect sizes were observed for some variables (heart rate: r = −0.64), indicating a tendency toward greater physiological activation during teacher-led sessions.
To further examine whether the observed differences could be attributed to practice or order effects, an additional analysis was conducted to compare performance changes across weeks under different interaction conditions. Changes between Week 1 and Week 2 were computed separately for sequences in which participants continued with robot-led activities and for those in which they transitioned to teacher-led interaction.
The results revealed that the response time decreased in both cases, but the reduction was more pronounced when participants interacted with the teacher (median Δ = −0.97) than when they continued robot interaction (median Δ = −0.59), with a large effect size (r = 0.73), although the difference did not reach statistical significance (p = 0.188).
Overall, these results suggest that task-response efficiency is greater in teacher-led interactions than in robot-mediated interactions under the conditions of this study, particularly in terms of response speed, for which a significant and large effect was observed. Moreover, this pattern is accompanied by a reduction in spontaneity, indicating a shift toward more efficient but less self-initiated interaction patterns. Several additional behavioral and physiological variables showed moderate-to-large effect sizes, despite not reaching statistical significance. This pattern suggests that the differences between robot-led and teacher-led sessions may extend beyond response time alone, although these effects should be interpreted cautiously given the limited sample size.
3.2. Effect of Repetition
Teachers reported that repeated exposure to the activities made them appear monotonous and overly simple. In line with these observations, repetition effects were examined quantitatively to assess changes in performance and physiological responses across the four executions of each activity. Friedman tests were conducted on complete repeated-measure units (sequences with valid observations across all four repetitions). The number of units varied across variables depending on data availability.
Friedman tests revealed a significant effect of repetition on response time (χ
2(3) = 34.20,
p < 0.001; Kendall’s W = 0.38, n = 30), spontaneity (χ
2(3) = 10.40,
p = 0.015; Kendall’s W = 0.12, n = 29) and heart rate (χ
2(3) = 17.85,
p = 0.002; Kendall’s W = 0.17, n = 35). As shown in
Table 5, the response time decreased from the first to the fourth repetition (median: 2.53 to 1.34), indicating improved performance across iterations, with a moderate-to-large effect size. In contrast, spontaneity decreased across repetitions (median: 58.70 to 42.86), suggesting a reduction in spontaneous responses as participants became more familiar with the tasks. Heart rate increased over repetitions (median: 92.84 to 95.00), although the associated effect size was small, indicating a limited physiological impact.
No statistically significant effects were observed for the remaining behavioral or physiological variables after FDR correction. Although some measures showed directional changes between the first and fourth repetitions, these differences did not reach statistical significance.
Post hoc Wilcoxon signed-rank tests comparing the first and fourth repetitions confirmed these findings. These comparisons were conducted on the same complete repeated-measure units used in the Friedman tests, with sample size varying by variable. The response time significantly decreased (
p < 0.001, FDR corrected, n = 30), with a large effect size (r = 0.89). In addition, spontaneity significantly decreased (
p = 0.006, FDR corrected, n = 29), with a moderate-to-large effect size (r = 0.57), reinforcing the observation that repeated exposure may reduce spontaneous responses. However, heart rate did not significantly differ between the first and fourth repetitions (
p = 0.38, n = 35), with a small effect size (r = 0.16). As illustrated in
Figure 2a, the evolution of heart rate across repetitions was not monotonic and showed intermediate fluctuations rather than a consistent increase. Notably, higher heart rate values were observed during the second week (iterations 3 and 4) than during the first week (iterations 1 and 2), although a consistent within-day pattern was also evident, with lower values in afternoon sessions than in morning sessions. This pattern explains the discrepancy between the significant Friedman test and the nonsignificant pairwise comparison.
Figure 2b illustrates the evolution of response time across repetitions, showing a clear decreasing trend as participants became more familiar with the tasks.
Overall, these results indicate that repetition is associated with improved response efficiency, particularly in terms of response speed. The effect size for response time was moderate to large (Kendall’s W = 0.38; r = 0.89), indicating a robust learning effect across repetitions. In contrast, the increase in heart rate was associated with a smaller effect of size, suggesting more limited physiological changes.
3.3. Activity-Specific Effect: UX Issues
On the basis of teacher feedback and session recordings, a usability issue was identified in the emotion classification activity that appeared to interfere with task execution in some participants, particularly those with ASD. For example, one participant (M7) repeatedly expressed difficulty (“It doesn’t work”) and showed signs of disengagement during the task.
To examine the impact of this issue, pairwise comparisons were conducted between this activity and the remaining activities using Wilcoxon signed-rank tests. These comparisons were performed at the participant level using paired observations, including only participants with valid data for both the emotion classification activity and the comparison activities (n = 6). One participant (U8) was excluded from this analysis because he did not complete the emotion classification activity. This analysis should be interpreted as exploratory, as the emotion classification task was selected based on usability issues identified during the intervention rather than as a predefined comparison.
The results indicated that compared with the other activities, the emotion classification activity tended to be associated with altered interaction patterns in performance, although most differences did not reach statistical significance. The response time and average duration were lower for this activity, but these differences were not significant (response time: W = 3.00, p = 0.156, r = −0.71, n = 6; average duration: W = 2.00, p = 0.094, r = −0.81, n = 6). Despite the lack of statistical significance, both variables showed large effect sizes, suggesting meaningful differences in task efficiency.
Similarly, no significant differences were observed for accuracy and production, and the effect sizes were small to moderate, indicating a limited impact on performance outcomes.
In terms of support assistance, teacher intervention was notably greater for the emotion classification activity (mean = 34.90 vs. 9.69), as illustrated in
Figure 3. Although this difference did not reach statistical significance (W = 3.00
p = 0.3125, r = 0.60, n = 6), the moderate-to-large effect size suggests an increased need for support during this task.
With respect to physiological measures, a significant difference was found in heart rate variability (HRV), which was lower during the emotion classification activity test (W = 0.00, p = 0.031, r = −1, n = 6), indicating a large effect. In contrast, heart rate (HR) and electrodermal activity (EDA) showed higher values in this activity, although these differences were not statistically significant. However, EDA exhibited a large effect size (r = 0.71), suggesting increased physiological activation.
Overall, these results suggest that the emotion classification activity is associated with a less favorable interaction pattern, including signs of frustration, increased teacher support, and heightened physiological activation relative to the other activities. While statistical significance was limited because of the small sample size, the consistently large effect sizes across several variables indicate that the observed differences may be practically meaningful.
4. Discussion
The present study examined whether robot-mediated interaction was associated with more efficient execution of structured communicative–linguistic tasks among children with ASD and whether this behavior extended to teacher-led interaction. Additionally, it explored the consistency between teacher feedback and objective behavioral and physiological data to inform improvements in the UX of the intervention.
Overall, the results indicate that (i) performance improves with repetition, (ii) task-specific usability issues can significantly influence both behavioral and physiological responses, and (iii) interaction with a robot may be associated with improved performance in human-led conditions, particularly in terms of response efficiency. Given the limited sample size, effect sizes were considered alongside statistical significance to better assess the magnitude and practical relevance of the observed effects.
The results provide partial support for RQ1, regarding changes in performance across structured communicative–linguistic tasks and interaction conditions. Participants showed significantly faster response times in teacher-led sessions than in robot-mediated sessions, suggesting improved performance efficiency when they were transitioning to human interaction. However, this improvement was accompanied by a significant decrease in spontaneity, indicating that participants produced fewer self-initiated responses in the teacher-led condition. Although this behavior may be consistent with facilitation, the sequential design of the study does not allow this improvement to be interpreted as evidence of a robot-to-teacher transfer effect, as practice and order effects may also contribute.
Differences in response time were also observed within mixed sessions, where teacher-led phases consistently elicited lower response times than robot-led phases did. Furthermore, week-to-week comparisons indicated that improvements in response time were more pronounced when participants transitioned to teacher-led interactions than when they continued with robot-led sessions. Taken together, these results suggest that the observed differences cannot be fully explained by repetition alone, although they should be interpreted cautiously.
These findings are consistent with previous research indicating that robots can act as mediators for social interaction in individuals with ASD, providing a simplified and predictable interaction framework that can later be generalized to more complex human contexts [
16,
33]. Moreover, the observed reduction in spontaneity suggests that this transition to human interaction may involve a trade-off between efficiency and self-initiated communication, highlighting the complexity of interaction processes in participants with ASD [
71,
72,
73]. However, no significant differences were observed in other behavioral or physiological measures.
While these findings suggest that differences are most clearly reflected in response speed, several variables showed moderate-to-large effect sizes, indicating that the influence of robot-mediated interactions may extend beyond response time. These effects did not reach statistical significance, likely because of the limited sample size, and should therefore be interpreted with caution.
On the other hand, the effect of repetition in terms of learning versus physiological load was examined, and the results revealed a clear improvement in performance across repetitions. This finding indicates a robust learning effect as participants became progressively more familiar with the tasks and interaction dynamics [
74]. However, this improvement was not accompanied by a reduction in physiological activation. In contrast, heart rate increased across repetitions, indicating that improved behavioral performance does not necessarily correspond to reduced cognitive or emotional load. This dissociation aligns with previous research suggesting atypical physiological regulation in individuals with ASD, where performance gains may coexist with sustained or even increased autonomic activation [
67]. Recent work on digital twin-driven human–robot collaboration highlights that efficient task execution can coexist with sustained system load and dynamic coordination requirements, something that in the context of ASD participants may reflect also heightened arousal, or differences in physiological regulation, where performance gains coexist with continued autonomic activation [
75].
Interestingly, these findings contrast with the perceptions of teachers (providing a partial answer to RQ2), who described the activities as becoming easier and more monotonous over time. This interpretation is also supported by the observed decrease in spontaneity across repetitions, suggesting that increased familiarity with the tasks may reduce engagement in self-initiated communication. Overall, this discrepancy highlights the importance of combining subjective evaluations with objective measures, as perceived task simplicity does not necessarily reflect the underlying physiological or interactional demands experienced by participants. This suggests that the perception of monotony may be influenced by the participant’s profile.
The emotion classification activity emerged as a critical point in the intervention. Although most performance differences were not statistically significant, several variables showed large effect sizes, including response time and average duration, indicating meaningful performance differences. In addition, physiological measures, such as HRV, showed both statistical significance and large effect size, reinforcing the interpretation of increased cognitive or emotional demand during this activity. Importantly, teacher feedback and session recordings revealed usability issues (specifically difficulties in interacting with the tablet interface), which likely contributed to these effects (answering RQ2). Moreover, the level of teacher support provided during this activity was much higher than that during the other activities, indicating that the activities were not performed with the same smoothness.
These findings reinforce the importance of UX design in interventions involving individuals with ASD. Even minor technical or interaction difficulties can disrupt task execution, increase frustration, and ultimately affect performance. In this sense, the results highlight how interaction mechanics play a crucial role in shaping UX and outcomes. These findings should not be interpreted as evidence of robust generalized linguistic improvement but rather as preliminary evidence of improved response efficiency and stable task-related communicative performance under a structured robot-mediated intervention.
5. Conclusions
This study investigated task-related performance and user experience in robot-assisted communicative–linguistic activities for students with ASD. In addition, the consistency between teacher feedback and quantitative behavioral and physiological data was examined to inform improvements in the user experience of the intervention.
A key contribution lies in the integration of teacher feedback with quantitative data to inform improvements in the user experience (UX) of the intervention. The results show that teacher observations can reliably identify usability issues, such as interaction difficulties, loss of attention, or frustration, which are also reflected in objective performance and physiological measures (RQ2). These findings reinforce the importance of incorporating practitioner perspectives into the evaluation of technological interventions for individuals with ASD, particularly when users may not explicitly verbalize their internal states. Moreover, the observed discrepancies between perceived and measured outcomes, such as in the case of repetition, highlight the need for data-driven approaches in the design and evaluation of these systems. Previous research has emphasized that UX in ASD populations must be grounded in both behavioral evidence and user-centered evaluation methods, as traditional usability approaches may fail to capture the specific needs of these users [
76]. In this context, combining subjective and objective data provides a more comprehensive understanding of the user experience.
From a design perspective, the findings align with established UX principles for individuals with ASD, which emphasize the importance of predictability, structured interactions, and frustration-free interfaces [
27]. The findings of this study highlight the dual role of repetition in ASD interventions. On the one hand, repeated and predictable interaction patterns were associated with clear improvements in performance, particularly in response speed, supporting their value as mechanisms for facilitating learning and task familiarization (RQ1). On the other hand, these gains were accompanied by a reduction in spontaneity, indicating that increased familiarity with the tasks may come at the cost of diminished spontaneous engagement. In addition, the observed performance improvements suggest enhanced task efficiency. Taken together, these results emphasize the need to carefully balance repetition with variability in activity design to sustain engagement while preserving opportunities for spontaneous interaction across diverse user profiles.
Furthermore, the comparison between the robot- and teacher-led sessions revealed faster responses in the teacher condition, suggesting that prior interaction with the robot may be associated with improved performance in subsequent human interactions, particularly in terms of response efficiency. However, this improvement was accompanied by a reduction in spontaneity, indicating that interactions with the teacher were more efficient but less self-initiated. This pattern suggests a shift toward more efficient but less self-initiated interaction dynamics in the teacher-led condition. Nevertheless, given the sequential design, this effect cannot be fully disentangled from practice or order effects.
Moreover, the usability issues identified in the emotion classification activity highlight the critical role of interaction design. The data of this activity were associated with altered performance patterns and increased physiological activation, emphasizing the impact that interface design can have on engagement and task execution. In this regard, the integration of teacher feedback with quantitative data proved to be a valuable approach for identifying design limitations and guiding improvements. Prior work has shown that even minor usability barriers can significantly impact engagement and learning outcomes in individuals with ASD, particularly when they introduce uncertainty or increase cognitive load [
57]. In this sense, ensuring smooth, intuitive, and error-tolerant interaction mechanisms is essential for maintaining engagement and preventing frustration.
Overall, the findings should be interpreted with caution. Rather than demonstrating a broadly generalizable robot-to-teacher transfer effect, this study identifies a context-bound pattern consistent with facilitation, suggesting that, within this specific school-based intervention and participant profile, prior robot-mediated interaction may be associated with more efficient subsequent teacher-led performance, particularly in terms of response efficiency. Therefore, the contribution of the study lies less in claiming generalizable effectiveness and more in identifying a promising, context-sensitive interaction pattern that warrants further validation with larger samples, baseline human-led comparisons, and more diverse educational settings.
Building on these findings, the study identifies specific improvements to be implemented in future iterations of the intervention. In particular, the emotion classification activity should be refined to address the interaction difficulties observed. Increasing the size of the drag-and-drop target area and improving the tolerance of the interaction are expected to facilitate task execution and reduce user frustration. Additionally, although repetition was associated with improved performance, the intervention suggests the need to introduce slight variations within repeated activities to maintain an adequate level of challenge and stimulation. These adaptations could include modifying stimuli, adjusting task parameters, or varying feedback while preserving the overall structure of the activity. These proposed improvements reinforce the importance of iterative, user-centered design processes in robot-assisted interventions, where usability issues identified through both behavioral and physiological data can be directly translated into concrete design refinements.
These findings contribute to the understanding of how social robots may be integrated into educational interventions for individuals with ASD as structured mediators of learning processes that, under specific conditions, may facilitate subsequent human interaction. Moreover, it highlights the need for user-centered and data-driven design approaches, where continuous evaluation and refinement of interaction mechanisms play key roles in optimizing outcomes. Future work should further explore the long-term effects of these interventions and continue refining UX design strategies tailored to the specific needs of individuals with ASD.
Despite the contributions of this study, several limitations should be acknowledged. The sample size is limited, and the analysis is based on a within-subject design, which constrains the generalizability of the findings beyond the specific participants and context studied. In addition, the intervention was conducted within a controlled educational setting using a predefined set of activities. Notably, no baseline measurement of teacher-led interaction was collected prior to the robot-mediated intervention. As a result, the comparison between robot- and teacher-led sessions is limited to postintervention data, which restricts the ability to determine the extent to which observed differences are attributable to the robot-mediated phase.
An additional limitation concerns the generalizability of the findings related to the transition from robot-mediated to teacher-led interaction. Because the study was based on a small sample, a within-subject design, and a highly specific educational setting, the observed facilitation pattern should be interpreted as context-bound and preliminary rather than as evidence of broad generalizability to other students with ASD, school environments, intervention protocols, or robotic platforms [
76]. In this sense, the findings are better understood as an indication that, under the conditions of this intervention, prior interaction with a structured and predictable robotic agent may support subsequent task performance in teacher-led sessions. More broadly, as noted in human–robot collaboration research, the effectiveness and transferability of human–robot processes depend strongly on task structure, agent roles, and contextual constraints, which limits direct extrapolation across domains and settings [
77].
Another aspect to consider is the interpretation of physiological measures. Although heart rate and electrodermal activity provide valuable insights into participants’ internal states, they are indirect indicators of cognitive and emotional processes and should therefore be interpreted with caution, particularly in populations such as individuals with ASD, where physiological responses may not always follow typical patterns.
An additional limitation of the study concerns the potential social desirability bias in teacher feedback. Because teachers were directly involved in the implementation of the intervention, their evaluations may have been influenced by a tendency to emphasize favorable aspects of the activities or to align their judgments with the expected goals of the project. This is a common risk in interview-based and self-reported data. However, in the present study, teacher feedback was contextualized as a source of usability evidence and used to identify practical issues in the interaction between children with ASD and the robot within a natural classroom setting. This limitation was partially mitigated through the structure of the Pocket-BLA interview, which explicitly elicited both positive and negative aspects of the experience, and through the triangulation of teacher feedback with behavioral and physiological data. In fact, teachers reported specific usability problems, such as interaction difficulties, insufficient visual support, delays in robot response, and reduced engagement due to repetition. Some of these observations were consistent with the quantitative results, particularly in the case of the emotion classification activity, where increased support needs and altered physiological responses were also observed. Therefore, teacher feedback should be interpreted as a valuable but nonneutral source of evidence that is useful for identifying design limitations, although it is still potentially affected by expectancy effects and social desirability bias.
Future research should directly address the limitations identified in the present study. Incorporating baseline measurements of teacher-led interaction prior to robot-mediated sessions and adopting counterbalanced or randomized designs would allow for a clearer disentanglement of facilitation-related patterns from practice and order effects. Expanding the sample size and including more diverse educational settings will be essential to improve the generalizability of the findings.
Longitudinal designs would also be valuable for assessing the stability of the observed effects over time and determining whether performance differences between robot- and teacher-led interactions persist beyond the intervention period. In addition, further work is needed to refine the interaction design and UX guidelines tailored to ASD users, building on the usability issues identified in this study. This includes improving task mechanics, adapting interaction complexity, and systematically integrating quantitative data with practitioner feedback to support the development of more robust and adaptable interventions.