EEG Correlates of Distractions and Hesitations in Human–Robot Interaction: A LabLinking Pilot Study

: In this paper, we investigate the effect of distractions and hesitations as a scaffolding strategy. Recent research points to the potential beneﬁcial effects of a speaker’s hesitations on the listeners’ comprehension of utterances, although results from studies on this issue indicate that humans do not make strategic use of them. The role of hesitations and their communicative function in human-human interaction is a much-discussed topic in current research. To better understand the underlying cognitive processes, we developed a human–robot interaction (HRI) setup that allows the measurement of the electroencephalogram (EEG) signals of a human participant while interacting with a robot. We thereby address the research question of whether we ﬁnd effects on single-trial EEG based on the distraction and the corresponding robot’s hesitation scaffolding strategy. To carry out the experiments, we leverage our LabLinking method, which enables interdisciplinary joint research between remote labs. This study could not have been conducted without LabLinking, as the two involved labs needed to combine their individual expertise and equipment to achieve the goal together. The results of our study indicate that the EEG correlates in the distracted condition are different from the baseline condition without distractions. Furthermore, we could differentiate the EEG correlates of distraction with and without a hesitation scaffolding strategy. This proof-of-concept study shows that LabLinking makes it possible to conduct collaborative HRI studies in remote laboratories and lays the ﬁrst foundation for more in-depth research into robotic scaffolding strategies.


Introduction
This research is rooted in the vision that robots will adequately support humans in their everyday tasks and that they will continuously learn and adapt to human needs over time.Engaging in joint action to achieve a task requires a common understanding of the ongoing interaction [1,2].This can be established through grounding [3] or alignment processes [4], which are typically established by verbal and non-verbal communication.Allwood and colleagues [5] identified four basic requirements for human communication: (1) willingness to communicate, (2) willingness to perceive, (3) ability to understand, and (4) ability to react attitudinally or behaviorally.The willingness to communicate and to perceive has been investigated and modeled in the context of HRI in terms of engagement and joint attention [6][7][8], which is mostly measured based on gazing behavior.Problems of communication generally arise in one of these areas and need to be identified and remedied.In order to achieve a shared understanding between robots and humans, it is, therefore, necessary for the robot to monitor the state of the interaction partner regarding these four levels and to provide appropriate scaffolding strategies in the event of problems such as inattentiveness or non-understanding.
Here, we focus on hesitations as a specific scaffolding strategy-a measure to support learners in acquiring new knowledge and skills.Our previous investigations have focused on the benefits of hesitations based on behavioral data such as correctly memorized information or correctly oriented gaze [9,10].However, the previous approach focused on interaction results but did not shed light on the processes that actually lead to better memorization.
The role of hesitations in human-human interaction is a much-discussed topic in current research [11,12], and there is evidence for the hypothesis that they have a communicative function and influence the interaction partner.As hesitations occur more often prior to long utterances [13] and infrequent words [14], the frequency distribution of hesitations in spontaneous speech is assumed to be caused by the level of cognitive load of the speaker.The strategic use of hesitations as a scaffolding strategy for the listener has also been discussed by some authors [15].Yet, results from studies on this issue indicate that humans do not make strategic use of hesitations [11], although several studies point in the direction of the beneficial effects on the listeners' comprehension of utterances [16,17].
To better understand how scaffolding strategies, such as hesitations, affect the cognitive processes of a human interaction partner and to work toward an adaptive robot behavior that takes the attentional state of its user into consideration, we developed an HRI setup that allows the measure of EEG signals of a human participant while interacting with a robot.The proper measurement of high-dimensional EEG biosignals from humans in interaction and the development of a robot system capable of interacting with people in a contingent way are both complex approaches that require dedicated hard-and software as well as specific expertise and knowledge.Both together are not commonly found in a single lab.
However, in today's digitally connected world, the close collaboration across disciplines, distances, languages, and cultures have become the rule rather than the exception.For example, in research and development, tightly interconnected interdisciplinary groups benefit from each other's diverse experiences and perspectives to jointly create innovations.With the technological invention of LabLinking, we take the concept of close collaboration to the next level.LabLinking is a technology-based interconnection of experimental laboratories with a defined level of connection tightness (LabLinking Level-LLL) [18].We argue that linked labs provide a unique platform for a continuous exchange between scientists and experimenters, thereby enabling a time-synchronous execution of experiments performed with and by decentralized users and researchers, improving the outreach and ease of human participant recruitment, allowing the establishment of new experimental designs jointly and to incorporate a panoply of complementary biosensors, devices, hardand software solutions to capture human behavior [18].Furthermore, LabLinking supports the increasing demand for sustainability and hybrid events in the post-COVID-19 world.
The following study would not have been conducted without LabLinking, since it builds on the complementary expertise and equipment of two laboratories: the Medical Assistance Systems Group (MAS) at Bielefeld University with its rich expertise in social robotics based on robots such as Pepper, Nao, or Flobi [19][20][21], and the Cognitive Systems Lab (CSL) at University of Bremen with vast experience in biosignal-adaptive cognitive systems [22] based on multimodal biosignal acquisition [23] and processing using machine learning methods [24], including the recording and interpretation of spoken communication [25] and high-density EEG in the context of intelligent robots and systems [26].

State of the Art
The use of EEG as a method to measure the impact of distractions and hesitations on the user is motivated by related work, in which some researchers found an effect of speaker's hesitations on the listener's EEG during listening to continuous speech.Corley et al. [27] showed that event-related potentials (ERPs) associated with the meaningful processing of language are affected by a preceding hesitation.In their experiment, the N400 effect (measuring difficulties in the integration of a word into its linguistic context) was found as unpredictable in contrast to predictable words.They compared fluent to disfluent utterances and found that the N400 effect was reduced by a hesitation before the unpredictable word, indicating that linguistic integration difficulties were reduced.In addition, a memory test indicated that words preceded by a disfluency were more likely to be remembered [27].
Collard [16] showed that ERPs associated with attention (mismatch negativity (MMN) and P300 effect) are affected by a preceding hesitation vowel in a similar experiment.Infrequently occurring, acoustically manipulated target words resulted in typical MMN and P300 components compared to a non-manipulated baseline.Furthermore, a prolonged pause between the hesitation vowel and continuing speech appeared to impair covert attention to the post-disfluent content and the subsequent memory performance for this content.This indicates an immediate effect of hesitations on the listener's overt attention to the upcoming speech [16].
For a robot to respond to lapses in attention, e.g., through a hesitation strategy, requires it to detect the lapse of attention.In many situations, this is caused by external, exogenous shifts of attention [28], which we refer to as distractions.Distractions can be detected from EEG signals based on a few seconds of data by applying machine learning techniques that are commonly used in Brain-Computer Interfaces (BCIs).While average-case analysis of EEG, as discussed above, can often identify subtle effects from ERP analysis, such singletrial classification in interactive scenarios more often relies on frequency-based features, which are not as susceptible to small latency shifts as ERPs.Vortmann and Putze [29] showed, for example, how such a detection (in their case of visual distractions) can be incorporated beneficially into an interactive system to adapt its behavior to the attentional state of the user.In the field of driver safety, multiple studies have investigated the detection of auditory distractions via EEG-based BCIs, for example, by Beltrán et al. [30] and Salous et al. [31].Another field in which the detection of distractions has been applied successfully is the medical field, which requires long periods of sustained attention, for example, during rehabilitation exercises [32].
These results indicate that hesitations may serve as a good scaffolding strategy that, on the one hand, provides time for processing while, on the other hand, guides the attention to the upcoming difficult part when explaining new information to an interaction partner.They also show that the EEG signal likely contains relevant information that could help us to identify the need for such guidance and measure the impact of hesitations on neural processing.
However, it is still unclear if this positive effect of hesitations can also be used as a communicative tool in human-agent interaction.In our research, we could already show that hesitations can be used as a non-intrusive intervention strategy for dealing with inattentive interaction partners [33].In our smart-home setting, a robotic virtual agent used hesitations in an explanation scenario whenever it lost the visual attention of the explainee.We could already show that in short interactions, without a change in the discourse, unfilled pauses based on missing mutual gaze have a positive effect on the gazing behavior, i.e., the visual focus of attention, of the interlocutor [34].In further studies, we could show that such a hesitation intervention strategy could also lead to higher task performance of the human at the cost of less positive subjective ratings regarding the artificial agent [33,35].
Using (1) mutual gaze and task-related features to detect inattentiveness due to missing engagement or difficulties in understanding and (2) different strategies to deal with these improve the task performance without negative side effects on the interaction [33].However, besides these positive effects of the explainer's hesitations on the task performance of the explainee, it is still an open question if they also affect the EEG responses of the listener and, therefore, provide further information about the understanding process.
This study investigates EEG responses to distractions and hesitations in human-robot interaction.It integrates previous findings from the MAS Lab with the work of CSL and colleagues within the DFG CRC 1320 Everyday Activity Science and Engineering (EASE), where we provide unique and critical contextual background for robots based on the recording, processing, modeling, and interpretation of human activities, perceptions, and feedback [26].For this purpose, biosignals resulting from the activity of the brain, the muscles, and the eyes, which are correlated to motion, communication, and other mental tasks, are recorded and interpreted to provide insight into diverse aspects of human behavior that enable them to masterfully perform everyday activities with little effort or attention [24].Furthermore, these findings are brought into the context of AI explanations, as investigated in the DFG TRR318 Constructing Explainability.
The LabLinking concept, as outlined above, of course, draws inspiration from related work.For example, a development in recent years has been the distribution of standardized, well-documented experiments across multiple labs to ensure the reproducibility of the reported results.Lücking et al. [36] is an example from the robotics domain, while the EEGManyLabs initiative by Pavlov et al. [37] is a similar endeavor from the neuroscience perspective.This approach shares similarities with LabLinking in that it conducts experiments in multiple labs and requires a level of formalized documentation to enable the replication of experiments.A key difference is that for reproducibility efforts, every lab on its own needs to be able to perform the experiment independently and asynchronously.A similar role in the research landscape is played by multicentre studies (e.g., [38]), which also focus on the harmonization of experiment protocols, but with the goal of creating a uniform data set.An alternative approach, which considers real-time interaction between robots and EEG setups at different sites, is teleoperation.Several examples [39][40][41] show that EEG-based Brain-Computer Interfaces can be used to control robots across a distance, with real-time transmission of the detected control signals.In contrast to LabLinking, the focus in teleoperation is on establishing a tight control loop between the EEG user and robot, while LabLinking supports a wide range of other scenarios (e.g., verbal HRI) and basic research (analyzing the EEG signal instead of using it for control purposes).

Scenario
We investigate the impact of a robot's hesitations to distractions on the EEG signals of a human listener in the context of an explaining scenario.This involves everyday actions, such as laying out dishes or building blocks on a table with unusual configurations.To investigate how cognitive processes are affected by scaffolding signals, we address situations where humans receive instructions and explanations from a robot.Our envisioned scenario corresponds to a robot scaffolding a human partner who is impaired, for example, by dementia and needs support to carry out sequential everyday tasks.To avoid the influence of prior knowledge on the processing of the instructions, we used a scenario consisting of fictitious new rules for setting a table with standard dishes and cutlery and a more abstract scenario with building blocks.Furthermore, we introduce a distraction signal to impair attention and understanding processes at predefined points in time during the interaction.This will allow us to measure brain signal responses to the scaffolding effects of hesitations.

Research Questions
We address the following research questions in our HRI scenario: • EEG responses to scaffolding in HRI -SRQ1: Can we find effects on single-trial EEG based on the distraction or nonunderstanding in HRI? -SRQ2: Can we find effects on single-trial EEG of scaffolding strategies in HRI?
• LabLinking Evaluation -MRQ1: Does LabLinking enable interdisciplinary joint research approaches between multiple labs?-MRQ2: What are the best practices to achieve benefits through LabLinking?

LabLinking Method
Combining a social robot and a high-density EEG setup within the same experiment poses a significant challenge for the proposed research project.Both components are expensive and require extensive expertise to be properly operated.Performing such an experiment often requires two or more groups to combine their unique technical equipment and their respective expertise.However, collaborations like this are difficult to realize because it is often not feasible to move delicate and unique equipment between two spatially distributed labs or to keep important items from one lab at a different location for an extended period of time.
To overcome this issue, we employ the LabLinking paradigm [18] for our experiment, depicted in Figure 1.In this paradigm for experimental research, it is possible for the robot Pepper to remain at its usual location at Bielefeld University while interacting in real-time with a human participant at the BioSignals Lab at the University of Bremen.To realize this setup, we implemented a technical infrastructure with the following main capabilities: (1) streaming of audio, video, and other data for human-robot interaction, (2) synchronized recording of multimodal data streams, and (3) control of experiment flow in a multi-site experiment.For the cross-site communication of events, we used the Robot Operating System (ROS) [42], which was also employed to control the robot Pepper at the Bielefeld lab.ROS provides a flexible messaging interface that allows us to establish a consistent and robust data flow between multiple machines across the two sites.A graphical user interface at each lab allowed the respective experimenters to communicate the state of the experiment (e.g., whether a trial was completed successfully) to the other side.Furthermore, all events were logged in ROS bags for later analysis of the temporal structure of the experiment (e.g., to identify trial beginnings).All involved machines at both sites were synchronized to the same NTP server to ensure a reliable alignment of timestamps.For reproducibility, the code for controlling the robot and parts of the lab linking can be installed using a distribution in the cognitive interaction toolkit (CITK) [43].

Streaming Study-Relevant Data
For streaming video data, we used OpenCV-based plugins in ROS (video_stream_opencv, image_view) [44].For streaming audio data, we used the GStreamer software [45], which supports highly configurable, low-latency streaming pipelines.A video stream capturing visual components from the robot in Bielefeld was streamed to Bremen, while video and audio components of the participant and a video of the table were streamed back to Bielefeld for the operators of the robot.Furthermore, we implemented a custom GStreamer plugin to store accurate timestamps of the beginning and end of audio recordings.All video frames and other event data were assigned timestamps within the ROS framework.This allowed us to precisely align all collected data types and modalities during analysis.The EEG recorder used a different middleware (Lab Streaming Layer), for which we implemented a custom bridge component to convert the respective data packages into ROS messages.
Instead of streaming the generated voice output of Pepper directly, we used an array microphone to record the voice output of Pepper and streamed this recording from Bielefeld to Bremen.This produced an acoustic setting reminiscent of a real video meeting, as opposed to a cleanly generated, text-to-speech audio experience.

Synchronized Recording
Pepper itself was connected to ROS via a customized version of the C++ naoqi ROS driver [46], as the original implementation does not support timestamps.We added timestamps at the beginning and end of speech output to be able to synchronize the different modalities used at the two sites of the LabLinking.This way, the integrated animated speech functionality of Pepper, which analyses the text and tries to produce contextually appropriate movements of Pepper, could also be added and used.A fixed offset yielding from a delay of audio recording in the two locations was calculated, along with a transmission delay of the spoken sentences from Bielefeld to Bremen.

Experiment Control Flow
To maintain a robust experiment flow in a multi-site experiment, we formalized the steps of one experiment trial into a state machine that was implemented by two different applications, which communicated between the sites.Through a graphical user interface, the experimenters could trigger steps of the experiment or mark them as completed.Marking a step as completed would then notify the other site and unlock the following stages of the experiment.This procedure ensured that experimenters at both sites had a matching understanding of the state of the experiment, avoiding premature or redundant steps.Besides this formalized interaction, a video conference channel was kept open during the experiment to coordinate in the case of unforeseen events.

Experimental Setup
To collect data on the neural responses to distractions in HRI as well as potential strategies to remedy such distractions, we conducted a LabLinking experiment.

Participants
Participants were recruited via postings on online message boards and through paper flyers.On average, participants were 22.92 years old (SD = 3.82).Four participants identified as female, and eight participants identified as male.Fluent German language skills were an inclusion criterion for participant selection.As compensation, participants received 30 EUR after completing the experiment (participants coming through a dedicated course on empirical human-computer interaction methods also received partial class credit).Each participant was informed about the procedure of the data collection and signed a consent form.

Setup
Participants were seated in front of an empty table at the CSL lab, with the EEG recording equipment on a smaller table on the left side and an assortment of items to place on the table on the right side.In front of the participant (behind the table) was the projection of the live video stream of the Pepper robot.The image was placed and scaled in a way that participants had the impression of Pepper standing behind the table.A microphone hanging from the ceiling recorded an audio stream from the CSL lab, and two cameras captured a top-view of the table as well as a frontal recording of the participant.

Instructions
At the beginning of the experiment (after all sensors were set up, see next section), participants received instruction about the interaction with the robot and the task procedure and then went through one trial block to clear up any misunderstandings, to familiarize the, with the voice of the robot and the style of instructions.In the main experiment, participants went through a number of blocks.In each block, they were asked to set up the table in a unique way, as instructed by the robot.Each block used items from one of two item sets: One set (KIT) included typical everyday kitchen objects, such as cups, plates, cutlery, or napkins.The other set (BLC) contained colored wooden blocks of different shapes.For each block, we created a custom sequence of instructions.Each instruction asked the participant to place one item in relation to one of the previously placed items or the table (e.g., "place the green cube left of the blue cylinder").In the middle of the experiment, participants took a 15-min break.Instructions (especially in the case of the kitchen items) were purposefully designed to not resemble a realistic table layout to minimize the influence of assumptions on likely item locations.The duration of speech content in each instruction varied between 3 and 5 s.We piloted and adjusted all instructions to ensure that all words were intelligible in the synthesized voice.

Distractions
Pepper's voice came from the front (where the robot was displayed), while distractions were played from the front-left, back-left, front-right and back-right locations.Distractor sounds were sampled from freely available radio and television documentaries with the keyword "robot" in their title.The sampled segments usually contained a mixture of narration, interviews, music, and sound effects.For each block, distractors were taken continuously from a single recording to give a sense of an ongoing conversation.For instructions without distraction, we played ambient sound of road noise from the same directions as the distractors.This was performed to avoid the main difference between distracted and not distracted conditions to be the presence or absence of source separation in the brain.The volume level of distractors and ambient sound was kept constant during all experiments.

Hesitations
Pepper's hesitations were predefined and generated during the normal synthesis process with Pepper's ALAnimatedSpeech interface.As a hesitation strategy, we adapted the hesitation strategy for synthetic speech proposed by Betz et al. [47], which has already been tested in an HRI scenario [10].To make sure the hesitations are recognized as hesitations and not as a normal break, we decided to use an additional silent pause of 1500 ms before the filler and 1000 ms after it.However, the actual total pause in Pepper's speech synthesis was about 3.5 s before the filler and 2.5 s after it (this included Pepper's normal speech pauses, synthesis processing delays, and delay in feedback on the current status of the end of synthesis).Pepper kept gesturing during the breaks.The German filler word "ähm" was played back 50% slower and with a pitch of 80% of the normal voice.In addition, the word before the first silent pause was reduced in speed rate (50%) as well to lengthen the word as an initiation of the hesitation.

Behavioral Data
After each instruction, participants replied to the robot by indicating that they "understood" the instruction, that they "not understood" and needed the robot to repeat the instruction, or that they were "uncertain" about the instruction (but would still try to execute it).Following a potential repetition of the instruction (if requested), the participants executed it by picking up one item and placing it.After the execution of one instruction, the resulting table was checked by the experimenter at the CSL for correctness by comparing the table layout (as seen through the overhead camera) to a reference picture for each step.In case of a deviation, this trial was marked, and the table setup was manually corrected by the experimenter to the expected position to avoid conflicts with subsequent instructions.Due to the deliberate ambiguity of some instructions, we did not correct every mistake, only those which were not compatible with the given instruction.

Conditions
Within each block, we derived three conditions from the combination of two factors (cf.Table 1): (i) distraction (present/absent) and (ii) hesitation (employed/not employed).From the four different combinations, we excluded the combination of absent distraction and employed hesitation, as this does not reproduce the expected robot behavior and removing it still allowed us to study the most relevant comparisons while dedicating more trials to the remaining three combinations.We call these combinations NODIST (distraction absent), DISTNOHES (distraction present, hesitation not employed), and DISTHES (distraction present, hesitation employed).The participants performed two sets of interaction-one with the kitchen objects and one with the wooden blocks.Each set consisted of five interaction blocks, and each block consisted of nine instructions.In total, each participant has thus carried out 90 instructions overall (30 per condition, cf.Table 2).In order for Pepper to behave consistently over a certain period of time, it reacted either with or without hesitations per set (kitchen objects (KIT), wooden blocks (BLC)) and changed its behavior for the second set.This resulted in four different interaction sequences ((1).KIT with hesitations, BLC without hesitations; (2).KIT without hesitations, BLC with hesitations; (3).BLC with hesitations, KIT without hesitations; (4).BLC without hesitations, KIT with hesitations).The participants are randomly assigned to one of these four scripts to balance between the order of the appearance of hesitations, and the order of the presented set and reduce interaction effects between them.
Figures 2 and 3 show the experimental setup.After the interaction, the participants filled out a questionnaire to gain further insights.The questionnaire consisted of six parts, including questions regarding (i) general demographics, (ii) the self-reported distractibility of the participants, (iii) the perception of the distractions, (iv) the intelligibility of the instructions, (v) the perception of the hesitations, and (vi) the synthesis quality.

EEG Processing and Classification
In traditional, strongly controlled experiment setups, we would analyze the EEG for event-related effects, such as Event-Related Potentials in the time domain or Event-Related Spectral Perturbations in the frequency domain.However, the uncontrolled nature of our approach (which we chose deliberately to study a realistic HRI scenario) makes it difficult to align events exactly, as onset, content, and acoustic properties of speech and distractors varied from trial to trial.A machine learning-based approach is more flexible in capturing these differences and also prepares us to eventually support the real-time adaptation of the robot.

Preprocessing
Throughout the 10 blocks, continuous EEG recordings were taken with a sample rate of 512 Hz and 64-channel EEG using a g.HIAMP 256 Biosignal Amplifier(g.tec).Two participants were excluded from the analysis due to technical issues in EEG recording.To obtain the EEG data during the participants' listening to Pepper, the audio recorded in Bielefeld was utilized as a precise reference for identifying speech onsets and offsets, corresponding to each of the nine instructions in the block.The fixed recording delay between the audio file in Bremen and the start of the EEG recording was subtracted, as detailed in Section 3.3.2.Furthermore, to account for the transmission delay, a correction was added to the onsets and offsets, calculated as the average time lag across each block based on the cross-correlation between the clear Pepper audio and the mixed Bremen audio.This yielded 30 trials of EEG data for each of the three conditions with a duration of 3-5 s for normal instructions and distracted instructions and 8-11 s for trials with hesitations.The EEG data were first rereferenced by the average of the two reference electrodes on the left and right earlobes, which were then excluded for further analysis, reducing the number of electrodes from 64 to 62. Subsequently, the data were bandpass filtered between 1 and 32 Hz using an FIR filter from the MNE python library [48], designed as a one-pass, zero-phase, non-causal bandpass filter, followed by a downsampling of the signal to 64 Hz.The filter was created using the firwin method and a Hamming window, with a passband ripple of 0.0194 and a stopband attenuation of 53 dB.The lower passband edge was set to 1.00 Hz with a transition bandwidth of 1.00 Hz and a lower −6 dB cutoff frequency of 0.50 Hz.The upper passband edge was set to 32.00 Hz with a transition bandwidth of 8.00 Hz and an upper −6 dB cutoff frequency of 36.00Hz.The filter length was 1691 samples, equivalent to 3.303 s.To allow for consistent comparisons across the conditions and varying trial lengths, the EEG data of each trial were segmented into 1 s decision windows with a 0.5 s overlap.For the hesitation condition, only the EEG data following the hesitation phase was taken into consideration to allow for a direct comparison between the cognitive state of the participants while listening to Pepper with distracting background speech and for the same condition, after their attention was redirected to Pepper.The pre-hesitation audio and silent phases were excluded from further analysis.This procedure resulted in approximately 150 windows per condition.For each of the 1 s windows, spectral power features were computed in 2 Hz bins for each channel using the Welch method for spectral power density calculation, yielding a 62 × N dimensional feature vector, where N is determined by the number of binned features (which was varied for different setups).

Classification
In accordance with research questions SRQ1 and SRQ2, we evaluated the impact of auditory distractions on cognitive processing by comparing the control condition (NODIST) to condition DISTNOHES and the distraction condition to condition DISTHES.Our goal was to assess the quantifiable effects of these distractions on cognitive processing and to determine if the robot's hesitation strategy leads to any significant changes in the EEG signal, which could indicate a shift in the participant's perception of the robot.The discrimination between NODIST and DISTHES was omitted for this analysis, as potential findings could not be attributed to either manipulation, namely distraction and hesitation as an intervention.
To answer our research question, we utilized a Random Forest model from the scikitlearn library [49] to discriminate the EEG windows for the two comparisons.Through a shallow grid search, we obtained optimized results for all participants with a parameter setting of 500 trees, a maximum tree depth of 10, a minimum sample split of 3, and a minimum of 4 samples per leaf.Further, a combination of all features from the delta (1-4 Hz), theta (4-8 Hz), alpha (8-12 Hz), low beta (12)(13)(14)(15)(16)(17)(18)(19)(20), and high beta (20-30 Hz) bands binned in 2 Hz led to the optimal performance for 4-12 Hz in the first classification task (ambient vs. distraction) and for 4-20 Hz in the second classification (distraction vs. hesitation), and 62 × 5 = 310 and 62 × 9 = 558 dimensional feature vectors, respectively.

EEG
The classification between the undistracted baseline condition and the distraction condition is depicted in Figure 4.The Random Forest model was trained and tested in a person-dependent manner using a stratified ten-fold cross-validation while making sure that no overlapping window appeared in the training and test sets and was run 20 times and averaged for each participant to account for random factors.
Using a combination of the theta and alpha band (4-12 Hz) yielded the best results for the SRQ1 (NODIST vs. DISTNOHES).The average accuracy across all participants was approximately 60% (STD 5%).A two-tailed t-test comparing our results to the baseline accuracy of 50% revealed a statistically significant difference for all participants (t(11) = 2.719, p < 0.05).The effect size, as measured by Cohen's d, was 0.43, indicating a medium effect.Using a combination of the theta, alpha, and low-beta band (4-20 Hz) yielded the best results for the SRQ2 (DISTNOHES vs. DISTHES).The average accuracy across all participants approximately reaches 73% (STD, 10%), significantly outperforming the baseline for each participant (t(11) = 7.757, p < 0.001) with robust classification results for more than half of the participants.The effect size, as measured by Cohen's d, was 1.7, indicating a large effect.The confusion matrix in Figure 5 shows consistent predictions for both classifications over all participants, with no clear preferences toward one class.

Behavioral Data
After each instruction, participants replied to the robot by indicating that they "understood" the instruction, that they "not understood" and needed the robot to repeat the instruction, or that they were "uncertain" about the instruction (but would still try to execute it). Figure 6 visualizes the participant's self-assessment of the understanding of Pepper's instructions.The participant's self-reported non-understanding of the instructions differed on average.In the NoDist condition, the average of not understanding instructions was Mdn = 1; in the DistHes condition, Mdn = 1.5; and in the DistNoHes condition, Mdn = 2.5.However, this finding did not reach statistical significance, F(2, 22) = 2.0, p = 0.157, η 2 = 0.06.
Figure 7 depicts the understanding (problems) for each participant on the left and over all participants on the right.The blue portion of the bar plots represents all trials in which the participant "understood" Pepper and correctly performed the corresponding action.The red area subsumes all cases in which the interaction was "unsuccessful", so an understanding problem occurred: the person was unsure whether they understood the actual instruction, asked for a repeat, or performed the action incorrectly.The number of "unsuccessful" interactions was statistically significantly different for the three conditions, F(2, 22) = 4.41, p = 0.025, η 2 = 0.08.In the NoDist condition, the participants had, on average, Mdn = 2.5 "unsuccessful" instructions, whereas in the DistHes and the DistNoHes conditions, the average "unsuccessful" instructions were higher (Mdn = 4.5, Mdn DistNoHes = 4.5).The post-hoc test showed only a significant difference between the ambient and the conditions with distraction (whether with or without hesitation) and no difference in the distraction conditions with and without hesitation (see Table 3).The general self-reported distractibility of the participants was measured with the Mind-Wandering-Questionnaire [50] on a 6-point likert scale with five items.The mean distractibility of all participants was 3.81 (SD = 0.66, Cronbach's α = 0.78).

Perception of the Distractions
The perception of the distraction was measured on a 6-point likert scale with four items (see Figure 8).The participants rated the provided distraction (background speech) as more disruptive than the general background noise (M NoDist = 2.5, M Dist = 4; V = 0, p < 0.01, r = −0.5).In addition, they stated that they could understand Pepper's voice less well during the provided distractions (M NoDist = 5, M Dist = 4; V = 0, p < 0.05, r = 0.47).As the participants stated that they found the background speech more disturbing than the general background noise, the manipulation test was successful.

Intelligibility of Pepper's Instructions
The intelligibility of Pepper's instructions was rated on a 6-point likert scale with seven items (see Figure 9).Pepper's information was mostly perceived as appropriate according to length, time, and comprehension.Only one-third of the participants perceived the instruction as rather untimely.They were felt to be a bit too slow and slightly delayed, which could be attributed to the hesitations.

Perception of the Hesitations
At first, participants were asked, "Pepper didn't always react in the same way to the appearance of the background voices.What did you notice?" to find out whether they noticed the hesitations.Two participants stated that they had not noticed anything.Another two participants noticed the hesitations but did not recognize them as such but as errors in synthesis (words "swallowed").Most participants recognized the hesitations (n = 7).The last participant only stated that the background voices were heavily demanding her concentration.Afterward, the hesitations were explained and rated on a 6-point likert scale with seven items (see Figure 10).The results indicate that most participants noticed the hesitations, but most of the participants rated them as rather unnatural and too long.We decided to use an additional silent pause of 1500 ms before the filler and 1000 ms after it so that the pauses are also recognized as hesitations and not as a normal break.Finding the right length of unfilled pauses is still an open research topic and should be addressed in further research.
Two-thirds of the participants stated that the hesitation did not cause them to stop listening to Pepper.Additionally, one-third of the participants said that the hesitation reattended them to Pepper's speech when they were distracted.Interestingly, none of the hesitation ratings significantly correlate with task performance.
Splitting the results by the interaction script suggests that the participants who ended with the hesitation block (scripts 2 and 4) rated the naturalness rather low.(see Figure 12) It should be noted here that only three participants are available per script, and, therefore, a statistical evaluation is not possible.However, it could indicate that the hesitation leads to a less natural synthesis.This would be consistent with our previous research (e.g., [35]), as the natural synthesis of hesitations in human-robot interaction is still an important field of research and should be addressed in further research.

Discussion
In our study, we conducted an experiment to investigate the distractibility in a humanrobot-interaction setting by simulating daily activities instructed by a robot in a remote live setting.The results of our experiment provide some insights regarding EEG responses to scaffolding in HRI and the methodology of LabLinking.

SRQ1: Effects of Distraction
Participants self-reported that they perceived the provided distraction (background voices) as more disruptive than the general ambient background noise.Thus, the manipulation check for our experiment was successful.In addition, the participants had significantly more understanding of problems in the disruptive condition.This was shown by participants' own insecurity about whether what was said was understood correctly, the request for repetition, or incorrect task execution.
Furthermore, we were able to classify between the undistracted baseline and the distracted condition.Hence, we could find effects on single-trial EEG based on the distraction or non-understanding condition in the human-robot interaction (SRQ1).The single-trial EEG analyses of neural responses during speech perception showed a statistically significant average discriminability of 60% between trials where background voices were used as a source of distraction, compared to the control trials with ambient noise.These findings were supported by the feedback of the participants, who reported greater difficulty in understanding and maintaining focus on the instructions in the distraction condition.Together, these results provide additional validation of the effect of distraction through speech as compared to a noisy environment on a neural level [52].Our analysis demonstrated that the best classification results were achieved through a combination of the theta and alpha band power features.Notably, an increase in theta band power and a decrease in alpha band power are associated with reduced attention and increased distractibility, and high correspondence to the cortical tracking of speech [53][54][55].Therefore, our results imply that the auditory distraction was successful and could potentially be detected using neural signals, particularly when using small window sizes.While the classification accuracies achieved in our study may still be considered quite low and not yet applicable for real-world applications, they do open up the possibility of building better systems and implementing adaptive approaches via real-time feedback to enhance humanrobot interaction in dynamic environments.Critically, it cannot be excluded that the effects arise due to non-understanding of the participants as opposed to distraction.Further analyses are needed to identify the underlying cognitive processes as reflected by the EEG data.In this first approach, we compared the trials without disruption with trials with disruption (but no hesitation).As a next step, we will analyze the successfully understood and correctly performed trials and compare them with the unsuccessful interaction trials.

SRQ2: Effects of Hesitation Scaffolding Strategy
Our second research question (SRQ2) addresses whether it is possible to find EEG responses related to scaffolding strategies in HRI, in our example, the hesitation strategy.Again, we were able to classify the conditions of distraction vs. hesitation.However, in the hesitation condition, participants had no fewer understanding problems than in the distraction condition without hesitations.Thus, this experiment was unable to reproduce the positive effects of hesitations from previous studies (e.g., [33]).The results of the understanding for each participant in Figure 7 show that the success of the hesitation scaffolding strategy could be very person-dependent.The hesitations only seem to be beneficial for some individuals; other participants seem more likely to be further distracted by the hesitations.This is also reflected by the results of the subjective ratings of the hesitations.The possible benefit of hesitation as a scaffolding strategy will be addressed in further studies, where the subjective ratings should be assessed additionally between blocks with different conditions.In doing so, we could gain more insights into the subjective perception of Pepper's hesitations.Furthermore, the hesitation strategy itself needs to be investigated further.Finding the right length of unfilled pauses is still an open research topic and should be addressed in further research.To make sure the hesitations are recognized as hesitations and not as a normal break, we decided to use an additional silent pauses before and after the filler.The actual total pause in Pepper's speech synthesis was about 3.5 s before the filler and 2.5 s after it.Pepper kept gesturing during the breaks.However, most participants rated this as too long.In addition, some participants did not recognize the hesitations as such but as errors in synthesis.Further research is needed to synthesize hesitations in a robot's live system.
The EEG data analysis revealed significant differences in neural responses between the two conditions.Averaging over all participants, the single-trial classification accuracy was 73% for trials following the scaffolding strategy compared to those where no such strategy was employed.These findings are consistent with the perception of the participants in Figure 10, which reflects that the robots' hesitation was clearly noticeable for every participant.However, the majority of participants reported perceiving Peppers's hesitation as unnatural and even distracting.A limitation of the current study is the ability to identify the specific cognitive processes underlying the observed neural responses.It remains unclear whether the observed difference is a direct result of the scaffolding strategy, hence an improvement in attention, or simply a reflexive response to the preceding action of the robot.In particular, the participants provided rather negative feedback regarding the effectiveness of redirecting attention.To definitely answer this question, more in-depth analyses of the EEG signals are required to find evidence for the involvement of specific cognitive processes that may be affected by the robot's scaffolding strategies.Nevertheless, the robust detection of the robot's intervention in the neural responses supports the use of scaffolding strategies such as hesitation for enhancing human-robot interaction with neural assistance.

MRQ1: Benefits of LabLinking
Beside the research regarding EEG responses to scaffolding in HRI, we investigated the LabLinking method (MRQ1).It enabled the interdisciplinary joint HRI research between our two laboratories: the Medical Assistance Systems Group (MAS) at Bielefeld University and the Cognitive Systems Lab (CSL) at the University of Bremen.The method allowed the presented HRI study to be carried out without having the expensive hardware in one place.This saves resources in many ways.First, (i) neither in Bremen nor in Bielefeld new hardware had to be bought.Additionally, the hardware did not have to be transported to the other laboratory, (ii) which always involves a risk of damage during transport.Furthermore, since the hardware did not have to be taken to the other laboratory, (iii) it could be used for further on-site studies meanwhile.Moreover, (iv) several resources for traveling were saved since the labs are connected via LabLinking.It was possible to carry out the presented study successfully, although we only visited each other's laboratory once.We had approximately 25 meetings over one year (excluding the study itself) and 9 days on which the study was conducted.It was not necessary to travel to the other laboratory to test the joint setup, which saved travel costs and CO 2 emissions.In addition, we were able to conduct the research despite the (travel) restrictions imposed by the COVID-19 pandemic.The connection of the two laboratories allows a further (v) economical use of human resources regarding their time.Experts in the individual areas were able to share their knowledge more quickly and easily.This enabled an interdisciplinary research exchange in a large team.

MRQ2: Challenges of LabLinking
In addition to these advantages, however, the LabLinking method also revealed some challenges (MRQ2), which required us to develop a number of best practices during iterations of the experiment development.First, it is important to maintain a joint experiment state and an unambiguous communication protocol between the experimenters on both sites.This can be supported through the combination of multiple channels: live monitors of the video and audio recordings on the different sites, explicit terminology, and an explicit flowchart, which determines which site is responsible for confirmations or taking the initiative to advance the experiment.We also formalized this flowchart in experiment control programs on both ends, which enforce this protocol programmatically.Second, it is important to also take care of temporal synchronization from a technical perspective.For this purpose, we use a unified messaging protocol (ROS in our case) through which all information is passed, including timestamps for each event.These also allow us to monitor data transmission latency.The latency for the transmission of the speech synthesis of the Bielefeld robot and arrival to the microphones in the Bremen laboratory was measured to be approximately 415 ± 170 ms.This process of synchronization creates a single repository of all data occurring within the experiment, all aligned according to a common clock.

Further Research
Apart from the already mentioned future work, we want to address the following research directions.

Early detection of non-understanding:
If it is possible to detect problems of understanding early in the interaction based on the EEG correlates, the Pepper robot could use these to identify the unsuccessful interactions during the interaction and correct them through linguistic strategies (e.g., hesitations).Adaption: Hesitations only seem to be beneficial for some individuals, whereas other participants seem more likely to be further distracted by the hesitations.A more detailed analysis of the data could provide information here.If, for example, the EEG data show differences in the groups, an appropriate (hesitation) strategy for the particular person could be selected.Pipeline Automatization: To facilitate the experiment process, we want to further automatize the pipeline for experiment execution and data analysis.In particular, we want to add automatic object detection to identify and locate the objects on the table.This will enable an automatic, fine-grained scoring of item placement and will, furthermore, provide additional context information to the robot.

Conclusions
In this paper, we presented the first results of a LabLinking pilot study investigating EEG correlates of distractions and hesitations in human-robot interaction.We were able to show that (i) the EEG correlates in the distracted condition are different from the baseline condition without distractions, and we can classify them.In addition, (ii) we could differentiate the EEG correlates of distraction with and without hesitations.Finally, (iii) we presented the benefits and challenges of the LabLinking method for enabling interdisciplinary joint research HRI experiments between multiple labs.This proof-of-concept study showed that it is possible to conduct HRI studies via LabLinking and lays a first foundation for more in-depth research into robotic scaffolding strategies.

Figure 1 .
Figure 1.The LabLinking setup for conducting HRI studies between the laboratory of the Medical Assistance System Group (MAS) at Bielefeld University and the Cognitive Systems Lab (CSL) at the University of Bremen: The Pepper robot gives instructions to the human interaction partner, which provides feedback of understanding.The laboratories are connected via a GStreamer pipeline for audio and additional communication via ROS for video and further system events.

Figure 2 .
Figure 2. Setup from the MAS lab perspective with the Pepper robot.

Figure 4 .
Figure 4. Classification results for 1-second EEG windows of the ambient and distraction conditions (red) and the distraction and hesitation conditions (blue) with standard error bars (black) calculated between the 10 splits of the cross-validation and the random runs.

Figure 5 .
Figure 5. Confusion matrix for 1-second EEG windows of the ambient and distraction conditions and the distraction and hesitation conditions averaged over all folds of the 10-fold cross-validation and all participants.The color bar denotes the amount of 1-second windows.

Figure 6 .
Figure 6.Self-assessment of understanding the instruction for each participant (left) and over all participants (right).

Figure 7 .
Figure 7. Understanding (problems) for each participant (left) and over all participants (right).

Table 1 .
Three conditions with a number of utterances utilized in the experiment (DIS: distraction, HES: hesitation, KIT: kitchen item, BLC: wooden block item).

Table 2 .
Four different scripts to mitigate order effects.Each participant was assigned one script with 90 utterances (DIS: distraction, HES: hesitation, KIT: kitchen item, BLC: wooden block item).

Table 3 .
Post-hoc pairwise comparison with Bonferroni correction of "unsuccessful" interaction between the conditions.