Prediction Models of Collaborative Behaviors in Dyadic Interactions: An Application for Inclusive Teamwork Training in Virtual Environments

: Collaborative virtual environment (CVE)-based teamwork training offers a promising avenue for inclusive teamwork training. The incorporation of a feedback mechanism within virtual training environments can enhance the training experience by scaffolding learning and promoting active collaboration. However, an effective feedback mechanism requires a robust prediction model of collaborative behaviors. This paper presents a novel approach using hidden Markov models (HMMs) to predict human behavior in collaborative interactions based on multimodal signals collected from a CVE-based teamwork training simulator. The HMM was trained using k-fold cross-validation, achieving an accuracy of 97.77%. The HMM was evaluated against expert-labeled data and compared against a rule-based prediction model, demonstrating the superior predictive capabilities of the HMM, with the HMM achieving 90.59% accuracy compared to 76.53% for the rule-based model. These results highlight the potential of HMMs to predict collaborative behaviors that could be used in a feedback mechanism to enhance teamwork training experiences despite the complexity of these behaviors. This research contributes to advancing inclusive and supportive virtual learning environments, bridging gaps in cross-neurotype collaborations.


Introduction
Human-computer interaction (HCI) technologies have become prevalent tools for facilitating skill learning, offering engaging interactions and replicable solutions to benefit and enhance learning experiences [1][2][3].These systems teach a range of skills, including cognitive abilities such as visual-spatial, auditory, and verbal skills [4]; affective learning such as emotion regulation [5]; and collaboration skills [6][7][8].Real-time prompts and feedback through visual, audio, and tactile cues are integral features of these systems that help in enhancing user engagement and learning experiences [9,10].
In our prior work, we developed a series of virtual collaborative tasks as a teamwork training simulator within a collaborative virtual environment (CVE), focusing on facilitating dyadic interaction [11].While participants found the training paradigms engaging, we identified a need for real-time feedback mechanisms to support participants during these collaborative tasks.Recognizing human behavior in feedback mechanisms is crucial for effective training outcomes [12].Recent studies reported that incorporating human behavior recognition, along with task performance, in adaptive training paradigms can significantly enhance participants' engagement and learning outcomes [7,8,13].Current methods for behavior recognition rely heavily on behavioral experts to observe and evaluate human behavior either in real-time during experimental sessions [14] or through analysis of video recordings post-experiment [15][16][17].While manual labeling of human behavior is reliable, it Signals 2024, 5 383 lacks real-time accessibility, is resource-intensive, and is prone to bias [18][19][20], highlighting the necessity for an automated human behavior prediction model.
Various machine learning methods have been employed to predict human behavior in computer-based interactions [21,22], aiming to address the inherent uncertainty in recognizing human behaviors [19].Traditional machine learning approaches have been pivotal, with several studies achieving notable success [23][24][25][26].For instance, Abdelrahman et al. [27] used deep learning and neural network methods to predict engagement and disengagement in human-robot interaction by extracting and scoring engagement-related features from human participants, such as gaze, head pose, and body posture, to infer engagement.The model achieved a 93% F1 score.Many of these studies used multimodal signals for the classification of behavior to improve the classification accuracy due to the complex nature of human behavior.According to a meta-analysis review of 30 studies that compared affect detection accuracy between multimodal and unimodal signals, researchers reported that accuracies based on multimodal data fusion were consistently better than those based on unimodal signals [28].This is further supported by the findings in a study by Mallol-Ragolta et al. [29], where they reported the best agreement score using multimodal signals compared to a unimodal signal in a robotic empathy recognition system.In another study by Okada et al. [30], signals from speech and head movement were captured and both verbal and non-verbal features were extracted from the signals to assess the collaborative behavior in discussions.
To handle the stochastic nature of human behavior, hidden Markov models (HMMs) have been widely used [31,32], offering robustness and flexibility in analyzing temporal patterns which are necessary for predicting behavioral patterns (e.g., emotions, activities, learning behaviors such as engagement, disengagement, confusion, frustration, distress, etc.) [33][34][35][36][37][38].Mihoub et al. [39] demonstrated the effectiveness of an incremental discrete hidden Markov model (IDHMM) to recognize and generate multimodal joint actions in face-to-face interactions.The results reported that the classification accuracy of IDHMM was 92%, while a support vector machine (SVM)'s classification accuracy was 81%.Another study compared HMM performance against traditional classification models that included support vector machine (SVM), random forest (RF), linear regression (LR), and deep neural network (DNN) in predicting students learning behavior in e-learning environments [37].The study utilized early assessment data and results indicated that HMM outperformed the other models for 5 of the 6 courses with accuracies above 90%.Similarly, Sharma et al. [34], utilized an HMM to predict effortful behavior in adaptive learning environments using both performance and physiological data, enabling real-time feedback based on predicted behavioral patterns.In this work, our first contribution is the design of an HMM-based model to recognize collaborative behaviors in dyadic interactions using multimodal signals, aiming to provide real-time feedback and scaffold learning in computer-based interactions.As a baseline for comparison, we implemented a rule-based behavior recognition method designed in consultation with experienced behavior analysts.We compared the HMM model against the rule-based method with expert labeled data as the ground truth.
While previous works have focused on behavior recognition in single-user interactions, recognizing human behavior in dyadic interactions, such as teamwork, remains underexplored.Teamwork, defined as collaborative work between individuals toward a common goal, has garnered increased research attention in recent years, particularly within organizational contexts [40,41].In addition to the benefits teamwork brings to a company, teamwork also leads to increased satisfaction in the workplace, which can fulfill personal growth [42].As society embraces inclusive workplaces with neurodiverse individuals, research interest on cross-neurotype collaboration, i.e., collaboration between neurotypical (people with "normal" neurotypes) and neurodiverse individuals, has increased [43].Studies have shown that cross-neurotype collaboration can be less effective than collaborations between individuals within the same neurotype [44].This phenomenon, driven by the double empathy problem [45], underscores that challenges in cross-neurotype communication and social interaction are the responsibilities of both neurotypical and autistic individu-als [46,47].Our second contribution is the implementation of the prediction model in our previously designed teamwork training simulator to support cross-neurotype teamwork training.
Additionally, we contribute to the field by creating a novel dataset containing multimodal signals from dyadic interactions labeled with collaborative behaviors by expert annotators.The outcome of this work can motivate future research to incorporate a robust behavioral prediction model in a feedback mechanism within virtual training environments that could enhance training experiences by scaffolding learning and promoting active collaboration.By creating a novel prediction model and a multimodal dyadic interaction dataset, our work seeks to advance the integration of robust behavioral prediction mechanisms into computer-based teamwork training, ultimately enhancing collaborative skill development.

Experimental Design
We conducted a preliminary study to gather multimodal signals from participants interacting with each other to complete various collaborative tasks in a CVE.The signals were then processed, analyzed, and labeled based on defined collaborative behaviors.The labeled signals were used to (i) train an HMM prediction model, (ii) design a rule-based prediction model, and (iii) evaluate both models.

Collaborative Tasks Description
The collaborative task selection was driven by employment-related studies for autistic individuals [48].We then designed the activities within the tasks based on input from stakeholders, including human resource personnel from several companies, certified behavioral analysts, career counselors, and autistic adults.They provided suggestions and feedback to encourage teamwork in a workplace environment between an autistic individual and a neurotypical (non-autistic) partner, which was discussed in detail in our previous work [11].Multiple discussion sessions with the stakeholders were conducted to select tasks that are collaborative and include interactions that were translatable to workplace environments.
The first task was a PC assembly task in which two participants were located on opposite ends of a table in the virtual environment, giving them different views of the workspace.They both were given written instructions and different hardware to collaboratively build a single computer.They would use a keyboard and mouse to move the components into the correct location within a set amount of time.Participants were required to take turns and communicate with each other when assembling the PC.The next task was a furniture assembly task in which participants were placed in a virtual living room and worked together to assemble various furniture pieces within a set amount of time.They used a haptic device to move and assemble the furniture parts to the target area.The final task was a fulfillment center task in which participants would drive virtual forklifts with varying height capacities to transport crates from a warehouse to a drop-off location.Participants used a gamepad to drive the forklift in this task.These collaborative tasks, as illustrated in Figure 1, were designed in Unity, a multi-platform game development software [49].
Three design strategies were embedded within the tasks to encourage communication and collaboration between the participants: (a) PC assembly: incomplete installation instructions were given to each participant to encourage them to exchange information to progress in the task; (b) furniture assembly: participants were given only an image of the assembled furniture, without written instruction, to encourage them to divide the task and coordinate their actions; and (c) fulfillment center: the list of crates for each participant were different and the location of the crates varied to allow participants to practice turn-taking.

Participants and Protocol
We recruited six autistic (ASD) and six neurotypical (NT) participants to form six crossneurotype (ASD-NT) participant pairs.The demographics for the participants are shown in Table 1.Participants with ASD were recruited through an existing university-based clinical research registry and the NT participants were recruited from the local community through regional advertisement.All study procedures were approved by the Vanderbilt University's Institutional Review Board (IRB) with associated procedures for informed assent and consent.Figure 2 illustrates the setup of the experiment.

Prediction Models Workflow
We describe four main processes involved in designing, training, and evaluating behavior prediction models in collaborative interactions, as seen in Figure 3. First, we captured multimodal data from both participants and performed signal processing to design the prediction models.Then, the multimodal signals together with video recordings were used by annotators to label the participants' collaborative behavior, which we defined as either "Engaged", "Waiting", or "Struggling", to establish ground truth.These labeled data were used to design a rule-based prediction model and train an HMM.We then evaluated both prediction models' performances.The following subsections explain each step of the process in detail.

Prediction Models Workflow
We describe four main processes involved in designing, training, and evaluating behavior prediction models in collaborative interactions, as seen in Figure 3. First, we captured multimodal data from both participants and performed signal processing to design the prediction models.Then, the multimodal signals together with video recordings were used by annotators to label the participants' collaborative behavior, which we defined as either "Engaged", "Waiting", or "Struggling", to establish ground truth.These labeled data were used to design a rule-based prediction model and train an HMM.We then evaluated both prediction models' performances.The following subsections explain each step of the process in detail.
the prediction models.Then, the multimodal signals together with video recordings were used by annotators to label the participants' collaborative behavior, which we defined as either "Engaged", "Waiting", or "Struggling", to establish ground truth.These labeled data were used to design a rule-based prediction model and train an HMM.We then evaluated both prediction models' performances.The following subsections explain each step of the process in detail.

Multimodal Signal Processing
The multimodal signals were captured from three devices integrated into the collaborative system.We used the signals from (i) task-dependent controllers, (ii) a microphone headset, and a (iii) Tobii EyeX eye tracker that were set up for each participant to extract seven binary features used to recognize the behavior of the participants in collaborative interactions.As an initial approach to analyze the signals, we chose to represent the features as binary as it allows for simplified analysis with still dependable results [4,50].The

Multimodal Signal Processing
The multimodal signals were captured from three devices integrated into the collaborative system.We used the signals from (i) task-dependent controllers, (ii) a microphone headset, and a (iii) Tobii EyeX eye tracker that were set up for each participant to extract seven binary features used to recognize the behavior of the participants in collaborative interactions.As an initial approach to analyze the signals, we chose to represent the features as binary as it allows for simplified analysis with still dependable results [4,50].The diagram in Figure 4 shows that we derived one feature, Speech Presence, from the microphone headset as a measure of communication between the participants.Then, from the eye tracker, we extracted the Gaze Presence feature and Gaze on Object feature to measure participant's focus in the task.Finally, we extracted four features from the controllers based on the presence of the controller, represented by the Controller Presence and Controller Manipulation, and the distance of the virtual object from a target location as a measure of task progression, as Object Move Closer and Object Move Away.The feature values were either 1 or 0 representing the presence or absence of the feature.We describe the selection of the feature values in more detail in Table 2.

Device
Binary Feature Feature Description

Microphone headset Speech Presence
Feature is set to "1" when participant is speaking and "0" otherwise.

Gaze Presence
Feature is set to "1" when participant's gaze detected on screen and "0" otherwise.

Gaze On Object
Feature is set to "1" when gaze is on a virtual object or within the defined "focus area" as depicted in Figure 5.

Controller Presence
Feature is set to "1" when an input is detected from the controller (keyboard button, mouse clicks, haptic presses) and "0" otherwise.

Controller Manipulation
Feature is set to "1" when controller is actively moving an object, and "0" otherwise.

* Object Move Closer
Feature is set to "1" when the distance of the object from the target location is decreasing, and "0" otherwise.

* Object Move Away
Feature is set to "1" when the distance of the object from the target location is increasing, and "0" otherwise.All the features were collected with a sampling rate of 1 Hz.These binary features were concatenated to form a feature vector (e.g., [0 1 0 1 0 1 0]) for the HMM design, while individual binary values were used as input to the rule-based model.A similar concatenation of the features was also used by Khamparia et al. [51] in their HMM application to investigate psychological and environmental factors to help improve learners' performance.As an example, based on the description in Table 2, the combination of the features [0 1 0 1 0 1 0] would represent speech absence, gaze presence, controller manipulated, and object moving closer to the target.All the features were collected with a sampling rate of 1 Hz.These binary features were concatenated to form a feature vector (e.g., [0 1 0 1 0 1 0]) for the HMM design, while individual binary values were used as input to the rule-based model.A similar concatenation of the features was also used by Khamparia et al. [51] in their HMM application to investigate psychological and environmental factors to help improve learners' performance.As an example, based on the description in Table 2, the combination of the features [0 1 0 1 0 1 0] would represent speech absence, gaze presence, controller manipulated, and object moving closer to the target.

Collaborative Behavior Coding Scheme
A literature review on collaborative learning showed that the most frequent and prevalent behavior that could influence collaborative interactions were engagement [52], struggling [53,54], and boredom [55,56].Engagement could represent positive collaborative interactions while struggling and boredom could indicate a negative collaborative experience that would require intervention.Using this literature review and discussions with the stakeholders and behavioral analysts, we chose the following three behaviors that would be the most useful in recognizing the initial collaboration level in our teamwork training simulator, which will henceforth be referred to as collaborative behaviors: Engaged, Struggling, and Waiting.Note that boredom was replaced with waiting in our application as it can be indicative of a negative collaborative experience.By focusing on when participants are waiting or struggling, the focus can be shifted to prevent boredom or disengagement in the system.These three behaviors represent essential stages of teamwork, allowing the system to provide informed and meaningful feedback.Engaged captures the behavior of the participant when performing the task and collaborating with their partner [57], allowing the system to provide positive feedback, such as "Good job!" or "Keep up the good work!".Struggling represents the behavior of the participant when they were not progressing in the task (e.g., the task object was moving away from the target), were not interacting with their partner, or were disengaged with the task (e.g., looking outside the focus area for some time) [58].The system would then use the Struggling behavior as an indicator to prompt the participants to collaborate-for example, "Ask your partner to help you with the task" and to the other participant "Your partner seems to be struggling, offer them help".Turn-taking is part of teamwork and collaborative interaction.As such, we used the Waiting behavior to represent the behavior when the participant was on standby while their partner was performing a task [59].This Waiting behavior is different from when a participant is not progressing in the task due to being distracted or disinterested (which is categorized under Struggling).In the Waiting behavior, the system would allocate some time for the participants to wait without prompting the participants.

Collaborative Behavior Coding Scheme
A literature review on collaborative learning showed that the most frequent and prevalent behavior that could influence collaborative interactions were engagement [52], struggling [53,54], and boredom [55,56].Engagement could represent positive collaborative interactions while struggling and boredom could indicate a negative collaborative experience that would require intervention.Using this literature review and discussions with the stakeholders and behavioral analysts, we chose the following three behaviors that would be the most useful in recognizing the initial collaboration level in our teamwork training simulator, which will henceforth be referred to as collaborative behaviors: Engaged, Struggling, and Waiting.Note that boredom was replaced with waiting in our application as it can be indicative of a negative collaborative experience.By focusing on when participants are waiting or struggling, the focus can be shifted to prevent boredom or disengagement in the system.These three behaviors represent essential stages of teamwork, allowing the system to provide informed and meaningful feedback.Engaged captures the behavior of the participant when performing the task and collaborating with their partner [57], allowing the system to provide positive feedback, such as "Good job!" or "Keep up the good work!".Struggling represents the behavior of the participant when they were not progressing in the task (e.g., the task object was moving away from the target), were not interacting with their partner, or were disengaged with the task (e.g., looking outside the focus area for some time) [58].The system would then use the Struggling behavior as an indicator to prompt the participants to collaborate-for example, "Ask your partner to help you with the task" and to the other participant "Your partner seems to be struggling, offer them help".Turn-taking is part of teamwork and collaborative interaction.As such, we used the Waiting behavior to represent the behavior when the participant was on standby while their partner was performing a task [59].This Waiting behavior is different from when a participant is not progressing in the task due to being distracted or disinterested (which is categorized under Struggling).In the Waiting behavior, the system would allocate some time for the participants to wait without prompting the participants.Although there are only three behaviors of collaboration discussed in this work, other behaviors could be added in the future based on the need and understanding of collaboration and teamwork.A definition of the collaborative behaviors was defined in consultation with a certified behavioral analyst, to ensure the consistency of the manual labeling, shown in Table 3.

# Collaborative Behavior Definition Condition
1 Engaged The participant is focused on the task, communicating, and progressing well.
Participant could be talking to their partner.Participant is using the controller and virtual object is moving closer to the target.
The participant is not progressing with the task due to difficulty performing the task, not communicating with their partner, distracted, or disinterested with the task.
Participant is not talking to their partner while: i. manipulating the controller but virtual object moving away from the target, or ii.not manipulating the controller and not looking at the screen (virtual objects, focused area).
3 Waiting The participant is on standby for their partner in a turn-taking task, not moving.
Participant is not talking to their partner, not using the controller, and not moving virtual objects, but is looking at an object or focus area.

Hand Labelling to Establish Ground Truth
Two annotators trained by a certified behavioral analyst used the collaborative behaviors defined in Table 3 to label the participants' behavior as either Engaged, Struggling, or Waiting based on the extracted features discussed in Section 2.2.1 and from watching video recording of the sessions.The annotators labeled 10 min of interactions from each session, for all six experimental sessions individually, resulting in 4976 hand-labeled datapoints.They achieved 98% agreement, and the remaining 2% disagreement was reconciled where both annotators decided on an agreed label through discussion.From the labeled data, the class distributions of the three behaviors were as follows: Engaged-19.9%,Struggling-28.0%, and Waiting-52.1%.The hand-labeled data distribution shows that the behaviors are not equally distributed, and the majority of the labeled behavior was Waiting since the tasks mainly involved turn-taking.When designing the HMM prediction model, the imbalance in the data distribution was taken into consideration to avoid overfitting and bias to the prediction model.To achieve this, we used k-fold cross-validation to minimize the data imbalanced.This is explained in more detailed in Sections 2.2.5 and 3.1.

Rule-Based Prediction Model Design
We gathered the inputs and feedback from the behavioral analyst when developing the coding scheme in Section 2.2.3 into a set of rules for each collaborative behavior based on the binary features.The rules were constructed to closely replicate the role of human annotators.The seven binary features that were discussed in Section 2.2.1 were used to drive the categorization of the collaborative behaviors defined in Table 3.In the rule-based model, we begin by checking the presence of speech.Since speech data contributes between 20-30% of the entire collaborative interaction, in this initial design, we assumed that any utterances while performing the task as an indication of engagement.Further analysis of the speech in future work would allow us to categorize the behavior more accurately (i.e., positive utterances as Engaged, and negative utterances as Struggling).If speech was detected, the rule would assign the collaborative behavior as being Engaged.If speech was not detected, the second rule was to check for keypresses, which were based on controller manipulation features.If controller manipulation was present, it would set the keypresses as true, and move to the next rule to check the object distance based on the object move closer and object move away features.If the object move away feature was true, then it means the object was moving away from the target, and the rule would assign the collaborative behavior as Struggling, If the object move closer feature was true, then it means the object is moving closer to the target, the rule would assign the collaborative behavior as Engaged, and if neither the object move closer and object move away are true, then it means that the object was not moving, so the rule would assign the collaborative behavior as Waiting.In the case when no speech was present and no keypresses were present, the rule would check for eye gaze presence.If eye gaze was present, participant collaborative behavior was assigned as Waiting.However, if eye gaze was absent, the collaborative behavior was assigned as Struggling.We then consolidated these rules into a rule-based model as shown by the flow chart in Figure 6.

Rule-Based Prediction Model Design
We gathered the inputs and feedback from the behavioral analyst when developing the coding scheme in Section 2.2.3 into a set of rules for each collaborative behavior based on the binary features.The rules were constructed to closely replicate the role of human annotators.The seven binary features that were discussed in Section 2.2.1 were used to drive the categorization of the collaborative behaviors defined in Table 3.In the rule-based model, we begin by checking the presence of speech.Since speech data contributes between 20-30% of the entire collaborative interaction, in this initial design, we assumed that any utterances while performing the task as an indication of engagement.Further analysis of the speech in future work would allow us to categorize the behavior more accurately (i.e., positive utterances as Engaged, and negative utterances as Struggling).If speech was detected, the rule would assign the collaborative behavior as being Engaged.If speech was not detected, the second rule was to check for keypresses, which were based on controller manipulation features.If controller manipulation was present, it would set the keypresses as true, and move to the next rule to check the object distance based on the object move closer and object move away features.If the object move away feature was true, then it means the object was moving away from the target, and the rule would assign the collaborative behavior as Struggling, If the object move closer feature was true, then it means the object is moving closer to the target, the rule would assign the collaborative behavior as Engaged, and if neither the object move closer and object move away are true, then it means that the object was not moving, so the rule would assign the collaborative behavior as Waiting.In the case when no speech was present and no keypresses were present, the rule would check for eye gaze presence.If eye gaze was present, participant collaborative behavior was assigned as Waiting.However, if eye gaze was absent, the collaborative behavior was assigned as Struggling.We then consolidated these rules into a rulebased model as shown by the flow chart in Figure 6.

HMM Design and Training
A Hidden Markov Model (HMM) is a probabilistic graphical model used to represent systems that evolve over time offering flexibility and scalability compared to deterministic predictive models [60].It comprises of five main elements [61] shown in Table 4.The first is the set of hidden states (N), which signifies the unobservable underlying conditions or states within the system.In our application, that is our three defined collaborative behaviors: Engaged, Struggling, and Waiting.Second, there are observations (M) associated with each state.In our application these observations are the seven binary features explained in Section 2.2.1.The model's dynamics are governed by state transition probabilities, represented as a state transition matrix (A).The matrix encodes the probabilities of transitioning from one state to another at each time step, reflecting how the system evolves over time.This matrix is generated when training the model.In addition to the transition matrix, there is an emission matrix (B) that defines the likelihood of generating a particular observation given the current state.Finally, the model requires an initial probability distribution (π) which specifies the initial likelihood of beginning the sequence in each hidden state.
Mathematically, HMMs address two fundamental problems: the evaluation problem, solved using the Forward Algorithm [62], which quantifies the likelihood that the HMM generated a specific sequence of observations, and the decoding problem, solved using the Viterbi Algorithm [63] or the Baum-Welch Algorithm [64], which determines the most probable sequence of hidden states that generated a given sequence of observations.In our application, we train the HMM by calculating the maximum likelihood estimate of the transition (A) and emission (B) probabilities for a sequence of distinct observations (M) with known states (N).Using the estimated transition and emission probabilities, we use the Baum Welch Algorithm to determine the most probable sequence of hidden states for the remainder of our observations.An ergodic state transition model was designed for our model as we assumed that the collaboration state can change from one state to any of the other states.Figure 7 shows a possible diagram of the HMM.
The HMM training was done in MATLAB [65] using the Statistics and Machine Learning Toolbox [66].The MATLAB function hmmestimate was used to generate estimated transition and emission matrices for the model by calculating the maximum likelihood, and the MATLAB function hmmviterbi was used to predict the collaborative behaviors.We used the k-fold cross-validation method to enhance the training outcome of the HMM to achieve optimal performance.As illustrated in Figure 8, we split 70% of the hand-labeled data as the HMM training set, while the remaining 30% of the hand-labeled data was used as the hold-out test set to evaluate the selected HMM.The HMM training was done in MATLAB [65] using the Statistics and Machine Learning Toolbox [66].The MATLAB function hmmestimate was used to generate estimated transition and emission matrices for the model by calculating the maximum likelihood, and the MATLAB function hmmviterbi was used to predict the collaborative behaviors.We used the k-fold cross-validation method to enhance the training outcome of the HMM to achieve optimal performance.As illustrated in Figure 8, we split 70% of the handlabeled data as the HMM training set, while the remaining 30% of the hand-labeled data was used as the hold-out test set to evaluate the selected HMM.
In the k-fold cross-validation, due to the imbalance in the labeled collaborative behaviors, we needed to split the data points to address the imbalance by including all possible observations and states in each training instances of the k-fold cross-validation.Since the sequence of the observation datapoints is important in generating the transition and emission matrices, we could not use random selection of the datapoints or stratification of the datapoints.As such, we opted to treat the data continuously by splitting the data into certain percentages instead of fixed datapoints for each fold.We found that by splitting the fold into 80% for training and 20% for validation, we were able to generate sets with all observations and behaviors included in each split.The validation set was then shifted upwards in each instance of the k-fold until every datapoint from the training set is used for validation in the same sequence without shuffling them into random positions.This is illustrated in the top portion of Figure 8.As part of the splitting of the data, we would In the k-fold cross-validation, due to the imbalance in the labeled collaborative behaviors, we needed to split the data points to address the imbalance by including all possible observations and states in each training instances of the k-fold cross-validation.Since the sequence of the observation datapoints is important in generating the transition and emission matrices, we could not use random selection of the datapoints or stratification of the datapoints.As such, we opted to treat the data continuously by splitting the data into certain percentages instead of fixed datapoints for each fold.We found that by splitting the fold into 80% for training and 20% for validation, we were able to generate sets with all observations and behaviors included in each split.The validation set was then shifted upwards in each instance of the k-fold until every datapoint from the training set is used for validation in the same sequence without shuffling them into random positions.This is illustrated in the top portion of Figure 8.As part of the splitting of the data, we would label each datapoint to either training or validation to ensure that the same data point would not end up in both training and validation.
The function hmmestimate used the maximum likelihood estimate to generate the state transition and emission matrices based on the binary observation sequences and hand-labeled hidden states.The matrices were then optimized using the function hmmtrain, where a Baum-Welch Algorithm was used to improve the probabilities.With the optimized matrices, we validated the model with the 20% remaining datapoints using the function hmmviterbi.This function used the Viterbi Algorithm to predict the most likely collaborative behavior based on the sequence of observations and probability matrices.The predicted collaborative behaviors were compared to the actual hand-labeled behavior to find the accuracies of each fold and the average accuracies for all k values from 5 through 10.
state transition and emission matrices based on the binary observation sequences and hand-labeled hidden states.The matrices were then optimized using the function hmmtrain, where a Baum-Welch Algorithm was used to improve the probabilities.With the optimized matrices, we validated the model with the 20% remaining datapoints using the function hmmviterbi.This function used the Viterbi Algorithm to predict the most likely collaborative behavior based on the sequence of observations and probability matrices.The predicted collaborative behaviors were compared to the actual hand-labeled behavior to find the accuracies of each fold and the average accuracies for all k values from 5 through 10.

Evaluating Prediction Models Performance
The evaluation of the prediction models was conducted using the remaining 30% of the hand-labeled data that were assigned as hold-out test set.For the rule-based prediction model evaluation, the collaborative behaviors were predicted using the defined rules from Section 2.2.4As for the HMM prediction model evaluation, we selected the transition and emission matrices from the HMM with the highest accuracy in the k-fold cross-validation and used the MATLAB function hmmviterbi to generate the predicted collaborative behaviors using the observations in the hold-out test set.We then compared the predicted collaborative behaviors to the hand-labeled collaboration behaviors.The results are presented in the next section.

Evaluating Prediction Models Performance
The evaluation of the prediction models was conducted using the remaining 30% of the hand-labeled data that were assigned as hold-out test set.For the rule-based prediction model evaluation, the collaborative behaviors were predicted using the defined rules from Section 2.2.4As for the HMM prediction model evaluation, we selected the transition and emission matrices from the HMM with the highest accuracy in the k-fold cross-validation and used the MATLAB function hmmviterbi to generate the predicted collaborative behaviors using the observations in the hold-out test set.We then compared the predicted collaborative behaviors to the hand-labeled collaboration behaviors.The results are presented in the next section.

HMM Training and Validation Results
We trained HMM models using 70% of the hand-labeled data using k-fold crossvalidation method to optimize the training output.As mentioned in Section 2.2.3, due to the imbalance in the data, we use a 80%-20% split for the training and validation of the HMM, respectively, to avoid missing observations and behaviors when training the HMM.
We present the accuracies for all folds with different k-fold values as a boxplot in Figure 9.The boxplot shows the distribution of the accuracy for different k-fold values and the scatter plot overlaid to represent the individual accuracy of each fold within the k-fold that generated an HMM model.From the plot, we can see that the ranges of the accuracies across all k-folds were between 90.35% to 97.77%, and the average accuracies across all folds were around 93%.The highest accuracy occurred when k = 9 and Fold = 6.We chose the optimized transition and emission matrices from this fold for evaluation against the hold-out test set.
that generated an HMM model.From the plot, we can see that the ranges of the accuracies across all k-folds were between 90.35% to 97.77%, and the average accuracies across all folds were around 93%.The highest accuracy occurred when k = 9 and Fold = 6.We chose the optimized transition and emission matrices from this fold for evaluation against the hold-out test set.

Prediction Models Evaluation Results
The evaluation was performed for both the rule-based prediction model and the HMM prediction model using the data from the hold-out test set.In the rule-based prediction model, the collaborative behaviors were predicted by evaluating the observations using the defined rules in Section 2.2.4.

Prediction Models Evaluation Results
The evaluation was performed for both the rule-based prediction model and the HMM prediction model using the data from the hold-out test set.In the rule-based prediction model, the collaborative behaviors were predicted by evaluating the observations using the defined rules in Section 2.2.4.
As for the HMM prediction model, we chose the HMM that was generated with the highest accuracy of 97.77%.The collaborative behaviors were predicted in MATLAB using hmmviterbi by using the Viterbi Algorithm to generate the most likely collaborative behavior based on the sequence of observations.
In both cases, the predicted collaborative behaviors were compared against the handlabeled collaborative behaviors.Table 5 compares the performance of rule-based prediction model and HMM prediction model and Figure 10 illustrates the confusion matrix of both prediction models.Overall, the HMM provided higher accuracy, precision, and recall of the participants' collaborative behavior compared to a rule-based model.When we look at the behaviors as shown in Figure 10, both models performed the best for the Engaged behavior since the conditions for the Engaged behavior were quite simple and straightforward where both models could provide a reliable prediction.However, for Waiting and Struggling behaviors, the rule-based model performed quite poorly where the model predicted most of the Struggling behavior as Waiting.The inflexibility of the rule-based model could have caused this.Rule-based models only allow one behavior for one set of conditions, whereas real hand-labeled data would have instances where the same condition produced different outcomes based on the context of the task (or previous sequence of events).For such cases, if we keep the rule-based model to predict participants' collaborative behavior, the feedback that the participants were to receive would not be true to their actual behavior.A participant that is Struggling would not be prompted to seek assistance as the system would assume they are Waiting for their partner to complete a turn.On the other hand, the HMM prediction results for Waiting and Struggling were reliable since the temporal information that was learned from the training was embedded within the state transition and emission probability matrices.This is consistent with the results reported by another study that implemented a semi-supervised model using the same dataset as this study [67].The study compared the performance of the developed semi-supervised automated Overall, the HMM provided higher accuracy, precision, and recall of the participants' collaborative behavior compared to a rule-based model.When we look at the behaviors as shown in Figure 10, both models performed the best for the Engaged behavior since the conditions for the Engaged behavior were quite simple and straightforward where both models could provide a reliable prediction.However, for Waiting and Struggling behaviors, the rule-based model performed quite poorly where the model predicted most of the Struggling behavior as Waiting.The inflexibility of the rule-based model could have caused this.Rule-based models only allow one behavior for one set of conditions, whereas real hand-labeled data would have instances where the same condition produced different outcomes based on the context of the task (or previous sequence of events).For such cases, if we keep the rule-based model to predict participants' collaborative behavior, the feedback that the participants were to receive would not be true to their actual behavior.A participant that is Struggling would not be prompted to seek assistance as the system would assume they are Waiting for their partner to complete a turn.On the other hand, the HMM prediction results for Waiting and Struggling were reliable since the temporal information that was learned from the training was embedded within the state transition and emission probability matrices.This is consistent with the results reported by another study that implemented a semi-supervised model using the same dataset as this study [67].The study compared the performance of the developed semi-supervised automated labeling of behaviors to supervised and unsupervised models.In this study, a fully supervised support vector machine (SVM) achieved 86.1% accuracy, while a semi-supervised self-training model with 2.5% of the data labelled achieved 84.5% accuracy.Based on this observation, the deterministic nature of HMMs would fit better in a dynamic interaction as it offers more flexibility than a rule-based model and traditional machine learning algorithms, without requiring a large amount of labeled data for training.

Conclusions and Future Work
HCI technologies have become integral in skills learning, offering engaging interactions and replicable solutions that enhance learning experiences.These HCI-based systems often incorporate real-time feedback based on user performance to boost user engagement and learning outcomes.By including human behavior in addition to user performance into the feedback mechanism, we can further increase skills learning and engagement.Manual labeling or annotation of data by experts is often used for offline analysis, however evaluating human behavior using this method can be resource-intensive, hindering real-time feedback capability.Thus, the ability to accurately and autonomously recognize human behavior using computational classification methods is crucial for effective feedback mechanisms.
Among various machine learning methods used to predict human behavior [68], probabilistic models, such as Hidden Markov Models (HMM), have been commonly used to predict or extract behavior patterns [34,[36][37][38].Leveraging probabilistic classification methods, HMMs analyze temporal patterns, making them suitable for predicting human behavior in collaborative interactions [33].Our work contributes to this field by developing an HMM-based model tailored to recognize collaborative behaviors in dyadic interactions using multimodal signals.We additionally designed a rule-based method of behavior prediction for a baseline comparison.
Building on our previous work which developed a teamwork training simulator in a CVE [11], which showed acceptability in dyadic interactions between autistic and neurotypical participants, we extend our work by designing prediction models that could be used to recognize collaborative behaviors in dyadic teamwork interactions.In future work, we want to explore the impact of utilizing the predicted collaborative behaviors as an input for a real-time feedback mechanism and how that could improve collaborative interactions in dyadic interactions, specifically cross-neurotype collaboration.
The results of the preliminary study indicated that our HMM prediction model was able to recognize collaborative behaviors with 90.59% accuracy, outperforming the rulebased model.While both models excelled in predicting Engaged behavior, the HMM demonstrated greater flexibility, particularly in predicting the Waiting and Struggling behaviors, due to its ability to leverage temporal information learned during training.This underscores the advantage of the HMM without requiring extensive labeled data for training.
Additionally, our creation of a novel dataset, comprised of multimodal signals from dyadic interactions labeled with collaborative behaviors by expert annotators, opens avenues for further research and experimentation in this domain.By integrating robust behavioral prediction mechanisms into computer-based teamwork training, we anticipate a significant enhancement in collaborative skills development, ultimately advancing the efficacy of virtual training environments.
Although the results are promising, it is important to acknowledge limitations in the HMM design and suggest key improvements for future studies.First, extracting more complex features from the multimodal data would allow researchers to better understand and observe a wider range of human behavior related to collaborative interactions.For example, adding a dialogue act classification [69] for the speech feature would better inform whether the participant said something because they needed help or sharing information indicating different behaviors.Second, the number of human behaviors used to capture participants' collaborative behaviors were limited.Expanding the range of behaviors, particularly for Waiting, into more distinguishable behaviors to allow the researchers to better understand what is taking place in the collaboration.Third, the imbalance on the distribution of the collaborative behavior introduced challenges in training the HMM.Using K-fold cross-validation was a preliminary method in generating an HMM capable of predicting collaborative behaviors.Future work would benefit from exploring other methods that could include supervised, or semi-supervised learning methods.Despite these limitations, results from the evaluation highlight the advantages of HMMs over rule-based prediction models in a dyadic collaborative interaction between autistic and neurotypical individuals, even with a small labeled dataset.Future work can continue to bridge the gap in effective teamwork training, ensuring a more inclusive and supportive learning experience.

Figure 1 .
Figure 1.Collaborative tasks to support collaborative interaction between autistic individuals and neurotypical partners.

Figure 2 .
Figure 2. System setup where two participants in separate rooms perform virtual collaborative tasks together.

Figure 3 .
Figure 3. Workflow for the design and evaluation of the prediction models.

Figure 3 .
Figure 3. Workflow for the design and evaluation of the prediction models.

Figure 4 .
Figure 4. Feature extraction from multimodal signals coming from three peripheral devices.

Figure 5 .
Figure 5. Example of virtual objects and focus areas defined for the eye tracker.

Figure 5 .
Figure 5. Example of virtual objects and focus areas defined for the eye tracker.

Figure 6 .
Figure 6.Flowchart for the rule-based prediction model.Figure 6. Flowchart for the rule-based prediction model.

Figure 6 .
Figure 6.Flowchart for the rule-based prediction model.Figure 6. Flowchart for the rule-based prediction model.

Figure 7 .
Figure 7. HMM diagram of all elements.

Figure 7 .
Figure 7. HMM diagram of all elements.

Figure 10 .
Figure 10.Confusion matrix for: (a) HMM; (b) rule-based model.For both confusion matrices, the column on the left for the collaborative behaviors represent the real labels, while the labels at the top represent the predicted behaviors.The boxes highlighted in yellow indicate the highest number of predictions for each of the behaviors, the boxes highlighted in green represent a positive performance measure of either recall, precision, or accuracy.The boxes highlighted in orange represent a negative performance measure of recall, precision, or accuracy.

Figure 10 .
Figure 10.Confusion matrix for: (a) HMM; (b) rule-based model.For both confusion matrices, the column on the left for the collaborative behaviors represent the real labels, while the labels at the top represent the predicted behaviors.The boxes highlighted in yellow indicate the highest number of predictions for each of the behaviors, the boxes highlighted in green represent a positive performance measure of either recall, precision, or accuracy.The boxes highlighted in orange represent a negative performance measure of recall, precision, or accuracy.

Table 3 .
Definition of Participant's Collaborative Behavior.

Table 5 .
Evaluation results for rule-based and HMM prediction models.

Table 5 .
Evaluation results for rule-based and HMM prediction models.