Incorporating Interpersonal Synchronization Features for Automatic Emotion Recognition from Visual and Audio Data during Communication

During social interaction, humans recognize others’ emotions via individual features and interpersonal features. However, most previous automatic emotion recognition techniques only used individual features—they have not tested the importance of interpersonal features. In the present study, we asked whether interpersonal features, especially time-lagged synchronization features, are beneficial to the performance of automatic emotion recognition techniques. We explored this question in the main experiment (speaker-dependent emotion recognition) and supplementary experiment (speaker-independent emotion recognition) by building an individual framework and interpersonal framework in visual, audio, and cross-modality, respectively. Our main experiment results showed that the interpersonal framework outperformed the individual framework in every modality. Our supplementary experiment showed—even for unknown communication pairs—that the interpersonal framework led to a better performance. Therefore, we concluded that interpersonal features are useful to boost the performance of automatic emotion recognition tasks. We hope to raise attention to interpersonal features in this study.


Introduction
During communication, emotion recognition skills help us understand the attitude, feeling, and intention of the partner, and therefore guide our behavior to make the communication successful. However, the ability of emotion recognition is different from person to person, and we sometimes fail to recognize the emotion of the interlocutor. This kind of mistake can lead to mutual misunderstandings, impeded communication, and deterioration in relationships [1]. To avoid such failures and improve communication, one solution here is to use the power of machine learning.
Thanks to the significant development in the field of machine learning, recently we have obtained many software programs that can automatically recognize human emotion [2][3][4][5]. Although the methods of automatic emotion recognition emerge, their performance is still unsatisfactory [6,7]. Therefore, we hope to propose a possible method to achieve a better performance.
As illustrated in Figure 1, humans recognize others' emotions through both individual features and interpersonal features. Studies [8][9][10] have shown that the individual features such as facial expression [11,12], gesture [13,14], and tone of the voice [15] help us to recognize others' emotions. For example, if a man clenches his fist, it may mean he is angry. If a man frowns, it may mean sorrow. Studies also have shown that interpersonal features such as mutual gaze [16,17], body synchronization [18] and the synchronization of speech [19] will help us to recognize others' emotions. Here, the interpersonal features used in this study are defined as the interpersonal interaction activities (verbal or nonverbal) that happen consciously or unconsciously during communication. It is important for the emotion recognition task because first-and third-person emotion recognitions will be influenced by these features [20]. For example, during an interaction, if you have a mutual gaze and touch your partner, greater positive emotion will be observed [21]. If the partner synchronizes with your action, the positive emotion will increase [18]. Furthermore, sometimes interpersonal features play a crucial role in recognizing emotion. For example, when one interlocutor is not very expressive, it will be hard to recognize his/her emotion from the individual features only. However, the synchronization of body movement with the interlocutor may help humans recognize the emotion. (E.g., if the synchronization is high, the possibility of positive emotion is high. See [22] for a review.) However, to the best of our knowledge, most current automatic emotion recognition technologies either only use the individual features or just simply combine individual features to capture interpersonal features (see Related Work below). They overlooked the importance of synchronization features. Therefore, we aim to explore the following questions in this study: Are the interpersonal features, especially time-lagged synchronization features, beneficial for automatic emotion recognition tasks? Here, time-lagged synchronization includes both concurrent (i.e., zero-lag) interpersonal features such as mutual gaze and mirroring of facial expressions, and action-reaction (i.e., lagged) interpersonal features such as utterances and responses or smile to smile.
We addressed this question using the K-EmoCon [23] dataset, a publicly available multimodal dataset of naturalistic conversations with continuous annotation of emotions by the participant themselves, as well as external emotion annotation. Using visual, audio, and audio-visual cross-modal features, respectively, we built two types of emotion recognition models: an individual model and interpersonal model. The individual model serves as a control condition using only individual features. The interpersonal model serves as an experimental condition, using both the individual and interpersonal features. We compared the performance of the models to judge whether interpersonal features are beneficial or not. Based on the findings on the importance of interpersonal features, we hypothesized that the interpersonal models would outperform the individual models with either unimodal or cross-modal features.

Related Work
Emotion recognition is a challenging task due to the difficulty of discrimination [24] and diverse expression modalities [25]. To solve the challenge of abstraction of emotion, researchers tried to use different features to discriminate different emotions. However, most of them are individual features.
A common feature used in visual modalities is facial expression [26][27][28][29][30][31][32]. Given a raw image, researchers used face detection methods [33][34][35] to find the position of the face first. Then, they cropped the face and extracted the feature of facial expressions. Finally, they fed these features into the classifier [36,37] to obtain the emotion. Some popular methods include DTAGN [38], FN2EN [39], LPQ-SLPM-NN [32], and so on. In addition to facial expression, gestures are also a common feature [40][41][42][43]. The researchers first used pose estimation methods [44][45][46] to obtain the pose of humans. Then, they fed the pose into a classifier to obtain the emotion. Pupil size [47] and gaze [48] are also important features used for recognizing emotion.
For the audio modality, the speech features [49][50][51][52][53] include qualitative features, such as voice quality, harshness, tense and breathy; continuous features, such as energy, pitch, formant, zero-cross rate (ZCR), and speech rate; spectral features, such as Mel-frequency cepstral coefficients (MFCC), linear predictor coefficients (LPC), perceptual linear prediction (PLP), and linear predictive cepstral coefficients (LPCC); Teager energy operator (TEO)based features, such as TEO-decomposed frequency modulation variation (TEO-FM-Var), normalized TEO autocorrelation envelope area (TEO-Auto-Env), and critical band based TEO autocorrelation envelope (TEO-CB-Auto-Env). Similar to the visual modality, given the raw speech signal, researchers first extracted their desired features such as above, then fed them into the classifier. Different from the above individual features methods, Lin [54], Lee [55], and Yeh [56] tried to use interpersonal features in audio modality to boost the performance of automatic emotion recognition. However, they did not explore whether synchronization will be beneficial or not, which is the main target of this study.
Although researchers have spent decades on emotion recognition tasks using unimodal features, the performance is still not satisfactory. To achieve a better performance, researchers proposed to fuse visual and audio modalities [57,58]. To further improve the performance, others tried to fuse not only the audio and visual modality but also the context modality [59,60]. This fusing strategy improved the performance of the emotion recognition task further, because multimodality can give mutually supplementary information that is missed in the unimodal approaches.
Despite all these efforts, we believe that there still is room for improvement. We were motivated by psychological studies that indicated that humans also use interpersonal features to recognize others' emotions [16][17][18][19]. According to our best knowledge, although the previous automatic emotion recognition research paid great attention to the individual features, most studies did not pay attention to the interpersonal features, especially the time-lagged synchronization. Therefore, we constructed an interpersonal model in the present study to explore whether interpersonal features are beneficial for emotion recognition or not.

Methods
The present study has two aims. First, we aimed to establish the usefulness of interpersonal features for an emotion recognition task. To achieve this, we constructed two models for comparison. One is the individual model using only individual features. Another one is the interpersonal model using both individual and interpersonal features. The only difference in structure between the two models is that the interpersonal model includes the synchronization model (the red block in Figure 2). Second, we aimed to show the power of interpersonal features in multiple modalities. Therefore, we built the models that use visual, audio, and audio-visual cross-modality features, respectively. Figure 2 shows the general framework of our individual and interpersonal models using visual (Figure 2a), audio (Figure 2b), and cross-modality ( Figure 2c). We note that we detected the emotions of both individuals (person A and B) in dyadic communication using visual, audio, and cross-modality. However, to explain our methods concisely, we use the scenario of predicting person A's emotion as an example.

K-EmonCon Dataset
To compare the individual model with the interpersonal model in different modalities, and to perform our experiments with minimum human intervention, we decided to use the K-EmoCon dataset [23] to test the usefulness of interpersonal features because, to our best knowledge, the K-EmoCon is the only dyadic dataset in which the subjects show spontaneous emotions during naturalistic conversations.
Other datasets are not suitable for our experiments due to posed or induced emotions and limited situation. For example, the IEMOCAP [61] is a popular dataset used for the emotion recognition task. However, IEMOCAP was considered to contain induced emotion and posed emotion by actors. As we aim to achieve the recognition of natural (spontaneous) emotions during dialogue communication, containing induced emotions and especially the posed emotions violates our purpose. Unlike IEMOCAP, the content in K-EmoCon is the natural debate between individuals without professional training in acting, which means it is more like an in-the-wild challenging situation. Figure 3 shows scenario and a sample image in the K-EmoCon dataset. The original K-EmoCon dataset includes 32 participants. However, for the complete audiovisual recording, there are 16 participants (Person IDs: 3,4,7,8,9,10,19,20,21,22,23,24,25,26,29,30) available in the dataset. The 16 participants are paired into eight sessions.  The original K-EmoCon dataset contains emotion annotations by the subjects themselves, by the partner, and by external raters. Since our purpose in this study was to test the utility of interpersonal feature in recognizing subjectively experienced emotions rather than the observed/inferred emotions by others, we decided to use self-reported annotations as the label. Although the K-EmoCon dataset also contains the labels of "cheerful", "happy", "angry", "nervous", and "sad", their values are heavily imbalanced (see Figure 3 in [23]) compared to the more normally distributed arousal and valence. Furthermore, arousal and valence are the two affective dimensions of the well-known circumplex model of emotion by James Russell [62] , which can cover more subtle changes in emotions. Thus, we used the arousal and valence labels for our emotion recognition task.
Specifically, we chose to use the self-reported arousal and valence which were rated on a five-level scale (from 1: very low to 5: very high) for every 5 s as emotion labels. Therefore, for the recognition of each 5 s segment of emotional state, the original input [5 × 30, 3, 112, 112], and the original size of individual speech data ([T A , C A , F A ]) was [5,2,22050]. In extracting MFCC features for audio data, we framed the audio data into the same temporal size as the visual data. That is, the temporal dimension for audio data after MFCC was 150, which is equal to T V × N V (visual temporal dimension). We formalized the emotion recognition as a classification task, similarly to [63], because the annotated emotion labels in K-EmoCon are limited to five-level scale instead of continuous values in an interval. Moreover, the labels changed by steps at intervals of every 5 s instead of changing continuously frame by frame, which made the task more like qualitative task rather than quantitative task.

Individual Model
Let us begin with the individual model for the visual modality. In general, our individual model includes three stages (Figure 2a).

•
The first stage is to feed the individual video clips (I A Video or I B Video ) into the backbone to extract spatial information and obtain individual features; • The second stage is to feed the individual features into the Temporal Net to extract temporal information; • The final stage is to feed the output from the Temporal Net into a fully connected layer to predict the value of arousal or valence. Now, we explain the detail of each component.
The backbone (Figure 4) for visual modality includes a convolutional neural network (CNN) [64] and transpose CNN [65]. It is a structure similar to Resnet [66]. A CNN was used to extract the local information first. A transpose CNN was used to extract further information and reshape the output to make its size equal to the size of the input. To obtain the general information, max-pooling was used to down-sample and summarize the local information. In the backbone, CNN plus transpose CNN were used for a total of four times. The first three times were used in the Resnet structure (purple line in Figure 4) to deepen our network because the mapping from the input features to emotional states requires heavy nonlinear transformation. The fourth time is slightly different from the first three times. The CNN was not connected with the transpose CNN directly. The max-pooling was inserted between the CNN and transpose CNN to reduce computing complexity. The Temporal Net ( Figure 5) here is a structure similar to the temporal CNN [67]. Dilation CNN and Resnet were used to extract temporal information. As the backbone for visual modality is deep enough, only one layer was used in Temporal Net to prevent overfitting.

Interpersonal Model
Next, we used the task of predicting Person A's emotion as an example to explain the process to obtain interpersonal feature Y I Video . (When predicting person B's emotion, the process is symmetrical.) In general, the interpersonal feature Y I Video was obtained by feeding the respective individual features (X A Video and X B Video ) into the synchronization model M S as shown in Figure 6. Specifically X A Video as shown in Equation (1). As for person B, we obtained X B Video with a different backbone model M B Video as shown in Equation (2). Then, the pair of individual features (X A Video and X B Video ) were fed into the synchronization model M S (Equation (3)) to obtain interpersonal features Y I Video . Finally, the interpersonal features were combined with individual features and fed into the fully connected layer to predict the value of emotion.
We note that M Video processed the spatial dimension, which means it conserves the temporal order of video clips. For example, the size of the original input I Video is [T, C, W, H], where T represents the time length of clips, C represents the RGB channel, W represents the frame image width, and H represents the frame image height. After the processing of M Video , the size of X Video is [T, F], where T keeps the same and F represents the length of the individual feature vector. The specific values used in the experiment are specified in Section 3.1. The synchronization model consists of two parts. The first part is the computation of time-lagged synchronization similarly to the time-lagged detrended cross-correlation analysis (DCCA) cross-correlation coefficient computing process [68]. The second part is to use 1D CNN to further extract information.
The detailed algorithm is shown in Algorithm 1. When computing time-lagged synchronization Y S Video , the individual features (X A Video and X B Video ) were first divided into several temporal blocks (R and R ). The number of blocks is n block . Then the cosine similarity was computed with R and R as shown in Equation (4). During this computing, the decay weight β was used for the output to emphasize the time-lagged feature. We design α = 1 − i τ n block β, because as i τ increases, the α decreases, which is similar to our memory that will forget information along with time. Finally, the mean of the output (Out block ) was calculated.
To further extract the information between each Out block , the 1D CNN was used to obtain interpersonal features Y I Video .

Preprocess
The mixture of speech from two speakers in a single audio data file brings a challenge to the audio modality. To overcome this challenge, several preprocessing steps were performed to obtain individual speech features. Given the obtained speech features, we shifted them by different time lengths, because we aimed to capture the action-reaction relationship between the speaker and interlocutor. We predicted Person A's emotion here as an example. (When predicting person B's emotion, the process is symmetrical.) Specifically, first, we manually segmented the raw audio data (I Audio ) to obtain each speaker's data (I A Audio and I B Audio ) as shown in Equation (5).

Individual Model
Similar to the visual modality, we built both the individual model and interpersonal model (Figure 2b). The individual model is similar to the visual modality. The rough information is extracted by the backbone. Then, the temporal information is further extracted by the Temporal Net ( Figure 5). The backbone here is CNN plus Transformer [69] as shown in Figure 7. Specifically, we use da kernel size of (3, 1) for the first layer, and the kernel size of (1, 1) for the second layer. The different kernel size helps our model extract more different degree local information. With a two-layer CNN, the max-pooling was used to summarize the local information. The output was fed into the Transformer to extract further information. The Temporal Net is the same as the visual modality.  Figure 2b). Next, the cosine similarity between the speaker feature vector X A Audio and different time-lagged interlocutor vectors X B τ i Audio was computed to obtain similarity vectors Y S τ i Audio as shown in Equation (6).
These similarity vectors were combined as Y S Audio with a decay weight α as shown in Equation (8). The Y S Audio was fed into the CNN to obtain our target-interpersonal features Y I Audio . The decay weight α here was calculated with decay parameter β as Equation (7). As τ i increases the α i will decrease. We also used 1 − β to represent the a priori knowledge, which makes sure the 1 − β percent of Y S τ i Audio will contribute.
Finally, the interpersonal features were combined with extracted individual features and fed into a fully connected layer to obtain the emotion value of the target speaker.

Individual Model
The cross-modality is similar to visual and audio modality, including individual models and interpersonal models (Figure 2c). For the individual model, individual features were extracted from the structure of visual and audio modality as in Sections 3.2 and 3.3. Then the two modality features were combined and fed into a fully connected layer to predict emotion value.

Interpersonal Model
For the interpersonal model, we incorporated both the audio-visual interpersonal feature Y AV and visual-audio interpersonal feature Y VA . We used the prediction of Person A's emotion as an example.
The audio-visual interpersonal features Y S τ i AV were obtained by computing cosinesimilarity between interlocutor visual modality X B Video and time-lagged speaker audio features X A τ i Audio as shown in Equation (9). Then, Y S τ i AV were combined together with decay weight α to obtain visual-audio interpersonal feature Y S AV as shown in Equation (10).
The visual-audio interpersonal features Y S τ i VA were obtained by computing the cosinesimilarity between speaker visual modality X A Video and time-lagged interlocutor audio features X B τ i Audio as shown in Equation (11). Then, Y S τ i VA were combined together with decay weight α to obtain visual-audio interpersonal feature Y S VA as shown in Equation (12).
Finally, these two interpersonal features (Y S AV and Y S VA ) were combined with individual features (X A Video and X A Audio ) and fed into the final layer to predict emotion value. We note that the computing of the cosine-similarity requires the data to share the same shape in both temporal and feature dimensions. For example, the size of a visual individual feature of speaker X A Video is [T V , F V ], and the size of an audio individual feature The temporal dimension of the individual feature is the same (T V = T A ) via the neural network that we built. The feature dimension of the individual feature of different modalities was reshaped to the same size F with interpolating methods such as Equation (13). Thus, the size of the feature (X A Video and X B Audio ) satisfied the required conditions Table 1 shows the recommended hyperparameters used in the experiment. Specifically, we used stochastic gradient descent (SGD) with momentum as an optimizer. As for the learning schedule, we used cosine annealing (maximum learning rate is 0.01, the minimum learning rate is 0.00001). The β is the decay parameter explained in Section 3. To avoid overfitting, we also used a trick called the flood level [70], which was expressed by b. All the models were trained from scratch. Both the main experiment and supplementary experiment were conducted on a laptop with Intel Core i7-9750H CPU 2.60GHz, 16GB RAM, NVIDIA GeForce RTX 2070 with Max-Q Design and with an operating system Ubuntu.

Setup
In the main experiment, we split the dataset (985 segments) along with the emotion value into train set and test set randomly. The percentage of train-dataset is 70% and the number of emotion values is five. The main experiment is formulated as a speakerdependent task, which is similar to [71,72].
It is common that, during usual communication, the emotion value stays in a moderate range most of the time and rarely gets into extreme states, which leads to the imbalanced data problem in our experiment. To solve the imbalanced distribution, several methods could be used such as data re-sampling approaches [73][74][75], class-balanced losses [76][77][78], and so on. Some of the literature has proved that resampling methods can improve the accuracy of class-imbalanced datasets [79]. Therefore, we chose to use data re-sampling approaches to obtain a total of 1000 segments (200 segments × 5 scale value) as training data, of which size is similar to the original dataset. For example, the number of the first class in the total dataset is D Total C1 = 100. Then, the number of the first class in the training dataset is D Train C1 = 70, and the number of the first class in the test dataset is D Test C1 = 30. Finally, after a random re-sampling, the number of the first class in the training dataset is 200.
To compare the interpersonal model with the individual model, we used accuracy as an evaluation metric like most studies. However, due to the imbalance of the test dataset, sometimes the accuracy cannot serve as a great evaluation method to compare the performance. Therefore, we also used the macro-f1-score and unweighted average recall (UAR) as additional evaluation metrics.

Baseline
Although our main purpose is to show the benefit of including interpersonal features, we hope to evaluate our proposed models comprehensively. However, as K-EmoCon is a new dataset, we cannot find suitable methods to compare with our methods directly. Therefore, we re-implemented a popular method called Hierarchical Fusion (HFusion) proposed by Majumder et al. [59]. We compared our individual models with HFusion in visual, audio, and cross (visual-audio) modality, respectively. To make the comparison fair, the setup used for HFusion was the same as the setup used for our models. Table 2 shows the comparison of results between our individual models and HFusion. Except for the audio valence accuracy, all the results of the individual model were better than HFusion. Moreover, the better f1-score and recall of individual model showed that the better audio valence accuracy of HFusion is because it classified most samples as majority class, which means the individual model was generally better even for predicting valence using audio modality. Therefore, we concluded that our proposed models are effective in the K-EmoCon dataset.  Table 3 shows the test results of performance for the individual and interpersonal models using visual modality to predict arousal and valence. The performance of the interpersonal model was better than the individual model in all target variables and performance metrics. More specifically, the superiority of the interpersonal model was not restricted by the dimension of the emotion, because its performance was better than the individual model both for arousal and valence dimension. The superiority of the interpersonal model was also not restricted by the evaluation metrics because its performance was better than individual model both in terms of accuracy, f1-score, and recall. Therefore, we concluded that interpersonal features are beneficial for emotion recognition in the visual modality.  Table 4 shows the performances of the individual and interpersonal models using audio modality features. As the audio modality included many silent segments, which did not provide useful information for the recognition task, the entire performance of the model using audio modality was lower compared with the visual modality. However, the results here showed that all of the performances of the interpersonal model was higher than the individual model regardless of evaluation metrics and emotion dimension. Therefore, we concluded that the interpersonal features are beneficial for emotion recognition in the audio modality.  Table 5 shows the performance results for the individual and interpersonal models using audio-visual cross-modality. The results again showed that the interpersonal model exhibited a better performance. However, some results of cross-modality were lower than the visual or audio modality, which may violate our intuition. We thought it could be due to two reasons. One is overfitting because we found that the training accuracy and f1-score of cross-modality is higher than other modality. Another is the flaw of audio modality data because the audio data included too much silence.

Discussion
To statistically test whether the interpersonal model significantly outperformed the individual model, we used a two-tailed Wilcoxon signed-rank test, which was also used in [80]. As shown in Figure 8, we pooled the accuracy and f1-score values for all the modalities and emotional dimensions to compare them between the interpersonal model and individual model. The p-value for accuracy and f1-score was less than 0.001. The pvalue of comparing recall between the interpersonal model and individual model was less than 0.01. Thus, we concluded the outperformance of the interpersonal model is significant. Taken together, we found that interpersonal features are beneficial for automatic emotion recognition regardless of different modalities, different emotion dimensions, and different evaluation metrics. However, in the main experiment, the same individuals contributed to both the training and test data (speaker-dependent task), which means we do not know whether interpersonal features are beneficial for new, unknown samples (speaker-independent task). Therefore, to test the generalization of our hypothesis that interpersonal features are beneficial for even unknown communication groups, we cover the supplementary experiment in the following section.

Setup
In the supplementary experiment, the percentage of the training dataset was around 75%. Specifically, the training dataset consisted of data from twelve people. The test dataset consisted of the remaining four people. Specifically, the 12 participant IDs for training data were 3, 4, 7, 8, 9, 10, 19, 20, 21, 22, 23, and 24. The remaining four participants, IDs 25,26,29, and 30, were used for the test data. We note that there was no particular rule in assigning IDs to participants in the K-EmoCon dataset. Therefore, there is no obvious cause to introduce selection bias.
We faced a problem of imbalanced data in the supplementary experiment because the distribution of emotion labels in the training data and testing data was imbalanced. As an extreme situation, some emotion values in the testing data were not included in the training data, which we did not face in the main experiment. For example, the arousal value 1 was in the training data, while there was no arousal value equal to 1 in test data. To solve this, we collapsed the emotion values into two levels (values from 1 to 3 were put into the low level; values from 4 to 5 were put into the high level). However, even after collapsing into two levels, the problem of imbalanced data still exists. Therefore, we also used the re-sampling methods to obtain 800 samples as training data (400 segments × 2 levels). In addition, we decided to use the F1-score and UAR as evaluation metrics here, because accuracy cannot reflect the true performance of the classifier for such imbalanced data. Table 6 shows all the f1-score results of the interpersonal model is better than the individual model regardless of modality. Table 7 shows the recall results of the interpersonal model are better than the individual model except for the Visual-Valence and Cross-Arousal results. We will discuss these result in detail according to each modality below.  For the visual modality, although the interpersonal model outperformed individual model in terms of f1-score (Table 6), it showed a worse recall result for the valence dimension (Table 7). We inspected the distributions of the dataset to explore the cause of the difference and found the distribution of valence labels was less balanced in training data and test data.

Result
For the audio modality, as Table 6 shows, all of the results of the interpersonal model are better than the individual model. However, this time, the difference between the valence dimension and arousal dimension was negligible. This can be explained by the fact that the audio data included too much silent part, which would have suppressed the performance of the interpersonal model. Too many silent parts may also have affected the recall result in Table 7. Specifically, the variance of the recall for arousal of the interpersonal model result was very high.
For the cross-modality, Table 6 showed the power of interpersonal features again. We found that the performance boost for the valence dimension was larger than that for the arousal dimension. We also found, similarly to the main experiment, that the result of cross-modality was sometimes lower than visual or audio modality. This could be due to overfitting and the flaw in audio modality data. These two possible problems may also explain the slightly lower recall result of the interpersonal model in the arousal dimension in comparison to the individual model in Table 7.

Discussion
The lower recall result of the interpersonal model in Visual-Valence, and Cross-Arousal may bring up the question regarding whether the interpersonal model outperformed the individual model. We tested this question with a two-tailed Wilcoxon signed-rank test as shown in Figure 9. When we pooled the recall values for all the modalities and emotional dimensions and compared them between the interpersonal model and individual model, the p-value was 0.076. Although it is slightly greater than 0.05, it is less than 0.1. Further, comparing the f1-scores between the interpersonal model and the individual model, the pvalue was less than 0.001, which means that, according to the f1-score, the interpersonal model significantly outperformed the individual model. Therefore, we concluded the interpersonal model was overall better than the individual model, which means interpersonal features are beneficial for automatic emotion recognition even with unknown communication pairs.

Conclusions
Inspired by the fact the humans recognize emotion via individual features and interpersonal features, we explored whether interpersonal features are beneficial for automatic emotion recognition in this study. Specifically, we constructed the individual model and interpersonal model in visual, audio, cross-modality respectively. Then, we compared these two models using the K-EmoCon dataset with the main experiment and supplementary experiment. Our main experiment results showed that the performance of the interpersonal model was higher than the individual model. Our supplementary experiment results showed-even for unknown communication pairs-that the interpersonal model outperformed the individual model. Therefore, we advocate incorporating interpersonal features for automatic emotion recognition in communication settings.
The framework used in this study was a "black box". We cannot identify what specific synchronization contributed to better emotion recognition performance. The "black box" nature impeded us from further improving the algorithm, and more importantly impeded us from understanding the mechanism about how humans recognize emotion in nature.
In the future, we hope to resolve this issue with the eXplainable Artificial Intelligence (XAI) approach [81].

Conflicts of Interest:
The authors declare no conflict of interest.