Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining

Song, Younghun; Chung, Kyungyong

doi:10.3390/electronics14204015

Open AccessFeature PaperArticle

Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining

by

Younghun Song

¹

and

Kyungyong Chung

^2,*

¹

Department of Computer Science, Kyonggi University, Suwon-si 16227, Gyeonggi-do, Republic of Korea

²

Division of AI Computer Science and Engineering, Kyonggi University, Suwon-si 16227, Gyeonggi-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(20), 4015; https://doi.org/10.3390/electronics14204015

Submission received: 28 August 2025 / Revised: 2 October 2025 / Accepted: 6 October 2025 / Published: 13 October 2025

(This article belongs to the Special Issue Application of Data Mining in Social Media)

Download

Browse Figures

Review Reports Versions Notes

Abstract

We propose a multimodal emotion recognition framework that integrates facial expressions and speech transcription (where text is derived from the transcribed speech), with a particular focus on effectively modeling the continuous changes and transitions of emotional states during conversation. Existing studies have primarily relied on single modalities (text or facial expressions). They often perform static emotion classification at specific time points. This approach limits their ability to capture abrupt emotional shifts or the structural patterns of emotional flow within dialogues. To address these limitations, this paper utilizes the MELD dataset to construct emotion sequences based on the order of utterances and introduces an analytical approach using Sequential Pattern Mining (SPM). Facial expressions are detected using DeepFace, while speech is transcribed with Whisper and passed through a BERT-based emotion classifier to infer emotions. The proposed method fuses multimodal results through a weighted voting scheme to generate emotion label sequences for each utterance. These sequences are then used to construct an emotion transition matrix, apply change-point detection, perform SPM, and train an LSTM-based classification model to predict the overall emotional flow of the dialogue. This approach goes beyond single-point judgments by capturing the contextual flow and dynamics of emotions and demonstrates superior performance compared to existing methods through experimental validation.

Keywords:

multimodal emotion recognition; sequential pattern mining; emotion transition; change-point detection; LSTM

1. Introduction

The advancement of modern artificial intelligence technologies is rapidly shifting toward enabling machines to understand human emotional states and respond accordingly. In particular, Multimodal Emotion Recognition (MER), which leverages non-verbal cues such as speech and facial expressions, plays a pivotal role in various application domains, including affective dialogue systems, smart healthcare, and human–robot interaction [1]. MER captures emotional cues that may be overlooked in single-modality approaches and provides higher Accuracy and interpretability by integrating complementary information across modalities. In real-world applications, the quality of user experience depends not only on the Precision of emotion recognition but also on contextual coherence and temporal stability—achievable only through the integrated use of multimodal information.

Most previous research on emotion recognition has focused on estimating emotional states at the level of individual utterances or static images [2,3]. For example, one approach uses Whisper to transcribe speech signals into text and then applies a BERT-based classifier for emotion analysis on the transcribed text, while another infers emotions from facial images using models such as DeepFace [4]. These approaches achieve high performance within their respective modalities but are largely confined to single-point emotional judgments, failing to adequately reflect the contextual flow of actual conversations or situations. Furthermore, in real-world settings, factors such as cross-modality asynchrony, signal quality fluctuations, and speaker switching make predictions based on a single modality and single time point unstable. Thus, aligning and integrating signals from different modalities along the temporal axis becomes indispensable.

These limitations are particularly evident in three aspects. First is the absence of temporal context. Human emotions continuously evolve with conversational flow, interlocutor responses, and environmental changes, showing not only momentary fluctuations but also long-term tendencies. Existing approaches, however, fail to capture this temporal continuity and change. Second is the difficulty in detecting abrupt emotional shifts. In actual conversations, transitions can occur suddenly—for instance, from a neutral state to anger or surprise—yet independent per-utterance predictions struggle to detect these sharp changes effectively. Third is the limitation of modality fusion methods. Many existing MER studies employ simple voting or averaging strategies that fail to capture cross-modal interactions and do not fully exploit the strengths of individual modalities. Moreover, since the reliability of each modality varies across time, fusion without temporal weighting or uncertainty handling weakens prediction consistency and robustness.

To overcome these limitations, MER requires the introduction of Sequential Pattern Mining (SPM) techniques, which can effectively analyze the sequential structure and transition tendencies of emotions. SPM provides a powerful tool to examine the order in which emotions occur along the temporal axis and to analyze which emotions frequently transition into others, moving beyond static classification to capture structural patterns in emotional flow. In particular, algorithms such as PrefixSpan can efficiently discover frequent patterns in emotional transition sequences, providing crucial insights for subsequent emotion prediction and dialogue flow analysis. Moreover, abrupt emotional shifts can be detected based on changes in sequence patterns, allowing sensitive reflection of sudden affective transitions within conversations.

Accordingly, this study employs the MELD (Multimodal EmotionLines Dataset), performing Whisper-based speech transcription, BERT-based text emotion analysis, and DeepFace-based facial emotion analysis. The extracted multimodal emotional information is reordered according to utterance sequences while preserving temporal intervals to construct emotion sequences. To alleviate cross-modality asynchrony, emotional signals are synchronized at the utterance level, while modality mismatches are handled through weighted fusion that considers temporal reliability and by allowing composite emotion labels. Subsequently, analyses of emotion transition patterns, duration, and change-point detection are performed to enable more refined emotion inference. This sequential emotion analysis captures the overall affective flow of conversations, enhances consistency, contextuality, and robustness in emotion prediction, and further contributes to real-time user state monitoring and proactive intervention strategy design.

The key contributions of this paper are as follows:

Development of a multimodal emotion recognition framework: By integrating Whisper-based speech transcription, BERT-based text emotion analysis, and DeepFace-based facial emotion analysis, the proposed framework overcomes the limitations of unimodal approaches and achieves more accurate and reliable emotion recognition.
Construction of emotion sequences and application of SPM: Based on MELD utterance sequences, emotion sequences are generated, and SPM-based techniques are applied to quantitatively analyze dynamic emotional characteristics such as transition patterns, duration, and change-point detection.
Emotion flow visualization and empirical analysis: By visually representing emotion transition matrices and change-point detection results, the study provides intuitive insights into trends of affective shifts and sudden changes within conversations, offering design guidance for affective systems and interface improvements.

The remainder of this paper is organized as follows. Section 2 explains multimodal data preprocessing, emotion extraction, fusion, and sequence construction methods. Section 3 applies SPM to emotion sequences, analyzes emotion flow patterns, and introduces feature extraction and an LSTM-based classification model. Section 4 compares the performance of the proposed method with existing approaches and evaluates the contribution of each component through an ablation study. Finally, Section 5 presents the conclusion and discusses directions for future research.

2. Related Work

2.1. Multimodal Emotion Recognition

Multimodal Emotion Recognition (MER) is a technology that integrates various input modalities (e.g., text, speech, and video) to recognize human emotions, enabling a more accurate capture of the complex nature of emotions. This has led to active research in the field [5]. Zadeh et al. [6], for example, proposed the Tensor Fusion Network (TFN) based on large-scale emotion datasets such as CMU-MOSEI, explicitly integrating inter-modality interactions to improve MER performance.

Recently, large pre-trained language models (e.g., BERT) and vision-based deep learning models (e.g., VGG, ResNet) have been employed to maximize the representational power of each modality, and such approaches have been shown to outperform unimodal methods [7]. In particular, previous studies have quantitatively demonstrated that multimodal approaches are superior to unimodal ones in MER. Farhadipour et al. showed that combining text and speech modalities in multiparty conversational environments improved emotion recognition Accuracy by more than 9.3% compared to unimodal methods [8]. Zhao et al., in the MER2024 Challenge, demonstrated that multimodal models achieved significantly higher F1-scores than unimodal audio/text models [9]. Ge et al. experimentally verified that early fusion strategies can more effectively learn emotion transitions [10]. Furthermore, SUMMER [11] and Lei [12], through distillation-based MER and comparative analyses across various datasets (SAVEE, RAVDESS, CREMA-D), showed that multimodal models also outperform unimodal ones in terms of generalization performance and robustness.

Unlike most prior MER studies, which mainly focused on emotion classification at single time points, this paper integrates time-series mining techniques to analyze emotional flows and transitions across the temporal dimension. Considering that emotional states evolve continuously and contextually, this approach provides greater practical applicability.

Whisper [13] (large-v2 model), developed by OpenAI, is a powerful speech recognition model known for its ability to provide accurate transcriptions across multiple languages and noisy environments. Whisper ensures high transcription quality even in natural conversational settings such as MELD, and its transcription outputs serve as inputs for subsequent text-based emotion analysis. For text emotion analysis, pre-trained BERT variants are commonly used, with fine-tuned models specialized for emotion labeling—such as bhadresh-savani/bert-base-uncased-emotion—being representative examples [14]. These models significantly outperform traditional LSTM or SVM-based approaches in terms of emotion classification Accuracy.

Meanwhile, DeepFace [15] is an open-source library for face recognition and emotion inference, supporting real-time emotion estimation through multiple backend models such as VGG-Face, ArcFace, and Dlib. In this study, DeepFace is applied to video-frame-based emotion analysis to supplement affective cues. These results are then fused with Whisper- and BERT-based emotion outputs to construct sequential emotion sequences along the temporal axis.

2.2. Sequential Emotion Flow Analysis

Sequential emotion flow analysis, which examines how emotions change over time, is an important research area for enhancing both the Precision and contextuality of emotion prediction. Zhang et al. [16] modeled transitions in emotional states from movie dialogues using emotion transition models, while Kshirsagar et al. [17] employed HMM-based models to probabilistically estimate emotion flows. Prior studies have also highlighted that emotional fluctuations themselves are critical signals, applying various time-based analysis methods such as change-point detection, emotion trend extraction, and emotion peak analysis [18].

Conventional approaches to emotion flow analysis have mainly focused on analyzing linear transition patterns between emotional states based on static text data, which has limited their ability to capture nonlinear temporal changes or overlapping emotional phenomena. In conversational contexts, however, emotions often evolve not only through gradual shifts but also through abrupt transitions, making it crucial to precisely capture the subtle, continuous structures of emotional flows.

Moreover, emotional expression manifests not only in text but also in non-verbal cues such as intonation, speech rate, facial expressions, and gaze. Thus, unimodal approaches to emotion flow analysis are inherently limited in reflecting these multifaceted expressions. For instance, even with identical text content, the direction and intensity of the conveyed emotion can differ dramatically depending on intonation or facial expressions. Consequently, recent studies have increasingly attempted to align multimodal emotional data—including speech and visual signals—along temporal sequences and analyze their flows.

For example, Yu et al. [19] proposed a Dual-Stage Cross-Modal Network that dynamically integrates facial expressions and acoustic features through Temporal Convolution and Gated LSTM, effectively modeling the continuous variation in emotional mimicry intensity. Meng et al. introduced a Masked Graph Learning with Recurrent Alignment (MGLRA) framework to address modality distortion and asynchrony issues in conversational multimodal emotion recognition. This model represents temporal features of each modality as graphs, reinforces cross-modal interactions and alignment through recursive alignment processes, and employs GCN-based masked learning to extract the core components of emotional representation. As a result, it achieved superior Precision and stability compared to existing models on the IEMOCAP and MELD datasets [20]. Meanwhile, Sun et al. proposed the HiCMAE (Hierarchical Contrastive Masked Autoencoder) for audio-visual self-supervised emotion recognition. By combining hierarchical feature extraction with contrastive learning across multimodal inputs, HiCMAE enhances shared affective representations while suppressing cross-modal noise. Experimental results showed that HiCMAE outperformed not only traditional supervised models but also a range of self-supervised approaches across nine emotion recognition benchmarks [21].

Previous research has also developed Conditional Random Field (CRF)-based methods for sequential classification of affective adjectives [22]. This study established bipolar adjective axes such as Adult–Childish and Retro–Modern, modeling how the distribution of adjectives shifts along the flow of utterances. CRF, as a sequential labeling method incorporating contextual information, effectively captures gradual shifts or abrupt transitions of affective expressions in text—capabilities beyond simple static classification [23]. Figure 1 below illustrates an example visualization of adjective distribution changes within the affective space proposed in that study.

The CRF-based approach is contextually aligned with the emotion flow analysis of this study. Just as previous research sequentially captured the transition patterns of affective adjectives within text, the present work tracks the temporal dynamics of multimodal emotional states—integrating speech, facial expressions, and text—within conversational contexts. Both lines of research emphasize that “emotions are not static states at a single point in time but processes that evolve continuously over time,” while this study extends that perspective into a multimodal setting.

This approach provides a foundation for detecting key points of emotional transition (change-points), as well as patterns of stability and abrupt shifts, thereby enhancing the practical applicability of emotion analysis systems designed for conversation and situational awareness. For instance, during user interaction, if emotions fluctuate unstably within specific segments, the system can detect such changes in real time and enable the design of affective support systems capable of timely intervention [24]. Building upon MELD conversational sequences, this study organizes multimodal emotion outcomes along the temporal flow and analyzes emotion transition matrices and flow patterns, thereby addressing limitations of prior work [25].

3. Facial and Speech Emotion Sequence Modeling Using Sequential Pattern Mining

This study proposes a method that recognizes emotions based on multimodal data, generates an emotion sequence by arranging the recognized emotional states in temporal order, and models emotional changes through SPM and flow-based analysis techniques. The proposed methodology consists of three stages.

In the first stage, multimodal data are preprocessed, and emotion information is extracted from text and facial expressions. Speech is transcribed into text through the Whisper model, and this text is classified into various emotion classes such as joy, sadness, anger, and surprise using a pre-trained BERT-based emotion classifier. The emotion prediction results are output as probability values through Softmax. In the case of facial video, emotions are extracted on a frame basis using the DeepFace model, and these are integrated at the utterance level to be aggregated into a single facial emotion label.

In the second stage, the multimodal emotion information extracted in this way is arranged and fused according to the chronological order of utterances to generate an emotion sequence that reflects the flow of emotions. The text and facial expression emotion results are integrated by a probability-based voting method, and the emotion is determined based on the dominant emotion. In cases where the results across modalities are inconsistent or show similar probabilities, they are recorded as fused emotions, thereby reflecting the overlap and ambiguity of emotions. Since the generated emotion sequence is constructed while maintaining the time interval between utterances, it can capture the natural flow of emotional development and transition.

In the third stage, various sequential analysis techniques are applied based on the constructed emotion sequence. First, SPM is used to extract frequently appearing emotion transition patterns (e.g., joy → neutral → sadness) within conversations. This contributes to discovering the regularity and tendency of transitions between emotional states. Next, an Emotion Transition Matrix is constructed to visually analyze the transition probabilities and structural flows between emotions. In addition, a CUSUM-based change-point detection technique is applied to identify sections where abrupt emotional transitions occur at an early stage. Finally, an LSTM (Long Short-Term Memory)-based model is used to learn the patterns of the emotion sequence, and based on this, a structure is designed that can predict or classify future emotional states.

Such analysis allows the dynamic characteristics of emotions to be reflected more precisely, improves the Accuracy of emotion recognition by considering temporal context, and enables sensitive responses to changes in user emotions in actual interactive systems. Figure 2 below visually represents the entire multimodal emotion recognition and analysis procedure proposed in this paper.

3.1. Multimodal Data Preprocessing and Emotion Extraction

In the first stage of this study, a preprocessing process is performed to extract emotion information from two main modalities, speech and facial expressions, based on MELD (Multimodal EmotionLines Dataset) [26]. MELD is a dataset constructed from the dialogues of the TV sitcom Friends, and it contains about 1400 dialogues and more than 13,000 utterances. Each utterance is composed of a video file and includes various metadata such as text, speaker information, emotion labels, and timestamps. Table 1 shows the statistics of the MELD dataset.

Figure 3 shows the multimodal preprocessing and emotion extraction pipeline performed in this study. First, video data were processed using OpenCV to extract normalized frames at 1 frame per second (FPS) and resizing to 224 × 224 and channel normalization were applied as needed. Then, facial expression emotions were classified into six classes—happy, sad, angry, surprise, fear, and neutral—using DeepFace [27], and frames with low face detection confidence or extreme poses (e.g., rotation > 45°) were excluded. In multi-person scenes, speaker candidates were selected through a simple tracking method, and probability vectors at the utterance level were aggregated using frame confidence and track duration as weights, after which exponential moving average (EMA) smoothing was applied. The right panel of the figure shows an example of the analysis results for each modality: (a) emotion distribution from facial analysis, (b) the corresponding speech waveform, and (c) emotion distribution inferred from the transcribed text.

At the same time, speech signals were sliced according to the utterance start and end times provided in MELD, and silent segments were removed using Voice Activity Detection (VAD). The Whisper (base) model was executed in a GPU environment to perform transcription, and word-level timestamps were obtained to enable fine synchronization with facial expression changes. The transcribed text was classified into seven classes—joy, anger, sadness, fear, surprise, love, and neutral—using the bhadresh-savani/bert-base-uncased-emotion [28] model provided by HuggingFace. This model converts input sentences into BERT embedding vectors, which are then passed to an output layer for emotion classification, and the probabilities of each class are calculated using the Softmax function as shown in Equation (1).

P (y_{i} | x) = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}}

(1)

In Equation (1), z_i represents the BERT output logits for class i, and K is the total number of emotion classes (7). The predicted emotion probabilities are then utilized in a weighted voting scheme or a complex emotion combination process. Finally, to resolve class mismatches between modalities, we normalized the labels into a common space: U = {joy, anger, sadness, fear, surprise, love, neutral}. For instance, “happy” from the facial modality was mapped to “joy,” and the “love” class, which is absent in the facial modality, was padded with a value of 0. Additionally, video and audio timestamps were realigned to within ±0.25 s of the utterance boundaries. If only one modality was valid, the others were replaced with either a uniform distribution or a conservative distribution biased towards “neutral.” Utterances with an insufficient number of valid samples were marked as “uncertain” to lower their weight in the subsequent fusion process. The entire preprocessing pipeline ensured reproducibility through random seed fixing and the caching of intermediate results.

3.2. Emotion Fusion and Time Series Construction

The second stage involves fusing the emotional information extracted from speech and facial expressions into a single, unified emotion label and arranging these labels chronologically to form an emotion sequence. In multimodal emotion recognition, each modality possesses distinct signal characteristics and noise levels, making an effective fusion technique a critical factor in determining the overall system performance [29]. In this study, we experimented with and compared both a simple voting method and a probability-based weighted voting method, ultimately adopting the approach with superior quantitative performance as our final fusion strategy.

First, the emotion results from each modality are represented by probability values (Softmax scores). Text transcribed by Whisper is classified into categories such as joy, sadness, anger, and fear using a BERT-based emotion classifier, which also provides prediction probabilities for each class. Similarly, facial emotions extracted via DeepFace have corresponding probability values for each emotion class per frame. These are aggregated, either by averaging or taking the maximum value, to derive utterance-level emotion probabilities.

The emotion fusion process follows three rules:

Dominant Emotion Selection: If the same emotion has the highest probability in both modalities, it is set as the final emotion.
Probability-Based Weighted Voting: If the modalities indicate different emotions, the one with the higher Softmax probability value is selected. If the probability difference is not significant (e.g., difference <0.15), both emotions are noted to represent a complex emotion (e.g., joy + sadness).
Handling Uncertain Cases: If the confidence of all emotion predictions for a specific utterance is low (confidence <0.4), that segment is labeled as “uncertain” or “neutral.”

Through this process, a single unified emotion label is determined for each utterance. Arranging these labels in the order of utterance completes the emotion sequence. Since this sequence reflects the temporal flow of emotions, subsequent steps involve constructing an Emotion Transition Matrix to analyze transition patterns, applying change-point detection methods (like CUSUM and Windowed Entropy) to identify abrupt emotional shifts, and using Frequent Pattern Mining to identify recurring transition structures [30]. This analysis allows for the quantitative inference of emotional stability, transition paths, and phenomena like emotional dissemination or convergence, which is key to understanding the broader emotional context of a conversation beyond isolated judgments.

Furthermore, this emotion sequence enables the quantitative analysis of emotional transitions, stability, and shifts, offering richer insights than static, single-point emotion analysis. For example, if the emotional flow within a single conversation is identified as joy → joy → sadness → sadness → anger → joy, the following analysis is possible: joy → joy indicates emotional continuity and represents the duration of that emotion. The subsequent transitions, joy → sadness and sadness → anger, suggest a gradual emotional descent into negative states. Finally, anger → joy represents a dramatic reversal, or a sharp transition. Such flows can be used to analyze emotional volatility or resilience, and the frequency and probability of each transition are quantified through an emotion transition matrix [31]. This provides a valuable foundation for assessing the persistence or emotional contagion of specific emotions, particularly when measuring engagement or emotional empathy in a conversation. Table 2 below shows an example of an emotion sequence constructed from the actual MELD dataset.

Table 2 illustrates how emotions unfold over time within a single dialogue, and the presence of complex (fused) emotions reinforces the need for multimodal analysis. Subsequently, this study applies various time-series mining techniques to this emotion sequence to conduct an in-depth analysis of dynamic emotional changes. Figure 4 provides a visualized example of the emotion sequence from the MELD dataset, as presented in Table 2. Timestamps are displayed above the face frames extracted for each segment, while the Text Emotion, Face Emotion, and Fused Emotion are sequentially arranged below. This allows for an intuitive understanding of how a single utterance changes emotionally over time and how emotion information from different modalities is complementarily combined [32]. In particular, cases where joy and sadness appear simultaneously, recorded as a complex emotion (joy + sad), demonstrate that multimodal fusion analysis can effectively capture emotional contexts that are difficult to grasp through single-modality analysis alone.

Figure 5 visually represents an actual dialogue scene from the MELD dataset, intuitively illustrating the changes in emotional states as the conversation between characters unfolds. Each utterance is displayed with the character’s facial expression, the corresponding text, and an emotion label. This helps to understand how expression-based and text-based emotion analyses are integrated to form the emotion sequence. For example, ‘surprise’ is identified as the emotion in the first utterance, followed by a transition to ‘joy’, ‘neutral’, ‘anger’, and ‘sadness’ over time.

This visual flow empirically demonstrates the process of temporal emotion fusion and sequence construction, effectively conveying the core concepts of the proposed multimodal emotion analysis framework.

3.3. Emotion Sequence Mining and Classification

The emotion probability vectors for the face and text modalities, obtained through preprocessing, are converted into a final utterance-level emotion sequence after label space normalization and Weighted Fusion. First, Temperature Scaling is applied to the logits, z_face and z_text, of each modality. This is a process that calibrates the model’s output confidence by preventing the probability distribution from becoming overly sharp or flat.

{\tilde{p}}_{m} = s o f t m a x (\frac{z_{m}}{t_{m}}) (m \in f a c e, t e x t)

(2)

In Equation (2), t_m is the temperature parameter for the given modality. If its value is greater than 1, the distribution becomes softer, and if it is less than 1, it becomes sharper. This process aligns the output probability distributions of the two modalities to the same scale, preventing one from overly dominating during the subsequent fusion. Next, a reliability score is calculated for the facial modality (r_f) and the text modality (r_t). The facial modality utilizes the average frame-by-frame face detection score and the track duration. The text modality uses the transcription segment length, the VAD (Voice Activity Detection) utterance ratio, and the average ASR log-probability. Based on these reliability scores, weights are calculated and are defined as shown in Equation (3).

w_{f} = \frac{r_{f}}{r_{f} + r_{t}}, w_{t} = \frac{r_{f}}{r_{f} + r_{t}}, (w_{f} + w_{t} = 1)

(3)

Equation (3) is a method that assigns weights based on the relative reliability ratio of the two modalities. For example, the more stable the face detection is, the larger the value of w_f becomes, and the higher the transcription quality is, the larger the value of w_t becomes. These calculated weights, w_f and w_t, proportionally adjust the influence of each modality in the subsequent fusion stage.

Finally, the probabilities of the two modalities are fused in Log-space. This approach prevents numerical instability (underflow) that can occur with very small probability values, enabling stable calculations.

l_{f u s i o n} = w_{f} \cdot \log ({\tilde{p}}_{f a c e} + ϵ) \cdot w_{t} \cdot \log ({\tilde{p}}_{t e x t} + ϵ), p_{f u s i o n} = s o f t m a x (l_{f u s i o n})

(4)

In Equation (4), ε is a stabilization constant on the order of 10⁻⁶, which prevents values close to 0 from occurring during log calculations. The probability vectors of each modality, p̃__face and p̃__text, are proportionally combined according to the weights w_f and w_t. Finally, a re-normalized final probability vector, p_fusion, is obtained through a Softmax function.

If the facial modality is undetected or the text transcription quality is below a threshold, the weight for that modality is set to 0, and the distribution of the missing modality is replaced with a conservative probability distribution biased towards neutral. The p_fusion sequence generated in this way is then used as input for time-series mining. The emotion flow analysis proposed in this study consists of four stages.

3.3.1. Emotion Transition Matrix

An Emotion Transition Matrix is a technique for quantitatively modeling the transition probabilities between emotional states in time-series data. Since emotions tend to change gradually over time or repeat specific patterns, predicting emotion classes at individual time points alone cannot fully explain the flow of a conversation. Calculating this matrix offers the following advantages:

Persistence Analysis: Measures how long a single emotion is maintained.
Directionality: Identifies the primary transition paths between emotions (i.e., which emotions tend to lead to which other emotions).
Predictive Modeling: Predicts the next emotional state based on past patterns.

Specifically, if we define the emotional state space as S = {joy, sadness, anger, fear, neutral, surprise} and the state at time t is s_t, the transition probability is defined as shown in Equation (5) below.

T_{i j} = P (s_{t + 1} = j | s_{t} = i)

(5)

Table 3 summarizes the mapping of the emotion class categories used in this study to their corresponding integer indices. For efficient computation in tasks such as sequence analysis, transition matrix construction, and the creation of time-series input features, each emotion is represented by a number from 0 to 6. This allows for consistent referencing throughout all subsequent algorithms and analysis processes.

Algorithm 1 defines the procedure for calculating a transition matrix from an input emotion sequence. First, it creates an N × N transition count matrix C, initialized with all elements as 0, where N is the number of emotional states. It then sequentially traverses the emotion sequence, identifying the transition from the current state s_t to the next state s_t+1 and incrementing the count at the corresponding position by 1. Once the transition counts are accumulated, the transition probability matrix T is generated by dividing each row by its sum. If no transitions are observed from a particular state (i.e., the row sum is 0), that row is replaced with a uniform probability or a predefined distribution.

Algorithm 1 Emotion Transition Matrix Calculation

1: Input: An emotion sequence O = (o₁, o₂, …, o_t), A set of unique emotion states S = {s₁, s₂, …, s_n}
2: Output: An emotion transition probability matrix T of size N × N
3:
4: ▷ Initialize a matrix to store transition counts
5: Initialize an N × N matrix C with all elements set to 0
6:
7: ▷ Iterate through the sequence to count observed transitions
8: for t ← 1 to T−1 do
9:     i ← indexOf(o_t, S)                    ▷ Get index of current emotion
10:     j ← indexOf(o_{t+1}, S)     ▷ Get index of next emotion
11:     C[i][j] ← C[i][j] + 1
12: end for
13: ▷ Normalize the count matrix row by row to get probabilities
14: Initialize an N × N probability matrix T
15: for i ← 1 to N do
16:     total_transitions ← Σ C[i][j] from j = 1 to N
17:     if total_transitions ≠ 0 then
18:          for j ← 1 to N do
19:               T[i][j] ← C[i][j]/total_transitions
20:          end for
21:     else
22:          for j ← 1 to N do
23:               T[i][j] ← 0
24:          end for
25:     end if
26: end for
27: return T

In this study, the utterance-level emotion sequence obtained through Weighted Fusion is used as the input for Algorithm 1. Specifically, the emotion probability vectors from the facial and text modalities are fused to generate a final sequence of emotion class labels. This sequence is then used to calculate the transition count matrix, which is subsequently normalized into a probability format. The resulting transition matrix is used to quantitatively analyze the persistence and transition patterns of emotions within the dialogue and is subsequently utilized as an input feature for dialogue flow analysis, change-point detection (CUSUM), and LSTM-based emotion flow classification.

3.3.2. Change-Point Detection

Emotions in a dialogue change continuously over time, but distinct transitions can also occur at specific moments. Capturing these abrupt emotional transition points is crucial for interpreting conversational context and detecting emotion-based events. In this study, we apply a change-point detection technique based on the CUSUM (Cumulative Sum) method. Originally developed for early anomaly detection in statistical process control, CUSUM identifies a change-point by accumulating the deviation of time-series data from its mean and flagging when this cumulative sum exceeds a certain threshold. The advantage of this method is its ability to consider accumulated changes over time, allowing it to effectively detect meaningful pattern shifts while ignoring noise and minor fluctuations. The CUSUM procedure is as follows. First, for an input sequence x_t(e.g., emotion scores, transition probabilities), the overall mean μ is calculated. The deviation at each time point is defined as the difference between its value and this mean. A drift parameter, δ, is set to suppress minor fluctuations. To consider only the positive cumulative sum, the current sum is reset to 0 if it becomes negative. If the cumulative sum exceeds a threshold h, that point is considered a change-point, and the sum is reinitialized to search for new changes in the subsequent segment. This process makes it possible to extract only the significant, abrupt change-points across the entire sequence.

The pseudocode for Algorithm 2 outlines the procedure described above in a step-by-step manner. First, the cumulative sum variable is initialized to 0, and a list C is prepared to store the indices of the change-points. The mean of the input sequence is then calculated and set as a baseline. Through a loop, each time point is sequentially examined, and the difference between the current value and the baseline mean, adjusted by the drift, is added to the cumulative sum. If the calculated cumulative sum is negative, it is reset to 0 to prevent the decreasing effect of a previous segment from carrying over to the next. This method ensures that the threshold is only exceeded when changes accumulate continuously, thereby filtering out minor oscillations and capturing only distinct pattern shifts. At the moment the threshold is exceeded, the corresponding time point is added to the list as a change-point, and the cumulative sum is reset to 0. By repeating this process, the indices of all change-points within the emotion sequence are ultimately produced.

Algorithm 2 CUSUM Change-Point Detection

1: Input: A numeric sequence X = (x₁, x₂, …, xT), threshold h, drift δ
2: Output: A list of change-point indices C
4: ▷ Initialization
5: S ← 0
6: C ← []                         ▷ Iterate through the sequence to count observed transitions
7: μ ← mean(X)          ▷ Calculate the mean of the sequence
9: ▷ Iterate through the sequence to detect changes
8: for t ← 1 to T do
9:     S ← max(0, S + x_t − μ − δ)
10:     if S > h then
11:          Append t to C
12:           S ← 0                         ▷ Reset sum after a change-point is detected
13:     end if
14: end for
15: return C

In this study, to prepare the input for the CUSUM algorithm, we first constructed a sequence of utterance-level emotion labels. From this sequence, we calculated the master Emotion Transition Matrix as described in Algorithm 1. To analyze the temporal dynamics of these emotional shifts, we then generated a transition probability sequence. This was achieved by iterating through the emotion label sequence: for each transition from the emotion at utterance t (st) to the emotion at utterance t + 1 (st + 1), we looked up the corresponding probability Tij in the master Emotion Transition Matrix, where i is the index for emotion st and j is the index for st + 1. For example, if a dialogue transitioned from ‘joy’ (i = 0) to ‘neutral’ (j = 6), the value recorded in the transition probability sequence for that time step would be the pre-calculated probability T0,6. The resulting numeric sequence, where each element represents the likelihood of an observed transition, served as the direct input X for the CUSUM change-point detection algorithm outlined in Algorithm 2. By applying CUSUM, we were able to automatically extract the points within a dialogue where emotions shift abruptly, and this was utilized in two ways. First, in the dialogue analysis stage, the change-points were interpreted as emotional transition segments, which could be correlated with significant events or reaction points in the conversational context. Second, in the subsequent LSTM-based emotion classification model, the change-point information was used as an auxiliary input (feature) to reflect not only temporal continuity but also abrupt emotional changes. This approach is significant in that it models the dynamic flow of emotions more precisely compared to previous studies that primarily utilized only the static characteristics of emotion distributions.

3.3.3. Emotion Sequential Pattern Mining

Emotions tend to occur in a specific order rather than independently over time. To effectively capture this emotional transition structure, this study proposes the SPM technique. Specifically, we propose an analysis method that utilizes the PrefixSpan algorithm to extract structural information from the emotional flow within dialogues, with the goal of deriving frequently occurring emotion sequence patterns. PrefixSpan is a frequent sequence mining algorithm based on prefix-projection and is well-suited for analyzing sequential label structures like emotions. In this study, the proposed algorithm is based on the emotion class sequence fused from the facial and text modalities. This sequence is constructed as an integer sequence using the previously defined emotion class indices. An example of such a sequence is provided in Table 4 below, which lists the chronologically occurring emotion class indices for each utterance clip (Video ID), allowing for a visual confirmation of the emotional flow.

Each value in Table 4 represents the emotion class index per time frame, and this sequence matrix serves as the foundational data for the SPM analysis. Through this, the repetition of specific emotions, the directionality of their transitions, and the emotional structure at various points in time can be clearly identified. By applying the PrefixSpan algorithm to this matrix, emotion transition patterns that use each emotion index as a prefix can be derived.

Table 5 shows the frequent sequential patterns that start with each emotion index. For example, <0, 0, 1> indicates a tendency for the ‘Joy’ state to be repeated before transitioning to ‘Anger’, while <2, 0, 3> suggests a flow from ‘Sadness’ through ‘Joy’ to ‘Fear’. Such patterns are effective for understanding the typical transition structures between emotions. These emotion sequence patterns go beyond single-point-in-time classification to reflect the contextual emotion flow and transition tendency. They can capture higher-order structures such as emotional repetition, abrupt shifts, and transitions. Furthermore, by quantitatively analyzing which emotions frequently appear after a specific state, these patterns also offer predictability for the next emotional state. This sequence-based information effectively summarizes the structure of the emotional flow and is proposed as a crucial input feature for the LSTM-based emotion classification model in a later stage. By learning the patterns and frequencies of emotional changes within these sequences, the model can move beyond simple classification to more precisely recognize types of emotional progression at the dialogue level.

3.3.4. Time Series Classification (LSTM)

Once the emotion sequence has been constructed and significant transition points identified through change-point detection in the preceding stages, the final step involves classifying the type of emotional flow in the dialogue based on this sequence. For this purpose, this study applies the Long Short-Term Memory (LSTM) model. LSTM is a variant of the Recurrent Neural Network (RNN) designed to overcome the vanishing long-term dependency problem that occurs during the learning of long sequences. While basic RNNs are prone to information loss or the exploding gradient problem with longer inputs, LSTM introduces three control mechanisms—the input gate, forget gate, and output gate—to enable the selective memorization and deletion of information. This allows the model to maintain long-term contextual information while effectively filtering out what is unnecessary.

This characteristic is particularly important in a conversational context, as emotional states are strongly dependent on the preceding flow of the dialogue, not just the immediately previous utterance. For example, if a sudden negative reaction appears in a conversation that has been generally positive, it is inappropriate to interpret it merely as an independent utterance. Instead, the accumulated context and the contrast effect must be considered together. Therefore, LSTM can be considered a structure optimized for reflecting the temporal continuity of emotions. The LSTM sequentially processes an input sequence X = {x₁, x₂, …, x_T} and outputs a hidden state vector h_t at each time step. Internally, this involves operations with a cell state and gates; with each new input, necessary parts of past information are retained while unnecessary parts are forgotten. Ultimately, a hidden vector summarizing the entire sequence is generated. This vector is then passed to a softmax layer, which outputs a probability distribution for a specific dialogue type, the presence of a dramatic emotional shift, or an emotion class. This enables the model to learn dialogue-level emotional patterns, moving beyond simple single-utterance classification.

In this study, the LSTM model was trained for 300 epochs on the MELD training set using the Adam optimizer, with performance validated on the dev and test sets. To enable the model to learn the complex temporal dynamics of conversations, we designed a comprehensive input feature vector for each utterance (time step t). The input to the LSTM model at each time step t was a feature vector constructed by concatenating three types of information:

Emotion Label: The primary emotion label for the utterance, represented as a one-hot encoded vector (e.g., ‘joy’ as [1, 0, 0, 0, 0, 0, 0]).
Change-Point Feature: A binary feature indicating an abrupt emotional shift. This value was set to ‘1’ if the CUSUM algorithm detected a change-point at utterance t, and ‘0’ otherwise. This allows the model to explicitly recognize moments of sharp transition.
SPM Pattern Features: Features derived from the mined frequent sequential patterns. To generate these, we first determined the minimum support threshold (min_support) for the PrefixSpan algorithm empirically by evaluating a range of values on the validation set, selecting the one that yielded the most meaningful patterns. Then, for each dialogue, we created a multi-hot binary vector where each dimension corresponded to one of the top-k frequent patterns. A ‘1’ was marked in the vector if the dialogue’s emotion sequence contained that specific pattern. This dialogue-level SPM feature vector was then concatenated with the utterance-level features for every time step in that dialogue.

By aligning and fusing these distinct sources of information, the LSTM model could learn not only from the immediate emotion but also from the broader temporal context, including gradual flows, abrupt shifts, and recurring structural patterns. This integrated approach allows the model to move beyond simple single-utterance classification and learn dialogue-level emotional patterns, thereby achieving a higher level of emotion prediction capability.

4. Results and Performance Evaluation

4.1. Results

This study validated the effectiveness of the proposed multimodal emotion recognition and time-series mining techniques using the MELD (Multimodal EmotionLines Dataset). MELD is constructed from video data of the sitcom ‘Friends’ and includes 1433 dialogues, 13,708 utterances, and various emotion labels (e.g., joy, sadness, anger, fear, neutral). For performance evaluation, this study used the official train/dev/test splits of the MELD dataset without modification.

All experiments were conducted in an Ubuntu 20.04 environment equipped with an Intel Core i9 CPU (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The proposed methodology was implemented using Python (version 3.8.10). For the specific models, OpenAI’s Base model was used for Whisper, and the bhadresh-savani/bert-base-uncased-emotion model from HuggingFace was used for text-based emotion classification. Facial expression recognition was performed using a pre-trained model from the DeepFace library [33]. The time-series classification was implemented based on an LSTM, which was trained for 300 epochs using the Adam optimizer with a learning rate of 0.001.

Figure 6 visually demonstrates the first stage of our research methodology—the multimodal emotion extraction process—using a sample from the MELD dataset. This stage involves independently extracting emotional information from two primary modalities: face and speech. On the left, the ‘Face Frames’ section illustrates the results of the visual modality analysis. Continuous face frames extracted from the MELD dataset’s video clips are analyzed using the DeepFace library, and the expression in each frame is classified into one of six emotion classes, such as ‘Happy’, ‘Surprise’, or ‘Neutral’. This process captures the speaker’s non-verbal, visual cues during the conversation. On the right, ‘Text Analysis’ shows the results of the speech modality analysis. First, the speech from each utterance is automatically transcribed into text using OpenAI’s Whisper model. Subsequently, the transcribed text is used as input for a BERT-based emotion classification model (bhadresh-savani/bert-base-uncased-emotion) and is classified into one of seven classes, including ‘Joy’, ‘Sadness’, and ‘Love’.

This figure is important as it shows the process of generating the foundational data for the subsequent stages of Emotion Fusion and Time Series Analysis. Notably, as can be seen in some rows, the emotion recognized from the face sometimes differs from the emotion recognized from the text, which clearly highlights the limitations of a single-modality approach. The subsequent fusion stage of our research aims to resolve precisely this type of incongruity and integrate the complementary information to perform a more accurate and contextually appropriate emotion analysis.

4.2. Performance Evaluation

To quantitatively evaluate the experimental performance, we used Accuracy, Precision, Recall, and Macro F1-score, which are defined by the following formulas.

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(6)

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

R e c a l l = \frac{T P}{T P + F N}

(8)

M a c r o F 1 s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(9)

For these metrics, TP (True Positive) and TN (True Negative) represent the counts of accurate predictions for the positive and negative classes, respectively. Conversely, FP (False Positive) indicates a Type I error, while FN (False Negative) signifies a Type II error. The Macro F1-score was employed as the principal evaluation criterion to mitigate the potential bias arising from the significant class imbalance inherent to the MELD dataset.

The Emotion Transition Matrix is defined as a matrix that quantifies the transition probabilities between emotional states. Each element P_ij represents the probability of transitioning from emotion i to emotion j, and is defined as shown in Equation (10).

P_{i j} = \frac{N_{i \to j}}{\sum N_{i \to j}}

(10)

The Emotion Transition Matrix allows for the quantitative understanding of specific emotional flows, making it possible to discover certain patterns (e.g., the transition from ‘joy’ to ‘neutral’).

The CUSUM (Cumulative Sum) algorithm is a technique for detecting points within time-series data where statistically significant changes occur. It is defined as shown in Equation (11).

S_{t} = \sum_{i = 1}^{t} (x_{i} - μ)

(11)

For a dataset with a mean μ and standard deviation σ, the cumulative sum is calculated. If the CUSUM value exceeds a predefined threshold (h), it is considered a change-point.

To clearly analyze the performance of the proposed model, this study conducted a comparative evaluation by setting up the following various baseline models:

Unimodal (Text): A unimodal model that recognizes emotions using only the BERT model based on text data transcribed by Whisper. As text data has been a primary modality in previous multimodal research, it serves as a useful performance baseline.
Unimodal (Face): A unimodal model that performs emotion recognition using only the DeepFace-based facial expression recognition model, recognizing emotions from video frame data. This serves as a baseline for evaluating the standalone performance of the visual modality.
Unimodal (Text) + Transition Matrix: To better isolate the individual contributions of multimodal fusion and temporal modeling, we introduced this baseline. It constructs an emotion transition matrix solely from the text modality and uses this temporal information as a feature for classification, allowing for a more direct comparison against our proposed multimodal temporal models.
Multimodal (Late Fusion): A multimodal fusion model that performs final emotion prediction by combining the emotion results extracted from the text and facial modalities using a simple voting method. This is a basic multimodal approach without any additional time-series analysis techniques applied.
Multimodal + Transition Matrix: A model that considers temporal features by applying an Emotion Transition Matrix to the results of the Late Fusion-based multimodal approach. This setup is for evaluating whether utilizing emotion transition characteristics improves model performance.
Multimodal + Change Detection: An approach that aims to improve emotion prediction Accuracy by adding CUSUM change-point detection to the multimodal emotion sequence to detect and reflect abrupt emotional changes. Multimodal + SPM (Proposed): The final model proposed in this study, which includes all the methods (multimodal fusion, emotion transition matrix, change-point detection, Emotion SPM, and LSTM-based time-series classification).

The following Table 6 details the performance evaluation results obtained for these various approaches on the MELD dataset.

As shown in Table 6, the text-based Unimodal model achieved a performance of 62.1% on the Macro F1-score, which is superior to the 56.3% of the face-based Unimodal model. This indicates that for the MELD dataset, text is a modality with relatively higher predictive power for emotions.

To better isolate the contributions of multimodal fusion and temporal modeling, we introduced the “Unimodal (Text) + Transition Matrix” baseline. This model, which incorporates temporal information into the strongest unimodal baseline, achieved a Macro F1-score of 67.7%. This result demonstrates a significant performance gain over the static text-only model, confirming that modeling temporal dynamics is beneficial even within a single modality.

Indeed, when we applied the Emotion Transition Matrix to the multimodal results (“Multimodal + Transition Matrix”), the performance increased further to 70.2% on the Macro F1-score. This proves that emotional transition patterns and temporal dependencies play a significant role in emotion prediction, and their impact is amplified when applied to richer, multimodal data. Furthermore, the “Multimodal + Change Detection” model saw a further increase in its F1-score to 71.3% by effectively modeling abrupt emotional changes. Finally, the “Multimodal + Sequence Pattern Mining (Ours)” model, which integrates all the aforementioned approaches, recorded the highest Macro F1-score of 72.9%. This is an improvement of approximately 10.8 percentage points over the initial unimodal baseline and about 5.2 percentage points over the basic multimodal fusion model. This validates that explicitly analyzing the temporal patterns of multimodal data makes multimodal emotion recognition models more robust.

Figure 7 visually represents these results, confirming that the proposed Multimodal + Time-Series model recorded the highest values across all four metrics. In particular, the distinct performance improvement in Accuracy and Macro F1 allows for an intuitive confirmation that the proposed method more effectively captures emotional flow and enhances classification performance. In summary, the experimental results of this study quantitatively prove that the proposed multimodal time-series-based emotion recognition methodology can achieve excellent performance even in complex emotional expression environments like MELD. It particularly shows that time-series analysis, which effectively reflects temporal characteristics, can make a significant contribution to improving emotion recognition Accuracy. These results are highly significant because, unlike previous multimodal approaches that mainly focused on the fusion of individual modalities, this study proposes an integrated approach that explicitly incorporates temporal patterns and points of change. The introduction of the Emotion Transition Matrix and change-point detection techniques, in particular, enabled the simultaneous consideration of both the continuity and abrupt transitions of emotions within a conversational context, allowing for a more sophisticated modeling of the complex emotional changes that occur in real dialogue environments.

Furthermore, the proposed method is meaningful in that it achieved balanced performance improvements across various metrics, including Accuracy, Precision, Recall, and Macro F1. This is a result that demonstrates the overall robustness of the model, not limited to a specific metric, and increases its potential for use in practical application areas such as real-time emotion analysis, conversational AI, and emotion-based personalized services. Consequently, this study provides empirical evidence for the need to integratively consider temporal dynamics beyond the simple fusion stage in multimodal emotion recognition research. This suggests an important direction for the future design of multimodal learning and emotion recognition systems and can serve as a foundation for achieving a human-like level of emotional understanding in complex conversational situations.

Furthermore, to provide a more intuitive understanding of the detailed classification performance for each emotion class, the Confusion Matrix for the final proposed model (Multimodal + SPM) is visualized in Figure 8. This matrix shows the degree of agreement between the true emotion labels (horizontal axis) and the predicted emotion labels (vertical axis) on the MELD test set. As can be seen in Figure 9, the model predicts major emotion classes such as ‘joy’, ‘neutral’, and ‘sadness’ with relatively high Accuracy, with the ‘neutral’ class showing the highest correct classification rate. On the other hand, minority classes like ‘fear’ or ‘disgust’ showed a tendency to be confused with ‘sadness’ or ‘neutral’. This is interpreted as being due to the class imbalance of the MELD dataset and the ambiguity of some emotional expressions. The Confusion Matrix provides a basis for analyzing inter-class confusion patterns beyond simple average performance metrics and also offers important insights for setting the direction for future performance improvements through fine-tuning or data augmentation. An Ablation Study was conducted to quantitatively analyze the degree to which each time-series mining component contributes to the overall performance of the proposed multimodal emotion recognition system. By observing how performance changes as each component is sequentially added, the effectiveness and contribution of each stage can be verified. The results are summarized in Table 7.

The initial configuration, the “Fusion only” model, classifies emotions at the individual utterance level by simply fusing the results from the face and text modalities; no time-series information is incorporated. This model achieved a Macro F1-score of 67.7%. When the Emotion Transition Matrix was introduced to incorporate the structural information of emotion transitions between utterances, the performance increased to 70.2%. This indicates that using transition probabilities between emotions is effective for overcoming the limitations of static classification and for reflecting the contextual flow of emotions.

To visualize this more intuitively, Figure 9 is presented, which shows the emotion transition probabilities. This figure displays the probability of transitioning from major emotional states like joy, sadness, anger, and neutral to the next emotional state in the form of a heatmap. Notably, joy shows a high probability of leading to joy again or to neutral, while sadness often transitions into anger or neutral. A transition matrix like this plays a fundamental role in predicting the persistence or abrupt change of emotions and can be utilized as foundational information for modeling the time-series flow of emotions. Next, by adding the Change-Point Detection technique to emphasize features at points of abrupt emotional change, the Macro F1-score increased to 71.3%. This demonstrates that discontinuous transition segments within the emotion sequence act as important clues for improving the model’s performance. Finally, by integrating an LSTM-based sequence classification model, the temporal dependencies and long-term patterns of the emotional flow were learned. This configuration, being the most complete among all models, ultimately achieved the highest performance with a Macro F1-score of 72.9%. This is a result that empirically demonstrates that a deep learning model reflecting time-series structure can significantly enhance the quantitative Accuracy in multimodal emotion analysis. Furthermore, this hierarchical improvement supports the point that time-based features such as the flow, transition, and discontinuous changes of emotions—which cannot be captured by simple static fusion alone—have a substantial impact on emotion recognition performance. Therefore, this study confirms that time-series mining techniques can be an essential and practical means of performance improvement in the field of multimodal emotion analysis.

First, the Emotion Transition Matrix was used to quantitatively analyze the structure of emotion transitions within a dialogue, serving as key data to understand the overall atmosphere and the structure of the emotional flow based on the transition probabilities between specific emotions. Subsequently, a CUSUM-based change-point detection technique was applied to automatically identify points where emotional changes occur abruptly, thereby capturing significant turning points or dramatic mood shifts during the conversation. Above all, this study was designed around the SPM technique to analyze emotion sequence patterns based on this transition and change-point information and to integrate them into a learnable structure. Specifically, by utilizing the PrefixSpan algorithm, frequently occurring emotion transition patterns were effectively extracted, and based on this, a model capable of classifying and predicting the types of emotional flow in a dialogue was constructed. This approach is distinct from conventional static emotion analysis in that it reflects the continuity, transition characteristics, and patterns of change in emotions. By combining change-point detection and SPM-based analysis, it can meticulously capture the complexity and contextuality of the emotional flow. For example, it could be applied to enhance response strategies by detecting moments of negative emotional transition in customer service early on, or it could contribute to narrative structure analysis by automatically identifying points of dramatic scene changes in dramas and movies by analyzing their emotional flow. Moreover, the methodology proposed in this study is not limited to the MELD dataset and has the extensibility to be applied to various multimodal conversational data. Based on emotion sequences constructed from various inputs such as text, speech, and facial expressions, the SPM technique can effectively extract the regularities and features of emotion transitions, which can be utilized as a foundational technology for future emotion-based AI systems to understand and respond to user context more deeply.

4.3. Computational Cost and Scalability for Real-Time Systems

While our proposed framework demonstrates strong performance, its practical application in real-time interactive systems requires consideration of its computational cost. The framework integrates several deep learning models, including Whisper for speech transcription, BERT for text emotion analysis, and DeepFace for facial analysis, each with considerable computational demands. In its current implementation using base-sized models, the framework is better suited for the offline analysis of conversational data rather than live, low-latency interactions.

For real-time deployment, several optimization strategies could be pursued. Key bottlenecks include the inference times of the BERT and Whisper models. Employing smaller, distilled versions of these models (e.g., TinyBERT, DistilWhisper) or applying model quantization techniques could significantly reduce latency with a manageable trade-off in Accuracy. Similarly, a more efficient frame-sampling strategy for facial analysis, rather than processing every frame, could lessen the computational load. Future work will focus on developing a lightweight version of this framework to balance high performance with the stringent requirements of real-time scalability.

5. Conclusions

This study proposed a new approach to enhancing the contextuality and consistency of emotion recognition by combining the SPM technique with a multimodal emotion recognition system to quantitatively analyze the flow and transition structure of emotions in dialogue. Most existing emotion recognition models have been limited to statically classifying emotions at the level of individual utterances or images, failing to adequately reflect complex environments like actual conversations where emotions change continuously and interactively.

To overcome this, the present study extracted emotional information from two modalities—text and facial expression—integrated them to form an emotion class sequence, and then utilized a PrefixSpan algorithm-based SPM technique to discover frequently occurring emotional patterns. Through this, it was possible to structurally identify the flow of emotions that repeatedly appear after a certain emotional state, as well as transition tendencies and abrupt emotional shifts.

Furthermore, by quantifying the transition probabilities between emotions through an emotion transition matrix and applying a change detection technique to identify points of rapid emotional state change, important aspects of emotional variation that are easily overlooked in single-point-in-time analysis were effectively captured. The LSTM-based emotion prediction model learned the contextual information of these emotional patterns, demonstrating a more sophisticated and consistent emotion classification performance. Experimental results showed higher Precision and F1-scores than conventional simple multimodal approaches.

Through an Ablation Study, it was proven that the utilization of SPM-based emotion flow information makes a substantial contribution to performance improvement. Additionally, the interpretability of the model was enhanced through the visualization of the Confusion Matrix and the Emotion Transition Matrix. This contributes to a clear understanding of confusion patterns and transition paths between emotions and has great potential for use in various application fields such as personalized emotional response systems, emotion-based conversational systems, and mental health monitoring.

Future research plans include extending this work into a more precise emotion prediction system through the analysis of long-term trends within emotion sequences, integration with Transformer-based models, and the incorporation of external contextual information such as the interlocutor’s emotions or situational background. Research will focus on verifying the robustness of our framework through extensive cross-dataset validation on other conversational emotion datasets (e.g., IEMOCAP). Additionally, securing multilingual datasets will be an important task to confirm its broader generalization performance.

Author Contributions

Conceptualization, K.C.; methodology, Y.S.; software, Y.S.; validation, Y.S. and K.C.; formal analysis, Y.S.; investigation, Y.S.; resources, K.C.; data curation, Y.S.; writing—original draft preparation, Y.S.; writing—review and editing, Y.S. and K.C.; visualization, Y.S.; supervision, K.C.; project administration, K.C.; funding acquisition, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work supported by Kyonggi University Research Grant 2025.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: [https://github.com/declare-lab/MELD, accessed on 20 August 2025] cited as [24].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, Y.; Mi, Q.; Gao, T. A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions. Biomimetics 2025, 10, 418. [Google Scholar] [CrossRef]
Maithri, M.; Raghavendra, U.; Gudigar, A.; Acharya, U.R. Automated Emotion Recognition: Current Trends and Future Perspectives. IEEE Trans. Cogn. Dev. Syst. 2021, 13, 1140–1152. [Google Scholar] [CrossRef] [PubMed]
Pan, B.; Hirota, K.; Jia, Z.; Dai, Y. A Review of Multimodal Emotion Recognition from Datasets, Preprocessing, Features, and Fusion Methods. Neurocomputing 2023, 561, 126866. [Google Scholar] [CrossRef]
Lian, H.; Lu, C.; Li, S.; Zhao, Y.; Tang, C.; Zong, Y. A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face. Entropy 2023, 25, 1440. [Google Scholar] [CrossRef]
Xiang, P.; Lin, C.; Wu, K.; Bai, O. MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition. arXiv 2024, arXiv:2404.18327. [Google Scholar]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 7–11 September 2017; pp. 1103–1114. [Google Scholar]
Tsai, Y.H.H.; Bai, S.; Yamada, M.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
Farhadipour, A.; Qaderi, K.; Akbari, M. Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts. arXiv 2025, arXiv:2503.06805. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, C.; Zhang, W. Improving Multimodal Emotion Recognition via Acoustic Adaptation and Visual Alignment. arXiv 2024, arXiv:2409.05015. [Google Scholar]
Ge, M.; Hou, Y.; Song, Y. Early Joint Learning of Emotion Information for Multimodal Conversation. arXiv 2024, arXiv:2409.18971. [Google Scholar]
Li, J.; Yu, R.; Huang, H.; Yan, H. Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion. arXiv 2025, arXiv:2503.23721. [Google Scholar] [CrossRef]
Lei, Y. In-depth Study and Application Analysis of Multimodal Emotion Recognition Methods: Multidimensional Fusion Techniques Based on Vision, Speech, and Text. Appl. Comput. Eng. 2024, 107, 73–80. [Google Scholar] [CrossRef]
Zhang, W.; Liu, J.; Fu, Y. Learning from a Progressive Network of Multigranular Interactions for Multimodal Emotion Recognition. In Proceedings of the 26th ACM international conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1512–1520. [Google Scholar]
Savani, B. Bert-Base-Uncased-Emotion. HuggingFace Model Card. 2020. Available online: https://huggingface.co/bhadresh-savani/bert-base-uncased-emotion (accessed on 27 August 2025).
Serengil, S.; Ozpinar, A. LightFace: A Hybrid Deep Face Recognition Framework. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–5. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Online, 18–24 July 2021. [Google Scholar]
Kshirsagar, R.; Morency, L.P.; Palm, R. Multimodal dialog systems with context modeling using hierarchical encoder-decoder architecture. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5680–5684. [Google Scholar]
Hassan, A.; Mahmood, A.; Durrani, T.S. Change-point detection in emotional speech signals using RNNs. Pattern Recognit. Lett. 2021, 143, 40–47. [Google Scholar]
Yu, J.; Li, B.; Zhang, Z.; Wang, X. Dual-Stage Cross-Modal Network for Dynamic Emotion Trajectory Modeling in Conversations. Front. Psychiatry 2025, 16, 1579543. [Google Scholar]
Meng, T.; Zhang, F.; Shou, Y.; Shao, H.; Ai, W.; Li, K. Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation. arXiv 2024, arXiv:2407.16714. [Google Scholar] [CrossRef]
Sun, L.; Lian, Z.; Liu, B.; Tao, J. HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition. arXiv 2024, arXiv:2401.05698. [Google Scholar] [CrossRef]
Chung, K.Y.; Na, Y.J.; Lee, J.H. Development of Fashion Design Recommender System using Textile based Collaborative Filtering Personalization Technique. J. KIISE Comput. Pract. Lett. 2003, 9, 541–550. [Google Scholar]
Kim, J.C.; Cho, E.B.; Chang, J.H. Construction of Dataset for the 5 Major Violent Crimes Through Collection and Preprocessing of Judgment. J. Artif. Intell. Converg. Technol. 2025, 5, 11–16. [Google Scholar]
Lee, S.E.; Yoo, H.; Chung, K. Pose Pattern Mining Using Transformer for Motion Classification. Appl. Intell. 2024, 54, 3841–3858. [Google Scholar] [CrossRef]
Jo, S.M. A Study on Technical Analysis of Efficient Recommendation Systems. J. Artif. Intell. Converg. Technol. 2025, 5, 17–22. [Google Scholar]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Mihalcea, R.; Cambria, E. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP), Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
Kim, M.J.; Chung, K. T5-Based Anomaly-Behavior Video Captioning Using Semantic Relation Mining. Appl. Soft Comput. 2025, 185, 113923. [Google Scholar] [CrossRef]
Kim, G.I.; Chung, K. ViT-Based Multi-Scale Classification Using Digital Signal Processing and Image Transformation. IEEE Access 2024, 12, 58625–58638. [Google Scholar] [CrossRef]
Lim, S. Trends in Automation Technologies for Computer Graphics Using Artificial Intelligence. J. Artif. Intell. Converg. Technol. 2025, 5, 37–42. [Google Scholar]
Serengil, S.I.; Ozpinar, A. DeepFace: A Lightweight Face Recognition and Facial Attribute Analysis (Age, Gender, Emotion and Race) Framework for Python. arXiv 2021, arXiv:2101.06938. [Google Scholar]

Figure 1. Example of affective analysis for fabric patterns using adjective pairs.

Figure 2. Process flow of the multimodal emotion recognition and analysis.

Figure 3. Overview of multimodal preprocessing and emotion extraction pipeline.

Figure 4. Visualization of Emotion Fusion and Time Series Construction.

Figure 5. MELD dialogue flow and example of multimodal emotion labels.

Figure 6. Multimodal emotion extraction from facial frames and transcribed text in the MELD dataset.

Figure 7. Performance comparison of unimodal, multimodal, and proposed methods.

Figure 8. Confusion matrix of the proposed model on MELD test set.

Figure 9. Emotion transition matrix visualized as a heatmap, showing probabilities of moving from one emotion to another.

Table 1. Summary Statistics of the MELD Dataset.

Category	Value
Total Dialogues	1433
Total Utterances	13,703
Average Utterance Length	Approx. 9 s
Average Emotion Length	2.7 s
Number of Emotion Classes	7
Frames per Video	Approx. 30–60

Table 2. Example of Emotion Sequence from MELD dataset.

Dialogue ID	Utterance ID	Text Emotion	Face Emotion	Fused Emotion	Timestamp(s)
757	17	Joy	Sad	Joy + Sad	0–3
757	18	Joy	Joy	Joy	3–6
757	19	Anger	Neutral	Anger	6–9
757	20	Sadness	Sadness	Sadness	9–12
757	21	Neutral	Joy	Joy	12–15

Table 3. Emotion Class and Integer Index Mapping.

Emotion Class	Integer Index
Joy	0
Anger	1
Sadness	2
Fear	3
Surprise	4
Love	5
Neutral	6

Table 4. Emotion sequence matrix.

Video ID	1	2	3	4	5	6	7
V001	0	0	1	1	0	0	0
V002	1	1	1	0	0	0	0
V003	0	1	0	0	0	0	0
V004	0	0	0	0	0	0	0
V005	1	1	1	0	0	0	0
V006	0	1	0	0	0	0	0
…	…	…	…	…	…	…	…
V500	1	1	1	0	0	0	0

Table 5. Sequential Patterns by Emotion Class.

Prefix	Subset of Sequential Patterns
<0>	<0, 0, 1>, <0, 1, 2>, <0, 0, 3>, <0, 2, 2>, <0, 3, 0>, <0, 0, 1, 0>, <0, 2, 4>, <0, 1, 1>, <0, 0, 0>, <0, 4, 2>, <0, 2, 5>, <0, 3, 1>, <0, 2, 0>, <0, 6, 6>, <0, 4, 0>, <0, 0, 6>, …
<1>	<1, 2, 2>, <1, 0, 1>, <1, 4, 4>, <1, 3, 3>, <1, 1, 0>, <1, 2, 5>, <1, 6, 1>, <1, 0, 0>, <1, 3, 1>, <1, 5, 0>, <1, 1, 2>, <1, 6, 3>, <1, 6, 0>, <1, 1, 6>, <1, 2, 1>, <1, 3, 1> …
<2>	<2, 0, 3>, <2, 2, 0>, <2, 1, 6>, <2, 3, 0>, <2, 0, 0>, <2, 4, 2>, <2, 6, 3>, <2, 2, 1>, <2, 3, 3>, <2, 5, 0>, <2, 2, 4>, <2, 1, 0>, <2, 6, 1>, <2, 2, 6>, <2, 0, 2>, <2, 2, 0>, …
<3>	<3, 3, 2>, <3, 0, 0>, <3, 2, 5>, <3, 4, 0>, <3, 1, 3>, <3, 0, 2>, <3, 3, 1>, <3, 0, 4>, <3, 5, 2>, <3, 2, 3>, <3, 6, 0>, <3, 1, 2>, <3, 4, 6>, <3, 6, 2>, <3, 3, 0>,<3, 0, 2>, …
<4>	<4, 0, 3>, <4, 4, 3>, <4, 2, 1>, <4, 0, 0>, <4, 1, 4>, <4, 3, 2>, <4, 6, 1>, <4, 0, 2>, <4, 2, 4>, <4, 5, 3>, <4, 1, 0>, <4, 3, 5>,<4, 0, 6>, <4, 6, 0>, <4, 1, 2>,<4, 0, 1>,…
<5>	<5, 0, 6>, <5, 2, 3>, <5, 0, 0>, <5, 4, 1>, <5, 1, 2>, <5, 3, 0>, <5, 2, 5>, <5, 0, 3>, <5, 5, 1>, <5, 6, 2>, <5, 1, 1>, <5, 2, 0>,<5, 0, 1>, <5, 6, 0>, <5, 5, 0>,<5, 5, 1> …
<6>	<6, 6, 2>, <6, 0, 1>, <6, 3, 4>, <6, 2, 6>, <6, 1, 1>, <6, 4, 0>, <6, 0, 2>, <6, 5, 3>, <6, 1, 5>, <6, 3, 0>, <6, 2, 2>, <6, 0, 0>,<6, 1, 0>, <6, 2, 0>, <6, 6, 0>,<6, 6, 2> …

Table 6. Proposed Methods Performance Evaluation.

Model	Accuracy (%)	Precision (%)	Recall (%)	Macro F1 (%)
Unimodal (Text)	64.3	63.2	61.8	62.1
Unimodal (Face)	58.9	57.7	55.1	56.3
Unimodal (Text) + Transition Matrix	68.7	68.1	67.4	67.7
Multimodal (Late Fusion)	68.7	68.1	67.4	67.7
Multimodal + Transition Matrix	70.8	70.2	69.1	70.2
Multimodal + Change Detection	71.5	71.0	70.2	71.3
Multimodal + SPM (Ours)	74.2	73.5	72.1	72.9

Table 7. Ablation study results.

Model	Macro F1 (%)
Multimodal (Fusion only)	67.7
Emotion Transition Matrix	70.2
Change-Point Detection	71.3
SPM Features	72.3
LSTM Classification (Full)	72.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, Y.; Chung, K. Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining. Electronics 2025, 14, 4015. https://doi.org/10.3390/electronics14204015

AMA Style

Song Y, Chung K. Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining. Electronics. 2025; 14(20):4015. https://doi.org/10.3390/electronics14204015

Chicago/Turabian Style

Song, Younghun, and Kyungyong Chung. 2025. "Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining" Electronics 14, no. 20: 4015. https://doi.org/10.3390/electronics14204015

APA Style

Song, Y., & Chung, K. (2025). Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining. Electronics, 14(20), 4015. https://doi.org/10.3390/electronics14204015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Emotion Recognition

2.2. Sequential Emotion Flow Analysis

3. Facial and Speech Emotion Sequence Modeling Using Sequential Pattern Mining

3.1. Multimodal Data Preprocessing and Emotion Extraction

3.2. Emotion Fusion and Time Series Construction

3.3. Emotion Sequence Mining and Classification

3.3.1. Emotion Transition Matrix

3.3.2. Change-Point Detection

3.3.3. Emotion Sequential Pattern Mining

3.3.4. Time Series Classification (LSTM)

4. Results and Performance Evaluation

4.1. Results

4.2. Performance Evaluation

4.3. Computational Cost and Scalability for Real-Time Systems

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI