Multi-Modal Emotion Detection and Tracking System Using AI Techniques

Mostert, Werner; Kurien, Anish; Djouani, Karim

doi:10.3390/computers14100441

Open AccessArticle

Multi-Modal Emotion Detection and Tracking System Using AI Techniques

by

Werner Mostert

^1,*,

Anish Kurien

¹

and

Karim Djouani

^1,2

¹

Department of Electrical Engineering, Tshwane University of Technology, Pretoria 0001, South Africa

²

LISSI Laboratory, University Paris-Est Créteil (UPEC), 94400 Créteil, France

^*

Author to whom correspondence should be addressed.

Computers 2025, 14(10), 441; https://doi.org/10.3390/computers14100441

Submission received: 12 August 2025 / Revised: 8 September 2025 / Accepted: 10 September 2025 / Published: 16 October 2025

(This article belongs to the Special Issue Advances in Semantic Multimedia and Personalized Digital Content)

Download

Browse Figures

Versions Notes

Abstract

Emotion detection significantly impacts healthcare by enabling personalized patient care and improving treatment outcomes. Single-modality emotion recognition often lacks reliability due to the complexity and subjectivity of human emotions. This study proposes a multi-modal emotion detection platform integrating visual, audio, and heart rate data using AI techniques, including convolutional neural networks and support vector machines. The system outperformed single-modality approaches, demonstrating enhanced accuracy and robustness. This improvement underscores the value of multi-modal AI in emotion detection, offering potential benefits across healthcare, education, and human–computer interaction.

Keywords:

Artificial intelligence; multi-modal emotion detection; emotion recognition; machine learning; affective computing; deep learning; convolutional neural networks; recurrent neural networks; heart rate variability; facial expression recognition; speech emotion recognition

1. Introduction

Emotion detection has emerged as a critical component in advancing human–computer interaction, with applications spanning healthcare, education, and autonomous systems. While traditional single-modal approaches using facial expressions or vocal patterns alone have shown promise, they often fail to capture the complexity of human emotional expression, leading to unreliable assessments particularly in sensitive applications [1].

Recent advances in affective computing have shifted focus toward multi-modal approaches that integrate diverse data streams—visual, acoustic, and physiological—to achieve more robust emotion recognition [2]. By leveraging the complementary nature of different modalities, these systems can overcome individual modal limitations, reduce susceptibility to manipulation, and provide more accurate emotional assessments [3,4]. However, developing effective multi-modal systems presents significant challenges, including feature extraction from heterogeneous data, temporal synchronization across modalities, and the design of algorithms capable of resolving conflicting emotional cues [5].

This study presents a novel multi-modal emotion detection system that simultaneously processes facial expressions, speech patterns, and heart rate variability using optimized deep learning architectures. The approach employs a modified ResNet-34 and custom convolutional neural networks (CNNs) for visual feature extraction, CNN-RNN architectures for temporal audio analysis using Mel-frequency Cepstral Coefficients (MFCCs), and a hybrid CNN-RNN model for physiological signal processing incorporating Fast Fourier Transform (FFT) and Root Mean Square of Successive Differences (RMSSD) features. These modalities are integrated through a weighted late fusion framework that maintains computational efficiency while achieving robust emotion recognition.

Research Objective:The primary objective of this research is to develop and validate a multi-modal emotion detection system that integrates visual, audio, and physiological data streams to achieve superior emotion recognition accuracy compared to single-modal approaches. Specifically, this study aims to (1) design task-specific lightweight architectures that outperform generic deep networks while maintaining computational efficiency; (2) investigate the relative contribution of behavioral versus physiological modalities in emotion detection; (3) develop an efficient fusion strategy that enables real-time processing while maintaining robustness to modality failures; and (4) validate the system’s performance across six basic emotion categories (anger, disgust, fear, happiness, neutral, sadness) using standardized datasets and evaluation metrics.

The main contributions of this work involve (1) a demonstration that task-specific lightweight architectures can significantly outperform generic deep networks—the 6-layer CNN exceeded ResNet-34 performance by 29.5% while reducing computational overhead; (2) a validation that physiological signals provide superior emotional indicators (F1: 0.971) compared to behavioral modalities; (3) an implementation of an efficient late fusion strategy using weighted averaging that achieves real-time processing (183.45ms) while maintaining modularity and graceful degradation capabilities; and (4) an extensive experimental validation showing the fusion model achieves an F1-score of 0.807, with robust performance even under modality failure conditions.

The remainder of this paper is organized as follows: Section 2 reviews related work in emotion detection and multi-modal fusion techniques. Section 3 describes the proposed methodology, including data preprocessing, model architectures, and the multi-modal fusion framework. Section 4 presents the individual modalities implementation. Section 5 presents the experimental results and computational complexity analysis. Section 6 discusses the findings, implications, and limitations, while Section 7 concludes with contributions and future research directions.

2. Related Work

2.1. Single-Modal Approaches

2.1.1. Facial Expression Recognition

Early emotion detection systems primarily relied on facial expression analysis. Ref. [6] proposed constructive feed-forward neural networks for visual-based emotion estimation, though this approach suffered from high false positive rates when subjects deliberately masked their emotions. Subsequent work by [7] combined SVMs and CNNs for analyzing student emotions in online lectures, achieving improved accuracy through statistical feature integration. Building on this, ref. [8] enhanced performance using the Viola–Jones algorithm [9] for face detection combined with Texton, BOW, GLCM, and LGXP features for emotion classification.

Advanced techniques explored multi-block sampling [10] and locality-preserving approaches [11], with the latter integrating LBP, Haar features, AU-based methods, and appearance-based methods for comprehensive facial landmark extraction. Recent developments include GANs for generating synthetic training data [12], with ref. [13] demonstrating their effectiveness in mitigating overfitting. However, facial expression-based methods remain limited by their inability to capture the full complexity of emotions and susceptibility to deliberate manipulation [7,8,11,13,14,15,16].

2.1.2. Audio-Based Detection

Audio modality offers complementary emotional insights through vocal patterns. Ref. [17] achieved 60% accuracy using pattern recognition on speech and song, improving to 80% when combined with video data. MFCC-based approaches have shown particular promise, with ref. [18] demonstrating their effectiveness in the INTERSPEECH 2013 challenge, while ref. [19] successfully applied MFCCs in call center applications. Hybrid models combining MFCCs with visual features [20] have shown enhanced robustness against speech variations and environmental noise.

Despite these advances, audio-based systems face challenges including noise sensitivity, speaker variability [21], and context dependency [22]. Cultural and linguistic differences further complicate universal model development, leading researchers toward multi-modal integration [23,24].

2.1.3. Physiological Signal Analysis

Heart rate variability (HRV) has emerged as a reliable indicator of emotional states through autonomic nervous system activity. Ref. [25] demonstrated significant associations between HRV parameters and emotional arousal using spectral analysis techniques. Ref. [26] provided comprehensive insights into ANS activity assessment through ECG, impedance cardiography, and photoplethysmography, while highlighting challenges from confounding factors like physical activity and medication [27,28].

ECG-based approaches have also shown promise, with ref. [29] successfully classifying emotions using machine learning techniques on ECG signals. Deep learning methods employ CNNs and RNNs for automatic feature extraction from HRV signals, though requiring substantial annotated data and facing interpretability challenges. Innovation in non-contact methods includes remote photoplethysmography (rPPG) [30], offering less intrusive alternatives for emotion detection.

Key challenges in physiological-based detection include individual variability, context dependency, and temporal dynamics of emotions [31,32], necessitating personalized models and multi-modal integration for comprehensive emotional assessment.

2.2. Multi-Modal Fusion Approaches

Multi-modal emotion detection leverages complementary information from multiple sources to achieve superior performance over single-modal approaches. Ref. [33] proposed a 3D-CNN framework integrating video, audio, and heart rate data through ensemble learning, achieving 96.13% accuracy using grid search methods. The system extracts spatiotemporal features from facial videos and employs SVM classification, though computational complexity limits real-time applications.

Ref. [34] developed a framework combining CNNs and LSTMs for feature extraction and fusion across video, audio, and physiological signals, demonstrating deep learning’s capability in capturing complex cross-modal patterns. However, this approach faces challenges in dataset availability and model interpretability. Ref. [35] introduced hierarchical attention fusion with context modeling, dynamically allocating attention weights across modalities while capturing temporal dependencies for nuanced emotion recognition. The approach requires careful synchronization and noise handling across modalities.

Ref. [36] provided a comprehensive framework using attention mechanisms and LSTMs for text, audio, and video fusion, showing significant improvements in detecting subtle emotional states. End-to-end models by ref. [37] process raw multi-modal data directly, reducing feature engineering requirements and improving adaptability across different environments.

2.3. Current Challenges and Gaps

Despite significant progress, several challenges persist in multi-modal emotion detection: (1) Individual and cultural differences in emotional expression require adaptive models capable of personalization. (2) Real-time performance remains limited due to computational complexity of multi-modal processing. (3) Robustness to noisy or missing data needs improvement for real-world deployment. (4) Ethical considerations and privacy concerns require careful attention as systems become more prevalent. Finally, (5) current discrete emotion categories fail to capture the continuous and nuanced nature of human emotions.

These challenges motivate our proposed system, which addresses computational efficiency through optimized feature extraction, improves robustness via adaptive fusion strategies, and maintains real-time performance while achieving superior accuracy compared to existing approaches.

3. Methodology

3.1. System Architecture

The proposed multi-modal emotion detection system integrates three primary data streams—facial expressions (video), speech patterns (audio), and heart rate variability (HRV)—to classify six emotional states: anger, disgust, fear, happiness, neutral, and sadness. The system employs specialized deep learning models for each modality, with outputs combined through a weighted late fusion approach to produce robust emotion estimates.

Figure 1 illustrates the complete processing pipeline of the multi-modal emotion detection system. The architecture processes three independent data streams in parallel: Video input is analyzed using a custom-designed 6-layer convolutional neural network to extract facial expression features, producing emotion probability scores

(x_{1}, x_{2}, \dots, x_{6})

for each of the six emotional states. Simultaneously, audio input undergoes Mel-frequency Cepstral Coefficient (MFCC) extraction followed by CNN-RNN processing to capture temporal speech patterns, yielding emotion estimates

(y_{1}, y_{2}, \dots, y_{6})

. The heart rate input is processed through heart rate variability (HRV) analysis using a CNN-RNN architecture, incorporating both time-domain (RMSSD) and frequency-domain (FFT) features to generate physiological-based emotion predictions

(z_{1}, z_{2}, \dots, z_{6})

.

The fusion stage computes preliminary emotion scores by averaging the outputs from all three modalities for each emotion category, then applies a late fusion algorithm with modality-specific weights

(w_{1}, w_{2}, \dots, w_{6})

to account for the varying reliability of each modality for different emotions. This weighted fusion approach enables the system to leverage the complementary strengths of each modality while compensating for individual weaknesses.

Dataset Description

The study utilizes three publicly available datasets combined with self-collected physiological data:

RAVDESS (Ryerson Audio–Visual Database of Emotional Speech and Song) [17]: Contains 7356 files from 24 professional actors (12 female, 12 male) expressing 8 emotions (calm, happy, sad, angry, fearful, surprise, disgust, neutral) through speech and song. Video resolution: 1920 × 1080 pixels at 30 fps; audio: 48 kHz, 16-bit. For this study, we selected 6 emotions matching our target categories.

FER-2013 (Facial Expression Recognition 2013): Comprises 35,887 grayscale facial images (48 × 48 pixels) categorized into 7 emotions. The dataset contains 28,709 training images, 3589 validation images, and 3589 test images, collected from the internet using automated Google image search.

EMODB (Berlin Database of Emotional Speech): Features 535 German utterances from 10 actors (5 male, 5 female) expressing 7 emotions. Audio sampled at 16 kHz with 16-bit resolution. We mapped the German emotion labels to our 6 categories.

3.2. Datasets and Data Collection

Self-Collected Heart Rate Data: To address the absence of synchronized physiological data, we collected heart rate measurements from the primary author during emotion elicitation sessions. Using a Polar H10 chest strap monitor (sampling at 130 Hz), we recorded HRV data while the subject viewed emotion-inducing video clips validated in prior research [38]. Total collection comprised 180 sessions (30 per emotion category), each lasting 3–5 min.

3.2.1. Data Synchronization and Alignment

For multi-modal training, we created synchronized pseudo-datasets by

Extracting individual frames from RAVDESS videos at 3 fps (every 10th frame);
Segmenting corresponding audio into 1 s clips with 50% overlap;
Matching heart rate segments to emotional labels based on elicitation timestamps;
Creating triplets of (video_frame, audio_segment, hr_segment) with consistent emotion labels.

It is important to note that this pseudo-synchronization approach, which combines data from different subjects across different datasets, represents a significant limitation of our study. In real-world applications, all three modalities would be captured simultaneously from the same individual, ensuring true temporal and personal coherence. The current approach may artificially inflate performance metrics as the model learns patterns from different individuals rather than the complex interactions between modalities within a single person experiencing an emotion.

3.2.2. Data Splits

The combined dataset was divided as follows:

Training set: A total of 70% (24,220 samples);
Validation set: A total of 15% (5190 samples);
Test set: A total of 15% (5190 samples).

Stratified sampling ensured balanced emotion representation across splits. Subject-independent splitting was applied to RAVDESS to prevent identity leakage.

3.3. Data Preprocessing

3.3.1. Visual Data Processing

The visual processing pipeline begins with video frame extraction at 3 frames per second, yielding sequences of RGB frames that capture essential facial dynamics while maintaining computational efficiency. The Viola–Jones algorithm [9] identifies facial regions within each frame, which are subsequently resized to 224 × 224 × 3 pixels to maintain consistency across the dataset. During training, we apply comprehensive data augmentation to enhance model generalization, including rotation within

θ \in [- 15^{\circ}, + 15^{\circ}]

, scaling with factors ranging from 0.8 to 1.2, horizontal flipping with 50% probability, and color jittering affecting brightness, contrast, and saturation within the range

[0.8, 1.2]

.

The color transformation follows the equation below:

I_{c}^{'} = α_{c} \cdot I_{c} + β_{c}

(1)

where

I_{c}

represents each color channel (R, G, B), with scaling factors

α_{c}

∼

U (0.8, 1.2)

and bias terms

β_{c}

∼

U (- 20, 20)

. For real-time processing, we apply min-max normalization:

p^{'} = \frac{p}{255}

(2)

where p represents pixel intensity values ranging from 0 to 255. The resulting input dimensions are Batch × 224 × 224 × 3, representing height, width, and color channels, while the output consists of a Batch × 6 probability distribution over the six emotion categories.

3.3.2. Audio Feature Extraction

Audio signals were captured using an Ajuhua omnidirectional digital microphone (Ajuhua Technology, Hangzhou, China). Audio processing was performed using Python libraries including librosa for feature extraction, NumPy for numerical operations, SciPy for signal processing, and soundfile for audio I/O operations.

Audio processing commences with resampling all audio signals to 16 kHz to ensure consistency across diverse sources. The signals are then segmented into 3 s windows with 1 s overlap, providing sufficient temporal context while enabling real-time processing. From these segments, we extract Mel-frequency Cepstral Coefficients (MFCCs) using 25 ms frames (400 samples) with 10 ms stride (160 samples), generating 13 coefficients through 26 Mel filter banks and 512 FFT points.

The selection of 150-sample timesteps represents approximately 3 s of audio (150 frames × 20 ms average), a duration empirically validated through ablation studies. We compared windows of 75, 150, and 300 timesteps, finding that 150 samples optimally balance temporal context for emotion-bearing prosodic patterns with computational efficiency. This configuration yields input dimensions of Batch × 150 × 13 where timesteps and MFCC coefficients form the temporal-spectral representation, producing Batch × 6 emotion probability outputs.

3.3.3. Physiological Signal Processing

Heart rate processing employs a carefully calibrated 75-heartbeat window, corresponding to approximately 60–90 s at normal heart rates. This window size emerged from three key considerations: first, Autonomic Nervous System (ANS) response dynamics, where emotional changes manifest over 30–120 s [26]; second, statistical reliability requirements recommending a minimum of 60 heartbeats for stable HRV metrics [39]; and third, empirical optimization across windows of 30, 60, 75, 90, and 120 beats, where 75 beats demonstrated the optimal accuracy–latency balance.

Feature extraction combines time-domain and frequency-domain analyses. In the time domain, we calculate the Root Mean Square of Successive Differences (RMSSD) over 75-beat windows:

RMSSD = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N - 1} {(R R_{i + 1} - R R_{i})}^{2}}

(3)

where

R R_{i}

represents successive RR intervals and

N = 75

. In the frequency domain, Fast Fourier Transform (FFT) is applied to the RR interval sequence, extracting 75 magnitude values. These features are concatenated to form a 150-dimensional feature vector (75 FFT values + 75 RMSSD values), resulting in input dimensions of Batch × 150 × 1 and output dimensions of Batch ×6.

3.4. Model Architectures

3.4.1. Visual Emotion Recognition

Our custom 6-layer convolutional neural network architecture processes individual frames extracted from video sequences through a carefully designed hierarchy of feature extraction layers. The network begins with Conv2D-1 containing 32 filters with 3 × 3 kernels, followed by batch normalization, ReLU activation, and 2 × 2 max pooling. This pattern continues through progressively deeper layers: Conv2D-2 with 64 filters, Conv2D-3 with 128 filters, Conv2D-4 with 256 filters, and Conv2D-5 and Conv2D-6 each with 512 filters. All convolutional layers maintain stride=1 and ’same’ padding to preserve spatial dimensions before pooling.

Following the convolutional stack, global average pooling reduces spatial dimensions while preserving channel information, feeding into a Dense layer with 256 units, dropout regularization at 0.5, and ReLU activation. The final Dense layer with 6 units and softmax activation produces the emotion probability distribution. For video inputs, frames are processed independently through the CNN, with temporal aggregation achieved by averaging frame-level predictions over 10-frame windows.

Batch normalization is applied after each convolutional layer following:

{\hat{x}}_{i} = \frac{x_{i} - μ}{\sqrt{σ^{2} + ϵ}}

(4)

y_{i} = γ {\hat{x}}_{i} + β

(5)

where

μ

and

σ^{2}

represent batch statistics,

ϵ = 10^{- 5}

ensures numerical stability, and

γ

and

β

are learned parameters. This architecture achieves comparable performance with only 2.3M parameters, significantly fewer than ResNet-34’s 21.8 M parameters.

3.4.2. Audio Emotion Classification

The audio emotion classification employs a CNN-RNN hybrid architecture that leverages both local pattern extraction and temporal sequence modeling. The convolutional component consists of two Conv1D layers: the first with 64 filters and kernel size 5, followed by max pooling with pool size 2 and dropout at 0.5; the second with 128 filters maintaining the same kernel size and pooling configuration. These layers extract local spectral patterns from the MFCC features.

The recurrent component comprises two LSTM layers, each with 128 units. The first LSTM returns sequences to capture temporal dependencies across the entire audio segment, while the second LSTM produces a fixed-size representation. Both LSTM layers incorporate 0.3 dropout for regularization. The final classification stage consists of a Dense layer with 64 units, 0.5 dropout, and ReLU activation, followed by a 6-unit Dense layer with softmax activation. This architecture, totaling 476K parameters, effectively balances model capacity with computational efficiency.

3.4.3. Heart Rate Analysis Model

The heart rate analysis model similarly adopts a CNN-RNN architecture, tailored for physiological signal processing. The convolutional section contains two Conv1D layers: the first with 32 filters and kernel size 3, the second with 64 filters maintaining the same kernel configuration. Each convolutional layer is followed by max pooling with pool size 2 and dropout at 0.5.

The recurrent section consists of two LSTM layers with 64 units each. The first LSTM maintains sequence outputs for comprehensive temporal modeling, while the second produces a final hidden state representation. Both incorporate 0.3 dropout for regularization. The architecture strategically processes the concatenated features: FFT features (first 75 values) are primarily handled by CNN layers to extract frequency patterns, while RMSSD features (last 75 values) benefit from LSTM processing to capture temporal heart rate variability. The classification head comprises a Dense layer with 32 units, 0.5 dropout, and ReLU activation, culminating in a 6-unit Dense layer with softmax activation. This compact architecture achieves excellent performance with only 198K parameters.

3.5. Hyperparameter Optimization

3.5.1. Parameter Selection Process

The hyperparameter optimization process employed Bayesian optimization through the Optuna framework, conducting 100 trials to systematically explore the parameter space. The search encompassed learning rates sampled from a log-uniform distribution between 10⁻⁵ and 10⁻², batch sizes selected from the set

{16, 32, 64, 128}

, dropout rates uniformly sampled from

[0.2, 0.7]

, LSTM units chosen from

{32, 64, 128, 256}

, and CNN filters per layer selected from

{16, 32, 64, 128}

.

The optimization process converged on distinct configurations for each modality. The visual CNN achieved optimal performance with a learning rate of 10⁻⁴, batch size of 32, and dropout rate of 0.5. Both the audio CNN-RNN and heart rate CNN-RNN models performed best with learning rates of 10⁻³, batch sizes of 64, and dropout rates of 0.5, suggesting similar optimization landscapes for temporal modalities.

3.5.2. Preventing Overfitting

To ensure robust generalization, we implemented a comprehensive regularization strategy encompassing six complementary techniques. Dropout with a rate of 0.5 was systematically applied after convolutional and Dense layers to prevent co-adaptation of features. L2 regularization with weight decay of 10⁻⁴ was incorporated into all Dense layers to penalize large weights. Early stopping monitored validation loss with a patience of 10 epochs, halting training when performance plateaued. Data augmentation, as previously described, expanded the effective training set size for visual and audio modalities. Batch normalization not only accelerated training but also provided implicit regularization through its noise-injection properties. Finally, learning rate scheduling using ReduceLROnPlateau with factor 0.5 and patience 5 enabled fine-tuning as training progressed.

3.6. Multi-Modal Fusion

Grid Search for Fusion Weights

The determination of optimal fusion weights required a sophisticated search strategy to navigate the high-dimensional parameter space. We defined weights ranging from 0.5 to 1.3 in increments of 0.1 for each modality-emotion pair, creating a search space of 9¹⁸ possible configurations. To make this computationally tractable, we employed coordinate descent optimization, beginning with a coarse search using step size 0.2, followed by fine-tuning with step size 0.1 around promising configurations. This approach reduced the search from approximately 1.5 × 10¹⁷ evaluations to a manageable 5000, while maintaining optimization quality as measured by macro-averaged F1-score on the validation set.

The late fusion mechanism combines individual modality predictions according to

E_{final} = \frac{\sum_{i = 1}^{3} w_{i, j} \cdot E_{i, j}}{\sum_{i = 1}^{3} w_{i, j}}

(6)

where

E_{i, j}

represents the prediction for emotion j from modality i, and

w_{i, j}

denotes the corresponding modality–emotion-specific weight. This formulation allows each modality to contribute differentially based on its reliability for specific emotions.

The resulting weight configuration (Table 1) reveals interesting patterns: audio receives higher weight for anger detection (1.1), likely due to vocal intensity cues; visual dominates disgust recognition with audio and heart rate down-weighted to 0.6 and 0.7, respectively; fear shows reduced visual contribution (0.7), compensated by audio and physiological signals; and neutral emotions receive uniformly elevated weights (1.3) across all modalities, suggesting that the subtlety of neutral states requires collective evidence.

3.7. Performance Metrics

Model performance evaluated using

Precision = \frac{TP}{TP + FP}

(7)

Recall = \frac{TP}{TP + FN}

(8)

F 1 - Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

(9)

where TP = True Positive, FP = False Positive, and FN = False Negative.

3.8. Implementation and Training

The system is implemented in Python 3.8 using TensorFlow 2.15 with multiprocessing for parallel modality processing. Each modality operates independently, writing predictions to CSV files for fusion processing. Models are trained using categorical cross-entropy loss with Adam optimizer. Early stopping with patience of 10 epochs prevents overfitting. Data augmentation and dropout regularization enhance generalization, while batch normalization accelerates convergence. The complete system achieves 183.45 ms processing time, enabling real-time operation at approximately 5.4 fps

4. Results and Experiments

4.1. Experimental Setup

Experiments were conducted on a Windows 11 system equipped with an AMD Ryzen 5 5600X processor (Advanced Micro Devices, Santa Clara, CA, USA) (6-core, 12-thread), 16 GB Orion RAM (Orion Memory Systems, Shenzhen, China) operating at 3200 MHz, and NVIDIA RTX 3070 GPU (Gigabyte Technology, New Taipei City, Taiwan). Performance metrics including precision, recall, and F1-score were evaluated for each modality and the fusion model. Computational efficiency was measured using Python’s time library with millisecond accuracy.

4.2. Individual Modality Performance

Table 2 presents the performance metrics for each modality and the fusion model. The heart rate modality achieved exceptional performance with an F1-score of 0.971, demonstrating the strong predictive power of physiological signals for emotion recognition. The audio CNN-RNN model achieved an F1-score of 0.745, substantially outperforming the spectrogram-based ResNet-18 approach (F1-score: 0.531), validating the direct time-series processing approach. For video modality, the custom 6-layer CNN model (F1-score: 0.706) significantly outperformed the standard ResNet-34 architecture (F1-score: 0.545), demonstrating that task-specific optimization can surpass generic deep architectures.

4.3. Video Modality Results

For the video modality, two different models were evaluated: ResNet-34 and a custom-developed CNN model. Figure 2 shows ResNet-34’s training and validation performance on the combined dataset. The model achieved an F1-score of 0.545, with training accuracy reaching 60% while validation accuracy plateaued at lower levels, indicating overfitting to the training data. The confusion matrix reveals particular difficulty in distinguishing between certain emotion pairs.

The custom CNN model’s performance, shown in Figure 3, significantly outperforms ResNet-34 with an F1-score of 0.706. This model demonstrates superior convergence characteristics, with training accuracy approaching optimal levels while maintaining better generalization to validation data. The substantial improvement in the custom model’s performance can be attributed to its optimization and tuning specifically for the video modality in the context of emotion recognition, highlighting the importance of model selection and customization in achieving high accuracy.

4.4. Audio Modality Results

The audio modality was processed using two approaches: spectrogram-based analysis with ResNet-18 and direct time-series processing with CNN-RNN. Figure 4 presents the ResNet-18 architecture’s performance on audio spectrograms, achieving an F1-score of 0.531. The training and validation curves show stable convergence, though the model does not reach the performance levels of the video models. Nevertheless, it demonstrates the potential of using visual representations of audio signals for emotion recognition.

The CNN-RNN combined model for direct audio processing, shown in Figure 5, exhibits robust performance with an F1-score of 0.745. This custom model underscores the effectiveness of combining convolutional and recurrent neural networks for handling time-series data like audio. The model balances precision and recall well, suggesting its reliability in accurately identifying emotions from audio signals. The confusion matrix indicates strong diagonal dominance, particularly for certain emotion categories.

4.5. Heart Rate Modality Results

The heart rate modality, analyzed using a CNN-RNN combined approach, achieves the highest performance metrics across all modalities with an F1-score of 0.971. Figure 6 presents the training dynamics and confusion matrix for this model. The exceptional performance indicates the potent predictive power of physiological signals like heart rate in emotion recognition, as well as the suitability of the CNN-RNN architecture for analyzing such data. The confusion matrix demonstrates near-perfect classification with minimal off-diagonal elements, indicating strong discriminative capability across all emotion categories.

4.6. Computational Efficiency

Table 3 presents measured execution times. Stand-alone models demonstrate efficient processing (47–81 ms), while integrated models show increased latency due to multi-modal coordination. Notably, the fusion process itself requires only 12.75 ms, validating the efficiency of the late fusion approach. The complete system achieves 183.45 ms processing time, enabling real-time operation at approximately 5.4 fps.

4.7. Fusion Model Analysis

The late fusion model achieved an F1-score of 0.807, representing a balanced performance across all emotion categories. The comparative analysis underscores the heterogeneous nature of data in multi-modal emotion detection systems and the need for tailored approaches to model building and evaluation. While some modalities inherently provide more direct indicators of emotional states (as seen with the heart rate modality), the integration of multiple modalities, each analyzed with optimally chosen and fine-tuned models, enhances the overall accuracy and robustness of emotion recognition systems.

4.8. Statistical Significance

Paired t-tests confirmed statistical significance (p < 0.001) for performance improvements of (1) custom CNN over ResNet-34 for video, (2) CNN-RNN over spectrogram-based audio processing, and (3) fusion model over individual modalities (excluding heart rate). Cross-validation (5-fold) showed consistent results with standard deviation <0.03 for all F1-scores, confirming model stability and reliability.

These results validate the multi-modal approach, demonstrating that strategic integration of complementary modalities with optimized architectures achieves superior emotion recognition performance while maintaining computational efficiency suitable for real-world deployment.

5. Individual Modality Analysis

5.1. Visual Modality: Strengths and Limitations

5.1.1. Performance Analysis

The visual modality achieved an F1-score of 0.706 using the custom 6-layer CNN, demonstrating moderate effectiveness in emotion recognition from facial expressions. Per-emotion analysis reveals the following:

-: Best performance: Happiness (F1: 0.84) and Disgust (F1: 0.78);
-: Moderate performance: Anger (F1: 0.72) and Sadness (F1: 0.70);
-: Weakest performance: Fear (F1: 0.65) and Neutral (F1: 0.55).

5.1.2. Strengths

Non-invasive capture: Requires only standard cameras;
Rich spatial information: Captures micro-expressions and facial muscle movements;
Cultural universality: Basic facial expressions recognized across cultures [40];
Real-time processing: Low computational overhead (47.23 ms per frame).

5.1.3. Limitations

Occlusion sensitivity: Performance degrades with partial face coverage;
Lighting dependence: Requires adequate illumination for feature extraction;
Deliberate masking: Subjects can consciously control facial expressions;
Head pose variation: Accuracy decreases with non-frontal faces;
Individual differences: Facial expression intensity varies across individuals.

5.1.4. Complementary Role

Visual modality provides strong disambiguation for happiness and disgust emotions, compensating for audio modality’s weakness in detecting visual-dominant emotions. The fusion system leverages visual cues when audio signals are ambiguous or corrupted by noise.

5.2. Audio Modality: Strengths and Limitations

5.2.1. Performance Analysis

The audio CNN-RNN model achieved an F1-score of 0.745, showing robust emotion detection from speech patterns. Per-emotion breakdown is as follows:

-: Best performance: Anger (F1: 0.85) and Sadness (F1: 0.81);
-: Moderate performance: Fear (F1: 0.76) and Happiness (F1: 0.73);
-: Weakest performance: Neutral (F1: 0.68) and Disgust (F1: 0.64).

5.2.2. Strengths

Temporal dynamics: Captures prosodic patterns and speech rhythm;
Involuntary cues: Vocal tremor and pitch variations difficult to control;
Robustness to visual occlusion: Functions independently of visual input;
Language-independent features: MFCCs capture paralinguistic information;
High discrimination for arousal: Effectively distinguishes high/low arousal states.

5.2.3. Limitations

Environmental noise: Performance degrades in noisy conditions;
Speaker variability: Individual voice characteristics affect recognition;
Cultural/linguistic bias: Emotional expression varies across languages;
Silent emotions: Cannot detect emotions without vocalization;
Recording quality dependence: Requires adequate microphone quality.

5.2.4. Complementary Role

Audio excels at detecting high-arousal emotions (anger, fear) through pitch and energy variations, compensating for visual modality’s difficulty with subtle fear expressions. The temporal nature of audio provides context missing from static visual frames.

5.3. Heart Rate Modality: Strengths and Limitations

5.3.1. Performance Analysis

Heart rate CNN-RNN achieved the highest individual performance (F1: 0.971), demonstrating physiological signals’ strong correlation with emotional states. Per-emotion results are as follows:

-: Exceptional performance: Fear (F1: 0.98) and Anger (F1: 0.97);
-: Strong performance: Sadness (F1: 0.96) and Happiness (F1: 0.95);
-: Good performance: Disgust (F1: 0.94) and Neutral (F1: 0.93).

5.3.2. Strengths

Involuntary response: Cannot be consciously controlled;
Objective measurement: Quantifiable physiological changes;
Continuous monitoring: Provides uninterrupted emotional tracking;
Early detection: ANS changes precede behavioral expressions;
Robust to deception: Immune to deliberate emotional masking.

5.3.3. Limitations

Sensor requirement: Needs physical contact or specialized equipment;
Physical activity confounds: Exercise affects heart rate independent of emotion;
Individual baseline variation: Requires personalization for optimal accuracy;
Medication effects: Beta-blockers and other drugs influence HRV;
Delayed response: ANS changes have 5–10 s latency.

5.3.4. Complementary Role

Heart rate provides ground truth for emotional arousal, validating or correcting behavioral modality predictions. Its high accuracy anchors the fusion system, particularly when visual and audio signals conflict. The modality’s immunity to conscious control makes it invaluable for applications requiring authentic emotion detection.

5.4. Synergistic Benefits of Multi-Modal Integration

5.4.1. Modality Compensation Matrix

The multi-modal fusion approach leverages the complementary strengths of different modalities depending on the emotion detection scenario. Table 4 presents the modality compensation matrix, illustrating which sensing modalities serve as primary, secondary, and validation sources for specific emotional states and detection scenarios.

5.4.2. Failure Mode Resilience

When individual modalities fail, the system maintains performance:

-: Visual failure (occlusion): Audio+HR achieve F1: 0.79;
-: Audio failure (noise): Visual+HR achieve F1: 0.75;
-: HR failure (sensor loss): Visual+Audio achieve F1: 0.72.

This graceful degradation ensures system reliability in real-world deployments where sensor failures are common.

6. Discussion

6.1. Principal Findings

This study demonstrates that multi-modal emotion detection enhances recognition accuracy through late fusion of video, audio, and heart rate data, achieving an F1-score of 0.807. The results reveal important insights: (1) custom task-specific architectures significantly outperform generic deep networks, (2) direct time-series processing for audio shows promise despite training challenges, and (3) multi-modal fusion provides robustness against individual modality limitations.

The custom 6-layer CNN for video processing achieved 70% validation accuracy, substantially outperforming ResNet-34’s 53% validation accuracy despite the latter reaching 60% training accuracy. This significant performance gap highlights the importance of architecture design tailored to specific tasks rather than adopting generic deep learning models. The ResNet-34’s overfitting, evident in the accuracy gap between training and validation, suggests that deeper networks may be unnecessarily complex for emotion recognition from facial expressions.

6.2. Architectural Innovations and Performance Trade-Offs

The custom video CNN’s success (F1: 0.706) compared to ResNet-34 (F1: 0.545) challenges the assumption that deeper networks inherently provide better performance, as suggested by ref. [8]. While the custom CNN showed some overfitting with training accuracy reaching 74%, it maintained better generalization than ResNet-34. This finding supports recent trends toward efficient architectures and contradicts the computational complexity advocated by ref. [33].

The audio processing results reveal contrasting patterns. The ResNet-18 spectrogram approach achieved stable performance (F1: 0.531), while the CNN-RNN model achieved superior results (F1: 0.745) with a training accuracy of 73%. The CNN-RNN’s success demonstrates that direct time-series processing of audio, when properly regularized, outperforms transformation-based approaches.

The heart rate CNN-RNN model achieved performance with 97% validation accuracy (F1: 0.971), strongly supporting ref. [26]’s findings on correlations between HRV and emotions. The high accuracy validates our feature extraction approach of combining FFT and RMSSD features.

6.3. Multi-Modal Fusion: Compensating for Individual Weaknesses

Despite the varying performance of individual modalities, the late fusion approach achieved an F1-score of 0.807, demonstrating the value of multi-modal integration. This improvement suggests that the modalities provide complementary information, with fusion compensating for individual weaknesses. The weighted averaging approach (Equation (6)) proved effective despite its simplicity, supporting the findings by ref. [35] while maintaining computational efficiency (12.75 ms overhead).

The fusion weights (Table 1) reveal adaptive strategies: neutral emotions receive increased weighting (1.3) across all modalities, potentially compensating for the subtle nature of neutral expressions. The reduced weights for disgust in audio (0.6) and heart rate (0.7) suggest that visual cues dominate for this emotion, aligning with the literature on facial expression.

6.4. Training Dynamics and Overfitting Challenges

The analysis of training curves reveals varying degrees of overfitting control across models. The custom CNN for video demonstrated a training/validation gap of 4%, indicating good generalization. The CNN-RNN for audio showed minimal overfitting with early stopping at epoch 25, while the CNN-RNN for heart rate exhibited excellent generalization with less than 2% training/validation gap.

The regularization strategies employed proved effective across all models. Dropout at 0.5 prevented memorization in Dense layers, while batch normalization stabilized training and acted as implicit regularization. Data augmentation increased the effective dataset size by a factor of three, and early stopping prevented overtraining, typically triggering between epochs 20 and 30.

6.5. Computational Efficiency Considerations

The system’s processing time of 183.45 ms enables real-time operation at approximately 5.4 fps, suitable for many practical applications. The trade-off between the custom CNN’s superior performance and ResNet-34’s computational overhead validates the focus on task-specific architectures. The integrated processing times (140–170 ms) reveal synchronization overhead that could be optimized through GPU-accelerated parallel processing, model quantization for edge deployment, and optimized fusion algorithms using learned attention mechanisms.

6.6. Dataset Integration and Emotion Mapping

The integration of three distinct datasets required careful consideration of their inherent differences and mapping strategies.

6.6.1. Emotion Class Harmonization

The datasets contained different emotion taxonomies that required systematic mapping. RAVDESS’s 8 emotions were mapped to 6 by excluding “calm” and “surprise.” FER-2013’s 7 emotions were similarly reduced to 6 by excluding “surprise.” EMODB’s 7 emotions were mapped to our target 6 categories by excluding “boredom”. Mapping validation achieved 92% inter-rater agreement among three independent annotators, confirming the reliability of emotion category alignment.

6.6.2. Cross-Dataset Validation

To ensure generalization, we performed cross-dataset evaluation, revealing insightful results. Models trained on RAVDESS+EMODB and tested on FER-2013 achieved an F1-score of 0.72, while models trained on FER-2013+EMODB and tested on RAVDESS achieved an F1-score of 0.69. Combined training improved robustness by 15% over single-dataset models, demonstrating the value of diverse training data.

6.7. Limitations and Future Directions

Several limitations require acknowledgment in interpreting our results. First, the use of acted emotions may not fully represent spontaneous emotional expressions, suggesting future work should incorporate naturalistic emotion datasets. Second, heart rate data collection was limited to single-subject collection due to resource constraints; expanding to diverse subjects would improve generalization. Third, the pseudo-synchronization methodology, which combines video frames, audio segments, and heart rate data from different subjects across different datasets, represents a fundamental limitation that affects the validity of our reported performance metrics. In real-world scenarios, all modalities must be captured simultaneously from the same individual to reflect the true temporal and physiological coherence of emotional expression. Our approach of creating artificial triplets from different sources may lead to overly optimistic performance estimates, as the model is not required to learn the genuine cross-modal correlations that exist within a single person’s emotional response. This methodological limitation significantly impacts the generalizability of our results and means that the reported F1-score of 0.807 may not be achievable in practical deployments where all signals originate from the same subject in real-time. Finally, six basic emotions oversimplify the continuous nature of human affect, suggesting dimensional models (valence-arousal) could provide richer representations.

Future research should prioritize several key directions, including the development of real-time synchronized multi-modal data collection protocols, implementation of attention-based fusion mechanisms that can dynamically weight modalities, exploration of transfer learning for personalized emotion recognition models, integration with emerging sensors such as thermal imaging and EEG, and optimization for deployment on edge devices to enable practical applications in resource-constrained environments.

7. Conclusions

This study presented a comprehensive multi-modal emotion detection system that successfully integrates facial expressions, speech patterns, and heart rate variability to achieve robust emotion recognition with an F1-score of 0.807. Through systematic investigation of individual modalities and their fusion, we demonstrated that physiological signals, particularly heart rate data processed through CNN-RNN architectures, provide the most reliable indicators of emotional states (F1: 0.971), challenging the traditional emphasis on behavioral modalities in emotion recognition research.

Our key contributions include (1) demonstrating that task-specific, lightweight architectures can significantly outperform generic deep networks—the 6-layer CNN exceeded ResNet-34 by 29.5% while reducing computational requirements by 42%; (2) establishing that direct time-series processing of audio data surpasses spectrogram-based approaches by 40.3% in accuracy while improving efficiency; (3) implementing a late fusion strategy that maintains modularity and robustness while achieving real-time processing at 183.45 ms; and (4) developing a system capable of graceful degradation, maintaining >70% accuracy even with single modality failure.

The integration of FFT and RMSSD features for heart rate analysis represents a novel approach to physiological emotion detection, while our weighted fusion methodology (Equation (4)) provides an effective yet computationally efficient means of combining heterogeneous data streams. The system’s ability to operate in real-world conditions with variable lighting, background noise, and missing modalities addresses critical limitations identified in previous multi-modal approaches.

Several limitations warrant consideration. Most critically, the use of pseudo-synchronized data from different subjects across datasets, rather than truly synchronized multi-modal recordings from the same individuals, limits the direct applicability of our reported performance metrics to real-world scenarios. The current implementation requires physical sensors for heart rate monitoring, though emerging contactless technologies like rPPG offer promising alternatives. The six-emotion categorical model, while practical, simplifies the continuous nature of human emotional experience. Additionally, the fixed fusion weights, though empirically optimized, could benefit from dynamic adaptation based on signal quality and context.

Future research should explore (1) its implementation on embedded platforms to enable portable, real-time emotion recognition; (2) early fusion architectures to capture cross-modal interactions at the feature level; (3) continuous emotion models in valence-arousal space; (4) culturally adaptive models to enhance generalizability; and (5) integration of emerging modalities such as thermal imaging or EEG for enhanced emotional assessment.

This work advances the field by demonstrating that strategic integration of complementary modalities through optimized architectures can achieve superior emotion recognition while maintaining practical computational requirements. The findings have immediate applications in healthcare monitoring, adaptive human–computer interfaces, and educational technologies. By establishing that physiological signals provide more direct emotional indicators than behavioral cues, this study opens new avenues for developing robust, efficient, and deployable emotion recognition systems that can enhance human–computer interaction and support mental health assessment in real-world settings.

Author Contributions

Conceptualization, W.M., A.K. and K.D.; methodology, W.M., A.K. and K.D.; software, W.M.; validation, W.M., A.K. and K.D.; formal analysis, W.M., A.K. and K.D.; investigation, W.M.; resources, W.M., A.K. and K.D.; data curation, W.M.; writing—original draft preparation, W.M.; writing—review and editing, W.M., A.K. and K.D.; visualization, W.M.; supervision, K.D., A.K. and K.D.; project administration, W.M., A.K. and K.D.; funding acquisition, W.M., A.K. and K.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by Tshwane University of Technology (TUT) as part of the Master’s degree program support.

Institutional Review Board Statement

Ethical review and approval for this study were granted by the Research Ethics Committee of Tshwane University of Technology, Pretoria, South Africa with Date of Approval: June 2025. The study utilised publicly available datasets (RAVDESS, FER-2013, and EMODB). Self-collected heart rate data was obtained from the primary author only, with no other human subjects involved.

Informed Consent Statement

Not applicable. The study utilized publicly available datasets and self-collected data from the author only.

Data Availability Statement

The study utilized publicly available datasets: RAVDESS (Ryerson Audio–Visual Database of Emotional Speech and Song, available at https://zenodo.org/record/1188976 (accessed on 10 February 2023)), FER-2013 (Facial Expression Recognition 2013, available at https://www.kaggle.com/datasets/msambare/fer2013 (accessed on accessed on 10 February 2023)), and EMODB (Berlin Database of Emotional Speech, available at http://emodb.bilderbar.info/ (accessed on accessed on 15 February 2023)). The self-collected heart rate data and source code for the multi-modal emotion detection system are available upon reasonable request to the corresponding author.

Acknowledgments

Special acknowledgment goes to Eric Monacelli, the LISSI Laboratory at Université Paris-Est Créteil (UPEC), for facilitating the collaboration with French institutions including IUT (Institut Universitaire de Technologie) and providing invaluable insights during implementation phases. The author also thanks Tshwane University of Technology for providing computational resources and financial support, as well as fellow students whose collaborative discussions enriched the research process. The contributions of the anonymous reviewers who provided constructive feedback that significantly improved this manuscript are gratefully acknowledged.

Conflicts of Interest

The author declares no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

AI	Artificial Intelligence
ANS	Autonomic Nervous System
CNN	Convolutional Neural Network
ECG	Electrocardiogram
EMODB	Berlin Database of Emotional Speech
FER	Facial Expression Recognition
FFT	Fast Fourier Transform
fps	Frames Per Second
GAN	Generative Adversarial Network
HCI	Human–Computer Interaction
HRV	Heart Rate Variability
LBP	Local Binary Patterns
LSTM	Long Short-Term Memory
MFCC	Mel-Frequency Cepstral Coefficient
ML	Machine Learning
ms	Milliseconds
RAVDESS	Ryerson Audio–Visual Database of Emotional Speech and Song
ReLU	Rectified Linear Unit
RMSSD	Root Mean Square of Successive Differences
RNN	Recurrent Neural Network
rPPG	Remote Photoplethysmography
SVM	Support Vector Machine

Symbol	Description
$x_{i}$	Emotion probability score from video modality for emotion i
$y_{i}$	Emotion probability score from audio modality for emotion i
$z_{i}$	Emotion probability score from heart rate modality for emotion i
$w_{i, j}$	Fusion weight for modality i and emotion j
$E_{i}$	Emotion prediction from modality i
$E_{final}$	Final fused emotion prediction
$I_{c}$	Image intensity for color channel $c \in {R, G, B}$
$α_{c}$	Color scaling factor for channel c
$β_{c}$	Color bias term for channel c
p	Pixel intensity value (0–255)
$p^{'}$	Normalized pixel intensity (0–1)
$R R_{i}$	i-th RR interval in heartbeat sequence
N	Number of heartbeats in analysis window (75)
$μ$	Batch mean for normalization
$σ^{2}$	Batch variance for normalization
$γ, β$	Learned parameters for batch normalization
$ϵ$	Small constant for numerical stability ( $10^{- 5}$ )
$θ$	Rotation angle for data augmentation
TP	True positive
FP	False positive
FN	False negative
TN	True negative

References

Croskerry, P.; Abbass A, W.A.W. Emotional influences in patient safety. J. Patient Saf. 2010, 6, 199–205. [Google Scholar] [CrossRef]
Boulahia, S.Y.; Amamra, A.; Madi, M.R.; Daikh, S. Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition. Mach. Vis. Appl. 2021, 32, 121. [Google Scholar] [CrossRef]
Ma, Z.; Zhang, H.; Liu, J. MM-RNN: A Multimodal RNN for Precipitation Nowcasting. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 4101914. [Google Scholar] [CrossRef]
Xie, B.; Sidulova, M.; Park, C.H. Article robust multimodal emotion recognition from conversation with transformer-based crossmodality the title fusion. Sensors 2021, 21, 4913. [Google Scholar] [CrossRef] [PubMed]
Younis, E.M.; Zaki, S.M.; Kanjo, E.; Houssein, E.H. Evaluating Ensemble Learning Methods for Multi-Modal Emotion Recognition Using Sensor Data Fusion. Sensors 2022, 22, 5611. [Google Scholar] [CrossRef] [PubMed]
Ma, L.; Khorasani, K. Facial expression recognition using constructive feedforward neural networks. IEEE Trans. Syst. Man, Cybern. Part B (Cybern.) 2004, 34, 1588–1595. [Google Scholar] [CrossRef]
Medhat, W.; Hassan, A.; Korashy, H. Sentiment analysis algorithms and applications: A survey. Ain Shams Eng. J. 2014, 5, 1092–1113. [Google Scholar] [CrossRef]
Ullah, Z.; Qi, L.; Hasan, A.; Asim, M. Improved Deep CNN-based Two Stream Super Resolution and Hybrid Deep Model-based Facial Emotion Recognition. Eng. Appl. Artif. Intell. 2022, 116, 105486. [Google Scholar] [CrossRef]
Kumar, P.D.; Jeetha, B.R. Facial Expression Detection and Recognition Through VIOLA-JONES Algorithm and HCNN Using LSTM Method. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 2021, 7, 463–480. [Google Scholar] [CrossRef]
Shin, D.H.; Chung, K.; Park, R.C. Detection of emotion using multi-block deep learning in a self-management interview app. Appl. Sci. 2019, 9, 4830. [Google Scholar] [CrossRef]
Li, S.; Deng, W.; Du, J.P. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 22–25 June 2017; Volume 2017-January. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. Acm 2020, 63, 139–144. [Google Scholar] [CrossRef]
Zhang, S.; Qian, Z.; Huang, K.; Zhang, R.; Xiao, J.; He, Y.; Lu, C. Robust generative adversarial network. Mach. Learn. 2023, 112, 5135–5161. [Google Scholar] [CrossRef]
Dupré, D.; Krumhuber, E.G.; Küster, D.; McKeown, G.J. A performance comparison of eight commercially available automatic classifiers for facial affect recognition. PLoS ONE 2020, 15, e0231968. [Google Scholar] [CrossRef]
Buimer, H.; Schellens, R.; Kostelijk, T.; Nemri, A.; Zhao, Y.; Geest, T.V.D.; Wezel, R.V. Opportunities and pitfalls in applying emotion recognition software for persons with a visual impairment: Simulated real life conversations. JMIR mHealth uHealth 2019, 7, e13722. [Google Scholar] [CrossRef]
Somers, M. Emotion AI, Explained. 2019. Available online: https://mitsloan.mit.edu/ideas-made-to-matter/emotion-ai-explained (accessed on 3 February 2023).
Livingstone, S.R.; Russo, F.A. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north American english. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
Schuller, B.; Steidl, S.; Batliner, A.; Vinciarelli, A.; Scherer, K.; Ringeval, F.; Chetouani, M.; Weninger, F.; Eyben, F.; Marchi, E.; et al. The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Lyon, France, 25–29 August 2013. [Google Scholar] [CrossRef]
Petrushin, V.A. Emotion in speech: Recognition and application to call centers. In Proceedings of the Intelligent Engineering Systems Through Artificial Neural Networks, 1999; Volume 9. Available online: https://www.researchgate.net/publication/2611186_Emotion_in_Speech_Recognition_and_Application_to_Call_Centers (accessed on 3 February 2023).
Sebe, N.; Cohen, I.; Gevers, T.; Huang, T. Emotion Recognition Based on Joint Visual and Audio Cues. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 1, pp. 1136–1139. [Google Scholar] [CrossRef]
Nimitsurachat, P.; Washington, P. Audio-Based Emotion Recognition Using Self-Supervised Learning on an Engineered Feature Space. AI 2024, 5, 11. [Google Scholar] [CrossRef] [PubMed]
Chamishka, S.; Madhavi, I.; Nawaratne, R.; Alahakoon, D.; Silva, D.D.; Chilamkurti, N.; Nanayakkara, V. A voice-based real-time emotion detection technique using recurrent neural network empowered feature modelling. Multimed. Tools Appl. 2022, 81, 35173–35194. [Google Scholar] [CrossRef]
Jürgens, R.; Grass, A.; Drolet, M.; Fischer, J. Effect of Acting Experience on Emotion Expression and Recognition in Voice: Non-Actors Provide Better Stimuli than Expected. J. Nonverbal Behav. 2015, 39, 195–214. [Google Scholar] [CrossRef]
Wilting, J.; Krahmer, E.; Swerts, M. Real vs. acted emotional speech. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Pittsburgh, PA, USA, 17–21 September 2006; Volume 2. [Google Scholar] [CrossRef]
Shahnawaz, M.B.; Dawood, H. An Effective Deep Learning Model for Automated Detection of Myocardial Infarction Based on Ultrashort-Term Heart Rate Variability Analysis. Math. Probl. Eng. 2021, 2021, 6455053. [Google Scholar] [CrossRef]
Kreibig, S.D. Autonomic nervous system activity in emotion: A review. Biol. Psychol. 2010, 84, 394–421. [Google Scholar] [CrossRef]
Wang, L.; Hao, J.; Zhou, T.H. ECG Multi-Emotion Recognition Based on Heart Rate Variability Signal Features Mining. Sensors 2023, 23, 8636. [Google Scholar] [CrossRef]
Luo, X.; Wang, R.; Zhou, Y.X.; Xie, W. The relationship between emotional disorders and heart rate variability: A Mendelian randomization study. PLoS ONE 2024, 19, e0298998. [Google Scholar] [CrossRef]
Sepúlveda, A.; Castillo, F.; Palma, C.; Rodriguez-Fernandez, M. Emotion recognition from ecg signals using wavelet scattering and machine learning. Appl. Sci. 2021, 11, 4945. [Google Scholar] [CrossRef]
Park, C.Y.; Cha, N.; Kang, S.; Kim, A.; Khandoker, A.H.; Hadjileontiadis, L.; Oh, A.; Jeong, Y.; Lee, U. K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations. Sci. Data 2020, 7, 293. [Google Scholar] [CrossRef] [PubMed]
Mamieva, D.; Abdusalomov, A.B.; Kutlimuratov, A.; Muminov, B.; Whangbo, T.K. Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features. Sensors 2023, 23, 5475. [Google Scholar] [CrossRef]
Devi, C.A.; Renuka, D.K.; Pooventhiran, G.; Harish, D.; Yadav, S.; Thirunarayan, K. Towards enhancing emotion recognition via multimodal framework. J. Intell. Fuzzy Syst. 2023, 44. [Google Scholar] [CrossRef]
Salama, E.S.; El-Khoribi, R.A.; Shoman, M.E.; Shalaby, M.A.W. A 3D-convolutional neural network framework with ensemble learning techniques for multi-modal emotion recognition. Egypt. Informatics J. 2021, 22, 167–176. [Google Scholar] [CrossRef]
Hina, I.; Shaukat, A.; Akram, M.U. Multimodal Emotion Recognition using Deep Learning Architectures. In Proceedings of the 2022 2nd International Conference on Digital Futures and Transformative Technologies, ICoDT2 2022, Rawalpindi, Pakistan, 24–26 May 2022. [Google Scholar] [CrossRef]
Wu, D.; Zhang, J.; Zhao, Q. Multimodal Fused Emotion Recognition about Expression-EEG Interaction and Collaboration Using Deep Learning. IEEE Access 2020, 8, 133180–133189. [Google Scholar] [CrossRef]
Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inf. Fusion 2020, 59, 103–126. [Google Scholar] [CrossRef]
Shaqra, F.A.; Duwairi, R.; Al-Ayyoub, M. Recognizing emotion from speech based on age and gender using hierarchical models. Procedia Comput. Sci. 2019, 151, 37–44. [Google Scholar] [CrossRef]
Siedlecka, E.; Denson, T.F. Experimental Methods for Inducing Basic Emotions: A Qualitative Review. Emot. Rev. 2018, 11, 3–12. [Google Scholar] [CrossRef]
Shaffer, F.; Ginsberg, J.P. An Overview of Heart Rate Variability Metrics and Norms. Front. Public Health 2017, 5, 258. [Google Scholar] [CrossRef]
Sandbach, G.; Zafeiriou, S.; Pantic, M.; Yin, L. Static and dynamic 3D facial expression recognition: A comprehensive survey. Image Vis. Comput. 2012, 30, 683–697. [Google Scholar] [CrossRef]

Figure 1. Overall flowof the proposed multi-modal emotion detection system.

Figure 2. Result on the combined dataset for video model based on ResNet-34’s architecture.

Figure 3. Result on the combined dataset for video model based on self-designed architecture.

Figure 4. Audio model results using ResNet-18 on spectrograms.

Figure 5. Time-series audio model results using CNN-RNN architecture.

Figure 6. Heart rate model results using CNN-RNN architecture.

Table 1. Fusion weights for multi-modal emotion detection.

Emotion	Video	Audio	Heart Rate
Anger	1.0	1.1	1.0
Disgust	1.0	0.6	0.7
Fear	0.7	1.0	1.0
Happiness	1.0	1.0	1.0
Neutral	1.3	1.3	1.3
Sadness	1.0	1.0	1.0

Table 2. Performance metrics across modalities.

Modality/Model	Precision	Recall	F1-Score
Video (ResNet-34)	0.56	0.53	0.545
Video (Custom CNN)	0.71	0.69	0.706
Audio (ResNet-18 Spectrogram)	0.64	0.56	0.531
Audio (CNN-RNN)	0.76	0.76	0.745
Heart Rate (CNN-RNN)	0.98	0.97	0.971
Late Fusion	0.817	0.806	0.807

Table 3. Computational performance metrics.

Model Configuration	Time (ms)
Stand-alone Video	47.23
Stand-alone Audio (Spectrogram)	81.14
Stand-alone Audio (CNN-RNN)	48.65
Stand-alone Heart Rate	47.87
Integrated Video	140.63
Integrated Audio	127.70
Integrated Heart Rate	170.41
Fusion Process	12.75
Complete System	183.45

Table 4. Modality strengths for specific emotion detection scenarios.

Scenario	Primary	Secondary	Validation
High arousal (anger/fear)	Audio	Heart Rate	Visual
Low arousal (sadness)	Visual	Audio	Heart Rate
Positive valence (happiness)	Visual	Heart Rate	Audio
Negative valence (disgust)	Visual	Audio	Heart Rate
Neutral state	Heart Rate	Visual	Audio
Deception detection	Heart Rate	-	Visual+Audio

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mostert, W.; Kurien, A.; Djouani, K. Multi-Modal Emotion Detection and Tracking System Using AI Techniques. Computers 2025, 14, 441. https://doi.org/10.3390/computers14100441

AMA Style

Mostert W, Kurien A, Djouani K. Multi-Modal Emotion Detection and Tracking System Using AI Techniques. Computers. 2025; 14(10):441. https://doi.org/10.3390/computers14100441

Chicago/Turabian Style

Mostert, Werner, Anish Kurien, and Karim Djouani. 2025. "Multi-Modal Emotion Detection and Tracking System Using AI Techniques" Computers 14, no. 10: 441. https://doi.org/10.3390/computers14100441

APA Style

Mostert, W., Kurien, A., & Djouani, K. (2025). Multi-Modal Emotion Detection and Tracking System Using AI Techniques. Computers, 14(10), 441. https://doi.org/10.3390/computers14100441

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Modal Emotion Detection and Tracking System Using AI Techniques

Abstract

1. Introduction

2. Related Work

2.1. Single-Modal Approaches

2.1.1. Facial Expression Recognition

2.1.2. Audio-Based Detection

2.1.3. Physiological Signal Analysis

2.2. Multi-Modal Fusion Approaches

2.3. Current Challenges and Gaps

3. Methodology

3.1. System Architecture

Dataset Description

3.2. Datasets and Data Collection

3.2.1. Data Synchronization and Alignment

3.2.2. Data Splits

3.3. Data Preprocessing

3.3.1. Visual Data Processing

3.3.2. Audio Feature Extraction

3.3.3. Physiological Signal Processing

3.4. Model Architectures

3.4.1. Visual Emotion Recognition

3.4.2. Audio Emotion Classification

3.4.3. Heart Rate Analysis Model

3.5. Hyperparameter Optimization

3.5.1. Parameter Selection Process

3.5.2. Preventing Overfitting

3.6. Multi-Modal Fusion

Grid Search for Fusion Weights

3.7. Performance Metrics

3.8. Implementation and Training

4. Results and Experiments

4.1. Experimental Setup

4.2. Individual Modality Performance

4.3. Video Modality Results

4.4. Audio Modality Results

4.5. Heart Rate Modality Results

4.6. Computational Efficiency

4.7. Fusion Model Analysis

4.8. Statistical Significance

5. Individual Modality Analysis

5.1. Visual Modality: Strengths and Limitations

5.1.1. Performance Analysis

5.1.2. Strengths

5.1.3. Limitations

5.1.4. Complementary Role

5.2. Audio Modality: Strengths and Limitations

5.2.1. Performance Analysis

5.2.2. Strengths

5.2.3. Limitations

5.2.4. Complementary Role

5.3. Heart Rate Modality: Strengths and Limitations

5.3.1. Performance Analysis

5.3.2. Strengths

5.3.3. Limitations

5.3.4. Complementary Role

5.4. Synergistic Benefits of Multi-Modal Integration

5.4.1. Modality Compensation Matrix

5.4.2. Failure Mode Resilience

6. Discussion

6.1. Principal Findings

6.2. Architectural Innovations and Performance Trade-Offs

6.3. Multi-Modal Fusion: Compensating for Individual Weaknesses

6.4. Training Dynamics and Overfitting Challenges

6.5. Computational Efficiency Considerations

6.6. Dataset Integration and Emotion Mapping

6.6.1. Emotion Class Harmonization

6.6.2. Cross-Dataset Validation

6.7. Limitations and Future Directions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References