Early and Late Fusion for Multimodal Aggression Prediction in Dementia Patients: A Comparative Analysis

Galanakis, Ioannis; Soldatos, Rigas Filippos; Karanikolas, Nikitas; Voulodimos, Athanasios; Voyiatzis, Ioannis; Samarakou, Maria

doi:10.3390/app15115823

Open AccessArticle

Early and Late Fusion for Multimodal Aggression Prediction in Dementia Patients: A Comparative Analysis

by

Ioannis Galanakis

^1,*

,

Rigas Filippos Soldatos

²

,

Nikitas Karanikolas

¹

,

Athanasios Voulodimos

³,

Ioannis Voyiatzis

¹ and

Maria Samarakou

^1,*

¹

Department of Informatics and Computing Engineering, University of West Attica, 12243 Athens, Greece

²

First Department of Psychiatry, Eginition Hospital, National and Kapodistrian University of Athens Medical School, 11528 Athens, Greece

³

Department of School of Electrical & Computing Engineering, National Technical University of Athens, 15780 Athens, Greece

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 5823; https://doi.org/10.3390/app15115823

Submission received: 1 April 2025 / Revised: 13 May 2025 / Accepted: 20 May 2025 / Published: 22 May 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Aggression in patients with dementia poses significant caregiving and clinical issues. In this work, fusion approaches—Early Fusion and Late Fusion—were compared to classify aggression using audio and visual signals. Early Fusion integrates the extracted features of the two modalities into one dataset before classification, while Late Fusion integrates the prediction probabilities of standalone audio and visual classifiers with a meta-classifier. Both models were tested using a Random Forest classifier with five-fold cross-validation, and the performance was compared on accuracy, precision, recall, F1-score, ROC-AUC, and inference time. The results showcase that Late Fusion is superior to Early Fusion in terms of accuracy (0.876 vs. 0.828), recall (0.914 vs. 0.818), F1-score (0.867 vs. 0.835), and ROC-AUC score (0.970 vs. 0.922), proving more suitable for high-sensitivity use cases like healthcare and security. However, Early Fusion exhibited higher precision (0.852 vs. 0.824), indicating that in cases when false positives are a requirement, Early Fusion is preferable. Paired t-tests were applied for statistical comparison and indicate that precision alone is significantly different, with the advantage of Early Fusion. Late Fusion also performs slightly less in inference time, which makes it suitable for use in real-time systems. These findings provide significant information on multimodal fusion strategies and their applicability in the detection of aggressive behavior, which can contribute to the development of efficient monitoring systems for dementia care.

Keywords:

machine learning; multimodal analysis; late fusion; early fusion; comparative analysis; meta-classifier; dementia; aggressive outbursts

1. Introduction

1.1. Dementia

Dementia is a disorder that affects cognition due to its neurodegenerative and chronic nature. Numerous symptoms are observed, commonly aggression, be it verbal or physical, or agitation. Dementia can also affect memory, language, reasoning, and behavior. It is caused by numerous disorders, with the most common being Alzheimer’s, and it can impair a person’s daily functioning over its course. With symptom improvement over time, the patients are found to have a declining course in communication and judgment, as well as mood changes and personality disturbances. Aggression is typically associated with dementia and can occur in the form of verbal as well as physical aggression, attacks, and agitation [1]. Based on prior research, 50% of people with dementia develop at least one form of aggressive behavior, increasing the risk to caregivers and mental health practitioners, as well as to patients themselves.

Aggressive behaviors are elicited by frustration, confusion, pain, or environmental changes. These behaviors pose a significant threat to patients themselves and healthcare workers, and therefore increase tension and psychological stress. Patients can injure themselves and cause injury, events that can burden patients in the long run [2].

Early detection of these symptoms, before their exacerbation, represents a significant objective, since it can enable timely interventions that can reduce the intensity of these episodes and therefore cause less distress to patients and caregivers. Timed interventions can also de-escalate dangerous situations and provide time to implement treatment plans and preventive interventions, thereby reducing harm while limiting pharmacological treatment. Previous research relied primarily on visual data to recognize such violent behaviors. By adding the analysis of audio-based features, a prediction method with improved stability can be provided, based on multimodal analysis, which improves prediction accuracy [3].

The Mini-Mental State Examination (MMSE) is commonly used in cognitive impairment screening, with limited usefulness in dementia prediction for MCI patients. Scientists in a systematic review of 11 studies involving approximately 1.569 MCI patients established MMSE to be imprecise in the prediction of dementia conversion and recommended the use of several longitudinal assessments [4]. In a review, researchers compared 52 cognitive test tools commonly utilized for the identification of Mild Cognitive Impairment (MCI) in its early stages. It was noted in this review that the Montreal Cognitive Assessment (MoCA), Mini Mental State Examination (MMSE), and Clock Drawing Test (CDT) were the most cited tools and the Six-item Cognitive Impairment Test (6CIT) and the Hong Kong Brief Cognitive Test (HKBC) the most recommended, considering various factors such as education and age [5]. Cognitive impairment and dementia screening in emergency departments were topics discussed by another group of researchers. The final results stress the need for more effective and practical screening techniques and more detection strategies in dementia [6].

1.2. Dementia and Machine Learning

Automated dementia diagnosis in clinical translations in machine learning-based areas is promising. However, generalizable machine learning models with improved robustness are also needed. While research is ongoing to this end, there are many examples of inconsistency in validation, whereas the use of common datasets improves interpretability and stability. Therefore, there is a need for an increase in clinical expertise and focus [7]. Another study presents possible options for the automated prediction of dementia diagnosis through machine learning. A review of the models of dementia diagnosis from 2011 to 2022 includes data modality considerations. The results point to excellent performance with many limitations [8]. In studies focusing on dementia, there is a rapid growth of applications of machine learning that enable automation of prediction, diagnosis, and treatment of patients.

Comparison is made between machine learning models and deep learning models to demonstrate the ability of the deep learning models to yield more accurate results compared to traditional machine learning models, as well as commenting on the need for additional computational power and resources. Scarcity of data and applications is a challenge in the further development of such models, even though they can be utilized in dementia care [9].

A meta-analysis describes the complexity of Alzheimer’s Disease (AD) diagnosis due to limitations of current methods in cognitive assessments and image analysis. The study explores the viability of machine learning models combined with novel biomarkers and points toward a novel way to enhance the accuracy of AD diagnosis [10]. Similarly, machine learning was also utilized for the prediction of dementia at an early stage, where scores from cognitive function tests were used as features to make the predictions. The ensemble model of AdaBoost achieved an exceptionally high accuracy of 83%, which was higher than other models and has potential use for Artificial Intelligence (AI) in diagnosis [11] in another study on Alzheimer’s Disease (AD), Frontotemporal Dementia (FTD), and the difficulties these cause for caregivers and patients. A new approach was created within this research utilizing deep learning through Electroencephalography (EEG) classification of AD and FTD with better preprocessing and classification. The results exhibited a potential solution for such an application of deep learning in screening dementia [12].

Finally, another study investigated Alzheimer’s Disease, Amyotrophic Lateral Sclerosis (ALS), and Frontotemporal Dementia (FTD) and their shared molecular pathophysiology with the help of unsupervised learning. The result identified a high correlation between key molecules [13].

1.3. Detection and Prediction of Episodes of Aggression in Patients with Dementia

Aggression and agitation behaviors in dementia are difficult to treat in a timely manner for caregivers and mental health facilities. In a systematic umbrella review, monitoring technologies of activities and behavioral symptoms in elderly neurodegenerative disease patients were investigated. Their application, as promising as it is, is met with a variety of challenges due to methodological diversity and the need for systematic measures in developing ethical AI in medicine [14]. Wearable sensor use is investigated to develop an individualized machine learning model for Behavioral and Psychological Symptoms of Dementia (BPSD). The study focused on the development of digital biomarkers, further innovating the approach in digital phenotyping of BPSD [15]. Another work introduced a public dataset based on non-intrusive sensors, i.e., wearables or cameras, to monitor agitation-related physical activity in dementia patients. Their performance showed high accuracy under controlled conditions, but under leave-one-out conditions, a decrease was observed, proving the complexity in the invariance space [16]. Another study presented an efficient computation scheme for real-time detection of aggressive behavior with the use of a wrist accelerometer. The wireless actigraphy system captures behavioral symptom information. Their utility is limited, however, due to limited computational capability [17]. Subsequently, a machine learning-driven classification model from the MediaPipe holistic model has been suggested that can recognize and classify aggressive episodes in real-time. The proof-of-concept study presents an innovative solution with the potential to predict aggressive behavior in patients with dementia before their onset [18].

1.4. Audio Classification and Machine Learning

A sound classification method has been proposed that uses convolutional neural networks (CNNs) and Mel frequency cepstral coefficients (MFCCs) as sound conversion methods, in order to obtain spectrograms suitable to be used as inputs to CNNs. This method improved the classification accuracy on the dataset from 63% to 97% while offering many benefits, such as faster training [19]. The acoustic quantification diversity of vocal audio has been utilized in a generalizable method with an unsupervised Random Forest model. It was demonstrated that this method could classify acoustic structures with high accuracy, offering a standard tool for music variation comparison [20]. Moreover, the use of Mel spectrograms has been applied with convolutional neural network (CNN) and Long Short-term Memory (LSTM) network architectures. Results show that a hybrid method, using compression techniques, Taylor pruning, and 8-bit quantization, exhibited high accuracy against traditional methods with the minimum of computational resources [21]. A systematic review compared machine learning techniques with limited data augmentation in audio-based sound classification, i.e., voice and speech. This paper presented common challenges, such as low or high noise data, and how they impact classification performance.

This review of the literature sheds light on sound analysis to derive features [22], a survey with an emphasis on different deep learning models in audio classification, and emphasizing five best architectures: CNNs, Recurrent Neural Networks (RNNs), Autoencoders, Transformers, and Hybrid Models. Popular audio transformation techniques, such as spectrograms and MFCCs, were included [23]. Evaluating small data and augmentation techniques exhibited the potential for improving classification using deep learning on sound features and problems such as noise or data scarcity that complicate and make feature extraction difficult and unreliable. Recommendations were proposed for enhancing the performance of sound classification [22]. A recent study explored repeated training of an existing convolutional neural network to analyze the impact of retraining parameters that will impose on accuracy and processing time in sound datasets [24].

Further, researchers improved their previous findings on the prediction of aggressive behaviors in dementia patients by integrating audio-based violence detection with their present visual-based analysis of body motion. Noise-filtering algorithms such as MFCCs, frequency filtering, and speech prosody were used in an audio recording dataset as a means of extracting pertinent features. Using a Late Fusion Rule, predictions of both models were integrated into a single meta-classifier as a way of achieving better early aggression detection accuracy. The results proved that multimodal enhancement further improved predictive capability, yielding a more accurate model for detecting and predicting aggressive behaviors in clinical populations, thereby facilitating caregivers in interventions [25].

1.5. Machine Learning Multimodal Approaches with Late and Early Fusion Meta-Classifiers

Multimodal machine learning provides an innovative and efficient way to enhance the predictive ability of a model. A fusion machine learning prediction model has alleviated data imbalance by combining classifiers and resampling techniques with improved overall accuracy. A fusion model yielded the highest scores in recall, accuracy, precision, and F1-score, indicating the effectiveness of this fusion method in the diagnosis of breast cancer [26].

Multimodal networks with the ability to handle text and audio are valuable tools for the early diagnosis of mental illness. It was proposed that Early and Late Fusion strategies in multimodal networks for early depression detection showed high classification scores, indicating potential in future depression detection applications [27]. Multimodal emotion prediction based on real-time physiological marker data was found to facilitate emotion recognition automation under the presence of complexities such as heterogeneity of expression.

Various ensemble models formed using actual data were compared and combined with base learners. These were non-parametric classifier algorithm of points classification through nearest neighbors or K-Nearest Neighbors (KNN), tree classifiers determining by rules of data attributes decision tree (DT), ensemble methods for precise classification from numerous decision trees or Random Forest (RF) and supervised models that determined the best optimal hyperplane for classification provided as Support Vector Machine (SVM) [28]. Public datasets and deep learning models were discussed for detection using deep learning approaches in homogeneous and heterogeneous sensing environments [29]. A new model was then proposed that combined emotional speech recognition and emotional analysis with features and various deep learning models. Findings indicated the benefits of ensemble methods and the need for multimodal fusion in affective emotion analysis [30]. Another study provides a neural network solution for emotion recognition with a Wearable Stress and Affect Detection (WESAD) dataset focused on wrist-physiological signal-acquired Blood Volume Pulse (BVP) and Electrodermal Activity (EDA) signals. The Siamese model’s Late Fusion-based approach resulted in 99% accuracy in four different emotional state labeling. The model was also hardware-acceleration-optimized, introducing potential real-time applicability in detection systems [31]. Hybrid meta-learning models have achieved strong performance in the research area, considering time-series forecasting. A research study proposed a time-series-associated meta-learning (TsrML) approach and utilized a meta-learner and base-learner on different datasets to observe fast adaptation to new small-sample data, outperforming other deep learning models in accuracy [32].

1.6. Objective

In this paper, a comparison of the Early Fusion and Late Fusion models is presented for predicting aggression in dementia patients, using both visual and audio data. The main aim is to evaluate the effectiveness of two fusion models and identify the approach that enhances the accuracy of the single models and outperforms other approaches.

A detailed comparative analysis is presented between the final results of the Early and Late Fusion models, thus providing insights into considering weaknesses and strengths of every approach. The aim is to further develop a multi-part research and reach a proof-of-concept methodology that could be applied in the prediction of episodes of aggressive behaviors of dementia patients.

1.7. Novelty and Key Contributions

This work introduces the construction of an Early Fusion model, based on existing multimodal analyses of both audio and video data, in addition to the existing Late Fusion model. An additional comparative evaluation is then performed, which assesses a new method by comparing the performance and accuracy of both Early and Late Fusion meta-classifiers for the prognostication of episodes of aggressive behavior in patients with dementia. The main contributions of this study are as follows:

Potential Benefits From Previous Work:

Multimodal Aggression Prediction: The breakthrough of a multimodal solution was presented that combines earlier-developed visual-based aggression analysis with new audio-based aggression analysis, in order to produce a system with consistent results under real-world conditions with improved noise removal;
Advanced Audio Processing Techniques: In order to obtain valuable audio features from audio files, advanced audio processing methods were utilized, such as the MFCCs, frequency filtering, and speech prosody, which improved model classification accuracy;
Advanced Visual Processing Techniques: In order to successfully extract video features from frame files, advanced techniques of video processing were applied, such as landmark extraction, pose analysis, artifact, and noise filtering;
Late Fusion Meta-Classifier: A Late Fusion meta-classifier was employed to merge the multi-layered model predictions into one data frame of both audio and visual aggressive predictions;
Early Fusion Meta-Classifier: An Early Fusion meta-classifier was developed for the purpose of merging the multi-layered model extracted features into a unified data frame consisting of both audio and visual aggressive cues;
Clinical Applicability and Proof of Concept: Results demonstrate the advantages and disadvantages of each fusion model, with the Late Fusion model outperforming the Early Fusion model in most metrics, with an improved detection of early aggression behaviors, offering proof-of-concept validity for facilitation of clinical practice in the future.

New Insights:

Results of this multilayer approach verify that when predictions from audio-based aggression are combined with visual prediction, the fusion generated meta-classifiers enhance aggression prediction with the Late Fusion model, outperforming the Early Fusion model in most metrics, providing the potential for clinical application;
Correlations of vocal frequencies and physical activity showcase their role in the model improvement.

Practical Implications:

This multimodal system could augment real-world clinical systems with a real-time alert system, enabling timely interventions;
Due to its enhanced accuracy, this technique has the potential for integration into future clinical circumstances.

1.8. Comparison with Recent and Existing Theoretical Methods in Pose Recognition and Violence Detection

Rodrigues et al. [33] introduced a multimodal fusion system for monitoring shared autonomous vehicle interiors, which integrates object detection and action recognition to identify violent behavior and misplaced objects. The system leverages state-of-the-art deep learning backbones such as I3D, R(2 + 1)D, and YOLOv5, trained on datasets such as MoLa InCar and COCO, and reports real-time performance on embedded automotive platforms, demonstrating its maturity for safety-critical applications. Kumar et al. [34] introduced an AI-driven expert system for smart city surveillance, which combines image-to-image stable diffusion for data augmentation, YOLOv7 for violent object detection, and MediaPipe with an LSTM-based classifier for action recognition. The system demonstrates better performance with 89.5% mAP for object detection and 88.33% accuracy for action classification, and is optimized for real-time deployment on edge devices with dash camera inputs, thereby enhancing emergency responsiveness in urban environments. Ding and Li [35] introduced a real-time animated character 3D pose recognition system using an enhanced deep convolutional neural network for facial and body pose estimation. By the abstract data structure translation of input poses and dynamic character animation generation process, the system can guarantee high-speed inference rates (up to 384 fps) and dramatic performance improvements in accuracy (~3.5%) on diverse datasets, demonstrating its efficiency against traditional pose estimation methods in aspects of performance and computational complexity. Xu et al. [36] introduced Violence-YOLO, a high-performance violence detection model with YOLOv9′s GELAN-C architecture for real-time performance in dense and heavy public areas. The model enhances mAP@0.5 by 0.9% with reduced computation load and model size using attention mechanisms (SimAM), light-weight blocks (GhostConv, RepGhostNet), and a new Focaler-IoU loss, making the model highly adaptable for embedded systems such as Raspberry Pi and applicable in use cases such as airport security monitoring. Abundez et al. [37] created a threshold active learning technique to detect physical violence in images from videos to improve classifier robustness across different environments. The method has two steps: first, applying pretrained models to identify imbalanced images in terms of a threshold value (μ), and these are added to the training set for its improvement; and second, checking new video data for imbalanced images by human assessors and retraining them into the model. This is a hybrid neural network method that enhances violence detection through iterative improvement in the accuracy of the classifier with different datasets. Negre et al. [38] provide a comprehensive literature review of deep-learning-based physical violence detection from video. The paper summarizes 21 open challenges, 28 recent datasets, 21 keyframe extraction methods, and 16 algorithm inputs utilized to detect violence. The paper also discusses the combination of various algorithms and their performance. The research aims to fill the gap for the necessity of the review in this field to date, highlighting the increasing relevance of video surveillance systems utilizing AI for detecting violence in public areas in real-time.

2. Materials and Methods

In this section, the methodology used for each of the two fusion models is presented, as well as the data curation process and the background behind each approach. The methodological approach for a Late Fusion model has already been discussed [25].

2.1. Data Collection and Preprocessing

Features acquired were extracted from the datasets of each of the visual and audio models and were then fused in a unified dataset.

Visual data were harvested from the collection of data from the “3D Human Pose in The Wild Using IMUs and a Moving Camera” image dataset. Images were processed with a MediaPipe holistic model. Images were fed into the model, and noise and artifact filtering were applied. Images displayed either an argument between two individuals that escalated or a casual conversation. In each frame, facial features, body stance, and hand gestures were detected, and the corresponding landmarks were later saved in a .csv file along with their corresponding x, y, z coordinates. Imputation techniques were applied with a KNN imputer to handle missing data from frames. Moreover, artifact and noise management were included to unify the resolution of the frames into a single resolution (640 × 480 pixels). Lower resolution frames were omitted. Noise reduction and quality control, a fast non-local means denoising algorithm was applied to enhance the quality of the images for a more accurate landmark extraction from MediaPipe. Following data preprocessing, MediaPipe extracted the landmarks that corresponded to specific features [18] (S1.1).

For the audio modalities, the data source was “Kaggle: Audio-based Violence Detection Dataset”, which contained audio clips from various aggressive and non-aggressive verbal instances from daily events. The audio files were fed into the audio model, which was preprocessed. Filtering and noise techniques, such as MFCCs and pitch filtering, were applied to remove unwanted noise from the data. The final filtered audio files were processed and their features extracted using Root Mean Square energy (RMS) and zero-crossing to describe the time-domain characteristics of the audio, using the extract_temporal_features method [25] (S1.2).

2.2. Late Fusion Methodology

Regarding the Late Fusion model as described in prior work [25], a meta-classifier was developed and trained, with the fusion of the prediction probabilities from the two separate visual and audio models (S1.1, S1.2). Initially, two single models were developed, a visual model that uses visual data to collect landmarks, detect human body stances, and predict aggressive probabilities in the said stance. One audio model that collects audio data and uses feature extraction with various techniques, such as noise filtering and MFCCs, is trained on these features and predicts probabilities of aggressive verbal instances. In the Late Fusion model development, probabilities of these two models were saved and then fused following testing on a holdout dataset, with unseen data, during training. Every row was processed over the visual data frame and compared with a corresponding row that exists in the audio data frame in the same order [39]. Data were merged based on their true and predicted labels. To handle uncertainty predictions, the Hybrid Uncertainty Calibration approach was implemented as presented in [40]. Finally, the confidence of each model was used to adjust the final predictions (S1.6).

2.3. Early Fusion Methodology

For the Early Fusion, extracted features from both the audio and the visual model were merged into a unified dataset. Considering that the two datasets bear a direct correlation with each other, audio and visual features were randomly assigned using a concatenation technique. After loading the datasets into memory, they were sorted and aligned by their common column “label” to ensure proper correspondence for the visual and audio data. For each entry of a visual feature in the dataset, a random audio feature entry was concatenated with the visual feature. In order to properly merge the feature data, a Python script that separated each dataset based on the label was developed, ensuring that both subsets have the same number of samples for each class, concatenated the label = 1 and label = 2 instances separately, and then recombined them into a single dataset. This technique ensures proper Early Fusion with matched samples while keeping features from both modalities without data loss [41].

Finally, with the aim of forming a comprehensive multimodal dataset, the datasets were concatenated along the feature axis. This resulted in a dataset where each entry contained both audio and visual features. The final dataset was saved as early_fusion_features_final_dataset.csv for future use in the training of the meta-classifier. The final dataset consisted of two subsequent datasets that were already filtered for noise and artifacts, and they were already preprocessed. Furthermore, imputation and feature standardization techniques for possible missing values that occurred during the fusion were applied [42] (S1.6, S1.7).

After the fusion of the visual data (body landmarks) and audio data (speech features), the final result was a high-dimensional multimodal feature space that contained the information for both modalities Table 1. More specifically, as follows:

Body Landmark Features (1629 features):

These represent key points of the body, face, and hands in the three-dimensional space with corresponding (x,y,z values).

Audio Features (39 features):

Audio speech signals extracted from the audio file, including the following:
○
Mel Frequency Cepstral Coefficients (MFCCs) (mean and standard deviation): 26 features;
○
Pitch (mean and standard deviation): 2 features;
○
Spectral Features (centroid, bandwidth, flatness, contrast): 4 features;
○
Zero-Crossing Rate (ZCR) (mean): 1 feature;
○
Root Mean Square Energy (RMS) (mean): 1 feature.

Class Label (True Value):

This column contains the class for each entry, aggressive or non-aggressive (1).

The total number of Early Fusion Data Entries after the merging of the two multimodal datasets is 932.

2.4. Modeling

For the model training, a Random Forest classifier with 100 trees was used. The reason behind the choice of the model is to retain uniformity and accuracy during the experimental process. Random Forest was selected and used during both the training and the testing of each individual model (audio and visual) [43] as well as for the Late Fusion meta-classifier training in our previous works [18,25]. For an understanding comparison and integrity of the results, the same classifier was used. Also, as before, a 5-fold cross-validation process to evaluate the model’s generalizability was implemented. Given the number of data points, a 10 × 10 inner and outer cross-validation fold could provide more accurate results, but this also could potentially overfit the data into the model and could impede the ability to have a true representative image of the final results. For training the model, a test–train split of 80–20 was implemented, where 80% of the data was used for training and 20% of the data was used for testing.

In previous work, the Late Fusion meta-classifier consisted of 465 instances of non-aggressive classes and 531 instances of aggressive classes. Class distribution during training of the Early Meta-Classifier resulted in 465 aggressive class instances and 465 non-aggressive class instances after the data curation and merging. For a fair comparison of the two meta-classifiers, the undersampling technique in the majority class of the Late Fusion model was applied, balancing the instances for both aggressive and non-aggressive classes at 465 instances for each class. This approach ensures that both classifiers have the same instances for both of their classes, thus ensuring a fair performance comparison with more robust and detailed data results (S1.1, S1.2, S1.6).

3. Results

For the evaluation of the model, the same metrics as in the previous works [18,25] were used to measure the efficacy of the model. Accuracy, precision, recall, and F1-score were computed for each fold. The trained model was then tested on the held-out test set to compute the ROC-AUC curves, confusion matrices, and learning curves. Inference times were also measured as before for benchmarking the model capabilities in real-time applications. For the statistical analysis, a paired t-test comparison of the performance metrics was used to analyze the statistical significance of the model.

3.1. Performance Comparison (Early vs. Late Fusion)

The key performance metrics for comparing Early and Late Fusion models are displayed in Table 2. The Late Fusion model outperforms the Early Fusion model in most scores, including recall, accuracy, F1 score, precision, and ROC-AUC, with the most significant improvement being seen in recall. Results indicate that Late Fusion is more effective in identifying cases of aggression, thus suggesting that the integration of audio and visual prediction features enables a robust classification in aggressive incidents, considering the decision level.

Considering false positive rates, Early Fusion exhibits higher precision than the Late Fusion mode, which indicates that Early Fusion performs better in minimal false alarm applications, such as real-time monitoring appliances, where an unnecessary intervention could be ineffective and disruptive. On the other hand, Early Fusion’s low recall score indicates that it may be ineffective in detecting aggressive behavior instances, which could be problematic in high-risk scenarios where the failure of a timely intervention could cause significant consequences to the patient or caregivers.

The plots and metrics for Early and Late Fusion models are presented below in Figure 1 and Figure 2, with corresponding ROC-AUC scores and curves, and a learning curve. Considering ROC curves, Late Fusion shows a better fit on the training data compared to the Early Fusion ROC curve. A steep slope in the Late Fusion ROC curve highlights its ability to learn well from the training set, and reaches a plateau, whereas Early Fusion’s slope is more gradual until 0.7 before reaching a plateau at 1.

The learning curves of Early and Late Fusion models are displayed in Figure 2. Early Fusion shows an incremental learning from the beginning until 670 training samples, with a minor decline at the end, with a maximum score reached of 0.83. Late Fusion, however, shows a steep inclination until 210 samples and then a plateau with a steady score on the training data, reaching a peak score of 0.89. Even though both models exhibit balanced learning curves, Late Fusion exhibits a more advantageous curve, indicating that the data are fitting well with the model. The plateau in the graph shows that the model learned as well as it could, and further data would not improve the model’s ability.

Confusion matrix comparison is being displayed in Table 3 and Figure 3, highlighting the model’s classification abilities both in positive and negative prediction class cases. Late Fusion reduced the number of false negatives, showing that it is less likely to falsely classify aggressive occurrences. This is advantageous in applications in which failing to detect aggression could have significant consequences. Early Fusion, on the other hand, maintains a slight advantage in false positives, aligning with its improved precision.

Final results of the performance metrics were drawn into a multi-axis spider diagram to better visualize strengths and weaknesses in the performance of the two models. Metrics are displayed in Figure 4, with Late Fusion outperforming Early Fusion in recall, accuracy, F1 score, ROC-AUC, Balanced Accuracy, Negative Predictive Values, Cross-Validation Balanced Accuracy, and Cross-Validation ROC-AUC score. The Cross-Validation Metrics were calculated during the outer folds inside the 5 Folds × 5 Repeats Nested CV before the application of the best hyperparameter tuning. These are a reliable metric of the model’s classification ability.

3.2. Statistical Significance Analysis

In order to determine and further validate the superiority of the Late Fusion model over the Early Fusion model, a paired t-test statistical analysis was conducted on the main evaluation scores, as shown in Table 4. This provides a better insight into the model’s strengths and weaknesses and sheds light on its observed performance.

The results shown in Figure 5 indicate that there are no statistically significant differences in the two models, apart from precision, which suggests that Early Fusion outperforms the Late Fusion model. Late Fusion shows better overall performance in metrics; however, this could also indicate that the higher scores could stem from random sampling variations. The two models show no statistical significance; therefore, other factors should be taken into consideration, such as computational cost and robustness. To further test this, the Inference Speed Performance test was performed. The degrees of freedom (df) for each test were calculated as n − 1, where n = 10 test runs (df = 9). A 95% confidence interval (CI) was computed for the mean difference between the Early Fusion and Late Fusion models on each metric.

3.3. Inference Speed Performance

In order to properly compare the efficiency of the models, real-world applicability should be discussed. Real-world scenarios require speed performance to predict the occurrence of an incident in a timely fashion. Therefore, the Inference Speed Performance test was conducted to measure and compare the speed of the two models. The results of this test are shown in Table 5, summarizing inference times for Late and Early Fusion.

Late Fusion exhibits slightly faster inference times than Early Fusion, indicating its benefits in time-sensitive scenarios in clinical supervision in healthcare. Both models exhibit small differences in processing time. Recall and F1 score display a wider gap, with Late Fusion outperforming Early Fusion by a bigger margin. Late Fusion scores outperform Early Fusion, with Early Fusion showing a slightly bigger inference time than Late Fusion, as shown in Figure 6. Even though this gap seems insignificant, in large-scale deployment scenarios, such small differences in time can be valuable.

All inference timing measurements reported in this work were conducted on a machine with an Intel Core i5-12450H processor and 16 GB DDR4 RAM, without GPU acceleration. The software platform was Python 3.12.3 and the following package versions: pandas 2.2.3, numpy 2.2.5, joblib 1.4.2, scikit-learn 1.6.1, scipy 1.15.2, and matplotlib 3.10.1. These demands allow for reported inference times to reflect typical performance on a common modern laptop configuration, commensurate with deployment environments in the real world and without requirements for special hardware.

Comparative analysis of Early and Late Fusion models shows that both models have strengths and weaknesses. Overall, Late Fusion outperforms Early Fusion with higher evaluation metrics, faster inference times, and no statistically significant differences, making it the superior model. This provides a perfect example for a strong candidate in applications where sensitivity in detecting and predicting aggressive behaviors is highly sensitive and critical. Early Fusion still has some strong points in cases when precision is the major priority, since it provides fewer false positives, making it useful in applications where false alarms are costly and must be prevented, such as automated monitoring systems in health clinics without supervision from staff.

3.4. Effect Size Analysis

In addition to examining statistical significance through Wilcoxon and Cohen’s d signed-rank tests, an effect size analysis was performed to quantify the magnitude of differences between Early Fusion and Late Fusion methods for each of the evaluation metrics. Cohen’s d is widely used to determine the standardized mean difference between two groups, providing a clearer notion of the practical significance of the results.

Table 6 provides the results of both Wilcoxon tests and corresponding Cohen’s d values for the performance measures: accuracy, precision, recall, and F1-score. The p-values of the Wilcoxon tests are utilized to verify whether there is a statistically significant difference in performance between the Early Fusion and Late Fusion methods. Cohen’s d values help provide a better understanding of the size of differences.

Precision is the only metric that has a statistically significant advantage (p = 0.0177) over Early Fusion, which was confirmed by the Wilcoxon test. Cohen’s d for precision, however, is −0.4074, which is a small effect size, and therefore, although there is significance, the practical effect size of the difference is moderate;
Accuracy, Recall, and F1-score were not statistically distinct (p > 0.05), as supported by both the Wilcoxon p-values and the Cohen’s d values. For instance, accuracy’s Cohen’s d is 0.2127, a small positive effect size, while recall’s Cohen’s d (−0.5314) and F1-score’s Cohen’s d (−0.0578) show non-significant differences between both approaches.

It is evident from this comparison that Early Fusion and Late Fusion are predominantly of low magnitude in most metrics. The notable result in precision must be interpreted with caution, since the effect size indicates a moderate difference that may not indicate meaningful improvement in actual application.

Although Late Fusion did not attain a statistically significant improvement in the majority of the measures, the gain in precision in Early Fusion, although statistically significant, is of a smaller practical size according to Cohen’s d. These results indicate that Early Fusion is superior regarding precision. Overall comparison between the two methods does not illustrate a clear-cut superiority of one method over the other based on the effect sizes.

3.5. Comparison with Recent Theoretical Methods in Pose Recognition and Violence Detection

Below, Table 7 presents a comparative overview of recent methods introduced in the field of pose recognition and violence detection. These methods apply various state-of-the-art machine learning and deep learning algorithms, such as multimodal fusion systems, neural networks, and object detection models such as YOLO and I3D, to improve the accuracy and speed of violent behavior detection or pose recognition for real-time applications. The table consolidates the systems constructed by different researchers, recording models and datasets applied, as well as performance scores and potential real-world applications. Notably are the facts that many of the systems are optimized for edge devices in a manner that they are deployable on safety-critical use cases, smart city monitoring, and surveillance in public space, among others. This comparison reveals the different approaches being pursued to extend the boundaries of pose estimation and violence detection, offering implications regarding accuracy, real-time capacity, and trade-offs in terms of system complexity.

4. Discussion

Dementia patients are at a greater risk of harming themselves and others during episodes of aggression. The risk factors involved in these behaviors have been widely studied. One vital goal is to reduce caregiver stress by informing them of the risks involved in these behaviors. Additional investigation of the interaction among different cases of behaviors has been accomplished through examination of vocally and physically aggressive and non-aggressive behaviors in order to better understand the factors related to agitation and aggressive behaviors among dementia patients. Findings showed that verbal aggression was most disruptive [44,45].

In this project, a machine learning model was designed, trained, and evaluated, capable of detecting and classifying aggressive and non-aggressive human behavior based on visual and audio data. To complete this proof-of-concept study, two meta-classifier fusion methods were proposed: the Early and Late Fusion models. This multimodal combination of previous works will further contribute to the competencies of this proof of concept and predict episodes of aggressive behavior in dementia patients more accurately and with improved reliability. As in the prior work, limitations in acquiring clinical datasets of real dementia patients are a drawback due to ethical concerns as well as patient consent. Partnerships with psychiatric hospitals are necessary to acquire such data and verify the applicability of the model in real-life scenarios. A proof of concept established the basis for the model to be created to simulate real-life scenarios, increase model reliability, and prepare it for clinical use, and enable future model validation with clinical data [46,47].

Dimensionality of Data

In multimodal datasets, a dataset with high dimensionality plays a crucial role when it comes to providing a rich and analytical representation of aggressive behavior prediction. The incorporation of visual and audio cues in this multimodal analysis benefits both models by providing enriched and complementary data sources. This allows a more analytical, comprehensive view of the classification and prediction of aggressive behaviors. Multimodal fusion also enhances the potential abilities of the model to learn from new data, since variability in modalities provides unique insights.

This approach comes with a key advantage, which is the model’s ability to handle missing data, therefore providing a more robust approach. In case of missing data, such as verbal audio data or missing or distorted images, the model can still provide meaningful predictions from available data. This highlights the robustness of this approach and ensures a more stable and reliable classification process. Moreover, the integration of deep-level interactions in features between audio and visual data helps the model to develop a better understanding of aggression patterns.

Interpretation of Results

Comparative analysis in this research between Early and Late Fusion approaches highlights the strengths and trade-offs of each method. The Late Fusion model outperformed Early Fusion in the majority of the evaluation metrics, including recall, accuracy, F1-score, ROC-AUC score, and inference time, while at the same time, no significant statistical differences were present between the comparison of the two models. This suggests that the aggregation of data of different natures in decision-level predictions provides a more robust process for real-world clinical applications.

Despite being outperformed by Late Fusion, Early Fusion exhibits better precision metrics with a low rate of false positive values, suggesting that it is better at avoiding false detections. This improvement comes with the trade-off of a higher risk of missing aggressive occurrences due to its low recall score. This showcases the importance of selecting a fusion strategy that is best suited for the requirements of the environment in which it is needed to perform. Healthcare application settings with human supervision may need to prioritize recall so that aggressive instances will not be overlooked, whereas in automated surveillance systems, it might be more crucial to emphasize precision in order to minimize unnecessary calls for interventions.

The main reason Late Fusion is better at recall is in its ability to decouple and process each modality separately before making a final decision. This decoupling allows the model to establish modality-specific decision boundaries that are less susceptible to cross-modal interference. For instance, even when the visual cues include motion blur, occlusion, or uneven lighting, the audio classifier can still provide a strong and assured input to the final prediction. The redundancy naturally resolves the modality-specific noise risk. Further, Late Fusion circumvents temporal asynchrony by disconnecting the learning processes. As compared to relying on ideal temporal coincidence, each modality is free to detect patterns of aggression from within its native temporal context, and the fusion system is tasked with combining these asynchronous predictions. Unlike Early Fusion, this technique does not suffer from the necessity of temporally aligned and cleaned features for best performance and thus is less sensitive to performance degradation when confronting real-world aberrations. Thus, the power of Late Fusion not only involves its decision-level structure but also its noise insensitivity, modality imbalance insensitivity, and temporal lag insensitivity. These characteristics are the most critical in the context of identifying aggressive behavior in erratic clinical environments.

Key Points of Comparison

Strengths of Late Fusion:
○
Higher accuracy, recall, and F1-score: Late Fusion outperforms Early Fusion in overall classification performance, making it better suited for detecting aggression with high reliability;
○
More effective handling of uncertainty: By aggregating the probabilities from the independent audio and visual models, Late Fusion significantly reduces false negatives. This is particularly useful in applications where missing an aggressive instance could have serious consequences;
○
Independent feature extraction for each modality: The separation of visual and audio processing prevents interference between modalities, ensuring that each model is optimized for its respective feature space before fusion (S1.7);
○
Faster inference times: Since predictions are fused at the decision level rather than requiring a joint feature space, Late Fusion achieves better computational efficiency, making it more viable for real-time applications like clinical monitoring and security surveillance.

Limitations of Late Fusion:
○
Slightly lower precision: Although Late Fusion has superior recall, it suffers from a higher false positive rate. This could lead to unnecessary interventions in scenarios where precision is critical;
○
Increased computational cost: The need to train separate models for audio and visual data before fusion adds to the overall processing time and computational resources required;
○
Dependence on model confidence scores: Since the final decision is influenced by the confidence levels of each model, any inaccuracies in confidence estimation could affect performance.

Strengths of Early Fusion:
○
Higher precision: Early Fusion exhibits a lower false positive rate, which is particularly beneficial for applications where reducing unnecessary alerts is a priority;
○
Direct integration of multimodal features: Unlike Late Fusion, which combines predictions, Early Fusion merges raw feature data, allowing for richer feature-level interactions between audio and visual cues;
○
Simpler training process: A single classifier is trained on the fused dataset, eliminating the need for separate models and fusion steps, reducing training complexity;
○
Potentially stronger feature representation: Combining data at the feature level enables the model to learn complex correlations between visual and audio signals that might not be captured when training models independently (S1.7).

Limitations of Early Fusion:
○
Lower recall: The model is more prone to missing instances of aggression due to the direct integration of features, which may not always effectively capture subtle behavioral patterns;
○
High-dimensional multimodal feature space: The increased complexity of the dataset, resulting from concatenating visual and audio features, may lead to overfitting if not managed properly (S1.4, S1.5, S1.7);
○
More sensitive to missing data: Since the model relies on both modalities being present for training and inference, any missing or corrupted features could degrade classification performance significantly;
○
Potential difficulty in feature alignment: Even though concatenation techniques were used to ensure proper label correspondence, the lack of direct correlation between audio and visual features could introduce noise into the training process, affecting overall performance.

Fusion Strategy Implications

As mentioned above, the choice between Early and Late Fusion is challenging. Each model has its own strengths and weaknesses, which must be taken into consideration when choosing the preferred strategy for a specific application. The Late Fusion approach is preferable in applications where high recall is crucial for real-world clinical scenarios, with human supervision, where an imminent episode of aggressive behavior must be identified in time to reduce risk. Early Fusion is preferable for applications where priority is given to precision, such as in automated systems without human supervision, where there is a need to minimize false cases.

Additionally, the Late Fusion method provides flexibility considering the update of the model since audio and visual features are trained separately. This can improve the maintenance and scalability of the model, since it is not dependent on all modalities at a given time; therefore, retraining is not required to make additional improvements to one modality. This provides modularity to the Late Fusion approach, which is beneficial for frequently updated systems that need to adapt to new data.

Model Robustness

Model robustness is another critical factor during the evaluation of the model fusion approach, especially in cases of missing modalities or handling variations in data, fast and efficiently in real-world deployment challenges. Based on the results discussed earlier, Late Fusion demonstrates superior performance thanks to its independent decision-making process, which allows for compensation in missing and corrupted data in modalities. This highlights its resilience in real-world conditions where data are often inconsistent or come with high noise.

Early Fusion implementation, on the other hand, can effectively integrate multimodal features from different cues and is more sensitive to data inconsistencies between audio and visual data. Feature concatenation at an early stage compensates for data discrepancies, as well as noise in the data, and can propagate through the model. This reduces overall classification performance and also reduces real-world adaptability due to its dependence on balanced and synchronized multimodal data.

Performance Behavior Across Data Conditions

The observed difference in performance between Early and Late Fusion models is explained through an examination of how each of the fusion approaches responds to sample-specific modality information under varying conditions of samples. Early Fusion combines audio and vision features at the input level in the form of a concatenated single feature vector. This approach enables the model to learn representations together. It also introduces the risk of modality imbalance, whereby dominant or noisy features from a single modality (like irregular audio information or poor-quality visual frames) can obscure informative patterns from the competitor model. This phenomenon is especially significant in borderline or uncertain samples, adding to Early Fusion’s decreased recall, due to the fact that it struggles to respond to minimal cues of aggression.

Late Fusion, in contrast, has modality-specific decision boundaries with different classifiers and fuses their decisions at the decision level. The design of Late Fusion allows for the exploitation of asynchronous modality salience. When there is only a single modality that provides a strong signal (e.g., clear speech tone changes even when visually occluded), the classifier can nevertheless trigger a correct aggressive classification. This results in better recall and improved ROC-AUC, particularly in edge cases. The steeper slope of its learning curve up to 210 samples also suggests that modality-specific classifiers are more data-efficient at initial stages, and therefore have more discriminatory within-modality structures, and are quicker to converge with fewer samples.

Moreover, Early Fusion’s increased accuracy is the result of higher decision thresholds that are a byproduct of early integration. As predictions are based on a combined feature space, only samples with clear, consistent signals across both modalities are labeled as aggressive. This reduces false positives but increases false negatives, as seen from the confusion matrix. This is especially critical where minimizing false alarms is of primary concern (e.g., autonomous surveillance), but not ideal in high-stakes contexts where missing an aggressive incident can undermine safety.

Lastly, the performance gap is non-uniform but data-based, with Early Fusion performing relatively better in clean and well-aligned input situations and Late Fusion dominating noisy, heterogeneous, or modality-damaged situations, bearing testament to its resilience and adaptability.

Choice of Methodology

Based on earlier research, the pipeline was chosen for this study and stayed the same in order to adhere to the model’s hierarchy and produce reliable results. Consequently, the Random Forest classifier’s ensemble classification feature was repurposed since it is the best method for the application due to its versatility in operating conditions and consistent performance across a range of decision thresholds [48]. Alternative classification model options could be used in this study, but doing so would render the improvement’s purpose moot because a different classifier would not offer consistency or dependability for a later comprehensive application of the model in real-world situations.

Real-World Clinical Settings Challenges

The use of controlled and/or simulated datasets and their effects on actual clinical settings must be discussed. As was previously mentioned, the datasets used in the audio and visual models came from open-source, free online datasets. These helped create the models and produce a proof-of-concept model that could be subsequently tested in actual clinical settings.

In future research, these models can be further enhanced using clinical audio and visual data. Clinical datasets may reveal issues that could affect the type of results, like background noise and other environmental noise present in clinical settings. Further, since every patient has different behaviors, individual variability may impede predictions.

Real World Application Strategies

Quality control is another crucial consideration when looking at actual clinical data. The quality of real-world data is impacted by missing values and incorrect labeling and usability. In order to achieve this, mitigation strategies have already been applied in every model under study. Robust data augmentation techniques that mimic clinical data noise in training and domain adaptation methods that use fine-tuning to transfer learning from a controlled dataset to a clinical dataset are examples of adaptation strategies that can be put into practice in response to the challenges discussed. Lastly, cooperation with healthcare organizations is required to obtain clinical data from mental health facilities and to engage in active learning within an active feedback loop where human specialists can oversee and determine the reliability of the results.

Model Robustness and Clinical Application Challenges

It is important to explore potential challenges to the implementation of the model in real clinical environments. Specifically, background noise issues, patient diversity, and missing sensor data are all difficulties that would likely have significant impacts on the model’s performance. To further enhance our model’s robustness, future work will entail testing the model in noisy environments, with artificially injected noise during training. This will challenge the model’s ability to perform in less-than-perfect conditions, common in clinical practice.

In addition, behavioral and patient heterogeneity must be examined. The model’s effectiveness would rely on some patient characteristics, like age, cognitive status, and history of behavior. This issue could be addressed through the implementation of solutions such as transfer learning, whereby the model will learn to generalize to new settings and datasets without the need for lengthy fine-tuning.

Early Fusion’s high precision advantage, as demonstrated in this study, also needs to be considered in the context of specific clinical circumstances. While Early Fusion is better at avoiding false alarms, this needs to be balanced against the clinical need to detect aggressive behaviors in a timely manner. The cost of reducing false positives in the clinical setting must be weighed carefully against the cost of missing an aggressive incident. Future research would have to address the clinical usefulness of such a trade-off, concerning real-world contexts in which false positives and false negatives both have direct implications for patient safety and care.

Ethical Considerations

Given the broader ethical implications of using machine learning algorithms to forecast violent tendencies in susceptible groups, we have to discuss associated moral concerns like the risk of misclassification and privacy.

Privacy: Sensitive information, such as audio biometric data, images, video recordings, and behavioral logs, is commonly used in predictive models. To stop misuse or illegal use, we must adhere to strict privacy laws and moral principles. Participants must give their consent after being fully informed about how their data will be used, and all datasets and participants must remain anonymous. Another essential duty of researchers is strict data privacy, especially for vulnerable populations [49,50];
Risk of False Positives and False Negatives: False positives may result in unnecessary medical treatment or stigmatization due to inaccurate forecasts of aggressive behavior. These might result in poor decisions or negatively affect the health of those who are already at risk. Predictive models are used to support clinical decision-making. Combining machine learning with the expertise of knowledgeable staff members can lessen these risks [51,52];
Fairness and Bias: Algorithms may apply biases in training data, producing inaccurate results. This is especially problematic when working with vulnerable populations or communities. Mitigation techniques are used when developing such models. To ensure equity, openness, and the reduction of prejudice in such systems, regular measures must be implemented. Taking bias and misclassification into account, the two datasets were not balanced at the time of their creation. Cross-validation methods with 5 × 5 folds were employed to eliminate randomness and lower the risk of misclassification in order to mitigate bias and fairness [49,50];
Ethical Implementation and Monitoring: When implementing such models in public settings such as schools, prisons, or mental hospitals, it is crucial to include expert teams in mental health and interdisciplinary fields. The morality of these models must be guaranteed in order to avoid putting people who are already in danger at needless risk [49,50];
Accountability and Human Agency: Machine learning models are meant to support human decision-making. Experts who can comprehend the model’s behavior and predictions in a larger context should be the ones to use it on a case-by-case basis. To guarantee such functionality when working with vulnerable groups, formal accountability structures ought to be put in place;
Patient Consent and Privacy: The data utilized for this research came from free, open-source datasets with audio and visual files suitable for academic purposes. Because of the nature of the dataset, there was no need for a consent form. This study did not require or involve the utilization of an institutional review board (IRB) or ethics board. Both the video and audio information were anonymized right from the source, thus no harm or risk was caused to the targeted application. The dataset was public and did not have any personally identifiable information (PII) or direct interaction with the patients. Both the existing and new models utilized 5 × 5 fold cross-validation in order to eliminate random prediction and unethical data bias.

In real-world clinical applications, patient privacy and consent are of major importance, and any system developed on top of the model outlined here would have to be completely compliant with legal and ethical standards for data protection. Specifically, the following would have to be in place to preserve patient privacy and regulatory compliance, such as the General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA):

Data Encryption: Any sensitive patient data (audio, video, and behavioral logs) in real clinical settings needs to be encrypted both in transit and in storage. This would ensure that unauthorized access to data is prevented, upholding patient confidentiality in storage as well as in communication between systems;
Access Permissions: There must be strict access control mechanisms such that sensitive information is viewable or analyzable only by authorized personnel. Role-based access control (RBAC) must be followed such that health care providers, researchers, and clinicians view information related to their roles only. Auditing trails must also be conducted to track access to sensitive information and assign accountability;
Data Minimization: According to the GDPR and HIPAA regulations, patient data should be maintained to the minimum required for clinical or research use. This can be achieved by adopting approaches such as anonymization or pseudonymization in a way that non-personally identifiable information is used unless explicit consent is provided;
Regulatory Compliance: Any data collected and processed in real clinical practice has to be compliant with relevant privacy legislation, i.e., GDPR for European patients and HIPAA for United States patients. This includes providing the patient with sufficient and comprehensible information on how their data will be processed, informed consent prior to collection of the information, and providing patients with the option to revoke consent at any time. Moreover, compliance with those laws would guarantee that activities of data storing, processing, and sharing are compatible with the legal framework of data protection;
Anonymization and Pseudonymization: Where personal identifiers have to be held for clinical reasons, anonymization or pseudonymization techniques must be used. Anonymization wipes out identifying information from the data set in a way that cannot be reversed, so that it is no longer feasible to link data with an individual. Pseudonymization, by comparison, entails substituting identifying information with pseudonyms such that the data may be linked to a person under certain conditions with approved access only.

Although the data sets used in this work were pre-collected and anonymized, clinical use of this model will have to adhere to stringent data protection and privacy laws. Future work will be necessary to incorporate robust encryption methods, secure access controls, and clear patient consent procedures to ensure that actual clinical settings using such models align with legal and ethical requirements, both safeguarding patient confidentiality and model integrity.

Limitations and Future Validation in Clinical Contexts

While this study achieves a sufficient demonstration of a robust and generalizable proof of concept with open-source datasets, we need to acknowledge that open-source datasets never completely reflect the real-world clinical setting complexity and variability. Therefore, the reported performance metrics and outcomes here should be viewed as foundational but not necessarily absolute. Confirmation in the future will depend on demonstrating the proposed fusion strategies within actual clinical workflows, that is, within dementia units, where the behavioral events are subtle, context-dependent, and affected by hundreds of confounding factors such as medication, cognitive impairment, and social relationships. The objective will be to validate the performance of the model on real clinical video and audio data with proper ethical approvals and data governance frameworks. Furthermore, longitudinal data collection will be accorded priority to account for intra-individual behavioral variation over time. These are necessary steps to progress from a simulated evaluation environment to a clinically validated assistive technology that can facilitate frontline caregivers in high-stakes decision-making reliably.

In this research, the findings indicate that in this proof of concept of a novel model for predicting episodes of aggressive behavior in dementia patients, Late Fusion is the best choice due to its robustness and effective approach with a superior recall, accuracy, and overall performance. Its advantages make it the best choice for applications where clinical caregivers and institutions must accurately and timely intervene to prevent such scenarios. However, the Early Fusion technique remains a solid alternative for scenarios where the avoidance of false positive alarms is the main concern in unsupervised applications. Future research should focus on implementing and testing real clinical data (S1.3) from partnerships with clinical institutions and shed further light on the results or the need for optimized fusion strategies by exploring hybrid approaches that combine the strengths of each methodology.

Comparative Analysis with Recent Research in Violence and Pose Detection

In this section, a contrast of the current work with recent advancements in the field of violence and pose detection is presented and highlighting the similarities, the differentiating aspects, strengths, and weaknesses with respect to current techniques. The papers of Rodrigues et al. [33], Kumar et al. [34], Ding and Li [35], Xu et al. [36], Abundez et al. [37], and Negre et al. [38] that we considered provide valuable information on the state-of-the-art methods in violence detection and pose estimation. We compare and summarize below significant information from these studies with our approach.

Similarities and Common Approaches

Deep Models for Object and Action Detection: Like Kumar et al. [34], our methodology also utilizes sophisticated deep models (YOLOv7 and LSTM) for object detection and action detection. Both methods seek to boost real-time detection accuracy for the recognition of violent behavior. Our emphasis, however, on recognizing a wider variety of violent postures, including defensive and aggressive movements, sets our work apart by enhancing the applications beyond violent objects only;
Real-Time Edge Deployment: As is the case with Rodrigues et al. [33] and Kumar et al. [34], our solution gives emphasis to real-time edge deployment. The approach takes embedded system deployment complexities into account and uses low-latency-optimized YOLO and MediaPipe that ensures quick detection with instantaneous alerting within real setups. This priority entails that the developed system finds usability in safety-critical operations, as presented by both Rodrigues et al. [33] and Kumar et al. [34];
Action Recognition and Pose Estimation: Ding and Li [35] designed a real-time 3D pose recognition system from a deep convolutional neural network, with some resemblance to our pose estimation technique. Though Ding and Li are concerned with character animation, we utilize a holistic model to estimate body poses, using MediaPipe to detect physical postures signaling violence. This intersection signifies a shared interest in applying pose analysis to action recognition systems, although our application is to real-world violence detection.

Differentiation and Unique Contributions

Focus on Violence Detection Based on Holistic Postures of the Body: While most of the works reviewed, e.g., Kumar et al. [34] and Xu et al. [36], focus on detecting violent objects (e.g., weapons), our work is novel in its focus on detecting violent actions through body postures and movement. Both offensive and defensive moves are covered under this work, which can provide insights into violence and aggression based on the motion of the human body rather than based on objects utilized;
Enhanced Dataset and Action Classification: In contrast to Abundez et al. [37], who utilize an active learning approach with a threshold that is used for continuously enhancing the classifier across environments, our approach includes a multimodal action classification system. While Abundez et al. [37] rely on recalibrating the model on new ambiguous images to produce higher accuracy, our solution improves on that idea by focusing on the physical postures of violence and classifying them as defensive and aggressive. This enables more subtle violent behaviors beyond simple object detection;
Multi-Class Pose Recognition: Our system identifies poses and actions in a real setting, as opposed to Ding and Li [35], who work on identifying animated character poses. Our system is specifically designed to identify violent and aggressive postures from video streams, which is an important practical distinction in deployment.

Strengths

Inter-Model Comprehensive Integration: Our approach combines object detection (via YOLOv7) and pose estimation (via MediaPipe) into a single solution, providing an end-to-end solution for both the object and the action context of violence. Kumar et al. [34] and Xu et al. [36] concentrate on violence detection based on object identification only, but our multi-faceted system detects violent actions, improving classification accuracy for diverse violent behavior;
Real-Time Performance and Robustness: Similar to Rodrigues et al. [33] and Kumar et al. [34], we emphasize real-time performance and deploy our model on embedded systems for rapid inference. The combination of YOLO for object detection and MediaPipe for pose estimation ensures that our system functions optimally under real-world scenarios, such as surveillance in complex environments. Our optimized model provides high performance with low latency; thus, it is usable in urgent response scenarios, such as security monitoring;
Generalization Across Multiple Environments: Our approach is able to handle a wide variety of video environments and hostile postures, addressing a common issue reported in Abundez et al. [37]. Using pretrained models and carefully curating the dataset, we ensure that our model generalizes well across multiple environments, hence being insensitive to a range of scenarios even when it encounters novel environments during training.

Weaknesses and Areas for Development

Real-World Deployment Complexity: While our approach demonstrates real-time performance, it may still be constrained in highly cluttered or low-resolution video settings where object and pose detection may be impaired. Similar to the constraints defined by Rodrigues et al. [33] and Kumar et al. [34], our system must deal with realistic real-world scenarios where occlusions, low light, or high-speed motions may impact detection accuracy;
Limited Action Classification Granularity: While our system is strong in discriminating violent from non-violent action, further discrimination among violent behavior subclasses (i.e., physical assault, verbal aggression, or self-injury) would add more generality to the model. Fine-tuning in this way could lead to more specific intervention in real-world circumstances compared to a more general violent action classification;
Dependence on Pretrained Models: While pretrained models are convenient in terms of improving initial performance and training, they also carry limitations when it comes to processing very particular sets of data. Our reliance on such models could restrict the capacity of the system to learn novel behaviors or unexpected actions, much like the iterative learning model presented by Abundez et al. [37].

In this work, a new contribution to the field of violence and pose detection is presented by integrating object detection and action recognition into an end-to-end system that can be applied in real-time on edge devices. Although adopting some of the features of recent progress, e.g., those of Kumar et al. [34] and Xu et al. [36], our system differs on the grounds of its focus on the recognition of violent acts through physical postures and gestures, offering a deeper contextual interpretation of aggression. Although there remain some problems with environment adaptability and action classification granularity, our system provides a scalable and robust solution for violence detection that can be used in many public safety scenarios.

5. Conclusions

This study presents a comparative analysis of Early and Late Fusion models for aggression detection in dementia patients using multimodal audio-visual data as a novel proof-of-concept methodology. According to the findings, Late Fusion outperforms Early Fusion in most key metrics, including accuracy (0.876 vs. 0.828), recall (0.914 vs. 0.818), F1-score (0.867 vs. 0.835), and ROC-AUC (0.970 vs. 0.922), making it more effective in detecting aggressive behaviors. Its higher recall suggests better detection of aggressive incidents, making it suitable for high-risk applications, while its faster inference times enhance its potential for real-time use. In contrast, Early Fusion achieved higher precision (0.852 vs. 0.824), indicating a lower false positive rate, which may be beneficial in scenarios where minimizing false alarms is crucial. Additionally, Late Fusion’s superiority is reinforced by its higher Balanced Accuracy (0.880 vs. 0.829) and Cross-Validated ROC-AUC (0.959 vs. 0.924), suggesting better generalizability.

Future Directions

This study provides results that open new opportunities for further advancements in human behavioral prediction models. It also paves the road for deep learning incorporation techniques for feature extraction that could improve the classification accuracy, such as CNNs and RNNs. The latest could be explored for a more complex approach in the visual and audio patterns.

Furthermore, hybrid fusion strategies that integrate certain aspects from both Early and Late Fusion techniques could provide an even more robust solution by integrating the strengths from each approach. This would provide increased accuracy and low false positive rates while maintaining the superior classification capabilities, providing a reliable detection system. Finally, attention mechanisms could be implemented during the fusion process to refine the decision-making ability of the model. This will allow for a more dynamic weight-critical feature mechanism and improve even more the classification performance.

While this research presents a strong way of identifying violence in terms of postures and gestures, there are still various directions for further development. One direction is the granularity of distinguishing between actions. Our system currently distinguishes violent from non-violent action but might, in future follow-up work, continue to distinguish action further to identify the more subtle domains of aggression involving physical aggression, verbal aggression, or self-injury. This would not merely enhance the ability of the model to react with greater accuracy against different types of violent conduct, but also provide more specific indications of actual-world intervention. On top of this, fine-grained action identification would help disambiguate between unlike intensities of aggressiveness, paving the way for more targeted action in public safety settings.

Deployability in the actual world is another key dimension for enhancement. Although the system has been tuned for real-time performance, precision is impacted by factors such as low-resolution video streams, occlusions, or adverse environmental conditions (e.g., low illumination, high-density environments, or high-speed motion). The aforementioned problems may be solved with data augmentation techniques, state-of-the-art pose detection algorithms, and multimodal fusion (e.g., combining audio or thermal data with visual data) that might boost the system’s resilience in uncontrolled and dynamic environments. Secondly, adaptive learning techniques could be employed to allow the model to continuously improve its detection abilities by adding new data in varied environments and conditions so that it becomes more suitable to generalize and adapt to variable conditions. All these advancements would significantly improve the system’s uses for real-time surveillance, and it would provide a more general solution for detecting violence in various real-world applications.

Real-Time Deployment

Implementation of these models in real-world clinical scenarios and settings is a crucial next step. Further research in this field should focus on an optimized model with efficiency high enough to ensure low-latency prediction times. Model compression and quantization strategies in the edge computing field should be considered to unlock further capabilities for resource-constrained devices that are commonly used in healthcare.

Validation of the model’s performance in clinical data and trials with real data from patients is essential to ensure the effectiveness and the reliability of the model. Collaboration with healthcare institutes and experts is mandatory to test and refine the usability of this method in real-time prediction of aggressive episodes, validating its usability in existing care protocols. The successful deployment of real-time aggression detection models could revolutionize patient monitoring, offering a proactive approach to managing aggression-related incidents in dementia care.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app15115823/s1. Figure S1. Aggressive image dataset sample; Figure S2. Non-aggressive image dataset sample; Table S1. Demonstration of file data structure; Table S2. Sample from the visual_detection_results.csv file after visual model testing; Table S3. Sample from the audio_detection_results.csv file after audio model testing; Table S4. Sample from merged_results.csv file; Table S5. Top 15 Features Selected from Random Forest During Early Fusion; Table S6. Top 15 Features Selected from Random Forest During Late Fusion.

Author Contributions

Conceptualization, I.G.; Methodology, I.G. and R.F.S.; Software, I.G.; Validation, I.G.; Formal analysis, I.G.; Investigation, I.G.; Resources, I.G.; Data curation, I.G.; Writing—original draft, I.G.; Writing—review & editing, I.G., R.F.S., N.K., A.V. and I.V.; Visualization, I.G.; Supervision, I.G. and M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available at https://www.kaggle.com/datasets/fangfangz/audio-based-violence-detection-dataset (accessed on 15 January 2025) and at https://virtualhumans.mpi-inf.mpg.de/3DPW/ (accessed on 1 October 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Devshi, R.; Shaw, S.; Elliott-King, J.; Hogervorst, E.; Hiremath, A.; Velayudhan, L.; Kumar, S.; Baillon, S.; Bandelow, S. Prevalence of Behavioural and Psychological Symptoms of Dementia in Individuals with Learning Disabilities. Diagnostics 2015, 5, 564–576. [Google Scholar] [CrossRef] [PubMed]
Schablon, A.; Wendeler, D.; Kozak, A.; Nienhaus, A.; Steinke, S. Prevalence and Consequences of Aggression and Violence towards Nursing and Care Staff in Germany—A Survey. Int. J. Environ. Res. Public Health 2018, 15, 1274. [Google Scholar] [CrossRef] [PubMed]
Priyadarshinee, P.; Clarke, C.J.; Melechovsky, J.; Lin, C.M.Y.; Balamurali, B.T.; Chen, J.M. Alzheimer’s Dementia Speech (Audio vs. Text): Multi-Modal Machine Learning at High vs. Low Resolution. Appl. Sci. 2023, 13, 4244. [Google Scholar] [CrossRef]
Arevalo-Rodriguez, I.; Smailagic, N.; Roqué-Figuls, M.; Ciapponi, A.; Sanchez-Perez, E.; Giannakou, A.; Pedraza, O.L.; Cosp, X.B.; Cullum, S. Mini-Mental State Examination (MMSE) for the early detection of dementia in people with mild cognitive impairment (MCI). Cochrane Database Syst. Rev. 2021, 2021, CD010783. [Google Scholar] [CrossRef]
Chun, C.T.; Seward, K.; Patterson, A.; Melton, A.; MacDonald-Wicks, L. Evaluation of Available Cognitive Tools Used to Measure Mild Cognitive Decline: A Scoping Review. Nutrients 2021, 13, 3974. [Google Scholar] [CrossRef]
Nowroozpoor, A.; Dussetschleger, J.; Perry, W.; Sano, M.; Aloysi, A.; Belleville, M.; Brackett, A.; Hirshon, J.M.; Hung, W.; Moccia, J.M.; et al. Detecting Cognitive Impairment and Dementia in the Emergency Department: A Scoping Review. J. Am. Med. Dir. Assoc. 2022, 23, 1314.e31–1314.e88. [Google Scholar] [CrossRef]
Martin, S.A.; Townend, F.J.; Barkhof, F.; Cole, J.H. Interpretable machine learning for dementia: A systematic review. Alzheimer’s Dement. 2023, 19, 2135–2149. [Google Scholar] [CrossRef]
Javeed, A.; Dallora, A.L.; Berglund, J.S.; Ali, A.; Ali, L.; Anderberg, P. Machine Learning for Dementia Prediction: A Systematic Review and Future Research Directions. J. Med. Syst. 2023, 47, 17. [Google Scholar] [CrossRef]
Merkin, A.; Krishnamurthi, R.; Medvedev, O.N. Machine learning, artificial intelligence, and the prediction of dementia. Curr. Opin. Psychiatry 2022, 35, 123–129. [Google Scholar] [CrossRef]
Chang, C.H.; Lin, C.H.; Lane, H.Y. Machine Learning and Novel Biomarkers for the Diagnosis of Alzheimer’s Disease. Int. J. Mol. Sci. 2021, 22, 2761. [Google Scholar] [CrossRef]
Irfan, M.; Shahrestani, S.; Elkhodr, M. Enhancing Early Dementia Detection: A Machine Learning Approach Leveraging Cognitive and Neuroimaging Features for Optimal Predictive Performance. Appl. Sci. 2023, 13, 10470. [Google Scholar] [CrossRef]
Stefanou, K.; Tzimourta, K.D.; Bellos, C.; Stergios, G.; Markoglou, K.; Gionanidis, E.; Tsipouras, M.G.; Giannakeas, N.; Tzallas, A.T.; Miltiadous, A. A Novel CNN-Based Framework for Alzheimer’s Disease Detection Using EEG Spectrogram Representations. J. Pers. Med. 2025, 15, 27. [Google Scholar] [CrossRef]
Wei, Z.; Iyer, M.R.; Zhao, B.; Deng, J.; Mitchell, C.S. Artificial Intelligence-Assisted Comparative Analysis of the Overlapping Molecular Pathophysiology of Alzheimer’s Disease, Amyotrophic Lateral Sclerosis, and Frontotemporal Dementia. Int. J. Mol. Sci. 2024, 25, 13450. [Google Scholar] [CrossRef]
Boyle, L.D.; Giriteka, L.; Marty, B.; Sandgathe, L.; Haugarvoll, K.; Steihaug, O.M.; Husebo, B.S.; Patrascu, M. Activity and Behavioral Recognition Using Sensing Technology in Persons with Parkinson’s Disease or Dementia: An Umbrella Review of the Literature. Sensors 2025, 25, 668. [Google Scholar] [CrossRef]
Iaboni, A.; Spasojevic, S.; Newman, K.; Martin, L.S.; Wang, A.; Ye, B.; Mihailidis, A.; Khan, S.S. Wearable multimodal sensors for the detection of behavioral and psychological symptoms of dementia using personalized machine learning models. Alzheimer’s Dement. Diagn. Assess. Dis. Monit. 2022, 14, e12305. [Google Scholar] [CrossRef]
Sharma, N.; Klein Brinke, J.; Braakman Jansen, L.M.A.; Havinga, P.J.M.; Le, D.V. Wi-Gitation: Replica Wi-Fi CSI Dataset for Physical Agitation Activity Recognition. Data 2024, 9, 9. [Google Scholar] [CrossRef]
Inoue, Y.; Moshnyaga, V.G. A Real-Time Detection of Patient’s Aggressive Behaviors on a Smart Sensor. In Proceedings of the 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, MA, USA, 11–14 August 2024; IEEE: New York, NY, USA, 2024; pp. 911–914. [Google Scholar] [CrossRef]
Galanakis, I.; Soldatos, R.F.; Karanikolas, N.; Voulodimos, A.; Voyiatzis, I.; Samarakou, M. A MediaPipe Holistic Behavior Classification Model as a Potential Model for Predicting Aggressive Behavior in Individuals with Dementia. Appl. Sci. 2024, 14, 10266. [Google Scholar] [CrossRef]
Chu, H.C.; Zhang, Y.L.; Chiang, H.C. A CNN Sound Classification Mechanism Using Data Augmentation. Sensors 2023, 23, 6972. [Google Scholar] [CrossRef]
Keen, S.C.; Odom, K.J.; Webster, M.S.; Kohn, G.M.; Wright, T.F.; Araya-Salas, M. A machine learning approach for classifying and quantifying acoustic diversity. Methods Ecol. Evol. 2021, 12, 1213–1225. [Google Scholar] [CrossRef]
Mou, A.; Milanova, M. Performance Analysis of Deep Learning Model-Compression Techniques for Audio Classification on Edge Devices. Sci 2024, 6, 21. [Google Scholar] [CrossRef]
Abayomi-Alli, O.O.; Damaševičius, R.; Qazi, A.; Adedoyin-Olowe, M.; Misra, S. Data Augmentation and Deep Learning Methods in Sound Classification: A Systematic Review. Electronics 2022, 11, 3795. [Google Scholar] [CrossRef]
Zaman, K.; Sah, M.; Direkoglu, C.; Unoki, M. A Survey of Audio Classification Using Deep Learning. IEEE Access 2023, 11, 106620–106649. [Google Scholar] [CrossRef]
Tsalera, E.; Papadakis, A.; Samarakou, M. Comparison of Pre-Trained CNNs for Audio Classification Using Transfer Learning. J. Sens. Actuator Netw. 2021, 10, 72. [Google Scholar] [CrossRef]
Galanakis, I.; Soldatos, R.F.; Karanikolas, N.; Voulodimos, A.; Voyiatzis, I.; Samarakou, M. Enhancing the Prediction of Episodes of Aggression in Patients with Dementia Using Audio-Based Detection: A Multimodal Late Fusion Approach with a Meta-Classifier. Appl. Sci. 2025, 15, 5351. [Google Scholar] [CrossRef]
Sakri, S.; Basheer, S. Fusion Model for Classification Performance Optimization in a Highly Imbalance Breast Cancer Dataset. Electronics 2023, 12, 1168. [Google Scholar] [CrossRef]
Nykoniuk, M.; Basystiuk, O.; Shakhovska, N.; Melnykova, N. Multimodal Data Fusion for Depression Detection Approach. Computation 2025, 13, 9. [Google Scholar] [CrossRef]
Younis, E.M.G.; Zaki, S.M.; Kanjo, E.; Houssein, E.H. Evaluating Ensemble Learning Methods for Multi-Modal Emotion Recognition Using Sensor Data Fusion. Sensors 2022, 22, 5611. [Google Scholar] [CrossRef]
Saidi, S.; Idbraim, S.; Karmoude, Y.; Masse, A.; Arbelo, M. Deep-Learning for Change Detection Using Multi-Modal Fusion of Remote Sensing Images: A Review. Remote Sens. 2024, 16, 3852. [Google Scholar] [CrossRef]
Resende Faria, D.; Weinberg, A.I.; Ayrosa, P.P. Multimodal Affective Communication Analysis: Fusing Speech Emotion and Text Sentiment Using Machine Learning. Appl. Sci. 2024, 14, 6631. [Google Scholar] [CrossRef]
Choi, H.S. Emotion Recognition Using a Siamese Model and a Late Fusion-Based Multimodal Method in the WESAD Dataset with Hardware Accelerators. Electronics 2025, 14, 723. [Google Scholar] [CrossRef]
Xu, Z.; Yu, G. A Time Series Forecasting Approach Based on Meta-Learning for Petroleum Production under Few-Shot Samples. Energies 2024, 17, 1947. [Google Scholar] [CrossRef]
Rodrigues, N.R.P.; da Costa, N.M.C.; Melo, C.; Abbasi, A.; Fonseca, J.C.; Cardoso, P.; Borges, J. Fusion Object Detection and Action Recognition to Predict Violent Action. Sensors 2023, 23, 5610. [Google Scholar] [CrossRef]
Kumar, P.; Shih, G.L.; Guo, B.L.; Nagi, S.K.; Manie, Y.C.; Yao, C.K.; Arockiyadoss, M.A.; Peng, P.C. Enhancing Smart City Safety and Utilizing AI Expert Systems for Violence Detection. Future Internet 2024, 16, 50. [Google Scholar] [CrossRef]
Ding, W.; Li, W. High Speed and Accuracy of Animation 3D Pose Recognition Based on an Improved Deep Convolution Neural Network. Appl. Sci. 2023, 13, 7566. [Google Scholar] [CrossRef]
Xu, W.; Zhu, D.; Deng, R.; Yung, K.; Ip, A.W.H. Violence-YOLO: Enhanced GELAN Algorithm for Violence Detection. Appl. Sci. 2024, 14, 6712. [Google Scholar] [CrossRef]
Abundez, I.M.; Alejo, R.; Primero, F.; Granda-Gutiérrez, E.E.; Portillo-Rodríguez, O.; Velázquez, J.A.A. Threshold Active Learning Approach for Physical Violence Detection on Images Obtained from Video (Frame-Level) Using Pre-Trained Deep Learning Neural Network Models. Algorithms 2024, 17, 316. [Google Scholar] [CrossRef]
Negre, P.; Alonso, R.S.; González-Briones, A.; Prieto, J.; Rodríguez-González, S. Literature Review of Deep-Learning-Based Detection of Violence in Video. Sensors 2024, 24, 4016. [Google Scholar] [CrossRef]
Deng, Y.; Zhang, C.; Yang, N.; Chen, H. FocalMatch: Mitigating Class Imbalance of Pseudo Labels in Semi-Supervised Learning. Appl. Sci. 2022, 12, 10623. [Google Scholar] [CrossRef]
Pan, Q.; Meng, Z. Hybrid Uncertainty Calibration for Multimodal Sentiment Analysis. Electronics 2024, 13, 662. [Google Scholar] [CrossRef]
Ye, L.; Chen, X.; Liu, H.; Zhang, R.; Zhang, B.; Zhao, Y.; Zhou, D. Vessel Type Recognition Using a Multi-Graph Fusion Method Integrating Vessel Trajectory Sequence and Dependency Relations. J. Mar. Sci. Eng. 2024, 12, 2315. [Google Scholar] [CrossRef]
Sukhavasi, S.B.; Sukhavasi, S.B.; Elleithy, K.; El-Sayed, A.; Elleithy, A. Hybrid Model for Driver Emotion Detection Using Feature Fusion Approach. Int. J. Environ. Res. Public Health 2022, 19, 3085. [Google Scholar] [CrossRef] [PubMed]
Knauer, U.; von Rekowski, C.S.; Stecklina, M.; Krokotsch, T.; Pham Minh, T.; Hauffe, V.; Kilias, D.; Ehrhardt, I.; Sagischewski, H.; Chmara, S.; et al. Tree Species Classification Based on Hybrid Ensembles of a Convolutional Neural Network (CNN) and Random Forest Classifiers. Remote Sens. 2019, 11, 2788. [Google Scholar] [CrossRef]
Scuteri, D.; Contrada, M.; Loria, T.; Tonin, P.; Sandrini, G.; Tamburin, S.; Nicotera, P.; Bagetta, G.; Corasaniti, M.T. Pharmacological Treatment of Pain and Agitation in Severe Dementia and Responsiveness to Change of the Italian Mobilization–Observation–Behavior–Intensity–Dementia (I-MOBID2) Pain Scale: Study Protocol. Brain Sci. 2022, 12, 573. [Google Scholar] [CrossRef]
Cesana, B.M.; Poptsi, E.; Tsolaki, M.; Bergh, S.; Ciccone, A.; Cognat, E.; Fabbo, A.; Fascendini, S.; Frisoni, G.B.; Frölich, L.; et al. A Confirmatory and an Exploratory Factor Analysis of the Cohen-Mansfield Agitation Inventory (CMAI) in a European Case Series of Patients with Dementia: Results from the RECage Study. Brain Sci. 2023, 13, 1025. [Google Scholar] [CrossRef]
Li, H.; Rajbahadur, G.K.; Lin, D.; Bezemer, C.P.; Jiang, Z.M. Keeping Deep Learning Models in Check: A History-Based Approach to Mitigate Overfitting. IEEE Access 2024, 12, 70676–70689. [Google Scholar] [CrossRef]
Freiesleben, T.; Grote, T. Beyond generalization: A theory of robustness in machine learning. Synthese 2023, 202, 109. [Google Scholar] [CrossRef]
Lueangwitchajaroen, P.; Watcharapinchai, S.; Tepsan, W.; Sooksatra, S. Multi-Level Feature Fusion in CNN-Based Human Action Recognition: A Case Study on EfficientNet-B7. J. Imaging 2024, 10, 320. [Google Scholar] [CrossRef]
Solís-Martín, D.; Galán-Páez, J.; Borrego-Díaz, J. A Model for Learning-Curve Estimation in Efficient Neural Architecture Search and Its Application in Predictive Health Maintenance. Mathematics 2025, 13, 555. [Google Scholar] [CrossRef]
Schmid, L.; Roidl, M.; Kirchheim, A.; Pauly, M. Comparing Statistical and Machine Learning Methods for Time Series Forecasting in Data-Driven Logistics—A Simulation Study. Entropy 2025, 27, 25. [Google Scholar] [CrossRef]
Joseph, J. Predicting crime or perpetuating bias? The AI dilemma. AI Soc. 2024, 27, 25. [Google Scholar] [CrossRef]
Farayola, M.M.; Tal, I.; Connolly, R.; Saber, T.; Bendechache, M. Ethics and Trustworthiness of AI for Predicting the Risk of Recidivism: A Systematic Literature Review. Information 2023, 14, 426. [Google Scholar] [CrossRef]

Figure 1. ROC-AUC curve comparison between Early and Late Fusion meta-classifiers.

Figure 2. Learning curve comparison between Early and Late Fusion meta-classifiers.

Figure 3. Confusion matrix comparison between Early and Late Fusion meta-classifiers.

Figure 4. Overall performance comparison between Early and Late Fusion models across all metrics.

Figure 5. T-test analysis comparison between Early and Late Fusion meta-classifiers.

Figure 6. Inference time (ms) comparison between Early and Late Fusion meta-classifiers across all metrics.

Table 1. Example of the first row of the Early Fusion Merged Dataset.

Landmark_0	Landmark_1	Landmark_n	Landmark_1628	Mfcc_Mean_1	Mfcc_Mean_n	Mfcc_Std_1	Mfcc_Std_n	Pitch_Mean	Pitch_Std	Spectral_Centroid_Mean	Spectral_Bandwith_Mean	Spectral_Flatness_Mean	Spectral_Contrast_Mean	Zero_Crossing_Rate_Mean	Rms_Mean	Label
0.6521162986755371	0.6397485733032227	…	−0.0233648493885993	−0.4566593870633595	…	−0.7672966098676699	…	−1.3259335751824306	−1.2938737484348524	−0.8645088609027595	−0.2302908607663812	−0.5734889273996062	0.6479885729501993	−0.5545112700054582	−0.2916646639402965	0

Table 2. Comparison of metrics between Early and Late Fusion models.

Metric	Early Fusion	Late Fusion	Best Model
Accuracy	0.8289	0.8763	Late Fusion
Precision (PPV)	0.8526	0.8242	Early Fusion
Recall	0.8182	0.9146	Late Fusion
F1-Score	0.8351	0.8671	Late Fusion
ROC-AUC Score	0.9223	0.9708	Late Fusion
Balanced Accuracy	0.8295	0.8804	Late Fusion
Cross-Validated Balanced Accuracy	0.8337	0.8828	Late Fusion
Cross-Validated ROC-AUC Score	0.9247	0.9592	Late Fusion

Table 3. Confusion matrix comparison.

Model	TN	FP	FN	TP
Early Fusion	74	14	18	81
Late Fusion	88	16	7	75

Table 4. Statistical Significance Analysis (T-tests).

Metric	T-Stat	p-Value	Degrees of Freedom	95% CI (Mean Diff)	Significant Difference
Accuracy	7.249	0.5086	9	[−0.015, 0.036]	No
Precision	3.891	0.0177	9	[0.003, 0.019]	Yes (Early Fusion is better)
Recall	−11.882	0.3005	9	[−0.067, 0.022]	No
F1-Score	1.414	0.2304	9	[−0.007, 0.025]	No

Table 5. Inference Speed Performance evaluation.

Metric	Early Fusion	Late Fusion	Faster Model
Total Inference Time	0.0042 s	0.0034 s	Late Fusion
Avg. Inference Time per Sample	2.2262 × 10⁻⁵ s	1.8283 × 10⁻⁵ s	Late Fusion

Table 6. Metrics of Wilcoxon and Cohen’s d metrics between Early and Late Fusion.

Metric	T-Test (p)	Wilcoxon (p) (Early Fusion)	Wilcoxon (p) (Late Fusion)	Cohen’s d (Early)	Cohen’s d (Late)	Interpretation
Accuracy	0.5086	0.5625	0.7500	0.213	0.324	Small effect, not significant
Precision	0.0177	0.4375	0.0625	−0.407	1.740	Large effect, significant and practical
Recall	0.3005	0.4375	0.6250	0.340	−0.531	Moderate mixed effects
F1-Score	0.2304	1.0000	0.3125	−0.058	0.632	Moderate effect (Late better)

Table 7. Comparative analysis of recent methods in pose recognition and violence detection.

Author(s)	System/Method	Technology/Model Used	Datasets	Performance	Real-time/Edge Deployment	Applications
Current Work	Multimodal fusion for aggression detection in dementia care	Early and Late Fusion with Random Forest classifiers	Audio-based Violence Detection (Kaggle), 3D Human Pose in the Wild	Late Fusion: accuracy 87.6%, recall 91.4%, F1-score 86.7%, ROC-AUC 0.970; Early Fusion: higher precision (85.2%)	Yes	Healthcare monitoring, security in dementia care
Rodrigues et al. [33]	Multimodal fusion system for vehicle interior monitoring	I3D, R(2 + 1)D, YOLOv5	MoLa InCar, COCO	Real-time performance on embedded platforms	Yes	Safety-critical applications, vehicle interiors
Kumar et al. [34]	AI-driven expert system for smart city surveillance	YOLOv7, MediaPipe, LSTM classifier	Not specified	89.5% mAP for object detection, 88.33% accuracy for action classification	Yes	Smart city surveillance, urban emergency responsiveness
Ding and Li [35]	Real-time animated character 3D pose recognition	Enhanced deep convolutional neural network	Not specified	High-speed inference (up to 384 fps), ~3.5% accuracy improvement	No	Animated character animation, pose estimation
Xu et al. [36]	Violence-YOLO model for violence detection in dense public areas	YOLOv9 GELAN-C, SimAM, GhostConv, RepGhostNet, Focaler-IoU loss	Not specified	mAP@0.5 improved by 0.9%, reduced computation load	Yes	Public space monitoring, airport security
Abundez et al. [37]	Threshold active learning technique for physical violence detection	Pretrained models, hybrid neural network	Not specified	Enhanced classifier accuracy through iterative retraining	No	Violence detection in videos

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Galanakis, I.; Soldatos, R.F.; Karanikolas, N.; Voulodimos, A.; Voyiatzis, I.; Samarakou, M. Early and Late Fusion for Multimodal Aggression Prediction in Dementia Patients: A Comparative Analysis. Appl. Sci. 2025, 15, 5823. https://doi.org/10.3390/app15115823

AMA Style

Galanakis I, Soldatos RF, Karanikolas N, Voulodimos A, Voyiatzis I, Samarakou M. Early and Late Fusion for Multimodal Aggression Prediction in Dementia Patients: A Comparative Analysis. Applied Sciences. 2025; 15(11):5823. https://doi.org/10.3390/app15115823

Chicago/Turabian Style

Galanakis, Ioannis, Rigas Filippos Soldatos, Nikitas Karanikolas, Athanasios Voulodimos, Ioannis Voyiatzis, and Maria Samarakou. 2025. "Early and Late Fusion for Multimodal Aggression Prediction in Dementia Patients: A Comparative Analysis" Applied Sciences 15, no. 11: 5823. https://doi.org/10.3390/app15115823

APA Style

Galanakis, I., Soldatos, R. F., Karanikolas, N., Voulodimos, A., Voyiatzis, I., & Samarakou, M. (2025). Early and Late Fusion for Multimodal Aggression Prediction in Dementia Patients: A Comparative Analysis. Applied Sciences, 15(11), 5823. https://doi.org/10.3390/app15115823

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Early and Late Fusion for Multimodal Aggression Prediction in Dementia Patients: A Comparative Analysis

Abstract

1. Introduction

1.1. Dementia

1.2. Dementia and Machine Learning

1.3. Detection and Prediction of Episodes of Aggression in Patients with Dementia

1.4. Audio Classification and Machine Learning

1.5. Machine Learning Multimodal Approaches with Late and Early Fusion Meta-Classifiers

1.6. Objective

1.7. Novelty and Key Contributions

1.8. Comparison with Recent and Existing Theoretical Methods in Pose Recognition and Violence Detection

2. Materials and Methods

2.1. Data Collection and Preprocessing

2.2. Late Fusion Methodology

2.3. Early Fusion Methodology

2.4. Modeling

3. Results

3.1. Performance Comparison (Early vs. Late Fusion)

3.2. Statistical Significance Analysis

3.3. Inference Speed Performance

3.4. Effect Size Analysis

3.5. Comparison with Recent Theoretical Methods in Pose Recognition and Violence Detection

4. Discussion

Comparative Analysis with Recent Research in Violence and Pose Detection

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI