3.1. Trained Model’s Performance
The performance of the audio model during the evaluation was investigated with the use of cross-validation on the training data, the F1-score, accuracy, precision, recall, confusion matrices, balanced accuracy, and PPV and NPV values. Finally, paired
t-testing and the inference time were calculated to further obtain insights into the evaluation, as shown in
Table 5.
3.1.1. Classification Performance in the Training Set Using Random Forest
Our model shows excellent performance in the test set, with high scores in all the metrics. The high level of performance indicates that the specific dataset that was used during the training aids the classifier to improve while providing a clear distinction between the classes. Although the high performance of the model during the training might indicate potential overfitting, the choice to use a very low percentage of the holistic dataset (10%) for training is proof that the model does not overfit and generalizes well in the training set. Cross-validation was also used to mitigate this and prevent the model from learning patterns in the specific set of data. This ensures good model generalization and prevents overfitting. Just like in the prior visual model, the models display the ability to generalize well for the data because of the low standard deviations across the folds combined with the high performance metrics
Figure 1.
Cross-validation scores (five folds): This represents the model accuracy for the specified folds (five folds) of the data. The scores, with a high value of 0.93 and a low value of 0.81, indicate that the model might perform consistently across some splits of the data, but this is acceptable, considering the imbalances of the data.
Mean cross-validation score: This represents the average accuracy for all five folds during the cross-validation and gives a more robust estimation of the performance of the model for unseen data. The score of 0.88 indicates solid generalizability, showcasing a robust model.
Inference Time: This is a crucial metric for applications in real time, which represents the time the model needs to make predictions based on new data. The score of 0.09 proves the suitability of the model for real-time applications.
Balanced Accuracy Before Cross-validation: This helps in the case of imbalanced classes depicting the average of the recall gathered in each class. The score of 0.92 provides an excellent result in a dataset with class imbalances, like ours.
ROC-AUC Score: This measures the model’s ability to distinguish between positive and negative classes, with a score of 1 being a perfect performance. The score of 0.9 highlights a strong model capable of distinguishing between classes.
Precision: This is the percentage of actual positive instances out of all the instances the model evaluated as positive. The score of 0.93 means that the model predicts correct positive classes 93.75% of the time.
Recall: This is the percentage of the actual positive cases that the model correctly identifies. The score of 0.93 means that the model successfully identifies 93.75% of the actual positive cases.
F1-Score: This is the precision and recall’s harmonic mean, which balances both metrics in order to identify both false positives and false negatives. The score of 0.93 indicates that the model has both high sensitivity and specificity.
Accuracy: This is the correctly classified instances’ overall percentage. The score of 0.92 shows that the model performs well in the overall classification.
Positive Predictive Value (PPV): This is the percentage of actually positive instances out of all the instances the model evaluated as positive. The score of 0.93 shows that most of the positive predictions are correct.
Negative Predictive Value (NPV): This is the percentage of actually negative instances out of all the instances the model evaluated as negative. The score of 0.92 shows that the model is predicting negative instances correctly 92% of the time.
Balanced Accuracy After Cross-validation: The score of 0.92 indicates that the pre cross-validation accuracy and post cross-validation accuracy are consistent, meaning that no significant bias or instability had been introduced to the model, further enhancing the model’s reliability.
True Positives (TPs): 30—Instances of the positive class correctly identified by the model.
True Negatives (TNs): 23—Instances of the negative class correctly identified by the model.
False Positives (FPs): 2—Instances of the negative class incorrectly identified by the model.
False Negatives (FNs): 2—Instances of the positive class incorrectly identified by the model.
Learning Curve: This depicts the performance of the model during its training with the training dataset. The training score showcases that the model performs extremely well with the training data, as is expected because the data have never been seen by the model. The cross-validation score shows a small gap (0.9–1.0) with the training score, while making a plateau, indicating that the model generalizes well
Figure 3.
3.1.2. Paired t-Test Analysis
We performed a
t-test analysis to evaluate if there is a statistically significant difference between the evaluation metrics and the test set. Below, in
Table 6 and
Figure 4, the results are presented.
Paired t-test results: This represents the comparison of the performances of the model before and after the cross-validation. A p-value above 0.05 indicates that there is no significant statistical difference between them. A high p-value indicates that cross-validation did not change the performance of the model significantly.
T-test for Accuracy: The p-value is greater than 0.05, meaning that there is no statistically significant difference between the accuracy of the cross-validation and the accuracy of the test set
T-test for Precision: The p-value is greater than 0.05, meaning that there is no statistically significant difference between the precision of the cross-validation and the precision of the test set
T-test for Recall: The p-value is high, meaning that the difference in recall between the cross-validation and the test set is because of random chance, without any statistically significant difference. That also indicates that the test set’s recall is not different from the cross-validation’s recall.
T-test for the F1-Score: There is no statistically significant difference between the F1-scores acquired from the cross-validation and the test set, indicating that the F1-score is similar across both datasets.
3.2. Model’s Performance After Testing in an External Dataset
After training our model with a small portion of the data (10% or 285 entries), we saved the trained model and then we introduced a portion of the data unseen by the model during the training. Even though we used a testing–training split during the training to evaluate the model’s performance, a model trained with a smaller portion of data and then tested with a large portion of unseen data is crucial to ensure overfitting is avoided and a good model evaluation in generalization. The model showcased exemplary performance in the new dataset (90% of the data or 2573 entries), further ensuring its robustness, generalizability, and ability for real-world scenario applications, as shown in
Table 7.
Performance of the Trained Model in the Test Set (External Dataset)
The trained model’s overall performance exhibits a very high score, and the trained model predicts correctly the data instances, with a good ability to distinguish positive and negative classes. The curve shape pinpoints that the model achieves a high true-positive rate and, at the same time, maintains a low number of false positives. A minor score drop is observed, and it is acceptable considering the fact that the model was trained on a very small portion of data with the class imbalance and yet maintains its reliability [
48,
49]
Figure 5.
Balanced Accuracy in the External Dataset: The score of 0.88 suggests that the model is generalizing well to the unseen data, but at the same time, some noise or feature variability may have been introduced, therefore, causing the drop in the score.
ROC-AUC Score: The score of 0.95 shows that although the balanced accuracy dropped, the model’s overall ability for predictions remains reliable.
Precision: The score of 0.87 means that the model predicts correct positive classes 87% of the time.
Recall: The score of 0.87 means that the model successfully identifies 87% of the actual positive cases.
F1-Score: The score of 0.87 indicates that the model has both high sensitivity and specificity.
Accuracy: The score of 0.87 shows that the model performs well in the overall classification. Consistency with other metrics is shown, where the F1-, precision, and recall scores are aligned, reinforcing the reliability of the model.
Positive Predictive Value (PPV): The score of 0.88 shows that most of the positive predictions are correct.
Negative Predictive Value (NPV): The score of 0.85 shows that the model is predicting negative instances correctly 85% of the time.
True Positives (TPs): 1327—Instances of the positive class correctly identified by the model.
True Negatives (TNs): 914—Instances of the negative class correctly identified by the model.
False Positives (FPs): 175—Instances of the negative class incorrectly identified by the model.
False Negatives (FNs): 157—Instances of the positive class incorrectly identified by the model.
Accuracy for Aggression Detections: 90%—From all the true labels as aggressive, the model identified 90% of them as actually aggressive.
Accuracy for Argument Detections: 84%—From all the true labels as non-aggressive, the model identified 84% of them as actually non-aggressive.
The performance of the model tested in an external dataset shows promising results, showcasing a good prediction balance between the two classes falling between the 0.8 and 0.9 accuracy values. These values are based on the results derived as a direct output from our code after the testing finalization, saved in a ‘detection_results.csv’ file. The final results are shown below in
Figure 7.
3.3. Meta-Classifier Performance Evaluation After Late Fusion
After applying the late fusion rule to the
visual_detection_results.csv and
audio_detection_results.csv files, we loaded the
dataset merged_detection_results.csv file, consisting of the merged results, and performed an identical pipeline with those of the two previous models to develop a meta-classifier. The meta-classifier reads the data, imputes missing values (if any), standardizes the features, and, using 5k-fold cross-validation, trains the random forest classifier using a testing–training split of 80–20 to make predictions for the combined detection results. The final results are shown in
Table 8,
Table 9 and
Table 10.
ROC-AUC Score: 95.52%—An indication that the models’ discrimination abilities between the classes are excellent, as shown in
Figure 8.
Accuracy: 88.0%—A high value, indicating the ability of the model to predict correctly.
Precision: 87.60%—The prediction of the positive class (aggressive) by the model indicates that it is correct 87% of the time.
Recall: 92.17%—An indication that the model is identifying correct true positives and that the missed predictions are few.
F1-Score: 89.83%—The harmonic mean of the model, between the precision and recall, is high, showcasing a good balance.
Balanced Accuracy: 87.26%—An indication that the model performs well across both classes and not only in the dominant class.
Inference Time—Overall, the inference time is 0.0030 s, which means that the model ensures quick predictions in real-world scenario applications.
Positive Predictive Value (PPV): 87.60%—The prediction of the positive class (aggressive) by the model indicates that it is correct 87% of the time.
Negative Predictive Value (NPV): 88.61%—This indicates a high prediction accuracy in the non-aggressive class.
Cross-Validated Balanced Accuracy: 86.82%—This score is similar to the balanced accuracy of the test set, proving the reliability of the model.
Cross-validated ROC-AUC score: 95.52%—This depicts the good generalization ability and robustness of the model.
True Positives (TPs): 106—Instances of the positive class correctly identified by the model.
True Negatives (TNs): 70—Instances of the negative class correctly identified by the model.
False Positives (FPs): 15—Instances of the negative class incorrectly identified by the model.
False Negatives (FNs): 9—Instances of the positive class incorrectly identified by the model.
Accuracy for Aggression Detections: 92%—From all the true labels as aggressive, the model identified 92% of them as actually aggressive.
Accuracy for Argument Detections: 82%—From all the true labels as non-aggressive, the model identified 82% of them as actually non-aggressive.
Learning Curve: The training score showcases that the model performs extremely well with the training data, but the performance slightly drops from around 460 to 550 samples and then increases steadily while making a plateau. This behavior is expected because the data have never been seen by the model and were not balanced, with the non-aggression class having fewer entries than the aggression class; therefore, this drop is insignificant. This is also the reason the training score remains stable at 1.0 (which is desirable), but the cross-validation score is trying to reach it
Figure 10.
Paired t-test Analysis
We performed a
t-test analysis on the meta-classifier to evaluate if there is a statistically significant difference between the evaluation metrics and the test set. In all our metrics, the
p-value is greater than 0.05, eliminating the indication of statistically significant differences in the performance of the model between the cross-validation and the test set. Below, in
Table 11 and
Figure 11, the results are presented.
T-test for Accuracy: The t-stat indicates small differences in accuracy between the cross-validation and the test set. The p-value is higher than the typical 0.05 threshold, indicating no statistically significant difference between the accuracy in the cross-validation and the test set.
T-test for Precision: The t-stat indicates small differences in precision between the cross-validation and the test set. The p-value is higher than the typical 0.05 threshold, indicating no statistically significant difference between the precision in the cross-validation and the test set. This also indicates that the precisions in the test set and the cross-validation are consistent.
T-test for Recall: The t-stat indicates a more noticeable difference in recall between the cross-validation and test set without reaching considerably high values. The p-value indicates some statistically significant difference between recall in the cross-validation and the test set. This means that the observed difference in recall did not occur by random chance; therefore, there is a meaningful distinction between recall values in the two sets.
T-test for the F1-Score: The t-stat indicates very small differences in the F1-score between the cross-validation and the test set. The p-value suggests that the F1-score in the test set is consistent with the F1-score observed during the cross-validation.
In
Figure 12, we can see the performance metrics of the meta-classifier after the final testing. In
Figure 13, we can see the meta-classifier’s count of predictions per probability range. The results show an improvement in certainty, as the probabilities increase from 0.6 to 1.0, which is a good indication. The lower non-aggression prediction count is acceptable because the aggression class had more entries than the non-aggression class.
In summary, in the final meta-classifier evaluation, we observe that the combination of audio and visual data further enhanced the model, providing a more robust application of the multimodal approach for predicting aggressive behaviors in real-world applications. Although the recall for the non-aggression class has some room for improvement, this is an acceptable outcome, considering the class imbalance (606 instances in the aggression class vs. 390 non-aggression-class instances) [
50,
51]. However, the balanced accuracy indicates that the model is robust and that it handles class imbalances well [
52].
The effect of the class imbalance on our model is visible in
Figure 9 and
Figure 12, and in
Figure 13, with the model favoring the majority class (aggression), recall in the non-aggression class is slightly lower (0.84) while being higher in the aggression class (0.98). A balanced accuracy of 87.26% supports the model’s bias. Class-balancing mitigation strategies and bias elimination could be applied to the model, such as oversampling techniques, like the synthetic minority oversampling technique (SMOTE) or utilizing the random forest’s attribute class_weight = ‘balanced’, which would increase the minority class’s weight. Nonetheless, on the basis of an unbiased comparison among the audio model, the visual model, and the meta-classifier, no class-balancing mitigation strategy was used because in the previous model, none of these class imbalance mitigation techniques were applied. To mitigate this, a cross-validation of 5 × 5 folds was used in the visual model from the prior work, in the audio classifier, and in the meta-classifier, providing methodological class imbalance mitigation and eliminating random predictions. In spite of this meta-classifier bias, it provides a head-to-head comparison between the previous model and the enhanced prediction model.
3.4. Comparison of the Initial Visual Model and the Late Fusion Model
Following the testing phase of the meta-classifier with the late-fusion-merging technique, the evaluation metrics of the meta-classifier were compared, combining predictions from both visual and audio features with the performance of the visual model. These results are presented in
Figure 14.
The multimodal meta-classifier indicates stability, as shown in
Figure 14, with performance metrics distributed across the spider diagram. Significant improvements in the recall, accuracy, F1-score, accuracy for aggression detections, and NPV are shown. However, minor reductions in the precision and PPV are observed, as well as a drop in the accuracy for non-aggression detections. These performance drops are because of the class imbalance in the late fusion merging. The drops in the performances of these three metrics are in the context of a meta-classifier model with balanced metrics.
As shown in
Figure 15, the initial visual model has shorter inference times (3.40 ms) than the meta-classifier’s inference time (95.50 ms). The meta-classifier model is 28 times slower. The meta-classifier has to process both audio and visual features, increasing the computational complexity. A tradeoff can be observed in the metrics, where the meta-classifier outperforms the visual-only model in all the scores. This improved performance highlights that the new model significantly enhances the quality of the classification by incorporating audio features. Although speed is the key in real-world applications, the tradeoff for quality improvement is admissible.
The statistical testing of results is shown in
Figure 16. The
t- and
p-values were used to measure the difference in significance. A t-value indicates the size of the difference relative to the variability between the two models. For a significant difference, there must be a very large absolute value. A
p-value below 0.05 is the threshold for a significant statistical difference.
Both models have high p-values, significantly higher than the threshold value (0.05), indicating that there is no significant difference between the two models. Model 1 has a negative t-value and lower p-value, which point toward higher significance than the Model 2 comparison’s threshold.
Model 1 has a higher negative t-value and lower p-value, which demonstrates a higher significance than the Model 2 comparison’s threshold. Model 2 exhibits the minimal precision difference, with a high p-value of 0.7008.
According to these findings, Model 1 has better precision, and Model 2 has a balanced performance with a higher F1-score. No significant statistical difference was observed between the two models.