4.3. Experiments and Evaluations
The machine used for the experiments was an Intel® Core™ i9-9900K CPU @ 3.60 GHz and 64 GB of RAM running a 64-bit windows 10 operating system. It was also equipped with an NVIDIA GeForce RTX 2080 Ti with 11 GB of GPU memory. The suitability of the machine in experimenting was determined by running a 64-bit version of MATLAB R2022b.
It was observed that there was a need to increase the generalization and reliability of the results; however, this was not possible due to hardware limitations. Therefore, the proposed model validates the models using a three-fold cross-validation method by training and testing the models three times, where one fold was used for testing, and the remaining two were used for training and validation. For the training and validation, around 85% of the data were used for training, and the rest for validation.
The experiments conducted in this work included a classification performance for the fine-tuned pre-trained models (i.e., GoogleNet, SqueezeNet, ResNet18, and DarkNet19) and the proposed multi-stream 3D CNN model. Using such pre-trained models facilitates a fair comparison with the proposed multi-stream 3D CNN model. By applying the three-fold cross-validation technique to every model within the conducted experiments, each fold has 2130 3D image sequences that were used for testing, whereas the remaining 3836 and 426 sequences were used for training and validation, respectively. The Adam optimization algorithm is considered to be one of the most efficient among all available optimization methods since it outperforms its counterparts. Therefore, all trained models were also trained by the Adam optimization algorithm with a gradient decay factor of 0. The initial learning rate was set at 0.001, while the regularization factor deemed most appropriate was 0.0001. Due to the limitations of the memory, the models were trained for 100 epochs with a minibatch size of 8.
The test evaluation is based on accuracy, sensitivity, specificity, and precision, which are the most commonly used performance indicators. Through three-fold cross-validation, the mean and standard deviation of the measurements were reported over three testing folds. As shown in
Table 3, performance measurements were derived using the concepts of true positive (TP), true negative (TN), false positive (FP), and false negative (FN).
Evaluating our proposed method for human fall detection was performed using a three-fold cross-validation approach to ensure the generalization and reliability of the results. As mentioned in
Section 3.2, every input video sequence was preprocessed by fusion to produce four fused images from 16 frames, after being converted to grayscale and resized to 128 × 128 in the height and width dimensions. These four fused images were used as the input to every evaluated model.
The results of experimenting with three-fold cross-validation for each of the fine-tuned pre-trained CNN models (i.e., GoogleNet, SqueezeNet, ResNet18, and DarkNet19) in addition to the proposed multi-stream 3D CNN model are shown in
Table 4. These results are reported by averaging the resulting (test set) metrics with the standard deviation (std) of each metric of the three crossed validated folds. The proposed multi-stream 3D CNN model archived the best results among others compared to the state-of-the-art models, having an average accuracy, sensitivity, specificity, and precision of 99.03%, 99.00%, 99.68%, and 99.00%, respectively. In addition, the standard deviations of these metrics were 0.44%, 0.46%, 0.15%, and 0.46%, respectively. This proves the potential of the proposed multi-stream 3D CNN model to improve the detection of human falls. In addition,
Figure 6 shows the ROC curves, which indicate that the proposed model outperforms other state-of-the-art models.
Information density is a measure of the performance of the models concerning how the accuracy behaves with the number of hyperparameters. It is defined by the accuracy (in percentage) obtained by the model divided by the number of hyperparameters in millions (M). A good model has a high information density.
Table 5 shows the accuracy, number of hyperparameters, number of layers, and information density of the models. From the table, we can see that the information density of the proposed model is the highest by a distance. The proposed model is light, yet achieves high accuracy.
4.4. Discussion
The proposed method aims to detect human falls using a vision-based multi-stream 3D CNN model with the fusion of video frame sequences to capture consecutive spatial and temporal information. Given the results in
Table 4, it can be seen that GoogleNet, SqueezeNet, and DarkNet19 performed similarly, resulting in a mean accuracy of 97.06%, 97.62%, and 97.45%, with a standard deviation of 1.09%, 0.29%, and 0.12%, respectively. Likewise, for the other metrics, these models had a sensitivity of 97.45%, 97.57%, and 97.35%, specificity of 99.19%, 99.22%, and 99.15%, and precision of 97.51%, 97.55%, and 97.39%, respectively. Although the SqueezeNet slightly outperformed both of GoogleNet and DarkNet19 in all metrics, with a difference of 0.56% and 0.17% in accuracy, 0.12% and 0.22% in sensitivity, 0.38% and 0.07% in specificity, and 0.04% and 0.16% in precision, respectively, the DarkNet19 standard deviations across the three cross-validated folds were better in performance.
ResNet18 was associated with a better performance, with a mean accuracy, sensitivity, specificity, and precision of 98.75%, 98.73%, 99.59%, and 98.71%, with standard deviations of 0.09%, 0.08%, 0.03%, and 0.09%, respectively. The proposed 4S-3DCNN model outperformed the other fine-tuned state-of-the-art models on every metric, with values of 99.03%, 99.00%, 99.68%, and 99.00% for accuracy, sensitivity, specificity, and precision, respectively. Although its standard deviations did not outperform other models, except for GoogleNet, a difference of 0.28% in mean accuracy, 0.27% in mean sensitivity, 0.09% in mean specificity, and 0.29% in mean precision was achieved over ResNet18. This indicates the effectiveness of our proposed method in comparison with four state-of-the-art fine-tuned models (i.e., GoogleNet, SqueezeNet, ResNet18, and DarkNet19).
On the other hand, a comparison with similar works is given in
Table 6. These results are compared to the results from each author experimenting with their proposed method on the Le2i dataset. This comparison includes four non-CNN-based methods and four CNN-based methods, along with our proposed method.
The work of Chamle et al. [
73] utilized the background subtraction approach to detect and mark movement between video frames with a rectangular and elliptical bounding box. Following that, the gradient boosting classifier was used to classify the fall by extracting the features of the aspect ratio, fall angle, and silhouette height from the marked bounding box. Their proposed method, without mentioning the number of training, testing, and validation sets, resulted in an accuracy of 79.30%, sensitivity of 84.30%, specificity of 73.07%, and precision of 79.40%. Alaoui et al. [
74] computed human silhouettes using optical flow to produce motion vectors for their fall detection by applying a directional distribution method called von Mises distribution. They achieved an accuracy of 90.90%, sensitivity of 94.56%, specificity of 81.08%, and precision of 90.84%. Poonsri et al. [
75] used principal component analysis (PCA) to extract aspect and area ratios, as well as the orientation from the human silhouette, to determine fall events. Their results achieved 86.21%, 93.18%, 64.29%, and 91.11% in accuracy, sensitivity, specificity, and precision, respectively. In the reference [
76], the authors proposed a fall detection method based on human body skeleton features with the computing of similarity scores between sequences using the dynamic time warping (DTW) algorithm. In their classification between fall and non-fall sequences using the leave-one-out protocol, an SVM classifier with a linear kernel was used, which resulted in an accuracy, sensitivity, specificity, and precision of 93.67%, 100.00%, 87.00%, and 83.62%, respectively.
Núñez et al. [
56] introduced a CNN-based fall detection method that included pre-training on the ImageNet dataset, tuning on an action-motion dataset (UCF101 dataset), and a final fine-tuning for the fall detection task. Their method was based on constructing an optical flow block of each 10 consecutive frames and using them as an input to their method. Their best-reported results, based on lighting manipulations to training sets, achieved a 97.00% accuracy, 99.00% sensitivity, and 97.00% specificity, with no score of precision reported. Zou et al. [
77] presented a 3D-CNN-based method for the detection of human fall incidents. Using a sequence of 16 frames, 3D features were extracted using a 3D-CNN model followed by the tube anchors generation layer and a softmax classification layer. Their results on the Le2i dataset achieved a 97.23% accuracy, 100.00% sensitivity, and 97.04% specificity. Vishnu et al. [
78] proposed a fall motion mixture model (FMMM) constructed from a Gaussian mixture model (GMM) to detect fall and non-fall events. In their feature extraction approach, they adopted a histogram of optical flow (HOF) and motion boundary histogram (MBH) to form a fall motion vector from 15 consecutive frames. Their experiments achieved an accuracy of 78.50%, sensitivity of 93.00%, and precision of 81.50%, without reporting the specificity.
Our proposed method based on a multi-stream 3D CNN with a four-branch architecture (4S-3DCNN) greatly outperforms similar works in accuracy, specificity, and precision. Although the works of Zou et al. [
77] and Youssfi et al. [
76] show a better sensitivity with only a difference of 1% compared to our method, Zou et al. [
77] did not consider any type of cross-validation in their evaluation, while Youssfi et al. [
76] applied only the leave-one-out cross-validation protocol. Given that our evaluation applies the three-fold cross-validation approach, this proves the generalization favorability of our 4S-3DCNN model, with its low susceptibility to overfitting.
Table 6.
Comparison of the proposed 4S-3DCNN with similar works on the Le2i dataset.
Table 6.
Comparison of the proposed 4S-3DCNN with similar works on the Le2i dataset.
Model | Accuracy | Sensitivity | Specificity | Precision |
---|
Chamle et al. [73] | 79.30% | 84.30% | 73.07% | 79.40% |
Núñez et al. [56] | 97.00% | 99.00% | 97.00% | - |
Alaoui, et al. [74] | 90.90% | 94.56% | 81.08% | 90.84% |
Poonsri et al. [75] | 86.21% | 93.18% | 64.29% | 91.11% |
Zou et al. [77] | 97.23% | 100.00% | 97.04% | - |
Vishnu et al. [78] | 78.50% | 93.00% | - | 81.50% |
Youssfi et al. [76] | 93.67% | 100.00% | 87.00% | 83.62% |
Proposed 4S-3DCNN | 99.03% | 99.00% | 99.68% | 99.00% |