The following section evaluates the proposed technique. First, we briefly describe the datasets used for the experimental protocol. Next, the corresponding metrics utilized for measuring the framework’s outcome are given. Finally, an extensive validation of the proposed solution follows.
4.3. Metrics
We used accuracy, sensitivity, and specificity as metrics to validate the proposed fall detector. These are defined by Equations (
3)–(
5), respectively. Concerning the testing, precision as displayed in Equation (
6) is adopted. Both are based on true positive (TP), false positive (FP), true negative (TN), and false negative (FN) detections. TPs are determined as the cases where the system correctly identifies an event, FPs are the incorrectly detected accidents, TNs are the correct non-fall detections, and FNs are the falls that the framework should have recognized but the system did not achieve it. If more events are identified in a time series wherein non-, or a fall, is included, the FPs increase accordingly:
Accuracy constitutes the correct detections for non- and fall cases, highlighting their significance. While it is one of the most common metrics used, it should be treated cautiously, as it depends on the applied dataset. For instance, if a pipeline detects only falls when 95 falls and 5 non-falls exist, the accuracy would be 95%. The metric of sensitivity, or also recall, reveals how the system performs concerning positive detections. On the other hand, specificity reflects the algorithm’s ability to handle negative recognitions by determining the rate of correct non-falls detections. The above-mentioned metrics should be considered simultaneously, as each alone can lead to wrong conclusions. Finally, precision is defined as the system’s ability to avoid FPs. This metric is complementary to sensitivity. A high score for both sensitivity and specificity is desirable, as their balance is essential. However, a higher score on the former is preferred over the latter, as it is advantageous to detect more false falls than losing an actual one.
4.4. Validation
Various experiments were conducted to determine the thresholds and assess the algorithm’s performance by applying different combinations of values.
Table 1 presents five out of the eight thresholds,
viz.,
,
,
,
and
, that are tested. The three remaining thresholds, namely
,
, and
, are set on 50, 100, and 100, respectively. Our validation is divided into two parts: the first comprises
and
when initially set to 100 and “
”, respectively. The remaining three thresholds were tested (see
Table 2). The second part evaluates the previous contrast values (see
Table 3). The first three received the highest values achieved during the first part. Considering that the lowest magnitudes on KFall and UR are 0.11 and 0.05 and the greatest are 67.89 and 111.81, we set the initial validation values of “
” to 2 and “
” to 100. The
is selected at 100, which corresponds to about one second.
Table 2 gives the results on the validation sets. The first line of scores corresponds to the initialized values of thresholds. In cases where
is less than or equal to 5 and
is larger than 40, very high scores on specificity and very low scores on sensitivity are noticed. This imbalance is observed because highs are characterized only by the elevated points, while lows are classified solely by the very low points. As a result, the connections between them that correspond to falls are restricted, and our system fails in detecting falls, yet performs well in detecting non-falls. Therefore, this fact is reasonable in KFall when the
is over 60, as the highest magnitude in the validation set of KFall is 67.89. However, higher values for
were tested for evaluating the performance on UR. The highest magnitude is 111.81. The imbalance between sensitivity and specificity is reduced as
gets higher and
lower. More specifically, on KFall, the most imbalanced performance, i.e., 85.16% and 87.95% in sensitivity and specificity, respectively, is reached when
is set to 5,
to 35, and
to 100. Regarding UR, we achieved 80.00% in sensitivity and 85.00% in specificity when
is 6,
is 25, and
is 100. Due to the fact that the aforementioned performance results are attained at different thresholds,
receives a value that depends on the average score of all the magnitudes in subsequent experiments, aiming to adapt the algorithm each time to the data it receives.
Furthermore, in
Figure 12 the outcomes detailed in
Table 2 are given through AUC–ROC curves. In these representations, each combination of thresholds is depicted on a plot, where the y-axis represents the true positive rate (TPR) or sensitivity, while the x-axis corresponds to the false positive rate (FPR), which is calculated as
. In an AUC–ROC curve, a model positioned closer to 1 for TPR and closer to 0 for FPR is considered a better classifier. Regarding KFall (see
Figure 12, left), it is observed that the algorithm’s performance is enhanced when the
average score is set to
. Additionally, a balance between high TPR and low FPR is established when
is set to 6.5, and
remains close to 100. On the contrary, very low values for
and high values for
lead to reduced FPR and TPR (
Figure 12).
It is worth noting that the highest TPR is attained when the values 6.5,
+ 10, and 105 are set for the thresholds
,
, and
, respectively, as shown in
Figure 12. Meanwhile, the FPR remains at 0.05. Consequently, the aforementioned values are chosen in the initial validation phase due to their superior and balanced performance in terms of sensitivity and specificity. This selection is substantiated by the data presented in
Table 2 and the AUC–ROC curves of
Figure 12, which demonstrate their effectiveness on both KFall and UR.
Next, the two remaining thresholds,
and
, are examined by keeping constant the first three thresholds at the values and variables that were defined previously. As depicted in
Table 3, our metrics on UR are maintained at 90.00% and 95.00% until the
is lower than 75. This occurrence is attributed to the fact that
is applied on the data where continuous peaks arise, e.g., when the subject runs or jumps. However, in contrast with KFall, in UR, such actions are not included. Furthermore,
is initially set to be equal to
and was tested only with higher values. Lower ones would not affect our algorithm’s detections since this threshold specifically pertains to neighboring peaks, where the fall should be greater in order to be accepted. Nevertheless, as depicted in the first five lines of
Table 3, where higher values were tested, the performance exhibited a decline. Regarding
in both datasets, the sensitivity decreases as it receives very low values (70, 65, and 60), causing our system to become more stringent in accepting a fall.
Respectively, from the AUC–ROC curves, which are depicted in the plots of
Figure 13, it is evident that changes in the thresholds of the second validation phase affect the algorithm’s performance less compared to those of the first part, as the various versions are placed close to each other. Particularly, in the curve of UR (
Figure 13, right), all the combinations are placed at exactly the same point except for the three cases, where the
is less than 70, while TPR is reduced. Additionally, from the KFall’s curve (
Figure 13, left), it is observed that as the
is increased from the
, FPR is also increased without TPR being enhanced. As for the
threshold, the algorithm’s performance remains stable for values either lower or higher than 100. However, when it falls below 70, there is a noticeable but minor impact on both TPR and FPR.
Finally, when is set at 85, 80, and 75, the sensitivity on KFall remains unchanged at 89.04%. Yet, the specificity improves from 83.74% to 84.03%, prompting us to choose 85 for , aiming to accept more falls, and set to be equal to ().
4.5. Method’s Outcome and Comparative Results
After our validation process, the final values of each threshold as chosen are as follows:
, evaluated.
, evaluated.
, not evaluated.
, evaluated.
, evaluated.
, not evaluated.
, not evaluated.
, evaluated.
Table 4, wherein the final results on the test sets are depicted, shows a sensitivity of 90.40% and 91.56% on MMsys and KFall, respectively, exhibiting the improved performance of our human fall detector. Additionally, the system proves it is not weak to non-fall actions, as it achieves a specificity score of 93.96% on MMsys and 85.90% on KFall. At the same time, the balance between sensitivity and specificity declares that our framework distinguishes a fall from daily activity. Moreover, the proposed pipeline outperforms the heuristic-based approach of [
45] on MMsys, concerning the sensitivity and the rule-based method of [
7] on KFall. Regarding the precision, both other pipelines achieve higher rates, implying that FPs are less; yet, it is preferable for a system to detect false events but, at the same time, identify a correspondingly higher number of TPs, especially if the balance between these metrics is maintained. The comparison between the proposed algorithm, the logistic regression described in [
29], and the CNN-based in [
7] indicates that our framework does not outperform the machine learning approaches. Finally, regarding the part of the place where the sensor is located on the human body, the results on MMsys demonstrate that our system fails to reach high performance results when the sensor is positioned on the thigh instead of the chest. The scores of 76.88%, 62.28%, 81.28%, and 50.00%, for accuracy, sensitivity, specificity, and precision, respectively, declare this fact.
Table 5 and
Table 6 present TPs and FPs for each sub-class separately when tested on MMsys and KFall. The aforementioned results permit us to understand where the algorithm performs well and where it does not. In
Table 7, the results from
Table 5 and
Table 6 are compressed by displaying the set of sub-classes within the corresponding range of false rates. In a total of 50 sub-classes, in both test sets, the proposed pipeline does not make mistakes in 15 sub-classes, while the false rate is very small, from
to
, in the other 16. Moreover, it performs well in six more sub-classes, showing a false rate between
and
, while in the other eight, it ranges from
to
. The following five sub-classes constitute the weak aspect of our system. In particular, “Near fall” on MMsys reaches a false rate of
. “Stumble while walking” on KFall reaches a score of
. “Forward falls while jogging caused by a trip” on KFall obtains a
false rate. Similarly, “Sit a moment trying to get up and collapse into a chair” on KFall obtains a score of
, and finally, “Gently jump when trying to reach an object” on KFall achieves a
false rate.
Lastly, in “False-2” at
Table 5 and
Table 6, we present every wrong detection that occurred before our last check. When it is not applied, the performance is reduced, particularly in the non-fall sub-classes, where the data patterns are similar to those of falls. It is worth noting that in KFall, when the last check is missing, the action “Gently jump trying to reach an object” presents false detections ranging from 99 to 109. Similarly, for “Jog normally with turn (4 m)”, these increase from 3 to 60; for “Jog quickly with turn (4 m)”, from 5 to 76; and for “Stumble while walking”, they rise from 59 to 84. Similarly, in MMsys, when the last check is missing, the “Near fall” sub-class, shows a score ranging from 85 to 92. Additionally, for “Ascending and Descending a staircase”, the wrongs increase from 4 to 19.