Our approach solves the fall detection problem in an end-to-end solution based on two steps—person detection and fall classification. The person detection algorithm aims to localize all persons in an image. Its output is the enclosing bounding boxes and the confidence scores that reflect how likely it is that the boxes contain a person. Fall classification estimates if the detected person is in a fall or not.

#### 3.2. Deep Learning-Based Person Detection

CNNs are one of the most popular machine-learning algorithm types at present and it has been decisively proven over time that they outperform other algorithms in accuracy and speed for object detection [

28].

Algorithms for object detection using CNN can be broadly categorized into two-stage and single-stage methods. The two-stage algorithm based on classification first generates many proposals or interesting regions from the image (body) and then those regions are classified using the CNN (head). In other words, the network does not check the complete image; instead, it only checks parts of the image with a high probability of containing an object. Region-CNN (R-CNN) proposed by Ross Girshick in 2014 [

29] was the first of this series of algorithms that was later modified and improved, for example, fast R-CNN [

30], faster R-CNN [

31], R-FCN [

32], Mask R-CNN [

33] and Light-Head R-CC [

34]. However, single-stage algorithms based on regression do not use regions to localize the object within the image; the predict bounding boxes and class probabilities at the whole image. The most known examples of this type of algorithm are Single Shot Detector (SSD), proposed by Liu et al. [

35] and ‘you only look once’ (YOLO) proposed by Joesph Redmon et al. in 2016 [

36]. YOLO has been updated to versions YOLOv2, YOLO9000 [

37] and YOLOv3 [

38]. In this paper, we decide to apply real-time object detection system YOLOv3 for person detection, which has proven to be an excellent competitor to other algorithms in terms of speed and accuracy.

The YOLO network takes an image and divides it into S × S grids. Each grid predicts B bounding boxes $\left\{bi\right\}$, $i=1,\cdots ,B$ and provides a confidence score for each of them $Con{f}_{bi}$, which reflects how likely the box contains an object. Bounding boxes with this parameter above a threshold value are selected and used to locate the object, a person in our case. The bounding box position is the output of this stage for our algorithm.

#### 3.3. Learning-Based Fall/Nonfall Classification

The effectiveness of SVM-based approaches for classification has been widely tested [

39,

40,

41]. The SVM algorithm defines a hyperplane or decision boundary to separate different classes and maximize the margin (maximum distance between data points of the classes). Support vectors are training data points that define the decision boundary [

42]. To find the hyperplane, a constrained minimization problem has to be solved. Optimization techniques such as the Lagrange multiplier method are needed.

In the case of nonlinearly separable data, data points from initial space

${R}_{d}$ are mapped into a higher dimensional space

Q where it is possible to find a hyperplane to separate the points. With this, the classification-decision function becomes

where training data are represented by

$\left\{{x}_{i},{y}_{i}\right\}$,

$i=1,\cdots ,N$,

${y}_{i}\in \left\{-1,1\right\}$,

b is the bias,

${\alpha}_{i}$,

$i=1,\cdots ,N$ are the Lagrange multipliers obtained during the optimization process [

43] and

${s}_{i}$,

$i=1,\cdots ,{N}_{s}$ are the support vectors, for which

${\alpha}_{i}\ne 0$ and

$K(x,{x}_{i})$ is a kernel function. A Radial Basis Function (RBF) was used as a kernel in this study:

where

$\gamma $ is the parameter controlling the width of the Gaussian kernel.

The accuracy of the SVM classifier depends on regularization parameter C and $\gamma $. C is the parameter that controls the penalization associated with the training samples that are misclassified and $\gamma $ defines how far the influence of a single training point reaches. So, both parameters must be optimized for every different task in particular, for example, by using cross-validation.

The selection of the right features or input parameters to the SVM plays an important role in having a high-performance classification algorithm. Some features are most widely used in the literature as aspect ratio (AR), change in AR (CAR), fall angle (FA), center speed (CS) or head speed (HS) [

21,

44,

45]. However, after analyzing the parameters that provide the best trade-off performance for goals to achieve in our approach, using the bounding box data of a detected person, we defined the input feature vector for the SVM classifier as

Aspect ratio of bounding box,

$A{R}_{i}$:

Normalized bounding box width,

$N{W}_{i}$:

Normalized bounding box bottom coordinate,

$N{B}_{i}$:

where ${W}_{bi}$ = $Xrigh{t}_{bi}-Xlef{t}_{bi}$, ${H}_{bi}$ = $Ydow{n}_{bi}-Xto{p}_{bi}$ are the width and height of bounding box $\left\{bi\right\}$, respectively, calculated from the bounding box position provided by YOLOv3 {$Xlef{t}_{bi}$, $Xrigh{t}_{bi}$, $Yto{p}_{bi}$, $Ydow{n}_{bi}$} and ${W}_{imagen}$, ${H}_{imagen}$ are the width and height of the overall image. Point $(0,0)$ is at the top-left corner of the overall image. Parameter $N{B}_{i}$ defines the distance from the bottom of the image to the lower part of the normalized bounding box. As the values of the $N{B}_{i}$ and $N{W}_{i}$ parameters are between 0 and 1, in order to give a similar weight to $A{R}_{i}$, we needed to adjust its value as input to the SVM. We analyzed the data and ${W}_{bi}$ was lower than 10${H}_{bi}$ for all cases, so we normalized $A{R}_{i}$ by 10 in order to get a feature in [0,1]. Therefore, we considered detection if ${W}_{bi}<10{H}_{bi}$.

Parameter

$A{R}_{i}$ is the most significant feature that characterizes the fall. As can be seen from the examples in

Figure 4a,b, a person standing upright has a small

$A{R}_{i}$, while this ratio is large in the case of a person lying in a horizontal body orientation position. However, this parameter alone is not enough. There are some cases where the person is in a lying-position but this parameter does not show it; this is the case of lying in a vertical body orientation position, as we show in

Figure 4c.

One of the main goals of the algorithm is the ability to differentiate between fallen people and resting situations.

Figure 5 shows one example of how the optical perspective in the cameras works. The object size in the image (in pixels) depends on the real image size (in mm) and the distance from the camera to the object [

46]:

Objects with the same size at different distances from the camera (object planes) appear with a different size (pixels) in the image plane; the closest one is visible in a larger size (

Figure 5a);

objects with the same size at the same distance to the camera (object planes) appear with the same size (pixels) in the image plane (

Figure 5b). If objects are at different heights in the object plane, the same happens in the image plane.

When we compare a fallen person and a resting person at the same distance from the camera, the situation is the one observed in

Figure 5b. The resting person is the person in the higher position. As shown in

Figure 6a, the

$A{R}_{i}$ and

$N{W}_{i}$ parameters in both cases were the same (same size of bounding box); however, the

$N{B}_{i}$ parameter was different

$N{B}_{1}$,

$N{B}_{2}$. For the same value,

$N{B}_{1}$, the bounding box size for a fallen person should be the red one (see

Figure 6b).

Therefore, proposed parameters $A{R}_{i}$, $N{W}_{i}$ and $N{B}_{i}$ provide needed information for differentiating those situations and, during the training stage, the SVM learns the relation between them in both cases (fall and resting position).

Figure 7 shows the previous explanation with real images. It contains three pairs of images where fallen and resting persons are at the same distance from the camera (1.5, 2 and 3 m away).

Table 1 shows the parameters provided to the SVM in those situations. As can be seen in the table, each pair of images have a similar

$N{W}_{i}$ parameter (slight differences are due to not being precisely at the same position from the camera). However, parameter

$N{B}_{i}$ had a larger value in the nonfall situation because the body was in a higher position in the image.