Anthropometric Ratios for Lower-Body Detection Based on Deep Learning and Traditional Methods

: Lower-body detection can be useful in many applications, such as the detection of falling and injuries during exercises. However, it can be challenging to detect the lower-body, especially under various lighting and occlusion conditions. This paper presents a novel lower-body detection framework using proposed anthropometric ratios and compares the performance of deep learning (convolutional neural networks and OpenPose) and traditional detection methods. According to the results, the proposed framework helps to successfully detect the accurate boundaries of the lower-body under various illumination and occlusion conditions for lower-limb monitoring. The proposed framework of anthropometric ratios combined with convolutional neural networks (A-CNNs) also achieves high accuracy (90.14%), while the combination of anthropometric ratios and traditional techniques (A-Traditional) for lower-body detection shows satisfactory performance with an averaged accuracy (74.81%). Although the accuracy of OpenPose (95.82%) is higher than the A-CNNs for lower-body detection, the A-CNNs provides lower complexity than the OpenPose, which is advantageous for lower-body detection and implementation on monitoring systems.


Introduction
For daily exercises, many people tend to focus dominantly or solely on cardiovascular exercises to burn calories. Lower-body strength, however, is also important not only for achieving perfect physical condition but also for maintaining total body health. Moreover, by strengthening the lower-body, one can improve one's agility and balance, helping to avoid falls and injuries during both daily activities and workouts. In addition, many studies [1,2] have found that lower-body strength and power are correlated and required for performing high-intensity, short-duration activities, such as jumping, sprinting, or carrying a load. Nevertheless, proper exercises should be performed to minimize the risk of injury from strengthening the lower-body. To prevent falls and injuries from these activities, the ability to detect the lower-body is crucial for monitoring the postures of participants during workouts.
With advances in computer technology, human body detection has become crucial in diverse applications such as surveillance systems, vehicle navigation, and posture recognition. Human body detection can also be applied to study human behavior and activities of daily living (ADLs) [3]. Thus, this human body detection can observe unusual signs in an activity sequence [4]. Moreover, it is a valuable indicator and threshold for monitoring systems in a workplace to identify inappropriate tasks and enhance injury prevention [5]. It is even used to automatically control home devices such as light sources and air conditioning to maintain suitable living conditions [6]. However, variations in body. The detected lower-body will then be indicated on the output image, as shown in Figure 1. The rest of this paper is organized as follows: Section 2 briefly describes the related work in the literature. The proposed method is introduced in Section 3. Experimental results are reported in Section 4 and discussed in Section 5. Finally, the conclusion is presented in Section 6.

Related Work
Several human detection algorithms have been studied and developed for many applications, e.g., person identification [27], automatic vehicles [28], and human gait analysis [29]. This section discusses the two main approaches to human body detection: traditional methods and deep learning methods.

Traditional Methods
In traditional methods, a high-dimensional image is transformed into low-dimensional data in the form of a feature vector. Then, the feature vectors extracted in this way are used to train a classification algorithm. In general, the traditional methods for human detection can be divided into two types: background subtraction and object-based methods.
Background subtraction [30] was introduced for object detection to identify moving objects based on the differences between the current frame and a background frame in either a pixel-by-pixel or window-by-window fashion [31]. This background subtraction was also combined the depth information to extract an object with higher success and capability [32]. A human bounding box can be detected by a refinement algorithm by matching the contour of the shadow of the human body. In addition, the Gaussian mixture model (GMM)-based background learning technique was applied to separate the human object from the background [33]. Iazzi et al. [34] also applied the background subtraction with a support vector machine (SVM) classifier to detect a fall in elderly people. The results showed that this method can gain a high accuracy of fall detection. However, the main problem with this background subtraction is that is not robust against changes in brightness or camera motion in the case of a non-stationary frame because the background frame is not updated [35,36]. Thus, Chiu et al. [37] proposed an idea of color category entropy to approximate the number of essential background groups and initiate acceptable representative background groups to accommodate dynamic background.
An object-based method for detecting faces was introduced by Mena et al. [38], who presented the Viola-Jones (VJ) algorithm. The VJ algorithm relies on an integral image representation and simple rectangular features such as Haar-Like features, based on which cascaded classifiers are used to detect faces. Adeshina et al. [39] also applied Haar-Like features with local binary patterns (LBP) to customize classroom face classification. As a result, the proposed algorithm showed a lower value of false-negative rate (FNR) of face classification than other methods. However, the VJ algorithm still has disadvantages when confronted with varying lighting conditions and occlusion. The histograms of oriented gradients (HOG) features [11] proposed in 2005 are high-dimensional features based on edges, which can be used in combination with a SVM classifier to detect human regions. When a SVM classifier is trained on the HOG features of both positive (human) and negative (non-human) images, the resulting HOG-SVM method can successfully detect human targets even in dark or occluded images. Moreover, Patel et al. [40] proposed a fusion of HOG features for human action recognition in video. The result showed that this fusion of HOG features and meta-cognitive neural network classifier archived high accuracy to detect human action. However, the HOG feature takes a long time to calculate the sliding and scaling windows needed to extract HOG features covering the entire input image. To deal with these problems, He et al. [41] proposed a fully-convolutional neural network for semantic regions of interest to detect pedestrians based on HOG and a SVM classifier. In addition, this proposed method can increase the speed of the algorithm. Additionally, Yang et al. [42] presented the parallel feature fusion based on Choquet integral between HOG and LBP features for pedestrian detection. This proposed algorithm improved the accuracy of detection and reduced the time of pedestrian detection.

Deep Learning Method
Deep learning methods consider both local and global features of the input image by using kernel filters for human body detection. Jammalamadaka et al. [43] presented a human body recognition approach using deep learning. They found that convolutional neural networks (CNNs) can successfully extract and detect human body parts. They also constructed a suitable pose estimation method by using a similar mapping between dimensions. However, this system requires optimizing a low-dimensional space for the human pose search. Qin and Josef [44] proposed a pedestrian detection method using improved CNNs with multi-feature fusion selection. Local parts of the human body were considered individually for the extraction of local features. Then, they merged these local features for full-body human detection. The ability to detect people achieved through multi-feature fusion was superior to what could be achieved based on the individual original features. However, the human anatomical proportions used to divide the parts of the human body were designed without referring to anthropometric data drawn from real human body measurements. Furthermore, the complexity of the multi-feature fusion process remained higher than that of using only one feature. Considering the human parts, Cao et al. [45] proposed the OpenPose method which is applied the CNNs model to map between the Part Affinity Fields (PAFs) and body joints. The results showed the high accuracy of body part detection of the human for multi-person detection. Lin et al. [46] also deployed the OpenPose method to detect human movement through detecting the keypoints of human joint changes. They applied the series recurrent neural network, long-and short-term memory (LSTM), and gated recurrent unit (GRU) models to detect a human fall. Nonetheless, the length of keypoints might be related to the human anatomical proportion from anthropometric data of the real human body.
While previous approaches can achieve adequate pedestrian detection performance, they may be limited to full-body detection based on human images. Consequently, in some applications, such as lower-body activities, these previous algorithms cannot perform well due to insufficient information. Our work investigates an indirect detection method for lower-body detection using anthropometric ratios applied to human body images.

Materials and Methods
This section describes the data set used in this study, the feature extraction processes with conventional and deep learning methods, the classifiers used, and our experimental evaluations. The experimental framework is illustrated in Figure 2. There are three main phases: image input, feature extraction, and classification. This paper presents a framework for lower-body detection using proposed anthropometric ratios, as illustrated in Figure 2. A person is captured or recorded by a camera, producing the input image. Then, the human body is detected and scaled using the proposed anthropometric ratios in combination with either traditional techniques (A-Traditional), such as the VJ and HOG-SVM algorithms, or the CNNs technique (A-CNNs). Finally, the lower-body area in the image is detected. For human body detection, there are two popular methods based on sliding windows for overall positioning and scaling in images [11,12,16]: traditional methods and deep learning methods.

Traditional Methods
As the basis for the application of the proposed anthropometric ratios, two traditional methods are used in this study to detect the human body: the VJ algorithm and the HOG-SVM algorithm. The VJ algorithm is used to perform frontal face or upper-body detection. The HOG algorithm is also used for frontal full-body detection.

Viola-Jones Algorithm (VJ)
The VJ algorithm [12,38] extracts simple features based on the notion of cascaded classifiers. Some instances of contrast detection are performed in cells in specific locations in an image, such as the human eye. The VJ algorithm solves the complex learning problem by using an enormous number of positive and negative training images together with a cascade of simple classifiers. The corresponding process is summarized in Figure 3.
As shown in Figure 3, the first step of the VJ Algorithm 1 for face detection is to convert the input image into a greyscale image. Next, the integral image representation is rapidly generated by calculating the value at pixel (x, y) as the sum of the pixels above and to the left of (x, y). Then, the sum of the values of all pixels in rectangle D, as shown in Figure 4, can be computed as 4 + 1 − (2 + 3). Subsequently, the entire image is scanned to calculate Haar-like features by subtracting the sum of the pixels under white rectangles from the sum of the pixels under black rectangles in patterns similar to those shown in Figure 5. Adaptive boosting (AdaBoost) is then used as a machine learning algorithm to find the best such rectangle features among the approximately 160,000 possible features in a window of 24 × 24 pixels in order to construct a linear combination of corresponding classifiers. In the final phase of the VJ algorithm, the input image is fed into these cascaded classifiers; if the input image passes all stages of the classifier cascade, the input image is identified as a human face image, whereas if the image does not pass any stages, it is not a human face image, as shown in Figure 6.     Accumulate filter outputs within this stage; 10 if accumulation fails to pass per-stage threshold then 11 Reject this window as a face; 12 Break the while loop; 13 if this detection window passes all N cascade of thresholds then 14 Accept this window as a face; 15 else 16 Reject this window as a face; 17 where P is the pixel sizes of an image size, N scale is the number of scales in image pyramid, N sliding is the number of the sliding detection window, N cascade is the number of stage in the cascade classifier, and N f ilter is the number of filter in the stage.

Support Vector Machine Classification Based on Histograms of Oriented Gradients (HOG-SVM)
Dalal and Triggs [11] proposed the HOG method for human detection, as demonstrated in Figure 7. In this method, the HOG features of both positive images (human images) and negative images (non-human images) are extracted and used to fine-tune a pre-trained linear SVM classifier for human detection. The overall process of HOG-SVM detection is summarized in Figure 8.  The HOG-SVM algorithm for human detection is presented in Algorithm 2. The HOG-SVM process is initialized with configuration parameters such as the sizes of cells, blocks, and bins of a sliding window. The image is then converted to greyscale. The sliding window is calculated as the HOG feature for whole the image.
For the sliding window, the gradients on the x-axis and y-axis are calculated using Equations (1) and (2), and the edge angles are computed using Equation (3).
where G x (x, y) is the gradients of x-axis and f (x, y) is the pixel value of gray scale image at (x, y) coordinates.
where G y (x, y) is the gradients of y-axis and f (x, y) is the pixel value of gray scale image at (x, y) coordinates.
where Direction(x, y) is the angle of gradients at (x, y) coordinates. The magnitude of the gradients is presented in Equation (4).
where Magnitude(x, y) is the magnitude of gradients at (x, y) coordinates. The edge histogram is created by gradient voting as shown in Equation (5)-(7).
where α is the weight of gradient vote, N bin is the number of bins, and Direction(x, y) is the angle of gradients by Equation (3).
where m n is the magnitude of gradient vote at the n bin, α is the weight of gradient vote by Equation (5), and Magnitude(x, y) is the angle of gradients by Equation (4).
where m nearest is the magnitude of gradient vote which is near the n bin, α is the weight of gradient vote by Equation (5), and Magnitude(x, y) is the angle of gradients by Equation (4).

Algorithm 2: Support vector machine classification based on histograms of oriented gradients
Data: P of the image is more than zero Result: Result of human detection 1 Configure the parameters of sizes of cell, block, bins, and percentage of overlapping; 2 Convert color image to gray image; 3 while number of scales in image pyramid do 4 Downsample image by one scale; 8 Calculate the magnitude gradients of x-axis and y-axis; 9 Calculate the edge degree; 10 while N bin > 0 do 11 Build a histogram from edge orientations and gradients level; 12 Vote the gradients level in each the edge orientations; 13 Normalize the histogram by neighbour cells; 14 Flattening 2D features into a vector of features; 15 Test this vector in SVM classifier; 16 if detection window passes the thresholds then 17 Accept this window as a human; 18 else 19 Reject this window as a human; 20 where P is the pixel sizes of an image size, N scale is the number of scales in image pyramid, N block is the number of the block in each window image, N cell is the number of cell in each block, and N bin is the direction in each cell.
Subsequently, the HOG features are normalized as shown in Equation (8) to be suitable for a variety of lighting conditions [11].
where M i is the normalized magnitude of gradient vote at i bin when i = 1 to K, K is the number of cell in one block multiplied by the number of bins (N bin )and e is a small constant value. The 2D features of the sliding window extracted in this way are converted into a single vector of features. Finally, this vector of features is tested in a SVM classifier. If the sliding window passes the threshold, it is detected as a human.

Deep Learning
Deep learning [16] is a technique for machine learning that can consider both the low-level and high-level information in a large data set. A deep learning architecture is generally similar to that of an artificial neural network but has greater numbers of hidden layers and nodes. In this study, a CNN is used for frontal full-body detection. A CNN model typically consists of convolutional layers, pooling layers, and fully connected layers, as shown in Figure 9: 1.
Convolutional layers: These layers are the core of the model, consisting of filters or kernels to calculate image features such as lines, edges, and corners. Generally, a filter consists of a mask matrix of numbers moved over the input image to calculate specific features. The convolution operations of filters consist of dot products and summations between the filters and the input image. The output of these operations is usually passed through an activation function designed for a particular purpose, such as the rectified linear unit (ReLU) activation function for non-linear input.

2.
Pooling layers: These layers generally reduce the dimensionality of the features. They represent pooled feature maps or new sets of features, moving from the local scale to the global scale. There are several possible pooling operations, including taking the maximum, average, or summation of each corresponding cluster of data from the previous layer.

3.
Fully connected layers: A fully connected layer, in which every neuron in the previous layer is connected to every neuron in the current layer, is typically used as a final layer. The softmax activation function is commonly used in a fully connected output layer to classify the input image into one of several classes based on the training images. Moreover, the experiment was compared with OpenPose method [45] which is a pre-trained model for human detection based on the PAFs relating to human body joints. In the OpenPose technique, there are three main procedures for human detection: 1.
Keypoints localization: The input image is located and predicted all the possible keypoints as human body joints based on a confidence map. This map is also beneficial of one person pose estimation.

2.
Part Affinity Fields: The keypoints are mapped to the 2-dimensional vector field for location and orientation of the associated human limbs.

3.
Greedy Inference: The 2-dimensional vector field is generated the pose keypoints for all the people in the image.
To maintain runtime performance, the OpenPose method [45] is the limited computation to a maximum of 6 stages, allocated differently procedures across the part affinity fields and keypoints localization.

Proposed Lower-Body Detection Framework Using Anthropometric Data
In this section, anthropometric data [21, 25,47] representing scaling relations for the human body are introduced. This section also illustrates a method of using anthropometric data to transform three regions of interest (ROIs) of the human body, namely, the full-body, the upper-body, and the face, into the lower-body ROI.

Anthropometric Data
In this section, anthropometric data [48] representing human body information are introduced. A survey of anthropometric data is generally related to the size, motion, and mass of the human body. Such survey data can be applied to design suitable clothing, ergonomic devices, or workspaces. In this study, human body size data from the NASA Anthropometry and Biomechanics Facility [21] are selected to provide information on the height and width of various body parts in a standing posture from the frontal view. This information was collected from healthy adults with an average age of approximately 40 years and from a wide range of ethnic and racial backgrounds. The example of anthropometric dimensional data is shown in Figure 10. There are three dominant parts of the human body considering in this experiment:

1.
Full-body: the width of the full-body is similar to the width of the upper-body, and the height of the full-body is measured from the foot to the top of the head.

2.
Upper-body: the width of the upper-body is recorded from the edge of the left hand to the edge of the right hand with the hands resting on the body, and the height of the upper-body is measured from the waist to the top of the head.

3.
Head: the width of the head is measured from the left ear to the right ear, and the height of the head is measured from the chin to the top of the head.
To address occlusion problems affecting the lower-body, this research aims to detect the lower-body indirectly by using anthropometric data. To apply anthropometric data for lower-body detection, a suitable human ROI ratio can be used to transform an ROI corresponding to any other part of the body into the lower-body ROI. In this study, three main ROIs are considered for transformation to the lower-body: • Full-body: the HOG-SVM or CNNs algorithm is used to detect the full-body of the target, as shown in Figure 11. • Upper-body: the VJ algorithm for upper-body detection (V J U pper ) is applied to detect the upper-body of the target, as illustrated in Figure 12. • Head: the head of the target is detected by using the VJ algorithm for face detection (V J Face ), under the assumption that the head ROI is close to the face ROI, as demonstrated in Figure 13.    The anthropometric data ratios are constructed from the median scaled sizes of the head, upper-body, lower-body, and full-body in a standing posture, as shown in Tables 1 and 2. However, the anthropometric data collected from female subjects are not sufficiently comprehensive; therefore, in this study, only male anthropometric data are selected for lower-body detection.

Transformation of the Full-Body ROI into the Lower-Body ROI
The HOG-SVM or CNNs algorithm can be used to directly detect the full-body ROI of a human in an image. This full-body ROI can then be cropped to obtain the lower-body ROI as shown in Equations (9)- (11). This process is also illustrated in Figure 14. In addition, the information of the anthropometric data of upper-body is shown in Table 1.
where R HU is the ratio height of upper-body per full-body, H U is the height of upper-body, and H F is the height of full-body.
where H Lower is the length of lower-body ROI, H F is the length of full-body ROI, and R HU is the ratio height of upper-body per full-body as shown in Table 1.
where W Lower is the width of lower-body ROI and W F is the width of full-body ROI.

Transformation of the Upper-Body ROI into the Lower-Body ROI
As illustrated in Figure 15, V J U pper is used to detect the ROI of the upper-body; then, this ROI is converted into the full-body ROI by using R HU and R WU as shown in Equations (9), (12)- (14).
where H F is the height of full-body ROI and H upper is the height of upper-body detection ROI by VJ algorithm and R HU is the ratio height of upper-body per full-body as shown in Table 1.
where R WU is the ratio width of upper-body per full-body, W U is the width of upper-body, and W F is the width of full-body.
where W F is the width of full-body ROI and W upper is the width of upper-body detection ROI by VJ algorithm and R WU is the ratio width of upper-body per full-body as shown in Table 1. Subsequently, the estimated full-body ROI is cropped to obtain the lower-body ROI as shown in Equations (10) and (11).

Transformation of the Face ROI into the Lower-Body ROI
In case of using V J Face to find the ROI of the human face, as demonstrated in Figure 15, the face ROI is converted into the full-body ROI by means of R HH and R W H as shown in Equation (15)- (18). Moreover, the information of the anthropometric data of head is shown in Table 2.
where R HH is the ratio height of head per full-body, H H is the height of head, and H F is the height of full-body.
where R W H is the ratio width of head per full-body, W H is the width of head, and W F is the width of full-body.
where H F is the height of full-body ROI and H Face is the height of face detection ROI by VJ algorithm and H head ratio is the ratio height of head per full-body as shown in Table 2.
where W F is the width of full-body ROI and W Face is the width of face detection ROI by VJ algorithm and R W H is the ratio width of head per full-body as shown in Table 2.
Then, this full-body ROI is cropped to obtain the lower-body ROI as shown in Equations (10) and (11). To summarize, diagrams of the frameworks for using the HOG-SVM or CNNs algorithm and the V J Face or V J U pper algorithm for lower-body detection are shown in Figures 14 and 15, respectively.

Dataset
Experiments were conducted using the INRIA Person Dataset [11], which consists of upright human images (positive images) and general background images (negative images). This data set is challenging for lower-body detection methods because it consists of images captured under various lighting conditions and containing occluding objects, such as vehicles and furniture, close to the human targets of interest.
In these experiments, 2416 positive images and 1218 negative images were used for training, where the negative images were obtained by randomly cropping the background images. Similarly, the data set used for testing included 1126 positive images and 453 randomly cropped negative images.
To analyse the results under different image conditions, five cases of human detection were investigated and analysed. Example images for cases 1-5 are shown in Figures 16-20, respectively. Five cases of scenario [49][50][51][52][53] are described as:

1.
Case of challenging lighting conditions: The light level in an image may not be sufficient to clearly reveal the presence of humans [49,51]. In particular, this may occur in indoor and night-time scenes, resulting in low image quality.

2.
Case of occlusion: Occlusion refers to overlapping either between a human and another human or between a human and another object in the image [49][50][51]. This can affect the ability to identify complete human shapes, such as in the case of a group of standing people.

3.
Case of multiple people: There may be more than one person in an image [49,51], such as in public sightseeing images or shopping mall images. Some algorithms can support multiple detection [11,12].

4.
Case of a difference in pose between the training and test images: A pose refers to the gesture or posture of a human in an image. For a test image depicting a person in a pose that does not appear in the training images [52], it may be difficult to detect whether the ROI is human or not human because it is not sufficiently similar to the training images [11].

5.
Case of different clothes: People in images may wear clothes of many different colors, sizes, and styles as well as different accessories [53]. Sometimes, certain clothing characteristics may make it difficult to identify a human shape.

Evaluation
To evaluate the lower-body detection performance of the frameworks, the confusion matrix [54,55] and complexity were used as performance measures. The difference image cases listed above were also analysed to investigate their influence on the detection ability. Table 3 presents the confusion matrix used for the evaluation of the frameworks. The columns represent a framework's detection results, and the rows represent the actual class. The entries in Table 3 are defined as follows: • TP denotes the number of images in the human data set that are correctly detected to contain at least one lower-body ROI. • FN denotes the number of images falsely identified as non-human images in the human data set. • FP denotes the number of images in the non-human data set that are falsely detected to contain at least one lower-body ROI. • TN denotes the number of images correctly identified as non-human images in the non-human data set. The performance of a framework on detection problems can be measured based on the confusion matrix. This paper focuses on three measures: sensitivity, specificity, and accuracy. Their equations are given in Table 4. The first common measure of detection performance is the accuracy. It can be used to evaluate the overall efficiency of a framework. Meanwhile, the sensitivity measures the accuracy of human detection in human images (positive images), whereas the specificity measures the accuracy of non-human detection (negative images). Table 4. Measures of detection performance based on the confusion matrix [54,55].

Sensitivity
TP TP+FN

TP+TN TP+TN+FP+FN
The complexity of each framework, reflecting the complexity of the algorithm used for detection, was also investigated.
The parameters used for VJ detection in these experiments were in the same scale range as in a previous experiment [12]. For face detection, the minimum window size was 20 × 20 pixels, and for upper-body detection, the minimum window size was 60 × 60 pixels. For the extraction of HOG features, the window size was 64 × 128 pixels. The CNNs model was modified from Chakrabarty and Chatterjee experiment [56]. The customized model comprised two pairs of convolutional and pooling layers, each with 32 filters with dimensions of 3 × 3. A final fully connected layer with 64 neurons was used to classify each input image as human or non-human based on the softmax function, as illustrated in Figure 21. In the case of the OpenPose, a pre-trained model was deployed to human pose estimation with six stages as in a previous experiment [45].

Experimental Results
In this section, three perspectives are considered for the evaluation of lower-body detection with the proposed anthropometric ratios: the performance of different frameworks, their complexity, and their sensitivity to different image conditions such as lighting conditions, an occlusion, multiple people, a difference in pose between the training and test images, and challenging clothes (The detail of image conditions are explained in Section 3.4). Table 5 summarizes the performance of different frameworks for lower-body detection with the proposed anthropometric ratios. It is clear that in the case of the sensitivity measure tested on the human image, the HOG-SVM framework again shows the highest performance among the traditional algorithms for detecting human images (72.22%), while V J Face and V J U pper can only achieve human detection with an accuracy of less than 36%. Regarding the deep learning techniques, OpenPose achieves higher sensitivity (99.73%) than the A-CNNs method (75.94%). On the non-human data set, all frameworks achieve a total specificity of more than 85% for detecting background images. The V J Face method achieves the highest specificity (99.56%), while the specificity of HOG-SVM is the lowest (85.43%). For the deep learning methods, the specificity of A-CNNs (99.30%) is higher than that of OpenPose (86.09%). The specificity of OpenPose is decreased by its false positive detections on background images, as shown in Figure 22. An averaged accuracy of A-Traditonal methods is around 74.81%. The HOG-SVM algorithm also provides higher overall accuracy (80.30%) than the other traditional methods, while the accuracy of V J Face is higher than that of V J U pper . In terms of deep learning, both the A-CNNs and OpenPose methods achieve overall accuracies of greater than 90%. Table 6  (n l−1 C l )), but the A-CNNs has only one stage of N T while the OpenPose is configured as six stages of N T . Table 6. Complexity of algorithm for lower detection.

A-Traditional Deep Learning
Algorithm V J Face , V J U pper

HOG-SVM OpenPose, A-CNNs
Integral (n l−1 C l )) Note: P is the pixel sizes of an image size, N r is the number regions of interest covers W pixels, M is the number of stage in the cascade classifier (N cascade ) in Algorithm 1, T is the number of filter in the stage (N f ilter ) in Algorithm 1, N f is the number of features that is calculate by N cell * N block * N bin in Algorithm 2, N T is the number state of CNNs or OpenPose, n l is the number of convolution layer, C l is the convolution layer complexity which equals to s 2 l · n l · m 2 l , s l is the size of kernel filter, m l is the size of pooling layer, and d is total of deep learning layers.

Different Image Conditions
Examples of results for cases 1-5 are illustrated in Figures 23-27, respectively. In most of these cases, detection is achieved by A-CNNs, OpenPose and HOG-SVM; however, the HOG-SVM framework cannot detect the person in the non-standing pose depicted in Figure 26. V J Face and V J U pper are able to achieve detection in cases 3-5, while in case 1, neither version of the VJ algorithm can detect any humans, as shown in Figure 23.

Discussion
In this section, lower-body detection with the proposed anthropometric ratios is discussed from three perspectives: accuracy, complexity, and different image conditions.
According to the results in Table 5, the proposed anthropometric ratios can be used to scale other detected parts of the human body to obtain lower-body ROIs. In addition, the A-CNNs, OpenPose and HOG-SVM methods achieve success in lower-body detection with high sensitivities of more than 80% because they can successfully detect and transform human shapes under various lighting and occlusion conditions. Regarding specificity, the V J Face algorithm provides higher specificity than the other methods for detection on the non-human data set because most background images consist of scenes such as sightseeing locations and mountains; therefore, the Haar-like rectangular templates rarely match these backgrounds. Regarding the performance of A-CNNs, it is sometimes not fair to use such background data sets in deep learning unless the background images are further categorized into subclasses, such as trees, appliances and buildings. To enhance the detection performance, a similar problem has been solved by using a one-class classifier based on a CNNs [57]. OpenPose was trained on the COCO data set [58], which contains images of two hundred fifty thousand people with keypoints [45]. Consequently, it seemed to provide the highest detection accuracy on the INRIA data set in this experiment. However, our A-CNNs model, which was trained on human body ROIs, could achieve higher specificity than OpenPose, which was trained on keypoints. According to Figure 22, OpenPose seems to be more suitable for human detection in a plain room than for application in an outdoor environment for exercise monitoring. For the purpose of lower-body detection, lower-body ROIs are based on the proposed anthropometric ratios, while OpenPose focuses on body keypoints, which need more optimization than lower-body ROIs based on the NASA Anthropometry and Biomechanics data [21]. Moreover, the proposed anthropometric ratios can also be modified for use in locating different parts of the body, such as the thigh, leg, and foot, to monitor lower-body activities without any need to retrain the A-CNNs model. In contrast, OpenPose would need to be retrained on a data set containing new keypoints for the detection of different body parts.
According to the results in Table 6, the V J Face and V J U pper algorithms have relatively low complexity because they can reduce the complexity of the general algorithm from O(N r W) to O(P + 4N r ) based on the integral image calculation. Then, the complexity of the cascaded classifiers (with M filters and T thresholds) is O(MT). Therefore, the final complexity of the VJ algorithm is O((P + 4N r )MT). In the case of HOG-SVM, the complexity of the HOG feature calculation is O(N r W), which is multiplied by O(N f ) (the number of features). The O(N f ) complexity of the HOG calculation is high because of the iterations over N cell , N block and N bin . If the dimensionality of the HOG features is not minimized to reduce the scale of O(N f ), this algorithm might be too complex to be suitable for on-line detection. Subsequently, the complexity of the linear SVM is O(N f ). Hence, the total complexity of HOG-SVM is O(N r W N 2 f ). The A-CNNs method has a high classifier complexity, which depends on the number of convolutions. It also requires customizing a model to achieve high accuracy. The OpenPose method uses a CNNs model in its three main procedures for human detection. Moreover, the number of stages of each procedure influences the complexity of OpenPose. The number of stages should be optimized to achieve a suitable trade-off between accuracy and complexity [59].
In the case of different image conditions, although the proposed anthropometric ratios can be used to crop the lower-body regions of human images, this approach is limited to images of humans in a standing posture. In addition, five cases of human detection were discussed as:

1.
Case of challenging lighting conditions: HOG-SVM, A-CNNs and OpenPose yield better detection results than V J Face and V J U pper . The former methods are not sensitive to lighting conditions when the features are in dark images.

2.
Case of occlusion: The HOG-SVM, A-CNNs and OpenPose methods can detect overlapping humans in images. V J U pper is able to detect some of the human targets in the image considered in this example, but V J Face is not because the faces of the humans in this image are rotated around the vertical axis.

3.
Case of multiple people: Most methods can detect the lower bodies of the people in this image because the other characteristics of this image are beneficial, such as good lighting, full visibility of the upper bodies and a frontal view of the faces. 4.
Case of a difference in pose between the training and test images: HOG-SVM cannot detect the lower-body of the person in this image because it depicts a human sitting on a bicycle and thus is not similar to the positive training images, i.e., standing human images. The A-CNNs uses the softmax function for classification, so the result is expressed in the form of a probability value expressing how close the input image is to the training images, whereas the OpenPose still can detect human keypoints because of the variety of postures used for training from the COCO data set [58].

5.
Case of challenging clothes: The HOG-SVM, A-CNNs and OpenPose methods can detect the lower bodies of the people in this image because they still have humanlooking shapes. V J Face can also detect the lower-body regions because there is no occlusion of the faces, while V J U pper can detect one of the two humans in the image.

Conclusions
This paper proposes anthropometric ratios for use in combination with either deep learning or traditional methods for lower-body detection in images captured under various environmental conditions. As seen from the results, the proposed framework can be beneficial for transforming some parts of the human body into corresponding lower-body ROIs; however, it is limited to images of humans in a standing posture captured from a frontal view only. Furthermore, in the deep learning methods, A-CNNs (90.14%) and OpenPose (95.82%) achieve higher accuracy than the averaged A-Traditional methods (74.81%) despite challenging illumination and occlusion conditions. However, the complexity of OpenPose, which depends on the number of nodes, layers, and stages, is higher than A-CNNs. In future work, anthropometric ratios suitable for various human postures will be studied. The specific data set provides the image conditions such as illumination conditions, occlusion, multiple people, the difference in posture, and a variety of clothes that will be tested. Furthermore, the A-CNNs model will be optimized its parameters for human body detection in a wide variety of scenarios. Additionally, the detection framework will be combined with a tracking system for faster monitoring of lower-body activities.  Data Availability Statement: Publicly available INRIA Person Dataset [11] was analyzed in this study. This data can be found here: http://lear.inrialpes.fr/data, accessed on 2 May 2020.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.