Human Height Estimation by Color Deep Learning and Depth 3D Conversion

: In this study, an estimation method for human height is proposed using color and depth information. Color images are used for deep learning by mask R-CNN to detect a human body and a human head separately. If color images are not available for extracting the human body region due to low light environment, then the human body region is extracted by comparing between current frame in depth video and a pre-stored background depth image. The topmost point of the human head region is extracted as the top of the head and the bottommost point of the human body region as the bottom of the foot. The depth value of the head top-point is corrected to a pixel value that has high similarity to a neighboring pixel. The position of the body bottom-point is corrected by calculating a depth gradient between vertically adjacent pixels. Two head-top and foot-bottom points are converted into 3D real-world coordinates using depth information. Two real-world coordinates estimate human height by measuring a Euclidean distance. Estimation errors for human height are corrected as the average of accumulated heights. In experiment results, we achieve that the estimated errors of human height with a standing state are 0.7% and 2.2% when the human body region is extracted by mask R-CNN and the background depth image, respectively.


Introduction
The physical measurements of a person such as human height, body width and stride length are important bases for identifying a person from video. For example, the height of the person captured by a surveillance video is important evidence for identifying a suspect. Physical quantities are also used as important information for continuously tracking a specific person in video surveillance system consisting of multiple cameras [1]. A specific behavior such as falling down can be recognized by detecting changes in human height. Various studies have been conducted to estimate human height from color video. Human height is estimated by obtaining 3D information of the human body from color video [2][3][4][5][6]. Both the position and the pose of the camera are required in order to obtain 3D information of human body. Human height can also be estimated by calculating the ratio of the length between human body and a reference object whose length is already known [7][8][9][10][11][12]. The estimation methods of human height based on color video have a disadvantage in that the camera parameters or information about a reference object are required.
Depth video stores depth values, meaning the distances between subjects and the camera. The pixels of depth video are converted to 3D coordinates by the depth values. Object detection [13][14][15] and behavior recognition [16][17][18] by depth video are possible by extracting the 3D features of the objects. Recently, smartphones recognize a human face through an equipped TOF sensor for recognizing the identity of a person. Object lengths can be also measured from depth video without the additional information, so the problems of the human-height estimation based on color video can be solved by using depth video.
The field of artificial intelligence has made significant progress by researching neural network structures which consist of multilayers. In particular, convolutional neural network (CNN) [19] respectably improves object detection that categorizes the object and detects the boundary boxes and pixels of the objects [20][21][22][23][24][25][26].
In this study, a human-height estimation method is proposed using depth and color information. The human-height estimation is improved by extracting a human body and a human head from color information and by measuring human height from depth information. The human body and the human head of current frame in color video are extracted through mask R-CNN [26]. If color images are not available due to a low light environment, then the human body region is extracted by comparing between current frame in depth video and a pre-stored background depth image. The topmost point of the human head region is extracted as a head-top and bottommost point of the human body region as a foot-bottom. Two top head and foot-bottom points are converted to 3D real-world coordinates by these image coordinates and depth pixel values. Human height is estimated by calculating a Euclidean distance between two real-world coordinates.
The proposed method improves the human-height estimation by using both color and depth information and by applying mask R-CNN which is an art-of-state algorithm for object detection. In addition, the proposed method removes the need for the camera parameters or the length of other object in the human-height estimation using depth information.
This study is organized as follows: In Section 2, the related works for object detection by CNN and for the human-height estimation based on color or depth video are described. In Section 3, the human-height estimation by depth and color videos is proposed. The experimental results of the proposed method are presented in Section 4. Finally, a conclusion for this study is described in Section 5.

Object Detection from Color Information by Convolutional Neural Network
Object detection problems in color image can generally be classified into four categories: classification, localization, detection and object segmentation. First, the classification determines an object category for single object in the image. Second, the localization finds the boundary box for single object in the image. Third, the detection finds the boundary boxes and determines object categories for multiple objects. Finally, the object segmentation finds pixels where each object is. CNN can solve whole categories of object detection problems. CNN replaces the weights of the neural networks with kernels which are rectangular filters. Generally, object detection through CNN are classified as 1-stage and 2-stage methods [27]. The 1-stage method performs both the location and the classification at once. The 2-stage method performs the classification after the location. The 1-stage method is faster than the 2-stage method but is less accurate. R-CNN [20] is a first proposed method for the detection through CNN. R-CNN applies a selective search algorithm to find the boundary box with a high probability where an object exists. The selective search algorithm is the method of constructing the boundary box by connecting adjacent pixels with similar texture, color and intensity. The object is classified through SVM (support vector machine). The feature map of the boundary boxes is extracted through AlexNet. R-CNN has disadvantages that the object detection is seriously slow and SVM should be trained separately from CNN. Fast R-CNN [21] has higher performance than R-CNN. Fast R-CNN applies a RoIPool algorithm and introduces a softmax classifier instead of SVM, so the feature map extraction and the classification are integrated into one neural network. faster R-CNN [22] replaces the selective search algorithm into a region proposal network (RPN) so whole processes of object detection are performed in one CNN. YOLO [23,24] is the 1-stage method for object detection. YOLO defines the object detection problem as a regression problem. YOLO divides an input image into the grid cells of a certain size. The boundary boxes and the reliability of the object are predicted for each cell at same time. YOLO detects the objects more quickly than the 2-stage methods but is less accurate. SSD [25] allows the various sizes of the grid cells in order to increase the accuracy of object detection. mask R-CNN [26] is proposed for the object segmentation unlike other R-CNNs.

Object Length Measurement from Color or Depth Information
Length estimation methods based on color video are classified into length estimations by camera parameters [2][3][4][5][6], by vanishing points [7][8][9][10][11][12], by prior statistical knowledge [28,29], by gaits [30,31] and by neural networks [32,33]. The length estimation methods by the camera parameters generate an image projection model into an color image by the focal length, the height and the poses of a camera. The object length is estimated by converting the 2D coordinates in the pixels of the image into 3D coordinates through the projection model. The length estimation methods by the camera parameters have a disadvantage that the accurate camera parameters should be provided in advance. In order to overcome this disadvantage, Liu [2] introduces an estimation method for the camera parameters using prior knowledge about the distribution of relative human heights. Cho [6] proposes an estimation method for the camera parameters by tracking the poses of human body from a sequence of frames. The length estimation methods by the vanishing points use a principle that several parallel lines in 3D space meet at one point in a 2D image. The vanishing point is found by detecting the straight lines in the image. The length ratio between two objects can be calculated using the vanishing points. If the length of one object is known in advance, then the length of another object can be calculated by the length ratio. Criminisi [7] introduces a length estimation method by the given vanishing points of the ground. Fernanda [8] proposes the detection method of the vanishing points by clustering the straight lines iteratively without camera calibration. Jung [9] proposes the method of detecting the vanishing points for color videos captured by multiple cameras. Viswanath [10] proposes an error model for the human-height estimation by the vanishing points. The error of the human-height estimation is corrected by the error model. Rother [11] detects the vanishing points by tracking specific object such as traffic signs from a sequence of frames. Pribyl [12] estimates the object length by detecting the specific objects. Human height can be also estimated by the prior statistical knowledge of human anthropometry [28,29] or by the gaits [30,31]. In recent years, estimation studies in various fields achieve great success by applying neural networks. The neural networks are also applied to the human-height estimation. Gunel [32] proposes a neural network for predicting a relationship between each proportion of human joints and human height. Sayed [33] estimates human height by CNN using a length ratio between a human body width and a human head size.
Since depth video has distance information from the depth camera, the distance between two points in a depth image can be measured without the camera parameters or the vanishing points. Many studies [34][35][36] extract a skeleton, which is the connection structure of human body parts, from the depth image for the human-height estimation. However, the human body region extraction is some inaccurate due to noises in depth video.

Proposed Method
In this study, we propose a human-height estimation method using color and depth information. It is assumed that a depth camera is fixed in a certain position. Color and depth videos are captured by the depth camera. Then, a human body and a human head are extracted from current frame in color or depth video. A head-top and a foot-bottom are found in the human head and the human body, respectively. Two head-top and foot-bottom points are converted into 3D real-world coordinates by the corresponding pixel values of the frame in depth video. Human height is estimated by calculating a distance between two real-world coordinates. The flow of the proposed methods is shown in Figure 1.

Human Body Region Extraction
It is important to accurately detect the human body region for estimating human height precisely. Most frames in depth video have many noises, so it is difficult to extract the accurate human body region from the depth frame. In contrast, color video allows to detect the human body region accurately by CNN. In the proposed method, mask R-CNN [26] is applied to extract the accurate human body region from current frame in color video. Then, the human body region is mapped to current frame in depth video. If color video is not available for extracting the human body region, then the human body region is extracted from current frame in depth video directly. In this case, the human body region is extracted by comparing with current depth frame with a pre-captured background depth image. Figure 2 shows the flow of extracting the human body region in the proposed method.

Human Body Region Extraction Using Color Frames
Mask R-CNN [26] consists of three parts: a feature pyramid network (FPN) [37], a residual network (ResNet) [38,39] and a RPN. A FPN detects the categories and the boundary boxes of objects in color video. ResNet extracts an additional feature map from each boundary box. Figure 3 shows the processes of extracting the human body and the human head using mask R-CNN.

Human Body Region Extraction
It is important to accurately detect the human body region for estimating human height precisely. Most frames in depth video have many noises, so it is difficult to extract the accurate human body region from the depth frame. In contrast, color video allows to detect the human body region accurately by CNN. In the proposed method, mask R-CNN [26] is applied to extract the accurate human body region from current frame in color video. Then, the human body region is mapped to current frame in depth video. If color video is not available for extracting the human body region, then the human body region is extracted from current frame in depth video directly. In this case, the human body region is extracted by comparing with current depth frame with a pre-captured background depth image. Figure 2 shows the flow of extracting the human body region in the proposed method.

Human Body Region Extraction
It is important to accurately detect the human body region for estimating human height precisely. Most frames in depth video have many noises, so it is difficult to extract the accurate human body region from the depth frame. In contrast, color video allows to detect the human body region accurately by CNN. In the proposed method, mask R-CNN [26] is applied to extract the accurate human body region from current frame in color video. Then, the human body region is mapped to current frame in depth video. If color video is not available for extracting the human body region, then the human body region is extracted from current frame in depth video directly. In this case, the human body region is extracted by comparing with current depth frame with a pre-captured background depth image. Figure 2 shows the flow of extracting the human body region in the proposed method.

Human Body Region Extraction Using Color Frames
Mask R-CNN [26] consists of three parts: a feature pyramid network (FPN) [37], a residual network (ResNet) [38,39] and a RPN. A FPN detects the categories and the boundary boxes of objects in color video. ResNet extracts an additional feature map from each boundary box. Figure 3 shows the processes of extracting the human body and the human head using mask R-CNN.

Human Body Region Extraction Using Color Frames
Mask R-CNN [26] consists of three parts: a feature pyramid network (FPN) [37], a residual network (ResNet) [38,39] and a RPN. A FPN detects the categories and the boundary boxes of objects in color video. ResNet extracts an additional feature map from each boundary box. Figure 3 shows the processes of extracting the human body and the human head using mask R-CNN. FPN condenses the scale of the input frame through bottom-up layers and expands the scale through top-down layers. Various scale objects can be detected through FPN. ResNet introduces a skip connection algorithm that the output value of each layer feeds into the next layer and directly into the layers about more than 2 hops away. The skip connection algorithm reduces the amount of values to be learned for weights of layers, so the learning efficiency of ResNet is improved. The feature map is extracted from the frame through FPN and ResNet. RPN extracts the boundary boxes and the masks which are object area in rectangle and in pixels, respectively. Comparing with faster R-CNN which applies RoIPool, mask R-CNN extends RPN to extract not only the boundary box, but also the masks. RoIPool rounds off the coordinates of the boundary box to integer. In contrast, RoIAlign allows the floating coordinates. Therefore, the detection of the object areas of mask R-CNN is more precisely than of faster R-CNN. Figure 4 shows an example for calculating the coordinates of the regions of interest (RoIs) for detecting the boundary box and the masks by RoIPool and RoIAlign. Non-max-suppression (NMS) removes overlapping areas between the boundary boxes. The size of the overlapping area is calculated for each boundary box. Two boundary boxes are merged if the size of the overlapping area is more than 70%.  Table 1 shows the performances of mask R-CNN when various types of backbones are applied to mask R-CNN. When X-101-FPN is applied as the backbone of mask R-CNN, the average precision of the boundary boxes (Box AP), which is a metric for detecting the boundary box, is the highest. However, the times for a train and a detection are slowest. In consideration of a tradeoff between the accuracy and the time for the detection, the proposed method applies ResNet-50 FPN which consists of 50 CNNs as the backbone. FPN condenses the scale of the input frame through bottom-up layers and expands the scale through top-down layers. Various scale objects can be detected through FPN. ResNet introduces a skip connection algorithm that the output value of each layer feeds into the next layer and directly into the layers about more than 2 hops away. The skip connection algorithm reduces the amount of values to be learned for weights of layers, so the learning efficiency of ResNet is improved. The feature map is extracted from the frame through FPN and ResNet. RPN extracts the boundary boxes and the masks which are object area in rectangle and in pixels, respectively. Comparing with faster R-CNN which applies RoIPool, mask R-CNN extends RPN to extract not only the boundary box, but also the masks. RoIPool rounds off the coordinates of the boundary box to integer. In contrast, RoIAlign allows the floating coordinates. Therefore, the detection of the object areas of mask R-CNN is more precisely than of faster R-CNN. Figure 4 shows an example for calculating the coordinates of the regions of interest (RoIs) for detecting the boundary box and the masks by RoIPool and RoIAlign. Non-max-suppression (NMS) removes overlapping areas between the boundary boxes. The size of the overlapping area is calculated for each boundary box. Two boundary boxes are merged if the size of the overlapping area is more than 70%. FPN condenses the scale of the input frame through bottom-up layers and expands the scale through top-down layers. Various scale objects can be detected through FPN. ResNet introduces a skip connection algorithm that the output value of each layer feeds into the next layer and directly into the layers about more than 2 hops away. The skip connection algorithm reduces the amount of values to be learned for weights of layers, so the learning efficiency of ResNet is improved. The feature map is extracted from the frame through FPN and ResNet. RPN extracts the boundary boxes and the masks which are object area in rectangle and in pixels, respectively. Comparing with faster R-CNN which applies RoIPool, mask R-CNN extends RPN to extract not only the boundary box, but also the masks. RoIPool rounds off the coordinates of the boundary box to integer. In contrast, RoIAlign allows the floating coordinates. Therefore, the detection of the object areas of mask R-CNN is more precisely than of faster R-CNN. Figure 4 shows an example for calculating the coordinates of the regions of interest (RoIs) for detecting the boundary box and the masks by RoIPool and RoIAlign. Non-max-suppression (NMS) removes overlapping areas between the boundary boxes. The size of the overlapping area is calculated for each boundary box. Two boundary boxes are merged if the size of the overlapping area is more than 70%.  Table 1 shows the performances of mask R-CNN when various types of backbones are applied to mask R-CNN. When X-101-FPN is applied as the backbone of mask R-CNN, the average precision of the boundary boxes (Box AP), which is a metric for detecting the boundary box, is the highest. However, the times for a train and a detection are slowest. In consideration of a tradeoff between the accuracy and the time for the detection, the proposed method applies ResNet-50 FPN which consists of 50 CNNs as the backbone.  Table 1 shows the performances of mask R-CNN when various types of backbones are applied to mask R-CNN. When X-101-FPN is applied as the backbone of mask R-CNN, the average precision of the boundary boxes (Box AP), which is a metric for detecting the boundary box, is the highest. However, the times for a train and a detection are slowest. In consideration of a tradeoff between the accuracy and the time for the detection, the proposed method applies ResNet-50 FPN which consists of 50 CNNs as the backbone. The human body and the human head are detected by mask R-CNN. Mask R-CNN is trained using 3000 images of COCO dataset [40] with information about the human body and the human head. In the training mask R-CNN, a learn rate and epochs are set to 0.001 and 1000, respectively. A threshold for detection of the human body and the human head is set to 0.7. If a detection accuracy for RoI is more than the threshold, then corresponding RoI is detected as the human body or the human head. The process of extracting the human body and human head regions through mask R-CNN is as follows.

1.
Resizing a color image to a certain size 2.
Extracting a feature map through FPN and ResNet50 3.
Extracting RoI boundary boxes from feature map by RoIAlign 4.
Boundary box regression and classification for boundary boxes through RPN 5.
Generating boundary box candidates by projecting the boundary box regression results onto the color frame 6.
Detecting a boundary box for each object by non-max-suppression 7.
Adjusting the boundary box area through RoIAlign 8.
Finding pixels in boundary boxes to obtain a mask for each boundary box

Human Body Region Extraction Using Depth Frames
If the depth camera is fixed in a certain position, then the pixels of the human body region in the depth frame have different values from the depth pixels of a background. Therefore, the body region can be extracted by comparing depth pixels between the depth frame and the background depth image which has depth information about background. In order to extract the human body region accurately, the background depth image should be generated from several depth frames that capture the background because the depth video includes temporary noises. A depth value at the certain position of the background depth image is determined as a minimum value among pixels in the corresponding position of the depth frames capturing the background.
The human body region is extracted by comparing the pixels between the depth frame and the background depth image. A binarization image B is generated for the human body region extraction as follows: where d b (x, y) and d(x, y) are the depth pixel values of the background depth image and the depth frame at position of (x, y), respectively and T b is a threshold for the binarization.

Extraction of Head Top and Foot Bottom Points
The topmost point of the human head region is extracted as the head-top, (x h , y h ) and the bottommost point of the human body region as the foot-bottom, (x f , y f ). If horizontal continuous pixels Appl. Sci. 2020, 10, 5531 7 of 18 exist as shown in Figure 5, then the head-top or the foot-bottom is extracted as a center point among these pixels. If human stands with legs apart as shown in Figure 6, then two separate regions may be found from the bottommost of the human body region. In this case, the center points of two regions are the candidates of the foot-bottom. One candidate which has a depth value closer to the depth pixel value of the head-top point is selected as the foot-bottom point.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 18 The topmost point of the human head region is extracted as the head-top, (xh, yh) and the bottommost point of the human body region as the foot-bottom, (xf, yf). If horizontal continuous pixels exist as shown in Figure 5, then the head-top or the foot-bottom is extracted as a center point among these pixels. If human stands with legs apart as shown in Figure 6, then two separate regions may be found from the bottommost of the human body region. In this case, the center points of two regions are the candidates of the foot-bottom. One candidate which has a depth value closer to the depth pixel value of the head-top point is selected as the foot-bottom point.

Human Height Estimation
Human height is estimated by measuring a length in the 3D real world between the head-top and foot-bottom points. In order to measure the length on the real world, 2D image coordinates of two head-top and the foot-bottom are converted into 3D real-world coordinates by applying a pinhole camera model [41] as follows: where X, Y, Z are the real-world coordinates, f is a focal length of the depth camera, which means the parameter of the depth camera, and W and H are the horizontal and vertical resolutions of the depth image, respectively. In (2), the origin of the image coordinate system is the top-left of the image, but the origin of 3D camera coordinate system is the camera center. In order to compensate for the difference in the position of the origin between two coordinate systems, the coordinates of the image center are subtracted from the image coordinates. the real-world coordinates of the head-top (Xh, Yh, Zh) and of the foot-bottom (Xf, Yf, Zf) are calculated by substituting the real-world coordinates and the depth values of the head-top and the foot-bottom for (2), respectively, as follows: The topmost point of the human head region is extracted as the head-top, (xh, yh) and the bottommost point of the human body region as the foot-bottom, (xf, yf). If horizontal continuous pixels exist as shown in Figure 5, then the head-top or the foot-bottom is extracted as a center point among these pixels. If human stands with legs apart as shown in Figure 6, then two separate regions may be found from the bottommost of the human body region. In this case, the center points of two regions are the candidates of the foot-bottom. One candidate which has a depth value closer to the depth pixel value of the head-top point is selected as the foot-bottom point.

Human Height Estimation
Human height is estimated by measuring a length in the 3D real world between the head-top and foot-bottom points. In order to measure the length on the real world, 2D image coordinates of two head-top and the foot-bottom are converted into 3D real-world coordinates by applying a pinhole camera model [41] as follows: where X, Y, Z are the real-world coordinates, f is a focal length of the depth camera, which means the parameter of the depth camera, and W and H are the horizontal and vertical resolutions of the depth image, respectively. In (2), the origin of the image coordinate system is the top-left of the image, but the origin of 3D camera coordinate system is the camera center. In order to compensate for the difference in the position of the origin between two coordinate systems, the coordinates of the image center are subtracted from the image coordinates. the real-world coordinates of the head-top (Xh, Yh, Zh) and of the foot-bottom (Xf, Yf, Zf) are calculated by substituting the real-world coordinates and the depth values of the head-top and the foot-bottom for (2), respectively, as follows:

Human Height Estimation
Human height is estimated by measuring a length in the 3D real world between the head-top and foot-bottom points. In order to measure the length on the real world, 2D image coordinates of two head-top and the foot-bottom are converted into 3D real-world coordinates by applying a pinhole camera model [41] as follows: where X, Y, Z are the real-world coordinates, f is a focal length of the depth camera, which means the parameter of the depth camera, and W and H are the horizontal and vertical resolutions of the depth image, respectively. In (2), the origin of the image coordinate system is the top-left of the image, but the origin of 3D camera coordinate system is the camera center. In order to compensate for the difference in the position of the origin between two coordinate systems, the coordinates of the image center are subtracted from the image coordinates. the real-world coordinates of the head-top (X h , Y h , Z h ) and of the foot-bottom (X f , Y f , Z f ) are calculated by substituting the real-world coordinates and the depth values of the head-top and the foot-bottom for (2), respectively, as follows: where d h and df are the depth values of the head-top and the foot-bottom, respectively. Human height is estimated by calculating an Euclidean distance between the real-world coordinates of the head-top and the foot-bottom as follows: The unit of the estimated human height by (5) is same as the unit of the depth pixels. If the pixels of the depth video store the distance as millimeters, then the unit of H is millimeter.
The estimated human height by (5) may have an error. One reason of the error is the noise of d h . Generally, (x h , y h ) may be in a hair area. The depth values in the hair area have large noises because the hair causes the diffuse reflection of an infrared ray emitted by the depth camera. Therefore, d h should be corrected as the depth value of a point which is close to the head top but is not on the hair area. The depth value of the point which is not on the hair area has a high similarity to the depth values of neighboring pixels. The similarity is obtained by calculating the variance of the pixels located within r pixels to the left, right and bottom including the corresponding pixel as follows: If σ 2 r is smaller than T σ , then the d h is corrected as the depth value of the corresponding pixel as shown in Figure 7. Otherwise, the point is found between pixels below one pixel and the similarity of the found point is calculated by (6). In (6), r is smaller as d h is larger because the width of an object is larger as the distance from the camera is closer as follows [42]: where P 1 and P 2 are the pixel lengths of the object widths when the depth values are d 1 and d 2 , respectively. Therefore, r depended on the depth value is determined as follows: In (8), d 0 and r 0 are constants so d 0 × r 0 can be regarded as a parameter. If d 0 × r 0 is represented as γ, (6) is modified as follows: Mask R-CNN occasionally detects slightly wider human body region than the actual region. In particular, the region detection error in the lower part of human body may causes that a point on the ground is extracted as the foot-bottom. Assuming the ground is flat, the difference in a depth gradient is little between two vertically adjacent pixels that is in the ground area. The depth gradient of certain pixel (x, y) is defined as follows: If certain point is on the ground, then the difference in the depth gradients is the same between the point and one pixel down. In order to determine whether the extraction of the foot-bottom is correct, two depth gradients are compared as follows: If D is 0, then the point is removed from the human body region. The comparison of the depth gradients by (11) is applied to the bottommost pixels of the human body region in order to correct the foot-bottom. If all of the bottommost pixels are removed, then this process is repeated for the points of the human body region where is one pixel up. The foot-bottom is extracted as a center pixel among the bottommost pixels which are not removed. Figure 8 shows correcting the position of the foot-bottom point.  Mask R-CNN occasionally detects slightly wider human body region than the actual region. In particular, the region detection error in the lower part of human body may causes that a point on the ground is extracted as the foot-bottom. Assuming the ground is flat, the difference in a depth gradient is little between two vertically adjacent pixels that is in the ground area. The depth gradient of certain pixel (x, y) is defined as follows: g(x, y) = d(x + 1, y) − d(x, y).
If certain point is on the ground, then the difference in the depth gradients is the same between the point and one pixel down. In order to determine whether the extraction of the foot-bottom is correct, two depth gradients are compared as follows: If D is 0, then the point is removed from the human body region. The comparison of the depth gradients by (11) is applied to the bottommost pixels of the human body region in order to correct the foot-bottom. If all of the bottommost pixels are removed, then this process is repeated for the points of the human body region where is one pixel up. The foot-bottom is extracted as a center pixel among the bottommost pixels which are not removed. Figure 8 shows correcting the position of the foot-bottom point. Mask R-CNN occasionally detects slightly wider human body region than the actual region. In particular, the region detection error in the lower part of human body may causes that a point on the ground is extracted as the foot-bottom. Assuming the ground is flat, the difference in a depth gradient is little between two vertically adjacent pixels that is in the ground area. The depth gradient of certain pixel (x, y) is defined as follows: If certain point is on the ground, then the difference in the depth gradients is the same between the point and one pixel down. In order to determine whether the extraction of the foot-bottom is correct, two depth gradients are compared as follows: If D is 0, then the point is removed from the human body region. The comparison of the depth gradients by (11) is applied to the bottommost pixels of the human body region in order to correct the foot-bottom. If all of the bottommost pixels are removed, then this process is repeated for the points of the human body region where is one pixel up. The foot-bottom is extracted as a center pixel among the bottommost pixels which are not removed. Figure 8 shows correcting the position of the foot-bottom point.  The noises of depth video may cause the temporary error of the human-height estimation. The temporary error of the human height is corrected by the average of the estimated human heights among a sequence of depth frames as follows: where n is the order of the captured depth frames and H(n) and H(n) are the estimated and corrected human heights in the nth frame order, respectively.

Experiment Results
Intel Realsense D435 is used as a depth camera for the experiments of the proposed method. A focal length f and a frame rate of the depth camera are 325.8 mm and 30 Hz, respectively. The resolutions of depth video are specified as 640 × 480. The threshold T b for (1) and T σ for (6) are set to 100 and 50, respectively. The parameter γ in (9) is 4000, which means r is 2 when d h is 2000 mm. Figures 9 and 10 show the extractions of the human body region through by mask R-CNN and by the background depth image, respectively. Both methods of the human body region extraction accurately extract the human body region at not only a standing state, but also a walking state. In addition, the human body region is accurately extracted regardless of the states of the human body. In Figure 9, areas painted in green and red are the human body and human head regions, respectively. The human head regions are accurately found even though the position of the hand is above the head. The human body region extraction by the background depth image extracts larger regions than by mask R-CNN, so some part of the background is included in the human body region. In addition, the bottom area of the human body is not included in the human body region because the depth values of these area are similar to the depth value of the ground.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 18 The noises of depth video may cause the temporary error of the human-height estimation. The temporary error of the human height is corrected by the average of the estimated human heights among a sequence of depth frames as follows: where n is the order of the captured depth frames and H and are the estimated and corrected human heights in the nth frame order, respectively.

Experiment Results
Intel Realsense D435 is used as a depth camera for the experiments of the proposed method. A focal length f and a frame rate of the depth camera are 325.8 mm and 30 Hz, respectively. The resolutions of depth video are specified as 640 × 480. The threshold Tb for (1) and for (6) are set to 100 and 50, respectively. The parameter γ in (9) is 4000, which means r is 2 when dh is 2000 mm. Figures 9 and 10 show the extractions of the human body region through by mask R-CNN and by the background depth image, respectively. Both methods of the human body region extraction accurately extract the human body region at not only a standing state, but also a walking state. In addition, the human body region is accurately extracted regardless of the states of the human body. In Figure 9, areas painted in green and red are the human body and human head regions, respectively. The human head regions are accurately found even though the position of the hand is above the head. The human body region extraction by the background depth image extracts larger regions than by mask R-CNN, so some part of the background is included in the human body region. In addition, the bottom area of the human body is not included in the human body region because the depth values of these area are similar to the depth value of the ground.    Figure 11 shows the correction of human height by (12) when the body region is extracted by mask R-CNN. The distributions of the estimated human height are large because of the noises of the depth frame when the correction of human height is not applied. After applying the correction of the human-height estimation by (12), the human heights are estimated as certain heights after about 20 frames. Figure 11. Results of height estimation after correction through cumulative average for each three persons. Figure 12 shows the result of the human-height estimation depending on the methods of the human body region extraction. The actual height of a person is 177 cm. In the human body region extraction through the background depth image, first 50 frames are accumulated to generate the background depth image. The human body keeps at a distance of 3.5 m from the camera. The body height is estimated as 176.2 cm when the human body region is extracted by mask R-CNN. The body height is estimated as 172.9 cm when the human body region is extracted by the background depth image.  Figure 11 shows the correction of human height by (12) when the body region is extracted by mask R-CNN. The distributions of the estimated human height are large because of the noises of the depth frame when the correction of human height is not applied. After applying the correction of the human-height estimation by (12), the human heights are estimated as certain heights after about 20 frames.  Figure 11 shows the correction of human height by (12) when the body region is extracted by mask R-CNN. The distributions of the estimated human height are large because of the noises of the depth frame when the correction of human height is not applied. After applying the correction of the human-height estimation by (12), the human heights are estimated as certain heights after about 20 frames. Figure 11. Results of height estimation after correction through cumulative average for each three persons. Figure 12 shows the result of the human-height estimation depending on the methods of the human body region extraction. The actual height of a person is 177 cm. In the human body region extraction through the background depth image, first 50 frames are accumulated to generate the background depth image. The human body keeps at a distance of 3.5 m from the camera. The body height is estimated as 176.2 cm when the human body region is extracted by mask R-CNN. The body height is estimated as 172.9 cm when the human body region is extracted by the background depth image. Figure 11. Results of height estimation after correction through cumulative average for each three persons. Figure 12 shows the result of the human-height estimation depending on the methods of the human body region extraction. The actual height of a person is 177 cm. In the human body region extraction through the background depth image, first 50 frames are accumulated to generate the background depth image. The human body keeps at a distance of 3.5 m from the camera. The body height is estimated as 176.2 cm when the human body region is extracted by mask R-CNN. The body height is estimated as 172.9 cm when the human body region is extracted by the background depth image. Appl. Sci. 2020, 10, x FOR PEER REVIEW 12 of 18    Figure 14 shows the result of the human-height estimation when a person whose height is 180 cm is standing, walking toward the camera and lateral walking. Human body keeps at a distance of 2.5 m from the camera when the human is standing and lateral walking. When the human is walking toward the camera, the distance from the camera is in range of 2.5 m to 4 m. In the standing state, the human height is estimated as 178.9 m. The human height is 177.1 cm and 174.9 cm in the lateral walking and the walking toward the camera, respectively. The magnitude of the estimated error in the lateral walking state is similar to in the standing state. The estimated error in walking toward the camera is larger than the others. The reason is that the vertical length of the human body is reduced because human knees are bent to a certain degree while walking.      Figure 14 shows the result of the human-height estimation when a person whose height is 180 cm is standing, walking toward the camera and lateral walking. Human body keeps at a distance of 2.5 m from the camera when the human is standing and lateral walking. When the human is walking toward the camera, the distance from the camera is in range of 2.5 m to 4 m. In the standing state, the human height is estimated as 178.9 m. The human height is 177.1 cm and 174.9 cm in the lateral walking and the walking toward the camera, respectively. The magnitude of the estimated error in the lateral walking state is similar to in the standing state. The estimated error in walking toward the camera is larger than the others. The reason is that the vertical length of the human body is reduced because human knees are bent to a certain degree while walking.   Figure 14 shows the result of the human-height estimation when a person whose height is 180 cm is standing, walking toward the camera and lateral walking. Human body keeps at a distance of 2.5 m from the camera when the human is standing and lateral walking. When the human is walking toward the camera, the distance from the camera is in range of 2.5 m to 4 m. In the standing state, the human height is estimated as 178.9 m. The human height is 177.1 cm and 174.9 cm in the lateral walking and the walking toward the camera, respectively. The magnitude of the estimated error in the lateral walking state is similar to in the standing state. The estimated error in walking toward the camera is larger than the others. The reason is that the vertical length of the human body is reduced because human knees are bent to a certain degree while walking.    Figure 14 shows the result of the human-height estimation when a person whose height is 180 cm is standing, walking toward the camera and lateral walking. Human body keeps at a distance of 2.5 m from the camera when the human is standing and lateral walking. When the human is walking toward the camera, the distance from the camera is in range of 2.5 m to 4 m. In the standing state, the human height is estimated as 178.9 m. The human height is 177.1 cm and 174.9 cm in the lateral walking and the walking toward the camera, respectively. The magnitude of the estimated error in the lateral walking state is similar to in the standing state. The estimated error in walking toward the camera is larger than the others. The reason is that the vertical length of the human body is reduced because human knees are bent to a certain degree while walking.   foot-bottom points, respectively. In Figure 15a, the position of d h is on the hair area, so d h has some error and the changes in d h are large as shown in Figure 16. After correcting d h , the changes is smaller. The changes of the estimated human body height are reduced after correcting of the foot-bottom point as shown in Figure 17. Two persons whose actual heights are 182 cm and 165 cm are estimated as 188.6 cm and 181.5 cm before correcting d h , respectively, as 172.2 cm and 163.3 cm after the correction of the head-top point, respectively and as 181.5 cm and 163.1 cm after the correction of both head-top and foot-bottom points, respectively.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 18 Figure 15 shows the positions of dh and (xf, yf) before and after the correction of the head-top and the foot-bottom, respectively. The green and red points in Figure 15 represent the head-top and footbottom points, respectively. In Figure 15a, the position of dh is on the hair area, so dh has some error and the changes in dh are large as shown in Figure 16. After correcting dh, the changes is smaller. The changes of the estimated human body height are reduced after correcting of the foot-bottom point as shown in Figure 17. Two persons whose actual heights are 182 cm and 165 cm are estimated as 188.6 cm and 181.5 cm before correcting dh, respectively, as 172.2 cm and 163.3 cm after the correction of the head-top point, respectively and as 181.5 cm and 163.1 cm after the correction of both head-top and foot-bottom points, respectively.    Figure 18 shows the results of the human-height estimation depending on r0 and Tσ, which are the parameters for (8) and (9) when d0 is 2000. The estimated height drops sharply when r0 is less than or equal to 2 and decreases smoothly when r0 is larger than 2. In addition, the estimated height Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 18 Figure 15 shows the positions of dh and (xf, yf) before and after the correction of the head-top and the foot-bottom, respectively. The green and red points in Figure 15 represent the head-top and footbottom points, respectively. In Figure 15a, the position of dh is on the hair area, so dh has some error and the changes in dh are large as shown in Figure 16. After correcting dh, the changes is smaller. The changes of the estimated human body height are reduced after correcting of the foot-bottom point as shown in Figure 17. Two persons whose actual heights are 182 cm and 165 cm are estimated as 188.6 cm and 181.5 cm before correcting dh, respectively, as 172.2 cm and 163.3 cm after the correction of the head-top point, respectively and as 181.5 cm and 163.1 cm after the correction of both head-top and foot-bottom points, respectively.    Figure 18 shows the results of the human-height estimation depending on r0 and Tσ, which are the parameters for (8) and (9) when d0 is 2000. The estimated height drops sharply when r0 is less than or equal to 2 and decreases smoothly when r0 is larger than 2. In addition, the estimated height Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 18 Figure 15 shows the positions of dh and (xf, yf) before and after the correction of the head-top and the foot-bottom, respectively. The green and red points in Figure 15 represent the head-top and footbottom points, respectively. In Figure 15a, the position of dh is on the hair area, so dh has some error and the changes in dh are large as shown in Figure 16. After correcting dh, the changes is smaller. The changes of the estimated human body height are reduced after correcting of the foot-bottom point as shown in Figure 17. Two persons whose actual heights are 182 cm and 165 cm are estimated as 188.6 cm and 181.5 cm before correcting dh, respectively, as 172.2 cm and 163.3 cm after the correction of the head-top point, respectively and as 181.5 cm and 163.1 cm after the correction of both head-top and foot-bottom points, respectively.    Figure 18 shows the results of the human-height estimation depending on r0 and Tσ, which are the parameters for (8) and (9) when d0 is 2000. The estimated height drops sharply when r0 is less than or equal to 2 and decreases smoothly when r0 is larger than 2. In addition, the estimated height  Figure 18 shows the results of the human-height estimation depending on r 0 and T σ , which are the parameters for (8) and (9) when d 0 is 2000. The estimated height drops sharply when r 0 is less than or equal to 2 and decreases smoothly when r 0 is larger than 2. In addition, the estimated height linearly increases when T σ is less than or equal to 250 and slowly increases when T σ is larger than 250. Body height is estimated most accurately when r 0 is 2 and T σ is 125. linearly increases when Tσ is less than or equal to 250 and slowly increases when is larger than 250. Body height is estimated most accurately when r0 is 2 and Tσ is 125.  Tables 2-4 show the results of the human-height estimation depending on human body postures for 10 persons. All of persons are captured within a range of 2.5 m to 4 m from the camera. Each person is captured with 150 frames. The error of the human-height estimation is calculated as follows: where and Hactual are the average of the corrected human heights by (12) and an actual height for a person, respectively. When the human body region is extracted by mask R-CNN, the errors of the human-height estimation with standing, lateral walking and walking towards the camera are 0.7%, 1.3% and 1.8%, respectively. The accurate foot-bottom for the human-height estimation is the point of a foot heel which is on the ground. However, the bottommost pixel of the body region which is extracted as the foot-bottom point is usually a foot toe point in the proposed method. The position difference between the foot heel and foot toe points may make the error of the human-height estimation. The human-height estimation errors with standing, lateral walking, and walking towards the camera are 2.2%, 2.9%, 4.6%, respectively, when the body region is extracted by the background depth image. The human-height estimation using only depth frames has more error than using both color and depth frames.
where H and H actual are the average of the corrected human heights by (12) and an actual height for a person, respectively. When the human body region is extracted by mask R-CNN, the errors of the human-height estimation with standing, lateral walking and walking towards the camera are 0.7%, 1.3% and 1.8%, respectively. The accurate foot-bottom for the human-height estimation is the point of a foot heel which is on the ground. However, the bottommost pixel of the body region which is extracted as the foot-bottom point is usually a foot toe point in the proposed method. The position difference between the foot heel and foot toe points may make the error of the human-height estimation. The human-height estimation errors with standing, lateral walking, and walking towards the camera are 2.2%, 2.9%, 4.6%, respectively, when the body region is extracted by the background depth image. The human-height estimation using only depth frames has more error than using both color and depth frames.

Conclusions
In this study, a human-height estimation method using color and depth information was proposed. The human body region was extracted through the pre-trained mask R-CNN to color video. The human body region extraction from depth video was also proposed by comparing with the background depth image. Human height was estimated from depth information by converting two points of head-top and foot-bottom into two 3D real-world coordinates and by measuring the Euclidean distance between two 3D coordinates. Human height was accurately estimated even if the person is not in front or a walking state. In the experiment results, the errors of the human-height estimation by the proposed method with the standing state were 0.7% and 2.2% when the human body region was extracted by mask-R CNN and by the background depth image, respectively. The proposed method significantly improves the human-height estimation by combining color and depth information. The proposed method can be applied to estimate not only the body height, but also the height of other object types such as animals. The proposed method can also be applied to gesture recognition and body posture estimation which require the types and the 3D information of objects.