Geometric Recognition of Moving Objects in Monocular Rotating Imagery Using Faster R-CNN

: Moving object detection and tracking from image sequences has been extensively studied in a variety of ﬁelds. Nevertheless, observing geometric attributes and identifying the detected objects for further investigation of moving behavior has drawn less attention. The focus of this study is to determine moving trajectories, object heights, and object recognition using a monocular camera conﬁguration. This paper presents a scheme to conduct moving object recognition with three-dimensional (3D) observation using faster region-based convolutional neural network (Faster R-CNN) with a stationary and rotating Pan Tilt Zoom (PTZ) camera and close-range photogrammetry. The camera motion e ﬀ ects are ﬁrst eliminated to detect objects that contain actual movement, and a moving object recognition process is employed to recognize the object classes and to facilitate the estimation of their geometric attributes. Thus, this information can further contribute to the investigation of object moving behavior. To evaluate the e ﬀ ectiveness of the proposed scheme quantitatively, ﬁrst, an experiment with indoor synthetic conﬁguration is conducted, then, outdoor real-life data are used to verify the feasibility based on recall, precision, and F1 index. The experiments have shown promising results and have veriﬁed the e ﬀ ectiveness of the proposed method in both laboratory and real environments. The proposed approach calculates the height and speed estimates of the recognized moving objects, including pedestrians and vehicles, and shows promising results with acceptable errors and application potential through existing PTZ camera images at a very low cost.


Introduction
In the field of computer vision, detecting and tracking moving objects has been widely studied for decades. A survey of the challenges and the latest methods of moving object detection in video sequences captured by a moving camera is presented in [1]. Closed-Circuit Televisions (CCTVs) provide a large number of images for video surveillance that involves various machine learning technologies [2]. Emerging applications in artificial intelligence, for example, [3][4][5][6][7][8][9], attract research attention for three-dimensional (3D) information acquisition from imagery to recognize objects and also to perceive their behaviors. Nevertheless, to robustly detect, track, and identify moving objects is still a challenge since a large number of variables and the possible geometric and dynamic ambiguities are involved in the computation [10][11][12][13]. To precisely separate moving objects from image backgrounds, This paper contributes a scheme to execute moving object recognition and further derives the geometric attributes from a single stationary and rotating PTZ camera configuration. Considering that both the foreground and background change while the camera is rotating, the scheme begins with the rectification of camera motion and proceeds with the moving object detection and identification by leveraging background subtraction and recognition techniques. Finally, the foreground pixels of the moving objects are refined and used to observe geometric attributes of each object using a combination of single-and multi-view solutions. With an assumption that all moving objects should locate on a known plane, the geometric attributes of the objects are determined from a monocular camera configuration, providing important clues for further intelligent applications. In addition, the proposed method is implemented on street-view images acquired using a SAMPO PTZ camera (Sampo Corporation, Taoyuan, Taiwan) in a stationary and rotating configuration, context (COCO) dataset, and KITTI dataset [45] for performance evaluation.

Methodology
The proposed scheme was comprised of the following four modules to identify, track, and perceive moving objects: (1) camera motion rectification, (2) motion segmentation, (3) moving object recognition, and (4) geometric observing. The block diagram of the proposed scheme is shown in Figure 1.

Image sequences
Camera motion rectification

Motion segmentation Geometric observing
Moving object recognition Recognized objects and their 3D attributes

Camera Motion Rectification
Since the movements of a rotating camera critically degenerate the accuracy and reliability of the motion segmentation, the proposed scheme begins with rectification to eliminate the camera motion by estimating the camera poses at each epoch, which is the critical process to find actual moving objects. Referring to the evaluation of state-of-the-art image features, speeded up robust features (SURF) method [46] has shown good accuracy regarding generic invariance properties. Although no best feature descriptor can tackle all kinds of deformation at present, SURF has shown its effectiveness and efficiency. Thus, SURF correspondences refined by random sample consensus (RANSAC) [47] were employed to construct the essential matrix for the relative camera pose estimation. The object function describing epipolar geometry for estimating camera poses can be read as: images, respectively and is the matrix conveying interior parameters of the camera. , an essential matrix, which can be expressed as Equation (2), can be solved linearly and used as approximations for nonlinear least squares adjustment [48].
where , , and are the eigenvalues derived from ( ).

Camera Motion Rectification
Since the movements of a rotating camera critically degenerate the accuracy and reliability of the motion segmentation, the proposed scheme begins with rectification to eliminate the camera motion by estimating the camera poses at each epoch, which is the critical process to find actual moving objects. Referring to the evaluation of state-of-the-art image features, speeded up robust features (SURF) method [46] has shown good accuracy regarding generic invariance properties. Although no best feature descriptor can tackle all kinds of deformation at present, SURF has shown its effectiveness and efficiency. Thus, SURF correspondences refined by random sample consensus (RANSAC) [47] were employed to construct the essential matrix for the relative camera pose estimation. The object function describing epipolar geometry for estimating camera poses can be read as: where x L = x L i y L i 1 and x R = x R i y R i 1 indicate the image coordinates in left and right images, respectively and C is the matrix conveying interior parameters of the camera. E, an essential Remote Sens. 2020, 12, 1908 4 of 18 matrix, which can be expressed as Equation (2), can be solved linearly and used as approximations for nonlinear least squares adjustment [48].
where y, e, d, ξ, and P denote an observation vector, a residual vector, a constant vector, unknowns, and a weight matrix, respectively. Rearranging Equation (2) leads to the following form: where w = d − By are discrepancy vectors. Thus, the unknowns can be derived by Equation (5), and a posteriori standard deviation of unit weight can be computed via Equation (6), in which r is the number of degrees of freedom (redundancy) as follows: Consequently, the photo coordinate system of the current image frame can be transformed into the coordinate system of a previous one via the relative camera poses. On the basis of the same coordinate system, the average movements between feature correspondences are estimated by: where (∆x mean , ∆y mean ) indicates the average translation between corresponding feature points. (x Q i , y Q i ) and (x R i , y R i ) are the photo coordinates of corresponding features in the query and reference frames, respectively. Thus, the rectified photo coordinates of the current frame with respects to reference one can be derived: where (x recti f ied i , y recti f ied i ) indicates the rectified photo coordinates which are then transformed to the image coordinates for transmitting their color attributes as follows: where (row recti f ied i , col recti f ied i ) indicates the rectified image coordinates; (x 0 , y 0 ) is the principle point of the reference frame, and the (n x , n y ) and (l x , l y ) represent the number of pixels in x and y directions and the size of image frames, respectively. Finally, the original spectrum information can be conveyed to the rectified images.

Motion Segmentation
Since this study applied a stationary and rotating camera configuration, the background and foreground changed simultaneously. If the foreground object and the camera moved in the same direction, the camera motion would counteract the movement of the object. On the contrary, the object movement would be magnified if the object and the camera moved in opposite directions. Therefore, as shown in Figure 2, without eliminating the interference of camera motion, it could lead to false positives of motion segmentation and a misinterpretation of the moving behavior.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 18 Figure 2. The camera motion rectification between t and t + 1 frames.
The segmentation process is to retrieve the actual moving objects from consecutive image frames, which was realized by a recursive background subtraction technique in this study. The red, green, and blue (RGB) color space can be converted to hue, saturation, and intensity (HSI) space to ease the lighting influence and to enhance segmentation quality.
where H, S, and I indicate hue, saturation, and intensity, whereas R, G, and B indicate the values of the three-color channels. In addition, an exponent subtraction factor based on saturation to distinguish moving objects from the background can be read as: where indicates the i-th image input, is treated as the reference frame. Θ is the subtraction factor, and is the exponent. A pixel is deemed as the foreground if the subtracting result is larger than a threshold that can be set adaptively according to the 3 of the average difference.
where FG indicates the foreground; μ and σ are the mean and the standard deviation, respectively. Furthermore, to eliminate salt-and-pepper noise, a median filter was used to polish the foreground [49].

Moving Object Recognition
To identify multiple classes from the detected moving objects, the model combining Faster R-CNN [41] with neural architecture search (NAS) [50] was leveraged in this study. In Figure 3, Faster R-CNN integrates feature extraction, region proposal, classification, and bounding box regression into a unified network, and reveals the best recognition accuracy but the lowest efficiency. The details of the recognition model can be referred to in [41]. The segmentation process is to retrieve the actual moving objects from consecutive image frames, which was realized by a recursive background subtraction technique in this study. The red, green, and blue (RGB) color space can be converted to hue, saturation, and intensity (HSI) space to ease the lighting influence and to enhance segmentation quality.
where H, S, and I indicate hue, saturation, and intensity, whereas R, G, and B indicate the values of the three-color channels. In addition, an exponent subtraction factor based on saturation to distinguish moving objects from the background can be read as: where Img i indicates the i-th image input, Img i−1 is treated as the reference frame. Θ is the subtraction factor, and e is the exponent. A pixel is deemed as the foreground if the subtracting result is larger than a threshold that can be set adaptively according to the 3σ of the average difference.
where FG indicates the foreground; µ and σ are the mean and the standard deviation, respectively. Furthermore, to eliminate salt-and-pepper noise, a median filter was used to polish the foreground [49].

Moving Object Recognition
To identify multiple classes from the detected moving objects, the model combining Faster R-CNN [41] with neural architecture search (NAS) [50] was leveraged in this study. In Figure 3, Faster R-CNN integrates feature extraction, region proposal, classification, and bounding box regression into a unified network, and reveals the best recognition accuracy but the lowest efficiency. The details of the recognition model can be referred to in [41].
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 18 It is worthwhile noting that the process of bounding box regression in this model can give precise estimates of the object regions since the region proposals are regressed based on the convolutional neural network. Therefore, the initial foreground pixels determined by the motion segmentation can be refined based on the regressive bounding boxes. The foreground pixels that fall outside the bounding should be excluded from the subsequent geometric computation process. As shown in Figure 4, the initial foreground pixels are determined by the yellow bounding boxes and can be further rectified by the regressive green boxes. The lighting or shadow interference was eased, and thus more reliable geometric estimation could be achieved.

Geometric Observing
This step collects the 3D geometric attributes of the recognized moving objects with respect to moving trajectories, object heights, and moving velocity. The locations of the moving objects in 3D space need to be first determined. Most existing methods, such as visual simultaneous localization and mapping (SLAM) and visual odometry, for example, [51,52], usually deal with stereo-or multiview images for the better intersecting geometry of 3D positioning. However, considering a single stationary and rotating camera configuration, these methods are not suitable even though slight motion and shift exist among the camera poses due to the deviation between the perspective center and the rotating axis. In addition, considering that only slight translational discrepancy exists between two perspectives centers at different timestamps, multiple view solutions would raise problems in dealing with weak intersecting geometry. In view of this, in this study, we determined the 3D locations of objects by combining single and multiple view solutions. The initial position was estimated in a single view manner, and then the estimate was treated as approximations to stabilize It is worthwhile noting that the process of bounding box regression in this model can give precise estimates of the object regions since the region proposals are regressed based on the convolutional neural network. Therefore, the initial foreground pixels determined by the motion segmentation can be refined based on the regressive bounding boxes. The foreground pixels that fall outside the bounding should be excluded from the subsequent geometric computation process. As shown in Figure 4, the initial foreground pixels are determined by the yellow bounding boxes and can be further rectified by the regressive green boxes. The lighting or shadow interference was eased, and thus more reliable geometric estimation could be achieved. It is worthwhile noting that the process of bounding box regression in this model can give precise estimates of the object regions since the region proposals are regressed based on the convolutional neural network. Therefore, the initial foreground pixels determined by the motion segmentation can be refined based on the regressive bounding boxes. The foreground pixels that fall outside the bounding should be excluded from the subsequent geometric computation process. As shown in Figure 4, the initial foreground pixels are determined by the yellow bounding boxes and can be further rectified by the regressive green boxes. The lighting or shadow interference was eased, and thus more reliable geometric estimation could be achieved.

Geometric Observing
This step collects the 3D geometric attributes of the recognized moving objects with respect to moving trajectories, object heights, and moving velocity. The locations of the moving objects in 3D space need to be first determined. Most existing methods, such as visual simultaneous localization and mapping (SLAM) and visual odometry, for example, [51,52], usually deal with stereo-or multiview images for the better intersecting geometry of 3D positioning. However, considering a single stationary and rotating camera configuration, these methods are not suitable even though slight motion and shift exist among the camera poses due to the deviation between the perspective center and the rotating axis. In addition, considering that only slight translational discrepancy exists between two perspectives centers at different timestamps, multiple view solutions would raise problems in dealing with weak intersecting geometry. In view of this, in this study, we determined the 3D locations of objects by combining single and multiple view solutions. The initial position was estimated in a single view manner, and then the estimate was treated as approximations to stabilize

Geometric Observing
This step collects the 3D geometric attributes of the recognized moving objects with respect to moving trajectories, object heights, and moving velocity. The locations of the moving objects in 3D space need to be first determined. Most existing methods, such as visual simultaneous localization and mapping (SLAM) and visual odometry, for example, [51,52], usually deal with stereo-or multi-view images for the better intersecting geometry of 3D positioning. However, considering a single stationary and rotating camera configuration, these methods are not suitable even though slight motion and shift exist among the camera poses due to the deviation between the perspective center and the rotating axis.
Remote Sens. 2020, 12, 1908 7 of 18 In addition, considering that only slight translational discrepancy exists between two perspectives centers at different timestamps, multiple view solutions would raise problems in dealing with weak intersecting geometry. In view of this, in this study, we determined the 3D locations of objects by combining single and multiple view solutions. The initial position was estimated in a single view manner, and then the estimate was treated as approximations to stabilize the computation of multiple view estimation. For this purpose, the first image frame was selected to define the reference coordinate system, and a ground plane where all objects should move on this plane was given. As shown in Figure 5, the image ray constructed by the camera center and the image point of B was used to intersect with the ground plane for determining the location B of the object. The image ray of a point can be described by the well-known collinearity equation as: where ( , , ) and ( , , ) indicate the 3D coordinates of the camera and objects, respectively; ( , ) is the image coordinates; and ( , , ) is the interior orientation parameters. In cases that the ground plane is expressed as = 0 , the bottom point of the object can be solved based on the simultaneous equations of the collinearity and plane formulae. In fact, the height of a moving object should be perpendicular to the ground, and the top point T of the object can be determined by solving the intersection of the image ray and the 3D line derived from the normal vector and the bottom point. Furthermore, to refine the positioning quality of the B and T, the estimates derived from the single view computation were treated as approximations for the multiple view estimation. The approximations stabilized the nonlinear calculation, even though the baselines between perspective centers of consecutive frames were relatively short. Finally, the refined T and B points were used to compute the object height, h. The bottom points of the object among frames describe its moving trajectory, and the velocity over time can be derived as well. It should be noted that the lowest and highest pixels crossing the object centroid and perpendicular to the ground plane are deemed to be the bottom and top points in this study.
Apart from determining the coordinate estimates, in this study, we further assessed the accuracy of the estimation from the related observations based on the theory of error propagation. Let and indicate the accuracy of the image coordinates of a point, , , and report the accuracy of camera position; and , , and denote the accuracy of image orientation parameters. The accuracy of the unknown X and Y of a point can be acquired as follows: where Σ indicates the variance-covariance matrix of the point coordinates, is the coefficient matrix with respect to the observations, and Σ is the variance-covariance matrix of the observation. These matrixes can be read as: The image ray of a point can be described by the well-known collinearity equation as: where (X c , Y c , Z c ) and (X, Y, Z) indicate the 3D coordinates of the camera and objects, respectively; (x, y) is the image coordinates; and (x 0 , y 0 , f ) is the interior orientation parameters. In cases that the ground plane is expressed as Z = 0, the bottom point B of the object can be solved based on the simultaneous equations of the collinearity and plane formulae. In fact, the height of a moving object should be perpendicular to the ground, and the top point T of the object can be determined by solving the intersection of the image ray and the 3D line derived from the normal vector and the bottom point. Furthermore, to refine the positioning quality of the B and T, the estimates derived from the single view computation were treated as approximations for the multiple view estimation. The approximations stabilized the nonlinear calculation, even though the baselines between perspective centers of consecutive frames were relatively short. Finally, the refined T and B points were used to compute the object height, h. The bottom points of the object among frames describe its moving trajectory, and the velocity over time can be derived as well. It should be noted that the lowest and highest pixels crossing the object centroid and perpendicular to the ground plane are deemed to be the bottom and top points in this study.
Apart from determining the coordinate estimates, in this study, we further assessed the accuracy of the estimation from the related observations based on the theory of error propagation. Let σ x and σ y indicate the accuracy of the image coordinates of a point, σ X c , σ Y c , and σ Z c report the accuracy Remote Sens. 2020, 12,1908 8 of 18 of camera position; and σ ω , σ ϕ , and σ κ denote the accuracy of image orientation parameters. The accuracy of the unknown X and Y of a point can be acquired as follows: where Σ XY indicates the variance-covariance matrix of the point coordinates, D is the coefficient matrix with respect to the observations, and Σ P is the variance-covariance matrix of the observation. These matrixes can be read as: In case that the accuracy of B and T points are computed, then, the quality of h can be estimated in a similar way. Let h = T − B 2 , the accuracy of B and T is Σ B = diag σ x i 2 , σ y i 2 and Σ T = diag σ x j 2 , σ y j 2 , respectively. Thus, the variance of h can be:

Results and Discussion
As mentioned above, in contrast to the existing methods designed for stationary or with motion prerequisites camera systems, this study concentrates on acquiring the geometric observations of moving objects detected from a stationary and rotating monocular camera. To evaluate the effectiveness of the proposed scheme quantitatively, an experiment with synthetic configuration is first conducted, and then a real-life data is used to verify the feasibility. In this study, the following three indices, namely recall, precision, and F 1 , are employed to assess the quality of foreground pixel detection: num. of correct foreground pixels num. of exact foreground pixels precision = num. of correct foreground pixels num. of detected foreground pixels

Quantitative Evaluation with Synthetic Configuration
In this case, a calibrated Canon EOS 650D (Canon Inc., Tokyo, Japan) is used to acquire sequential images with a size of 5183 × 3456 pixels. To verify the effectiveness of the camera motion rectification, the background subtraction is implemented for images acquired by rotating and static camera configuration, respectively. To assess the robustness to illumination change, the simulation is realized in an indoor environment for the convenience of lighting control. A rigid chair is used to play the role of a moving object. The rotating angle of the camera is four degrees per step. The depth of this test field is 6.5 m. Figure 6 shows a fraction of the captured images under different lighting conditions, in which the image data captured by a static camera configuration are treated as the reference for the following assessment. Figure 7 shows the motion segments of the object obtained before and after camera motion rectification. Noticeably, without rectification, the camera motion counteracts the movement of the object when the foreground object and the camera move in the same direction. On the contrary, the object movement would be magnified if the object and the camera move in opposite directions. This would lead to false positives of motion segmentation and a misinterpretation of the moving behavior. counteracts the movement of the object when the foreground object and the camera move in the same direction. On the contrary, the object movement would be magnified if the object and the camera move in opposite directions. This would lead to false positives of motion segmentation and a misinterpretation of the moving behavior.            Furthermore, a quantitative evaluation is given in Table 1 which provides insight into the effectiveness of the proposed method. The detection results obtained from the static and rotating camera configurations exhibit comparable quality regardless of the camera motion and the illumination change, proving the validity of the rectification. Although the recall rates of the rotating camera configuration are slightly lower than the stable one, the precision rates reveal that the rotating configuration yields more accurate detection on the contrary, which is also shown in the resulting aggregative indices. Moreover, the illumination change certainly affects the foreground determination, and therefore the recall rate of the rotating camera configuration with lighting change drops to 60% in images c and d. The proposed method yields satisfactory performance, achieving a level up to 0.90 in the aggregative index.

Street View Surveillance of a Rotating PTZ Camera
The proposed method is implemented on street-view images acquired using a SAMPO PTZ camera in a stationary and rotating configuration. The focal length of the SAMPO PTZ camera is 2.8 mm and it has an image size of 1080 × 1920 pixels. The field of view is approximately 140 degrees by the capability of rotating 355 and 90 degrees in horizontal and vertical directions, respectively. The first image of the camera is set as the reference coordinate system. The equation of the ground plane is given as Z = 0, accordingly. The recognition model has been trained on the common objects in context (COCO) dataset [53], which contains over 2.5 million labeled instances in 330,000 images. Figure 9 shows a fraction of the street-view image sequence and their timestamps, in which the PTZ camera is set on a footbridge with a 5.1 m height from the given ground plane. The minimum blob area in motion segmentation is set as 1000 pixels to banish trivial patches. A1 to A3 shows the acquired images when the camera is static, whereas B1 to B3 depicts the acquired images when the camera is rotating. Figure 9 shows a fraction of the street-view image sequence and their timestamps, in which the PTZ camera is set on a footbridge with a 5.1 m height from the given ground plane. The minimum blob area in motion segmentation is set as 1000 pixels to banish trivial patches. A1 to A3 shows the acquired images when the camera is static, whereas B1 to B3 depicts the acquired images when the camera is rotating.  9. A fraction of the street-view image sequence. Figure 9. A fraction of the street-view image sequence. Figure 10 demonstrates the motion segmentation and the recognition results of the sequences. In the light of the red bounding boxes, the resulting foreground pixels of motion segmentation is not reliable and sensitive to the shadow and reflection influence. However, the regressive bounding boxes derived from the recognition process can be used to improve the description of the moving object boundaries. Only the foreground pixels of an object surrounded by the regressive bounding boxes are used to estimate the geometric attributes of the object. In this study, the lowest foreground pixel in the middle of the regressive bounding box is defined as the foot point, and the height of the object is computed accordingly.
Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 18 Figure 10 demonstrates the motion segmentation and the recognition results of the sequences. In the light of the red bounding boxes, the resulting foreground pixels of motion segmentation is not reliable and sensitive to the shadow and reflection influence. However, the regressive bounding boxes derived from the recognition process can be used to improve the description of the moving object boundaries. Only the foreground pixels of an object surrounded by the regressive bounding boxes are used to estimate the geometric attributes of the object. In this study, the lowest foreground pixel in the middle of the regressive bounding box is defined as the foot point, and the height of the object is computed accordingly.  Table 2 shows the statistic geometric attributes in terms of the recognized object classes, object heights, and moving velocity. In this case, the keyframes are selected every 10 frames for the computation of the object heights and velocity. This information can further contribute to the identification and prediction of object behavior. Referring to the general specification of the objects, the geometric estimates of the recognized objects in Table 2 seem promising. The camera motion rectification adjusts the relative motion between the rotating camera and objects. By combining the motion segmentation and the recognition process, the regions of the moving objects in images can be assigned properly, and therefore facilitates the determination of the moving object locations over time. The statistics of velocity also reveal the statuses of Objects 1 and 3 correctly, showing that they were accelerating rapidly when starting the movement at the intersection. The accuracy of the geometric attributes, however, is highly correlated to the quality of the foreground object detection. If the foreground pixels of an object cannot describe the object completely, obvious errors would be induced in estimating the object's location and height. Currently, the object height is measured based on the height displacement of the object in the image.  Table 2 shows the statistic geometric attributes in terms of the recognized object classes, object heights, and moving velocity. In this case, the keyframes are selected every 10 frames for the computation of the object heights and velocity. This information can further contribute to the identification and prediction of object behavior. Referring to the general specification of the objects, the geometric estimates of the recognized objects in Table 2 seem promising. The camera motion rectification adjusts the relative motion between the rotating camera and objects. By combining the motion segmentation and the recognition process, the regions of the moving objects in images can be assigned properly, and therefore facilitates the determination of the moving object locations over time. The statistics of velocity also reveal the statuses of Objects 1 and 3 correctly, showing that they were accelerating rapidly when starting the movement at the intersection. The accuracy of the geometric attributes, however, is highly correlated to the quality of the foreground object detection. If the foreground pixels of an object cannot describe the object completely, obvious errors would be induced in estimating the object's location and height. Currently, the object height is measured based on the height displacement of the object in the image. In cases that a moving object comprises a depth, the height estimate would convey a conspicuous error, which can be seen in the standard deviation of Objects 1 and 3 in Table 2, since vehicles are the main class of moving objects with a depth effect. By contrast, the height estimate of the pedestrian is promising due to the nature of the body shape. As demonstrated in Figure 11, a super-pixel segmentation [54] is performed on the recognized result of a vehicle to derive its subregions. These regions are superimposed onto their foreground pixels to eliminate the depth interference in estimating the object height. Nevertheless, the segmenting process assumes the top of the vehicle should locate at the center segment of the regressive bounding box, and the object height is determined from the foot point to the highest foreground pixel in the center segment along the direction of the ground normal vector. In light of Table 3, the modified height estimates of the recognized Objects 1 and 3 show promising results and approach the official specification of these two vehicles. In addition, the standard deviation of the estimates is improved up to around 50% as compared with those in Table  2. However, the integrity of the foreground pixels, the heading poses, and image appearances of vehicles still frustrate the effectiveness of the modified height estimation for vehicle classes.

Performance Evaluation of Various Networks
To gain insight into the effectiveness of different model networks, including Faster R-CNN, mask region-based convolutional neural network (Mask R-CNN ) [55], and the improvement of you only look once (YOLOv3) [56], this study carried out the comparison of the object detection by adopting PTZ camera images in an indoor environment, a corridor, and a construction site, and further assess the accuracy of the estimated geometric measurement using the KITTI benchmark. Figure 12 shows the image sequences along with the camera height setup used for estimation. For each dataset, the proposed method is integrated with these three model networks to reveal the estimation of detected objects, respectively. These images contain various illumination conditions, different types of objects, and view angles. In this case, the keyframes were selected every three frames from 30 sequential images. It should be noted that the evaluation lies in the accuracy of the geometric measurement instead of focusing on the completeness or correctness of object recognition. Therefore, the labels of person, bicycle, car, and truck are selected in this case, and only if the similarity of a specific label is higher than 70%, then the detected object is introduced for the In light of Table 3, the modified height estimates of the recognized Objects 1 and 3 show promising results and approach the official specification of these two vehicles. In addition, the standard deviation of the estimates is improved up to around 50% as compared with those in Table 2. However, the integrity of the foreground pixels, the heading poses, and image appearances of vehicles still frustrate the effectiveness of the modified height estimation for vehicle classes.

Performance Evaluation of Various Networks
To gain insight into the effectiveness of different model networks, including Faster R-CNN, mask region-based convolutional neural network (Mask R-CNN) [55], and the improvement of you only look once (YOLOv3) [56], this study carried out the comparison of the object detection by adopting PTZ camera images in an indoor environment, a corridor, and a construction site, and further assess the accuracy of the estimated geometric measurement using the KITTI benchmark. Figure 12 shows the image sequences along with the camera height setup used for estimation. For each dataset, the proposed method is integrated with these three model networks to reveal the estimation of detected objects, respectively. These images contain various illumination conditions, different types of objects, and view angles. In this case, the keyframes were selected every three frames from 30 sequential images. It should be noted that the evaluation lies in the accuracy of the geometric measurement instead of focusing on the completeness or correctness of object recognition. Therefore, the labels of person, bicycle, car, and truck are selected in this case, and only if the similarity of a specific label is higher than 70%, then the detected object is introduced for the geometric analysis. The quantitative results reflect the adaptability of these models for the surveillance and geometric measurement tasks. Figure 13 shows the object detection and recognition results of each models, while Table 4 reports the height estimates of the selected objects. On the one hand, in Figure 13, the detection results show the similarity among these three models in most scenes. However, when illumination conditions deteriorate or obstruction occurs, all the similarity scores and completeness of each label decrease, mainly Mask R-CNN, in which the deterioration of object recognition can be found in the image sequences of a construction site. The completeness and correctness of each model also degenerate at nightfall. On the other hand, the estimates in Table 4 agree with the visual results, showing similar heights among these three models, where "object ID" refers to the legends in the first column in Figure 13. The mean and standard deviation are calculated from the estimates of all keyframes. In view of Equations (14)- (17), the height estimate of the person on an indoor image set achieves an error of 3.5 cm by using a keyframe pair based on Faster R-CNN, where the true value of 177 cm lies in the reasonable range of 176.8 ± 3.5 cm.

Models Faster R-CNN Mask R-CNN YOLOv3
Indoor Corridor Construction site  Moreover, this study leverages the image sequences of the KITTI benchmark (last row in Figure 12) to compare the estimated object heights with those provided by KITTI's specifications. In this case, the labels of person and bicycle were selected for evaluation, where the chosen objects are noted in the first row in Figure 14. It should be noted that YOLOv3 reveals slightly poor instance segmentation and detection results in image Sequence 1 due to its weighting strategy. Table 5 shows the evaluation results. Among Faster R-CNN, Mask-RCNN, and YOLOv3, Faster R-CNN results in relatively low errors and low standard deviations. the labels of person and bicycle were selected for evaluation, where the chosen objects are noted in the first row in Figure 14. It should be noted that YOLOv3 reveals slightly poor instance segmentation and detection results in image Sequence 1 due to its weighting strategy. Table 5 shows the evaluation results. Among Faster R-CNN, Mask-RCNN, and YOLOv3, Faster R-CNN results in relatively low errors and low standard deviations.   Figure 14. Results of object detection applying various models to KITTI dataset. Regarding the evaluation of the object height, it is apparent that most of the estimates are higher than the values provided by KITTI. This could have resulted from the inaccurate setting of camera height, mismatching assumption, or the discrepancy in measurement aspects. Nevertheless, all the differences are less than 10 cm, which is acceptable in some practical applications. Additionally, this evaluation shows that YOLOv3 demonstrated low performance in object detection in terms of completeness and correctness, whereas Faster R-CNN demonstrated the best performance in object detection in accuracy and precision. This evaluation also reflects the limitation of the proposed method. The precise height of camera is indispensable for an accurate height estimate. Nevertheless, in most of the cases, a surveillance camera can be set up with a priori known condition, and the reliability of the estimates can be further reviewed by their theoretical accuracy computed by using Equations (14)-(17).

Conclusions
This paper contributes a scheme to acquire 3D geometric attributes of moving objects by using Faster R-CNN with a stationary and rotating PTZ camera configuration, which is rarely discussed in the literature. The effectiveness of the proposed method in yielding the moving distances, moving velocity, object heights, and object recognition from a monocular camera has been validated through synthetic and real datasets. Regarding the specific camera configuration in this study, the 3D positions are determined by combining single and multiple view solutions to render accurate estimates. Inevitably, interference such as shadow effects and occlusions would deteriorate the reliability and completeness of the motion segmentation. However, by leveraging the deep learning recognition technique, the regressive bounding boxes resulted from Faster R-CNN facilitate the refinement of the object boundaries, which can directly improve the quality of the geometric estimation. Moreover, a super-pixel segmentation process is specifically applied to the vehicle class to further improve its object height estimation by reducing the depth effect. The proposed approach calculates the height and speed estimates of the recognized moving objects, including pedestrians and vehicles, and shows promising results and application potential through existing CCTVs at a very low cost. A continued investigation on enhancing the computational efficiency and the exploration of object moving behavior should be addressed in future work.