Moving Object Detection in Trafﬁc Surveillance Video: New MOD-AT Method Based on Adaptive Threshold

: Previous research on moving object detection in trafﬁc surveillance video has mostly adopted a single threshold to eliminate the noise caused by external environmental interference, resulting in low accuracy and low efﬁciency of moving object detection. Therefore, we propose a moving object detection method that considers the difference of image spatial threshold, i.e., a moving object detection method using adaptive threshold (MOD-AT for short). In particular, based on the homograph method, we ﬁrst establish the mapping relationship between the geometric-imaging characteristics of moving objects in the image space and the minimum circumscribed rectangle (BLOB) of moving objects in the geographic space to calculate the projected size of moving objects in the image space, by which we can set an adaptive threshold for each moving object to precisely remove the noise interference during moving object detection. Further, we propose a moving object detection algorithm called GMM_BLOB (GMM denotes Gaussian mixture model) to achieve high-precision detection and noise removal of moving objects. The case-study results show the following: (1) Compared with the existing object detection algorithm, the median error (MD) of the MOD-AT algorithm is reduced by 1.2–11.05%, and the mean error (MN) is reduced by 1.5–15.5%, indicating that the accuracy of the MOD-AT algorithm is higher in single-frame detection; (2) in terms of overall accuracy, the performance and time efﬁciency of the MOD-AT algorithm is improved by 7.9–24.3%, reﬂecting the higher efﬁciency of the MOD-AT algorithm; (3) the average accuracy (MP) of the MOD-AT algorithm is improved by 17.13–44.4%, the average recall (MR) by 7.98–24.38%, and the average F1-score (MF) by 10.13–33.97%; in general, the MOD-AT algorithm is more accurate, efﬁcient, and robust.


Introduction
With the rapid growth of information technology, surveillance cameras have been widely used because of their advantages of real-time performance, low cost, and efficiency. They have become an indispensable technical means of urban management in terms of safety. Massive and real-time video data contain abundant spatiotemporal information and provide essential support for real-time supervision, case investigation, and natural resource monitoring [1,2]. However, video data have the disadvantages of low value, redundancy, and noise. Although significant progress has been made in low-level video understanding, high-precision and high-level video understanding technology is still in the research stage [3]. Moving objects, such as vehicles and pedestrians in traffic videos, have high application requirements for traffic supervision departments. The moving object is the key to video information mining. Setting reasonable thresholds to detect a highprecision moving object from surveillance video is one of the current hot issues in video geographic-information-system (GIS) research [4].

1.
Moving object detection methods based on traditional single threshold These methods mainly include the frame difference [16][17][18], optical flow [19,20], and background difference methods [21][22][23], among others. For example, Zuo et al. [24] improved the accuracy of moving object detection based on the background frame difference method. Luo et al. [25] combined the background difference and frame difference methods to detect moving objects and remove external-environment interference. Akhter et al. [26] realized contour detection and feature-point extraction of moving objects through the optical flow method. Li et al. [27] used the background difference algorithm to obtain the foreground and background images, and then extracted the moving object in the surveillance video. The above algorithm uses a unified threshold value in the image space to filter the interference of the external environment, resulting in an unreasonable threshold-value setting, thereby affecting the accuracy of moving object detection.

2.
Moving object detection methods based on pixels or regions Compared with the region-based moving object detection method, the pixel-based moving object detection method is fast and straightforward, which is suitable for the ISPRS Int. J. Geo-Inf. 2021, 10, 742 3 of 21 rapid monitoring of video objects. Its typical methods include the vibe algorithm [28][29][30], non-parametric model [31], and Gaussian mixture model (GMM) [32], among others. For example, Liu et al. [33] proposed the three-frame difference algorithm of the adaptive GMM to suppress the external environment's interference effectively. Zuo et al. [34] improved the accuracy of the moving object detection algorithm based on the improved GMM (IGMM). The aforementioned algorithm only sets a fixed threshold range according to the geometric-imaging characteristics of moving objects in image space, and the suitability of the detection threshold for moving objects is not considered. According to the perspectiveimaging characteristics of the camera, when the object is close to the camera, the area of the object in the image is larger; otherwise, when the object is far away, the area of the object in the image is smaller. In fact, the actual size of the object will not change in the geographical space [35].

3.
Moving object detection method based on the segmented threshold This method attempts to calculate thresholds in different regions of the image space. For example, Chan et al. [36] proposed linear segmentation of video frames to obtain weight maps at different locations and calculated the threshold range based on the position of the object in the video frame. Chang et al. [37] proposed a method of spatial-imaging area normalization. The area weight of the object in the image space is obtained to filter part of the moving object detection interference. Lin et al. [38] established a single mapping relationship between video image space and geographic space and realized interference filtering from the external environment based on the non-linear-perspective correction model algorithm (NPCM). However, the above algorithm only considers the linear or non-linear characteristics of the object on the image and does not consider the projection distortion of the object size caused by the camera-imaging mechanism. It also ignores the difference in imaging the geometric characteristics of moving objects in different positions of the video frame. The threshold setting is not very specific, affecting the accuracy of moving object detection.
In this paper, we propose a moving object detection method considering the difference of image spatial threshold (named MOD-AT). First, based on the perspective characteristics of the camera and the smallest bounding rectangle of the object (dynamic block for short, or BLOB), a detection algorithm for the denoising threshold range of each pixel position is designed. On this basis, a moving object detection algorithm (named GMM_BLOB) was designed, and the corresponding relationship between the BLOB centroid and the pixel-position threshold range was constructed. Finally, the high-precision detection of moving objects based on the adaptive threshold was realized. The purpose of this study is to provide a new perspective for object detection, improve the accuracy and efficiency of object detection, and provide technical support for the integration of video and GIS.

General Idea and Technical Process
A new dynamic object detection method, i.e., MOD-AT (Figure 1), is proposed in this paper. The method mainly focuses on adaptive threshold calculation based on cameraimaging characteristics and moving object detection based on the adaptive threshold. Adaptive threshold calculation based on object ground projection size 1. Adaptive threshold calculation based on camera-imaging characteristics Previous research mostly adopted a single threshold to eliminate the noise caused by external-environment interference, resulting in low accuracy and low efficiency of moving object detection. We consider the orientation of moving objects in geographical space to obtain the adaptive threshold for a moving object. First, we calculate the homography matrix by selecting the homonymous points between the surveillance video and the online remote sensing image in the actual scene. Then, based on the homography, we establish the mapping relationship between the geometric BLOB in the image space and the smallest bounding rectangle in the geographic space. Finally, we calculate the actual size of the object based on the mapping relationship, and then obtain both the imaging area of the BLOB in image space and the range of adaptive threshold according to the projection size of the object.

Moving object detection based on adaptive threshold
In moving object detection, the existing object detection algorithms do not fully consider the perspective-imaging characteristics of the moving objects and the interference of external environmental noise. We abstract the irregular moving objects in image space as a set of minimum circumscribed rectangles (BLOBs). Using the above adaptive threshold calculation method, we design a moving object detection algorithm based on GMM_BLOB by background reconstruction, background-difference calculation, and moving-block (BLOB) set acquisition. We achieve high accuracy detection and noise removal of moving objects by adaptive threshold.

Mapping Relation Calculation Based on Homography Method
For the convenience of threshold calculation, it is necessary to establish the mapping relationship between the actual orientation of the object in geographic space and the geometric-imaging characteristics in image space. Compared with the traditional mapping method based on the camera model, the homography method is simpler and does not require camera-internal and external parameters [39]. Therefore, we use the homography method to construct the mapping relationship between image space and geographical space.
First, four or more control points { ( , ), ( , ), … , ( , )} in the video are selected, and their pixel coordinates are obtained. Then, in the high-precision remote sensing image, the corresponding points { ( , ), ( , ), … , ( , )} are selected to obtain the geographic coordinates. Finally, based on these control points, the mapping matrix, H, of the camera video to geospatial mapping is obtained, according to

1.
Adaptive threshold calculation based on camera-imaging characteristics Previous research mostly adopted a single threshold to eliminate the noise caused by external-environment interference, resulting in low accuracy and low efficiency of moving object detection. We consider the orientation of moving objects in geographical space to obtain the adaptive threshold for a moving object. First, we calculate the homography matrix by selecting the homonymous points between the surveillance video and the online remote sensing image in the actual scene. Then, based on the homography, we establish the mapping relationship between the geometric BLOB in the image space and the smallest bounding rectangle in the geographic space. Finally, we calculate the actual size of the object based on the mapping relationship, and then obtain both the imaging area of the BLOB in image space and the range of adaptive threshold according to the projection size of the object.

2.
Moving object detection based on adaptive threshold In moving object detection, the existing object detection algorithms do not fully consider the perspective-imaging characteristics of the moving objects and the interference of external environmental noise. We abstract the irregular moving objects in image space as a set of minimum circumscribed rectangles (BLOBs). Using the above adaptive threshold calculation method, we design a moving object detection algorithm based on GMM_BLOB by background reconstruction, background-difference calculation, and moving-block (BLOB) set acquisition. We achieve high accuracy detection and noise removal of moving objects by adaptive threshold.

Mapping Relation Calculation Based on Homography Method
For the convenience of threshold calculation, it is necessary to establish the mapping relationship between the actual orientation of the object in geographic space and the geometric-imaging characteristics in image space. Compared with the traditional mapping method based on the camera model, the homography method is simpler and does not require camera-internal and external parameters [39]. Therefore, we use the homography method to construct the mapping relationship between image space and geographical space.
First, four or more control points q{q 1 (x 1 , y 1 ), q 2 (x 2 , y 2 ), . . . , q n (x n , y n )} in the video are selected, and their pixel coordinates are obtained. Then, in the high-precision remote sensing image, the corresponding points Q{Q 1 (X 1 , Y 1 ), Q 2 (X 2 , Y 2 ), . . . , Q n (X n , Y n )} are selected to obtain the geographic coordinates. Finally, based on these control points, the mapping matrix, H, of the camera video to geospatial mapping is obtained, according to Equation (1). The inverse matrix of H −1 is the mapping matrix from geographic space to where (x, y) is the image coordinate of a point in the image space, (X, Y) is the geographic coordinates of the corresponding point (x, y) in geographic space, and H −1 is the inverse 3 × 3 matrix solved by the homography matrix, H.

Calculation of Object Projected Size Based on Mapping Relationship
To calculate the projection size of the object in geographical space, it is necessary to know the external parameters of the camera, i.e., the geographic location C(X cam , Y cam , H cam ) and the homography matrix, H. We can calculate the homography matrix, H, as in Section 3.2.1. However, for moving objects of different heights not in a plane or three-dimensional space, it is necessary to obtain the camera's internal parameters further, i.e., internal parameter matrix, K, rotation matrix, R, and translation matrix, T, as well as high-precision digital-elevation-model (DTM) and digital-surface-model (DSM) data. Considering that our existing data cannot obtain the camera's internal parameters and there are no highprecision DTM and DSM data of the camera area, we choose the flat video data and focus on the moving object on the flat surface. We assume that the video resolution is i × j and that the image coordinates of the corresponding pixel in the row, u (0 u i − 1), and column, v (0 v j − 1), are C uv (x, y, 0). According to the homography matrix, H, we transform the image-coordinate point, C uv (x, y, 0), into the geographic coordinate R uv (X, Y, 0).
Because of the different distances and orientations of the object from the camera during the process of the object moving, as a result, the geometric-imaging characteristics of the object at different pixel points in the image space are constantly changing. Therefore, it is necessary to obtain the projection length and width of the object's outer contour in geographic space based on the object-mapping relationship and then set different threshold ranges for each pixel position based on the camera-imaging characteristics. As shown in Figure 2, C(X cam , Y cam , H cam ) is the center of the camera in geographical space; the object dimensions are height, H uv , width W l uv , and length, T uv . The upper midpoint of the BLOB is the top point, and its geographic coordinates are B uv (X, Y, H uv ). The lower midpoint of the BLOB is the touch point, and its geographical coordinates are P uv (X, Y, 0). Rays LT and LJ from the camera position, C(X cam , Y cam , H cam ), point to B uv (X, Y, H uv ) and P uv (X, Y, 0), respectively. The angles between the ray and the ground are α and β. The object-ground projection area is a trapezoid, and its height, upper-side and lower-side lengths are calculated, which correspond to the object-ground projection length, h uv , and ground projection width w u uv (one for near and one in the distance). According to the principle that the corresponding sides of similar triangles are proportional, we calculate the coordinates of the intersection point between the ray, LT, and the ground plane, z=0, according to Equations (3) and (4). Then, the projection length, ℎ , of the object on the ground is calculated, as in Equation (5).
2. Calculation of the object's projected length, , in ground During the object movement, the orientation of the moving object relative to the camera will change. On the horizon, however, the different directions of the object relative to the camera can be approximated as a cylinder. In geographic space, if the height and width of the cylinder do not change at a certain position, the projected width, , in geographic space will not change. As shown in Figure 3, assuming the geographic coordinates of the object (Obj) are ( , Y, 0), the symbol of represents the width of the object projected to the ground. Two rays are drawn from the camera position, ( , , ), to points ( − , Y, ) and ( + , Y, ), respectively. Then, they intersect the ground plane at and , respectively. The linear formulas are and , as show in (6) and (7), and their intersection points with the ground plane are ( , , 0) and ( , , 0), respectively, i.e., Equations (8)- (11), and the values of , , and can be derived.

1.
Calculation of the object's projected length, h uv , in ground According to the principle that the corresponding sides of similar triangles are proportional, we calculate the coordinates of the intersection point between the ray, LT, and the ground plane, z = 0, according to Equations (3) and (4). Then, the projection length, h uv , of the object on the ground is calculated, as in Equation (5). 2.
Calculation of the object's projected length, w u uv , in ground During the object movement, the orientation of the moving object relative to the camera will change. On the horizon, however, the different directions of the object relative to the camera can be approximated as a cylinder. In geographic space, if the height and width of the cylinder do not change at a certain position, the projected width, w u uv , in geographic space will not change. As shown in Figure 3, assuming the geographic coordinates of the object (Obj) are R uv (X, Y, 0), the symbol of W l uv represents the width of the object projected to the ground. Two rays are drawn from the camera position, C(X cam , Y cam , H cam ), to points J L (X − W l uv 2 , Y, H uv ) and J R (X + W l uv 2 , Y, H uv ), respectively. Then, they intersect the ground plane at Geo 3 uv and Geo 4 uv , respectively. The linear formulas are CJ L and CJ R , as show in (6) and (7), and their intersection points with the ground plane are Geo 3 uv X 3 , Y 3 , 0 and Geo 4 uv X 4 , Y 4 , 0 , respectively, i.e., Equations (8)- (11), and the values of Geo 3 uv , Geo 4 uv , and w u uv can be derived.

Adaptive Threshold Calculation Based on Object-Ground Projection Size
According to the algorithm described in Section 3.2.2, based on the minimum bounding rectangle of the object (Obj) in geographic space, the width, , and height, ℎ , of Obj in geographic space can be calculated. In the image space, the moving object size in different positions is inconsistent. Therefore, the object size and camera posture information must be considered to calculate the threshold range of different pixels. When the BLOB appears at different pixel positions in the video frame, the threshold range can be used to filter the interference of the external environment; that is, a high-precision moving object detection method that considers the threshold differentiation in the image space.
The adaptive threshold calculation includes the calculation of the quadrilateral coordinates of the object projected to the ground and the calculation of the area range of the object in the image plane.
1. Calculation of the quadrilateral coordinates of the object projected to the ground In geographic space, the coordinate value of the object projected to the ground can be calculated according to the camera center and object location and size. As shown in Figure

Adaptive Threshold Calculation Based on Object-Ground Projection Size
According to the algorithm described in Section 3.2.2, based on the minimum bounding rectangle of the object (Obj) in geographic space, the width, w u uv , and height, h uv , of Obj in geographic space can be calculated. In the image space, the moving object size in different positions is inconsistent. Therefore, the object size and camera posture information must be considered to calculate the threshold range of different pixels. When the BLOB appears at different pixel positions in the video frame, the threshold range can be used to filter the interference of the external environment; that is, a high-precision moving object detection method that considers the threshold differentiation in the image space.
The adaptive threshold calculation includes the calculation of the quadrilateral coordinates of the object projected to the ground and the calculation of the area range of the object in the image plane.

1.
Calculation of the quadrilateral coordinates of the object projected to the ground In geographic space, the coordinate value of the object projected to the ground can be calculated according to the camera center and object location and size. As shown in Figure 4, the coordinates of the camera center point are known asC(X cam , Y cam , H cam ), the object coordinates areR uv (X, Y, 0), and the dimensions are w u uv andh uv . The coordinates of the four points projected by the object in geographic space are respectively. According to the geometric relationship between the minimum bounding rectangle (w u uv , h uv )of the object and the center point, C(X cam , Y cam , H cam ), of the camera, the geographic-coordinate values of the four coordinate points can be calculated, i.e., Equations (12)- (17).
ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 8 of 21 ( , ) = − 2 * sin , − 2 * cos (14) ( , ) = ( + 2 * sin , + 2 * cos ) (15) ( , ) = ( + 2 * sin , + 2 * cos ) (16) nates of the four points on the video frame, the minimum area, , of the external rectangle of the object on the image, that is, the minimum value, , of the threshold, can be calculated, as shown in Equation. (18). According to Section 3.2, the object size remains unchanged at the same position, but the size changes at different positions. In the experimental setting, the empirical threshold, , is obtained after many experiments, as shown in Equation. (19). the image coordinates of the object are ( , , 0), ( , , 0), ( , , 0) and ( , , 0). Figure 5, when traversing each point within the field of view, the coordinates of the midpoint, the coordinates of the pixel, and the threshold range compose a set that can be expressed as formula (20). This set contains the image coordinates, ( , , 0), the corresponding geographic coordinates, ( , , 0), and the maximum, , and minimum, , threshold values of the object at the pixel Obj.

2.
Calculation of the area range of the object in the image plane According to the algorithm presented in Section 3.2.1, the inverse matrix, H −1 , is obtained, and the image coordinatesPic 1 uv x 1 , y 1 , 0 , Pic 2 uv x 2 , y 2 , 0 , Pic 3 uv x 3 , y 3 , 0 , and Pic 4 uv x 4 , y 4 , 0 on the video frame corresponding toGeo 1 uv X 1 , Y 1 , 0 , Geo 2 uv X 2 , Y 2 , 0 , Geo 3 uv X 3 , Y 3 , 0 , and Geo 4 uv X 4 , Y 4 , 0 , respectively, are solved. According to the coordinates of the four points on the video frame, the minimum area, S uv , of the external rectangle of the object on the image, that is, the minimum value, Min uv , of the threshold, can be calculated, as shown in Equation (18). According to Section 3.2, the object size remains unchanged at the same position, but the size changes at different positions. In the experimental setting, the empirical threshold, Max uv , is obtained after many experiments, as shown in Equation (19). the image coordinates of the object aremPic 1 uv mx 1 , my 1 , 0 , mPic 2 uv mx 2 , my 2 , 0 , mPic 3 uv mx 3 , my 3 , 0 and mPic 4 uv my 4 , my 4 , 0 . As shown in Figure 5, when traversing each point within the field of view, the coordinates of the midpoint, the coordinates of the pixel, and the threshold range compose a set that can be expressed as Formula (20). This set contains the image coordinates, C uv (x, y, 0), the corresponding geographic coordinates, R uv (X, Y, 0), and the maximum, Max uv , and minimum, Min uv , threshold values of the object at the pixel Obj.

Adaptive Threshold Calculation and Application
The application of adaptive threshold in moving object detection mainly includes three processes: First, each pixel in the video frame is traversed to obtain its corresponding BLOB threshold range. Then, based on the center of the moving object BLOB, the relationship between its area and the pixel threshold range is judged. Finally, in the process of movement, different threshold ranges are automatically used to filter interference from the external environment. As shown in Figure 6a, moving objects A, B, C, etc. are distributed in different locations in geographic space. The quadrilateral in Figure 6b is the region of these objects on the image plane. When the centroid of the moving object, A, is located at pixel , the relationship between the areas of object A and the range of the corresponding threshold at is judged. If the area of object A is less than the minimum threshold or greater than the maximum threshold, it is regarded as noise.

Adaptive Threshold Calculation and Application
The application of adaptive threshold in moving object detection mainly includes three processes: First, each pixel in the video frame is traversed to obtain its corresponding BLOB threshold range. Then, based on the center of the moving object BLOB, the relationship between its area and the pixel threshold range is judged. Finally, in the process of Obj i movement, different threshold ranges are automatically used to filter interference from the external environment. As shown in Figure 6a, moving objects A, B, C, etc. are distributed in different locations in geographic space. The quadrilateral in Figure 6b is the region of these objects on the image plane. When the centroid of the moving object, A, is located at pixel R 5 , the relationship between the areas of object A and the range of the corresponding threshold at R 5 is judged. If the area of object A is less than the minimum threshold or greater than the maximum threshold, it is regarded as noise. The previous moving object detection algorithm mainly sets the dynamic threshold based on the depth map or the normalized pixel value without considering the adaptability of the threshold value in the process of the moving object. The key to object detection is to build a robust background image. The current background modeling methods are

Moving Object Detection Based on GMM_BLOB Algorithm
The previous moving object detection algorithm mainly sets the dynamic threshold based on the depth map or the normalized pixel value without considering the adaptability of the threshold value in the process of the moving object. The key to object detection is to build a robust background image. The current background modeling methods are the GMM [40,41] and the Vibe algorithm [42][43][44][45][46]. Owing to parameter-setting and backgroundtemplate updating problems, the vibe algorithm will lead to missed detection, residual shadow, and ghost phenomena in object detection. However, the traditional GMM method is slow and significantly affected by illumination. Adding a balance coefficient and merging redundant Gaussian distribution into the traditional GMM algorithm can improve the realtime performance and accuracy of the algorithm [47][48][49][50]. However, the traditional GMM algorithm ignores the influence of the external environment, which leads to an increase in the number of moving object detections. Meanwhile, the traditional object detection and tracking algorithms are slow, and it is difficult for them to meet the requirements of real-time surveillance video processing. As mentioned above, the improved object detection algorithm (GMM_BLOB) is designed. Based on the GMM algorithm, this algorithm abstracts the dynamic block BLOB of irregular moving objects in image space. It adds BLOB-threshold-filter conditions to improve the accuracy of moving object detection.
As shown in Figure 7, the foreground image F(x, y, i), and background image, B(x, y, i), of the video are extracted based on the background-mixture method [48]. On this basis, the background-subtraction method is used to extract the difference image, R(x, y, i), and the candidate BLOB sets are further obtained. Owing to the BLOB set containing real moving objects and noise, it is necessary to further filter the noise in the BLOB set to improve the accuracy of object detection. According to the method detailed in Section 3.3.1, the imaging area and threshold range of each BLOB are also changing in the process of moving in geographic space. The relationship between the area of each BLOB in the candidate object set and the threshold range of its centroid pixel is determined. When the BLOB area is less than the minimum value of the threshold and greater than the maximum value, it is used as noise. Finally, the accurate moving object set, BLOB i (i = 1, 2, . . . , n), is obtained.
Object detection based on adaptive threshold

Experimental Design
The software environment used in the experiments in this work is VS, C#, Emgu CV, and Arcengine, and the hardware environment is a GTX-1660Ti GPU and an i7-10750H CPU with 16 GB of memory. The experimental sites are a playground video (designated video #1) and a traffic scene video (video #2), as shown in Table 1. The videos are outdoor scenes, and the image resolution is 1280 × 720. The differences between videos #1 and #2 are the following: (1) the moving object in video #1 contains only people, and video #2 contains people and cars; (2) the camera height of video #1 is higher than that of the video #2 corresponding camera, and the horizon is wider than that of video #2; and (3) the frame rate for video #1 is 25 frames per second (fps) and that for video #2 is 30 fps.
Precision, recall, and F1 score were used as the evaluation indexes to verify the accuracy of MOD-AT, and their calculation formulas are (21), (22), and (23), respectively. The test data for four moving object detection algorithms are recorded every 30 frames. The results of moving object detection were compared from two scales, namely single-frame

Experimental Design
The software environment used in the experiments in this work is VS, C#, Emgu CV, and Arcengine, and the hardware environment is a GTX-1660Ti GPU and an i7-10750H CPU with 16 GB of memory. The experimental sites are a playground video (designated video #1) and a traffic scene video (video #2), as shown in Table 1. The videos are outdoor scenes, and the image resolution is 1280 × 720. The differences between videos #1 and #2 are the following: (1) the moving object in video #1 contains only people, and video #2 contains people and cars; (2) the camera height of video #1 is higher than that of the video #2 corresponding camera, and the horizon is wider than that of video #2; and (3) the frame rate for video #1 is 25 frames per second (fps) and that for video #2 is 30 fps.
Precision, recall, and F1 score were used as the evaluation indexes to verify the accuracy of MOD-AT, and their calculation formulas are (21), (22), and (23), respectively. The test data for four moving object detection algorithms are recorded every 30 frames. The results of moving object detection were compared from two scales, namely singleframe accuracy and overall accuracy. At the same time, the mean precision (MP), mean recall (MR), mean F1-score (MF), variance precision (VP), variance recall (VR), variance F1 score (VF), and mean error number (MN) of the three indexes were used to evaluate the robustness of the algorithm. The calculation formulas are the following: where TP indicates that the foreground object is correctly identified, FP indicates the number of video backgrounds recognized as foreground objects, FN represents the number of foreground objects recognized as background, TN is the actual number of objects in the experimental scene, and N is the total number of detection frames.

Adaptive Threshold Calculation
Using the algorithm presented in Section 3.2, the true threshold of a moving object at any position can be calculated. A moving object with a height of 1.75 m and a width of 0.8 m in the experimental scene was taken as an example. The threshold ranges of moving objects in videos #1 and #2 were calculated separately. To obtain the maximum threshold value in Section 3.2.3, we analyzed the size-change range of the object in videos #1 and #2 at different positions. As in Figure 8a,

Adaptive Threshold Calculation
Using the algorithm presented in Section 3.2, the true threshold of a moving object at any position can be calculated. A moving object with a height of 1.75 m and a width of 0.8 m in the experimental scene was taken as an example. The threshold ranges of moving objects in videos #1 and #2 were calculated separately. To obtain the maximum threshold value in Section 3.2.3, we analyzed the size-change range of the object in videos #1 and #2 at different positions. As in Figure 8a, ]. Setting the threshold range too large or too small will affect the accuracy of moving object detection. Therefore, we chose 3/2 as the scale factor determined by the maximum threshold.

Adaptive Threshold Calculation
Using the algorithm presented in Section 3.2, the true threshold of a moving object at any position can be calculated. A moving object with a height of 1.75 m and a width of 0.8 m in the experimental scene was taken as an example. The threshold ranges of moving objects in videos #1 and #2 were calculated separately. To obtain the maximum threshold value in Section 3.2.3, we analyzed the size-change range of the object in videos #1 and #2 at different positions. As in Figure 8a, ]. Setting the thresh old range too large or too small will affect the accuracy of moving object detection. There fore, we chose 3/2 as the scale factor determined by the maximum threshold. The algorithm detailed in Section 3.2 was used to obtain the threshold maximum and minimum values of the object (width, w u uv , and height, h uv ) at different pixel points. As shown in Table 2, (1) the projected height of the object is constantly changing in the process of moving in geographic space, and it is different from the actual height of the object; and (2) the threshold range of the object in different positions is constantly changing. Therefore, an adaptive threshold range should be used in the process of object detection.
The minimum peripheral contour of the object corresponding to each pixel of videos #1 and #2 is mapped to the corresponding position in the geographic space and image space, respectively. As shown in Figure 9a,b, the projection height and width of the object in the geographic space are used to realize the mapping from the geographic space to the image space. As shown in Figure 9c,d, the threshold range of each pixel-position object in the image space can realize moving object detection based on the adaptive threshold.  The algorithm detailed in Section 3.2 was used to obtain the threshold maximum and minimum values of the object (width, , and height, h ) at different pixel points. As shown in Table 2, (1) the projected height of the object is constantly changing in the process of moving in geographic space, and it is different from the actual height of the object; and (2) the threshold range of the object in different positions is constantly changing. Therefore, an adaptive threshold range should be used in the process of object detection.
The minimum peripheral contour of the object corresponding to each pixel of videos #1 and #2 is mapped to the corresponding position in the geographic space and image space, respectively. As shown in Figure 9a,b, the projection height and width of the object in the geographic space are used to realize the mapping from the geographic space to the image space. As shown in Figure 9c,d, the threshold range of each pixel-position object in the image space can realize moving object detection based on the adaptive threshold.

Single-Frame Accuracy Verification
The object detection results were recorded by videos #1 and #2 every 30 frames, and 2400 records were obtained. At fixed intervals, 40 samples were randomly selected to verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm.
The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1.
The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.

2.
The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table 4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision. Video frame #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision. TN=15 TN=17 #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision. TN=15 TN=17 #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision. TN=15 TN=17 #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision.

TN=15 TN=17
Background #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision. TN=15 TN=17 #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision.  TP=2,FP=5  TP=10,FP=26   TP=16,FP=11   IGMM   TP=2,FP=9  TP=2,FP=6   TP=11,FP=29   TP=16 FP=9   TN=2  TN=2   TN=15  TN=17 verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision.  TP=2,FP=5  TP=10,FP=26   TP=16,FP=11   IGMM   TP=2,FP=9  TP=2,FP=6   TP=11,FP=29   TP=16 FP=9   TN=2  TN=2   TN=15  TN=17 verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision.  TP=2,FP=5  TP=10,FP=26   TP=16,FP=11   IGMM   TP=2,FP=9  TP=2,FP=6   TP=11,FP=29   TP=16 FP=9   TN=2  TN=2   TN=15 TN=17 GVIBE verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision.  TP=2,FP=5  TP=10,FP=26   TP=16,FP=11   IGMM   TP=2,FP=9  TP=2,FP=6   TP=11,FP=29   TP=16 FP=9   TN=2  TN=2   TN=15  TN=17 verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision. verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision. verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision.

TN=15 TN=17
IGMM verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision. verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision. verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision.  TP=2,FP=5  TP=10,FP=26   TP=16,FP=11   IGMM   TP=2,FP=9  TP=2,FP=6   TP=11,FP=29   TP=16 FP=9   TN=2  TN=2   TN=15   TN=17 verify the accuracy of a single frame. Table 3 shows the object detection results of video #1 at 9:20 and 9:22 and of video #2 at 9:40 and 9:42. This indicates that the TP value of the MOD-AT algorithm is closer to TN, and that the FP value is smaller than the existing object detection algorithm. It is proven that the MOD-AT algorithm eliminates most of the noise generated by the external environment, compared with the current algorithm. The number of correctly detected objects (TP) is compared with real objects (TN). As shown in Figure 10a,c, the TP and TN values of the four detection algorithms at different time points are compared. The result shows that the moving object detection result, TP, of the MOD-AT algorithm is closer to TN. The box diagram in Figure 10b,d, shows the average variability of the four algorithms, GVIBE [39], IGMM [34], NPCM [38], and MOD-AT, compared with the number of TN detections. These indicators show that 1. The mean value of object detection error (MD) is reduced by 1.2-6.8% for video #1 and by 1.65-11.05% for video #2.
2. The MN of MOD-AT object detection results for videos #1 and #2 decreases by 1.5-10.5% and 1-15.5%, respectively, on the whole; by 3.5-5% and 3-6.5%, respectively, compared with the IGMM algorithm; by 7-10.5% and 8-15.5%, respectively, compared with the GVIBE algorithm; and by 1.5-2% and 1-1.5%, respectively, compared with the NPCM algorithm. As shown in the randomly selected time points in Table  4, it can be seen that the precision, recall, and F1 score of the MOD-AT algorithm for single-frame detection are all higher than those of the GVIBE, IGMM, NPCM, and MOD-AT, indicating that the MOD-AT algorithm has high precision.

Verification of Overall Accuracy
The overall accuracy of 2400 random records was verified, and then the precision, re-call, and F1-score were calculated for each frame. The results can be seen from the radar map distribution in Figure 11. The precision, recall, and F1-score of the MOD-AT algorithm are greater than those of the GVIBE, IGMM, NPCM, and MOD-AT algorithms. At the same time, the average value (MP) and variance (VR) of the calculation results of multiple frames of various indicators were obtained, and the results are shown in Table 5, as follows. (1) For different video data, the F1-score of the MOD-AT algorithm is maintained above 90%, and the results show that MOD-AT algorithm can maintain good object detection performance for videos with different view and height. (2) Compared with other object detection algorithms, the MOD-AT algorithm improves MP by 17.13-44.4%, MR by 7.98-24.38%, and MF by 10.13-33.97%. (3) The MOD-AT algorithm VP, VR, and VF are lower than those of other object detection algorithms, showing that the precision, recall, and F1-score of the MOD-AT algorithm are stable and maintain high accuracy. Mean-while, the MOD-AT algorithm reduced the time consumption per frame by 7.9-24.3%, as shown in Table 6, indicating an optimized efficiency and performance. To verify the impact of videos with different frame rates on the algorithm's performance, the frame rates of videos #1 and #2 were separately converted to 10, 20, 30, 40, and 50 fps, respectively, and time efficiency and CPU memory consumption were compared. As can be seen in Figure 12, for videos with different frame rates, videos #1 and #2 show the following change patterns: (1) For every 10-fps increase in frame rate, CPU usage increases by 15.4-47.5% and time efficiency increases by 18.7-43.7%; and (2) for the same frame rate, the more targets each frame contains, the longer the processing time and the higher the CPU consumption.

Conclusions and Discussion
Aiming at the problem that the previous moving object detection algorithm does not consider the influence of camera-imaging characteristics, resulting in low target-detection accuracy, a moving object detection method called MOD-AT that considers the difference in image spatial threshold was designed in this work. MOD-AT realizes the high-precision detection of moving objects at different positions on the horizon according to different thresholds.
Experimental results show that the MOD-AT algorithm has higher accuracy in both single-frame and overall accuracy evaluation. In the aspect of single-frame accuracy, we report the following.
1. Compared with the existing object detection algorithm, the median error (MD) of the MOD-AT algorithm is reduced by 1.2-11.05%.
2. The mean error (MN) of the MOD-AT object detection results is reduced by 1-15.5%, which shows that the MOD-AT algorithm has high accuracy in single-frame detection. In terms of overall accuracy, (a) the results show that the F1 score of the MOD-AT algorithm is above 90% for different experimental scenarios, demonstrating the

Conclusions and Discussion
Aiming at the problem that the previous moving object detection algorithm does not consider the influence of camera-imaging characteristics, resulting in low target-detection accuracy, a moving object detection method called MOD-AT that considers the difference in image spatial threshold was designed in this work. MOD-AT realizes the high-precision detection of moving objects at different positions on the horizon according to different thresholds.
Experimental results show that the MOD-AT algorithm has higher accuracy in both single-frame and overall accuracy evaluation. In the aspect of single-frame accuracy, we report the following.

1.
Compared with the existing object detection algorithm, the median error (MD) of the MOD-AT algorithm is reduced by 1.2-11.05%.

2.
The mean error (MN) of the MOD-AT object detection results is reduced by 1-15.5%, which shows that the MOD-AT algorithm has high accuracy in single-frame detection. In terms of overall accuracy, (a) the results show that the F1 score of the MOD-AT algorithm is above 90% for different experimental scenarios, demonstrating the stability of the MOD-AT algorithm; and (b) compared with the existing object detection algorithms, the MOD-AT algorithm improves MP by 17.13-44.4%, MR by 7.98-24.38%, and MF by 10.13-33.97%, which shows that the MOD-AT algorithm has high precision. 3.
The MOD-AT algorithm performance was improved by 7.9-24.3% compared to other algorithms, reflecting its efficiency.
Of course, this algorithm has several shortcomings. For example, due to the limitation of video data and the difficulty of obtaining high-precision DTM and DSM data, the threshold calculation is limited to moving objects on a plane. At the same time, multiple experiments were carried out on the two videos to determine the maximum threshold range. More experimental data are needed to verify the universality of the maximum threshold range. In addition, it is important to note that for more complex monitoring scenarios, such as group objects, current algorithms need to consider object detection, semantic segmentation, deep learning, and other methods. In conclusion, how to further coordinate the efficiency of the current method with the high accuracy of a deep-learning method is also a problem that needs further study. These problems remain the focus of planned follow up research.