High-Accuracy Recognition and Localization of Moving Targets in an Indoor Environment Using Binocular Stereo Vision

: To obtain effective indoor moving target localization, a reliable and stable moving target localization method based on binocular stereo vision is proposed in this paper. A moving target recognition extraction algorithm, which integrates displacement pyramid Horn–Schunck (HS) optical ﬂow, Delaunay triangulation and Otsu threshold segmentation, is presented to separate a moving target from a complex background, called the Otsu Delaunay HS (O-DHS) method. Additionally, a stereo matching algorithm based on deep matching and stereo vision is presented to obtain dense stereo matching points pairs, called stereo deep matching (S-DM). The stereo matching point pairs of the moving target were extracted with the moving target area and stereo deep matching point pairs, then the three dimensional coordinates of the points in the moving target area were reconstructed according to the principle of binocular vision’s parallel structure. Finally, the moving target was located by the centroid method. The experimental results showed that this method can better resist image noise and repeated texture, can effectively detect and separate moving targets, and can match stereo image points in repeated textured areas more accurately and stability. This method can effectively improve the effectiveness, accuracy and robustness of three-dimensional moving target coordinates.


Introduction
With the development of indoor localization technology [1][2][3], moving target localization estimation based on computer vision technology [4][5][6] is becoming more and more widely used in indoor environments. Indoor localization technology [7] is used to obtain the location information of people and objects in indoor environments. Due to the difficultly of Global Navigation Satellite System (GNSS) to locate targets in non-line of sight environments, high-precision localization methods in outdoor environments, such as Global Positioning System (GPS), and cellular networks in indoor environments do not work. Therefore, indoor scene localization technology has become research hot spot in recent years [8,9]. However, the majority indoor localization technologies need plenty of additional basic measures and have a high construction cost. There are many problems involved in solving in indoor localization. Indoor localization technologies [10][11][12][13], such as Wireless Local Area Network (WLAN), Wi-Fi, Ultra Wide Band (UWB), Bluetooth, and Radio Frequency Identification Devices (RFID), all need signal transmitters and communication network settings in indoor scenes, which increases the labor intensity and results in high costs. Moreover, technologies based on radio signals need moving targets along with a corresponding signal sensor to receive and transmit signals. This means the technologies based on radio signals are active localization technologies and need the cooperation. Sometimes, moving targets do not carry a sensor; therefore, we propose a localization method based on computer vision to obtain the position of a moving target in a passive way, which does not require cooperation from a moving target. Target localization based on computer vision has many advantages, such as high efficiency, high accuracy, low costs, and no requirement of prior knowledge.
General vision target localization techniques include monocular vision localization [14], binocular vision localization [15][16][17] and omnidirectional vision localization [18]. Monocular vision localization has low accuracy and is not able to obtain depth information; meanwhile, omnidirectional vision localization has a complex structure, low image resolution, and serious distortion problems; compared to these two forms of localization, binocular vision localization [19] has advantages such as a simple structure and high accuracy and efficiency, so we chose the parallel structure of binocular vision to estimate target locations. Binocular vision localization, also called stereo vision localization, is a comprehensive technology that uses two cameras and processors to simulate human eyes to observe the surrounding scene information, and then analyzes and understands the acquired information. To be specific, by calculating the relative relationship between the position of the same spatial point in the two images acquired by the two cameras at the same time, the disparity values of the corresponding matching points in the left and right images are calculated, and then the known internal and external parameter matrices are used to reconstruct the triangulation principle coordinates of three-dimensional space points, thus obtaining the 3D coordinate values of space said points.
To obtain indoor moving target localization, we chose binocular stereo vision and followed three steps: (1) Finding and separating the moving target in the image sequence; (2) matching the moving target pixels in the stereo image pairs at the same time; (3) translating the two-dimensional pixel coordinates to three-dimensional world coordinates. To extract the moving target, at present, the main moving target detection algorithms are optical flow [20], the introduced-frame difference [21], and background subtraction [22]. Traditional moving target detection methods are extremely susceptible to the interference of background areas and changes of light and shadow, resulting in inaccurately detected moving target areas. In recent years, many methods based on the above three algorithms have been proposed to solve these problems, and the optical flow method has become the main target detection method because of its better precision and robustness. [23] proposed optical flow with motion occlusion detection based on triangulation, addressing the problem of motion occlusion of optical flow estimation. [24] presented a PatchMatch framework for optical flow by adopting a coarse-to-fine PatchMatch with generated sparse seeds to obtain sparse matching, thereby improving the computational efficiency of optical flow. Stereo matching often uses area-and feature-based matching. Area-based matching often uses three-dimensional information reconstruction because the detailed information is kept intact, but this takes a long time. [25] reconstructed the 3D surface by alternating between reprojection error minimization and mesh denoising for the recovery of 3D geometrical surfaces from calibrated 2D multi-view images, effectively preserving the fine-scale details of a reconstructed surface. Feature-based matching fuses ordinary feature point extraction and matching methods such as the Fast [26], Harris [27], Sift [28], and Surf [29] methods with the camera's limit constraint principle to perform line-by-line stereo matching. Feature-based matching can only obtain sparse point pairs, meaning it is easy for it to result in mismatches, even potentially failing to detect feature points when matching repeating textured regions with non-rigid targets. Recently, [30] proposed a method that converts 2D Fourier transformation into 1D format and derives 1D Fourier transformation of the fast image matching model, which can effectively achieve large-scale and real-time detection, can improve the robustness of stereo matching, and can reduce the complexity of vehicle segmentation. [31] presented a novel super-pixel-based regression framework and introduced CRF with RRF into stereoscopic camera depth maps, making it more applicable to generate accurate data from an inexpensive stereo camera. [32,33] integrated a binocular stereo vision system into hardware and software.
We analyzed the factors that affect the accuracy of binocular localization and found the main affects to be the accuracy of target area extraction and stereo matching. Therefore, in this paper, a binocular moving target localization algorithm that fuses the improved Horn-Schunck (HS) large displacement pyramid optical flow method (O-DHS) and highprecision stereo deep matching (S-DM) is proposed. First, the combined improved large displacement HS pyramid optical flow method with Delaunay triangulation [34] fuses the Delaunay triangulation to obtain the precise optical flow field in the current frame image of the left camera frame images. Second, the Otsu [35] obtains the optical flow threshold value to obtain the moving target area range. These two steps solve the problem of the traditional HS optical flow not being able to obtain the optical flow vector field of the target when the movement is fast, and the moving target's occluded area contains the optical flow value, which enlarges the area range of the moving target in the image. Then, according to the stereo deep matching algorithm, the left and right stereo image pairs are matched at the same time to obtain dense corresponding points, which solves the traditional matching algorithm problems regarding the matching being sparse, the low accuracy, and the high errors while matching non-rigid or targets with repeated textures. Finally, the target localization is calculated with the moving target area 3D point coordinates according to the centroid method, and the target area 3D points coordinates are reconstructed based on the target segmentation area and dense stereo matching points. The algorithm we propose can locate non-rigid moving targets more accurately with repeated areas or certain range deformation. In Sections 3 and 4, the experimental results verify that the algorithm has higher accuracy and practicality than other algorithms.

Overview of the Methods
To obtain the localization information of the moving target, this paper first used the theory of binocular stereo imaging to find the corresponding relationship between the coordinate information in the stereo image pair and the three-dimensional coordinates. Then, by separating the moving target area range of stereo left image sequence and detecting the point-to-point stereo matching relationship of stereo image pairs, we could obtain the dense points' 3D coordinates in the moving object area. At last, we calculated the moving target 3D locating according to the centroid method, which mainly included three aspects: (1) Analysis of the parallel binocular stereo vision model. We established a backward resection model based on the left and right image pairs at the same time and derived the conversion relationship between the pixel coordinates on the image plane of the stereo image pairs and the world coordinates on the real three-dimensional coordinate system. (2) Extraction of the moving target area (O-DHS). We detected the moving target area in the left camera image sequence by using the variational HS large displacement optical flow, and we used Delaunay triangulation to remove the occlusion area in the optical flow field to obtain a more accurate moving target optical flow field. Otsu was used to determine the optical flow threshold value to extract the moving foreground. (3) Dense stereo depth matching of stereo images (S-DM). To determine the matching relationship between the left and right images, the stereo deep matching algorithm was used to obtain the dense point matching relationship of the stereo image pairs, and the matching algorithm fused the deep matching and the stereo matching constraint principle. The algorithm framework of this paper is shown in Figure 1.

Binocular Visual Localization Measurement Method
Binocular parallel stereo vision imitates the human perception of the depth of the surrounding environment with two eyes to obtain the three-dimensional information of a scene. According to the principle of triangulation, two cameras with coplanar imaging surfaces are used to image the same scene from different angles to obtain the disparity and to recover three-dimensional information. As shown in Figure 2, O l and O r are defined as the optical center positions of the left and right cameras; O l X l Y l Z l and are the left and right camera coordinates; b is the distance between the two optical center points of O l and O r , called the baseline distance; the focal length of the camera is f ; the left camera coordinate system is used as the binocular camera coordinate system; P(X, Y, Z) is the three-dimensional space point coordinate of the binocular camera coordinate system; the projection point coordinates in the left and right camera imaging coordinate systems are P l (x l ,y l ) and P r (x r ,y r ). Projecting the model onto the XOZ projection surface, which is the vison model shown in Figure 2 project to the XOZ coordinate plane. (as shown in Figure 3), C is defined as the distance between the intersection point of vertical line from point P to the camera's imaging plane and the projection point P r of point P in right imaging plane. Based on similar triangles, we can obtain: We can get C = bx r /(x l − x r ), and substituting it into Equation (1), we can calculate Z: Then, d = x l − x r is defined as the disparity, so Equation (4) can be rewritten as: This is then substituted into Equation (3): Moreover, we can project the model onto the YOZ projection surface to calculate Y: Therefore, in order to obtain the 3D coordinates of the target, the projection point coordinates P l (x l , y l ) and P r (x r , y r ) in the left and right camera imaging coordinate system need to be calculated, and the projection coordinates can be obtained by image pixel coordinate transformation.
According to the pinhole model, the transformation relationship between the camera coordinates p(x, y, z) and the image pixel coordinate M(m, n) is: where S is the scale factor; f x and f y represent the effective focal lengths in the X and Y directions; δ x and δ y represent the image center coordinates; f x , f y , δ x , and δ y are the internal parameters of the stereo color camera; R and T represent the position and orientation of the camera in the real world, which are external parameters of the stereo camera. The above parameter values can be obtained by stereo color camera calibration (see Section 4). Additionally, the coordinate plane is coincidental with the spatial plane. It is considered that the z value is 0, which can be simplified as: Respectively substituting the pixel coordinates of the left and right images to obtain P l (x l , y l ) and P r (x r , y r ) and relating Equations (5)- (7) can enable calculation of the camera coordinate system's three-dimensional coordinates (X, Y, and Z).

Motion Detection and Extraction
In this paper, the HS pyramid optical flow and Delaunay triangulation [36] fusion algorithms were used to detect the moving target in the adjacent frame images I 1 and I 2 acquired by the left camera. First, the data term of the traditional HS optical flow model was improved, the errors caused by lighting changes and image noise were reduced, and the calculation accuracy of the HS optical flow model was improved. Then, Delaunay triangulation (confirmed in Section 2.3.2) was used to determine whether the pixels have optical flow belonging to a background or target area. While regions all occur at the moving target area boundary, occlusion may not. Therefore, Delaunay triangulation was used to detect the occlusion area from the edge area of the image, thereby optimizing the optical flow of the occlusion area and improving the precision and accuracy of the HS optical flow field, so as to improve the accuracy and the accuracy of the moving target detection. Finally, Otsu was employed to obtain the optical flow threshold value to obtain the moving target area range. We called this method of detecting and extracting a moving target O-DHS.

HS Optical Flow Model Construction
(1) Data term The spatial gradient is defined as ∇ 3 I = I x I y I t T , the partial derivatives of the spatial gradient respect to x and y directions are ∇ 3 I x = I xx I xy I xt T , ∇ 3 I y = I xy I yy I yt T , and the optical flow vector is ω = [u v 1] T . Knowing from assumption of constant brightness, (∇ 3 I) T ω = 0. The data items can be defined as: where Λ is the occlusion coefficient, which is a binary function based on prior occlusion information, where Λ = 1 means that the point is not occluded and Λ = 0 means that the point is occluded (the occlusion determination is explained in detail in Section 2.3.2); ψ(x) = x 2 + ξ 2 is a non-square penalty function, where 0 < ξ 1, and ξ in this paper is 0.001. Because HS optical flow has a poor ability to suppress noise, Weickert [37] proposed to introduce local consistency of Lucas Kanade (LK) optical flow to reduce noise. The HS and LK models can be merged by convolution of the Gaussian function K p , and K p represents a Gaussian kernel function with radius ρ. Gradient conservation is introduced to solve the problem of lighting changes that cannot be solved only by gray-scale conservation. The data term is: where γ is the balancing gray term and the gradient term coefficient; T , and J y = ∇ 3 I y ∇ 3 I y T , then the function is simplified as: (2) Smoothness term We adopted the isotropic smoothness strategy of the global smoothness term, which is: (3) Multi-resolution multilayer refinement To solve the problem of HS optical flow not being able to detect large moving targets, we introduced multi-resolution pyramid layering to refine optical flow, and downsampled continuous frame images from the bottom up by bilinear interpolation to build the pyramid. The number of layers from the bottom to top is I k , k = 0, 1, 2, · · · n; the optical flow value , calculated at the k-1 layer of the pyramid, is the initial optical flow value ω 0 k of the next k layer, which is added it to the k layer HS optical flow value dω k = du k dv k T to obtain the initial optical flow value of the next k+1 layer, up until the end of the bottom of the pyramid, resulting in the final optical flow value. The optical flow value at the top of the pyramid is defined as ω 0 = [00] T . The downsampling factor of the bilinear interpolation used in this paper was 0.95, and the number of layers was n = 40. The function is:

Delaunay Triangulation Occlusion Area Determination
In this paper, according to the Delaunay geometric occlusion detection method proposed by Kennedy et al. [37], the method introduced in [35] was applied to the occlusion detection of moving targets. We further applied it to the detection and extraction of moving targets by deleting the error detection optical flow value in the occlusion mixed in the moving target foreground area. This improved the accuracy of segmentation and extraction of moving targets, so we were able to improve the target localization accuracy.
For the pixel point position B = (i j) T in I 1 , it is added to the optical flow ω = (u v) T at the point as the corresponding point (i + u j + v) of X, so the gray difference ∆I between I 1 and I 2 is: When ∆I = 0, the gray of the corresponding positions of the consecutive frames are consistent, indicating that point p is not occluded, Λ = 1 in Equations (11)- (13). When ∆I = 0, Delaunay triangulation is constructed on gray difference map for occlusion determination. The specific method is: Taking each point B 1 (i, j) in the previous frame image I 1 and the neighborhood coordinate points B 2 (i + 1, j) and B 3 (i, j + 1) to form a region triangle, Equation (16) is used to calculate the corresponding triangle optical flow value in I 2 , and then the gray change of this triangle area is calculated. The equation is: where ϕ 1 , ϕ 2 , and ϕ 3 represent the weights of the three points; we took ϕ 1 = ϕ 2 = ϕ 3 = 1 3 . We compared the grayscale difference between the pixels in the neighborhood of 20 × 20 around the triangle with the triangle to determine the occlusion of the points in the neighborhood. The specific judgment method is: ∆I B > ∆I ∆ , which means the point is occluded, Λ = 0, ∆I B < ∆I ∆ , which means the point is not occluded, and Λ = 1.

Moving Target Area Extraction
We used Otsu to determine the threshold of the moving target optical flow field; thus, the foreground area of the moving target could be separated from the background area in the image, and the coarse range of the moving target area was extracted. Because the image was affected by noise, the obtained optical flow field was not uniformly distributed, which caused the obtained binary foreground image to have uneven target edges, porosity, and unnecessary small foreground areas, except for the moving target. In order to obtain a moving foreground target with smooth edges, as shown in Figure 4, we analyzed the morphological operation of corrosion, expansion, opening operation, and closing operation, thereby determining that the moving target foreground area could be obtained by first closing and then opening operation.

Stereo Deep Matching
Deep matching (DM), proposed by Weinzaepfel et al. [38,39], is a dense fast image matching algorithm similar to a convolutional neural network structure. It describes the image local features by combining the optical flow algorithm with dense sampling, a pyramid structure, cross convolution, and max-pooling descriptor local features. It can match non-rigid deformation and regions with repeated textures and can obtain dense matching pairs. Compared to other algorithms, DM has the advantages of high computing efficiency and high accuracy. The DM algorithm uses the pixel block area around a point as a descriptor of this point, builds a correlation pyramid between the left and right images according to the similarity measure between the block areas of the two images of the left and right cameras to be matched at the same moment, and then establishes a matching relationship from the top of the pyramid to iterate to the bottom of the pyramid to obtain the match between the left and right camera image relationship, including two steps: (1) Calculation of the correlation tile area from the bottom up to establish the correlation pyramid; (2) top-down matching transfer. This paper used five constraints of stereo matching-Epipolar constraint, uniqueness constraint, consistency constraint, continuity constraint, and order constraint; these were integrated into the deep matching algorithm to match the stereo image pairs, and we called it stereo deep matching (S-DM).

Establishing an Association Pyramid
Since the HOG (Histogram of Oriented Gradient) [40] algorithm can better describe the local features of an image, HOG was used to describe the features of all pixels in the image. The pyramid structure can solve the problem of incomplete image feature information on a single scale. It is easier and more accurate to use small-scale images to process the matching of regions with repeated textures. Therefore, in order to obtain accurate and dense correspondence between left and right camera images, the associated pyramid uniting HOG and pyramids can accurately and comprehensively match left and right camera images.
Image I l and I r represent the images of the left and right cameras at the same time, and the number of rows and columns is L and R. Grayscale processing is performed on I l and I r . Image I l is divided into no overlap segments with a size of 4 × 4, and the segmented 4 × 4 block images are called initial block images, so as to reduce the influence of the deformation regions in I l and I r caused by different angles or displacements. The initial block image of 4 × 4 is divided into four same-size subregions, and the four subregions are in the range of [−26 • , 26 • ], rotated internally and with the deformation of the [1/2, 3/2] proportion. The similarity can be regarded as the average of the similarity of the four quadrant units. Therefore, the similarity between I l and I r is: where H l and H r are the 4 × 4 initial blocks of the same row in left camera image I l and the right camera image I r ; H l i and H r i (n i ) correspond to single quadrant feature descriptors; d(·) is a similarity metric function with a value range of [0, 1].
Then, Gaussian smoothing is performed on the segmented initial block H with a parameter of σ 1 = 1 to reduce the distortion caused by image compression. A histogram of the direction gradient of each block is calculated, which is divided into two steps: (a) Using the Laplacian operator to convolve block H to obtain the gradient value ∇ x , ∇ y , calculating the gradient value in eight directions (cos(π/4)i, sin(π/4)i)| i=1,2,3···8 of nonnegative projection to obtain the gradient value of each pixel in said eight directions. (b) In order to reduce the influence of light and local shadows of the image, (1 + e −τx )/(1 + e −τx ) τ = 0.2 is used for correcting. Then, Gaussian smoothing is used again to obtain a histogram of the gradient direction. By reducing the weight value in the smaller direction and increasing the weight value in the larger direction, the influence of the larger direction value can be increased. The similarity formula of the feature descriptor of the pixel is: Then, the underlying association graph of the pyramid is calculated according to d(H l , H r ) convolution. H · i,j represents the histogram of the gradient direction, and I r m,(p,q) is defined to represent the associated block image of the association graph. {(p, q)| p ∈ {2, 6, 10 · · · L − 2}, q ∈ {2, 6, 10 · · · R − 2}} means that the center point of the block image is (p, q); m means the size of the block is m × m m = 4 · 2 t . t represents the current number of iterations); I l q represents the search area where the figure is located on the same horizontal line in the right image. The association graph C 4,p can be obtained from the convolution of I r 4,(p,q) and I l q according to Equation (19). For each point p r on the right imaging plane I l , C 4,p (p r ) represents the matching degree between I l 4,p and I r 4,p .
An association graph of the previous layer is obtained by aggregating the smaller block images at the bottom of the gold tower of the association graph into larger block image, which means that the m × m block is aggregated by four block images size of The obtained aggregated correlation graph is subsampled to obtain a 1/2 previous layer correlation graph, and the max-pooling method is used to pass the local maximum; then, the pyramid is decremented twice from the bottom up, which is α : C((p, q) r ) → max S∈{−1,0,1} 2 C((p, q) r + S) .
As the number of iterations caused by subsampling increases, the spatial variation of the association graph decreases. To eliminate the effect of the two decreases in each iteration, a constant coefficient of 2 is added to the iterated function, as: β : C (p, q) r → C 2(p, q) r . This causes the block area movement amount S to be {−1, 0, 1} 2 , that is, Because max-pooling only retains the maximum correlation value, in order to maintain the robustness of the larger correlation values during the iteration process, a non-linear mapping γ is introduced: γ : C (p, q) r → C (p, q) r µ , and the final correlation iteration function is:

Association Pyramid Matching Transfer
By using the possible associated block image on the upper layers of the pyramid as a cut-in block image, the matching correspondence is transmitted to the lower layer of the pyramid and returned to the lower layer association graph. The optimal matching of the same image on different pyramid levels is selected to form the lowest matching result. The cut-in block I m,p is composed of four subquadrants I (m,(p,q)),i | i=1,2,3,4 ; due to max-pooling sampling, I (m,(p,q) r ),i corresponds to the block area I m 2 ,(p,q) r +s i + i in the next layer, and i represents the displacement: i = argmax C m 2 ,(p,q) r +s i 2 (p, q) r + s i + i , s i ∈ {−1, 0, 1} 2 (22) (p, q) i is defined as (p, q) + s i and (p, q) r i as 2 (p, q) r + s i + i ; let R (p, q), (p, q) r , m, represent the association score, with m representing the block being the size of m × m, the center pixel coordinate being p or p r , and its matching degree being . Thus, R is: R (p, q), (p, q) r , m, = (p, q), (p, q) r , m m = 4, K is defined as (p, q), (p, q) r , m, ; considering the matching situation at the cut-in block K = (p, q), (p, q) r , m, C m 2 ,(p,q) (p, q) r and then backtracking to the bottom of the pyramid, matching transfer has many overlapped blocks. Herein, we only backtracked the block with the highest association score, while those with a lower score were eliminated. According to Equation (23), the stereo deep matching is obtained as: Finally, the matching result K E is filtered to find the best matching pairs. In detail, according to the K E matching results, the images I l and I r are returned with the center (p, q) l and (p, q) l with a 4 × 4 field size as the best matching pairs. Optimum (p, q) l is defined as Optimum (p, q) r , meaning that (p, q) l and (p, q) r are the best matching pair, where (p, q) l = (p, q) is the corresponding initial block in the left camera imaging plane corresponding to the associated block, so the matching result can expressed as:

Consistency Test
According to the consistency constraints of stereo matching, the source position in the left image compared to the target position in the right image should be the same as the position of back-matching from the right image to match the left image. Based on this, we took the right image as the source image and the left image as the target image, repeating Sections 2.4.1 and 2.4.2 to determine the matching point pairs to select the same matching as the previous matching results, taking them as the final best matching results.

Motion Detection and Extraction Accuracy Analysis
We took four sets of continuous frame images with a resolution of 326 × 1024, simulated in the MPI Sintel data set [41]. We compared the algorithm in this paper to LK [42], HS pyramid [43], Black and Anandan (B&A) [44], and Brox [45]. The detection results of common optical flow methods were compared to verify whether the algorithm has a certain precision. As shown in Figure 5, it can be seen that the HS large displacement optical flow method fused with Delaunay achieved better detection results. The omission rate and false rate of the foreground target were used as the evaluation index of the target detection performance. The omission rate is e OR = N omission N N omission /N × 100%, representing the rate of the number of undetected moving target pixels to the total number of pixels. The false rate is e FR = N f alse N × 100%, representing the rate of rate of the number of non-moving target pixels to the total number of pixels that are falsely detected. The comparison results of the omission rate and false rate are shown in Figure 6. In Figure 6a, the missed detection rate of our algorithm is lower than that of the other algorithms. In Figure 6b, the false rate of our algorithm for the ambush sequence is slightly higher than that of the HS pyramid and B&A optical flow methods, but the missed detection rate of the ambush sequence is significantly lower than that of the other algorithms, as is the false rate of our algorithm. In summary, the method of moving target detection and extraction in this paper is effective and is able to reduce the final target positioning error caused by the target region extraction error.

Stereo Deep Matching Analysis
In order to verify the effectiveness of the stereo deep matching, we used 10 stereo image pairs from the general Middlebury Stereo Datasets from the 2014 datasets [46] for the training set to prove the accuracy. The results of a comparison against the S-Fast, S-Harris, S-Sift, and S-Surf methods (representing the Fast, Harris, Sift, and Surf matching methods after filtering and correction under stereo matching constraints) are shown in Figure 7. Comparing the number of matching points and the correct matching rate, the results are shown in Table 1. Accuracy means the proportion of correct matches to total detected matching points. It can be seen from the Figure 7 that the matching results of the other methods are concentrated in a certain area, while those of our method with the matching point pairs are densely and uniformly distributed throughout the image. Because the target localization adopts the average value of the coordinate position of the target area point, our method is better at improving the localization accuracy. As can be seen from the table, the number of matching points of our method is much higher than that of the other algorithms, and the accuracy is basically generally higher. The pipe matching accuracy is 2.96% lower than that of S-Fast, but the S-Fast matching only achieved 68 point pairs that are relatively concentrated, far lower than the 3406 point pairs of our method, and the accuracy of S-Fast in Play-Table is only 73.33%, which is 18.96% lower than ours. The accuracy in Teddy is 1.69% lower than that of F-Harris, but the accuracy of F-Harris in Pipes is only 45.33%. Therefore, compared to the other methods, the stereo deep matching in this paper has higher accuracy and stability, and the matching point pairs are evenly dense, which is more suitable for locating moving targets.

Experimental Results of the Real Stereo Images Pairs
In this paper, the accuracy of localization was verified by calculating the position of the moving targets in the actual scenes, which mainly included comprehensive experiments of camera calibration, moving target detection and extraction, stereo image pairs, stereo deep matching, and target positioning. The initial image pair sequence was captured by 1280 × 720 pixels and 30 fps for pairs of left and right images and the parallel structure binocular camera consisted of two fixed-focus lenses. The image pairs data is captured by the stereo camera (shown in Figure 8). A Lenovo computer with an Intel Core i5 2.50 GHz processor processed the data, and the binocular camera communicated with the image processing computer through a usb3.0 line.
To obtain the correct localization, the binocular camera needs to be calibrated before stereo matching to rectify the image to be matched. In this paper, the binocular camera was calibrated using the MATLAB 2016b calibration toolbox to obtain the internal parameters f x , f y , δ x , and δ y and the external parameters R and t. The experimental data were obtained using a binocular USB sensor with a resolution of 1280 × 720 to obtain the image sequence. First, the binocular camera was calibrated, and the internal parameter matrices of the left and right cameras were defined as  After calibration, the camera's distortion parameters and the internal parameter matrix were used to correct the image.
After pre-processing the acquired frame images, the optical flow field calculation of the improved large-shift HS optical flow method was combined with the Otsu binarization and first closing and then opening morphological operations to extract the target moving foreground. The target region extraction results are shown in Figure 9.
Then, stereo matching was performed on the corrected stereo image pair, and the moving target extraction area was combined to extract the stereo matching point pair of the moving target area. Then, the stereo matching results were compared to those of the Fast, Harris, Sift, and Surf stereo matching. As shown in Figure 10 and Table 2, Harris stereo matching did not detect matching point pairs in the moving target area in all 80-84 frame images; meanwhile, Fast stereo matching only detected one matching point in the 80th and 83rd image sequences, and the location of the detection point was in the upper half of the moving target area, which caused the measured y to be larger, and only two frames detected matching points in the target area. A large number of the matching point pairs detected by the Sift and Surf stereo matching algorithms were concentrated in a small area in the upper part of the target area. Most of the moving target areas could hardly detect the matching point pairs, which caused measurement errors. The error between the x and y values was large. The stereo matching algorithm in this paper achieved high robustness, even in the areas of repeated texture of the moving target, and the detected stereo matching point pairs of the target area were evenly distributed within the range of the moving target area, thereby obtaining more accurate coordinate values.

Discussion and Conclusions
In this paper, a binocular parallel structure optical camera was used to acquire binocular image data for measuring the 3D localization information of moving objects. According to the image correction data, the range of the moving target area was detected by using the continuous frame image of the left camera, which overcame the problem of the distance extracted from the shadow area of the target detection being larger than the real distance. Finally, according to the principle of the stereo vision 3D coordinate measurement and the related calibration parameters, the matching points of the moving object region were calculated. Using the lattice data, the three-dimensional coordinate of moving objects was calculated by the heart shape method. The experiments showed that the proposed algorithm can realize the three-dimensional localization of moving targets with high precision and robustness.
The goal of our research was to improve the segmentation accuracy and stereo matching accuracy of moving targets to enhance the localization accuracy. The low localization accuracy of binocular parallel vision cameras was solved. Although our algorithm proved to be more accurate in indoor environments, its actual implementation may need more research. First of all, the algorithm was not effective because the extraction of the optical flow field based on the HS method takes a long time. We will explore how to improve the efficiency of producing high-precision optical flow fields in subsequent studies. Second, we plan to build a stereo vision system for moving target orientation based on our algorithm. Finally, we plan to conduct orientation studies in complex indoor environments such as shopping malls, museums and galleries.

Data Availability Statement:
The general Middlebury Stereo Dataset is provided in https://vision. middlebury.edu/stereo/data/. The MPI Sintel Flow Dataset can be found in http://sintel.is.tue.mpg. de/. The data used to support the findings of this study are available from the corresponding author upon request.