Next Article in Journal
Advanced ANN Architecture for the CTU Partitioning in All Intra HEVC
Previous Article in Journal
Deep Analysis of Imaging Characteristics of Spaceborne SAR Systems as Affected by Antennas Using 3D Antenna Pattern
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Trinocular System for Pedestrian Localization by Combining Template Matching with Geometric Constraint Optimization

1
School of Electronics, Peking University, Beijing 100871, China
2
School of Integrated Circuits, Shandong University, Jinan 250100, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(19), 5970; https://doi.org/10.3390/s25195970
Submission received: 23 August 2025 / Revised: 20 September 2025 / Accepted: 24 September 2025 / Published: 25 September 2025
(This article belongs to the Section Navigation and Positioning)

Abstract

Pedestrian localization is a fundamental sensing task for intelligent outdoor systems. To overcome the limitations of accuracy and efficiency in conventional binocular approaches, this study introduces a trinocular stereo vision framework that integrates template matching with geometric constraint optimization. The system employs a trinocular camera configuration arranged in an equilateral triangle, which enables complementary perspectives beyond a standard horizontal baseline. Based on this setup, an initial depth estimate is obtained through multi-scale template matching on the primary binocular pair. The additional vertical viewpoint is then incorporated by enforcing three-view geometric consistency, yielding refined and more reliable depth estimates. We evaluate the method on a custom outdoor trinocular dataset. Experimental results demonstrate that the proposed approach achieves a mean absolute error of 0.435 m with an average processing time of 3.13 ms per target. This performance surpasses both the binocular Semi-Global Block Matching (0.536 m) and RAFT-Stereo (0.623 m for the standard model and 0.621 m for the real-time model without fine-tuning). When combined with the YOLOv8-s detector, the system can localize pedestrians in 7.52 ms per frame, maintaining real-time operation (>30 Hz) for up to nine individuals, with a total end-to-end latency of approximately 32.56 ms.

1. Introduction

Reliable pedestrian localization is a fundamental requirement in a wide range of domains, including autonomous driving [1], service robotics [2,3], and assistive navigation [4,5]. In urban sidewalk environments, pedestrians move dynamically among buildings, street infrastructure, and other mobile agents. Even minor localization errors in such safety-critical contexts can have severe consequences, whether for autonomous vehicles maneuvering safely in traffic [6] or wearable assistive devices guiding visually impaired users [7]. Therefore, stereo vision has been widely adopted as a perception technology [8,9,10]. By emulating human binocular vision, it provides dense depth cues without external infrastructure, offering a flexible and scalable solution for spatial perception [11,12,13].
Over the past two decades, binocular stereo vision has developed rapidly [14,15]. Early methods, such as Block Matching (BM) [16] and Semi-Global Matching (SGM) [17], established the foundation for efficient disparity estimation. More recently, deep learning approaches, such as GC-Net [18] and PSMNet [19], established the paradigm of deep cost aggregation. Subsequent works, including RAFT-Stereo [20] and its variants [21,22], have substantially improved disparity accuracy and robustness under challenging conditions such as occlusion or low-texture regions. However, inherent limitations remain. The geometric configuration of binocular systems confines estimation to a single baseline plane, resulting in depth uncertainty that increases quadratically with distance [23]. This limitation severely compromises long-range pedestrian localization [24]. Moreover, deep learning–based solutions, while powerful, often impose high computational costs that hinder deployment in lightweight, real-time systems [25,26,27]. These challenges underscore the need for stereo vision methods that are both more robust and computationally efficient, particularly in outdoor pedestrian localization.
A common strategy for pedestrian localization is to combine object detection networks with depth information derived from stereo vision systems [28,29,30]. For instance, Zuo et al. [30] utilize the Semi-Global Block Matching (SGBM) algorithm to acquire a depth matrix of the platform scene and employ the YOLOX object detection algorithm to identify the positions of passengers on subway platforms. Although effective for scene understanding, dense stereo methods are often computationally redundant for target-specific localization tasks, as they devote significant resources to background regions irrelevant to the object of interest. To achieve efficient target positioning, researchers have developed feature-based methods in binocular stereo vision, which involve extracting and matching feature points. These approaches are computationally efficient and have been applied to pedestrian localization [31,32,33]. For example, Wei et al. [32] proposed a remote distance ranging method based on an improved YOLOv5. In another study, Guo et al. [33] developed a positioning system incorporating edge detection and scale-invariant feature transforms. However, these binocular approaches depend on a limited number of feature points within the target region. For distant pedestrians, these features can be sparse or unreliable, leading to noisy depth estimates and frequent mismatches. The trade-off between the computational overhead of dense methods and the instability of sparse methods has motivated research into hybrid strategies. Leveraging compact attention mechanisms has emerged as a promising alternative to processing dense cost volumes, offering a path toward solutions that are both lightweight and accurate. The attention mechanism improves matching accuracy by effectively fusing spatial context with geometric correspondence cues [34,35]. Moreover, adopting a multi-scale architecture can further improve the fidelity of details [36,37]. Despite these algorithmic advances, fundamental geometric limitations of a two-camera system still bring the challenge of depth uncertainty increasing quadratically with distance.
Trinocular stereo vision has recently been proposed as a promising extension to overcome these shortcomings [38,39,40]. By introducing a third viewpoint vertically offset from the horizontal baseline, trinocular systems form a more complete perception cone [41]. This configuration enriches disparity estimation, improves depth accuracy at longer ranges, and increases resilience against occlusions [42,43]. Researchers have utilized trinocular systems for applications such as scene reconstruction and robotic navigation, where the additional view has improved the robustness of visual odometry and enabled more accurate 360-degree 3D reconstruction [44] and enable more accurate 360-degree 3D reconstruction [45]. In the field of metrology, trinocular vision has been successfully applied to specialized measurement tasks, such as determining the ground clearance of transmission lines [46], measuring spatially encoded artifacts [47], and building intelligent object measurement systems [48].
Although trinocular vision holds considerable potential, research in this area remains limited, and few methods directly address the demands of pedestrian localization in outdoor sidewalk environments. One reason is the added algorithmic complexity and computational overhead that come with processing a third viewpoint. Many existing studies emphasize holistic tasks such as dense disparity estimation or large-scale scene reconstruction [42,43,45], which introduce high costs unsuited to real-time pedestrian localization. Others [46,48] treat trinocular vision simply as multiple binocular pairs, overlooking the geometric advantages of the full system. In addition, current trinocular solutions are often designed for specialized metrology applications, relying on well-defined object edges or controlled environments. Consequently, these pipelines do not adapt well to dynamic and unstructured settings, where challenges such as long-range pedestrian localization remain unresolved. Despite the theoretical advantages of trinocular geometry, its practical potential is still largely unrealized, leaving a critical gap in safety-critical applications like pedestrian localization.
In this work, we present a trinocular stereo vision method designed specifically for pedestrian localization in outdoor sidewalk scenes. The proposed system arranges three cameras in an approximately equilateral triangle configuration, which broadens the effective baseline and introduces a vertical viewpoint that contributes complementary disparity information. Building on this geometry, we first employ a multiscale template matching algorithm to rapidly estimate the pedestrian’s position from the primary binocular pair, ensuring efficient computation without sacrificing accuracy. This initial estimate is then refined by exploiting the third vertical viewpoint through an optimization strategy that combines three-view geometric constraints with similarity measures. By uniting these elements, the proposed framework achieves both high depth accuracy and computational efficiency, offering a practical pathway for deploying stereo vision systems in safety-critical pedestrian localization tasks.
This paper is organized as follows: Section 2 details the proposed methodology. Section 3 evaluates the performance of our method on a custom trinocular vision dataset captured on outdoor sidewalks. Section 4 discusses the results and their potential applications. Finally, Section 5 concludes this study.

2. Methodology

In this paper, we present a trinocular stereo vision system for pedestrian localization, wherein the cameras are arranged at the vertices of an equilateral triangle with parallel optical axes. We also introduce a trinocular stereo vision method based on local template matching. First, a multi-scale stereo matching algorithm is applied to the left–right pair to establish the initial disparity and corresponding target position. Subsequently, a three-view constraint optimization refines the estimate by incorporating information from all three views.
The overall architecture of the proposed method is illustrated in Figure 1 and the methodology is summarized in Appendix A.

2.1. Camera Settings and Symbols

The trinocular camera system is composed of three cameras, designated as left ( C l ), right ( C r ) and top ( C t ), which are arranged in an approximately equilateral triangle. The principal optical axes of the cameras are parallel. The C l and C r cameras constitute the primary binocular stereo pair. The C t is positioned vertically above the midpoint of the C l C r baseline.
The process of camera calibration and image rectification involves the following steps:
  • Intrinsic calibration: Each camera ( C l , C r , C t ) was calibrated independently to obtain its intrinsic parameters. These parameters include the intrinsic matrix K (which encodes the focal lengths f x , f y , principal point coordinates c x , c y and the vector of distortion coefficients D . The distortion vector accounts for primary radial and tangential lens distortions. The intrinsic parameters are denoted as K l , D l , K r , D r and K t , D t .
  • Extrinsic calibration: The spatial relationship between the cameras was determined by performing extrinsic calibration. With C l serving as the reference coordinate frame, we computed the rotation matrix ( R ) and translation vector ( t ) for the C l - C r and C l - C t pairs, denoted as R l r , t l r and R l t , t l t .
  • Image Rectification: The stereo image pair I l , I r was rectified using the extrinsic parameters R l r , t l r to align epipolar lines, yielding the reprojection matrix Q l r , the relative rotation R l r , the relative translation t l r , and the rotation matrices R r e c t _ l and R r e c t _ r . Specifically, R r e c t _ l and R r e c t _ r represent the transformation from the original camera coordinate systems into their rectified counterparts. The top image I t was undistorted using its intrinsic parameters K t , D t .

2.2. Multi-Scale Template Matching for Initialization

This section introduces an improved multi-scale template matching algorithm for rapid correspondence initialization from a rectified stereo pair I l , I r . The method employs a two-stage matching strategy to efficiently locate the match. In the first stage, a coarse estimate is established using an image pyramid-based search with a preprocessed template T . Subsequently, the second stage refines this estimate by matching the original template T 0 within a local search window centered on the result from the coarse search.

2.2.1. Template Preprocessing

Prior to the matching process, a template pre-processing step is applied. An original template T 0 is first acquired from the target pedestrian’s region of interest (ROI) in I l , parameterized by its center u l , v l and dimensions w 0 , h 0 . The template’s area is calculated as A 0 = w 0 · h 0 . To handle variations in scale, a cropped template T with dimensions w , h is then defined. The dimensions of T are determined based on the area A 0 of the original template T 0 , anchored at the same center point u l , v l , according to the following conditions:
  • If A 0 < A m i n , the template is expanded by a factor r e x p to incorporate richer contextual information from the surrounding region. The resulting dimensions are defined as w , h = r e x p w 0 , r e x p h 0 .
  • If A 0 > A m a x , the template is cropped by a factor r c u t to remove redundant information, which could degrade matching stability. The resulting dimensions are defined as w , h = r c u t w 0 , r c u t h 0 .
  • If A m i n < A 0 < A m a x , the template is considered adequate. The resulting dimensions are defined as w , h = w 0 , h 0 .
In the following steps, the cropped template T is used for the image pyramid-based search, while the original template T o is used for the precise refinement within the local search window.

2.2.2. Matching Strategy

To accelerate the search and matching process, we employ an image pyramid strategy. Specifically, we construct an image pyramid for the search image I r . Each level of the pyramid is generated by down-sampling the preceding one, as illustrated in Figure 2.
The search proceeds from the top level L N down to the bottom level L 0 . At each level k , the search for the template T k is performed within the down-sampled image I r , k . Within the search area of I r , k , candidate windows of the same size as T k are extracted. We compute the grayscale image similarity between the template T k and each candidate window W u , v using the zero normalized cross-correlation (ZNCC) score, which is robust to linear and affine illumination changes and thus well-suited for template matching. The function is defined as:
Z N C C u , v = i , j T k i , j T k ¯ · W u , v , i , j W u , v ¯ i , j T k i , j T k ¯ 2 · i , j W u , v , i , j W u , v ¯ 2
where T k is the template at level k , T k ¯ is its mean intensity, and W u , v ¯ is the mean intensity of the candidate window in I r , k at position u , v .
Since the image pair I l , I r has been epipolarly rectified, the search is restricted to a local region. For each level k , the search region is centered on an initial position u 0 , k , v 0 , k and the position with the maximum score is selected as the optimal location:
u k * , v k * = arg max u k u 0 , k d u v k v 0 , k d v Z N C C u k , v k
where u k , v k denote the optimal coordinates at level k , and d u and d v define the horizontal and vertical search ranges, respectively. At the initial level L N , the horizontal search is performed globally. Then, the optimal position u k , v k is scaled by a factor s (typically s = 2 ) to serve as the initial position for the next finer level k 1 , according to the relation u 0 , k 1 , v 0 , k 1 = s · u k , s · v k .
Subsequently, in the second-stage matching, we define a local search window at the finest level L 0 , which corresponds to the original image I r . The rectangular region Ω L S W centered at the first-stage matching point p r = u r , v r . Ω L S W is defined as:
Ω L S W = u , v   |   u u r D u ,   v v r D v
where D u and D v determine the size of Ω L S W . Within Ω L S W , the position yielding the maximum score against the original template T o is taken as the final match p r = u ^ r , v ^ r , as shown in Figure 3.

2.3. Three-View Constraint Optimization

Using the matching pixel coordinates p l = u l , v l in image I l and p r = u ^ r , v ^ r in image I r , we first compute an initial point to establish a depth prior. The depth estimate is refined using a similarity measure constrained by all three views. This process consists of two main parts: (1) initial point triangulation and search space construction, and (2) maximization of the three-view consistency along the defined search depth axis.

2.3.1. Initial Point Triangulation and Search Space Construction

The initial point can be computed from a pair of matching pixel coordinates using the reprojection matrix Q l r , in the left camera coordinate system. Given p l = u l , v l , p r = u ^ r , v ^ r , and the disparity d = u l u ^ r , the initial point P 0 = X 0 , Y 0 , Z 0 T can be calculated as:
P 0 = [ X 0 , Y 0 , Z 0 , 1 ] T = 1 ω Q l r [ u l , v l , d , 1 ] T
where ω is the normalization coefficient for the homogeneous coordinates.
The search space is then reduced to a one-dimensional depth axis, whose direction is defined as the unit vector pointing from the left camera’s optical center O l to the initial point P 0 . The unit vector α is given by:
α = P 0 O l P 0 O l R 3
The search extent along this axis ( Δ z ) is determined by disparity uncertainty. We use the heuristic that objects appearing larger in the image may possess greater depth uncertainty. Therefore, we associate the disparity uncertainty δ d with w 0 , the width of the initial ROI, such that δ d = λ z · w 0 . Here, λ z is a scaling factor that controls the search range. The relationship between this disparity uncertainty δ d and the depth search range Δ z is given by the approximation Δ z f · B l r d 2 δ d , where f is the focal length of the left camera and B l r is the baseline distance (can be derived from Q l r ), and d is the initial disparity estimate of the target.
For the candidate depth z Δ z , Δ z along the search axis, the corresponding candidate point P z can be calculated as:
P z = P 0 + z · α

2.3.2. Depth Optimization

Within the defined depth search segment, we sample a set of candidate 3D points P z with a uniform step size, denoted as z s t e p . Each candidate point P z is then reprojected onto the left, right, and top image planes to obtain the corresponding points, respectively. The projected points, p l z , p r z , and p t z , are given by
p l z = π l P z
p r z = π r R l r P z + t l r
p t z = π t R l t · R r e c t _ l 1 P z + t l t
Here, π l and π r represent the projective transformations for the rectified left and right cameras, while π t is the projective transformations for the top camera. The term R r e c t _ l 1 represents the rotation matrix that transforms from the rectified left camera’s coordinate system back to the original left camera’s coordinate system.
The total similarity function is defined by the photometric consistency of image patches between the different views, as shown in Figure 4. Specifically, it aggregates the patch similarity scores from the I l , I r and I l , I t image pairs. The maximum score corresponds to the most refined depth estimation, as it indicates the highest degree of similarity among the corresponding patches in all three views.
The process of depth optimization involves the following steps:
  • The reference patch, P a t c h r e f , of size M M pixels, is extracted from the left view I l , centered at the point p l = u l , v l . The patch size M is determined by the equation M = r m · m i n w 0 , h 0 , where w 0 and h 0 are the width and height of the initial ROI, respectively. The term r m is a scaling factor that relates the patch size to the size of the ROI.
  • For each candidate point P z , we extract corresponding patches, P a t c h r and P a t c h t , from the right view I r and the top view I t , centered on their respective projected coordinates, p r z and p t z .
  • The similarity scores are then calculated between P a t c h r e f and the patches from the right and top views ( P a t c h r and P a t c h t , defined as:
    Z N C C l r = Z N C C P a t c h r e f , P a t c h r
    Z N C C l t = Z N C C P a t c h r e f , P a t c h t
  • The total similarity score for a candidate point P z is defined as the weighted sum of the individual scores:
    C z = λ r · Z N C C l r + λ t · Z N C C l t ,             λ r + λ t = 1
    where λ r and λ t are the weighting coefficients for the right and top views, respectively.
  • By iterating through all candidate points P z , the point that yields the maximum total score is selected as the optimal estimate, P * = arg max z * C z .

3. Experimental Results

To evaluate the performance of the method proposed in this paper, we constructed a real-world trinocular stereo vision dataset captured from two sidewalk scenes. The dataset we constructed contains 12 sequences of pedestrians moving within a range of 2 to 18 m, totaling 3943 valid frames.

3.1. Dataset Collection and Parameters

3.1.1. Dataset Collection

As shown in Figure 5, the experimental platform integrates our trinocular vision system with a commercial depth camera (Orbbec-335L, Orbbec Inc., Shenzhen, China) and the three cameras are arranged in an approximately equilateral triangle configuration with a baseline length of about 50 cm. The trinocular system consists of three IMX415 (Sony, Beijing, China) camera modules connected to a computing platform via USB 2.0 interfaces. We implemented synchronized image acquisition using Python 3.9. All images were captured at a resolution of 1280 × 720 pixels and a frame rate of 30 FPS.
The dataset we constructed contains 12 sequences of pedestrians moving within a range of 2 to 18 m, captured from two sidewalk scenes. The primary scene includes 8 sequences of moving pedestrians, totaling 2956 valid frames. To facilitate cross-scene validation, a supplementary scene was captured, contributing an additional 4 sequences, totaling 987 valid frames, and featuring different lighting conditions and backgrounds. To capture the ground truth for localization, an additional reference camera was positioned to overlook the entire measurement area. The precise world coordinates of the target pedestrian were acquired by post-processing the video from this camera. Specifically, this approach leverages a homography transformation to map annotated 2D pixel coordinates from the static reference camera’s image plane to the 2D world ground plane, which is a standard methodology for such evaluations [49]. These coordinates were used to calculate the true distance between the target and the trinocular system (Figure 6). Figure 7 shows sample frames in the left camera view that illustrate the two sidewalk scenes for data collection.

3.1.2. Parameters

We compare the method proposed in this paper with two binocular baseline methods. First, we adopt the SGBM-based positioning approach [30], utilizing the left and right camera views from our trinocular module as input. This allows us to compare the performance improvement of a trinocular system over a binocular one under an identical baseline of 50 cm. SGBM is a classic and widely used binocular method, which we implemented in OpenCV. The key parameter configurations for the SGBM algorithm and our proposed method, which contains no learnable parameters, are provided in Appendix B.1 and Appendix B.2, respectively. For our second comparison, we selected the Orbbec-335L, a commercial depth camera with a baseline of 9.5 cm, to provide a practical comparison against our wide-baseline system. For our third comparison, we adopt RAFT-Stereo [20], a widely adopted deep learning-based stereo model, utilizing the left and right camera views from our trinocular module as input. For this, we utilize two officially pre-trained models: the RAFT-Stereo Middlebury model, which we refer to as the standard model, and the Realtime model, both of which were applied directly without any fine-tuning on our dataset to ensure a fair assessment of their generalization capabilities. For the real-time version, we use the configuration with the number of down-sampling stages set to 3, the number of GRU layers to 2, and iterative refinements to 7.
The cameras of the trinocular system are calibrated using the checkerboard calibration method available in the MATLAB 2025a Toolbox, following the procedure described in Section 2.1. The calibrated intrinsic and extrinsic parameters (relative to the left camera coordinate system) are presented in Table 1 and Table 2.

3.2. Accuracy Evaluation

3.2.1. Comparison of Accuracy

For accuracy evaluation, we compare our proposed method (denoted as Match + Trino) with four baseline methods: the SGBM algorithm, the Orbbec stereo camera, RAFT-Stereo, and RAFT-Stereo (Realtime). To ensure a fair comparison, all methods are evaluated under a consistent framework. Specifically, the initial Regions of Interest (ROIs) for our method, SGBM, and both RAFT-Stereo versions are generated by a single detector operating on the left camera view. For the Orbbec stereo camera, the ROI is extracted from its own RGB image. For these baseline methods, the target’s depth is then determined by extracting the closest valid depth value near the center of its ROI from their respective disparity or depth maps.
The quantitative evaluation of localization accuracy is presented in Figure 8. Figure 8a illustrates the mean absolute error of the five methods, and Figure 8b quantifies the performance gain of our method relative to the four baselines.
As shown in Figure 8a, the localization error for all methods tends to increase with distance. Compared to other binocular baseline methods, our proposed method consistently demonstrates a lower error across the evaluated range (2 to 18 m). The performance gap is particularly evident at longer ranges. For instance, at approximately 18 m, our method’s depth estimation error is 1.02 m, substantially lower than that of SGBM (1.31 m), RAFT-Stereo (Realtime) (1.42m), RAFT-Stereo (1.47 m), and the Orbbec camera (2.44 m).
Figure 8b quantifies this improvement by showing the relative gain of our method over the other methods, representing the percentage reduction in error achieved by our approach. Beyond the 9 m threshold (indicated by the gray dashed line), our method reduces the depth estimation error by over 54% compared to Orbbec and typically by 20% to 40% compared to SGBM, RAFT-Stereo, and RAFT-Stereo (Realtime).

3.2.2. Ablation Study

To further validate the effectiveness of the three-view constraint optimization module proposed in Section 2.3, we conducted an ablation study. We compare the initial results obtained from multi-scale template matching (denoted as Match) with the final results after optimization (denoted as Match + Trino). The quantitative results of the ablation study are presented in Figure 9.
As shown in Figure 9a, the error after three-view optimization (Match + Trino) is consistently lower than the error of the initial results (Match) across the entire distance range. This result shows that introducing the third viewpoint as a geometric constraint effectively corrects the initial depth estimate and is important to enhancing the accuracy of our method. The gain curve in Figure 9b further quantifies this trend, indicating that the optimization yields progressively more benefit at longer distances. Beyond the 9 m threshold (indicated by the gray dashed line), the optimization consistently reduces the error by 30% to 44%, underscoring the Trino module’s critical role in long-range accuracy.
Table 3 provides a comprehensive performance comparison of each method on the entire dataset, summarizing key error statistics including mean absolute error (MAE), root mean square error (RMSE), and standard deviation (STD).
As shown in Table 3, the results show our method achieves an MAE of 0.435 m, an RMSE of 0.615 m, and an STD of 0.434 m on the dataset. The statistical error values are lower than other methods, indicating that our approach has better consistency and stability.
Furthermore, we evaluate our method’s accuracy under different parameter settings. Specifically, we vary four parameter groups relative to the default configuration defined in Appendix B.2: (i) the view weights λ r and λ t , (ii) the patch size ratio r m , (iii) the depth step size z s t e p , and (iv) the depth range scale λ z . The following tables report the MAE across three distance bins (2–6 m, 6–12 m, and 12–18 m), with 95% confidence intervals. The sample counts for each bin are 744, 1619, and 1550, respectively.
The results in Table 4 show that reducing the top view weight λ r = 0.75 , λ t = 0.25 increases the error in all distance bins. This confirms the particular importance of the vertical geometric constraints provided by the top view, especially at longer distances. Conversely, increasing the top-view weight λ r = 0.25 , λ t = 0.75 beyond the default configuration λ r = 0.5 , λ t = 0.5 yields comparable performance with no significant improvement.
Table 5 demonstrates that patch size is a key factor affecting accuracy. Reducing the patch radius ( r m ) leads to a clear increase in MAE, with long-distance errors rising from 0.7 m to 1.17 m. This indicates that excessively small patches fail to capture sufficient texture information from the target, resulting in less reliable similarity estimations. Conversely, increasing r m to 0.7 achieves performance comparable to the default setting of 0.5.
The results in Table 6 and Table 7 show the impact of the depth search step size ( z s t e p ) and the depth range scale ( λ z ). The results for different configurations are consistent, confirming that the multi-scale template matching stage provides effective initial depth estimates. This allows the subsequent depth optimization to refine them without requiring an excessively large search range. Therefore, the default settings of z s t e p = 0.1 m and λ z = 0.15 balance the computational overhead of the optimization phase while maintaining accuracy.

3.3. Computational Performance

3.3.1. Comparison of Computational Efficiency

We evaluated the computational efficiency of the algorithms on 1280 × 720 resolution images using an experimental platform equipped with an Intel Core i7-14700KF CPU and an NVIDIA GeForce RTX 4090 GPU. The evaluation included our proposed method (Match + Trino) and its individual components, as well as three baseline algorithms: SGBM, RAFT-Stereo, and RAFT-Stereo (Realtime). Table 8 presents the average run time and resulting throughput for each.
As shown in Table 8, our method exhibits high computational efficiency for localizing individual targets. Unlike baseline methods with a fixed per-frame cost, our method scales linearly with the number of targets. For a single target ( N = 1 ), our complete pipeline requires about 3.13 ms, achieving a throughput of approximately 319 FPS, which exceeds the performance of RAFT-Stereo (Realtime) (47.0 FPS), SGBM (19.6 FPS), and the standard RAFT-Stereo (1.35 FPS). Separately, the Match step requires about 0.98 ms and the Trino optimization step takes approximately 2.15 ms, showing that the Trino optimization is the more computationally intensive component in our method.
Furthermore, the relationship between algorithm performance and target size is illustrated in Table 9, where target size is measured by the pixel area of the target’s ROI.
As shown in Table 9, the runtimes of SGBM, RAFT-Stereo, and RAFT-Stereo (Realtime) are all nearly independent of the target size, as they perform dense computations across the entire image. In contrast, the computational cost of our proposed method trends upward as the target’s size decreases (i.e., as the target becomes more distant). It is most efficient for medium-sized targets (2.24 ms for the 1282 px bin) and increases for smaller, more distant targets.

3.3.2. Time Breakdown

Figure 10 shows the runtime trends for the total algorithm and its two main components (Match and Trino) as a function of the target’s pixel area. Results in Figure 10 are binned by theoretical centers (e.g., 322, 642, 1282). Points are plotted at the empirical mean of each bin.
As shown in Figure 10, the runtime of the Match stage remains relatively stable at approximately 1.5 ms and is less affected by the target’s size. In contrast, the Trino step is the primary factor influencing the method’s total runtime, causing it to increase as the target’s size decreases. Thus, the total runtime reflects the combined contributions of these two components. For smaller targets (approaching the 1024 px2 bin), the total runtime reaches approximately 4.0 ms.
Furthermore, Table 10 presents a breakdown of the computational cost for the Match and Trino stages, including the average patch size M , depth search range Δ z , GFLOPs, and latency for each target size bin. For the Match stage, GFLOPs are computed as the cumulative cost of all ZNCC evaluations over the multi-scale search and refinement. The cost of a single ZNCC on a w × h template is approximated as 10   w h FLOPs. For the Trino stage, FLOPs are estimated as N z · 2 · 10 · M 2 + κ , where N z = 2 · Δ z z s t e p   + 1 is the number of depth samples, M is the patch size, and the factor of 2 accounts for two ZNCC evaluations per sample. The term κ denotes a fixed non-pixel-level overhead per depth sample, which was calibrated from the measured latencies using a simple linear model to cover costs from the Python loop, camera projections, and patch extraction.
The results in Table 10 show performance characteristics for the two stages. For the Match stage, the required computation (GFLOPs) and the resulting latency increase with target area. In contrast, the Trino stage shows higher latency, indicating a lower effective throughput of approximately 1.9–2.3 GFLOPs/s, compared to the Match stage’s median effective throughput of about 210 GFLOPs/s. This gap arises because the Match stage benefits from a highly optimized sliding-window kernel, while the Trino stage is limited by per-depth sample overheads and less efficient memory access.
To complete the pedestrian localization pipeline, we integrated our method with the YOLOv8-s object detector [50] for initial ROI generation. On our experimental dataset, the end-to-end system achieved an average inference time of 7.52 ms per frame for a single pedestrian, which consists of 4.39 ms for the YOLOv8-s detector and 3.13 ms for our proposed trinocular localization algorithm. The detector was configured with the pretrained yolov8s.pt weights. During testing, the input image size was set to 1280, the confidence threshold to 0.5, and the intersection over union (IoU) threshold for non-maximum suppression was 0.70.
To evaluate the system’s viability for real-time deployment with multiple pedestrians, we consider that the YOLOv8-s detector processes the entire image once, while our localization algorithm is applied sequentially to each of the N detected pedestrians. Consequently, the total processing time per frame can be estimated as 4.39 + N × 3.13 ms. This indicates that for a standard 30 Hz video stream (which provides a 33.3 ms budget per frame), our system can maintain real-time performance for up to N = 9 pedestrians, with a total processing time of approximately 32.56 ms.

3.4. Qualitative Analysis of Matching and Detection Inaccuracies

We present two qualitative examples illustrating matching and detection inaccuracies. In these cases, we manually introduced offsets to simulate errors in both the matching stage (Figure 11) and the initial ROI (Figure 12).
Figure 11 demonstrates the capability of the Trino stage to correct inaccurate input depth computed by the Match stage. To simulate a matching inaccuracy, we manually added a 15-pixel rightward offset to the initial disparity obtained from the Match stage. As shown in Figure 11a, the matched bounding box in the right view is shifted relative to the input detection in the left view (both boxes are shown in yellow). The initial 3D point corresponding to this incorrect match is then projected back onto the three views, with the extracted patches highlighted in red. Figure 11b shows the result after applying the three-view optimization in the Trino stage. The corresponding patches (highlighted in green) appear in the correct positions on the target, indicating that the matching inaccuracy has been corrected.
Figure 12 presents a simulated detection inaccuracy, where the initial ROI in the left view (shown as a yellow box) is slightly offset upward and to the right of the target. Consequently, the matching result yields a similarly offset correspondence in the right view (shown as a yellow box), since it relies on the inaccurate ROI in the left view as the template. After applying the Trino stage, the resulting 3D point is projected onto the three views. The corresponding patches are highlighted in green, which are correctly located on the target pedestrian. Since our optimization extracts patches from the ROI center, it is tolerant to small offsets in the initial detection from the left view.

4. Discussion

Experimental results show that the method proposed in this paper performs better in accuracy and computational efficiency. The comparison with the SGBM and RAFT-Stereo methods was conducted under an identical baseline, as their input was sourced from the left and right cameras of our trinocular module. This configuration provides a direct validation of the performance advantage of trinocular over binocular systems.
In terms of positioning accuracy, our method achieves a lower error than the binocular methods (SGBM, the RAFT-Stereo methods, and the depth camera). The main reason is that our method is target-oriented, focusing computational resources on the target ROI. In contrast, the SGBM algorithm is a dense matching algorithm, which is prone to generating errors in low-texture areas, such as a pedestrian’s clothing. Similarly, the RAFT-Stereo methods, despite being state-of-the-art deep learning approaches, also exhibit higher errors. This is primarily because we utilized models pre-trained on standard datasets without any fine-tuning on our specific data. Consequently, the features learned by the pre-trained models do not generalize well to our dataset, as the gap between the training data and outdoor pedestrian scenes leads to degraded performance. The depth camera’s performance also degrades outdoors, particularly at long distances.
Furthermore, the experimental results indicate that the accuracy advantage of our method increases with distance, which can be attributed to the additional geometric constraint introduced by the third viewpoint from the top camera. In traditional binocular stereo vision, the horizontal disparity of a given point decreases as the distance increases. At long distances, the binocular system becomes more sensitive to disparity errors, causing small pixel estimation errors to be amplified into significant depth errors.
Our trinocular system introduces a vertical viewpoint, providing a non-collinear geometric constraint that complements horizontal disparity. The observation from the top camera effectively constrains the target’s position along the depth axis. This miti-gates the ambiguity inherent in the pixel estimation of horizontal stereo systems at long ranges. This additional constraint addresses a key limitation of binocular systems, enabling precise long-range measurements. The results of the ablation study (Table 4 and Figure 9) also support the critical role of the third camera.
In terms of computational efficiency, our method outperforms both the classic SGBM algorithm and RAFT-Stereo (Realtime) in runtime. The standard RAFT-Stereo, requiring 739.43 ms per frame, is ill-suited for real-time applications. As detailed in Table 8, our complete localization process handles a single target in an average of 3.13 ms. This advantage stems from the fact that the runtime of dense-matching methods, such as SGBM and RAFT-Stereo, is strongly coupled with image resolution, resulting in a significant computational load for our high-resolution (1280 × 720) dataset.
In contrast, our method mitigates this dependency. First, the initial matching process efficiently narrows the search space via a multi-scale strategy. Subsequently, the computational cost of the depth optimization under the three-view constraint is determined not by image resolution, but by the depth search range. When the multi-scale template matching stage provides a relatively accurate initial depth estimate, the subsequent optimization can refine it without requiring an excessively large search range. This targeted approach remains more efficient than the full-image computation required by both the SGBM and RAFT-Stereo methods.
The methodology proposed in this study demonstrates its potential for accurate localization in outdoor environments. It is suitable for applications such as autonomous driving, robot navigation, and intelligent surveillance. Moreover, the trinocular system can be further customized and optimized by adjusting physical deployment configuration, ranging from deployments on urban transport infrastructure and buildings to lightweight wearable devices. Nevertheless, several limitations remain, offering valuable directions for future research.
The proposed method operates as a localization-after-detection pipeline, which makes its performance fundamentally dependent on the success and accuracy of the initial 2D object detector. This dependency introduces a potential point of failure: a missed detection interrupts localization entirely, while an imprecise bounding box propagates errors and degrades the final estimate. To address this, future work should explore a tighter and more synergistic integration of the detection and localization modules. For example, an end-to-end network that jointly optimizes both tasks could reduce reliance on the accuracy of the initial bounding box. Alternatively, embedding the localization algorithm within a tracking framework would allow position estimates from previous frames to inform and refine object detection in subsequent frames, thereby creating a more robust and resilient system.
Furthermore, the choice of similarity measure method is largely determined by practical hardware deployment and application scenarios. In this work, we employed ZNCC as the similarity metric. While ZNCC is effective and efficient, its performance is constrained in challenging scenarios. The reliability of its matching score degrades significantly under occlusions, severe non-linear illumination changes, or when targets exhibit low-texture surfaces.
To extend the applicability of our framework, particularly for safety-critical systems, future research should explore more advanced matching strategies. Deep learning-based descriptor heads pre-trained on large-scale datasets are promising alternatives, as they excel at handling complex appearance variations and have demonstrated superior performance in challenging conditions. However, their adoption requires a careful consideration of the trade-off between their enhanced robustness and significant computational cost, especially for real-time applications. It is also important to explicitly address the occlusion challenge in crowded environments. Promising approaches include part-based matching, which reasons about visible object components, and robust aggregation methods like sub-block voting, which build consensus from partially matched regions. With these further improvements, we believe the proposed trinocular framework can be developed into a widely applicable solution across diverse application scenarios.

5. Conclusions

This paper presents a trinocular stereo vision system for pedestrian localization, with cameras arranged at the vertices of an equilateral triangle and parallel optical axes. Experimental results on our custom dataset show that our method achieves high accuracy, with a mean absolute error of 0.43 m, and high computational efficiency with an average processing time of 3.13 ms per target. This performance exceeds that of the binocular SGBM algorithm (0.536 m and 19.6 FPS), and RAFT-Stereo (Realtime) (0.621 m and 47.0 FPS), and the standard RAFT-Stereo (0.623 m and 1.35 FPS). Empirical results show that the performance advantage becomes more pronounced at longer ranges. For distances beyond 9 m, our method achieves a relative error reduction of 20% to 40% compared to the binocular methods using equivalent baseline settings. When integrated with a YOLOv8-s object detector, the end-to-end system is capable of maintaining real-time performance (>30 Hz) for up to nine pedestrians, with a total processing time of approximately 32.56 ms, underscoring its practical deployment viability.

Author Contributions

Conceptualization, J.Z., J.X. and S.X.; methodology, J.Z.; software, J.Z.; validation, J.Z., S.H. and Y.L.; formal analysis, J.Z., S.H. and Y.L.; investigation, J.Z. and S.H.; resources, J.X. and S.X.; data curation, J.Z.; writing—original draft preparation, J.Z.; writing—review and editing, S.H., Y.L. and J.X.; visualization, J.Z.; supervision, S.X.; project administration, S.X.; funding acquisition, J.X. and S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the National Key R&D Program of China (Grant No. 2024YFC3406302), Peking University, the Natural Science Foundation of China (Grant No. 12204273), and the Instrument Improvement Funds of Shandong University Public Technology Platform (ts20230111).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The overall methodology is summarized in Algorithm A1. The algorithm details the complete computational process, starting from inputting the images P l , P r , P t , the original target template T 0 = u l , v l , w 0 , h 0 and the set of camera module calibration parameters set S , to executing the multi-scale template matching and three-view constraint optimization, and finally returning the target’s refined position P * .
Algorithm A1 The proposed trinocular stereo vision localization algorithm
Input: Left image P l , right image P r , top image P t , original target template (from left image) T 0 = u l , v l , w 0 , h 0 , camera parameter set S = K l , K r , K t , R l r , t l r , R l t , t l t , Q l r .
Output: The target P * (in the left camera coordinate system).

1: I l , I r , I t = ImageRectification P l , P r , P t
“Obtaining initial matching from I l , I r
2: T = Preprocessing T 0
3: Build image pyramids for I r and T .
4: for k from L N down to 0 do
5:               if k is L N then
6:      Initial v 0 , k = v l
7:      Search ( u k * , v k * ) = arg max v k v 0 , k d v Z N C C ( u k , v k )
8:                else
9:      Initial u 0 , k , v 0 , k =   ( s · u k + 1 * , s · v k + 1 * )
10:                      Search ( u k * , v k * ) = arg max u k u 0 , k d u v k v 0 , k d v Z N C C ( u k , v k )
11:              end if
12: end for
13: Use T 0 for search:
14:       Ω L S W = u , v   |   u u r D u ,   v v r D v  
15: Find the best matching point p r = u ^ r , v ^ r for p l = u l , v l
“Optimization based on I l , I r , I t
16: P 0 = Triangulation p l , p r , Q l r
17: for z in Δ z , Δ z with z s t e p do
18:               Candidate points P z = P 0 + z · α
19:               Extract P a t c h r   , P a t c h t corresponding to the projection point:
20:       p l z , p r z , p t z = Projection P z , S
21:               Calculate C z = λ r · Z N C C l r + λ t · Z N C C l t
22: end for
23: P * = arg max z * C z
24: return P *

Appendix B

Appendix B.1

The detailed parameter configurations for the SGBM algorithm are provided in Table A1, which serves as a baseline for comparison. Specifically, the fundamental search range for disparity values is defined by minDisparity and numDisparities, with the latter set to 256. The matching process operates on a pixel window of blockSize = 5. The penalty coefficients P1 and P2, which control the smoothness constraint, are set to values of 1176 and 4704, respectively. To improve the quality of the resulting disparity map, post-processing is performed using parameters including disp12MaxDiff = 1, uniquenessRatio = 10, speckleWindowSize = 100, and speckleRange = 32 to filter out noise and unreliable matches. The mode is set to MODE_SGBM_3WAY to enable the full-scale dynamic programming approach for more robust results.
Table A1. Parameters for the SGBM algorithm in OpenCV.
Table A1. Parameters for the SGBM algorithm in OpenCV.
ParameterValue
minDisparity0
numDisparities256
blockSize5
P11176
P24704
disp12MaxDiff1
uniquenessRatio10
speckleWindowSize100
speckleRange32

Appendix B.2

The detailed parameter configurations for our proposed method are provided in Table A2. The parameters are organized into two main components: the multi-scale matching (Match) and the trinocular optimization (Trino).
In the initial matching process, the image pyramid level is set to 3. The pyramid is constructed via OpenCV, which applies a fixed 5 × 5 Gaussian kernel for smoothing prior to each down-sampling step. The parameters governing search template handling, including the expansion ratio ( r e x p = 1.5 ), cropping ratio ( r c u t = 0.5 ) , and area thresholds ( A m i n = 4096 , A m a x = 65,536 ), were determined empirically on our custom dataset to balance the retention of sufficient contextual information for small targets with the need for stable matching in large ones. For the pyramid levels, the horizontal and vertical search ranges are 10 and 2. At the finest level L 0 , the local search window is set to 20 and 20. In the depth optimization process, the disparity uncertainty ratio is set to 0.15, with the deep search sampled at a discrete step size of 0.1. The patch size ratio is set to 0.5, and the weight coefficients for the right and top view matching scores are set to 0.5. This provides equal weight to the evidence from both the right and top views.
Table A2. Parameters for the proposed method.
Table A2. Parameters for the proposed method.
ComponentParameterValue
MatchN3
A m i n 4096
A m a x 65,536
r e x p 1.5
r c u t 0.5
d u 10
d v 2
D u 20
D v 20
Trino λ z 0.15
z s t e p 0.1
r m 0.5
λ r 0.5
λ t 0.5

References

  1. Choi, J.; Chun, D.; Kim, H.; Lee, H.-J. Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019; pp. 502–511. [Google Scholar]
  2. Trulls, E.; Corominas Murtra, A.; Pérez-Ibarz, J.; Ferrer, G.; Vasquez, D.; Mirats-Tur, J.M.; Sanfeliu, A. Autonomous navigation for mobile service robots in urban pedestrian environments. J. Field Robot. 2011, 28, 329–354. [Google Scholar] [CrossRef]
  3. Ahmed, D.B.; Díez, L.E.; Diaz, E.M.; Domínguez, J.J.G. A survey on test and evaluation methodologies of pedestrian localization systems. IEEE Sens. J. 2019, 20, 479–491. [Google Scholar] [CrossRef]
  4. Patel, I.; Kulkarni, M.; Mehendale, N. Review of sensor-driven assistive device technologies for enhancing navigation for the visually impaired. Multimed. Tools Appl. 2024, 83, 52171–52195. [Google Scholar] [CrossRef]
  5. Hsu, Y.; Wang, J.; Chang, C. A wearable inertial pedestrian navigation system with quaternion-based extended Kalman filter for pedestrian localization. IEEE Sens. J. 2017, 17, 3193–3206. [Google Scholar] [CrossRef]
  6. Charroud, A.; El Moutaouakil, K.; Palade, V.; Yahyaouy, A.; Onyekpe, U.; Eyo, E.U. Localization and mapping for self-driving vehicles: A survey. Machines 2024, 12, 118. [Google Scholar] [CrossRef]
  7. Li, G.; Xu, J.; Li, Z.; Chen, C.; Kan, Z. Sensing and navigation of wearable assistance cognitive systems for the visually impaired. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 122–133. [Google Scholar] [CrossRef]
  8. Rai, A.; Mounier, E.; de Araujo, P.R.M.; Noureldin, A.; Jain, K. Investigation and Implementation of Multi-Stereo Camera System Integration for Robust Localization in Urban Environments. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, 48, 1255–1262. [Google Scholar] [CrossRef]
  9. Llorca, D.F.; Sotelo, M.A.; Parra, I.; Ocaña, M.; Bergasa, L.M. Error analysis in a stereo vision-based pedestrian detection sensor for collision avoidance applications. Sensors 2010, 10, 3741–3758. [Google Scholar] [CrossRef]
  10. Keller, C.G.; Enzweiler, M.; Rohrbach, M.; Llorca, D.F.; Schnorr, C.; Gavrila, D.M. The benefits of dense stereo for pedestrian detection. IEEE Trans. Intell. Transp. Syst. 2011, 12, 1096–1106. [Google Scholar] [CrossRef]
  11. Wang, J.; Meng, X.; Xu, H.; Pei, Y. Neural Modeling and Real-Time Environment Training of Human Binocular Stereo Visual Tracking. Cogn. Comput. 2023, 15, 710–730. [Google Scholar] [CrossRef]
  12. Zhang, Z.; Tao, W. Pedestrian detection in binocular stereo sequence based on appearance consistency. IEEE Trans. Circuits Syst. Video Technol. 2015, 26, 1772–1785. [Google Scholar] [CrossRef]
  13. Xie, Q.; Long, Q.; Li, J.; Zhang, L.; Hu, X. Application of intelligence binocular vision sensor: Mobility solutions for automotive perception system. IEEE Sens. J. 2023, 24, 5578–5592. [Google Scholar] [CrossRef]
  14. Yang, L.; Wang, B.; Zhang, R.; Zhou, H.; Wang, R. Analysis on location accuracy for the binocular stereo vision system. IEEE Photonics J. 2017, 10, 1–16. [Google Scholar] [CrossRef]
  15. Poggi, M.; Tosi, F.; Batsos, K.; Mordohai, P.; Mattoccia, S. On the synergies between machine learning and binocular stereo for depth estimation from images: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5314–5334. [Google Scholar] [CrossRef] [PubMed]
  16. Brown, M.Z.; Burschka, D.; Hager, G.D. Advances in computational stereo. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 993–1008. [Google Scholar] [CrossRef]
  17. Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef]
  18. Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar]
  19. Chang, J.-R.; Chen, Y.-S. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar]
  20. Lipson, L.; Teed, Z.; Deng, J. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 218–227. [Google Scholar]
  21. Li, J.; Wang, P.; Xiong, P.; Cai, T.; Yan, Z.; Yang, L.; Liu, J.; Fan, H.; Liu, S. Practical stereo matching via cascaded recurrent network with adaptive correlation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16263–16272. [Google Scholar]
  22. Xu, G.; Wang, X.; Ding, X.; Yang, X. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21919–21928. [Google Scholar]
  23. Szeliski, R. Computer Vision: Algorithms and Applications, 2nd ed.; Springer Nature: Cham, Switzerland, 2022; pp. 595–612. [Google Scholar]
  24. Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
  25. Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 402–419. [Google Scholar]
  26. Shamsafar, F.; Woerz, S.; Rahim, R.; Zell, A. Mobilestereonet: Towards lightweight deep networks for stereo matching. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2417–2426. [Google Scholar]
  27. Yang, L.; Tang, W. A lightweight stereo depth estimation network based on mobile devices. In Proceedings of the Seventh International Conference on Computer Graphics and Virtuality, Hangzhou, China, 23–25 February 2024; pp. 41–52. [Google Scholar]
  28. Qian, W.; Hu, C.; Wang, H.; Lu, L.; Shi, Z. A novel target detection and localization method in indoor environment for mobile robot based on improved YOLOv5. Multimed. Tools Appl. 2023, 82, 28643–28668. [Google Scholar] [CrossRef]
  29. Wang, L.; Li, L.; Wang, H.; Zhu, S.; Zhai, Z.; Zhu, Z. Real-time vehicle identification and tracking during agricultural master-slave follow-up operation using improved YOLO v4 and binocular positioning. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2023, 237, 1393–1404. [Google Scholar] [CrossRef]
  30. Zuo, J.; Wang, Y.; Wang, R. Method for Acquiring Passenger Standing Positions on Subway Platforms Based on Binocular Vision. IEEE Access 2025, 13, 32971–32980. [Google Scholar] [CrossRef]
  31. Ding, J.; Yan, Z.; We, X. High-accuracy recognition and localization of moving targets in an indoor environment using binocular stereo vision. ISPRS Int. J. Geo-Inf. 2021, 10, 234. [Google Scholar] [CrossRef]
  32. Wei, B.; Liu, J.; Li, A.; Cao, H.; Wang, C.; Shen, C.; Tang, J. Remote distance binocular vision ranging method based on improved YOLOv5. IEEE Sens. J. 2024, 24, 11328–11341. [Google Scholar] [CrossRef]
  33. Guo, J.; Chen, H.; Liu, B.; Xu, F. A system and method for person identification and positioning incorporating object edge detection and scale-invariant feature transformation. Measurement 2023, 223, 113759. [Google Scholar] [CrossRef]
  34. Nguyen, U.; Heipke, C. 3d pedestrian tracking using local structure constraints. ISPRS J. Photogramm. Remote Sens. 2020, 166, 347–358. [Google Scholar] [CrossRef]
  35. Sarlin, P.-E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
  36. Hayat, M.; Gupta, M.; Suanpang, P.; Nanthaamornphong, A. Super-resolution methods for endoscopic imaging: A review. In Proceedings of the 2024 12th International Conference on Internet of Everything, Microwave, Embedded, Communication and Networks (IEMECON), Jaipur, India, 24–26 October 2024; pp. 1–6. [Google Scholar]
  37. Tosi, F.; Bartolomei, L.; Poggi, M. A survey on deep stereo matching in the twenties. Int. J. Comput. Vis. 2025, 133, 4245–4276. [Google Scholar] [CrossRef]
  38. Chuang, K.; Lin, W. Improve the Applicability of Trinocular Stereo Camera in Navigation and Obstacle Avoidance. In Proceedings of the 2025 1st International Conference on Consumer Technology, Matsue, Japan, 29–31 March 2025; pp. 1–4. [Google Scholar]
  39. Bi, S.; Gu, Y.; Zou, J.; Wang, L.; Zhai, C.; Gong, M. High precision optical tracking system based on near infrared trinocular stereo vision. Sensors 2021, 21, 2528. [Google Scholar] [CrossRef]
  40. Liu, C.; Lin, W. A Novel Stereo Vision Universality Algorithm Model Suitable for Tri-PSMNet and InvTri-PSMNet. In Proceedings of the 2025 1st International Conference on Consumer Technology, Matsue, Japan, 29–31 March 2025; pp. 1–4. [Google Scholar]
  41. Zhang, Y. Multi-ocular Stereovision. In 3D Computer Vision: Foundations and Advanced Methodologies; Springer: Singapore, 2024; pp. 203–235. [Google Scholar]
  42. Wang, J.; Peng, C.; Li, M.; Li, Y.; Du, S. The study of stereo matching optimization based on multi-baseline trinocular model. Multimed. Tools Appl. 2022, 81, 12961–12972. [Google Scholar] [CrossRef]
  43. Wang, H.; Li, M.; Wang, J.; Li, Y.; Du, S. A Discussion of Optimization about Stereo Image Depth Estimation Based on Multi-baseline Trinocular Camera Model. In Proceedings of the 2021 International Conference on Computational Science and Computational Intelligence, Las Vegas, NV, USA, 15–17 December 2021; pp. 1716–1720. [Google Scholar]
  44. Roghani, S.E.S.; Koyuncu, E. Canonical Trinocular Camera Setups and Fisheye View for Enhanced Feature-Based Visual Aerial Odometry. IEEE Access 2024, 12, 134888–134901. [Google Scholar] [CrossRef]
  45. Pathak, S.; Hamada, T.; Umeda, K. Trinocular 360-degree stereo for accurate all-round 3D reconstruction considering uncertainty. Adv. Robot. 2024, 38, 1038–1051. [Google Scholar] [CrossRef]
  46. Zhou, Y.; Li, Q.; Wu, Y.; Ma, Y.; Wang, C. Trinocular vision and spatial prior based method for ground clearance measurement of transmission lines. Appl. Opt. 2021, 60, 2422–2433. [Google Scholar] [CrossRef]
  47. Isa, M.A.; Leach, R.; Branson, D.; Piano, S. Vision-based detection and coordinate metrology of a spatially encoded multi-sphere artefact. Opt. Lasers Eng. 2024, 172, 107885. [Google Scholar] [CrossRef]
  48. Ma, Y.; Li, Q.; Xing, J.; Huo, G.; Liu, Y. An intelligent object detection and measurement system based on trinocular vision. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 711–724. [Google Scholar] [CrossRef]
  49. Oh, S.; Hoogs, A.; Perera, A.; Cuntoor, N.; Chen, C.-C.; Lee, J.T.; Mukherjee, S.; Aggarwal, J.K.; Lee, H.; Davis, L. A large-scale benchmark dataset for event recognition in surveillance video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 3153–3160. [Google Scholar]
  50. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 21 August 2025).
Figure 1. The overall architecture of the proposed trinocular vision-based localization framework. The pipeline begins with rectified trinocular images I l , I r , I t and an initial region of interest (ROI) in I r . Subsequently, multi-scale template matching is performed to obtain an initial depth estimate. Finally, this estimate is refined through a similarity evaluation of image patches that leverages information from all three views to determine a more precise target location.
Figure 1. The overall architecture of the proposed trinocular vision-based localization framework. The pipeline begins with rectified trinocular images I l , I r , I t and an initial region of interest (ROI) in I r . Subsequently, multi-scale template matching is performed to obtain an initial depth estimate. Finally, this estimate is refined through a similarity evaluation of image patches that leverages information from all three views to determine a more precise target location.
Sensors 25 05970 g001
Figure 2. Schematic of the first-stage matching process. The template T is cropped from the left image I l . An image pyramid is constructed for the right image I r . At each level of the pyramid L k , we compute the similarity between the down-sampled template T k and the down-sampled image I r , k within a local search range (highlighted in orange).
Figure 2. Schematic of the first-stage matching process. The template T is cropped from the left image I l . An image pyramid is constructed for the right image I r . At each level of the pyramid L k , we compute the similarity between the down-sampled template T k and the down-sampled image I r , k within a local search range (highlighted in orange).
Sensors 25 05970 g002
Figure 3. Schematic of the second-stage matching process. The original template T 0 is used to refine the matching in the original image I r within a local search window, which is defined by a width of D u and a height of D v .
Figure 3. Schematic of the second-stage matching process. The original template T 0 is used to refine the matching in the original image I r within a local search window, which is defined by a width of D u and a height of D v .
Sensors 25 05970 g003
Figure 4. Schematic of the three-view constraint optimization. A candidate point P z is projected onto three image planes. Then, matching patches ( P a t c h r e f ,   P a t c h r , P a t c h t ) are extracted, centered on projected points ( p l z , p r z , p t z ), to evaluate the similarity scores across the views.
Figure 4. Schematic of the three-view constraint optimization. A candidate point P z is projected onto three image planes. Then, matching patches ( P a t c h r e f ,   P a t c h r , P a t c h t ) are extracted, centered on projected points ( p l z , p r z , p t z ), to evaluate the similarity scores across the views.
Sensors 25 05970 g004
Figure 5. The experimental platform, which includes a camera module and a laptop for data acquisition. The camera module consists of our proposed trinocular vision system and a commercial depth camera (Orbbec-335L).
Figure 5. The experimental platform, which includes a camera module and a laptop for data acquisition. The camera module consists of our proposed trinocular vision system and a commercial depth camera (Orbbec-335L).
Sensors 25 05970 g005
Figure 6. An illustration of the experimental setting. The left, right, and top views are captured by the trinocular system, and an additional reference camera is used to calculate the ground truth distance to the detected target.
Figure 6. An illustration of the experimental setting. The left, right, and top views are captured by the trinocular system, and an additional reference camera is used to calculate the ground truth distance to the detected target.
Sensors 25 05970 g006
Figure 7. Sample frames in the left camera view that illustrate the two sidewalk scenes (Scenario 1 and Scenario 2) for data collection.
Figure 7. Sample frames in the left camera view that illustrate the two sidewalk scenes (Scenario 1 and Scenario 2) for data collection.
Sensors 25 05970 g007
Figure 8. Quantitative comparison of localization accuracy with 95% confidence intervals (shown as shaded areas). Data are aggregated into 1-meter-wide distance bins centered at values from 2 m to 18 m, with sample counts typically around 240 per bin. (a) The mean absolute error of our method (Match+Trino), SGBM, the Orbbec camera, RAFT-Stereo, and RAFT-Stereo (Realtime). (b) The relative performance gain of our method over the SGBM, the Orbbec camera, RAFT-Stereo, and RAFT-Stereo (Realtime).
Figure 8. Quantitative comparison of localization accuracy with 95% confidence intervals (shown as shaded areas). Data are aggregated into 1-meter-wide distance bins centered at values from 2 m to 18 m, with sample counts typically around 240 per bin. (a) The mean absolute error of our method (Match+Trino), SGBM, the Orbbec camera, RAFT-Stereo, and RAFT-Stereo (Realtime). (b) The relative performance gain of our method over the SGBM, the Orbbec camera, RAFT-Stereo, and RAFT-Stereo (Realtime).
Sensors 25 05970 g008
Figure 9. Quantitative results of the ablation study with 95% confidence intervals (shown as shaded areas). Data are aggregated into 1-meter-wide distance bins centered at values from 2 m to 18 m, with sample counts typically around 240 per bin. (a) The mean absolute error of the initial (Match) and final (Match + Trino) results. (b) The relative performance gain of the final results over the initial results.
Figure 9. Quantitative results of the ablation study with 95% confidence intervals (shown as shaded areas). Data are aggregated into 1-meter-wide distance bins centered at values from 2 m to 18 m, with sample counts typically around 240 per bin. (a) The mean absolute error of the initial (Match) and final (Match + Trino) results. (b) The relative performance gain of the final results over the initial results.
Sensors 25 05970 g009
Figure 10. Impact of target size on the runtimes of the proposed method (Match + Trino) and its individual components (Match and Trino). Data are categorized into the same pixel area bins as in Table 9 (e.g., 322, 642, 1282), which also reports the mean area and sample counts for each bin.
Figure 10. Impact of target size on the runtimes of the proposed method (Match + Trino) and its individual components (Match and Trino). Data are categorized into the same pixel area bins as in Table 9 (e.g., 322, 642, 1282), which also reports the mean area and sample counts for each bin.
Sensors 25 05970 g010
Figure 11. Correction of a simulated matching inaccuracy by the Trino stage. A 15-pixel rightward offset was manually added to the initial disparity from the Match stage. (a) The erroneous 3D point projected back onto the three views, with extracted patches highlighted in red and bounding boxes outlined in yellow. (b) After applying the Trino stage, the corresponding patches (highlighted in green) are correctly aligned with the target.
Figure 11. Correction of a simulated matching inaccuracy by the Trino stage. A 15-pixel rightward offset was manually added to the initial disparity from the Match stage. (a) The erroneous 3D point projected back onto the three views, with extracted patches highlighted in red and bounding boxes outlined in yellow. (b) After applying the Trino stage, the corresponding patches (highlighted in green) are correctly aligned with the target.
Sensors 25 05970 g011
Figure 12. Correction of a simulated detection inaccuracy by the Trino stage. The initial ROI in the left view (yellow box) was manually shifted upward and to the right of the target, resulting in an offset correspondence in the right view. After applying the Trino stage, the projected patches (highlighted in green) are correctly located on the pedestrian, demonstrating tolerance to small initial detection errors.
Figure 12. Correction of a simulated detection inaccuracy by the Trino stage. The initial ROI in the left view (yellow box) was manually shifted upward and to the right of the target, resulting in an offset correspondence in the right view. After applying the Trino stage, the projected patches (highlighted in green) are correctly located on the pedestrian, demonstrating tolerance to small initial detection errors.
Sensors 25 05970 g012
Table 1. Intrinsic matrices of the trinocular camera system.
Table 1. Intrinsic matrices of the trinocular camera system.
Camera f x f y c x c y
K l 711.9746710.7513656.8128370.2462
K r 709.7325708.6549639.3270338.5957
K t 720.5992721.1380626.4810372.2269
Table 2. Extrinsic parameters (Relative to the left camera coordinate system).
Table 2. Extrinsic parameters (Relative to the left camera coordinate system).
Pair Rotation   R Translation   t (mm)
R l r , t l r 0.9995 0.0174 0.0280 0.0176 0.9998 0.0035 0.0280 0.0040 0.9996 468.9070 5.3564 4.5990
R l t , t l t 0.9999 0.0117 0.0106 0.0125 0.9965 0.0831 0.0096 0.0833 0.9965 226.8953 439.3381 29.7137
Table 3. Comparison of localization errors on the dataset.
Table 3. Comparison of localization errors on the dataset.
MethodMAE (m)RMSE (m)STD (m)
Orbbec 335L1.2211.5490.953
SGBM0.5360.7340.501
RAFT-Stereo0.6230.8000.501
RAFT-Stereo (Realtime)0.6210.7810.473
Match (Ours, w/o Opt.)0.6380.8150.508
Match + Trino (Ours)0.4350.6150.434
Table 4. Results on the impact of view weights λ r   λ t on MAE (m) *.
Table 4. Results on the impact of view weights λ r   λ t on MAE (m) *.
Parameter: λ r , λ t MAE (2–6 m) MAE (6–12 m) MAE (12–18 m)
(0.25, 0.75)0.161 ± 0.0090.309 ± 0.0120.695 ± 0.026
(0.5, 0.5) (default)0.154 ± 0.0090.313 ± 0.0130.702 ± 0.026
(0.75, 0.25)0.167 ± 0.0110.330 ± 0.0140.735 ± 0.027
* All MAE values are reported as mean ± 95% confidence interval.
Table 5. Results on the impact of patch size ratio ( r m ) on MAE (m) *.
Table 5. Results on the impact of patch size ratio ( r m ) on MAE (m) *.
Parameter: r m MAE (2–6 m)MAE (6–12 m)MAE (12–18 m)
0.30.208 ± 0.0110.521 ± 0.0211.169 ± 0.045
0.5 (default)0.154 ± 0.0090.313 ± 0.0130.702 ± 0.026
0.70.150 ± 0.0080.290 ± 0.0110.699 ± 0.023
* All MAE values are reported as mean ± 95% confidence interval.
Table 6. Results on the impact of depth search step ( z s t e p ) on MAE (m) *.
Table 6. Results on the impact of depth search step ( z s t e p ) on MAE (m) *.
Parameter: z s t e p MAE (2–6 m)MAE (6–12 m)MAE (12–18 m)
0.050.155 ± 0.0090.315 ± 0.0130.709 ± 0.026
0.1 (default)0.154 ± 0.0090.313 ± 0.0130.702 ± 0.026
0.20.160 ± 0.0090.314 ± 0.0130.695 ± 0.026
* All MAE values are reported as mean ± 95% confidence interval.
Table 7. Results on impact of depth range scale ( λ z ) on MAE (m) *.
Table 7. Results on impact of depth range scale ( λ z ) on MAE (m) *.
Parameter: λ z MAE (2–6 m)MAE (6–12 m)MAE (12–18 m)
0.10.156 ± 0.0090.311 ± 0.0130.712 ± 0.029
0.15 (default)0.154 ± 0.0090.313 ± 0.0130.702 ± 0.026
0.20.157 ± 0.0090.312 ± 0.0130.709 ± 0.026
* All MAE values are reported as mean ± 95% confidence interval.
Table 8. Average run time of algorithms/components.
Table 8. Average run time of algorithms/components.
Algorithms/ComponentsAverage Time (ms)Throughput (FPS) @ N = 1 *
SGBM51.0219.60
RAFT-Stereo739.431.35
RAFT-Stereo (Realtime)21.2647.04
Part 1: Match (Ours)0.98-
Part 2: Trino (Ours)2.15-
Total: Match+Trino (Ours)3.13319.49
* Throughput for baseline methods is measured per-frame. For our method, throughput is reported per-target (for N = 1), as its runtime scales with the number of targets.
Table 9. Average runtime (ms) for different target sizes.
Table 9. Average runtime (ms) for different target sizes.
Bin   Center   p x 2 *Mean Area (px2)SGBM (ms)RAFT-Stereo (ms)RAFT-Stereo (Realtime) (ms)Ours (ms)
32 2 = 1024 1875.8551.59739.0421.304.02
64 2 = 4096 5342.4050.87740.1321.263.01
128 2 = 16,384 18,990.6650.73738.7421.202.24
256 2 = 65,536 62,398.2850.21738.5321.252.83
* Bin center refers to the pre-defined center of the area bins (e.g., 322, 642, 1282). Mean area reports the actual average area of samples within each bin. The sample counts for each bin are 1164, 1732, 852, and 195, respectively.
Table 10. GFLOPs and latency for match and trino stages.
Table 10. GFLOPs and latency for match and trino stages.
Bin   Center   p x 2 * Patch   Size   M p x Δ z (m)Match
GFLOPs
Match
Latency (ms)
Trino
GFLOPs
Trino
Latency (ms)
32 2 = 1024 13.52.710.0340.8900.00583.126
64 2 = 4096 22.11.820.0960.9280.00412.081
128 2 = 16 , 384 40.91.000.3421.0790.00271.158
256 2 = 65,536 69.50.571.121.5910.00231.235
* Bin center refers to the pre-defined center of the area bins (e.g., 322, 642, 1282). Patch size M and Δ z report the actual average values of samples within each bin. The Trino stage GFLOPs calculation uses z s t e p = 0.1 m (according to Appendix B.2), and the fixed overhead κ 1.037 × 10 5 F L O P s / s a m p l e .
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, J.; Huang, S.; Li, Y.; Xu, J.; Xu, S. A Trinocular System for Pedestrian Localization by Combining Template Matching with Geometric Constraint Optimization. Sensors 2025, 25, 5970. https://doi.org/10.3390/s25195970

AMA Style

Zhao J, Huang S, Li Y, Xu J, Xu S. A Trinocular System for Pedestrian Localization by Combining Template Matching with Geometric Constraint Optimization. Sensors. 2025; 25(19):5970. https://doi.org/10.3390/s25195970

Chicago/Turabian Style

Zhao, Jinjing, Sen Huang, Yancheng Li, Jingjing Xu, and Shengyong Xu. 2025. "A Trinocular System for Pedestrian Localization by Combining Template Matching with Geometric Constraint Optimization" Sensors 25, no. 19: 5970. https://doi.org/10.3390/s25195970

APA Style

Zhao, J., Huang, S., Li, Y., Xu, J., & Xu, S. (2025). A Trinocular System for Pedestrian Localization by Combining Template Matching with Geometric Constraint Optimization. Sensors, 25(19), 5970. https://doi.org/10.3390/s25195970

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop