1. Introduction
Computer-assisted surgical (CAS) navigation systems provide real-time spatial localization to support precision interventions and robotic-assisted procedures [
1,
2]. Accurate tracking of surgical instruments is essential in these applications [
3]. The optical tracking system (OTS) represents a generally recognized gold standard for surgical intraoperative navigation due to its high tracking accuracy, robustness, non-contact, and strong environmental adaptability [
4,
5]. However, in practice, reflective markers often generate excessive intensity, leading to sensor saturation and halo artifacts that degrade localization performance and stability [
6,
7].
Traditional techniques, such as the Circular Hough Transform (CHT), struggle to filter these optical interferences due to their reliance on rigid geometric rules [
7]. Furthermore, widely used intensity-based methods, such as the Weighted Centroid algorithm, are fundamentally limited by the dynamic range of infrared sensors [
8,
9]. In surgical scenarios, the high-intensity reflection from reflective markers often exceeds the sensor’s saturation threshold, resulting in a flat-top intensity profile where the Gaussian peak is clipped [
10]. This saturation destroys the intensity gradient information essential for sub-pixel weighting, causing the calculated centroid to inevitably drift towards the geometric center of the asymmetric blooming artifact, rather than retaining the true physical center of reflective markers [
11]. In binocular optical tracking systems, this 2D centroid deviation severely degrades downstream tracking performance. Based on the principles of stereo triangulation, a minute 2D localization shift on the image plane is non-linearly amplified into a depth estimation error in 3D space. Therefore, mitigating optical blooming to achieve highly stable 2D centroid extraction is a fundamental physical prerequisite for reliable six-DOF (degrees of freedom) rigid-body pose estimation in surgical navigation.
Generic deep learning detectors often treat reflective markers as simple geometric patterns, ignoring optical effects like halo and intensity spread, which can shift predicted centers [
12,
13,
14]. This data-driven approach often neglects the underlying optical characteristics of the imaging process, limiting the model’s ability to distinguish between reflective markers’ true signal and complex optical interference such as blooming. This approach fails to account for the physical “blooming” effect where overexposure causes photons to spill into adjacent pixels. Consequently, the model may perceive an enlarged, asymmetric halo as the actual marker boundary, leading to a sub-pixel shift in the calculated center. By neglecting the intensity-gradient physics, such models lack the intrinsic mechanism to decouple the stable signal from optical artifacts.
In this study, we propose a physics-prior-guided detection framework that incorporates the optical properties of the infrared reflective markers directly into the network architecture [
15]. We model the imaging pattern of a reflective marker not just as a circular object, but as a point source convolved with the system’s Point Spread Function (PSF), resulting in a Gaussian-like intensity profile. Based on this physical principle, we introduce a Laplacian-based brightness prior within the attention mechanism to enhance the intensity inflection points around reflective marker boundaries. The design is inspired by the edge-enhancement principle of the Laplacian-of-Gaussian (LoG) operator, while the practical implementation adopts a lightweight discrete Laplacian kernel to avoid additional Gaussian smoothing overhead. By combining optical physics with deep learning, the proposed method achieves high-precision sub-pixel localization suitable for optical tracking, forming a foundation for potential downstream applications such as rigid-body pose estimation in surgical navigation or other precision tracking scenarios [
16,
17]. This approach maintains real-time inference capability while improving robustness against optical saturation, providing a practical solution for reflective marker localization in challenging imaging environments.
2. Materials and Methods
2.1. Image Data Acquisition
The dataset utilized in this research was acquired using a custom-designed image acquisition platform. The platform consists of an active infrared binocular camera (Aimuyi Technology Co., Ltd., Guangzhou, China), a calibration tool, and a computer. To ensure high-quality image data was obtained, active infrared camera captured images of a rigid body tool containing reflective markers from different angles and distances. The working distance was set to approximately 1 m to 1.5 m from the rigid body tool to achieve optimal imaging results. The computer system was responsible for controlling the shooting process and data collection. The dataset consists of 8000 images with a resolution of 2472 × 2064 pixels, labeled as b1, b2, b3, and b4, each corresponding to a reflective marker sphere on the rigid body tool. The dataset labels b1 through b4 correspond to the four spherical markers of a standard tetrahedral rigid body used for six-DOF surgical tracking. In clinical practice, most navigation tools employ three to five markers to ensure visibility under partial occlusion. While our current model is trained on this standard four-marker configuration, the proposed architecture is inherently scalable to accommodate images containing more than five targets without architectural modification.
2.2. Dataset Construction
2.2.1. Image Preprocessing and Optical Calibration
To ensure data consistency and sub-pixel edge integrity, prior to dataset annotation, all raw images underwent a comprehensive preprocessing to ensure data quality and consistency. This preprocessing involved the following: (1) stereo rectification was performed using a checkerboard-optimized relative pose, with rectified images remapped via Lanczos interpolation to minimize geometric distortion; (2) adaptive Contrast Limited Adaptive Histogram Equalization (CLAHE) with a clip limit of 3.0 and a 6 × 6 tile grid to enhance marker visibility while suppressing background noise; and (3) adaptive contrast stretching was implemented based on the first and 99th intensity percentiles, followed by a 3 × 3 Laplacian-based sharpening kernel to accentuate the gradient at the marker–background interface. These steps were performed before image annotation to facilitate precise manual labeling and were consistently applied to all images used for model training and inference, ensuring a uniform data distribution for the network.
2.2.2. High-Precision Ground Truth Establishment
Establishing a reliable sub-pixel reference is challenging under specular saturation, as standard automated methods often shift the centroid toward asymmetric blooming artifacts. To reduce this bias, we adopted a geometrically guided annotation strategy combining manual labeling and ellipse fitting.
First, all 8000 images were manually annotated to generate coarse bounding boxes for the reflective markers. The annotation was performed using a zoom-assisted labeling interface to ensure accurate placement around the marker regions. Next, an ellipse fitting algorithm was applied within each localized region to extract the geometric centroid based on the intensity gradients of the marker edges. This provided sub-pixel coordinates for the majority of reflective markers under clear visibility. For more challenging cases, such as partial occlusion, strong saturation, or large viewing angles, the ellipse fitting could occasionally diverge from the true marker center. In these situations, the centroid was refined manually by visually aligning the ellipse to the visible arc segments of the marker boundary, ensuring that the estimated center remained consistent with the physical marker geometry rather than the optical blooming artifacts. To further improve reliability, the annotation process included a cross-verification step by two independent annotators. Discrepancies between the two annotations were re-examined and corrected through consensus. In practice, ellipse fitting produced stable centroid estimates for the majority of samples, while only a small proportion of images (approximately 8%) required manual adjustment. This human-in-the-loop process ensures that the ground truth remains robust and consistent across the dataset, providing a consistent baseline for evaluating model performance under saturated imaging conditions.
Finally, according to the YOLO dataset format, the dataset was split into training, validation and test set in a ratio of 8:1:1. A total of 6400 images was randomly chosen for the training set, 800 images for the validation set, and 800 images for the test set.
2.3. Improved YOLOv8 Algorithm
The study’s method is built upon the YOLOv8 architecture [
18]. As illustrated in
Figure 1, the proposed method introduces innovative improvements while retaining the original YOLOv8 framework. Initially, the C2f_GhostV2 module was introduced, replacing the final C2f module in the YOLOv8 Backbone network and all C2f modules within the Neck architecture. By integrating GhostNetv2’s lightweight convolutions with its long-range attention mechanism, C2f_GhostV2 reduces the model’s computational complexity and parameter count while maintaining effective feature extraction capabilities. Addressing the characteristics of reflective markers in infrared images, such as high luminosity, small size, and susceptibility to halo interference, the Brightness-Prior-Enhanced Spatial Attention (BPESA) module was designed and embedded. This module is strategically inserted after the P3 and P4 feature maps within the Neck. BPESA concurrently integrates channel attention, spatial attention, and a brightness prior branch. Through these structural optimizations, the proposed method addresses the limitations of generic detectors, particularly regarding limited localization precision and susceptibility to specular reflection noise in infrared small object detection [
19]. Experimental results demonstrate that this approach not only improves inference speed but also reduces the Root Mean Square (RMS) error in reflective markers centroid localization. Furthermore, it concurrently elevates comprehensive detection metrics such as P, R, mAP50, and mAP50-95, thereby providing accurate marker coordinates for potential downstream tracking applications.
2.3.1. Lightweight Feature Extraction Block
In order to further reduce the number of model parameters and computational complexity to facilitate deployment in high-speed optical tracking tasks, as depicted in
Figure 1, the C2f_GhostV2 module was designed to replace the final C2f module in the backbone network and all C2f modules within the Neck architecture. C2f_GhostV2 is an efficient and lightweight feature aggregation module, aiming to supersede traditional convolutional operations by reducing computational redundancy and enhancing inference efficiency. As shown in
Figure 2, the core of this module lies in its integrated GhostBottleneckV2. GhostBottleneckV2 draws inspiration from the GhostNet concept, generating numerous ghost feature maps from a few intrinsic feature maps through economical linear operations, thereby reducing the computational cost and parameters associated with standard convolutions [
20]. Building upon this, GhostNetV2 further incorporates a Down-sampling fully connected attention (DFC_Attn) mechanism. DFC_Attn captures long-range dependencies by pooling feature maps to a smaller dimension before applying channel-wise grouped convolutions, subsequently upsampling the attention map back to the original size for feature enhancement [
20]. This design enables C2f_GhostV2 to achieve an effective balance between computational efficiency and global receptive field while preserving rich feature representation capabilities.
Compared to the original YOLOv8’s C2f module, the integration of C2f_GhostV2 reduces the model’s computational complexity and parameter count, consequently boosting the model’s real-time performance during infrared image inference. Particularly in the deeper layers of the backbone and the Neck, where feature maps often possess a high number of channels, these regions represent computational hotspots. The introduction of C2f_GhostV2 effectively curtails the computational redundancy in these areas. This optimization allocates valuable computational budget for the subsequent integration of more refined, albeit slightly more computationally intensive, attention mechanisms like BPESA, allowing the overall model to maintain high efficiency.
2.3.2. A Physics-Prior-Guided Feature Enhancement Module
To mitigate the localization errors caused by saturation blooming, the Brightness-Prior-Enhanced Spatial Attention (BPESA) module is implemented within the high-resolution P3 and P4 stages of the Neck. Unlike generic attention mechanisms that rely solely on learned statistics, BPESA functions as a domain-specific matched filter, explicitly encoding the optical characteristics of reflective markers. As depicted in
Figure 3, the architecture employs a parallel multi-branch design, integrating a Brightness Prior Branch with standard channel and spatial attention mechanisms.
The Brightness Prior Branch is designed to recover the geometric integrity of reflective markers from saturated inputs. Assuming that the reflective marker projection approximates the convolution of the physical aperture with the system PSF, saturation blooming mainly manifests as a low-frequency expansion that obscures the true centroid. To counter this effect, a hybrid signal enhancement strategy based on a discrete Laplacian operator is introduced. The design is inspired by the edge-enhancement principle of the Laplacian-of-Gaussian (LoG) operator, while adopting a computationally efficient discrete formulation that avoids the additional Gaussian smoothing stage. First, a contrast convolution layer is initialized with a fixed 3 × 3 Laplacian kernel of center weight 8 and neighbors −1. This operator functions as a high-pass matched filter designed to maximize the response at the inflection points of the intensity gradient, corresponding to the physical edge of the reflective marker. By doing so, it explicitly suppresses the flat gradients found in the saturation plateau and the diffuse low-frequency background of the halo artifacts. To accommodate variable PSF scales caused by depth changes, a learnable branch processes the brightness map directly to model the scale-dependent optical blooming.
A critical challenge is balancing this physical prior with spatial features. BPESA employs an adaptive soft-gating mechanism. As mathematically defined in our fusion layer, two learnable scalar weights, and , are introduced. During training, the network dynamically optimizes these parameters. In the optimal working volume where the reflective marker’s PSF is sharp, the network assigns a higher weight to the brightness prior. Conversely, at extreme distances where scale mismatch occurs, the network adaptively adjusts reliance on the multi-scale spatial attention branch. This ensures that the physical prior assists the model without imposing rigid constraints that would fail under scale variation. Finally, the features from the spatial and brightness branches are fused via a convolution to generate the final attention map. This gray-box design enhances perceptual capacity by directly leveraging the target’s salient physical traits while maintaining robustness against the top-hat distortion of saturation artifacts.
2.4. Test Platform
To systematically validate the performance of the proposed method, the dataset described in detail in
Section 2.1 was utilized for model training. All algorithms in this study are implemented on the PyTorch v2.5.1 deep learning framework. All experiments were conducted on a workstation equipped with an Intel i9-13900K CPU and NVIDIA RTX 4080 GPU (NVIDIA, Santa Clara, CA, USA), using PyTorch v2.5.1 and CUDA v12.1. The batch size was set to 16. The training process lasted for 150 epochs, each involving one full iteration of the entire dataset to ensure that the model was able to fully learn the data features. Additionally, to maintain consistency and reduce computational load, all input images were scaled to a uniform size of 800 × 800 pixels, providing sufficient image detail and ensuring the feasibility of the computational process. All raw images were rectified for lens distortion using pre-calibrated intrinsic parameters before being resized to 800 × 800 pixels. Since the aspect ratio was maintained and the sub-pixel coordinates were mapped back to the original 2472 × 2064 resolution for final error calculation, the resizing process did not introduce geometric bias or impact the physical localization accuracy.
2.5. Metrics for Model Evaluation
The algorithm performance is evaluated by comparing the detection results of images between the models before and after improvement under identical experimental conditions. Mean Average Precision (mAP), Recall (R), and Precision (P), GFLOPs, RMS were employed as evaluation metrics for the reflective marker detection model. Precision is defined as the ratio of correctly predicted positive samples to the total number of predicted positives, while Recall represents the ratio of correctly predicted positives to the total number of actual positive samples.
Precision and Recall are usually calculated by the relationship between True Positives (TP), False Positives (FP), and False Negatives (FN). TP is the number of positive samples correctly identified by the algorithm, FP is the number of negative samples incorrectly identified as positive by the algorithm, and FN is the number of positive samples incorrectly identified as negative by the algorithm.
Average Precision (AP) is the area enclosed by the Precision–Recall (P-R) curve and the axes, representing the average precision of a single object category. Mean Average Precision (mAP) is the average of the APs of the different categories, used as a measure of the overall detection performance of the network. mAP50 denotes the APs of all the images in each category computed with an Intersection over Union (IoU) threshold of 0.5. mAP50:95 denotes the mean AP computed across different IoU thresholds ranging from 0.5 to 0.95.
GFLOPs (Giga Floating Point Operations) measure the computational complexity of the model.
The root-mean-square (RMS) center-localisation error is adopted as the primary metric to quantify the accuracy of the proposed method. For each test image, it is defined as the Euclidean distance between the predicted center
and the corresponding ground-truth coordinate
. The overall performance is reported as the mean RMS computed over the entire validation set, accompanied by the standard deviation. In this study, the ground-truth coordinate
is defined as the precise geometric centroid derived from the ellipse fitting refinement (as described in
Section 2.2). The predicted center
is calculated as the geometric center of the predicted bounding box output by the model. Unlike traditional grid-based detectors that output discrete pixel indices, the regression head of the YOLOv8 architecture utilizes Distribution Focal Loss (DFL) to predict continuous floating-point coordinates. This continuous mathematical representation inherently extracts the bounding box center with sub-pixel resolution directly from the feature maps. The predicted continuous coordinates are subsequently mapped back from the 800 × 800 model input to the native 2472 × 2064 sensor resolution to calculate the absolute physical tracking error.
Here, , represents the total number of samples in the validation and test sets, in this study, both of which consist of 800 images. The term denotes the number of reflective markers in the i image. Since the dataset encompasses four classes of reflective markers, the number of reflective markers in any single image is constrained such that .
The experiment employed mAP50-95 and RMS as the primary evaluation metric.
3. Results and Analysis
3.1. Improved YOLOv8 Model Based on Ablation Experiment
This study evaluates the effectiveness of the proposed improvements to the YOLOv8 framework through a series of ablation experiments. The baseline model corresponds to the original YOLOv8 architecture without any modifications. Two intermediate variants were constructed by introducing the C2f_GhostV2 module and the BPESA module individually in order to assess their respective contributions. All ablation experiments were performed on the validation set, which was used during model development to optimize module configurations and training parameters without introducing information leakage from the test set. Therefore, the results reported in
Table 1 correspond to performance measured on the validation set.
As shown in
Table 1, incorporating the C2f_GhostV2 module alone provides a dual advantage. It reduces model complexity by 23.4% in parameters and 13.6% in GFLOPs, while simultaneously improving the core localization metric, reducing the RMS error from 1.23 to 0.69 pixels. Similarly, introducing the lightweight BPESA module improves localization accuracy with minimal computational overhead, lowering the RMS to 0.88 pixels. When both components are combined, the proposed method achieves the best overall performance. The model attains the highest mAP50-95 of 86.5%, together with Precision (P) of 97.7% and Recall (R) of 88.5%, while reducing the RMS to 0.64 pixels. Compared with the baseline model, this corresponds to a 48.0% reduction in RMS, demonstrating that the two modules contribute complementary improvements in localization accuracy.
3.2. Comparisons with Different Attention
These results demonstrate that, unlike generic attention mechanisms, BPESA effectively balances lightweight design and localization accuracy by explicitly incorporating brightness-prior information into the feature extraction process, allowing the network to focus more reliably on the structural characteristics of reflective markers.
As presented in
Table 2, a comprehensive comparison was conducted between the proposed BPESA module and several mainstream attention mechanisms on the same baseline model. All experiments reported in
Table 2 were evaluated on the validation set to assess the relative effectiveness of different attention modules during model development. The experimental results clearly indicate that traditional channel-wise attention, such as SE [
21], and the position-aware CA module [
22], perform suboptimally on this high-precision localization task, even negatively impacting the core RMS error metric. Furthermore, although recent methods like LSK [
23], which enlarge the receptive field, improve the mAP50-95 to 86.3%. However, the RMS remains at 1.14 pixels and the method introduces a noticeable increase in model complexity, with a 14.8% rise in GFLOPs. In contrast, the proposed BPESA module achieves the best overall performance with negligible additional computational overhead (a mere 1.2% increase in GFLOPs). It not only boosts the mAP50-95 to the highest value of 86.4%, but more importantly, reduces the RMS to 0.88 pixels, which is a 28.5% reduction compared to the baseline. These results demonstrate that, unlike generic attention mechanisms, BPESA effectively balances lightweight design and localization accuracy by explicitly incorporating brightness-prior information into the feature extraction process, allowing the network to focus more reliably on the structural characteristics of reflective markers.
To evaluate the efficacy of different attention mechanisms in reflective marker detection, we conducted a visual comparative analysis under various challenging scenarios.
Figure 4 illustrates model performance when faced with common imaging artifacts, including reflective markers lost, spatial offsets, rotation, and inversion. While all tested attention mechanisms achieve precise localization under slight rotations, their performance diverges significantly when markers are missing or inverted. Among the tested models, the CA showed a higher rate of false negatives. In contrast, the proposed BPESA demonstrates robustness, maintaining the lowest missed detection rate across all tested conditions.
The qualitative advantages of BPESA are further substantiated by the signal profile analysis in
Figure 5. In standard infrared imaging, raw marker profiles are often corrupted by specular spikes and high-frequency sensor noise (indicated by the red dashed line), which can match the intensity of the true signal. This saturation-induced flat-top effect, coupled with noise-driven jagged peaks, typically renders traditional thresholding or weighted centroid methods unstable.
As a physics-informed component, the BPESA module acts as a robust matched filter rather than a simple edge detector. As evidenced by the blue solid line, the module exhibits selective signal restoration: it suppresses chaotic background interference, maintaining a low response in non-target regions, while simultaneously reconstructing a smooth, unimodal peak aligned with the marker’s geometric center. Unlike the raw input, which fluctuates unpredictably due to saturation, the BPESA output yields a consistent Gaussian-like envelope. This response confirms that the model effectively leverages the Point Spread Function prior to distinguish between the coherent structural signal of valid targets and random optical noise, directly contributing to the achievement of sub-pixel tracking precision.
3.3. Comparison with Different Methods
To evaluate the generalizability and practical efficacy of the proposed framework, a final benchmark was conducted on the test set. This evaluation involved a comparison against state-of-the-art YOLO series models and classical algorithms [
18,
25,
26].
Table 3 summarizes the performance metrics obtained exclusively from the test set. As the data in
Table 3 indicates, the proposed method maintains a clear competitive edge across all evaluation dimensions. Notably, the model achieves this superior accuracy while simultaneously being more lightweight; with 23.1% fewer parameters and 12.3% lower GFLOPs than the baseline YOLOv8n, our model still obtains the highest mAP50-95 of 87.2%.
To validate the robustness of the proposed framework, we benchmarked it against classical optical metrology algorithms on the test set. While the Weighted Centroid algorithm achieves low latency (2.59 ms), it suffers from a critically low recall (63.9%). This deficiency is physically attributed to the flat-top saturation phenomenon. As marker intensity exceeds the sensor’s dynamic range, the Gaussian profile is clipped into a plateau. The analysis shows that the necessary thresholding removes background noise, it also isolates this flat plateau, stripping the algorithm of the intensity gradient information required for accurate sub-pixel weighting. Consequently, the calculated centroid drifts towards the geometric center of the asymmetric blooming artifact rather than the true marker center.
The Circular Hough Transform maintains high recall (99.8%), but exhibits significant localization jitter, with an RMS error of 2.73 pixels. This instability arises from the diffuse nature of the PSF in infrared imaging. Unlike the sharp step-edges of calibration targets, reflective markers produce soft brightness transitions. The gradient-based voting mechanism of the Hough transform is easily confused by these diffuse edges, leading to unstable radius estimation and subsequent centroid deviations. In addition to the above classical approaches, two widely used sub-pixel localization methods, 2D Gaussian Fitting [
27] and Radial Symmetry Tracking [
28], were also evaluated. Gaussian fitting achieves an RMS error of 0.91 pixels, but its performance degrades when strong infrared reflections clip the Gaussian peak into a flat-top saturation profile. Radial symmetry tracking produces a larger RMS error of 1.40 pixels with reduced precision and recall, mainly due to the asymmetric halo caused by optical blooming that distorts the gradient symmetry used for center estimation. In contrast, the proposed method bypasses these intensity-level pitfalls by extracting high-level semantic features, leveraging the BPESA optical prior to distinguish between the physical marker edge and the optical halo.
Note that the physical resolution varies with depth. Based on our camera calibration, at the reference working distance of 1.0 m, the ground sample distance is approximately 0.22 mm/pixel. Therefore, the 0.52 pixels RMS error corresponds to a physical estimation error of 0.11 mm at this specific depth. This sub-millimeter precision is critical for microsurgical navigation tasks where minimizing tracking error is essential for avoiding damage to vital tissues. Furthermore, its Precision and Recall also reached 98.6% and 89.3%, respectively. In conclusion, these results indicate that the proposed method enhances the localization accuracy for reflective markers while achieving model lightweighting, showcasing its potential for high-precision optical tracking scenarios.
In terms of computational efficiency, we evaluated the end-to-end inference speed on the RTX 4080 platform. The test involved high-resolution inputs (2472 × 2064 pixels) resized to the model’s input dimension of 800 × 800 pixels. As shown in the experimental logs, the proposed method achieves an average end-to-end latency of 7.43 ms, corresponding to a frame rate of 134.6 FPS. This performance exceeds the standard real-time requirement of 30 FPS. The low latency is attributed to the lightweight C2f_GhostV2 backbone, which effectively reduces the floating-point operations without compromising feature extraction capabilities. This sub-10 ms latency ensures that the surgical tracking system provides immediate visual feedback with negligible lag.
4. Discussion
The contribution of this study is the transition from standard feature extraction to physics-prior-guided enhancement. Architecturally, the BPESA module functions effectively as a hybrid optical filter. By integrating a fixed 3 × 3 Laplacian operator with a learnable PSF-modeling branch, the network explicitly separates the signal components: the fixed high-pass filter isolates the high-frequency physical edge, while the learnable branch models the low-frequency intensity leakage associated with optical blooming. Importantly, the adaptive soft-gating mechanism mediated by the learnable weights and allows the model to dynamically balance between this rigid physical constraint and learned spatial contexts. This ensures that the brightness prior assists localization within the optimal working volume without imposing brittle constraints when the marker scale varies significantly due to depth changes.
Beyond accuracy, the replacement of the standard backbone with C2f_GhostV2 addresses the latency bottleneck in high-frequency tracking. Surgical navigation requires immediate visual feedback, and any processing lag translates to hand–eye coordination errors for the surgeon. Our method achieves a frame rate >130 FPS, providing a substantial safety margin over the standard 30–60 Hz tracking requirement. The results confirm that the global receptive field provided by the DFC attention mechanism effectively compensates for the parameter reduction, maintaining feature discrimination even under motion blur.
Despite these advancements, several limitations warrant further discussion. First, while the dual-branch design accommodates moderate scale variations, the fixed 3 × 3 Laplacian kernel possesses a constant receptive field. When the working distance deviates significantly from the reported 1.0–1.5 m range, the marker’s projection scale changes; at extreme distances, the physical edge signal may become too compressed for the fixed operator, whereas at closer, defocused ranges, the bloom may exceed the kernel support. Second, although our detection framework is theoretically marker-agnostic, the current validation is based on a standard four-marker configuration. In more complex clinical layouts with higher marker density or irregular geometries, performance may be influenced by inter-marker occlusions or proximity artifacts not fully captured in this study.
Finally, the validation relies on geometrically refined manual annotations. While this introduces human dependency, it serves as a reliable reference in the presence of sensor saturation. Standard automated algorithms are inherently biased by the optical blooming, whereas human perception can reconstruct the physical geometry from the unsaturated marker edges. This manual geometric refinement effectively injects 3D physical priors into the 2D detection training pipeline. Because a spherical marker strictly projects as an ellipse regardless of the viewing angle, fitting the ground truth to the visible unsaturated arc segments forces the network to learn this invariant physical geometry, rather than overfitting to the unpredictable and asymmetric flat-top saturation patterns. Therefore, the reported RMS of 0.52 pixels represents the deviation from the best possible geometric estimation under current imaging constraints.
5. Conclusions
This study presents a novel method based on YOLOv8 and evaluated its capabilities in the efficient and accurate detection of reflective markers. By incorporating C2f_GhostV2 and BPESA as the improvement strategy, the proposed method achieved mAP50-95, R, and p values of 87.2%, 89.3%, and 98.6%, respectively. Crucially, the model demonstrated real-time performance with an inference speed of 134.6 FPS (7.43 ms latency), utilizing 2.312M parameters and achieving a low RMS error of 0.52 pix. Compared to the baseline YOLOv8n model, P was improved by 2.1%, while the number of model parameters was reduced by 23.1% and the RMS was reduced by 54.4%. Thus, the proposed method not only achieves high accuracy and efficiency of reflective markers detection but also represents an advancement in making the model lightweight. Future work will focus on further optimizing the model structure to reduce its parameters and improve its detection accuracy, as well as exploring the application of the model across a broader range of real-world scenarios.
Author Contributions
Y.X.: writing—review and editing, writing—original draft, methodology, investigation, formal analysis, data curation, conceptualization. H.T.: validation, formal analysis, data curation, supervision. Z.Z.: writing—review and editing, validation, formal analysis, data curation. W.L.: supervision, conceptualization. A.H.: formal analysis. N.Z.: data curation, conceptualization. X.M.: validation, investigation. S.Z.: resources, investigation, funding acquisition. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Anhui provincial Key Research and Development Project (Grant No. 2023s07020008) and supported by the Institute of Energy, Hefei Comprehensive National Science Center (Grant No. 25KZS211) and the APC was funded by the Institute of Energy, Hefei Comprehensive National Science Center.
Data Availability Statement
The private dataset underlying the results presented in this paper is not publicly available at this time but may be obtained from the authors upon reasonable request.
Acknowledgments
The authors acknowledge support of the Anhui provincial Key Research and Development Project (Grant No. 2023s07020008) and the Institute of Energy, Hefei Comprehensive National Science Center (Grant No. 25KZS211).
Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- Liu, Y.; Zhao, Z.; Shi, P.; Li, F. Towards Surgical Tools Detection and Operative Skill Assessment Based on Deep Learning. IEEE Trans. Med Robot. Bionics 2022, 4, 62–71. [Google Scholar] [CrossRef]
- Lu, S.; Yang, J.; Yang, B.; Yin, Z.; Liu, M.; Yin, L.; Zheng, W. Analysis and Design of Surgical Instrument Localization Algorithm. Comput. Model. Eng. Sci. 2023, 137, 669–685. [Google Scholar] [CrossRef]
- Chen, L.; Ma, L.; Zhang, F.; Yang, X.; Sun, L. Alligent tracking system for surgical instruments in complex surgical environment. Expert Syst. Appl. 2023, 230, 120743. [Google Scholar] [CrossRef]
- Zhang, T.; Wang, J.; Song, S.; Meng, M.Q.-H. Wearable Surgical Optical Tracking System Based on Multi-Modular Sensor Fusion. IEEE Trans. Instrum. Meas. 2022, 71, 5006211. [Google Scholar] [CrossRef]
- Ma, G.; McCloud, M.; Tian, Y.; Narawane, A.; Shi, H.; Trout, R.; McNabb, R.P.; Kuo, A.N.; Draelos, M. Robotics and optical coherence tomography: Current works and future perspectives [invited]. Biomed. Opt. Express 2025, 16, 578–602. [Google Scholar] [CrossRef] [PubMed]
- Xu, L.; Zhang, H.; Wang, J.; Li, A.; Song, S.; Ren, H.; Qi, L.; Gu, J.J.; Meng, M.Q.-H. Information loss challenges in surgical navigation systems: From information fusion to AI-based approaches. Inf. Fusion 2023, 92, 13–36. [Google Scholar] [CrossRef]
- Ercan, M.F.; Qiankun, A.L.; Sakai, S.S.; Miyazaki, T. Circle detection in images: A deep learning approach. In Global Oceans 2020: Singapore—U.S. Gulf Coast; IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar] [CrossRef]
- Shortis, M.R.; Clarke, T.A.; Short, T. Comparison of some techniques for the subpixel location of discrete target images. In Videometrics III; SPIE: Bellingham, WA, USA, 1994; pp. 239–250. [Google Scholar] [CrossRef]
- Vahid, M.R.; Chao, J.; Kim, D.; Ward, E.S.; Ober, R.J. State space approach to single molecule localization in fluorescence microscopy. Biomed. Opt. Express 2017, 8, 1332–1355. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.; Ma, L.; Zhan, W.; Zhang, Y.; Sun, L. Research on the method of anti-occlusion of surgical instrument tracking based on multi-camera module information fusion. Measurement 2025, 239, 115480. [Google Scholar] [CrossRef]
- Luhmann, T.; Robson, S.; Kyle, S.; Boehm, J. Close-Range Photogrammetry and 3D Imaging, 2nd ed.; De Gruyter: Berlin, Germany, 2015. [Google Scholar]
- Hussain, S.M.; Brunetti, A.; Lucarelli, G.; Memeo, R.; Bevilacqua, V.; Buongiorno, D. Deep Learning Based Image Processing for Robot Assisted Surgery: A Systematic Literature Survey. IEEE Access 2022, 10, 122627–122657. [Google Scholar] [CrossRef]
- Wang, C.; Calle, P.; Ton, N.B.T.; Zhang, Z.; Yan, F.; Donaldson, A.M.; Bradley, N.A.; Yu, Z.; Fung, K.; Pan, C.; et al. Deep-learning-aided forward optical coherence tomography endoscope for percutaneous nephrostomy guidance. Biomed. Opt. Express 2021, 12, 2404–2418. [Google Scholar] [CrossRef] [PubMed]
- Han, W.; Dong, X.; Wang, G.; Ding, Y.; Yang, A. Application and improvement of YOLO11 for brain tumor detection in medical images. Front. Oncol. 2025, 15, 1643208. [Google Scholar] [CrossRef] [PubMed]
- Harper, D.M.; Chen, L.J.; McKay, R.T.; Nguyen, S.L.; Fontaine, M.A. Physics-informed neural networks for real-time deformation-aware AR surgical tracking. bioRxiv 2025. [Google Scholar] [CrossRef]
- Barbastathis, G.; Ozcan, A.; Situ, G. On the use of deep learning for computational imaging. Optica 2019, 6, 921–943. [Google Scholar] [CrossRef]
- Nehme, E.; Freedman, D.; Gordon, R.; Ferdman, B.; Weiss, L.E.; Alalouf, O.; Naor, T.; Orange, R.; Michaeli, T.; Shechtman, Y. DeepSTORM3D: Dense 3D localization microscopy and PSF design by deep learning. Nat. Methods 2020, 17, 734–740. [Google Scholar] [CrossRef] [PubMed]
- Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
- Li, Z.; Cui, M.; Hu, D.; Gong, J.; Weng, J.; Zhang, Z.; Tian, L.; Li, M.; Huang, K. A two-stage method for specular highlight detection and removal in medical images. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2025; Gee, J.C., Alexander, D.C., Hong, J., Iglesias, J.E., Sudre, C.H., Venkataraman, A., Golland, P., Kim, J.H., Park, J., Eds.; Springer Nature: Cham, Switzerland, 2026; pp. 23–33. [Google Scholar] [CrossRef]
- Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetV2: Enhance cheap operation with long-range attention. In Advances in Neural Information Processing Systems; (supplementary material); Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 9969–9982. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
- Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 16748–16759. [Google Scholar] [CrossRef]
- Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar] [CrossRef]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 107984–108011. [Google Scholar]
- Ye, F.; Inman, J.T.; Hong, Y.; Hall, P.M.; Wang, M.D. Resonator nanophotonic standing-wave array trap for single-molecule manipulation and measurement. Nat. Commun. 2022, 13, 77. [Google Scholar] [CrossRef] [PubMed]
- Parthasarathy, R. Rapid, accurate particle tracking by calculation of radial symmetry centers. Nat. Methods 2012, 9, 724–726. [Google Scholar] [CrossRef] [PubMed]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |