Next Article in Journal
Parameter Sensitivity Analysis of Generators and Grid-Connected Constraints in Hybrid Microgrids Using Deep Reinforcement Learning
Previous Article in Journal
Integration of Multi-Gas Sensors and Aerial Thermography into UAVs for Environmental Monitoring of a Landfill
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Precision Without Complexity: A Comparative Study of YOLO26 Pose Variants for Distal Arm Landmark Detection

1
Industry 4.0 Convergence Bionics Engineering, Pukyong National University, Busan 48513, Republic of Korea
2
Digital Healthcare Research Center, Pukyong National University, Busan 48513, Republic of Korea
3
College of Korean Medicine, Kyung Hee University, Seoul 02453, Republic of Korea
4
College of Korean Medicine, Dongshin University, Naju 58245, Republic of Korea
5
Division of Smart Healthcare, College of Information Technology and Convergence, Pukyong National University, Busan 48513, Republic of Korea
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2026, 16(8), 3968; https://doi.org/10.3390/app16083968
Submission received: 26 March 2026 / Revised: 7 April 2026 / Accepted: 15 April 2026 / Published: 19 April 2026

Abstract

Accurate anatomical landmark localization in clinical images requires millimeter-level spatial precision, yet whether increasing model scale improves such precision in structured medical imaging tasks remains unclear. Five YOLO26 pose-estimation variants (N, S, M, L, and X) were evaluated on 3679 RGB distal-arm images from 262 participants under a standardized overhead imaging protocol, with five anatomical landmarks annotated across the proximal forearm, mid-forearm, and hand. Localization error was quantified in millimeters using ArUco-marker-based pixel-to-millimeter calibration; all models were initialized from COCO-pretrained weights, fine-tuned under identical conditions, and assessed using COCO-style detection metrics and physically grounded localization error. Detection performance saturated across all scales (mAP@0.5 = 99.5%), while localization performance differed substantially; YOLO26N achieved the lowest mean error (2.76 ± 0.96 mm) and the highest proportion of predictions within 4 mm (88.0%), whereas YOLO26X produced the highest mean error (4.08 ± 2.59 mm) despite a 26.9× higher computational cost. Landmark-wise analysis revealed a consistent proximal-to-distal error gradient, with the largest degradation at anatomically ambiguous proximal landmarks in larger models. These findings suggest that increasing model capacity does not improve clinically meaningful localization precision in structured distal-arm imaging, and lightweight models may offer the most favorable accuracy-efficiency trade-off in resource-constrained clinical settings.

1. Introduction

Automated anatomical landmark detection is a core task in medical computer vision, underpinning clinical applications including surgical planning, orthopedic assessment, radiographic measurement, rehabilitation monitoring, and treatment guidance [1,2]. One of the major challenges in this domain is the limited availability of annotated medical imaging data for model training, compounded by the sensitivity of learned representations to inter-annotator variability in annotation quality, which introduces label noise that can adversely affect model performance [3,4]. Beyond data constraints, these tasks require not only reliable detection of landmark presence but precise spatial localization at the millimeter scale, a level of accuracy that distinguishes models capable of clinical deployment from those that remain confined to research settings [5,6]. Clinical deployment of AI systems for anatomical landmark detection also raises important data security considerations. Systems handling sensitive biometric imaging data must address risks related to data protection, including encryption-based approaches to securing medical image data across edge devices and cloud-connected platforms [7].
Deep learning (DL) has substantially advanced the state of the art in anatomical landmark detection over the past decade. Early convolutional neural network (CNN) approaches established the viability of data-driven landmark localization in medical imaging, demonstrating that CNNs could perform automated anatomical landmark detection at accuracy levels approaching clinical inter-observer variability [8]. Subsequent architectures, including deep residual networks, further improved representational capacity, while methods that integrate spatial configuration into heatmap-based CNN frameworks demonstrated that exploiting the structured spatial relationships between anatomical sites yields substantial gains in localization accuracy, even under limited training data conditions [1,9]. Multi-scale feature fusion and high-resolution representation learning further advanced spatial precision, and the development of more sophisticated detection pipelines reduced dependency on large labeled datasets [10,11,12]. DL-based models for anatomical landmark localization have consistently demonstrated that accuracy varies systematically by anatomical region, with landmarks overlying distinctive bony prominences outperforming those in regions of greater soft-tissue variability, a pattern reported across cephalometric [5], spinal [2], and multi-body-region evaluations [13].
Pose estimation frameworks originally developed for human body keypoint detection have increasingly been adapted for anatomical landmark localization, exploiting their capacity to jointly detect object regions and predict keypoint coordinates in a single forward pass [14,15]. The standard benchmark for training and evaluating pose estimation models is the Microsoft COCO dataset, which provides large-scale keypoint annotations across diverse real-world images and defines the primary evaluation protocol used across the field, including mean Average Precision (mAP) computed under the Object Keypoint Similarity (OKS) metric at multiple IoU thresholds [16]. Two-dimensional pose estimation requires less computational power, is more data-accessible, and is easier to deploy on consumer-grade hardware, making it well-suited for both research and practical clinical applications [17].
Among pose estimation frameworks, the YOLO series has become widely adopted for single-stage, real-time keypoint detection. Originally proposed as a unified, regression-based object detector, the YOLO family has evolved through successive generations from YOLOv7 through YOLOv8 and YOLOv9 [17,18,19,20] to the most recent YOLO26 [21], each introducing architectural innovations that improve the balance between detection speed, accuracy, and computational efficiency [22]. High-resolution network architectures such as HRNet [10] established the importance of maintaining spatial precision throughout feature extraction, a principle that has influenced the design of subsequent YOLO-based pose estimators. These frameworks have been applied in medical imaging contexts, including patient monitoring, bedside assessment, and anatomical landmark localization [14,15,17]. Fiducial markers such as ArUco [23] have been adopted in several medical imaging pipelines to provide reliable physical scale references, enabling pixel-to-millimeter conversion and grounding model outputs in clinically interpretable units.
The most directly relevant prior work is that of Malekroodi et al. [24], who applied a YOLOv8-based pose estimation framework to detect five distal arm anatomical landmarks (LI11, LI10, TE5, LI4, and TE3) from a dataset of controlled arm images, achieving a mean Average Precision (mAP) of 0.99 at OKS 50% and mean localization errors below 5 mm relative to expert annotations. Yuan et al. [25] proposed YOLOv8-ACU, a modified YOLOv8-pose framework incorporating lightweight attention modules for facial landmark detection, achieving an mAP@0.5 of 97.5% and mAP@0.5:0.95 of 76.9% while reducing model parameters and computational load relative to the baseline. Wang et al. [26] addressed hand landmark localization using a cascaded network that combines YOLOv5 with HRNet and a dual-attention mechanism, achieving accurate localization of 21 hand landmarks with strong real-time performance. Seo et al. [27] evaluated HRNet and ResNet architectures for hand landmark detection on 2D images, providing comparative baselines for pose estimation in constrained distal extremity imaging. Yang et al. applied structure-guided deep learning with bone-measuring constraints for back landmark localization, achieving a normalized mean error of 0.6% and a failure rate of 1.2% at 1 cm on a dataset of 430 back images with 19 annotated landmarks while maintaining real-time inference [28].
Across this body of work, the YOLO framework follows the established convention of offering multiple model variants spanning a wide range of architectural complexity. The YOLO26 family covers five scale points, YOLO26N through YOLO26X, ranging from 2.9 M to 57.6 M parameters and 7.5 to 201.7 GFLOPs [21]. This scaling convention is motivated by findings on large-scale benchmarks, where larger variants consistently achieve higher mAP by exploiting increased representational capacity to handle diverse scenes, occlusions, and viewpoints [22]. Whether this benefit transfers to structured medical imaging tasks, however, is an open and practically important question. In clinical imaging settings characterized by constrained acquisition protocols, limited pose variability, and small annotated datasets, the task complexity may be substantially lower than in general-purpose benchmarks. Data augmentation techniques have been widely adopted to partially address this limitation and improve model generalization under constrained training conditions [29]. Nevertheless, increased model complexity heightens the risk of overfitting, particularly in the medical domain, where datasets are often limited, and annotations are expensive to acquire [3,30]. Under these conditions, larger models may fail to improve or may actively degrade localization precision relative to lightweight counterparts.
This question has direct practical significance for clinical deployment. In clinical applications, pose estimation systems must balance accuracy with real-time responsiveness and limited computational resources, making efficiency a primary design consideration alongside precision [17,19]. Lightweight architectures offer significant advantages: lower memory footprint, faster inference, and suitability for edge environments [31]. If lightweight models achieve localization accuracy equivalent to or better than that of larger models in structured tasks, the assumption that larger models are inherently preferable in clinical AI systems warrants reconsideration.
Despite strong benchmark performance gains from architectural scaling in natural-image pose estimation, it remains unknown whether the same scaling behavior benefits structured clinical imaging tasks characterized by constrained acquisition, limited anatomical variability, and relatively small annotated datasets. In such settings, detection may become trivial while sub-centimeter localization remains challenging, and excessive model capacity may amplify overfitting rather than improve precision. To address this question, we compared five YOLO26 pose-estimation variants across a 26.9× range of computational settings on 3679 distal-arm images with five expert-annotated anatomical landmarks. Performance was assessed using both standard detection metrics and physically calibrated millimeter-scale localization error. We hypothesized that, in this structured imaging setting, larger models would not necessarily yield better localization accuracy and that lightweight models might provide the optimal balance between precision and computational efficiency.

2. Materials and Methods

2.1. Study Design and Objective

This study was designed to evaluate whether increasing architectural scale improves millimeter-level keypoint localization accuracy for automated anatomical landmark detection on distal arm images. Although larger deep learning models often outperform smaller ones on large-scale benchmark datasets, it remains unclear whether this advantage persists in structured clinical imaging tasks with constrained pose variability and limited training data [3,22,30]. Because clinical deployment also requires efficient inference under practical hardware constraints, model comparison in this setting should consider both spatial precision and computational cost [17,19,31]. We therefore evaluated five YOLO26 pose estimation variants under matched training and evaluation conditions and compared their performance using standard detection metrics, along with physically grounded localization errors expressed in millimeters.
To examine this, five variants of the YOLO26 pose estimation framework, spanning a 26.9× range in computational cost (7.5–201.7 GFLOPs), were evaluated under identical experimental conditions on a real-world dataset of distal arms. Models were compared using standard object-detection metrics and physically grounded localization errors expressed in millimeters, enabling direct assessment of clinical relevance.

2.2. Image Acquisition and Landmark Annotation

The dataset comprised 3679 RGB images from 262 healthy adult participants (age range: 18–68 years), with images of both left and right arms captured per participant. For each image, participants placed one forearm in a standardized pronated position on a flat surface, and images were captured from a fixed, top-down perspective at 1488 × 837 pixels, ensuring consistent framing and minimal perspective distortion across sessions. Each image included an ArUco fiducial marker (100 × 100 mm) to establish a physical scale reference, enabling pixel-to-millimeter conversion and compensating for minor inter-session variation in camera-to-subject distance, thereby allowing all localization errors to be reported in physically meaningful units.
Five anatomical landmarks were annotated per image: LI11 and LI10 (proximal forearm), TE5 (mid-forearm), and LI4 and TE3 (hand region), as illustrated in Figure 1. All annotations were produced in COCO format following a standardized protocol developed under the supervision of experienced Traditional East Asian Medicine (TEAM) practitioners. From a pool of trained candidate annotators, the annotator with the highest agreement with expert reference labels was selected to annotate the entire dataset, minimizing systematic annotation bias. To prevent subject-level data leakage, images were partitioned strictly by participant identity: 2963 images for training (201 participants), 666 for validation (43 participants), and 50 for final evaluation (18 participants) (Figure 2). The test set consists exclusively of images from participants not present in either the training or validation splits.

2.3. YOLO26 Pose-Estimation Framework

Landmark localization was performed using the YOLO26 pose estimation framework, a single-stage architecture that jointly predicts bounding regions and keypoint coordinates in a single forward pass. The network comprises three functional components: a hierarchical feature-extraction backbone, a multi-scale feature-aggregation neck, and a dedicated keypoint-prediction head, as illustrated in Figure 3. Five model variants were evaluated (YOLO26N, YOLO26S, YOLO26M, YOLO26L, and YOLO26X) spanning 2.9–57.6 million parameters and 7.5–201.7 GFLOPs, with detailed architectural specifications summarized in Table 1. This range provides a controlled basis for analyzing the impact of architectural scaling on localization precision under consistent training conditions.

2.4. Model Training, Configuration, and Evaluations

All five YOLO26 pose-estimation variants (N, S, M, L, and X) were initialized from COCO-pretrained weights provided by the Ultralytics framework and fine-tuned under identical conditions on a single NVIDIA GeForce RTX 4090 GPU (24 GB VRAM) using PyTorch 2.0.1, CUDA 11.8, and the Ultralytics 8.4.6 framework. The input images were resized to 640 × 640 pixels, and a uniform augmentation strategy comprising horizontal flipping, HSV photometric distortion, scaling, translation, mosaic, and random erasing was applied uniformly across all variants. Models were optimized using SGD with momentum 0.937, weight decay 0.0005, and an initial learning rate of 0.01, trained for up to 300 epochs with a batch size of 16 and early stopping based on validation performance. All hyperparameters were held constant across variants, and final performance was reported on the participant-independent held-out test set. A Spearman rank correlation between computational cost (GFLOPs) and mean localization error was computed across the five model variants as a non-parametric supplement to the OLS linear trend.
Detection Performance: Detection accuracy was assessed using mean Average Precision at IoU threshold 0.5 (mAP@0.5) and mAP@0.5:0.95, following the standard COCO evaluation protocol.
Localization Error in Millimeters: For each landmark k, the localization error in pixel units was computed as the Euclidean distance between the predicted coordinate ( x ^ k , y ^ k ) and the reference annotation ( x k , y k ) as shown in Equation (1).
e k p x = x ^ k x k 2 + y ^ k y k 2
Pixel-space distances were converted to millimeters using the ArUco-derived scale factors, as defined in Equation (2).
e i k m m = e i k p x × s i
The mean per-image error across all five landmarks serves as the primary localization metric. Reporting error in millimeters rather than pixels grounds model evaluation in physical space and enables direct clinical interpretation.

3. Results and Analysis

3.1. Keypoint Detection Performance

All five YOLO26 variants achieved near-perfect landmark detection, with mAP@0.5 reaching 99.5% across all models (Table 1). While the same variants achieve mAP@0.5 scores of 83.3–91.6% on the COCO benchmark, the near-perfect detection observed on the distal arm dataset reflects the constrained and structured nature of the imaging protocol, with a fixed viewpoint, controlled lighting, and limited pose variability, rather than a ceiling effect of model capacity. This uniformity provides no discriminating power between model scales; all subsequent analysis, therefore, focuses on spatial localization accuracy.

3.2. Effect of Model Scale on Localization Accuracy

Table 2 summarizes per-model localization performance. The results reveal a counterintuitive pattern: localization accuracy does not improve with increasing model size. YOLO26N, the smallest variant with 2.9 M parameters and 7.5 GFLOPs, achieves the lowest mean error (2.76 ± 0.96 mm) and the highest proportion of predictions within 4 mm (88.0%). YOLO26L produces a comparable mean error of 2.96 mm despite requiring 12× more compute. The largest model, YOLO26X (57.6 M parameters, 201.7 GFLOPs), yields the worst mean error of (4.08 ± 2.59 mm), a 47.8% increase over YOLO26N, alongside a markedly elevated standard deviation indicating unreliable tail performance.
No monotonic trend is observed across the scaling sequence. YOLO26S underperforms YOLO26M, which underperforms YOLO26L; and the transition from YOLO26L to YOLO26X produces the sharpest accuracy degradation in the series. These findings are consistent with the scaling model, which provides no systematic localization benefit in this setting. While neither the OLS trend (r = 0.82, p = 0.086) nor the Spearman rank correlation (ρ = 0.60, p = 0.28) reaches conventional significance at n = 5, the convergent pattern across three independent metrics: lowest mean error, lowest 90th-percentile error, and highest 4 mm compliance, all favoring YOLO26N, supports this interpretation (Figure 4).

3.3. Landmark-Error Distribution and Analysis

Table 2 presents the full distribution of localization errors for each model. The contrast between YOLO26N and YOLO26X is instructive. While their medians are comparable (2.65 mm vs. 3.05 mm), the distributions diverge substantially in the upper tail: the 90th-percentile error for YOLO26X (9.18 mm) is 2.3× that of YOLO26N (4.04 mm), and YOLO26X’s maximum error (10.44 mm) is nearly double YOLO26N’s (5.53 mm). This indicates that YOLO26X produces acceptable predictions under favorable conditions but is prone to severe localization failures on more challenging images, a critical liability for clinical use.
YOLO26N’s distribution is compact and consistent: 66.0% of predictions fall below 3 mm, 88.0% below 4 mm, and 96.0% below 5 mm, with an interquartile range of 2.09–3.30 mm. The violin plot confirms a narrow, unimodal distribution for YOLO26N with negligible upper-tail spread, in contrast to YOLO26X, whose wide violin shape and nine individual outliers (7.6–10.4 mm) reflect highly variable and unreliable predictions. This combination of low central tendency and well-controlled tail behavior confirms reliable, clinically relevant precision across the test set (Figure 5).
Table 3 disaggregates mean localization error by anatomical landmark. A consistent proximal-to-distal accuracy gradient is observed across all model variants. Distal hand landmarks (LI4 and TE3) consistently yield the lowest errors, with LI4 achieving the best single-landmark result of 2.07 mm (YOLO26M). Proximal forearm landmarks LI11 and LI10 produce the highest and most variable errors across all models, reaching 5.54 mm and 5.35 mm, respectively, under YOLO26X. Figure 6a shows the cumulative distribution of per-image mean localization error, with YOLO26N leading compliance rates at every threshold (66% at 3 mm, 88% at 4 mm, 96% at 5 mm). Figure 6b presents the mean localization error, disaggregated by anatomical landmark and model variant, as a grouped bar chart with ±1 SD error bars, confirming a proximal-to-distal accuracy gradient across all models.
This gradient reflects the underlying anatomy. Distal landmarks correspond to well-defined bony prominences with stable visual texture and sharp spatial boundaries, facilitating reliable detection. Proximal landmarks overlie regions of greater soft tissue variability, lower surface contrast, and less visually distinctive local features. These characteristics increase localization difficulty, particularly for high-capacity models that are prone to overfitting under limited data. Notably, the proximal–distal performance gap is narrowest for YOLO26N and widest for YOLO26X, reinforcing the finding that larger models generalize less effectively in anatomically ambiguous regions.
Figure 7 provides qualitative support for these findings on a representative test image, where YOLO26N achieves the largest improvement over all competing variants. Across all five landmarks, YOLO26N (2.27 mm mean error) consistently places predictions closest to the ground truth. The zoomed insets reveal that the largest inter-model differences occur at the proximal landmarks LI11 and LI10, where YOLO26X produces severe mislocalizations of 17.5 mm and 16.5 mm, respectively. In comparison, YOLO26N remains within 3.3 mm and 2.6 mm. At the distal landmarks TE5, LI4, and TE3, predictions from all model cluster near the ground truth, consistent with the lower localization difficulty of the hand region. These observations confirm that the proximal–distal accuracy gradient identified in the quantitative analysis is spatially systematic and not an artifact of image aggregation.

3.4. Computational Efficiency

YOLO26X consumes 26.9× more compute than YOLO26N while delivering 47.8% worse mean localization accuracy. Combined with YOLO26N’s lower memory footprint and faster inference speed, it is well-suited for edge-based, real-time clinical deployment. Taken together, these results establish a clear and practically important finding: in structured anatomical landmark localization tasks with limited training data, model scaling is not a productive optimization strategy. Under these conditions, lightweight architectures are not a compromise; they are the optimal design choice.

4. Discussion

This study shows that increasing the pose-model scale does not necessarily improve clinically meaningful localization of anatomical landmarks in structured distal-arm imaging. Although all five YOLO26 variants achieved near-identical and near-ceiling detection performance, clear differences emerged when performance was assessed using physically calibrated localization error. The smallest model, YOLO26N, achieved the lowest mean error (2.76 ± 0.96 mm) and the most stable error distribution, whereas the largest model, YOLO26X, showed the poorest overall precision (4.08 ± 2.59 mm) despite requiring 26.9× greater computational cost. These findings suggest that, in this setting, the key challenge is not landmark detection itself, but the ability to localize anatomical points with consistent millimeter-level accuracy. Table 4 contextualizes these findings within related work on anatomical landmark localization, demonstrating that the present study achieves the lowest reported physically calibrated mean localization error across comparable distal-arm acupoint detection tasks, while using the smallest single-stage model variant.
This distinction is central to the clinical interpretation of the results. In many computers vision studies, model comparison is dominated by benchmark-style detection metrics such as mAP. However, for anatomical landmark analysis, detection and localization should not be regarded as equivalent endpoints. In the present study, mAP@0.5 was effectively saturated across all models, indicating that each variant could identify the relevant landmark configuration under the standardized acquisition protocol. Yet this apparent equivalence disappeared once performance was expressed in millimeters. The clinical value of an anatomical landmark model lies not simply in whether it detects a point, but in whether it places that point with sufficient spatial precision to support downstream interpretation, guidance, or intervention. From this perspective, physically grounded error measures are more informative than benchmark metrics alone, and the present findings illustrate how reliance on detection performance can obscure differences that are highly relevant in practice.
The absence of benefit from architectural scaling is also biologically and technically plausible. Larger pose-estimation models often perform well in unconstrained natural-image environments because they must accommodate substantial variation in viewpoint, occlusion, background clutter, and object appearance. By contrast, the current task was defined within a tightly controlled imaging environment, with fixed overhead positioning, consistent framing, and limited background variability. Under such conditions, the recognition problem is relatively simple, and landmark presence becomes easy to establish even for lightweight models. What remains difficult is the finer problem of precise coordinate regression from subtle anatomical surface cues. In that context, additional model capacity may add little useful information. On the contrary, excessive capacity may increase sensitivity to training-specific visual patterns that do not generalize well across participants, resulting in reduced robustness rather than improved precision. The behavior of YOLO26X is consistent with this interpretation, particularly because its disadvantage was evident not only in average error but also in the broader tail of large-error predictions, a failure mode we characterize more precisely below as task-capacity mismatch overfitting.
The performance degradation of YOLO26X can be understood in terms of task-capacity mismatch overfitting, which is mechanistically distinct from classical overfitting. In classical overfitting, failure is detectable via training-validation divergence. Here, validation performance remained acceptable, yet test-set tail errors were severely elevated, with 9 outliers ranging from 7.6 to 10.4 mm and a 90th-percentile error of 9.18 mm (2.3× YOLO26N’s 4.04 mm). From a feature learning perspective, detection saturates trivially across all scales (mAP@0.5 = 99.5%), leaving precise coordinate regression as the real challenge, a low-information-content problem relative to YOLO26X’s 57.6 M parameters and ~3000 training images. From a network fitting perspective, when capacity far exceeds task complexity, gradient descent fits spurious participant-specific correlations rather than a better regression function, amplifying prediction variance at the tail rather than average error [3,30]. From an attention distribution perspective, larger models appear to exploit non-generalizable local texture patterns more aggressively at anatomically ambiguous regions, consistent with YOLO26X’s failure to concentrate almost entirely at proximal landmarks LI11 (5.35 mm) and LI10 (5.54 mm) versus 3.16 mm and 3.27 mm for YOLO26N, with negligible inter-model differences at distal landmarks. Notably, since all variants were initialized from identical COCO-pretrained weights, this disadvantage cannot be attributed to initialization, making the capacity-mismatch explanation more compelling.
The landmark-specific findings further support this explanation. Across all models, localization error followed a clear proximal-to-distal gradient, with the proximal forearm landmarks LI11 and LI10 showing greater error than the more distal landmarks in the hand region. This pattern is anatomically credible. Distal landmarks are typically associated with more visually distinctive contours and firmer underlying structural cues. In contrast, proximal forearm landmarks are more susceptible to soft-tissue variation and have less sharply defined external boundaries. Such regions are intrinsically more ambiguous in two-dimensional surface imaging. Importantly, the performance penalty of the larger models was most evident in these anatomically ambiguous proximal landmarks, suggesting that increased capacity did not improve discrimination where anatomical uncertainty was greatest. Instead, the smaller model appeared to generalize more stably across regions with less explicit visual structure.
These findings also have practical significance for the deployment of clinical AI. In translational settings, the most useful model is rarely the one with the greatest absolute complexity, but rather the one that offers the best balance of accuracy, consistency, and efficiency. Here, the lightweight model was not merely faster or cheaper to run; it was also the best-performing model on the outcome that matters most clinically. This has important implications for real-time or resource-constrained implementations, where lower latency, reduced hardware requirements, and more predictable behavior are often essential. The results therefore argue for a task-matched approach to model selection in medical image analysis, rather than an assumption that performance will improve monotonically with scale. In structured imaging tasks with limited visual variability, a lightweight architecture may not represent a compromise, but the most appropriate solution.
The present study also reinforces the importance of framing evaluation around clinically interpretable endpoints. Once landmark detection reaches ceiling performance, the central issue becomes whether a model is sufficiently reliable at real-world distances. A model that performs well in aggregate but occasionally produces large spatial errors may be less useful than one whose average performance is similar but whose predictions remain consistently within an acceptable tolerance. The superiority of YOLO26N in both mean error and distributional stability, therefore, strengthens the argument that robustness should be considered alongside central accuracy in anatomical localization studies.
Several algorithmic directions are promising for reducing residual errors at proximal landmarks. First, structured loss functions incorporating anatomical spatial constraints derived from TEAM expertise could penalize predictions that violate plausible inter-landmark distance ranges, directly targeting the ambiguity at LI11 and LI10 that standard regression loss cannot capture. Second, auxiliary supervision mechanisms such as region-of-interest weighting or attention-guided feature extraction for proximal regions would provide a stronger gradient signal where local visual cues are weakest. Third, graph-based relational modeling that explicitly encodes inter-landmark spatial relationships as structural constraints could improve regression stability at ambiguous sites by leveraging the more reliably localized distal landmarks as spatial anchors. These approaches are more principled than scaling model capacity, which our results show not only fails to resolve proximal landmark precision but may actively worsen it.
Several limitations should be acknowledged. The test set, although participant-independent, comprised 50 images from 18 unique participants acquired under a standardized protocol with controlled positioning and fixed illumination, which may limit generalizability to less controlled clinical settings where limb posture, shooting angle, illumination variation, and partial occlusion are present. While 50 images yield 250 landmark-level predictions under strict subject-level separation, with all 18 test participants entirely unseen during training and validation, this compares favorably with Malekroodi et al. [24], who reported no separate participant-independent test set, and Seo et al. [27], who used 180 test images without subject-level separation. The sample size nonetheless limits statistical power, as reflected in the non-significant trend in Section 3.2. Additionally, formal inter-rater reliability metrics such as ICC were not computed, which prevents decomposition of localization errors into model prediction noise versus annotation noise and leaves the lower bound of performance unquantified. This limitation is common in the acupoint localization literature [24,27] but warrants formal intra- and inter-rater reliability assessment in future work.
In summary, these results challenge the implicit assumption in medical computer vision that larger models are inherently preferable. When structured imaging tasks reduce scene complexity to the point where detection becomes trivial, the relevant performance axis shifts from detection accuracy to spatial regression precision; a dimension on which model capacity provides no systematic benefit and may actively harm reliability. The present findings argue for a more deliberate, task-matched approach to architecture selection in clinical AI, prioritizing distributional stability and efficiency alongside mean performance.

5. Conclusions

This study evaluated five YOLO26 pose-estimation variants for automated localization of anatomical landmarks on distal-arm images, demonstrating that increased architectural scale does not necessarily translate into improved millimeter-level localization accuracy in structured clinical imaging tasks. Among the evaluated models, YOLO26N, the smallest variant, achieved the lowest mean error (2.76 ± 0.96 mm) and the highest clinical compliance rate (88.0% within 4 mm), while requiring 26.9× less computational cost than YOLO26X, which produced the worst mean error (4.08 ± 2.59 mm) despite being the largest model evaluated. Notably, all variants were initialized from identical COCO-pretrained weights and fine-tuned under the same conditions, confirming that YOLO26N’s advantage cannot be attributed to differences in initialization. A consistent proximal-to-distal accuracy gradient was observed across all models, with performance deteriorating most notably at anatomically ambiguous proximal landmarks. Collectively, these findings suggest that, under conditions of limited and highly structured medical imaging data, increasing model capacity alone is insufficient to improve localization precision.
These conclusions should be interpreted in light of several limitations. The test set comprised 50 images from 18 unique participants entirely unseen during training and validation, yielding 250 landmark-level predictions under strict subject-level separation; while this exceeds the participant-independent rigor of directly comparable prior work, it nonetheless limits statistical power and generalizability to less controlled acquisition settings. Additionally, the analysis was restricted to a single anatomical region, leaving open the question of whether similar scaling behavior would be observed in other medical landmark localization tasks. Future work should evaluate larger, more diverse datasets across varied clinical conditions, extend the analysis to other anatomical domains, and investigate principled approaches to reducing proximal landmark error, including structured anatomical loss functions and graph-based inter-landmark modeling, as more targeted alternatives to architectural scaling. Overall, the present results support task-matched model selection, rather than default architectural scaling, as a more appropriate design principle for clinically deployable landmark localization systems.

Author Contributions

Conceptualization: N.M.; Data curation: P.P., H.M.K.K.M.B.H. and N.M.; Formal Analysis: P.P., H.M.K.K.M.B.H. and N.M.; Funding acquisition: B.-i.L.; Investigation: M.Y., B.-i.L.; Methodology: P.P., H.M.K.K.M.B.H. and N.M.; Project administration: M.Y. and B.-i.L.; Resources: H.-J.P., C.-S.N., M.Y. and B.-i.L.; Software: P.P.; Supervision: N.M., H.-J.P., C.-S.N., M.Y. and B.-i.L.; Validation: H.-J.P., C.-S.N., M.Y. and B.-i.L.; Visualization: P.P. and H.M.K.K.M.B.H.; Writing—original draft: P.P., H.M.K.K.M.B.H. and N.M.; Writing—review and editing: N.M., H.-J.P., C.-S.N., M.Y. and B.-i.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Research Foundation of Korea (NRF) and funded by the Ministry of Science and ICT (No. 2022M3A9B6082791).

Institutional Review Board Statement

This study involved human subjects. The research protocol was reviewed and approved by the Institutional Review Board of Pukyong National University, Republic of Korea (Approval No. 1041386-202207-HR-41-02), in compliance with the Declaration of Helsinki. All participants provided written informed consent prior to enrollment.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The dataset used for this study can be obtained from the corresponding authors upon reasonable request.

Acknowledgments

The authors thank the volunteer participants who contributed to the real image dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
YOLOYou Only Look Once
mAPmean Average Precision
IoUIntersection over Union
OKSObject Keypoint Similarity
COCOCommon Objects in Context
RGBRed, Green, Blue
GFLOPsGiga Floating Point Operations per Second
VRAMVideo Random Access Memory
SGDStochastic Gradient Descent
HSVHue, Saturation, Value
IQRInterquartile Range
SDStandard Deviation
TEAMTraditional East Asian Medicine
AIArtificial Intelligence
2DTwo-Dimensional

References

  1. Payer, C.; Štern, D.; Bischof, H.; Urschler, M. Integrating Spatial Configuration into Heatmap Regression Based CNNs for Landmark Localization. Med. Image Anal. 2019, 54, 207–219. [Google Scholar] [CrossRef] [PubMed]
  2. Noh, S.H.; Lee, G.; Bae, H.J.; Han, J.Y.; Son, S.J.; Kim, D.; Park, J.Y.; Choi, S.K.; Cho, P.G.; Kim, S.H.; et al. Deep Learning Method for Precise Landmark Identification and Structural Assessment of Whole-Spine Radiographs. Bioengineering 2024, 11, 481. [Google Scholar] [CrossRef] [PubMed]
  3. Tajbakhsh, N.; Jeyaseelan, L.; Li, Q.; Chiang, J.N.; Wu, Z.; Ding, X. Embracing Imperfect Datasets: A Review of Deep Learning Solutions for Medical Image Segmentation. Med. Image Anal. 2020, 63, 101693. [Google Scholar] [CrossRef] [PubMed]
  4. Yang, F.; Zamzmi, G.; Angara, S.; Rajaraman, S.; Aquilina, A.; Xue, Z.; Jaeger, S.; Papagiannakis, E.; Antani, S.K. Assessing Inter-Annotator Agreement for Medical Image Segmentation. IEEE Access 2023, 11, 21300. [Google Scholar] [CrossRef] [PubMed]
  5. Serafin, M.; Baldini, B.; Cabitza, F.; Carrafiello, G.; Baselli, G.; Del Fabbro, M.; Sforza, C.; Caprioglio, A.; Tartaglia, G.M. Accuracy of Automated 3D Cephalometric Landmarks by Deep Learning Algorithms: Systematic Review and Meta-Analysis. Radiol. Med. 2023, 128, 544–555. [Google Scholar] [CrossRef] [PubMed]
  6. Deep Learning-Based Human Pose Estimation: A Survey. Available online: https://www.researchgate.net/publication/347881067_Deep_Learning-Based_Human_Pose_Estimation_A_Survey (accessed on 23 March 2026).
  7. Lin, Y.; Liao, Y.; Zeng, W.; Wei, Y.; Chen, D.; Yuan, X.; Li, Y.; Erkan, U.; Toktas, A.; Zhang, C.; et al. 3D Non-Degenerate Hyperchaos: Design, Analysis, and Application in Image Encryption. IEEE Trans. Consum. Electron. 2026, 1. [Google Scholar] [CrossRef]
  8. Arik, S.Ö.; Ibragimov, B.; Xing, L. Fully Automated Quantitative Cephalometry Using Convolutional Neural Networks. J. Med. Imaging 2017, 4, 014501. [Google Scholar] [CrossRef] [PubMed]
  9. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  10. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 5686–5696. [Google Scholar] [CrossRef]
  11. Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Computer Vision—ECCV 2016; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2016; Volume 9912, pp. 483–499. [Google Scholar] [CrossRef]
  12. Zhang, J.; Liu, M.; Shen, D. Detecting Anatomical Landmarks From Limited Medical Imaging Data Using Two-Stage Task-Oriented Deep Neural Networks. IEEE Trans. Image Process. 2017, 26, 4753–4764. [Google Scholar] [CrossRef] [PubMed]
  13. Noothout, J.M.H.; De Vos, B.D.; Wolterink, J.M.; Postma, E.M.; Smeets, P.A.M.; Takx, R.A.P.; Leiner, T.; Viergever, M.A.; Išgum, I. Deep Learning-Based Regression and Classification for Automatic Landmark Localization in Medical Images. IEEE Trans. Med. Imaging 2020, 39, 4011–4022. [Google Scholar] [CrossRef] [PubMed]
  14. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 1302–1310. [Google Scholar] [CrossRef]
  15. Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  16. Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar]
  17. Dong, C.; Du, G. An Enhanced Real-Time Human Pose Estimation Method Based on Modified YOLOv8 Framework. Sci. Rep. 2024, 14, 8012. [Google Scholar] [CrossRef] [PubMed]
  18. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  19. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
  20. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2024. [Google Scholar]
  21. Sapkota, R.; Karkee, M. Ultralytics YOLO evolution: An overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 object detectors for computer vision and pattern recognition. arXiv 2025, arXiv:2510.09653. [Google Scholar]
  22. Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
  23. Garrido-Jurado, S.; Muñoz-Salinas, R.; Madrid-Cuevas, F.J.; Marín-Jiménez, M.J. Automatic Generation and Detection of Highly Reliable Fiducial Markers under Occlusion. Pattern Recognit. 2014, 47, 2280–2292. [Google Scholar] [CrossRef]
  24. Malekroodi, H.S.; Seo, S.D.; Choi, J.; Na, C.S.; Lee, B.I.; Yi, M. Real-Time Location of Acupuncture Points Based on Anatomical Landmarks and Pose Estimation Models. Front. Neurorobot. 2024, 18, 1484038. [Google Scholar] [CrossRef]
  25. Yuan, Z.; Shao, P.; Li, J.; Wang, Y.; Zhu, Z.; Qiu, W.; Chen, B.; Tang, Y.; Han, A. YOLOv8-ACU: Improved YOLOv8-Pose for Facial Acupoint Detection. Front. Neurorobot. 2024, 18, 1355857. [Google Scholar] [CrossRef] [PubMed]
  26. Wang, H.; Liu, L.; Wang, Y.; Du, S. Hand Acupuncture Point Localization Method Based on a Dual-Attention Mechanism and Cascade Network Model. Biomed. Opt. Express 2023, 14, 5965. [Google Scholar] [CrossRef] [PubMed]
  27. Seo, S.-D.; Madusanka, N.; Malekroodi, H.S.; Na, C.-S.; Yi, M.; Lee, B. Accurate Acupoint Localization in 2D Hand Images: Evaluating HRNet and ResNet Architectures for Enhanced Detection Performance. Curr. Med. Imaging 2024, 20, e15734056315235. [Google Scholar] [CrossRef] [PubMed]
  28. Wang, Y.; Lan, T.; Dou, W.; Chen, Z.; Zhang, S.; Chen, G. Structure-Guided Deep Learning for Back Acupoint Localization via Bone-Measuring Constraints. Front. Physiol. 2025, 16, 1662104. [Google Scholar] [CrossRef] [PubMed]
  29. Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
  30. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A Survey on Deep Learning in Medical Image Analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
  31. Ji, Y.C. Improving the Lightweight Pose Detection Model Based on YOLOpose. In Proceedings of the 3rd International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM2024) Lecture Notes in Electrical Engineering; Springer: Singapore, 2025; Volume 1326, pp. 31–41. [Google Scholar] [CrossRef]
Figure 1. Representative distal-arm images and anatomical landmark annotations. Standardized distal-arm images acquired in the pronated position with the ArUco fiducial marker used for physical calibration. The five annotated landmarks (LI11, LI10, TE5, LI4, and TE3) are shown.
Figure 1. Representative distal-arm images and anatomical landmark annotations. Standardized distal-arm images acquired in the pronated position with the ArUco fiducial marker used for physical calibration. The five annotated landmarks (LI11, LI10, TE5, LI4, and TE3) are shown.
Applsci 16 03968 g001
Figure 2. Subject-wise data splitting. Flow diagram showing the split of 3679 images from 262 participants into training (2963 images, 201 participants), validation (666 images, 43 participants), and held-out test (50 images, 18 participants) subsets. No participant appeared in more than one subset.
Figure 2. Subject-wise data splitting. Flow diagram showing the split of 3679 images from 262 participants into training (2963 images, 201 participants), validation (666 images, 43 participants), and held-out test (50 images, 18 participants) subsets. No participant appeared in more than one subset.
Applsci 16 03968 g002
Figure 3. Overview of the YOLO26 pose estimation architecture. Input images are processed through a hierarchical backbone (Conv and C3k2 blocks at pyramid levels P1–P5, followed by SPPF and C2PSA modules), a multi-scale neck (upsampling and concatenation), and a keypoint prediction head producing detections at three scales. Five model variants (N, S, M, L, and X) are available; architectural specifications are provided in Table 1.
Figure 3. Overview of the YOLO26 pose estimation architecture. Input images are processed through a hierarchical backbone (Conv and C3k2 blocks at pyramid levels P1–P5, followed by SPPF and C2PSA modules), a multi-scale neck (upsampling and concatenation), and a keypoint prediction head producing detections at three scales. Five model variants (N, S, M, L, and X) are available; architectural specifications are provided in Table 1.
Applsci 16 03968 g003
Figure 4. Mean localization error (mm) as a function of computational cost (GFLOPs) across YOLO26 model variants. Each point represents the mean per-image localization error over the 50-image test set; error bars indicate ±1 standard deviation. The dashed line shows the ordinary least-squares linear trend (r = 0.82, p = 0.086), with the shaded band denoting the 95% confidence interval; a supplementary Spearman rank correlation yielded ρ = 0.60 (p = 0.28), consistent with a positive trend but not reaching significance at n = 5. The green zone (<3.1 mm) denotes the high-precision region. Despite a 26.9× increase in computational cost, YOLO26X yields a 47.8% higher mean error than YOLO26N, suggesting that architectural scaling does not improve localization accuracy in this task.
Figure 4. Mean localization error (mm) as a function of computational cost (GFLOPs) across YOLO26 model variants. Each point represents the mean per-image localization error over the 50-image test set; error bars indicate ±1 standard deviation. The dashed line shows the ordinary least-squares linear trend (r = 0.82, p = 0.086), with the shaded band denoting the 95% confidence interval; a supplementary Spearman rank correlation yielded ρ = 0.60 (p = 0.28), consistent with a positive trend but not reaching significance at n = 5. The green zone (<3.1 mm) denotes the high-precision region. Despite a 26.9× increase in computational cost, YOLO26X yields a 47.8% higher mean error than YOLO26N, suggesting that architectural scaling does not improve localization accuracy in this task.
Applsci 16 03968 g004
Figure 5. Distribution of per-image mean localization error (mm) across YOLO26 model variants, displayed as violin plots with embedded box plots. The violin shape represents the kernel density estimate of the error distribution; the inner box represents the interquartile range (Q1–Q3); the horizontal line within the box denotes the median; whiskers extend to 1.5× the IQR; diamond markers indicate the mean; and circles represent individual image predictions (n = 50 per model). The dashed horizontal line marks the 4 mm clinical threshold. Model variants are ordered by increasing architectural complexity, from YOLO26N (2.9 M parameters, 7.5 GFLOPs) to YOLO26X (57.6 M parameters, 201.7 GFLOPs). Nine outliers observed for YOLO26X (7.6–10.4 mm) are displayed individually.
Figure 5. Distribution of per-image mean localization error (mm) across YOLO26 model variants, displayed as violin plots with embedded box plots. The violin shape represents the kernel density estimate of the error distribution; the inner box represents the interquartile range (Q1–Q3); the horizontal line within the box denotes the median; whiskers extend to 1.5× the IQR; diamond markers indicate the mean; and circles represent individual image predictions (n = 50 per model). The dashed horizontal line marks the 4 mm clinical threshold. Model variants are ordered by increasing architectural complexity, from YOLO26N (2.9 M parameters, 7.5 GFLOPs) to YOLO26X (57.6 M parameters, 201.7 GFLOPs). Nine outliers observed for YOLO26X (7.6–10.4 mm) are displayed individually.
Applsci 16 03968 g005
Figure 6. Localization error analysis across YOLO26 model variants. (a) Cumulative distribution of per-image mean localization error (mm). Each curve represents the proportion of test images falling below a given error threshold; dots indicate compliance rates at the 4 mm clinical threshold. The shaded green region (<3.1 mm) denotes the high-precision zone. (b) Mean localization error (mm) disaggregated by anatomical landmark and model variant, displayed as a grouped bar chart. Error bars indicate ±1 SD. Asterisks (*) mark the lowest error per landmark across all model variants. Landmark groupings (proximal forearm: LI11, LI10; mid-forearm: TE5; hand: LI4, TE3) reflect the anatomical regions assessed, with a consistent proximal-to-distal accuracy gradient observed across all models.
Figure 6. Localization error analysis across YOLO26 model variants. (a) Cumulative distribution of per-image mean localization error (mm). Each curve represents the proportion of test images falling below a given error threshold; dots indicate compliance rates at the 4 mm clinical threshold. The shaded green region (<3.1 mm) denotes the high-precision zone. (b) Mean localization error (mm) disaggregated by anatomical landmark and model variant, displayed as a grouped bar chart. Error bars indicate ±1 SD. Asterisks (*) mark the lowest error per landmark across all model variants. Landmark groupings (proximal forearm: LI11, LI10; mid-forearm: TE5; hand: LI4, TE3) reflect the anatomical regions assessed, with a consistent proximal-to-distal accuracy gradient observed across all models.
Applsci 16 03968 g006
Figure 7. Qualitative evaluation of anatomical landmark localization on a representative test image. The overview panel displays all five landmarks with ground-truth positions (gold stars) and model predictions (colored circles) overlaid on the original image; lines indicate the displacement from ground truth to each prediction. The orange bounding box delineates the region of interest encompassing all landmarks. Zoomed insets show landmark-level detail for each of the five anatomical sites (LI11, LI10, TE5, LI4, and TE3), with accompanying horizontal bar charts reporting the per-landmark localization error (mm) for each model variant. The summary panel reports the overall mean error per model across all landmarks. The selected image corresponds to the case in which YOLO26N achieves the largest margin of improvement over all competing variants.
Figure 7. Qualitative evaluation of anatomical landmark localization on a representative test image. The overview panel displays all five landmarks with ground-truth positions (gold stars) and model predictions (colored circles) overlaid on the original image; lines indicate the displacement from ground truth to each prediction. The orange bounding box delineates the region of interest encompassing all landmarks. Zoomed insets show landmark-level detail for each of the five anatomical sites (LI11, LI10, TE5, LI4, and TE3), with accompanying horizontal bar charts reporting the per-landmark localization error (mm) for each model variant. The summary panel reports the overall mean error per model across all landmarks. The selected image corresponds to the case in which YOLO26N achieves the largest margin of improvement over all competing variants.
Applsci 16 03968 g007
Table 1. Detection performance of YOLO26 model variants across two evaluation settings. COCO benchmark results are reported under standard evaluation conditions (input size 640 × 640). Distal arm dataset results are computed on the 50-image held-out test set.
Table 1. Detection performance of YOLO26 model variants across two evaluation settings. COCO benchmark results are reported under standard evaluation conditions (input size 640 × 640). Distal arm dataset results are computed on the 50-image held-out test set.
YOLO
Variants
Params (M)FLOPs (G)COCO mAP@0.5COCO mAP@0.5:0.95Distal Arm mAP@0.5Distal Arm mAP@0.5:0.95
YOLO26N2.97.583.357.299.599.2
YOLO26S10.423.986.663.099.599.4
YOLO26M21.573.189.668.899.599.3
YOLO26L25.991.390.570.499.599.2
YOLO26X57.6201.791.671.699.599.2
mAP: mean Average Precision; IoU: Intersection over Union.
Table 2. Per-model localization error distribution (mm) across the 50-image held-out test set. All values are computed over per-image mean errors across five landmarks. Bold values indicate the best result per column.
Table 2. Per-model localization error distribution (mm) across the 50-image held-out test set. All values are computed over per-image mean errors across five landmarks. Bold values indicate the best result per column.
YOLO
Variants
Mean (mm)Median (mm)SD (mm)P75 (mm)P90 (mm)<4 mm (%)
YOLO26N2.762.650.963.304.0488.0
YOLO26S3.353.051.464.125.8474.0
YOLO26M3.113.021.053.864.2580.0
YOLO26L2.962.780.963.614.1586.0
YOLO26X4.083.052.594.419.1872.0
SD: standard deviation; P75: 75th percentile; P90: 90th percentile; <4 mm (%): proportion of per-image mean errors falling below 4 mm.
Table 3. Mean localization error by anatomical landmark and model variant. Landmarks are ordered proximal (LI11, LI10) to distal (LI4, TE3). Bold values indicate the minimum error per landmark across all models.
Table 3. Mean localization error by anatomical landmark and model variant. Landmarks are ordered proximal (LI11, LI10) to distal (LI4, TE3). Bold values indicate the minimum error per landmark across all models.
Yolo
Variants
Mean Localization Error (mm)
LI11 LI10TE5LI4TE3
YOLO26N3.163.272.442.212.75
YOLO26S4.314.253.072.162.96
YOLO26M3.553.852.912.073.15
YOLO26L3.363.582.892.262.72
YOLO26X5.355.543.092.983.46
Table 4. Comparative Summary of Related Studies on Anatomical Landmark Localization.
Table 4. Comparative Summary of Related Studies on Anatomical Landmark Localization.
StudyYearBody RegionModel/MethodDataset (Images)mAP@0.5Localization Error MetricBest Reported ResultPhysical Calibration (mm)
Wang et al. [26]2023Hand (21 keypoints)SC-YOLOv5 + HRNet (cascade, dual-attention)Custom (real scene)mAP@0.5 = 97.15%Average offset error (AOE)AOE = 0.0269 (>40% lower than others)No (normalized units; d = 18 cm denominator, result is dimensionless)
Malekroodi et al. [24]2024Distal arm (LI11, LI10, TE5, LI4, TE3)YOLOv8l-pose (transfer learning, fine-tuned on custom dataset)5997 images 194 participantsmAP@0.5 = 0.99Euclidean distance, reported in mmMean error <5 mm (mm-calibrated)Partial—fixed global conversion factor via 80 cm reference sheet
Seo et al. [27]2024Hand (forearm acupoints)HRNet-w48 vs. ResNet (top-down)940 images/94 participants (PK dataset); test set = 180 imagesno single mAP@0.5 value statedMean distance error (pixels)HRNet-w48 surpassed expert annotatorsNo (ArUco used for perspective correction; error reported in pixels only)
Yuan et al. [25]2024Face (facial acupoints)YOLOv8-ACU (ECA attention + Slim-neck + GIoU loss)Self-constructed (facial acupoint)97.5%mAP@0.5:0.95 = 76.9% (validation); 80.7% on external test setmAP@0.5 = 99.5% on external test setNo (no mm reported)
Present Study (YOLO26N)2026Distal arm (LI11, LI10, TE5, LI4, TE3)YOLO26N (smallest variant; single-stage)3679 images 262 participants99.5%Physical mm (ArUco-calibrated)Mean error 2.76 ± 0.96 mm (88% within 4 mm)Yes (ArUco marker)
Notes: Physical calibration (mm) indicates whether the localization error was expressed in physically meaningful millimeter units using a calibration reference. The present study is the only work to apply ArUco marker-based per-image pixel-to-millimeter calibration rather than a fixed global conversion factor for physically grounded error reporting across a systematic model-scaling comparison.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Padmanabha, P.; Herath, H.M.K.K.M.B.; Madusanka, N.; Park, H.-J.; Na, C.-S.; Yi, M.; Lee, B.-i. Precision Without Complexity: A Comparative Study of YOLO26 Pose Variants for Distal Arm Landmark Detection. Appl. Sci. 2026, 16, 3968. https://doi.org/10.3390/app16083968

AMA Style

Padmanabha P, Herath HMKKMB, Madusanka N, Park H-J, Na C-S, Yi M, Lee B-i. Precision Without Complexity: A Comparative Study of YOLO26 Pose Variants for Distal Arm Landmark Detection. Applied Sciences. 2026; 16(8):3968. https://doi.org/10.3390/app16083968

Chicago/Turabian Style

Padmanabha, Prathiksha, H. M. K. K. M. B. Herath, Nuwan Madusanka, Hi-Joon Park, Chang-Su Na, Myunggi Yi, and Byeong-il Lee. 2026. "Precision Without Complexity: A Comparative Study of YOLO26 Pose Variants for Distal Arm Landmark Detection" Applied Sciences 16, no. 8: 3968. https://doi.org/10.3390/app16083968

APA Style

Padmanabha, P., Herath, H. M. K. K. M. B., Madusanka, N., Park, H.-J., Na, C.-S., Yi, M., & Lee, B.-i. (2026). Precision Without Complexity: A Comparative Study of YOLO26 Pose Variants for Distal Arm Landmark Detection. Applied Sciences, 16(8), 3968. https://doi.org/10.3390/app16083968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop