1. Introduction
Anatomical landmark localization in medical imaging involves identifying specific anatomical points across modalities such as X-ray, CT, and MRI and is essential for various clinical tasks that include diagnosis, disease progression assessment, quantitative analysis, and treatment planning [
1,
2,
3,
4]. However, manual annotation of these landmarks by a doctor is both time-consuming and labor-intensive, and the resulting annotations often exhibit interobserver variability due to differences in image quality, acquisition conditions, and patient-specific anatomical variations. These factors make it difficult to maintain consistent annotation quality and further highlight the need for reliable and reproducible automated landmark detection methods.
To robustly address such variability, deep learning-based approaches have become standard in medical image analysis [
5,
6,
7]. Among these, landmark detection based on heatmap regression has become the widely adopted approach [
8,
9,
10]. Unlike direct regression, this approach represents each landmark as a spatial probability distribution, providing better spatial generalization and more stable localization than direct coordinate prediction [
11,
12].
U-Net has been widely utilized in medical imaging because its encoder–decoder architecture can effectively integrate multi-scale features [
13] and, when paired with heatmap-based landmark representations, has demonstrated strong performance [
10,
11,
12,
14]. However, because heatmap representations rely primarily on local cues around each landmark, they are limited in their ability to capture anatomical variability and long-range structural relationships among landmarks [
7,
15,
16,
17]. Because the predicted heatmaps emphasize only the immediate neighborhood around each landmark, they cannot explicitly encode higher-level structural information, such as bilateral symmetry, inter-landmark distances, or positional relationships.
To address this issue, several studies have highlighted the importance of incorporating global anatomical constraints—including geometric relationships among landmarks and their relative spatial organization—to ensure structurally consistent predictions [
16,
17]. Furthermore, graph-based or geometric formulations represent landmarks as nodes and utilize iterative refinement strategies to refine their spatial and structural relationships, thereby incorporating global anatomical priors into the prediction process [
8,
16,
17,
18]. Although these methods have achieved notable progress in independently improving local representations and global structural relationships, research that utilizes these components in a mutually complementary manner remains limited.
The goal of this study is to develop a unified landmark detection framework that can more reliably learn anatomical relationships between landmarks in chest X-ray images. To this end, we adopt a U-Net-based heatmap regression network as the baseline architecture and incorporate structural losses that impose symmetry and vertical positional relationships during training. We further integrate a GNN-based landmark refinement module to adjust the initial heatmap predictions from a global structural perspective. This design simultaneously leverages local features, anatomical rules, and global structural information, addressing limitations present in previous approaches that considered these components separately.
For the experiments, we evaluated the performance of the proposed model using a chest X-ray dataset of 279 images annotated with six pulmonary anatomical landmarks.
Figure 1 visualizes the dataset along with the ground-truth placement of each landmark. To examine the effects of the structural constraints—symmetry, ordering, and the GNN-based refinement—we conducted additional experiments using various combinations of these components. These evaluations allowed quantitative verification of how each structural constraint contributes to the stability of the overall landmark configuration.
The remainder of this paper is organized as follows:
Section 2 provides an overview of related work, while
Section 3 describes the proposed method in detail.
Section 4 presents and analyzes the experimental results, then
Section 5 concludes the paper with a discussion and final remarks.
2. Related Works
This section presents an overview of the major research directions in anatomical landmark detection for medical imaging. We focus on five representative categories: early coordinate-regression-based landmark detection approaches [
19], heatmap-based inference methods [
8,
9,
10], techniques that incorporate anatomical priors [
16,
17], graph-based approaches that model global relationships among landmarks [
18], and finally, transformer-based architectures [
20] and CNN–Transformer hybrid models [
21,
22].
In the early stages of deep learning, direct coordinate regression was widely used, where convolutional neural networks (CNNs) or regression-based models directly predicted the
coordinates of each landmark [
19]. This approach offers simplicity and computational efficiency. However, regressing coordinates as scalar values introduces fundamental limitations, including the inability to model spatial uncertainty, the lack of explicit anatomical relationship modeling, and vulnerability to shape variability and structural overlap that frequently occur in X-ray images [
3,
4].
Heatmap regression was introduced to overcome these limitations. Modeling each landmark as a 2D probability distribution allows such progression to provide far more stable localization than direct regression, and has since become the dominant approach for landmark detection [
5,
7,
10,
11,
12,
15,
23]. Multi-scale fully convolutional architectures—such as U-Net, Hourglass Networks, and Feature Pyramid Networks (FPNs) [
13,
24,
25,
26]—further enhance this paradigm by integrating information across multiple resolutions, allowing the models to capture fine-grained local structures around each landmark.
However, heatmap-based methods still rely primarily on local appearance cues. In consequence, they face inherent difficulty in representing global anatomical configurations among landmarks, or incorporating anatomical priors in an end-to-end manner. To address the limitations of local appearance-based predictions, several studies have attempted to integrate global anatomical constraints into heatmap-based outputs. Payer et al. [
7] enhanced landmark consistency by augmenting heatmap regression CNNs with a spatial configuration model. Ham et al. [
14] leveraged soft-argmax coordinates to learn interactions among landmarks, and explicitly incorporated these structural constraints through a loss function.
In 3D imaging, O’Neil et al. [
27] improved robustness by combining atlas-based coordinate alignment with relative spatial relationships between landmarks, enabling the integration of long-range spatial context.
Alongside these efforts, graph-based approaches have also been explored, motivated by the difficulty of explicitly modeling global structure using heatmaps alone. In this approach, landmarks are represented as nodes and anatomical relationships as edges, allowing graph-based models or priors to capture global structural dependencies [
18,
28,
29]. These studies use graph neural networks (GNNs) to model structural dependencies among landmarks, or to refine heatmap-based predictions during a post-processing stage. However, because these methods rely on heatmap-derived coordinates, their performance can be influenced by the accuracy of the initial heatmaps, and inaccuracies may propagate through subsequent refinement steps.
Recent studies have increasingly explored transformer-based architectures and CNN–Transformer hybrid models for anatomical landmark detection. For instance, Zhao et al. [
21] developed SpineHRformer, a transformer-based model designed to capture long-range dependencies for precise spinal landmark localization. Furthermore, Jin et al. [
20] proposed STLD, a selection-free self-training framework that utilizes a transformer-based regressor to effectively model the global relationships between landmarks while mitigating confirmation bias. In particular, Huang et al. [
22] proposed CASEMark, a hybrid framework that integrates Swin Transformers with convolutional networks to jointly model global spatial context and fine-grained local details in multi-structure X-ray images. These advances reflect a broader research shift toward leveraging expansive spatial context to achieve clinically reliable landmark localization. Building on these observations, our work proposes a new framework that integrates local appearance cues, anatomical priors, and global structural information within a single unified learning process. The following sections describe the detailed components of the proposed method.
3. Methods
Figure 2 illustrates the overall pipeline of the proposed framework. We introduce a landmark localization approach that integrates probabilistic heatmap regression, graph-based structural refinement, and anatomical regularization to simultaneously enhance spatial accuracy and structural consistency. The pipeline follows a staged refinement strategy: a U-Net backbone first predicts heatmaps that encode local appearance and spatial probability for each landmark. A lightweight graph neural network (GNN) then models global relationships among landmarks and refines the initial coordinates through context-aware message passing. Finally, structural regularization terms based on anatomical symmetry and vertical ordering are incorporated during training to enforce anatomically plausible landmark configurations.
By combining explicit structural reasoning with conventional heatmap-based regression, the proposed framework goes beyond minimizing coordinate error to enable landmark predictions that are both precise and anatomically coherent.
3.1. U-Net for Heatmap Regression
Figure 3 presents the U-Net architecture employed in this study, adapted from Ronneberger et al. [
13], which serves as the heatmap regression backbone of the proposed framework. The U-Net is responsible for analyzing the input chest X-ray image and generating landmark-specific probability heatmaps. The model follows an encoder–decoder structure in which each block consists of two 3 × 3 convolutional layers, followed by instance normalization and ReLU activation. In the encoder, spatial resolution is progressively reduced through 2 × 2 max-pooling while the number of feature channels increases from 32 to 256. In the decoder, transposed convolution-based up sampling restores spatial resolution, and skip connections link feature maps of matching scales to preserve fine structural details.
Finally, a 1 × 1 convolution layer outputs six heatmap channels, each representing a two-dimensional spatial response corresponding to a single anatomical landmark.
3.2. Feature Embedding Using Soft-Argmax
After generating the spatial probability maps for each anatomical landmark, the U-Net outputs a heatmap for each landmark . To convert this probabilistic representation into continuous coordinates, we employ a differentiable soft-argmax operation. The conventional argmax selects only the single highest-response pixel and is therefore non-differentiable, making it unsuitable for end-to-end coordinate learning. In contrast, the soft-argmax computes an expectation over all pixel locations weighted by their probabilities, enabling smooth gradient flow from coordinate-based losses back to the heatmap representation.
Following prior work [
18], the predicted coordinates
are obtained as:
where
is a temperature parameter that controls the sharpness of the distribution: lower values of
produce sharper and more peaked heatmaps, while higher values yield smoother probability maps. This differentiable formulation effectively bridges pixel-level information and spatial reasoning.
The resulting coordinates
serve as compact anatomical descriptors and provide the input representation for the graph-based refinement module described in
Section 3.3, while
Section 3.4 details the associated supervision losses.
3.3. Graph-Based Landmark Refiner
Although heatmap-based soft-argmax prediction can accurately localize each landmark at a local level, it treats landmarks independently, and hence fails to capture anatomical structures or inter-landmark relationships. To address this limitation, the predicted landmark coordinates—obtained from the soft-argmax step—are interpreted as nodes in a graph, and a lightweight GNN-based refinement module exchanges information among neighboring landmarks to further adjust these initial estimates, as
Figure 4 illustrates:
Each landmark node
is constructed by combining the 32-dim feature vector sampled from the predicted coordinate
on the final decoder feature map
the normalized Coord. Output
, and the SoftMax probability sampled at the same location (the heatmap confidence score).
Figure 4 shows that the resulting node embedding integrates local appearance information, global positional cues, and SoftMax-based Heatmap confidence score, providing a richer representation for subsequent graph refinement.
The structural relationships among landmarks are encoded in an adjacency matrix
which incorporates left–right symmetry, superior–middle–inferior anatomical ordering, and cross-connections that reflect the partial asymmetry of real thoracic structures, as illustrated in
Figure 4. Since the left and right lungs possess nearly symmetrical anatomical structures, these bilateral connections allow the model to calibrate unstable predictions by leveraging contralateral information, effectively preventing bilateral imbalance. However, recognizing that real thoracic anatomy is not perfectly symmetrical, we avoided overly restrictive symmetry assumptions—which can induce prediction errors—by incorporating cross-connections to reflect realistic partial asymmetry. This matrix is then normalized to enable stable information propagation across the graph, following the symmetric normalization proposed in graph convolutional networks [
30]:
In this formulation,
denotes the degree matrix. Based on the normalized adjacency matrix, two graph convolution layers propagate information across anatomically related nodes. Specifically, as depicted in
Figure 4, each landmark node maintains a feature embedding that encodes local appearance information (32-dim feature vector), normalized coordinates, and Heatmap Confidence score derived from the peak activation of the corresponding heatmap. In this module, local appearance features indicate whether a landmark is positioned on the correct anatomical texture, while coordinates enable the model to compute relative distance relationships between landmarks. Prediction confidence reflects the sharpness of the heatmap response, allowing the model to identify reliable landmarks as reference anchors during structural refinement. During message-passing, each node updates its representation by aggregating embeddings from neighbors defined by the adjacency matrix. Through this aggregation, each landmark compares its predicted position with the surrounding spatial configuration, enabling the GNN to learn how to correct structurally implausible deviations based on learned anatomical relationships.
The resulting node representations form 128-dimensional embeddings enriched with anatomical context. A fully connected layer then regresses the coordinate offsets
and the refined landmark positions are obtained as:
In this way, the GNN acts as a global structural refinement module that enhances the anatomical coherence of the heatmap-based predictions. The individual contribution and necessity of each feature component are quantitatively validated through the ablation study.
3.4. Loss Functions
To train the proposed framework in an end-to-end manner, we decompose the objective into a localization loss and a structure-aware loss. The localization loss supervises individual landmark positions through heatmap regression and coordinate refinement, while the structure-aware loss enforces anatomically plausible configurations.
3.4.1. Localization Loss
The U-Net decoder outputs a heatmap
for each landmark
, which is converted into a probability map
. The heatmaps are supervised using a pixel-wise cross-entropy loss against the normalized Gaussian targets
:
Continuous landmark coordinates
, obtained by the soft-argmax operator described in
Section 3.2, are as well supervised using an L1 regression loss:
This coordinate-level supervision enables sub-pixel accuracy beyond the native heatmap resolution and stabilizes the learning of continuous spatial representations. The overall localization loss is defined as:
3.4.2. Structure-Aware Loss
In addition to local accuracy, we impose anatomical plausibility by incorporating two structural constraints: left–right symmetry and superior–middle–inferior ordering.
Figure 5 illustrates these constraints. These structural losses,
and
function as anatomical guardrails that stabilize the convergence of the GNN refiner. Since GNN are sensitive to the quality of initial landmark estimates, noisy heatmaps in the early training phase can lead the network to model erroneous spatial relationships. By enforcing explicit geometric priors, these losses constrain the search space to anatomically plausible configurations, ensuring a robust synergistic interaction between the learned graph features and structural constraints. For paired left–right landmarks
,
is defined as:
where (
,
) denote the refined coordinates,
(lateral spacing) is the minimum allowed horizontal separation, and pairs are the set of symmetric landmark pairs. This loss function encourages symmetric landmark pairs to share identical vertical (
) coordinates while simultaneously enforcing a minimum horizontal spacing between them. This constraint is designed to prevent the landmark collapse phenomenon, where the model excessively clusters both points near the midline in a singular pursuit of symmetry. By maintaining this lateral separation, the model ensures the preservation of an anatomically plausible symmetric configuration and structural integrity.
To maintain anatomical consistency along the vertical axis, we enforce a top–mid–bottom hierarchy:
where δ is a small margin that ensures sufficient separation between hierarchical levels. The terms
,
, and
denote the averaged vertical positions of each left–right landmark pair. This loss defines the physical sequential relationships between landmarks that are required to follow a top–mid–bottom arrangement along the anatomical vertical axis. It prevents the reversal of landmark orders, thereby suppressing potential outlier predictions during the training process and ensuring global structural consistency.
The structure-aware loss combines the two constraints:
3.4.3. Total Loss
The full training objective is the sum of the localization and structure-aware losses; this formulation jointly optimizes pixel-level heatmap quality, continuous coordinate accuracy, and global anatomical consistency:
4. Results
4.1. Dataset
This dataset contains 279 publicly available chest X-ray images, each annotated with six landmarks [
31]. It is split into 204 training images and 75 testing images [
32]. Since physical spacing metadata is not provided, performance is measured using pixel distances. All images are resized to 512 × 512 pxl.
4.2. Implementation Details
All experiments were implemented in PyTorch version 2.5.1 and trained on an NVIDIA RTX 3090 GPU. Data augmentation included random rotation (±18°), translation (±12%), isotropic scaling at (0.88 − 1.15)×, random adjustments of brightness, contrast, and gamma, as well as occasional blurring and noise injection, to model diverse anatomical and acquisition variations.
Training was performed for 120 epochs using the AdamW optimizer with a learning rate of and a weight decay of . A warm-up strategy was applied during the first 10 epochs, after which a cosine schedule was used to gradually adjust the structural loss weights (, ), the soft-argmax temperature, and the Gaussian sigma. The weighting coefficients of the structural losses were chosen to provide sufficient anatomical guidance while avoiding over-constraint of the coordinate regression. Specifically, , = 1.2 × 10−3 and = 4 × 10−3 were selected such that the structural losses act as regularizing terms that stabilize training and guide the GNN toward anatomically plausible configurations, without dominating the primary localization loss. This balance ensures that structural priors function as soft constraints, improving convergence stability and structural consistency while preserving the model’s ability to adapt to realistic anatomical variations.
4.3. Evaluation Metrics
We adopt two widely used evaluation metrics in anatomical landmark detection: the Mean Radial Error (MRE), and the Success Detection Rate (SDR) [
33]. MRE quantifies the average Euclidean distance between the predicted landmarks and their corresponding ground-truth annotations, whereas SDR measures the percentage of landmarks whose prediction error falls within a predefined tolerance. For a set of
landmarks, the MRE is defined as:
where
denote the ground-truth coordinates, and
denote the predicted coordinates.
The SDR represents the proportion of landmarks whose radial error does not exceed a given threshold
, and is defined as:
where
is the Euclidean distance between the predicted and ground-truth landmarks, and
is the indicator function that returns 1 if the condition is satisfied, and 0 otherwise. In this study, SDR was evaluated using thresholds
and
pxl.
4.4. Performance Comparison
The performance of the proposed method was evaluated on a chest X-ray six-landmark dataset using the MRE and SDR metrics described in
Section 4.3. We report SDR values at tolerance levels of (3, 6, and 9) pxl, where the 3-pxl threshold represents a clinically demanding high-precision criterion. For a fair and meaningful performance comparison, we selected representative heatmap-based landmark detection methods that have been widely adopted in prior studies on landmark detection, covering diverse architectural designs and modeling strategies. Furthermore, to assess the robustness and stability of the proposed framework, all experiments were repeated five times with different random seeds, and the reported results correspond to the average performance across runs. During testing, initial landmark coordinates were obtained via soft-argmax from the predicted heatmaps, and the GNN refinement was subsequently applied to produce the final refined landmark coordinates.
Table 1 compares the proposed method with existing heatmap-based landmark detection approaches.
Figure 6 presents a visualization of the predicted landmarks, demonstrating that the proposed method produces consistently aligned and anatomically coherent landmark positions.
Table 1 shows that the proposed method achieves improved performance over existing approaches, including an MRE of 3.82 pxl, and an SDR 3 pxl of 52.44%.
These results indicate that the proposed framework not only reduces the MRE, but also achieves superior performance under the most stringent evaluation criterion (SDR 3 pxl), demonstrating its effectiveness for high-precision and clinically demanding landmark localization tasks.
4.5. Ablation Study
In this section, an ablation study was conducted on the test dataset to examine the contribution of each core component and hyperparameter in the proposed framework.
Table 2 summarizes the performance variations when the Symmetry Loss, Order Loss, and GNN refinement module are selectively included or removed. The configuration incorporating all three components achieves the best overall performance, yielding the highest SDR scores and the lowest MRE (3.82 pxl). In contrast, removing any component generally lowers performance, as reflected in the SDR and MRE results.
A detailed comparison reveals the distinct yet complementary roles of each module.
The combination of Symmetry and Order losses without the GNN provides a strong anatomical baseline with an MRE of 3.82 pxl, and the addition of the GNN refinement module further boosts the SDR at the most stringent threshold (3 pxl) from 51.78% to 52.44%. This indicates that the structural losses impose global anatomical constraints to maintain overall structural consistency, whereas the GNN module performs fine-grained and adaptive geometric corrections through relational message passing. These results indicate that each component contributes meaningfully to improving structural consistency and coordinate accuracy.
To further analyze the role of each structural cue used by the GNN-based landmark refiner, we conduct an input ablation in which the GNN receives either (1) full input representation, or (2) one of the three components removed: 32-dim feature vector (feat), coordinate Output (), or Heatmap Confidence score (prob). In all variants, Symmetry Loss and Structure Loss are kept active to ensure a consistent structural prior across experiments. Removing node feature vectors (w/o feat) tests whether appearance-level local descriptors contribute to refinement accuracy; removing coordinate normalization (w/o ) evaluates the impact of explicit spatial encoding on graph message passing; and removing probability cues (w/o prob) measures the sensitivity of the GNN to local peak sharpness and confidence.
A quantitative analysis of
Table 3 reveals that each input component plays a distinct yet essential role in the refinement process. Specifically, the removal of spatial coordinates (
) leads to the most significant increase in MRE (from 3.91 to 4.01 pxl), highlighting that explicit coordinate information is the primary driver for maintaining global geometric consistency. Conversely, the exclusion of confidence scores (prob) results in the sharpest decline in 3 pxl SDR. This underscores that reliability-aware weighting is critical for high-precision localization, as it allows the GNN to effectively prioritize stable spatial anchors over noisy initial predictions. These findings justify the synergistic integration of visual, geometric, and confidence signals, which collectively ensure anatomically robust landmark detection.
Table 4 presents the model performance with respect to different temperature values used in the soft-argmax operation. The temperature parameter controls the sharpness of the predicted heatmap distribution: larger values produce smoother, flatter distributions, whereas excessively small values result in overly peaked responses. The experimental results show that T = 0.01 yields the best overall performance, achieving the highest SDR (3 pxl), while maintaining a low MRE (3.82 pxl).
In contrast, when T is set too high (e.g., 0.10), the heatmap becomes overly flattened, leading to performance degradation. Conversely, extremely small values (e.g., 0.005) reduce the MRE slightly while also decreasing the SDR, revealing a trade-off that ultimately lowers overall localization quality. These findings indicate that when the heatmap distribution becomes either too diffuse or too concentrated, the precision of soft-argmax-based coordinate estimation can be compromised.
5. Discussions
In our current study, the scale of the available dataset was constrained by the inherent requirement for expert-level annotation in medical imaging. Due to this limitation, the model’s performance may be restricted when encountering rare pathological cases or atypical anatomical variations. However, the proposed methodology effectively models global structural information through the synergy between structural losses and GNN-based refinement. This design can improve prediction accuracy even within atypical anatomical datasets, and we expect that the integration of more extensive and diverse multi-center datasets in future research will further enhance the framework’s robustness under such challenging conditions.
Furthermore, while the present framework was specifically designed for the unique structural patterns of chest X-rays, its applicability to other anatomical regions such as the knee or cranium remains a crucial area for exploration. Automatically adapting and transforming the proposed structural constraints to suit the distinct characteristics of various anatomical domains—and facilitating their interaction with the GNN-based refinement mechanism—will be instrumental in establishing a scalable framework that can consistently deliver high landmark localization accuracy across a wide range of clinical specialties.
6. Conclusions
In this study, we proposed a novel structure-aware framework to accurately detect six anatomical landmarks in chest X-rays. To explicitly model the spatial dependencies among landmarks, we introduced a GNN-based refinement module, together with structural losses derived from anatomical priors. This structure-aware learning strategy is implemented as a single pipeline (
Figure 2), in which a U-Net backbone produces landmark probability maps, from which initial coordinates are extracted via a soft-argmax operation and subsequently refined through a lightweight GNN that captures global inter-landmark relationships. Rather than using coordinates alone, each GNN node embedding combines positional information, local appearance features, and confidence scores, enabling the refinement stage to reason jointly over geometric and appearance-based cues. The ablation results in
Table 3 further show that removing any of these components degrades performance, highlighting the importance of this enriched node representation.
Quantitative evaluations show that our method outperforms conventional approaches in both MRE and SDR metrics (
Table 1), demonstrating that the incorporation of structural priors and GNN refinement effectively alleviates the limitations of independent single-heatmap predictions by improving spatial reliability and anatomical consistency. Ablation studies further confirm that symmetry constraints, ordering constraints, and the GNN module each contribute meaningfully to the model accuracy and stability, with the full configuration achieving the best performance (
Table 2).
Accurate identification of anatomical landmarks is crucial in medical imaging and diagnostic applications, as these landmarks serve as key reference points for clinicians in diagnosis, treatment planning, and patient monitoring. The proposed framework is therefore expected to substantially improve localization accuracy when integrated into medical systems that require precise anatomical landmark identification. In future work, beyond improving accuracy, we plan to explore more efficient inference strategies to facilitate rapid clinical diagnosis.