Learning Structural Relations for Robust Chest X-Ray Landmark Detection

Choi, Su-Bin; Ham, Gyu-Sung; Oh, Kanghan

doi:10.3390/electronics15030589

Open AccessFeature PaperArticle

Learning Structural Relations for Robust Chest X-Ray Landmark Detection

by

Su-Bin Choi

¹,

Gyu-Sung Ham

²

and

Kanghan Oh

^1,*

¹

Department of Computer and Software Engineering, Wonkwang University, Iksan 54538, Republic of Korea

²

AI Convergence Research Institute, Wonkwang University, Iksan 54538, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 589; https://doi.org/10.3390/electronics15030589

Submission received: 23 December 2025 / Revised: 15 January 2026 / Accepted: 26 January 2026 / Published: 29 January 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Accurate anatomical landmark localization is essential to automate chest X-ray analysis and improve diagnostic reliability. While global context recognition is essential in medical imaging, the inherently high-resolution nature of these images has long made this task particularly difficult. While the U-Net-based heatmap regression methods show strong performance, they still lack explicit modeling of the global spatial relationships among landmarks. To address this limitation, we propose an integrated structural learning framework that captures anatomical correlations across landmarks. The model generates probabilistic heatmaps with U-Net and derives continuous coordinates via soft-argmax. Subsequently, these coordinates, along with their corresponding local feature vectors, are fed into a Graph Neural Network (GNN) to refine the final positions by learning inter-landmark dependencies. Anatomical priors, such as bilateral symmetry and vertical hierarchy, are incorporated into the loss function to enhance spatial consistency. The experimental results show that our method consistently outperforms state-of-the-art models across all metrics, achieving significant improvements in MRE and SDR at 3, 6, and 9 pxl thresholds. This high precision demonstrates the framework’s strong potential to enhance the accuracy and robustness of clinical diagnostic systems.

Keywords:

medical landmark localization; structural spatial priors; graph-based refinement

1. Introduction

Anatomical landmark localization in medical imaging involves identifying specific anatomical points across modalities such as X-ray, CT, and MRI and is essential for various clinical tasks that include diagnosis, disease progression assessment, quantitative analysis, and treatment planning [1,2,3,4]. However, manual annotation of these landmarks by a doctor is both time-consuming and labor-intensive, and the resulting annotations often exhibit interobserver variability due to differences in image quality, acquisition conditions, and patient-specific anatomical variations. These factors make it difficult to maintain consistent annotation quality and further highlight the need for reliable and reproducible automated landmark detection methods.

To robustly address such variability, deep learning-based approaches have become standard in medical image analysis [5,6,7]. Among these, landmark detection based on heatmap regression has become the widely adopted approach [8,9,10]. Unlike direct regression, this approach represents each landmark as a spatial probability distribution, providing better spatial generalization and more stable localization than direct coordinate prediction [11,12].

U-Net has been widely utilized in medical imaging because its encoder–decoder architecture can effectively integrate multi-scale features [13] and, when paired with heatmap-based landmark representations, has demonstrated strong performance [10,11,12,14]. However, because heatmap representations rely primarily on local cues around each landmark, they are limited in their ability to capture anatomical variability and long-range structural relationships among landmarks [7,15,16,17]. Because the predicted heatmaps emphasize only the immediate neighborhood around each landmark, they cannot explicitly encode higher-level structural information, such as bilateral symmetry, inter-landmark distances, or positional relationships.

To address this issue, several studies have highlighted the importance of incorporating global anatomical constraints—including geometric relationships among landmarks and their relative spatial organization—to ensure structurally consistent predictions [16,17]. Furthermore, graph-based or geometric formulations represent landmarks as nodes and utilize iterative refinement strategies to refine their spatial and structural relationships, thereby incorporating global anatomical priors into the prediction process [8,16,17,18]. Although these methods have achieved notable progress in independently improving local representations and global structural relationships, research that utilizes these components in a mutually complementary manner remains limited.

The goal of this study is to develop a unified landmark detection framework that can more reliably learn anatomical relationships between landmarks in chest X-ray images. To this end, we adopt a U-Net-based heatmap regression network as the baseline architecture and incorporate structural losses that impose symmetry and vertical positional relationships during training. We further integrate a GNN-based landmark refinement module to adjust the initial heatmap predictions from a global structural perspective. This design simultaneously leverages local features, anatomical rules, and global structural information, addressing limitations present in previous approaches that considered these components separately.

For the experiments, we evaluated the performance of the proposed model using a chest X-ray dataset of 279 images annotated with six pulmonary anatomical landmarks. Figure 1 visualizes the dataset along with the ground-truth placement of each landmark. To examine the effects of the structural constraints—symmetry, ordering, and the GNN-based refinement—we conducted additional experiments using various combinations of these components. These evaluations allowed quantitative verification of how each structural constraint contributes to the stability of the overall landmark configuration.

The remainder of this paper is organized as follows: Section 2 provides an overview of related work, while Section 3 describes the proposed method in detail. Section 4 presents and analyzes the experimental results, then Section 5 concludes the paper with a discussion and final remarks.

2. Related Works

This section presents an overview of the major research directions in anatomical landmark detection for medical imaging. We focus on five representative categories: early coordinate-regression-based landmark detection approaches [19], heatmap-based inference methods [8,9,10], techniques that incorporate anatomical priors [16,17], graph-based approaches that model global relationships among landmarks [18], and finally, transformer-based architectures [20] and CNN–Transformer hybrid models [21,22].

In the early stages of deep learning, direct coordinate regression was widely used, where convolutional neural networks (CNNs) or regression-based models directly predicted the

x, y

coordinates of each landmark [19]. This approach offers simplicity and computational efficiency. However, regressing coordinates as scalar values introduces fundamental limitations, including the inability to model spatial uncertainty, the lack of explicit anatomical relationship modeling, and vulnerability to shape variability and structural overlap that frequently occur in X-ray images [3,4].

Heatmap regression was introduced to overcome these limitations. Modeling each landmark as a 2D probability distribution allows such progression to provide far more stable localization than direct regression, and has since become the dominant approach for landmark detection [5,7,10,11,12,15,23]. Multi-scale fully convolutional architectures—such as U-Net, Hourglass Networks, and Feature Pyramid Networks (FPNs) [13,24,25,26]—further enhance this paradigm by integrating information across multiple resolutions, allowing the models to capture fine-grained local structures around each landmark.

However, heatmap-based methods still rely primarily on local appearance cues. In consequence, they face inherent difficulty in representing global anatomical configurations among landmarks, or incorporating anatomical priors in an end-to-end manner. To address the limitations of local appearance-based predictions, several studies have attempted to integrate global anatomical constraints into heatmap-based outputs. Payer et al. [7] enhanced landmark consistency by augmenting heatmap regression CNNs with a spatial configuration model. Ham et al. [14] leveraged soft-argmax coordinates to learn interactions among landmarks, and explicitly incorporated these structural constraints through a loss function.

In 3D imaging, O’Neil et al. [27] improved robustness by combining atlas-based coordinate alignment with relative spatial relationships between landmarks, enabling the integration of long-range spatial context.

Alongside these efforts, graph-based approaches have also been explored, motivated by the difficulty of explicitly modeling global structure using heatmaps alone. In this approach, landmarks are represented as nodes and anatomical relationships as edges, allowing graph-based models or priors to capture global structural dependencies [18,28,29]. These studies use graph neural networks (GNNs) to model structural dependencies among landmarks, or to refine heatmap-based predictions during a post-processing stage. However, because these methods rely on heatmap-derived coordinates, their performance can be influenced by the accuracy of the initial heatmaps, and inaccuracies may propagate through subsequent refinement steps.

Recent studies have increasingly explored transformer-based architectures and CNN–Transformer hybrid models for anatomical landmark detection. For instance, Zhao et al. [21] developed SpineHRformer, a transformer-based model designed to capture long-range dependencies for precise spinal landmark localization. Furthermore, Jin et al. [20] proposed STLD, a selection-free self-training framework that utilizes a transformer-based regressor to effectively model the global relationships between landmarks while mitigating confirmation bias. In particular, Huang et al. [22] proposed CASEMark, a hybrid framework that integrates Swin Transformers with convolutional networks to jointly model global spatial context and fine-grained local details in multi-structure X-ray images. These advances reflect a broader research shift toward leveraging expansive spatial context to achieve clinically reliable landmark localization. Building on these observations, our work proposes a new framework that integrates local appearance cues, anatomical priors, and global structural information within a single unified learning process. The following sections describe the detailed components of the proposed method.

3. Methods

Figure 2 illustrates the overall pipeline of the proposed framework. We introduce a landmark localization approach that integrates probabilistic heatmap regression, graph-based structural refinement, and anatomical regularization to simultaneously enhance spatial accuracy and structural consistency. The pipeline follows a staged refinement strategy: a U-Net backbone first predicts heatmaps that encode local appearance and spatial probability for each landmark. A lightweight graph neural network (GNN) then models global relationships among landmarks and refines the initial coordinates through context-aware message passing. Finally, structural regularization terms based on anatomical symmetry and vertical ordering are incorporated during training to enforce anatomically plausible landmark configurations.

By combining explicit structural reasoning with conventional heatmap-based regression, the proposed framework goes beyond minimizing coordinate error to enable landmark predictions that are both precise and anatomically coherent.

3.1. U-Net for Heatmap Regression

Figure 3 presents the U-Net architecture employed in this study, adapted from Ronneberger et al. [13], which serves as the heatmap regression backbone of the proposed framework. The U-Net is responsible for analyzing the input chest X-ray image and generating landmark-specific probability heatmaps. The model follows an encoder–decoder structure in which each block consists of two 3 × 3 convolutional layers, followed by instance normalization and ReLU activation. In the encoder, spatial resolution is progressively reduced through 2 × 2 max-pooling while the number of feature channels increases from 32 to 256. In the decoder, transposed convolution-based up sampling restores spatial resolution, and skip connections link feature maps of matching scales to preserve fine structural details.

Finally, a 1 × 1 convolution layer outputs six heatmap channels, each representing a two-dimensional spatial response corresponding to a single anatomical landmark.

3.2. Feature Embedding Using Soft-Argmax

After generating the spatial probability maps for each anatomical landmark, the U-Net outputs a heatmap

H_{k} (x, y) \in R^{S \times S}

for each landmark

k

. To convert this probabilistic representation into continuous coordinates, we employ a differentiable soft-argmax operation. The conventional argmax selects only the single highest-response pixel and is therefore non-differentiable, making it unsuitable for end-to-end coordinate learning. In contrast, the soft-argmax computes an expectation over all pixel locations weighted by their probabilities, enabling smooth gradient flow from coordinate-based losses back to the heatmap representation.

Following prior work [18], the predicted coordinates

(x_{k}, y_{k})

are obtained as:

(x_{k}, y_{k}) = \sum_{x, y} s o f t m a x (\frac{H_{k} (x, y)}{T}) (x, y)

(1)

where

T

is a temperature parameter that controls the sharpness of the distribution: lower values of

T

produce sharper and more peaked heatmaps, while higher values yield smoother probability maps. This differentiable formulation effectively bridges pixel-level information and spatial reasoning.

The resulting coordinates

(x_{k}, y_{k})

serve as compact anatomical descriptors and provide the input representation for the graph-based refinement module described in Section 3.3, while Section 3.4 details the associated supervision losses.

3.3. Graph-Based Landmark Refiner

Although heatmap-based soft-argmax prediction can accurately localize each landmark at a local level, it treats landmarks independently, and hence fails to capture anatomical structures or inter-landmark relationships. To address this limitation, the predicted landmark coordinates—obtained from the soft-argmax step—are interpreted as nodes in a graph, and a lightweight GNN-based refinement module exchanges information among neighboring landmarks to further adjust these initial estimates, as Figure 4 illustrates:

Each landmark node

N_{k}

is constructed by combining the 32-dim feature vector sampled from the predicted coordinate

(x_{k}, y_{k})

on the final decoder feature map

F,

the normalized Coord. Output

(x_{k} / W, y_{k} / H)

, and the SoftMax probability sampled at the same location (the heatmap confidence score). Figure 4 shows that the resulting node embedding integrates local appearance information, global positional cues, and SoftMax-based Heatmap confidence score, providing a richer representation for subsequent graph refinement.

The structural relationships among landmarks are encoded in an adjacency matrix

A,

which incorporates left–right symmetry, superior–middle–inferior anatomical ordering, and cross-connections that reflect the partial asymmetry of real thoracic structures, as illustrated in Figure 4. Since the left and right lungs possess nearly symmetrical anatomical structures, these bilateral connections allow the model to calibrate unstable predictions by leveraging contralateral information, effectively preventing bilateral imbalance. However, recognizing that real thoracic anatomy is not perfectly symmetrical, we avoided overly restrictive symmetry assumptions—which can induce prediction errors—by incorporating cross-connections to reflect realistic partial asymmetry. This matrix is then normalized to enable stable information propagation across the graph, following the symmetric normalization proposed in graph convolutional networks [30]:

A_{norm} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}}

(2)

In this formulation,

D

denotes the degree matrix. Based on the normalized adjacency matrix, two graph convolution layers propagate information across anatomically related nodes. Specifically, as depicted in Figure 4, each landmark node maintains a feature embedding that encodes local appearance information (32-dim feature vector), normalized coordinates, and Heatmap Confidence score derived from the peak activation of the corresponding heatmap. In this module, local appearance features indicate whether a landmark is positioned on the correct anatomical texture, while coordinates enable the model to compute relative distance relationships between landmarks. Prediction confidence reflects the sharpness of the heatmap response, allowing the model to identify reliable landmarks as reference anchors during structural refinement. During message-passing, each node updates its representation by aggregating embeddings from neighbors defined by the adjacency matrix. Through this aggregation, each landmark compares its predicted position with the surrounding spatial configuration, enabling the GNN to learn how to correct structurally implausible deviations based on learned anatomical relationships.

The resulting node representations form 128-dimensional embeddings enriched with anatomical context. A fully connected layer then regresses the coordinate offsets

(Δ x_{k}, Δ y_{k}),

and the refined landmark positions are obtained as:

(x_{k}^{'}, y_{k}^{'}) = (x_{k} + Δ x_{k}, y_{k} + Δ y_{k})

(3)

In this way, the GNN acts as a global structural refinement module that enhances the anatomical coherence of the heatmap-based predictions. The individual contribution and necessity of each feature component are quantitatively validated through the ablation study.

3.4. Loss Functions

To train the proposed framework in an end-to-end manner, we decompose the objective into a localization loss and a structure-aware loss. The localization loss supervises individual landmark positions through heatmap regression and coordinate refinement, while the structure-aware loss enforces anatomically plausible configurations.

3.4.1. Localization Loss

The U-Net decoder outputs a heatmap

H_{k} (x, y)

for each landmark

k

, which is converted into a probability map

P_{k} (x, y) = s o f t m a x (H_{k} (x, y))

. The heatmaps are supervised using a pixel-wise cross-entropy loss against the normalized Gaussian targets

G_{k} (x, y)

:

L_{C E} = - \frac{1}{K} \sum_{k = 1}^{K} \sum_{x, y} G_{k} (x, y) l o g P_{k} (x, y)

(4)

Continuous landmark coordinates

(x_{k}, y_{k})

, obtained by the soft-argmax operator described in Section 3.2, are as well supervised using an L1 regression loss:

L_{x y} = \frac{1}{K} \sum_{k = 1}^{K} L 1 ((x_{k}, y_{k}), (x_{k}^{G T}, y_{k}^{G T}))

(5)

This coordinate-level supervision enables sub-pixel accuracy beyond the native heatmap resolution and stabilizes the learning of continuous spatial representations. The overall localization loss is defined as:

L_{l o c a l} = L_{C E} + λ_{x y} L_{x y}

(6)

3.4.2. Structure-Aware Loss

In addition to local accuracy, we impose anatomical plausibility by incorporating two structural constraints: left–right symmetry and superior–middle–inferior ordering. Figure 5 illustrates these constraints. These structural losses,

L_{s y m}

and

L_{o r d}

function as anatomical guardrails that stabilize the convergence of the GNN refiner. Since GNN are sensitive to the quality of initial landmark estimates, noisy heatmaps in the early training phase can lead the network to model erroneous spatial relationships. By enforcing explicit geometric priors, these losses constrain the search space to anatomically plausible configurations, ensuring a robust synergistic interaction between the learned graph features and structural constraints. For paired left–right landmarks

(i, j)

,

L_{s y m}

is defined as:

L_{s y m} = \sum_{(i, j) \in p a i r s} S m o o t h L 1 (y_{i}^{'}, y_{j}^{'}) + w_{w i d t h} \cdot R e L U (d_{m i n} - ∣ x_{i}^{'} - x_{j}^{'} ∣)

(7)

where (

x_{i}^{'}

,

y_{i}^{'}

) denote the refined coordinates,

d_{m i n}

(lateral spacing) is the minimum allowed horizontal separation, and pairs are the set of symmetric landmark pairs. This loss function encourages symmetric landmark pairs to share identical vertical (

y

) coordinates while simultaneously enforcing a minimum horizontal spacing between them. This constraint is designed to prevent the landmark collapse phenomenon, where the model excessively clusters both points near the midline in a singular pursuit of symmetry. By maintaining this lateral separation, the model ensures the preservation of an anatomically plausible symmetric configuration and structural integrity.

To maintain anatomical consistency along the vertical axis, we enforce a top–mid–bottom hierarchy:

L_{o r d} = m a x (0, | y_{t o p}^{'} - y_{m i d}^{'} + δ) + m a x (0, | y_{m i d}^{'} - y_{b o t}^{'} + δ)

(8)

where δ is a small margin that ensures sufficient separation between hierarchical levels. The terms

y_{t o p}^{'}

,

y_{m i d}^{'}

, and

y_{b o t}^{'}

denote the averaged vertical positions of each left–right landmark pair. This loss defines the physical sequential relationships between landmarks that are required to follow a top–mid–bottom arrangement along the anatomical vertical axis. It prevents the reversal of landmark orders, thereby suppressing potential outlier predictions during the training process and ensuring global structural consistency.

The structure-aware loss combines the two constraints:

L_{s t r u c t} = λ_{s y m} L_{s y m} + λ_{o r d} L_{o r d}

(9)

3.4.3. Total Loss

The full training objective is the sum of the localization and structure-aware losses; this formulation jointly optimizes pixel-level heatmap quality, continuous coordinate accuracy, and global anatomical consistency:

L_{t o t a l} = L_{l o c a l} + L_{s t r u c t}

(10)

4. Results

4.1. Dataset

This dataset contains 279 publicly available chest X-ray images, each annotated with six landmarks [31]. It is split into 204 training images and 75 testing images [32]. Since physical spacing metadata is not provided, performance is measured using pixel distances. All images are resized to 512 × 512 pxl.

4.2. Implementation Details

All experiments were implemented in PyTorch version 2.5.1 and trained on an NVIDIA RTX 3090 GPU. Data augmentation included random rotation (±18°), translation (±12%), isotropic scaling at (0.88 − 1.15)×, random adjustments of brightness, contrast, and gamma, as well as occasional blurring and noise injection, to model diverse anatomical and acquisition variations.

Training was performed for 120 epochs using the AdamW optimizer with a learning rate of

7 \times 10^{- 4}

and a weight decay of

5 \times 10^{- 5}

. A warm-up strategy was applied during the first 10 epochs, after which a cosine schedule was used to gradually adjust the structural loss weights (

λ_{sym}

,

λ_{ord}

), the soft-argmax temperature, and the Gaussian sigma. The weighting coefficients of the structural losses were chosen to provide sufficient anatomical guidance while avoiding over-constraint of the coordinate regression. Specifically,

λ_{sym}

, = 1.2 × 10⁻³ and

λ_{ord}

= 4 × 10⁻³ were selected such that the structural losses act as regularizing terms that stabilize training and guide the GNN toward anatomically plausible configurations, without dominating the primary localization loss. This balance ensures that structural priors function as soft constraints, improving convergence stability and structural consistency while preserving the model’s ability to adapt to realistic anatomical variations.

4.3. Evaluation Metrics

We adopt two widely used evaluation metrics in anatomical landmark detection: the Mean Radial Error (MRE), and the Success Detection Rate (SDR) [33]. MRE quantifies the average Euclidean distance between the predicted landmarks and their corresponding ground-truth annotations, whereas SDR measures the percentage of landmarks whose prediction error falls within a predefined tolerance. For a set of

N

landmarks, the MRE is defined as:

MRE = \frac{1}{N} \sum_{i = 1}^{N} \sqrt{(x_{i} - {\hat{x}}_{i})^{2} + (y_{i} - {\hat{y}}_{i})^{2}}

(11)

where

(x_{i}, y_{i})

denote the ground-truth coordinates, and

({\hat{x}}_{i}, {\hat{y}}_{i})

denote the predicted coordinates.

The SDR represents the proportion of landmarks whose radial error does not exceed a given threshold

t

, and is defined as:

SDR = \frac{1}{N} \sum_{i = 1}^{N} I (d_{i} \leq t)

(12)

where

d_{i}

is the Euclidean distance between the predicted and ground-truth landmarks, and

I

is the indicator function that returns 1 if the condition is satisfied, and 0 otherwise. In this study, SDR was evaluated using thresholds

t = (3, 6,

and

9)

pxl.

4.4. Performance Comparison

The performance of the proposed method was evaluated on a chest X-ray six-landmark dataset using the MRE and SDR metrics described in Section 4.3. We report SDR values at tolerance levels of (3, 6, and 9) pxl, where the 3-pxl threshold represents a clinically demanding high-precision criterion. For a fair and meaningful performance comparison, we selected representative heatmap-based landmark detection methods that have been widely adopted in prior studies on landmark detection, covering diverse architectural designs and modeling strategies. Furthermore, to assess the robustness and stability of the proposed framework, all experiments were repeated five times with different random seeds, and the reported results correspond to the average performance across runs. During testing, initial landmark coordinates were obtained via soft-argmax from the predicted heatmaps, and the GNN refinement was subsequently applied to produce the final refined landmark coordinates. Table 1 compares the proposed method with existing heatmap-based landmark detection approaches.

Figure 6 presents a visualization of the predicted landmarks, demonstrating that the proposed method produces consistently aligned and anatomically coherent landmark positions. Table 1 shows that the proposed method achieves improved performance over existing approaches, including an MRE of 3.82 pxl, and an SDR 3 pxl of 52.44%.

These results indicate that the proposed framework not only reduces the MRE, but also achieves superior performance under the most stringent evaluation criterion (SDR 3 pxl), demonstrating its effectiveness for high-precision and clinically demanding landmark localization tasks.

4.5. Ablation Study

In this section, an ablation study was conducted on the test dataset to examine the contribution of each core component and hyperparameter in the proposed framework.

Table 2 summarizes the performance variations when the Symmetry Loss, Order Loss, and GNN refinement module are selectively included or removed. The configuration incorporating all three components achieves the best overall performance, yielding the highest SDR scores and the lowest MRE (3.82 pxl). In contrast, removing any component generally lowers performance, as reflected in the SDR and MRE results.

A detailed comparison reveals the distinct yet complementary roles of each module.

The combination of Symmetry and Order losses without the GNN provides a strong anatomical baseline with an MRE of 3.82 pxl, and the addition of the GNN refinement module further boosts the SDR at the most stringent threshold (3 pxl) from 51.78% to 52.44%. This indicates that the structural losses impose global anatomical constraints to maintain overall structural consistency, whereas the GNN module performs fine-grained and adaptive geometric corrections through relational message passing. These results indicate that each component contributes meaningfully to improving structural consistency and coordinate accuracy.

To further analyze the role of each structural cue used by the GNN-based landmark refiner, we conduct an input ablation in which the GNN receives either (1) full input representation, or (2) one of the three components removed: 32-dim feature vector (feat), coordinate Output (

x y

), or Heatmap Confidence score (prob). In all variants, Symmetry Loss and Structure Loss are kept active to ensure a consistent structural prior across experiments. Removing node feature vectors (w/o feat) tests whether appearance-level local descriptors contribute to refinement accuracy; removing coordinate normalization (w/o

x y

) evaluates the impact of explicit spatial encoding on graph message passing; and removing probability cues (w/o prob) measures the sensitivity of the GNN to local peak sharpness and confidence.

A quantitative analysis of Table 3 reveals that each input component plays a distinct yet essential role in the refinement process. Specifically, the removal of spatial coordinates (

x y

) leads to the most significant increase in MRE (from 3.91 to 4.01 pxl), highlighting that explicit coordinate information is the primary driver for maintaining global geometric consistency. Conversely, the exclusion of confidence scores (prob) results in the sharpest decline in 3 pxl SDR. This underscores that reliability-aware weighting is critical for high-precision localization, as it allows the GNN to effectively prioritize stable spatial anchors over noisy initial predictions. These findings justify the synergistic integration of visual, geometric, and confidence signals, which collectively ensure anatomically robust landmark detection.

Table 4 presents the model performance with respect to different temperature values used in the soft-argmax operation. The temperature parameter controls the sharpness of the predicted heatmap distribution: larger values produce smoother, flatter distributions, whereas excessively small values result in overly peaked responses. The experimental results show that T = 0.01 yields the best overall performance, achieving the highest SDR (3 pxl), while maintaining a low MRE (3.82 pxl).

In contrast, when T is set too high (e.g., 0.10), the heatmap becomes overly flattened, leading to performance degradation. Conversely, extremely small values (e.g., 0.005) reduce the MRE slightly while also decreasing the SDR, revealing a trade-off that ultimately lowers overall localization quality. These findings indicate that when the heatmap distribution becomes either too diffuse or too concentrated, the precision of soft-argmax-based coordinate estimation can be compromised.

5. Discussions

In our current study, the scale of the available dataset was constrained by the inherent requirement for expert-level annotation in medical imaging. Due to this limitation, the model’s performance may be restricted when encountering rare pathological cases or atypical anatomical variations. However, the proposed methodology effectively models global structural information through the synergy between structural losses and GNN-based refinement. This design can improve prediction accuracy even within atypical anatomical datasets, and we expect that the integration of more extensive and diverse multi-center datasets in future research will further enhance the framework’s robustness under such challenging conditions.

Furthermore, while the present framework was specifically designed for the unique structural patterns of chest X-rays, its applicability to other anatomical regions such as the knee or cranium remains a crucial area for exploration. Automatically adapting and transforming the proposed structural constraints to suit the distinct characteristics of various anatomical domains—and facilitating their interaction with the GNN-based refinement mechanism—will be instrumental in establishing a scalable framework that can consistently deliver high landmark localization accuracy across a wide range of clinical specialties.

6. Conclusions

In this study, we proposed a novel structure-aware framework to accurately detect six anatomical landmarks in chest X-rays. To explicitly model the spatial dependencies among landmarks, we introduced a GNN-based refinement module, together with structural losses derived from anatomical priors. This structure-aware learning strategy is implemented as a single pipeline (Figure 2), in which a U-Net backbone produces landmark probability maps, from which initial coordinates are extracted via a soft-argmax operation and subsequently refined through a lightweight GNN that captures global inter-landmark relationships. Rather than using coordinates alone, each GNN node embedding combines positional information, local appearance features, and confidence scores, enabling the refinement stage to reason jointly over geometric and appearance-based cues. The ablation results in Table 3 further show that removing any of these components degrades performance, highlighting the importance of this enriched node representation.

Quantitative evaluations show that our method outperforms conventional approaches in both MRE and SDR metrics (Table 1), demonstrating that the incorporation of structural priors and GNN refinement effectively alleviates the limitations of independent single-heatmap predictions by improving spatial reliability and anatomical consistency. Ablation studies further confirm that symmetry constraints, ordering constraints, and the GNN module each contribute meaningfully to the model accuracy and stability, with the full configuration achieving the best performance (Table 2).

Accurate identification of anatomical landmarks is crucial in medical imaging and diagnostic applications, as these landmarks serve as key reference points for clinicians in diagnosis, treatment planning, and patient monitoring. The proposed framework is therefore expected to substantially improve localization accuracy when integrated into medical systems that require precise anatomical landmark identification. In future work, beyond improving accuracy, we plan to explore more efficient inference strategies to facilitate rapid clinical diagnosis.

Author Contributions

Conceptualization, S.-B.C. and K.O.; methodology, S.-B.C.; software, S.-B.C.; validation, S.-B.C.; formal analysis, S.-B.C. and G.-S.H.; investigation, S.-B.C. and K.O.; writing—original draft preparation, S.-B.C.; writing—review and editing, G.-S.H. and K.O.; visualization, S.-B.C.; supervision, K.O. and G.-S.H.; project administration, K.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Chest X-ray Landmark Dataset. Available online: https://github.com/ngaggion/Chest-xray-landmark-dataset (accessed on 11 December 2025).

Acknowledgments

This paper was supported by Wonkwang University in 2024.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

Shen, D.; Wu, G.; Suk, H.-I. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef] [PubMed]
van Ginneken, B.; ter Haar Romeny, B.; Vergeer, M. Computer-Aided Diagnosis in Chest Radiography: A Survey. IEEE Trans. Med. Imaging 2001, 20, 1228–1241. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Huang, C.; Hsieh, M.; Li, C.; Chang, S.; Li, W.; Vandaele, R.; Marée, R.; Jodogne, S.; Geurts, P. Evaluation and comparison of anatomical landmark detection methods for cephalometric X-ray images: A grand challenge. IEEE Trans. Med. Imaging 2015, 34, 1890–1900. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Huang, C.; Lee, J.; Li, C.; Chang, S.; Siao, M.; Lai, T.; Ibragimov, B.; Vrtovec, T.; Ronneberger, O. A benchmark for comparison of dental radiography analysis algorithms. Med. Image Anal. 2016, 31, 63–76. [Google Scholar] [CrossRef] [PubMed]
Chen, R.; Ma, Y.; Chen, N.; Lee, D.; Wang, W. Cephalometric Landmark Detection by Attentive Feature Pyramid Fusion and Regression-Voting. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2019; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11766, pp. 873–881. [Google Scholar]
Noothout, J.M.; De Vos, B.D.; Wolterink, J.M.; Postma, E.M.; Smeets, P.A.; Takx, R.A.; Leiner, T.; Viergever, M.A.; Išgum, I. Deep learning-based regression and classification for automatic landmark localization in medical images. IEEE Trans. Med. Imaging 2020, 39, 4011–4022. [Google Scholar] [CrossRef]
Payer, C.; Štern, D.; Bischof, H.; Urschler, M. Integrating spatial configuration into heatmap regression based CNNs for landmark localization. Med. Image Anal. 2019, 54, 207–219. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.-Y.; Lai, B.; Li, W.; Wang, Y.; Zheng, K.; Wang, F.; Lin, C.; Lu, L.; Huang, L.; Han, M.; et al. Scalable Semi-supervised Landmark Localization for X-ray Images using Few-shot Deep Adaptive Graph. In Deep Generative Models, and Data Augmentation, Labelling, and Imperfections; DGM4MICCAI; Springer: Berlin/Heidelberg, Germany, 2021; Volume 13003, pp. 145–153. [Google Scholar]
Fard, A.; Ferrantelli, J.; Dupuis, A.; Mahoor, M. Sagittal Cervical Spine Landmark Point Detection in X-ray Using Deep Convolutional Neural Networks. IEEE Access 2022, 10, 59413–59427. [Google Scholar] [CrossRef]
Stern, A.; Sharan, L.; Romano, G.; Koehler, S.; Karck, M.; De Simone, R.; Wolf, I.; Engelhardt, S. Heatmap-based 2D Landmark Detection with a Varying Number of Landmarks. Proc. Ger. Workshop Med. Image Comput. 2021, 13008, 22–27. [Google Scholar]
Chen, J.; Che, H.; Sun, J.; Rao, Y.; Wu, J. An automatic cephalometric landmark detection method based on heatmap regression and Monte Carlo dropout. In Proceedings of the 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; pp. 1–4. [Google Scholar]
Zhong, Z.; Li, J.; Zhang, Z.; Jiao, Z.; Gao, X. An Attention-Guided Deep Regression Model for Landmark Detection in Cephalograms. In Medical Image Computing and Computer Assisted Intervention—MICCAI; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11769, pp. 540–548. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Ham, G.; Oh, K. Learning Spatial Configuration Feature for Landmark Localization in Hand X-rays. Electronics 2023, 12, 4038. [Google Scholar] [CrossRef]
McCouat, J.; Voiculescu, I. Contour-Hugging Heatmaps for Landmark Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20565–20573. [Google Scholar]
Li, W.; Lu, Y.; Zheng, K.; Liao, H.; Lin, C.; Luo, J.; Cheng, C.-T.; Xiao, J.; Lu, L.; Kuo, C.-F.; et al. Structured Landmark Detection via Topology-Adapting Deep Graph Learning. In Proceedings of the European Conference on Computer Vision–ECCV; Springer: Berlin/Heidelberg, Germany, 2020; pp. 23–28. [Google Scholar]
Urschler, M.; Ebner, T.; Štern, D. Integrating geometric configuration and appearance information into a unified framework for anatomical landmark localization. Med. Image Anal. 2018, 43, 23–36. [Google Scholar] [CrossRef] [PubMed]
Xiao, J.; Dang, K.; Ding, X. Anatomical Landmark Localization for Knee X-ray Images via Heatmap Regression Refined with Graph Convolutional Network. In Proceedings of the 2023 16th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Taizhou, China, 28–30 October 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Zhang, J.; Liu, M.; Shen, D. Detecting anatomical landmarks from limited medical imaging data using two-stage task-oriented deep neural networks. IEEE Trans. Image Process. 2017, 26, 4753–4764. [Google Scholar] [CrossRef] [PubMed]
Jin, H.; Che, H.; Chen, H. Rethinking Self-Training for Semi-Supervised Landmark Detection: A Selection-Free Approach. IEEE Trans. Image Process. 2024, 33, 4952–4965. [Google Scholar] [CrossRef] [PubMed]
Zhao, M.; Meng, N.; Cheung, J.P.Y.; Yu, C.; Lu, P.; Zhang, T. SpineHRformer: A Transformer-Based Deep Learning Model for Automatic Spine Deformity Assessment with Prospective Validation. Bioengineering 2023, 10, 1333. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Zhou, X.; He, X.; Wei, Y.; Yang, W.; Wang, S.; Sun, X.; Li, H. CASEMark: A Hybrid model for robust anatomical landmark detection in multi-structure X-rays. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 37. [Google Scholar] [CrossRef]
Kang, J.; Oh, K.; Oh, I. Accurate landmark localization for medical images using perturbations. Appl. Sci. 2021, 11, 10277. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition CVPR; Springer: Berlin/Heidelberg, Germany, 2015; pp. 3431–3440. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Computer Vision—ECCV; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9912, pp. 483–499. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
O’Neil, A.Q.; Kascenas, A.; Henry, J.; Wyeth, D.; Shepherd, M.; Beveridge, E.; Clunie, L.; Sansom, C.; Šeduikytė, E.; Muir, K.; et al. Attaining human-level performance with atlas location autocontext for anatomical landmark detection in 3D CT data. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; Volume 11131, pp. 470–484. [Google Scholar]
Chen, C.; Yang, X.; Huang, R.; Shi, W.; Liu, S.; Lin, M.; Huang, Y.; Yang, Y.; Zhang, Y.; Luo, H.; et al. Region proposal network with graph prior and IoU-balance loss for landmark detection in 3D ultrasound. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI); IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
Wang, Y.; Lam, H.K. Landmark-based graph convolutional neural network for interpretable catheter malposition detection. Knowl.-Based Syst. 2026, 333, 114993. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Gaggion, N.; Mansilla, L.; Mosquera, C.; Milone, D.H.; Ferrante, E. Improving Anatomical Plausibility in Medical Image Segmentation via Hybrid Graph Neural Networks: Applications to Chest X-Ray Analysis. IEEE Trans. Med. Imaging 2023, 42, 544–556. [Google Scholar] [CrossRef] [PubMed]
Lu, C.; Yang, G.; Qiao, X.; Chen, W.; Zeng, Q. UniverDetect: Universal landmark detection method for multidomain X-ray images. Neurocomputing 2024, 600, 128157. [Google Scholar] [CrossRef]
Di Via, R.; Santacesaria, M.; Odone, F.; Pastore, V.P. Is In-Domain Data Beneficial in Transfer Learning for Landmarks Detection in X-ray Images? In Proceedings of the IEEE International Symposium on Biomedical Imaging, ISBI, Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Springer: Berlin/Heidelberg, Germany, 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar]
Gao, Z.; Gao, L.; Xu, S.; Shen, Y.; Bai, J.; Xu, R.; Sang, N. Global Second-Order Pooling Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Springer: Berlin/Heidelberg, Germany, 2019; pp. 3024–3033. [Google Scholar]
Fang, P.; Zhou, J.; Roy, S.K.; Ji, P.; Petersson, L.; Harandi, M. Attention in Attention Networks for Person Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4626–4641. [Google Scholar] [CrossRef] [PubMed]
Hassani, A.; Walton, S.; Li, J.; Li, S.; Shi, H. Neighborhood attention transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Springer: Berlin/Heidelberg, Germany, 2022; pp. 6185–6194. [Google Scholar]

Figure 1. Visualization of boundary landmarks on chest X-ray images. Green dots denote ground-truth landmark points. L: left, R: right, PA: posterior-anterior view.

Figure 2. Overall architecture of the proposed framework. U-Net produces heatmaps from input X-rays, followed by soft-argmax to obtain coordinates Output. Landmark relations are refined through a graph-based refiner and regularized by structural constraints.

Figure 3. U-Net based heatmap regression for boundary landmark point detection.

Figure 4. Graph-based Landmark Refiner. The refiner integrates soft-argmax coordinates and sampled local features into a graph structure, where each node represents an anatomical landmark. Through message passing across anatomically related nodes, the model predicts coordinate offsets

(Δ x, Δ y)

to produce spatially consistent and refined landmark positions.

Figure 4. Graph-based Landmark Refiner. The refiner integrates soft-argmax coordinates and sampled local features into a graph structure, where each node represents an anatomical landmark. Through message passing across anatomically related nodes, the model predicts coordinate offsets

(Δ x, Δ y)

to produce spatially consistent and refined landmark positions.

Figure 5. Illustration of the structural constraints. (a) Symmetry Loss enforces vertical alignment and a minimum lateral spacing between left–right landmark pairs. (b) Order Loss ensures an anatomically valid top–mid–bottom hierarchy of landmark positions.

Figure 6. Visualization results of the proposed model in the Chest X-ray dataset. Green dots are the ground-truth, while blue dots are the prediction results. L: left, R: right, PA: posterior-anterior view, AP: anterior-posterior view.

Table 1. Quantitative comparison with the existing methods on the Chest X-ray dataset.

Method		SDR(%)		MRE ± SD (pxl)
Method	3 pxl	6 pxl	9 pxl	MRE ± SD (pxl)
U-Net [13]	50.85 ± 0.65	82.33 ± 0.60	91.68 ± 0.30	4.30 ± 0.07
SE [34]	51.16 ± 0.62	81.89 ± 0.58	91.63 ± 0.28	4.21 ± 0.06
CBAM [35]	50.90 ± 0.60	82.55 ± 0.55	92.03 ± 0.27	4.25 ± 0.06
GSOP [36]	51.88 ± 0.58	82.71 ± 0.52	92.61 ± 0.25	4.12 ± 0.05
AiA [37]	52.09 ± 0.55	82.11 ± 0.50	92.38 ± 0.24	4.05 ± 0.05
NAT [38]	52.24 ± 0.52	81.66 ± 0.48	92.46 ± 0.23	4.08 ± 0.06
UniverDetect [32]	50.81 ± 0.63	82.52 ± 0.56	92.27 ± 0.26	4.06 ± 0.05
The proposed method	52.44 ± 0.41	83.11 ± 0.58	93.11 ± 0.22	3.82 ± 0.04

Bold indicates the best performance.

Table 2. Performance comparison with respect to different combinations of the Structural Loss (Symmetry, Order) and GNN refinement module.

Method		SDR(%)		MRE (pxl)
Method	3 pxl	6 pxl	9 pxl	MRE (pxl)
Symmetry, Order, GNN	52.44	83.11	93.11	3.82
Symmetry, Order	51.78	81.56	92.22	3.82
Symmetry, GNN	51.33	83.11	92.44	3.93
Order, GNN	51.78	82.89	92.44	3.80
GNN	51.56	81.78	92.67	3.91

Bold indicates the best performance.

Table 3. Impact of GNN input components on landmark detection performance.

Method		SDR(%)		MRE (pxl)
Method	3 pxl	6 pxl	9 pxl	MRE (pxl)
GNN	51.56	81.78	92.67	3.91
GNN (w/o feat)	51.33	81.11	91.78	3.99
GNN (w/o $x y$ )	51.11	80.67	91.33	4.01
GNN (w/o prob)	50.44	80.67	91.11	3.97

Bold indicates the best performance.

Table 4. Performance comparison in relation to the temperature parameter T.

Temperature T		SDR(%)		MRE (pxl)
Temperature T	3 pxl	6 pxl	9 pxl	MRE (pxl)
0.10	49.78	80.44	91.56	3.97
0.05	51.11	80.67	93.33	3.87
0.02	50.00	82.67	92.89	3.83
0.01	52.44	83.11	93.11	3.82
0.005	52.00	81.33	92.89	3.78

Bold indicates the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Choi, S.-B.; Ham, G.-S.; Oh, K. Learning Structural Relations for Robust Chest X-Ray Landmark Detection. Electronics 2026, 15, 589. https://doi.org/10.3390/electronics15030589

AMA Style

Choi S-B, Ham G-S, Oh K. Learning Structural Relations for Robust Chest X-Ray Landmark Detection. Electronics. 2026; 15(3):589. https://doi.org/10.3390/electronics15030589

Chicago/Turabian Style

Choi, Su-Bin, Gyu-Sung Ham, and Kanghan Oh. 2026. "Learning Structural Relations for Robust Chest X-Ray Landmark Detection" Electronics 15, no. 3: 589. https://doi.org/10.3390/electronics15030589

APA Style

Choi, S.-B., Ham, G.-S., & Oh, K. (2026). Learning Structural Relations for Robust Chest X-Ray Landmark Detection. Electronics, 15(3), 589. https://doi.org/10.3390/electronics15030589

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning Structural Relations for Robust Chest X-Ray Landmark Detection

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. U-Net for Heatmap Regression

3.2. Feature Embedding Using Soft-Argmax

3.3. Graph-Based Landmark Refiner

3.4. Loss Functions

3.4.1. Localization Loss

3.4.2. Structure-Aware Loss

3.4.3. Total Loss

4. Results

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Performance Comparison

4.5. Ablation Study

5. Discussions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI