Part Affinity Fields and CoordConv for Detecting Landmarks of Lumbar Vertebrae and Sacrum in X-ray Images

With the prevalence of degenerative diseases due to the increase in the aging population, we have encountered many spine-related disorders. Since the spine is a crucial part of the body, fast and accurate diagnosis is critically important. Generally, clinicians use X-ray images to diagnose the spine, but X-ray images are commonly occluded by the shadows of some bones, making it hard to identify the whole spine. Therefore, recently, various deep-learning-based spinal X-ray image analysis approaches have been proposed to help diagnose the spine. However, these approaches did not consider the characteristics of frequent occlusion in the X-ray image and the properties of the vertebra shape. Therefore, based on the X-ray image properties and vertebra shape, we present a novel landmark detection network specialized in lumbar X-ray images. The proposed network consists of two stages: The first step detects the centers of the lumbar vertebrae and the upper end plate of the first sacral vertebra (S1), and the second step detects the four corner points of each lumbar vertebra and two corner points of S1 from the image obtained in the first step. We used random spine cutout augmentation in the first step to robustify the network against the commonly obscured X-ray images. Furthermore, in the second step, we used CoordConv to make the network recognize the location distribution of landmarks and part affinity fields to understand the morphological features of the vertebrae, resulting in more accurate landmark detection. The proposed network was evaluated using 304 X-ray images, and it achieved 98.02% accuracy in center detection and 8.34% relative distance error in corner detection. This indicates that our network can detect spinal landmarks reliably enough to support radiologists in analyzing the lumbar X-ray images.


Introduction
Recently, due to the global population's aging, there has been a gradual increase in the proportion of the elderly. As a result, many degenerative diseases, such as osteoporosis, arthritis, and muscular regression, are becoming more common. Furthermore, many spine-related disorders are occurring because of these degenerative diseases. The spine is important in our bodies because it protects the central and peripheral nerves. Furthermore, once a spinal problem starts, it is hard to recover completely. Thus, it is critical to diagnose it early and to receive appropriate therapy. In this diagnosis, clinicians can use Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and X-ray images. However, obtaining CT or MRI images consumes much time and money. Therefore, for initial diagnosis, many clinicians use X-ray images, which are relatively inexpensive and easy to acquire. In the case of X-ray images, however, tissues or shadows of other bones overlap, making it difficult to diagnose. As a result, numerous deep-learning-based vertebral analysis methods have been proposed that assist in diagnoses, such as automatic vertebral landmarks' detection or vertebral segmentation. In particular, the detection of vertebral landmarks is important for the quantitative analysis of spine alignments, along with the diagnosis of spondylolisthesis, scoliosis, compression factor, and degenerative change.
Among the previous methods, some works [1-4] adopted a one-stage process to perform detection or segmentation in spine X-ray images. However, because X-ray images are frequently occluded, identifying all landmarks in a single process is exceptionally challenging. Therefore, most of the recently proposed methods [5][6][7] adopted a two-stage process that determines vertebral regions to crop and then performs spinal detection or segmentation. Kim et al. [5] used the predicted center of each lumbar vertebra to determine the vertebral region and performed segmentation. However, Kim et al. did not consider the characteristic of the X-ray sufficiently, which is generally occluded; occasionally, the center of the first lumbar vertebra was not predicted adequately. Additionally, since Kim et al. did not segment the sacrum, spondylolisthesis diagnosis at the fifth lumbar vertebra (L5)-S1 level is impossible. Contrary to Kim et al., Cina et al. [6] performed landmark detection in both S1 and lumbar vertebrae and, when determining the vertebral region, using an expanded bounding box generated by roughly predicted landmarks of each vertebra. However, Cina et al. had difficulty accurately detecting every landmark as the already mentioned problem of one-stage detection, and once a single landmark was improperly detected, adequately making the bounding box was impossible.
In this paper, we propose a network that detects four corner points of the upper end plate and lower end plate of each lumbar vertebra and two corner points of the upper end plate of S1 in lateral lumbar X-ray images ( Figure 1). Here, we excluded the lower end plate of S1 since it is often difficult to localize in the X-ray image. Our proposed network detects the center of each vertebra based on the confidence maps, and we performed landmark detection in cropped vertebral images using the detected centers. For detecting centers, we adopted Pose-Net [5] and increased the kernel size of some convolution filters to enable accurate center detection with broad receptive fields. In addition, a novel augmentation technique called random spine cutout was applied, which randomly cuts the vertebra employing a conventional cutout [8] to precisely localize the center, even when the X-ray image is obscured. In the landmark detection process, we utilized M-Net [9] and used CoordConv [10] instead of conventional convolution in the encoding layers, which helps M-Net understand the distribution of positions between each landmark, enabling more accurate detection. Further, we let M-Net recognize the appearance characteristics of the vertebra by making M-Net predict part affinity fields (PAFs) when detecting landmarks. At this time, we utilized two M-Nets, which share a similar structure, for detecting landmarks of the lumbar vertebrae and S1 because of the different shapes between the S1 and lumbar vertebrae. As a result, the proposed network achieved 98.02% accuracy in center detection and an 8.34% relative error rate in landmark detection. Our contributions are summarized as follows: • We propose a novel network to detect landmarks of the lumbar vertebrae and S1 specialized in X-ray images. • The proposed network has a wide range of receptive fields, and it is capable of precise vertebral center localization. • In X-ray images, we demonstrate that random spine cutout is more efficient than the conventional cutout. • We show that using CoordConv helps the network to localize an invisible landmark by learning the location distribution of landmarks. • Furthermore, we demonstrate that learning the morphological properties of the vertebra by additionally predicting PAFs is effective in improving spinal landmark detection's accuracy with a slightly greater computational cost.

Pose Estimation
Pose estimation is a task for localizing the joints of the body. DeepPose [11] was the first approach that utilized deep learning in pose estimation. DeepPose used a regressor that predicts the target joint by a sequential process. However, DeepPose did not consider the spatial connection between joints and performed pose estimation sequentially, which has the disadvantage of poor prediction accuracy and slow learning and inference. Thomson et al. [12] adopted a heat map (i.e., confidence map of joint) into pose estimation to solve the problem of inaccuracy in a high-precision region and made the estimation faster. Furthermore, it prevents improper estimations using the correlation between joints. However, it has the limitation of predicting only one person's pose in the image. Cao et al. [13] allowed multiple pose estimations within an image in a bottom-up manner to predict all joints in the image and then connected all joints of each person. To properly link all joints of each person based on the relationship between joints, Cao et al. predicted part affinity fields when joining joints for each individual. Moreover, Cao et al. performed real-time pose estimation by fast inference. HRNet [14] employed a top-down approach, first recognizing a person and then detecting all joints from that person. The top-down method is slower than the bottom-up method because the top-down method recognizes humans first before estimating the pose, but the top-down method can estimate poses with higher accuracy. In addition, HRNet maintained the resolution of the input image, used multi-resolution in parallel, and enabled detailed pose estimation with a high-resolution image. Because of these advantages, it showed good performance not only in pose estimation, but also in detection and segmentation [14].

Spinal Segmentation and Detection
In the lumbar X-ray images, various deep-learning-based segmentation approaches [3][4][5] for spinal diagnosis have been proposed. Cho et al. [3] presented a method for automatically calculating the lumbar lordosis angle by segmenting the sacrum and lumbar vertebrae using U-Net [15] and the DSC loss [16]. However, it was unsuccessful for occlusions in X-ray images since the segmentation process was carried out simply in the single stage. MBNet [4] recovered the lost information that occurred in the downsampling process during the upsampling process by utilizing a feature fusion module [17] with U-Net. Additionally, it improved the segmentation performance by predicting the parameters required for lumbar vertebra inspection. However, there is a problem that network learning is possible only when there is a ground truth for the parameters. Kim et al. [5] proposed a hierarchical segmentation network that detects the centers of lumbar vertebrae based on the confidence map frequently used in pose estimation [12][13][14], splits the vertebral region, and then performs segmentation and fine-tuning.
Along with many proposed segmentation methods, many landmark detection methods [1, 2,6,7,18] have also been proposed. Yi et al. [1] detected the spinal centers and then localized the landmarks by predicting an offset between the center and the landmarks. Yeh et al. [2] received the whole-spine local view radiographs and ensembled two models to detect all spinal landmarks. However, both Yi et al. and Yeh et al. performed detection in the downscaled full spinal image rather than in each vertebral image, making landmark localization difficult. Cina et al. [6] proposed a network that detects the landmarks of the lumbar vertebrae and nearby vertebrae (T9-12, S1) after cropping each vertebra based on the previously obtained rough landmarks. However, proper cropping of each vertebra is hard when roughly detected landmarks are not clearly localized. Zhang et al. [18] detected spinal landmarks and used part affinity fields to calculate a more accurate Cobb angle. Khanal et al. [7] split each vertebra image using Faster-RCNN [19] and then performed landmark detection. At this time, however, bottlenecks occurred during the region proposal process of Faster-RCNN, resulting in a delay in landmark detection.
Numerous methods using MRI [20,21] and CT [22] images have also been proposed in addition to X-ray images. DeepSPINE [21] detects the centers of the lumbar vertebrae and segments the lumbar vertebrae so it could support the diagnosis of lumbar spinal stenosis. Furthermore, SpineOne [20] determines the center of the lumbar vertebrae and the discs in MRI images and could assist in diagnosing degenerative discs and vertebrae. Payer et al. [22] presented a method that performs vertebral center detection and segmentation in CT images. However, MRI and CT images have the problem of requiring much time and money to shoot. In addition, since these observe the spine relatively more easily than X-ray images, developing X-ray image analysis assisting methods is more needed.

Method
In this section, we first explain the two main components of our network. Furthermore, we describe the learning strategy for our network. In summary, our network includes the significant role of the following: (1) Pose-Net: predicts the centers of the lumbar vertebrae and the upper end plate of S1 (first sacral vertebra). (2) M-Net: detects the landmarks of each vertebra. Throughout this paper, we refer to the upper end plate of S1 as the sacrum, and from top to bottom, we call the lumbar vertebrae as L1 to L5.

Network Structure
It is highly challenging to accurately identify landmarks from X-ray images using a straightforward convolutional inference procedure since the X-ray image is generated in a 2-dimensional image, overlapping the tissues or shadows of numerous bones. Therefore, first, we chose the 2-stage method, which finds the centers of the lumbar vertebrae and sacrum first, then crops each vertebra based on the center and, finally, detects the landmarks for each cropped image. Figure 2 represents the overall flow of our proposed network. On a high level, it consists of pre-processing the input image before applying Pose-Net, after which we obtained the center coordinates of each vertebra using Pose-Net and post-processing. Then, we cropped each vertebra image from a zero-padded input image to feed as the input for M-Net to achieve landmark detection. These locally detected landmarks are then mapped to the input image using coordinate mapping.

Detecting Centers of the Lumbar Vertebrae and the Sacrum
Pre-Processing: To save the computing costs, we resized the input images to 512 × 512 pixels. At this point, we applied zero padding to handle multiple resolutions of input images, ensuring that the aspect ratio remains the same. Additionally, all images were subjected to Gaussian blurring and contrast limited adaptive histogram equalization (CLAHE) [23] to minimize noise and ensure the distinction between the spine from the background.
Detecting the center: In pose estimation, estimating the probability of the target joint location in the form of the confidence map [12][13][14] showed an outstanding performance compared to utilizing the regressor [11]. Therefore, following Kim et al. [5], we adopted Pose-Net to detect the centers of the vertebrae based on the confidence map. We modified Pose-Net by expanding the receptive fields of the convolutional layers by increasing the 7 × 7 size of the convolution filter to 13 × 13. Consequently, this decreased the observed outliers since the increased visibility of nearby vertebrae helps analyze better while predicting the centers of vertebrae. Furthermore, we also modified Pose-Net to detect the landmarks of the sacrum. Kim et al. produced center coordinates of the lumbar vertebrae after post-processing the generated confidence map of 1 channel. However, in our case, we not only detected the centers of lumbar vertebrae, but the center of the sacrum as well, and due to the proximity of S1 and L5, it is challenging to extract each center coordinate from the confidence map of 1 channel. Thus, by dividing the confidence maps C ∈ R 6×64×64 for each center of the lumbar vertebrae and the sacrum from 1 channel to 6 channels (C i=1,2,3,4,5,6 ), we made it simpler to extract each center coordinate. C i=1,...,6 represents the confidence maps of each center for L1-L5 and the sacrum in order. Additionally, we also changed the vanilla convolutional blocks (convolution, normalization, and activation) of the previous Pose-Net to the pre-activation design (normalization, activation, and convolution), where for the normalization layer, we used instance normalization [24] along with ReLU in the activation layer. The detailed structure of Pose-Net is shown in Figure 3. All the abstracted parts are the same as the structure of the previous Pose-Net.

Input
Pre-predicted center of L5 ( 5 ) Output ( 1,2,3,4,5,6 ) Training procedures: To robustly train our Pose-Net with a limited dataset, we suggest a random spine cutout augmentation technique, which randomly cuts the vertebra, to consider the frequent occlusion in X-ray images. Furthermore, we used two existing augmentation techniques. First, to deal with the changes in visual quality, such as brightness and contrast, we employed random brightness and contrast adjustment augmentations. Second, since each vertebra has a different appearance, we randomly applied rotation, scaling, and translation augmentations.
Cutout [8] is primarily used in classification tasks to enhance a classifier's performance by masking a portion of an image while training the classifier. However, this conventional cutout cuts the arbitrary region in the image, and it is unsuitable for use in the lumbar X-ray image, which has a large background area, rather than the spine. Therefore, we propose a new augmentation method named random spine cutout (RSC), which randomly cuts the lumbar vertebrae utilizing the conventional cutout.
In RSC, as shown in Figure 4, a vertebra is randomly selected from the lumbar vertebrae (L1-L4) and masks to a zero value. We performed RSC only among the often occluded vertebrae (L1, L2, L3, L4), whose average pixel values inside the vertebra are low, as shown in Table 1. Additionally, if the whole vertebra region is cut, it might be infeasible to find the center of that vertebra. Thus, we randomly cut the region only 60% of the vertebra's width and length, which was an experimentally determined value.  Our Pose-Net pre-predicts a confidence mapC 5 for the center of L5 in the internal layer of Pose-Net, similar to Kim et al. [5], and concatenates it with the feature of Pose-Net's internal layer to output confidence maps C i=1,2,3,4,5,6 for each center of the lumbar vertebrae and sacrum.
To train Pose-Net, we employed the following ground truth of center confidence mapsĈ i=1,2,3,4,5,6 based on the ground truth of center coordinatesĉ i=1,2,3,4,5,6 of the lumbar vertebrae and sacrum.Ĉ where x represents a pixel position inĈ and σ is given as 1/2 of the L5 height, which is the value when Pose-Net performs best. The loss function L Pose usingĈ 1,...,6 ,C 5 , and C i=1,...,6 was used to train our Pose-Net.
Post-processing: For Figure 5, when Pose-Net incorrectly predicts the center of the upper vertebra of L1 (T12) rather than L1 in a channel C 1 , where L1 should have been predicted, the successive target centers from C 2 to C 4 are also not properly detected. At this time, if we only selected the maximum point of each confidence map (max 1 , max 2 , max 3 , max 4 , max 5 , max 6 ) as the center of each vertebra, there were cases where we would inaccurately extract the centers of some lumbar vertebrae, as shown in Figure 6a,c. To solve this problem, we used post-processing by taking advantage of the fact that all confidence maps for the centers of all lumbar vertebrae and the sacrum exist. First of all, we computed the distance d i=1,...,5 between the maximum points of each confidence map to determine whether C i=1,...,6 is the same case as Figure 6a,c. d i=1,...,5 is calculated as ||max(C i ), max(C i+1 )|| 2 using the maximum points of C i and C i+1 . Then, we calculated an average value mean(d except(max(d)) ) of the d i=1,...,5 , excluding the maximum value of d i=1,...,5 . If max(d) > mean(d except(max(d)) ) × 1.4, it was determined that part of the center coordinates of the vertebrae were not extracted exactly, and post-processing was carried out. Here, mean(d except(max(d)) ) × 1.4 was the maximum distance between two local maximum points of the center confidence map in our results. The post-processing process is detailed in Algorithm 1 and Figure 7. In Algorithm 1, 0.4 in the equation p y > max(C j+1 ) − 0.4 × d max was an experimentally obtained value, which can properly derive the center confidence map of the vertebra whose center was not extracted. As a result, thanks to the post-processing, we can remove the T12 center point from the improperly extracted center points and properly extract the unextracted center from the channelwise summed confidence map.  Algorithm 1 Post-processing.
Obtain C sum by channelwise summation of all confidence maps of centers Calculate the y-coordinate distance d max between the furthest maximum points (max(C j ) and max(C j+1 )) for The y-coordinate value of each pixel The value of p ← 0 end if end for Obtain a new maximum point max new in C sum Remove the top of the maximum point from {max 1 , . . . , max 6 } Insert the max new between the maximum point of C j (max j ) and the maximum point of C j+1 (max j+1 ) in the maximum point set Obtain the final center locations {max 2 , . . . , max j , max new , max j+1 , . . . , max 6 }

Detecting Landmarks of the Lumbar Vertebrae and the Sacrum
Pre-processing: Using the center coordinates c i=1,2,3,4,5,6 of each vertebra determined by post-processing the Pose-Net results, we cropped each vertebral region from the zeropadded original image. The cropped images were centered on the center c i corresponding to each vertebra and were cropped into a square form by calculating the height H and width W using the y-axis distance between the center coordinates of the neighboring vertebrae as given in Equation (5).
where (c i ) y denotes the y-coordinate of c i . Due to the extremely wide variance of the distance between the centers of the sacrum and L5, the H and W of the sacrum area were determined using the y-axis distance between the L4 and L5 centers. After cropping, we resized the cropped image to 256 × 256 pixels and applied Gaussian blurring so that it could be used as an input for M-Net. Detecting landmarks: Based on U-Net [15], which is mostly utilized in the segmentation task of medical images, M-Net [9] can use a range of receptive fields utilizing multi-scale inputs and shows superior performance compared to U-Net [5,9]. High segmentation performance translates into significant spatial comprehension of a given object, which means M-Net can also show good performance in the task of detecting landmarks. Therefore, we utilized the M-Net structure for the landmark detection of vertebrae, and Figure 8 shows the detailed design. In Figure 8, we modified all convolution blocks to the same pre-activation design as our Pose-Net. During landmark detection, we used two M-Nets with identical structures, except for the last layer, assuming the appearances of the lumbar vertebrae and sacrum are very different. In the case of the sacrum, which has two landmarks, M-Net outputs a 3-channel result including confidence maps of landmarks and part affinity maps.
The CoordConv block in Figure 8 is the convolution block that uses CoordConv [10] instead of the conventional convolution. CoordConv performs the convolution operation after simply concatenating the coordinates corresponding to the position in the input feature normalized to the (−1,1) values with the input feature. This enables the translation-invariant convolution process to use pixel location information, leading to substantial performance gains in tasks such as object detection [10] and segmentation [25]. Therefore, we exploited the similarly positioned landmarks for each cropped vertebrae image, as shown in Figure 9, to allow M-Net with CoordConv to predict the exact location of the landmark based on the learned location distribution, even when the landmarks are occluded.  Part affinity fields (PAFs) is a concept introduced by the pose estimation task [13], which enables the network to connect all the joints of a particular person in an image by recognizing the connectivity of each joint found during pose estimation. We made M-Net learn the morphological information of vertebrae by predicting the PAFs. When training M-Net to learn PAFs, we enforced it to estimate the probability of a region with a linking segment L t R t of the top-left to top-right lumbar vertebral landmarks and a linking segment L b R b of the bottom-left to bottom-right lumbar vertebral landmarks. Similarly, in the case of the sacrum, the PAF is defined by a line segment that connects the two landmarks of the sacrum. Learning PAFs allows M-Net to estimate the shape of the vertebrae from the learned morphological information and recognize relations within PAFs (i.e., at the edges of PAFs are landmarks of lumbar vertebrae and the sacrum). In each lumbar vertebrae, L t R t and L b R b are similarly apart. L t R t is generally parallel to L b R b ), so that it can detect the proper landmarks when occlusion occurs in the X-ray image. Furthermore, PAFs can be employed in place of landmarks through post-processing.
Training procedures: For effective M-Net learning, we generated an input image of M-Net by cropping using the ground truth of the center coordinates from the zeropadded original input image, while, during inference, we used the center coordinates calculated through Pose-Net. In addition, we employed random scaling, rotation, brightness, and contrast adjustment augmentations, similar to training Pose-Net. Additionally, random translation augmentation was used to robustly respond to the difference between the center coordinates used in inference and training.
To train M-Net (M L ), which detects the landmarks of the lumbar vertebrae, we constructed the ground truth of confidence mapsĈ L i=1,2,3,4 of each landmark as given in Equation (6) using four ground truth landmarksl L i=1,2,3,4 of each vertebra. Similarly, to train M-Net (M S ), which detects the landmarks of the sacrum, we constructed the ground truth of confidence mapsĈ S i=1,2 as given in the following Equation (7) using two ground truth of where x represents a pixel position inĈ. In Equation (6), σ is given as 1/10 of the average length of two lines diagonally connected to the landmarks of each vertebra, and in Equation (7), σ is given as 1/6 of the distance between two landmarks in each sacrum, defined experimentally. Additionally, Equations (8) and (9) show the ground truth of the PAFs we used for training M LPL ∈ R W×H and training M SP S ∈ R W×H , respectively.
whereP(x) represents a random position inP. M L receives the vertebra image to estimate the PAFs P L and confidence maps C L i=1,2,3,4 for the landmarks. Likewise, M S predicts the confidence maps C S i=1,2 and PAFs P S for the landmarks of the sacrum. Then, through the loss functions L M L and L M S , we trained M L and M S , respectively.

Experiments
This section defines the dataset used for our experiments and provides the experimental settings, along with the ablation studies on the proposed methods. Subsequently, the proposed network is compared to the previous work [5], which was modified to fit landmark detection.

Experimental Setup
We constructed a lateral view of the lumbar X-ray image dataset utilizing NHANES II's lumbar X-ray dataset [26] from the National Library of Medicine and the BUU Spine Dataset [27] from Burapha University. Our dataset consisted of 1524 images, from which we used 976 images for training, 244 for validation, and 304 for testing, where we converted all images to grayscale. Moreover, we used the test set to yield all experimental results.
For training Pose-Net, we used the Adam optimizer [28] with a learning rate of 1 × 10 −4 and a batch size of 16. Moreover, for stable learning, we used a learning rate scheduler, which reduced the learning rate linearly. We conducted training up to 200 epochs, but if the validation loss ceased to decrease for 30 epochs, we stopped it prematurely. For training M-Net, we used a batch size of 32. Similar to Pose-Net, we trained M-Net for up to 250 epochs and stopped early if the value of the validation loss did not drop for 25 epochs. All other settings were the same as for Pose-Net learning. We used the Pytorch v1.11.0 framework and CUDA v11.3, with a single NVIDIA RTX 3090 device, for conducting all experiments.

Experiments of Center Detector
In this section, we generated all the experimental results by mapping center coordinates created through Pose-Net to each input image ∈ R 512×512 of Pose-Net. First, we performed an experiment based on changing the kernel size of the convolution layer before applying random spine cutout augmentation. We increased the 7 × 7 convolution filter to 13 × 13 in the existing Pose-Net structure [5] to reduce the imprecise detection of all vertebrae centers by making the convolution filter scan more widely, including nearby vertebrae. Therefore, we performed a quantitative and qualitative comparison of the effect of changing the size of the convolution filter on detecting the centers of the lumbar vertebrae and sacrum. In all quantitative comparisons of center detection, an outlier refers to the proportion of images that do not include some landmarks in the image after pre-processing (i.e., cropping) using center coordinates generated by Pose-Net. The values in Tables 2-4 represent the average pixel distance errors (and standard deviation) between the predicted center coordinates and the ground truth of the center coordinates. In addition, (Inlier) in the first column of each table denotes the results except for outlier cases, while (All) denotes all results with outlier cases.
Kernel size: Table 2 shows the center distance error according to the changing of the kernel size. In Table 2, comparing the kernel sizes of Pose-Net, 7 × 7 and 13 × 13, the average distance error value of the latter is slightly higher than the former, but the outlier ratio of the latter is less by 1% than the former. In this work, we used Pose-Net to locate rough areas for cropping the target vertebra by detecting its center. Accordingly, to ensure a lower outlier ratio rather than a lower distance error on average, we adjusted the kernel size from 7 × 7 to 13 × 13.    Figure 10a shows the outlier result that occurred by incorrectly predicting the center of L2 in the channel as the center of L3. Furthermore, in Figure 10c, Pose-Net incorrectly predicted the middle of L3 and L4 as the center of L3 by recognizing L3 and L4 as one vertebra, which were deformed due to compression fractures. On the other hand, Figure 10b,d show that Pose-Net using the modified kernel size rather than 7 × 7 accurately predicted the centers of each vertebra. Here, the resolution of the Pose-Net output was 64 × 64 pixels, and the average distance of ground truth center coordinates between adjacent vertebrae was around 7 pixels. Therefore, when the kernel size is 7 × 7, the convolution filter barely sees the nearby vertebrae when scanning the center of each vertebra, and the rational center detection ability from surrounding information is insufficient. Consequently, we increased the kernel size of 7 × 7 to 13 × 13 in Pose-Net for the following experiments, which was around twice 7 × 7, enabling most of the neighboring vertebrae to be seen, allowing center detection performance better even if the shape of the vertebra was deformed.

Random spine cutout (RSC):
Before we verify the effectiveness of RSC, we compared the performance of Pose-Net between the size of the RSC region because an increased similarity between the size of the RSC region and an actual occluded vertebra situation may have a more significant effect on RSC. When the lengths of the width and height of the RSC region were (1.0, 0.8, 0.6, 0.4)-times the width and height of each vertebra, respectively, we compared the performance of Pose-Net.
In Table 3, the distance error is typically low when the region of RSC is large. Cutting most of the vertebra area caused Pose-Net to learn strictly by increasing the loss value, and as a result, it made Pose-Net detect the center more precisely, while black background pixels might be considered as a masked vertebra and cause an outlier ratio increase. When the RSC region was too small (e.g., ratio of 0.4), both distance errors and outlier ratio increased. Therefore, we selected a ratio of 0.6, which had a relatively low distance error and the lowest outlier ratio.
We quantitatively and qualitatively compared the performance of the conventional cutout [8] and RSC. When using the conventional cutout, we set the region of the cutout similar to that of the RSC region. Table 4 shows the benefit of RSC. In Table 4, we can see that training Pose-Net with RSC and the conventional cutout showed good performance in the distance error. However, unlike RSC, when using the conventional cutout only, the outlier ratio was higher than without the cutout. Since occlusion is a common feature of X-ray images, both RSC and the conventional cutout were effective in reducing the distance errors. However, our RSC on the lumbar vertebrae (L1-L4) was more effective in drastically decreasing the distance error and outlier ratio.
However, about 2% of outliers still existed even with RSC. These are difficult cases for both Pose-Net and radiologists to localize the vertebrae precisely, as shown in Figure 11. Our Pose-Net predicted L5 as the sacrum (S1) in Figure 11a, yielding all predictions incorrect. In this case, because L5 can be seen as S1 due to lumbosacral transitional vertebrae [29], radiologists also cannot easily differentiate one from the other. Similarly, in Figure 11b, our Pose-Net predicted S1 as L5, and it had difficulty distinguishing S1 from L5 based on the X-ray image. We also compared RSC and the conventional cutout qualitatively in Figure 12. As illustrated in the first row of Figure 12, the centers of L3 and L4 were more precisely localized when either RSC or the conventional cutout was employed compared to the results without cutout augmentation. Where the center area was occluded like L1, the predicted center of L1 was skewed to the left when using the conventional cutout technique. In contrast, in the case of RSC, the center of L1 was accurately predicted. Moreover, only RSC can drive Pose-Net to accurately detect sequential L1 to L3, even when it is difficult to visually distinguish the boundary between L2 and L3, as shown in the second row of Figure 12. Therefore, it is evident that RSC is superior to the conventional cutout technique in the lumbar X-ray image and can make Pose-Net more robust in circumstances where the vertebrae are not clearly visible by cutting only the lumbar vertebrae.
w/o RSC, Cutout w RSC w Cutout Figure 12. Results of ablation study with the conventional cutout and RSC. The red circles denote the predicated centers of L1, L2, L3, L4, L5, and the sacrum in order from the top, and blue circles and labels denote the ground truth of the centers.

Experiments of Landmark Detector
In this section, we validated whether learning the local and morphological information by CoordConv [10] and PAFs was efficient in M-Net to predict landmarks accurately. We obtained all results of landmark detection by mapping M-Net's inference results to the original input image. And in all quantitatively comparing landmark detection, we excluded the outlier results of Pose-Net and used a relative distance error (RD) to evaluate the predicted accuracy based on vertebra size besides the pixel distance error (D). The relative distance error of lumbar vertebrae is formulated by Equations (12)- (14), where gt is the ground truth of the target landmark, pred is the result of detecting the corresponding landmark, pred x is the x-coordinate of pred, and h and v are the lengths of horizontal and vertical lines that connect gt and nearby landmarks with gt, respectively. For the sacrum, it is formulated using Equation (15), where l is the length of the sacrum.
relative distance error(%) = 100 × r x 2 + r y 2 (14) relative distance error(%) = 100 × ||pred − gt|| 2 l (15) CoordConv: We utilized CoordConv in M-Net's encoding layers and analyzed its effectiveness along with the optimal number of CoordConvs. Table 5 shows that utilizing one CoordConv in the encoding layer of M-Net (CC(1)) significantly increased the detection accuracy of the sacrum landmarks while decreasing the detection accuracy of lumbar vertebrae rather than without CoordConv. When utilizing CoordConv in two encoding layers (CC(2)), the detection accuracy of the both lumbar vertebrae and the sacrum was higher than without CoordConv. It is evident that CC(1) is only practical for the sacrum with a few landmarks, but CC(2) is beneficial to both the sacrum and the lumbar vertebrae. Consequently, in subsequent experiments, we employed CC(2). Table 5. Ablation study of CoordConv(CC). CC(1) means using CoordConv in the first encoding layer of M-Net, and CC(2) means using CoordConv in the first two encoding layers, as shown in Figure 8. The lowest value for each column in distance error (D) is marked in bold and underlined in relative distance error (RD).  Figure 13 shows the effect of CoordConv. The boundary between L5 and the sacrum is hard to identify in the first row of Figure 13, and when without utilizing CoordConv, M-Net incorrectly predicted the top-left landmark of L5 as the left landmark of the sacrum. Furthermore, when the boundaries of L1 and L5 are invisible, as in Rows 2 and 3 of Figure 13, respectively, the case without CoordConv predicted incorrect locations as landmarks because it predicts based simply on visual information. When employing CoordConv, however, M-Net can use not only visual data, but also location data to predict the landmark. Consequently, utilizing CoordConv allows the network to learn the location distribution and derive accurate landmark detection even when landmarks are invisible.
w/o CoordConv w CoordConv Figure 13. Results of landmark detection depending on CoordConv. The blue circles represent the ground truth, while yellow circles with labels represent predicted results and the green lines represent the line connecting the predicted landmarks of an endplate.

Part affinity fields (PAFs):
We let M-Net predict PAFs to learn appearance information about the vertebrae and compared quantitatively and qualitatively whether this benefits detecting landmarks. Furthermore, we conducted an ablation study to determine which width is suitable for the ground truth.
According to the ground truth width of PAFs, as depicted in Figure 14, we measured the performance of M-Net (Table 6). When the width of PAFs was thin such as 2 pixels, there was no significant performance improvement. This is because M-Net cannot adequately learn the morphological information of the vertebrae since the loss value of PAFs is low. However, when the width was sufficiently thick, such as 6 pixels, the PAFs loss value became large and M-Net learned with the aim of PAF prediction rather than landmark detection. This caused low performance. Therefore, we selected PAFs with a width of 4 pixels as the ground truth, which improved the overall performance significantly. Comparing the performance with and without PAFs demonstrated that predicting PAFs improved the accuracy of landmark detection.

Input
(a) (b) (c) Figure 14. Example of the ground truth of PAFs by width: (a) 2 pixels, (b) 4 pixels, and (c) 6 pixels.  Figure 15 shows the effect of predicting PAFs. When without PAFs, M-Net makes an anomalous prediction, as shown in the first row of Figure 15, when L1 is blackened. Moreover, in the second row of Figure 15, the lengths of L1a and L1b are wildly different, and the two lines are not parallel. Prediction in this way might cause misdiagnosis when making a diagnosis about spine alignment. However, when utilizing PAFs, M-Net predicts more accurately based on the learned vertebral shape (i.e., L1a and L1b have similar lengths and are parallel). Without PAFs, it is hard to detect landmarks precisely when the corner points of the vertebra are not distinct, relying only on the visuals of the input image. When M-Net predicts PAFs, it is possible to have accurate detection because the morphological features to predict PAFs are included in the process of landmark detection.

CoordConv and PAFs:
We both quantitatively and qualitatively analyzed if M-Net takes full advantage of both CoordConv and PAFs. Table 7 shows the landmark detection errors according to using them. Table 7. Ablation study of CoordConv(CC) and PAFs. The lowest value for each column in distance error (D) is marked in bold and underlined in relative distance error (RD). In Table 7, we observe that the accuracy was significantly high for the vertebrae located in the upper region of the lumbar vertebrae (L1 and L3) when using PAFs and for the lower half of the lumbar vertebrae (L4 and L5) when using CoordConv. PAFs based on morphological information outperformed CoordConv for the upper section of the lumbar vertebrae, where its landmark locations differ substantially from the body to body. In contrast, CoordConv outperformed PAFs for the lower lumbar area, where the landmarks are located at similar positions for all bodies. Employing both CoordConv and PAFs showed improved performances compared to using CoordConv alone for the upper section of the lumbar vertebrae and using PAFs alone for the lower section. Furthermore, it showed the lowest total average distance error. Consequently, M-Net detected the landmark more accurately by utilizing the benefits of both CoordConv and PAFs.
The effect of employing both CoordConv and PAFs can be seen more intuitively in Figure 16. In the first column, the spine area is very faint and the left and right sides of L3 are asymmetric with an abnormal shape. In this case, there was no significant performance improvement when using either CoordConv or PAFs. When using only CoordConv, L3b was detected more accurately, but L2a, L2b, and L3a were still incorrectly detected. Due to the abnormal shape of L3, the locations of landmarks were unusual, so using position information alone was ineffective for landmark detection. When using PAFs alone, L2a prediction worsened, and L2b, L3a, and L3b were still not properly detected. Again, due to the abnormal shape of L3, it is challenging to predict PAFs, and inaccurately predicted PAFs negatively affect predicting L3 landmarks. In the second column, the L2a was predicted close to the ground truth when using only CoordConv rather than without CoordConv and PAFs, but the prediction of L2b was not improved. PAFs of L2 were incorrectly predicted when only PAFs were employed, resulting in worse detection results than those without PAFs, as shown in the purple box of the third column. However, when utilizing both CoordConv and PAFs, L2b was detected more accurately, recognizing the morphological correlation between L2a and L2b based on the position information. Consequently, when employing both CoordConv and PAFs, all landmarks were predicted well since PAFs can be predicted more precisely with CoordConv. Therefore, it is possible to overcome the problem of incorrectly predicted PAFs by using CoordConv, and landmarks were precisely detected by combining the benefits of both methods. We also compared the inference times of the methods. Table 8 shows the average time for detecting all landmarks from each test set image. When using only PAFs, the inference time only increased by 0.2% because it only needed to output an additional one-channel output predicting PAFs with M-Net. Furthermore, when employing PAFs, the inference time raised slightly while the landmark detection accuracy was greatly enhanced, as seen in Table 7. Employing CoordConv was also effective for detecting landmarks with an average inference time of 21.87 ms, which is not an issue in real life. Finally, we show how well our network performed by a box plot and an outlier ratio table using relative distance errors. We define an outlier case when the average relative distance error of all landmarks in each vertebra is higher than 20%, which is farther than the typical distance between neighboring landmarks of vertebrae (e.g., the distance between the left bottom landmark of L1 and the left top landmark of L2). Table 9 shows the ratio of outlier cases.  Table 9 shows the average outlier ratio of each vertebra. In Table 9, the outlier ratios are low for L2-L5, but very high for L1 and the sacrum. L1 is frequently occluded by thoracic structures or deformed due to a severe compression fracture, and this results in a higher outlier ratio than other lumbar vertebrae. L5 and S1 are too confused to differentiate one from the other when the lumbosacral transitional vertebrae exist. Accordingly, the outlier ratio of the sacrum landmarks is exceptionally high and even radiologists cannot easily find the exact position of the sacrum.
These several outlier cases are shown in Figure 17. Our M-Net detected some T12 landmarks as L1 landmarks in Figure 17a because the L1 compression fracture is so severe that the vertebra does not form a normal shape. In Figure 17b, M-Net incorrectly predicted the sacrum because L5 and the sacrum areas are not apparent. For many challenging cases, our network had an average landmark detection accuracy of 98.38%. The relative distance error box plot for each vertebra except outliers is presented in Figure 18. This shows that our network had a low average relative distance error across every vertebra.

Comparison to the Previous Work
We compared the performances of our method and Kim et al. [5] as the previous method quantitatively and qualitatively. For a fair comparison, the network of Kim et al. was modified from a segmentation task to a landmark detection task, and we considered the results of lumbar vertebrae since there was no result for the sacrum in Kim et al.
Center detector: Table 10 shows that our overall center detection ability was superior to Kim et al., particularly in terms of the outlier ratio and the distance error of L1. When localizing the center of each lumbar vertebra, Kim et al. generated a center confidence map of 1 channel and extracted the center coordinates of the lumbar vertebrae. At this time, in the case of L1, the confidence score was low when obscured by thoracic structures, as shown in the Discussion Section of [5], making it challenging to extract local maximum coordinates through the post-processing of Kim et al. Moreover, Kim et al. eliminated the confidence map when the confidence score was too low, and there were many outlier cases in which the center coordinates were not extracted for L1 because of the elimination, as shown in Figure 19. In addition, the predicted location of the centers was slightly different from the ground truth. However, our method can detect more accurately using a wide convolution filter size and RSC. Furthermore, even when some of the target center coordinates were not extracted, we used the multi-channel confidence map including the confidence map with low confidence scores during the post-processing step to ensure that all center coordinates were extracted. Table 10. Comparison of center detection performance between ours and Kim et al. [5]. The value having the lowest distance error for each column in (Inlier) is marked in bold and underlined for (All).  Figure 19. Results of center detection. Our results are (a,c), and the results of Kim et al. [5] are (b,d).
The red circles denote the predicated centers of L1, L2, L3, L4, L5, and the sacrum in order from the top, and the blue circles and labels denote the ground truth of the center of each vertebra.
Landmark detector: As shown in Table 11, our method outperformed Kim et al. in every vertebral landmark detection, except L3. Especially, our network showed lower distance errors of 16.3259 pixels and 10.3488 pixels, respectively, for L1 and L2 compared to Kim et al. When landmarks were barely visible, such as L5 in the first row and L1-L2 in the second row of Figure 20, unlike Kim et al., which detected landmarks solely based on visuals, our method was much better as it predicted landmarks using location and morphological information through CoordConv and PAFs. Furthermore, Kim et al. often incorrectly predicted vertebral landmarks, even when the landmarks were a little occluded, as in the third row and fourth row. As a result, the standard deviation of the distance error values was very high, as shown in Table 11, indicating that it was not detected reliably. However, thanks to CoordConv and PAFs, our method achieved better landmark detection accuracy and reliability.

Ours
Kim et al. Figure 20. Example of landmark detection result of ours and Kim et al. [5]. The blue circles represent the ground truth, while yellow circles with labels represent predicted results, and the green lines represent the line connecting the predicted landmarks of an endplate. Table 11. Comparison of landmark detection performance between our method and Kim et al. [5]. The lowest value for each column in distance error (D) is marked in bold and underlined in relative distance error (RD).

L1
L2 L3 L4 L5 Total Figure 21. Loss graphs of ours and Kim et al. [5]. The first row is a loss graph of M-Net predicting the landmark of the lumbar vertebrae, and the second row is a loss graph of M-Net predicting the landmark of the sacrum.

Conclusions
In this paper, we presented a novel two-stage network for detecting the landmarks of the lumbar vertebrae and sacrum on X-ray images. The proposed network detected landmarks from the vertebral images, which were cropped from the zero-padded input image using the detected center of each vertebra. In the center detection process, we expanded the receptive fields of the network to perform more accurate detection. Moreover, our proposed random spine cutout augmentation technique made the network perform detection more robustly on X-ray images, reflecting the properties of X-rays, which are often partially obscured. Additionally, we used CoordConv and part affinity fields to improve the accuracy of landmark detection by learning the distribution of landmark positions and the structural features of the vertebrae.
Our experiments showed that using random spine cutout, which directly cuts a random vertebra, was more effective in increasing the center detection accuracy of each vertebra than using the conventional cutout in a lumbar X-ray image. Furthermore, it also demonstrated that learning location information through CoordConv helped to detect occluded landmarks, and the prediction of PAFs was effective in enhancing the detection performance by recognizing the shape features of the vertebra. Finally, the landmark detection accuracy of our proposed network was 98.38%.
However, there were some failure cases because of severe compression fractures. This can be overcome by using the image transform augmentation technique, which makes a normal vertebra look like a compression fracture, and this remains as future work.
The proposed network can be utilized to diagnose spondylolisthesis at the L5-S1 level since it detects landmarks on the upper end plate of the S1, in addition to quantifying spinal alignment, scoliosis, the compression factor, and degenerative change.