Accurate Landmark Localization for Medical Images Using Perturbations

: Recently, various studies have been proposed to learn the rich representations of images during deep learning. In particular, the perturbation method is a simple way to learn rich representations that has shown signiﬁcant success. In this study, we present effective perturbation approaches for medical landmark localization. To this end, we report an extensive experiment that uses the perturbation methods of erasing, smoothing, binarization, and edge detection. The hand X-ray dataset and the ISBI 2015 Cephalometric dataset are used to evaluate the perturbation effect. The experimental results show that the perturbation method forces the network to extract richer representations of an image, leading to performance increases. Moreover, in comparison with the existing methods that lack any complex algorithmic change of network, our methods with speciﬁc perturbation methods achieve superior performance.


Introduction
Landmark localization has been widely used in diverse applications. An early application can be found in face recognition [1]. Thanks to the success in facial analysis, its application has expanded to medical applications such as the analysis of factures of the humerus [2], precise navigation in magnetic resonance imaging (MRI) scans during neurosurgery [3], and aortic valve localization for pre-operative surgical planning [4]. Other important medical fields include landmark localization for hand X-ray images and cephalometric images. In hand X-ray images, doctors estimate bone age to diagnose growth abnormalities and endocrine disorders [5]. For bone age estimation, they usually use a method such as TW3 [6] that requires accurate landmark localization. Cephalometric landmark localization is an essential task for orthodontic analysis and treatment planning. Based on the geometric features of distances and angles of specific landmarks, dentists create treatment plans. Landmark localization by humans has several drawbacks. First, it is a laborious process that is time demanding. Second, there are high intra-person and interperson variations [7]. Recently, to solve this problem, research to automate the landmark localization process by deep learning has been actively conducted.
In the early days, landmark localization study was conducted by using probabilistic models, such as the active appearance model (AAM). However, the deep learning method achieved much higher landmark positions accuracy than the classical probabilistic models [8,9]. We introduce two recent papers that achieved state-of-the-art accuracy using deep learning. Payer et al. used two cascaded neural networks [10]. The first network extracts the candidate region of landmarks, whereas the second network determines the exact location of landmarks. The results of the two networks are combined to decide the final landmark locations. In contrast, Oh et al. used a single neural network working entirely in an end-to-end manner [11]. The authors also improved the accuracy by applying the

Related Works
In the first part, we review the efforts that have been developed in the deep learning area to obtain richer representations. The second part gives an overview of landmark localization techniques for various images, such as hand X-ray, cephalometric, and face.

Rich Representation Learning
It is well known that the generalization ability of deep learning originates from the rich representations extracted through numerous hierarchical layers in the network [12]. Many studies have attempted to improve the generalization ability of the network by developing a better network structure. Network in network, inception, and residual networks are typical of the approach [15][16][17].
Recent studies have proposed a new approach that forces the learner to solve an additional task. A denoising convolutional auto-encoder (CAE) has been widely employed [18]. It forces the network to learn the representations through the task that eliminates the noise artificially added on purpose. In recent years, there has been an attempt to make models understand the context of the entire image, to extract the rich representations for the semantic segmentation or object localization problems [19][20][21]. The methods force networks to seek global representation, rather than focusing on the most discriminative local part. To this end, these methods randomly hide important image patches, and then apply the deep learning model using the perturbed images to minimize classification error. Some studies [22,23] have employed an inpainting-based self-supervised learning strategy to learn the rich representation. This strategy randomly eliminates image patches and minimizes pixel-level reconstruction error. Refs. [24,25] used image patch-wise reassembly and matching tasks. The studies divided an image into several patches and solved the problem of aligning the shuffled patches. The above-mentioned papers proved that richer representations improved the performance of the image recognition task.
We note that similar approaches can be applied to the landmark localization to improve the accuracy. However, application to landmark localization can only be found in one paper. Oh et al. applied the smoothing perturbation to a cephalometric image during the learning process of landmark localization [11]. The paper showed that the perturbation improved the accuracy of landmark localization using the ISBI dataset.

Landmark Localization
Landmark localization was first developed to analyze face images, and was then expanded to medical images, such as hand X-ray and cephalometric images. Ref. [1] provides an excellent review of face landmark localization. Our review is focused on medical images.
As for classical approaches, for the medical landmark detection the random forests wise methods [15,26] have been proposed. They employed a regression-voting strategy and showed outstanding performance. The region proposal approaches [12,15,[27][28][29] have applied a cascade process that consists of region proposal detection and landmark point localization. Although these approaches have shown success for the medical landmark localization problem, the computational complexity is increased, and an architecture modification is sometimes required. Moreover, its hyper-parameters of complex architecture make the problem heuristic. In recent years, deep learning-based end-to-end learning strategy [11], which employs the whole image as input, has been outperforming region proposal approaches. This approach incorporates the region proposal and landmark point detection without feature engineering by human knowledge, and this has led to outstanding performance. When considering output designs, the approaches can be divided into coordinates regression [29][30][31] and heatmap regression [10,11,[32][33][34][35]. For coordinate regression, the method returns x and y coordinates, which means one pixel of each landmark using CNN with fully connected layers. In contrast, heatmap regression approaches estimate the heatmap of each landmark, which is located on the centroid of the landmark. In such a way, Gaussian distribution is typically used to draw a heatmap shape, and truth landmark points are considered as the centroid of the Gaussian. Recently, heatmap regression-wise approaches have outperformed the coordinate regression methods, and our method is also based on heatmap regression. Figure 1 illustrates the overall framework of the proposed model. The input image is perturbed by randomly sampling a patch, applying an operation on the patch, and feeding the resulting image into the network. The landmark points are converted into heatmaps. The U-net is trained using the perturbed image as input, and the heatmaps as output.

Materials and Methods
Appl. Sci. 2021, 11, x FOR PEER REVIEW 3 of 15 one paper. Oh et al. applied the smoothing perturbation to a cephalometric image during the learning process of landmark localization [11]. The paper showed that the perturbation improved the accuracy of landmark localization using the ISBI dataset.

Landmark Localization
Landmark localization was first developed to analyze face images, and was then expanded to medical images, such as hand X-ray and cephalometric images. Ref. [1] provides an excellent review of face landmark localization. Our review is focused on medical images.
As for classical approaches, for the medical landmark detection the random forests wise methods [15,26] have been proposed. They employed a regression-voting strategy and showed outstanding performance. The region proposal approaches [12,15,[27][28][29] have applied a cascade process that consists of region proposal detection and landmark point localization. Although these approaches have shown success for the medical landmark localization problem, the computational complexity is increased, and an architecture modification is sometimes required. Moreover, its hyper-parameters of complex architecture make the problem heuristic. In recent years, deep learning-based end-to-end learning strategy [11], which employs the whole image as input, has been outperforming region proposal approaches. This approach incorporates the region proposal and landmark point detection without feature engineering by human knowledge, and this has led to outstanding performance. When considering output designs, the approaches can be divided into coordinates regression [29][30][31] and heatmap regression [10,11,[32][33][34][35]. For coordinate regression, the method returns x and y coordinates, which means one pixel of each landmark using CNN with fully connected layers. In contrast, heatmap regression approaches estimate the heatmap of each landmark, which is located on the centroid of the landmark. In such a way, Gaussian distribution is typically used to draw a heatmap shape, and truth landmark points are considered as the centroid of the Gaussian. Recently, heatmap regression-wise approaches have outperformed the coordinate regression methods, and our method is also based on heatmap regression. Figure 1 illustrates the overall framework of the proposed model. The input image is perturbed by randomly sampling a patch, applying an operation on the patch, and feeding the resulting image into the network. The landmark points are converted into heatmaps. The U-net is trained using the perturbed image as input, and the heatmaps as output.

Datasets
Our model was tested with the public hand radiograph dataset in Table 1, Digital Hand Atlas. To compare with state-of-the art [10], we followed the same process as [10,26,33]. We used 895 images of the dataset with an average size of 1563 × 2169 pixels with 37 landmarks. To evaluate the network, we assumed a wrist width of 50 mm determined by two of the annotated landmarks at the wrist, and performed three-fold cross validation setup by splitting 895 images into approximately 600 training images and 300 test images per fold. In addition, to show that our model works well in landmark localization on X-ray image, we used Cephalometric X-ray data in Table 1 used in the IEEE International Symposium on Biomedical Imaging 2015 challenges (ISBI 2015) [36]. This dataset consists of 400 cephalometric X-ray images of 400 patients with 19 landmarks. Two experienced medical doctors produced the landmarks, and the average point was used as GT. The resolution of the image was 1935 × 2400, with pixel size of 0.1 mm × 0.1 mm. The dataset was divided into 150 training, 150 test1 (validation), and 100 test2 (on-site competition). We trained the network for 2000 epochs using training data.

Overall Algorithms
Algorithm 1 illustrates a meta-algorithm for the proposed landmark localization method. The label L i contains the information about k landmark positions, in the form ((x_1, y_1), (x_2, y_2), . . . , (x_k, y_k)). A hand X-ray image has 37 landmarks, so k = 37. A cephalometric image has 19 landmarks, so k = 19. Excluding line 4, Algorithm 1 is the same as the standard BP (back-propagation) learning algorithm. The output of Algorithm 1 is the learned neural network weight.

Algorithm 1. Network training with local feature perturbation
Input: Train set I i (image) and L i (label set) for i = 1, 2, . . . , n, perturbation operation q. Output: Learned network weights. 1: Initialize network weights. 2: While network is not converged 3: For i = 1 to n. 4: I i = random perturbation of I i using Algorithm 2 with the operation q. 5: I i = random augmentation of I i . 6: Perform forward computation with I i and compute error e i . 7: Back-propagate e i and update weights. Line 4 deserves attention. The perturbation can be an operation that reduces the information of the input image. In our implementation, we used an operation of smoothing, blackout, whiteout, binarization, or edge detection for the perturbation. The perturbation operation is applied to a patch whose location and size are decided randomly, as explained in Algorithm 2. Algorithm 2 is described in detail in Section 3.3. After altering the patch in line 5, the augmentation is applied to the perturbed image. Rotation, translation, rescale, and color-jitter are applied as augmentation operations. The augmented image is used as input to the network in line 6. Figure 2 illustrates the altered image with each of the 5 perturbation operations. Line 4 deserves attention. The perturbation can be an operation that reduces the information of the input image. In our implementation, we used an operation of smoothing, blackout, whiteout, binarization, or edge detection for the perturbation. The perturbation operation is applied to a patch whose location and size are decided randomly, as explained in Algorithm 2. Algorithm 2 is described in detail in Section 3.3. After altering the patch in line 5, the augmentation is applied to the perturbed image. Rotation, translation, rescale, and color-jitter are applied as augmentation operations. The augmented image is used as input to the network in line 6. Figure 2 illustrates the altered image with each of the 5 perturbation operations. Note that Algorithm 1 works in the online mode in generating a train sample. The online mode generates a train sample just before inputting the sample into the network. The online mode has the advantage that the size of train set can be considered unlimited.

Perturbation Operations
In lines 1 and 2 of Algorithm 2, a patch is generated by random sampling of location and size. In line 3, one of the perturbation operations q is applied to the patch. We summarize the values of parameters used in our experiment in Table 2. The values were determined empirically. We performed the cross validation using ISBI 2015 train set in determining the values. The same values were applied to both ISBI 2015 and Digital Hand Atlas datasets.

Blackout and Whiteout
The blackout changes all the pixels in patch p into black, being value 0. The whiteout changes all the pixels in p into white, being value 1. These operations are the extreme cases, in the sense that they remove all the information within the patch. The network should look at neighboring regions outside of patch p, so as to learn the landmarks within p. Figure 2b,c show examples of the blackout and whiteout, respectively.

Smoothing
Smoothing can be implemented in several ways. We chose the method that decreases the resolution of patch p, and then increases the resolution back to the original size. The scaling factor α is determined by sampling from a normal distribution N(µ, σ 2 ), where µ and σ 2 are the mean and variance. In our experiment, we used µ = 0.2 and σ = 0.15. The operation is as follows: This algorithm reduces the information in p to some extent, according to the randomly sampled α. Therefore, the network should learn the representation using both the local information in patch p, and the global information outside p. Figure 2d illustrates an example image perturbed by smoothing.

Binarization
Binarization converts pixels in p into black (0) or white (1). The threshold value is obtained from the Otsu algorithm [37], The threshold value in Otsu algorithm is determined such that the value maximizes inter-class variance by separating pixels into two classes, background and foreground. In our perturbation, a value generated from the normal distribution N(µ, σ 2 ), where µ = 0.2, σ = 0.15 is added to the threshold. The resulting threshold is applied to the patch p to binarize. Figure 2e illustrates an example image perturbed by binarization. As was the case for smoothing, the network should learn the representation using both the reduced local information in patch p, and the global information outside p.

Edge Detection
Edge detection is performed using Canny edge detection algorithm [38], resulting in pixels with 0 (non-edge) or 1 (edge). The sigma value for the Canny edge detector is determined by random sampling from N(µ, σ 2 ), where µ = 3.5, σ = 1.5. A larger value of sigma results in coarser scale edges and greater noise suppression effect.
Unlike other operations that only degrade the information in the image, the edges have two opposite facets, degrading or upgrading the information. The image is upgraded in the sense that the positions that change the most in the image are highlighted by being converted to value 1, indicating the edge. As a result, it is expected that the landmarks lying close to the edges will be more easily detected. In contrast, the shading information disappears by edge detection. Therefore, the landmarks far from the edges should rely on global information. We expect that the two opposite effects will make the network more robust to image variations.

Hybrid
The hybrid method uses all the previously described perturbation methods probabilistically. Each perturbation is chosen with equal probability. Each perturbation has different effects, so the hybrid scheme provides more active perturbation.

Heatmap Regression
Algorithm 1 illustrates that the input of the neural network is an image I, and the output is a label set L. The label set provides the information about the landmark positions denoted by L = ((x_1, y_1), (x_2, y_2), . . . , (x_k, y_k)), where k is the number of landmarks.
Two approaches are available for representing L. The first approach uses the coordinate values x_i and y_i as they are. In this case, the network output layer has 2k output nodes. The second approach converts the coordinate values of l_i = (x_i, y_i) into a heatmap, where a Gaussian distribution is formed centered at l_i. The heatmap size is the same as the input image. The heatmap is normalized in the range 0-1. Since each of the landmarks has its own heatmap, k heatmaps are made. The network output layer should output m × n × k tensor, where m × n is the size of the input image, and k is the number of landmarks. The second approach is termed heatmap regression; an approach that has been proven to be superior [10,32,33]. This paper adopts the heatmap regression method.
The literature adopting the heatmap regression usually used the Gaussian distribution to mark a landmark in the heatmap. Figure 3a shows a hand X-ray image, and Figure 3b illustrates the heatmap using Gaussian distribution for the landmark at Figure 3a. Another option is to use the Laplace distribution that is sharper at the center than the Gaussian. Since the goal of the neural network is to determine a landmark position accurately, it seems that the Laplace distribution is better. This paper adopts the Laplace distribution in forming heatmaps. Figure 3c   For the test stage, we should devise a method to convert a predicted heatmap into coordinate values. Figure 3d shows the predicted heatmap for the landmark. To calculate landmark coordinates, we select pixels whose values are larger than 0.85 of the maximum  For the test stage, we should devise a method to convert a predicted heatmap into coordinate values. Figure 3d shows the predicted heatmap for the landmark. To calculate landmark coordinates, we select pixels whose values are larger than 0.85 of the maximum heatmap pixel value. The averages of these pixel positions are regarded as landmark coordinates on the heatmap.

Network Architecture
Our architecture is based on U-net [39], and U-net with attention module [40]. The contracting path and expanding path consist of the repeated application of two 3 × 3 convolutions, each followed by LeakyReLU and a 2 × 2 max pooling (for contracting path), or upsampling (for expansive path). The number of feature channels was increased from 64, 128, 256, to 512 in the contracting path, and decreased to 512, 256, 128, to 64 in the expansive path. Attention gates filter the features propagated through the skip connections [40]. We used AC loss as a cost function and minimized cost function by using the Adam optimizer. AC loss is a function proposed by [11]. This function calculates the loss by considering the distance and angle between landmarks. The learning rate is initialized as 0.0001 and annealed by the Cosine annealing schedule.

Results
The proposed method was evaluated on hand radiograph data and cephalometric X-ray data to compare with papers [10,11] that achieved the state-of-the-art. The network was trained and tested in PyTorch with Intel Xeon Gold 6126 2.6 GHz CPU, 64 GB of memory and RTX 2080 Ti GPU with 11 GB RAM. For all datasets, we resized input images to 640x800] using the bilinear method. Since our network is based on U-net, it produces heatmap outputs of 640x800 sizes.
The performance of landmark localization methods in Table 3 was evaluated with point-to-point error (PE) and error detection rate (EDR). PE is defined as the Euclidean distance between the ground truth landmark point and the predicted landmark point. The metric ">n mm" in EDR means the ratio of landmarks whose PE is greater than n millimeter (mm).  Table 3 compares the results of perturbation methods on the Digital Hand Atlas dataset. All methods were tested based on U-net. In Table 3, all perturbators except whiteout and smoothing increased performance. Among them, the hybrid perturbator shows significant performance improvement. U-net achieved 2 mm range EDR of 4.27%, and mean PE of 0.66. When hybrid perturbator was applied, 2 mm EDR decreased to 3.96%, and mean PE decreased to 0.64. Table 4 compares the results of perturbation methods on the ISBI 2015 test1 dataset. U-net achieved 2 mm range EDR of 15.47%, and median PE of 12.03. Significant performance improvement has been observed when using whiteout, binarization, and hybrid perturbator. When using the edge detection perturbator, the network achieved 2 mm range EDR of 14.42%, and median PE of 11.72.

Quantitative Performance Comparison with Existing Studies
In this section, we perform a quantitative performance comparison with existing studies by using the best perturbation method for each dataset. We trained two networks, U-net and Attention U-net. In Table 5 using Digital Hand Atlas dataset, the existing study of Payer et al. [33] achieved 2 mm range EDR of 5.01% and mean PE of 0.66. Our hybrid perturbator based on U-net achieved 2 mm range EDR of 3.96% and mean PE of 0.64. The hybrid perturbator based on Attention U-net showed a poorer performance than U-net. It achieved 2 mm range EDR of 4.76% and mean PE of 0.71. In Table 6 using ISBI 2015 dataset (test1), Chen et al. [35] achieved a 2 mm range EDR of 13.33% and median PE of 11.7. Our edge detection perturbator based on Attention U-net achieved 2 mm range EDR of 13.26% and median PE of 11.48. The same perturbator based on U-net achieved 2 mm range EDR of 14.42% and median PE of 11.78. Table 7 shows the results of the ISBI 2015 on-site competition data. The best existing study is Oh et al. [11], which achieved 2 mm range EDR of 24.11% and median PE of 14.45. Our edge detection perturbator based on U-net achieved 2 mm range EDR of 25.58% and median PE of 14.82. The same perturbator based on Attention U-net achieved 2 mm range EDR of 25.21% and median PE of 15.30. The ISBI 2015 test1 data is different from the test2 data. Previous studies [11,42] have reported that landmarks 3 (Orbitale), 6 (Supramentale), 13 (Upper lip), and 16 (Soft tissue pogonion) have a large performance gap between test1 and test2 data. This factor does not guarantee that experiments in the same environment will produce similar results in the test2 dataset.  Figure 4a shows mean radial error (MRE) curves for the Digital Hand Atlas dataset according to different perturbations. As Table 3 illustrates, the hybrid perturbation depicted with black dashed line shows the best performance. Figure 4b shows MRE curves for the ISBI 2015 test1 dataset. As Table 4 illustrates, the edge perturbation depicted with red line leads other perturbations. MRE curves on Figure 5 are shown according to different epochs. Between the 20th and 50th epochs, the network performance has improved dramatically. After the 500th epoch, the network performance has improved slightly.

Discussions and Conclusions
The performance improvement by the perturbations can be explained by the fact that a perturbation functions as a regularizer with data augmentation. Over the entire training phase, the sizes and positions of the patches are chosen randomly, and each training sam-

Discussions and Conclusions
The performance improvement by the perturbations can be explained by the fact that a perturbation functions as a regularizer with data augmentation. Over the entire training phase, the sizes and positions of the patches are chosen randomly, and each training sample is different from one another. The various perturbation parameters described in Section 3.3 enforce further augmentation effect. We suggested several perturbation algorithms and tested them. The perturbation operations cause a loss of the local information of the image. This forces the models to search for global features. The perturbations are different in the amount of information loss. Figure 7 shows the diverse attributes of the perturbators. The blackout and whiteout perturbations erase all local information. Thus, the network has no choice but to localize the landmarks by learning the global features outside the patch. Smoothing reduces the information in the patch, so the network can rely both on the inside and outside of the patch. Binarization and edge detection have two opposite facets, weakening or strengthening the local information of the patch. The weakening facet happens because the details of the patch disappear. The strengthening facet happens because the operations sharpen the region boundaries. So, binarization and edge detection can be regarded as guiding the feature learning.

Discussions and Conclusions
The performance improvement by the perturbations can be explained by the fact that a perturbation functions as a regularizer with data augmentation. Over the entire training phase, the sizes and positions of the patches are chosen randomly, and each training sample is different from one another. The various perturbation parameters described in Section 3.3 enforce further augmentation effect. We suggested several perturbation algorithms and tested them. The perturbation operations cause a loss of the local information of the image. This forces the models to search for global features. The perturbations are different in the amount of information loss. Figure 7 shows the diverse attributes of the perturbators. The blackout and whiteout perturbations erase all local information. Thus, the network has no choice but to localize the landmarks by learning the global features outside the patch. Smoothing reduces the information in the patch, so the network can rely both on the inside and outside of the patch. Binarization and edge detection have two opposite facets, weakening or strengthening the local information of the patch. The weakening facet happens because the details of the patch disappear. The strengthening facet happens because the operations sharpen the region boundaries. So, binarization and edge detection can be regarded as guiding the feature learning. The guided feature learning, it is worth noting, is an effective way of enriching the feature representation. Many researchers showed that the guided feature learning improved the accuracy of object detection. A notable result can be found in [43] where the dilated convolution layers played the role of the feature guiding. In that work, the feature guidance is learned in the end-to-end manner. On the contrary, our perturbators accomplish the feature learning by using hand-crafted features as a preprocessor. One of our important future works is to embed the perturbation module in the network architecture and learn the perturbator automatically. This extension is expected to improve the accuracy of landmark localization. Another benefit comes from removing the necessity of manually setting the parameters in Table 2.
In Tables 3 and 4, the edge detection that plays the role of the guided feature learning generally leads other operations. The hybrid that randomly chooses edge detection also resulted in significant improvement. These improvements have a positive effect on planning treatment for dentists. Geometric features such as distances and angles are used to classify the patients into several groups. Accurate classification is very important for the treatment. We argue that the accuracy improvement by the edge detection and hybrid perturbators has a significant impact on building a dental treatment system.
The most important future work is to apply the perturbators to other medical tasks described in Introduction such as analyzing factures of the humerus and aortic valve localization for pre-operative surgical planning.

Conflicts of Interest:
The authors declare no conflict of interest.