Multi-Task Learning for Medical Image Inpainting Based on Organ Boundary Awareness

: Distorted medical images can signiﬁcantly hamper medical diagnosis, notably in the analysis of Computer Tomography (CT) images and organ segmentation speciﬁcs. Therefore, improving diagnostic imagery accuracy and reconstructing damaged portions are important for medical diag-nosis. Recently, these issues have been studied extensively in the ﬁeld of medical image inpainting. Inpainting techniques are emerging in medical image analysis since local deformations in medical modalities are common because of various factors such as metallic implants, foreign objects or specular reﬂections during the image captures. The completion of such missing or distorted regions is important for the enhancement of post-processing tasks such as segmentation or classiﬁcation. In this paper, a novel framework for medical image inpainting is presented by using a multi-task learning model for CT images targeting the learning of the shape and structure of the organs of interest. This novelty has been accomplished through simultaneous training for the prediction of edges and organ boundaries with the image inpainting, while state-of-the-art methods still focus only on the inpainting area without considering the global structure of the target organ. Therefore, our model reproduces medical images with sharp contours and exact organ locations. Consequently, our technique generates more realistic and believable images compared to other approaches. Addi-tionally, in quantitative evaluation, the proposed method achieved the best results in the literature so far, which include a PSNR value of 43.44 dB and SSIM of 0.9818 for the square-shaped regions; a PSNR value of 38.06 dB and SSIM of 0.9746 for the arbitrary-shaped regions. The proposed model generates the sharp and clear images for inpainting by learning the detailed structure of organs. Our method was able to show how promising the method is when applying it in medical image analysis, where the completion of missing or distorted regions is still a challenging task.


Introduction
Computed Tomography (CT) has been one of the essential medical imaging systems and utilized for expert diagnoses. However, the CT images are often distorted by reflection from metallic implants or foreign objects such as pacemakers, catheters, and drainage tubes. Moreover, medical images are sometimes degraded due to the sudden movements of the patient during the scanning phase. Many approaches have been proposed for the restoration of deformed images, which include research results on noise reduction, image translation, or inpainting. Among these methods, inpainting has emerged as a reasonably effective and popular method today. Several studies on medical image inpainting have been proposed, including the technique of handling damaged square-shaped regions [1,2]. However, in the real situation, the defects are mostly not of the squares, but of arbitraryshaped regions, which launched the study of medical image inpainting with any damaged forms [3], resulting in the restoration of practical failures with any deformation. These techniques still suffer from incomplete restorations, such as blurred boundaries and loss of

•
We propose a framework based on edge and organ boundary awareness to reconstruct deformed regions in CT images.

•
We newly introduced the use of organ boundaries in addition to edges to establish enough structural knowledge for the inpainting of damaged regions, including the part of the organs. Specifically, multi-task learning is employed to train the network simultaneously for the prediction of edges and organ boundaries. The use of organ boundaries for the learning of structural information in medical image inpainting has never been tried before, and it is adopted for the first time in the literature.

•
Our method generates more realistic and believable images compared to other approaches. In both quantitative and qualitative evaluation, the proposed method outperforms the state-of-the-art methods.
The rest of the paper is organized as follows. In Section 2, we introduce the related literature in the field of general inpainting and medical inpainting. The details of our architecture are presented in Section 3. The experimental results are given in Section 4. Finally, conclusions are shown in Section 5.

Inpainting in General Field
For inpainting images, we can put them into two main groups: traditional and learning-based approaches. The traditional methods employ diffusion-based or patchbased methods with low-level features, while the learning-based approaches try to understand the semantics of the image to fulfill the inpainting task. The success of deep learning has made the second approach effective and very popular in recent years. We introduce the details of studies based on both approaches in the following sections.

Traditional Approach
With conventional methods, the algorithms try to find components from the background area, then compute the similarity levels and fill in the hole. It uses the information available in the image containing the deformed part to generate the missing area [9,10], which provides a simple algorithm, responding relatively well to the inpainting of small areas in case the scene is not too complicated. Conventional methods also do not require a high amount of training data. However, in some cases, such as large or arbitrary-shaped holes possibly with the background of complex structures, these methods can fail to produce a good recovery.

Learning-Based Approach
In learning-based approaches, different types of features can be learned from a large spectrum of sample images, leading to better predictions compared to conventional methods. The deep learning approaches have been studied extensively, and the results showed a significant improvement in the performance. The context encoder (CE) network [11] uses adversarial training [12] with a novel autoencoder. Most of the early deep learning methods use standard convolutional networks over the corrupted image, using convolutional filter responses on the pixels in the masked holes, which often lead to artifacts such as color discrepancy and blurriness. Partial convolution [13] was proposed, where the convolution is masked and renormalized to be conditioned on only valid pixels. In this model, an updated mask was automatically generated for the next layer as part of the forward pass. Later, partial convolution has been generalized to a gated convolution [14] by providing a learnable dynamic feature selection mechanism for each channel and each spatial location for free-form image inpainting. In early studies using deep learning networks, the missing parts were predicted by propagating the surrounding convolutional features into the missing region to produce semantically plausible images, but they often resulted in blurry images. Spatial attention has been applied to consider the contextual relationship between the background and the hole region. The Shift-Net model [15] introduced a special shift layer to the U-Net architecture to shift the encoder feature of the known region for an estimation of the missing parts, resulting in sharper images with detailed textures. A learnable bidirectional attention map module (LBAM) [16] learned feature re-normalization on both the encoder and decoder of the U-net [17] architecture. A recurrent feature network (RFN) [18] was proposed to exploit the correlation between adjacent pixels and strengthen the constraints for estimating deeper pixels. However, these studies have not fully utilized the structural knowledge of the image. There are several approaches to exploit the inherent structure of information in the input images by using the edge or object boundaries for inpainting [4][5][6]8]. EdgeConnect [4] is a two-stage adversarial model that comprises an edge generator followed by an image completion network. The edge generator hallucinates the edges in the missing area (either square-shaped or arbitrary-shaped). The image completion network fills in the distorted regions using hallucinated edges as a priori for the inpainting. Edge structures and color-aware maps are fused in a two-stage generative adversarial network (GAN) [5]. In the first stage, edges with the missing regions are used to train an edge structure generator. Meanwhile, distorted input images with the missing part are transformed into a global color feature map by the content-aware fill algorithm. In the second stage, the edge map and the colormap are fused to generate the refined image. The authors in [6] proposed a foreground-aware image inpainting system that explicitly disentangles structure inference and content completion. The foreground contours are predicted first and then the inpainting is performed using the predicted contours as a guidance. A multi-task learning framework with auxiliary tasks of edge finding and gradient map prediction is used to incorporate the knowledge of the image structure to assist inpainting [8].

Traditional Approach
The use of inpainting techniques is emerging in medical image analysis since local deformations in medical modalities are common because of various factors such as metallic implants or specular reflections during the image captures. The completion of such missing or distorted regions is important to enhance post-processing tasks such as segmentation or classification. Traditional approaches for medical image inpainting focus on interpolation, non-local means, diffusion techniques, and texture synthesis [19][20][21][22][23]. However, the conventional methods are confined to a single image and they do not learn from images with similar features.

Learning-Based Approach
These days, medical image inpainting has been studied extensively with deep learning models [1][2][3]24,25]. GAN is used to incorporate two patch-based discriminator networks with style and perceptual losses for the inpainting of missing information in positron emission tomography-magnetic resonance imaging (PET-MRI) [1]. A generative framework is proposed to handle the inpainting of arbitrary-shaped regions without a prior localization of the regions of interest [3]. Several improvements are made to deep learning models and are reported with better performance than conventional methods. However, these methods do not use the inherent structure of information in the medical images, resulting in blurry images and often lacking detail. The authors in [7] proposed a method using structural information which is represented by the edges of the image. The network decouples image repair into two separate stages: edge connection and contrast completion. The first stage is to predict the edges inside the missing region. The result edge map is used for inpainting. Even though the use of edges succeeded in improving the performance, it does not provide deeper knowledge of organ structures in the body, resulting in still poor quality of restoration. Recently, a deep neural network for medical inpainting has been proposed in [26]. This framework generates 3D images from sparsely sampled 2D images. They employed an inpainting deep neural network based on a U-net-like structure and DenseNet sub-blocks. However, because of ignoring boundary information in training, their method meets the problem of boundary artifacts. Additionally, since [26] was trained and tested on a dataset that is not publicly available, it is hard to compare performance with this study.
In this paper, we propose a multi-task learning model based on auxiliary tasks of edge reconstruction, and organ boundary prediction with the main task of CT image inpainting. The proposed method is more consistent and effective for image inpainting through simultaneous training of the prediction of edges and organ boundaries.

Network Architecture
There have been several methods for boundary detection, such as sketch generation using GAN [27,28]. A contour generation algorithm is used to output contour drawings of arbitrary input images [27]. An application for face photo-sketch synthesis based on the composition-aided GAN is introduced by [28]. The proposed network consists of three GANs for edge reconstruction, organ boundary prediction and image inpainting. Our multitask learning model is built on an adversarial framework, where the three discriminators feedback the discrimination results to the generator as well as the discriminator. Figure 1 shows the detailed architecture of our model. After encoding the input image, three decoder networks predict the edge map, the organ boundary, and the completed image simultaneously. These results are fed into the discriminator networks, whose feedbacks are directed to the generators. The generator network is a modified autoencoder with one shared encoding and three decoding parts. The Dilated Residual Network (DRN) block of an upgraded ResNet block [4] is constructed by replacing the first convolutional layer with the dilated convolutional layer. Dilated convolutions are used with a dilation factor of two instead of original convolutions in the residual layers to effectively expand the receptive field without losing resolution in subsequent layers [29][30][31]. Figure 2 shows the detailed architecture of the DRN block. In the training process, the decoding part is usually difficult to generate feature maps with enough detailed information. Therefore, we employ a super-resolution module (SRM) inside the decoding parts for helping the network learning feature efficiently and produce feature maps with more details. SRM is a modification of the fast super-resolution convolutional neural network (FSRCNN) [32], which makes our model faster with better-reconstructed image quality. Our model is based on the pix2pix GAN [33]. The proposed network takes images from one domain as input and outputs the corresponding image in the other domain, rather than a fixed-size vector. Unlike the initially proposed architecture which classifies a whole image as real or fake, the pix2pix GAN-based model tries to classify patches of an image as real or fake. Therefore, the output is a matrix of values instead of a single value.

Discriminator
The discriminator is the network to distinguish whether data are from a dataset or generated from generators. Thanks to the discriminator, the model learns the association between input and output. Therefore, the generated images are better and more plausible in detail. We use three discriminators separately during training for better learning features. Each discriminator consists of several convolution layers with a sigmoid activation function.

Loss Function
The loss function returns a non-negative real number representing the difference between two quantities: the predicted label and the correct label. The loss function is like a form to force the model to pay the penalty every time it predicts its mistake, and the number of penalties is proportional to the severity of the error. In all supervised learning problems, our goal always includes minimizing the total penalty payable. Ideally, the loss function should return the minimum value of zero. During the training process, we used many different types of loss for various purposes.
In our network, the input uses the distorted image I gt = I gt (1 − M), where I gt is the ground truth image and M is the mask image with 1 value for missing region and 0 for background. The symbol denotes the Hadamard product. Similarly, we have I edge_in = I edge_gt (1 − M), where I edge_gt is the edge map extracted from ground truth images by the Canny edge detector. Our network generates three images: completed image I image_pred , organ boundary map I organs_pred and edge map I edge_pred with missing regions filled in. Those images have the same resolution as the input image. Let G, D 1 , D 2 , and D 3 be the generator and the discriminator of the image generator, edge generator and organ boundary generator, respectively.
First, we analyze the network with a decoding part which generates completed image I image_pred . We employed two losses proposed in [34,35], commonly known as perceptual loss Loss image_perceptual and style loss Loss image_style . Perceptual loss is defined as follows: where δ i is the activation map in the ith layer of a pre-trained network. These activation maps are also employed to calculate style loss, which measures the differences between covariances of activation maps. Given feature maps of size N i = C j × H j × W j , style loss is calculated by: where G δ j is a C j × C j Gram matrix generated from activation maps δ i . We also used the reconstruction loss Loss image_reconstruction in our model. We chose l 1 loss for this reconstruction loss. We also used a discriminator for our image completion part. Typically, generators' gradients often disappear quickly in generative adversarial networks [12]. To fix this problem, we employed Hinge loss [36], which is useful for classifiers. These loss functions are defined as: In the components that generate the completed map and completed organ boundary, the structures are similar. The completed edge map and completed organ boundary map are denoted as I edge_pred , I organs_pred , respectively. We still use the perceptual losses Loss edge_perceptual , Loss organs_perceptual , style losses Loss edge_style , Loss organs_style and reconstruction losses Loss edge_reconstruction , Loss organs_reconstruction in our model for training the whole model. Finally, our overall loss function is calculated by: Loss total = ε 1 Loss image perceptual + ε 2 Losss image_style +ε 3 Loss image_reconstruction +ε 4 Loss image gen + ε 5 Loss edge_perceptual + ε 6 Loss edge_style +ε 7 Loss edge_reconstruction +ε 8 Loss edge_gen +ε 9 Loss organs_perceptual + ε 10 Loss organs_style +ε 11 Loss organs_reconstruction + ε 12 Loss organs_gen (6) where ε is the weight of each loss component. From the experiment, we choose ε 1 = ε 5 = ε 9 = 0.1, ε 2 = ε 6 = ε 10 = 250, ε 3 = ε 7 = ε 11 = 1, ε 4 = ε 8 = ε 12 = 0.1 for the training of our model.

Experimental Environment and Datasets
We have used masks of arbitrary-shaped regions and square-shaped regions in this study. One hundred random mask images are created for each mask type for training and 50 mask images are generated for the testing. To make the comparison fair, we use these same masks for training and testing. We conducted a review of our methodology on a publicly available medical dataset of StructSeg2019 [37]. In the StructSeg2019 dataset, there are 50 3D images of CT scans from 50 patients. Fifty of the voxel representations in 3D images are converted into 4775 2D images, among which 1000 2D images are used for the testing, and the rest are used for the training. The input sizes for training and testing are uniformly set as 256 in width and 256 in height. The Canny edge detector [38] is used to generate edge map ground-truth from the input image and organ boundaries ground-truth from the organ segmentations given in the dataset. We employed the Adam algorithm with a batch size of 4 to optimize the network. The proposed method was trained with 30 epochs and the initial learning rate was set at 0.0002. During the training of the model, we use two types of augmentation which are rotation and horizontal reflection. With rotation, we rotate the image in angles 90, 180, and 270 degrees. Our method was implemented in Python language and Pytorch framework. Table 1 shows the details of our experimental environment and the configuration of the training model.

Evaluation Criterion
Following research [3] in the medical inpainting field, we use common evaluation metrics such as structural similarity index for measuring image quality (SSIM) [39], peak signal-to-noise ratio (PSNR), mean squared error (MSE), and universal image quality index (UQI) [40] to quantify the performance of the models. Our research conducted experiments on both settings of square-shaped holes and arbitrary-shaped holes. The metric PSNR and MSE are defined as: where MSE represents mean squared error, and the maximum value is denoted by k max , particularly for 8-bit images k max = 255. The better the image quality, the higher the PSNR. Structural similarity (SSIM) is seen to be a stronger parameter for evaluating picture consistency which is within the range [0,1], with a score close to 1 indicating better conservation of the structure. This metric is based on the visual perception characteristic of humans. The SSIM is calculated between two commonly sized windows A × B.
where µ ω i , σ ω i 2 are the average and the variance of window ω i , respectively. The covariance is denoted by σ ω 1 ω 2 and c 1 , c 2 are numerical stabilizing parameters. We also used the metric UQI, which is the predecessor of SSIM, to evaluate our proposed methods with other methods. Let I gt = I gti i = 1, 2, . . . , Z and I pred = I predi i = 1, 2, . . . , Z be the ground truth and the predicted image, respectively. The metric UQI is defined as: UQI = σ I gt I pred σ I gt σ I pred .
2 σ I gt σ I pred

Results
Our results in PSNR and MSE metrics are presented by graphs which are shown in Figures 3 and 4. Figure 3 shows the PSNR score of our method for a square-shaped and arbitrary-shaped masked image compared with the others. The higher value is the better. Figure 4 presents the MSE score of our method for a square-shaped and arbitraryshaped masked image compared with the others. The lower value is the better. For square-shaped inpainting, the results are presented in Figure 5 and Table 2, respectively. Partial convolution [13] resulted in the worst inpainting results from both a quantitative and qualitative perspective. The proposed method achieved the highest performance compared to the others. In Figure 6 and Table 3, the qualitative and quantitative results for arbitrary-shaped inpainting are given, respectively. Our approach still outperforms other methods. The PSNR results achieved 43.44 and 38.06 dB in the property of square-shaped and arbitrary-shaped masks, respectively. These results demonstrate the effectiveness of the proposed method for both square and arbitrary masks. Table 4 shows the comparison of results from different types of loss function using a discriminator. Table 5 shows the quantitative results of the proposed method between using SRM and without using it. Table 6 presents the quantitative comparison of PSNR/SSIM/MSE/UQI between multitask and mono-task framework in property of arbitrary-shaped regions. Table 7 introduces the quantitative comparison of PSNR/SSIM/MSE/UQI between multi-task and mono-task framework in property of square-shaped regions.    [15]. (c) Result obtained from using public source code of method [16]. (d) Result obtained from using public source code of method [4]. (e) Result obtained from using public source code of method [13]. (f) Result obtained from using public source code of method [14]. (g) Result obtained from using public source code of method [18]. (h) Result of our method. (i) Ground truth image.   [15]. (c) Result obtained from using public source code of method [16]. (d) Result obtained from using public source code of method [4]. (e) Result obtained from using public source code of method [13].
(f) Result obtained from using public source code of method [14]. (g) Result obtained from using public source code of method [18]. (h) Result of our method. (i) Ground truth image.   Table 7. The quantitative comparison of PSNR/SSIM/MSE/UQI between multi-task and mono-task framework in property of square-shaped regions.  Tables 2 and 3 show the experimental results compared to other methods based on both arbitrary-shaped and square-shaped masks. Compared to recent inpainting studies, our method produced promising results. Particularly, we compared it to methods [13,15] proposed in 2018, methods [4,14,16], introduced in 2019, and method [18] presented in 2020.

Multi-Task with Edge and Boundary Information (Ours
From Tables 2 and 7, for square mask shapes, with using the only mono-task framework, our results are only 40.85 dB PSNR and lower than the method results from LBAM [16], with a PSNR value reaching 43 dB. Although we integrated the edge knowledge learning task or organ boundary awareness task, the results show very little performance increase with the PSNR metric increased to 42.32 and 42.77 dB, respectively. However, it does prove the positive value of adding the auxiliary task into the network. We continued to survey the inpainting results using a multi-task learning framework based on edge awareness and organ boundary knowledge in the medical image. We achieved promising results when the PSNR metric value increased to 43.44 dB, which surpassed all remaining methods. Tables 3 and 6 show that our results still obtained the best value for the mask shape in an arbitrary form. The proposed method helped us achieve a PSNR value exceeding 38.06 dB compared with the highest value of the remaining methods of 37.20 dB belonging to the EdgeConnect method [4]. Our approach has outstanding practicality and outstanding performance from the above quantitative comparisons compared to inpainting methods in recent years. We also present qualitative comparisons in Figures 5 and 6. We apply a mask shape of a square in Figure 5 and then we reproduce images with very high authenticity. The structure of the right lung is preserved relatively intact while the remaining methods reproduce low plausible images. The other methods generate images that lost so much detailed information. The left lung in Figure 6 has degraded a lot when we use an arbitrary-shaped mask. However, thanks to the multi-task framework based on edge awareness and organ boundary knowledge, our method can reproduce the image with a plausible result. Especially, the structures of the right lung and left lung are pretty well reconstructed. The left lung boundary area was reproduced quite sharply without any blurring or distortion, while the rest of the methods were not capable. Table 4 shows the comparison of results from different types of loss function using a discriminator. By using binary cross-entropy loss, our model can only generate results with 37.00 and 41.78 dB in PSNR with the property of arbitrary-shaped mask and square-shaped mask, respectively. These results are lower than the results of methods [4,16]. When we change the binary cross-entropy loss with mean square error loss, our inpainting task results are a little bit better, but these results in the property of square-shaped mask are still lower than the method [16]. Finally, we replace the mean square error loss with Hinge loss. Our inpainting results outperform others, with 43.44 and 38.06 dB in the property of square-shaped and arbitrary-shaped masks, respectively. Figure 7 shows the qualitative comparison of inpainting results between using binary-cross entropy loss, L2 (mean square error) loss and Hinge loss in discriminators. For the best-generated results, we chose Hinge loss as our discriminators during training the model. Figure 8 introduces the qualitative comparison inpainting results between using the mono-task framework, multi-task with organ boundary knowledge framework, multi-task with edge knowledge framework, and multi-task with boundary combined with edge knowledge framework. We also examine the effect of SRM on our model. Table 5 shows the quantitative results of the proposed method between using SRM and without using it. The results show that generated images by the model with SRM are pretty much better in terms of either square-shaped or arbitrary-shaped masks. This proved the positive effect of SRM in making the model generate better high-resolution features with more useful details. From the above comparison and analysis, we find that the proposed model has superior performance in research on inpainting on medical CT images compared with other studies in recent years. Table 8 shows the detailed structure of the encoding part. The components of the discriminator network are introduced in Table 9. The detailed information of the decoding parts is given in Tables 10-12.

Ablation Study
Binary Cross-Entropy (BCE), Mean Square Error (MSE), and Hinge loss are popular loss functions for classification. Cross-entropy calculated a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The score is minimized, and the perfect cross-entropy value is 0. The range of BCE output is from 0 to 1. MSE computes the sum of squared distances between the ground truth value and the predicted value. The Hinge loss function emphasizes examples to have the correct sign, adding more error when there is a difference in the sign between the ground truth and the predicted value. Figure 5 shows qualitative comparison inpainting results between using binary cross entropy loss, L2 loss and Hinge loss in discriminators. Table 4 compares the results from different types of loss functions used in discriminators with the property of square-shaped and arbitrary-shaped regions. Our results outperformed others when we chose Hinge loss for our discriminators during training the model. We also validate the effectiveness of SRM in our model. Table 5 shows the quantitative results with SRM and without using it, with the property of square-shaped and arbitrary-shaped regions. The results show that generated images by the model with SRM are of better performance, which proved the positive effect of SRM in making the model generate features with useful details. Tables 6 and 7 show the quantitative comparison of PSNR/SSIM/MSE/UQI between the multi-task and mono-task framework in the property of arbitrary-shaped and square-shaped regions. Although we integrated the edge learning task or organ boundary task, the results are slightly increased in the PSNR metric. Our method outperformed the rest of the methods when we used a multi-task learning framework based on both edge and organ boundary learning. It does prove the positive value of adding the auxiliary task into the network. Figure 6 presents qualitative comparison inpainting results between using the mono-task framework, multi-task with organ boundary knowledge framework, multi-task with edge knowledge framework and multi-task with boundary combined with edge knowledge framework.

Conclusions
This paper presented an efficient multi-task learning network for medical image inpainting based on organ boundary awareness. We utilized the auxiliary tasks of edge and organ boundary prediction to make the model generate the sharp and clear images for inpainting by learning the detailed structure of organs. Our model proved itself efficient in the reconstruction of the degraded or distorted organs and generates plausible boundaries for the inpainting. Based on detailed experimental evaluation, we demonstrated that the proposed method outperforms the state-of-the-art methods on medical image inpainting. It achieved the best results in literature so far, with the highest PSNR and lowest MSE value for both of the arbitrary-shaped and the square-shaped regions. The proposed model generates the sharp and clear images for inpainting by learning the detailed structure of organs. Therefore, our method was able to show how promising the method is when applying it in medical image analysis, where the completion of missing or distorted regions is still a challenging task. The research is a good foundation for future medical imaging analysis and helps the diagnostic and prognostic capabilities of medical experts. We hope to extend the proposed method of handling other systems of medical images such as X-rays, magnetic resonance, or ultrasound images. Additionally, in medical analysis, there are so many datasets that are quite small in size. Therefore, it is essential to optimize the model when applied to small datasets to achieve good results. Some research directions use relatively small datasets but still achieve good performance, such as [41][42][43]. In the future, we will optimize the proposed model to apply it to small datasets.