End-to-End Deep Learning CT Image Reconstruction for Metal Artifact Reduction

: Metal artifacts are common in CT-guided interventions due to the presence of metallic instruments. These artifacts often obscure clinically relevant structures, which can complicate the intervention. In this work, we present a deep learning CT reconstruction called iCTU-Net for the reduction of metal artifacts. The network emulates the ﬁltering and back projection steps of the classical ﬁltered back projection (FBP). A U-Net is used as post-processing to reﬁne the back projected image. The reconstruction is trained end-to-end, i.e., the inputs of the iCTU-Net are sinograms and the outputs are reconstructed images. The network does not require a predeﬁned back projection operator or the exact X-ray beam geometry. Supervised training is performed on simulated interventional data of the abdomen. For projection data exhibiting severe artifacts, the iCTU-Net achieved reconstructions with SSIM = 0.970 ± 0.009 and PSNR = 40.7 ± 1.6. The best reference method, an image based post-processing network, only achieved SSIM = 0.944 ± 0.024 and PSNR = 39.8 ± 1.9. Since the whole reconstruction process is learned, the network was able to fully utilize the raw data, which beneﬁted from the removal of metal artifacts. The proposed method was the only studied method that could eliminate the metal streak artifacts.


Introduction
The presence of high attenuation objects in the scanning field leads to artifacts in computed tomography (CT) imaging, which substantially decrease the image quality. The generic term for these kinds of artifacts is metal artifacts, which are a combination of beam hardening, scattering, photon starvation, and edge effects [1]. Metal artifacts are common in CT-guided interventions due to the presence of metallic instruments such as biopsy needles [2][3][4] or catheters [5]. In many interventions, iodine contrast agent is used, leading to additional beam hardening [6]. These artifacts often obscure clinically relevant structures, which can complicate the intervention. For example, the visibility of liver lesions is significantly reduced during liver biopsy [2] or during transarterial chemoembolization (TACE) [7,8], where catheters are used in combination with contrast agents.
Several CT reconstruction methods have been developed to improve image quality in the presence of metal objects. Statistical iterative reconstruction techniques can be used to correct beam hardening and thus mitigate metal artifacts [9]. Furthermore, dualenergy CT allows one to reconstruct virtual monoenergetic images at high kiloelectron volt levels, which substantially reduces metal artifacts [5,10]. The most common type of metal artifact reduction (MAR) method is based on inpainting projection data that has been affected by metal. In these approaches, the metal objects are first automatically detected (e.g., via thresholding) in the uncorrected CT image. The metal objects are then forwardprojected into the sinogram domain to obtain a metal trace. The projection data in this metal trace are treated as missing data and are interpolated, e.g., via linear interpolation (LIMAR) [11]. Meyer et al. proposed a modification of the LIMAR approach called normalized MAR (NMAR) [12]. NMAR uses a forward projection of an image prior to flatten the uncorrected sinogram before interpolation. This additional step smoothes the sinogram, which reduces the streak artifacts caused by interpolation. In NMAR, the image prior is obtained by identifying air, soft tissue, and bone, in either the uncorrected CT or pre-corrected LIMAR image.
With the rapidly increasing popularity of deep learning in medical imaging in recent years [13], a plethora of novel MAR methods have emerged. Deep learning networks are mostly trained in a supervised manner and thus require a metal-free and a corresponding metal-affected dataset. These metal-affected data are commonly synthesized by inserting metallic objects into the metal-free data. Zhang et al. presented a convolutional neural network (CNN) called CNN-MAR, which outputs an improved image prior [14]. This image prior is forward-projected, and the resulting sinogram data are used to fill in the metal trace in the original sinogram. Several CNN approaches that operate in the sinogram domain have been introduced [15][16][17]. Lossau et al. developed a sophisticated sinogram inpainting approach that works in the presence of motion. A segmentation network identifies the metal trace in the projection domain; a second network fills in the missing sinogram data; and, after reconstruction, a third network reinserts the metal objects in the corrected image [18]. A popular class of deep learning MAR techniques are image-based CNNs. They take the uncorrected images as input and either learn a direct mapping to the artifact-free images [14,19,20] or to the artifact residuals [21]. These image-based methods often rely on input data that has already been pre-corrected to produce reasonable results [14,19]. Another option for MAR in the image domain is unsupervised image-toimage translation, which has the advantage that no synthesized metal artifacts are necessary and thus training can be conducted with unaltered clinical data [22][23][24]. Compared to supervised models, unsupervised models can achieve similar performance on synthetic data [22]. Lin et al. recently proposed an end-to-end trainable network called Dual Domain Network (DuDoNet) [25]. It consists of a sinogram enhancement network and an image enhancement network, which are connected by a Radon Inversion Layer (RIL). The RIL reconstructs the CT images using the filtered back projection (FBP) and allows gradient propagation during training.
In this work, we present an end-to-end deep learning CT reconstruction called iCTU-Net, for the correction of metal artifacts. The network learns the mapping from the metalaffected sinograms to the artifact-free images. It consists of three parts, which are trained simultaneously: sinogram refinement, back projection, and image refinement. To our knowledge, we are the first to train a single end-to-end deep learning network for the task of reducing metal artifacts with a learnable backprojection operation. Since the whole reconstruction process, including the back projection, is learned, the network is able to freely adapt the reconstruction to the imperfections of the sinogram data. The reconstruction is trained in a supervised manner with simulated interventional training data. We focus on liver interventions; thus, we generate abdominal liver data, including metal objects. We compare our iCTU-Net to the classical NMAR algorithm and to a sinogram refinement and an image refinement deep learning network. Both of these networks employ the same U-Net architecture that is used in our network, which allows for a fair comparison. These reference networks were selected to investigate the performance of deep learning MAR approaches in three different domains: sinogram pre-processing, image post-processing, and reconstruction.

iCTU-Net
The design of our iCTU-Net displayed in Figure 1a is based on the iCT-Net by Li et al. [26], which in turn is inspired by the classical FBP. The reconstruction is trained end-to-end, i.e., the inputs of the iCTU-Net are sinograms and the outputs are reconstructed images. The network includes pre-processing layers and aims to emulate the filtration of the sinograms and the back projection into the image domain. Post-processing layers were used to further refine the reconstruction. The network performs the complete CT image reconstruction and does not require a predefined back projection operator or the exact X-ray beam geometry.
In a first step, disturbances in the raw measurement data, such as excessive noise, are supposed to be suppressed using 3 × 3 convolutions (refining layers). The corrected sinogram is then filtered via 10 × 1 convolutions (filtering layers). By using 1 × 1 convolutions after the refining and filtering layers and by applying padding in all convolutions, the refined and filtered sinogram maintains the same size of the input sinogram. The convolutions in the refining layers employ a shrinkage activation function with a threshold of 0.0001 [26]. For the filtering layers, a tanh activation function is used. Afterwards, the refined and filtered sinogram is projected into the image space in a back projection step. This is realized by a d × 1 convolution with N 2 output channels without padding, where d is the number of detector elements and N is the output image size. This convolution connects every detector element with every pixel in the image space. Since the back projection is learned, sinograms acquired with different beam geometries can be used to train the network, such as parallel beam and fan beam. Then, the results for each view angle v are reshaped to images of size N × N and rotated according to the acquisition angle. The acquisition angle of the projections is the only geometrical information provided to the network. The rotated images are linearly interpolated and cropped to maintain an image size of N × N. The back projected image is then obtained by combining all views with a 1 × 1 convolution using a leaky Rectified Linear Unit (ReLU) activation function [27]. Finally, the image output is further refined by a U-Net. The U-Net is a popular choice for post-processing to reduce artifact in CT imaging [28].

Reference MAR Networks
To compare our iCTU-Net to other methods, we implement two deep learning MAR algorithms similar to those of Gjesteby et al. Both networks use pre-corrected NMAR inputs [17,19]. One is based in the projection domain (U-Net Sino), and the other one in the image domain (U-Net Image). To ensure comparability, we use the same U-Net architecture in the iCTU-Net, U-Net Sino, and U-Net Image. In the U-Net Sino, the sinograms are first refined by a U-Net, and the result is then reconstructed using the FBP [17]. In the U-Net Image, the sinograms are first reconstructed with the conventional FBP and then refined with a U-Net [19]. These reference networks were chosen to allow a comparison of sinogram pre-processing, image post-processing, and reconstruction deep learning MAR techniques.
The U-Net architecture is shown in Figure 1d and is similar to the original U-Net by Ronneberger et al. [29]. It has four en-and decoding blocks consisting of 3 × 3 convolutions, which are connected via skip connections. Zero-padding is used in the convolutions to ensure that the network output is the same size as the network input. The blocks of the top level have 32 channels, which are doubled with each encoding block until the lowest block has 512 channels. Downsampling in the contracting path is performed via 2 × 2 max-pooling with stride 2, while upsampling in the expansive path is accomplished using 3 × 3 transposed convolutions with stride 2. All convolutional layers are followed by a ReLU activation function.

Data Generation
To simulate the training data, we use the XCAT phantom, which provides highly detailed whole-body anatomies [30]. The phantom includes female and male models of different ages, providing a wide variety of patient geometries. Further customization of anatomies by changing organ sizes is possible. We create 40 different XCAT models for training and 10 additional models for testing, resulting in 3964 and 991 slices of size 512 × 512 pixel with an in-plane resolution of 1 × 1 mm 2 and a slice thickness of 2 mm, respectively. Because we choose liver interventions as a use case, we generate abdominal XCATs that include the whole liver.
Organ masks can be easily obtained within the XCAT framework. Utilizing these organ masks, we insert metal structures inside the veins of the XCAT phantoms, emulating contrast agents or interventional instruments, such as catheters. Metal objects are only placed inside thicker blood vessels and have a uniform size, independent of the blood vessel size. This is realized by first eroding the blood vessel masks of the XCAT phantom, using a disk with a radius of 3 pixels as a structuring element. The erosion is performed to exclude the smallest blood vessels. To obtain the final metal mask, we skeletonize the mask and then increase the thickness via dilation using a disk with a radius of 3 pixels. An example is shown in Figure 2, with the metal mask in red, the initial blood vessels in white, and the liver in green. Most of the metal structures are placed inside the liver or in the portal vein beneath the liver. Our data generation pipeline is shown in Figure 3, which starts with the generation of the ground truth data in the first row. First, we create sinograms by forward projecting the XCAT image data using a parallel beam geometry with 736 projection beams and 360 projection angles. A polychromatic X-ray spectrum and the energy-dependence of the absorption coefficients are considered in the forward projection: with weights of the energy spectrum η(E i ). An incident flux of I 0 = 4 · 10 6 photons is used, which is slightly increased compared to clinical levels [31], to combat photon starvation due to the presence of the metal objects. The X-ray energy spectrum is generated using the SpekCalc software with a tube peak voltage of 100 kVp and 1 mm aluminium filter [32]. We use 91 energy bins from 10 keV to 100 keV with a uniform size of 1 keV. The organ masks provided by the XCAT framework make it possible to assign an energy-dependent attenuation coefficient µ(x, E i ) to each organ. The sinograms are then reconstructed via a FBP. Since the energy dependence of the attenuation coefficients is accounted in the forward projection, beam hardening is present in the ground truth data. To simulate data affected by metal, we utilize the previously mentioned metal mask to insert the attenuation coefficient of iron. Afterwards, metal sinograms are created via forward projection using Equation (1). Noise is then added, the projection data is normalized, and the negative logarithm is applied: The photon production, attenuation, and detection is described by a Poisson distribution. Electronic noise of the detector is simulated with a Gaussian distribution N with a mean value of zero and σ 2 = 40 [33,34]. A subsequent FBP results in an image containing metal artifacts. As input for the training of our networks we do not use this artifact image; instead, we use data pre-corrected with the NMAR algorithm as shown in the third row of Figure 3. The prior image used for the normalization in NMAR is obtained by segmentation of soft tissue and bone in a LIMAR image [11,12].

Training
The networks are trained with the SSIM loss function using the Adam optimizer with a learning rate of 0.001 [35]. We apply L2 regularization to the network weights, with a weighting factor of 10 −6 . Each network is trained for 25 epochs. The training data in image domain is windowed to [−1000, 1000] HU and then mapped to the interval [−1, 1]. The whole image slices are used for training, and no patches are extracted. The sinogram training data is neither windowed nor normalized. The input and label images for the iCTU-Net (green), U-Net Sino (blue), and U-Net Image (red) are noted in Figure 3.

Evaluation
The reconstructions are evaluated by calculating the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) for the test data. We set the background values of ground truth and reconstructions to −1000 HU to focus the analysis on the body region where the relevant anatomy is located. For the evaluation, the slices of the test dataset are divided into three categories: no metal, moderate metal artifacts, and severe metal artifacts, with 106, 748, and 137 slices, respectively. This separation allows one to evaluate the reconstructions when no metal is present. A slice is assigned to the severe metal artifact category if the FBP yields an SSIM value of less than 0.7, and to the moderate metal artifact category if the SSIM value of the FBP is greater than or equal to 0.7. The SSIM threshold is chosen, such that the number of slices with severe metal artifacts is similar to the number of slices without metal.

Experiments
We conduct three experiments. First, we configure our iCTU-Net in an ablation study. Then, we investigate the impact of different sinogram input data for training in an input study. Finally, we compare our best network configuration with state-of-the-art MAR algorithms.
In the ablation study, we investigate different post-processing layers and loss functions. The purpose of the ablation study is to find settings for the iCTU-Net that yield the best reconstructions. The resulting network configuration will be used in following studies.
We train three networks with different post-processing layers after backprojection: no post-processing, three convolution layers, and a U-Net. All of these networks are trained with the SSIM loss and with pre-corrected NMAR sinograms as input. To investigate the influence of the loss function, we additionally train the U-Net post-processing network with the MSE loss. Both SSIM and MSE are commonly used loss functions in CT artifact reduction and CT reconstruction [28].
In the input study, we train the network with different sets of training input data in addition to the previously used pre-corrected NMAR sinograms. The idea behind the input study is to find out how the network behaves for different kinds of sinogram input data. We use sinograms without metal (ground truth sinogram in Figure 3 but with additional noise added via Equation (2)) to investigate the network's performance if no metal is present. In this way, the reconstruction performance and the ability to mitigate metal artifacts can be evaluated separately. We calculate the evaluation metrics for different categories of artifact severity, even though none of the test data contain any metal. Nevertheless, the categories are used to allow fair comparisons to the other networks. We also train a network with uncorrected metal sinograms (noisy metal sinogram in Figure 3), to see if an NMAR pre-correction is necessary.
Finally, in the comparison study, we compare our iCTU-Net with the NMAR sinogram inpainting algorithm and the U-Net Sino and U-Net Image networks described earlier.

Ablation Study
The results of the evaluation metrics for the ablation study are shown in Table 1, and reconstructed images are shown in Figure 4. We first investigate the impact of the different post-processing layers. Compared to using the U-Net for post-processing, using no postprocessing and three convolutional layers performs generally worse, especially for severe artifacts. When using no post-processing, clear streak and extinction artifacts are present. Using the three convolutional layers for post-processing improves SSIM and PSNR and the streak and extinction artifacts disappear. However, the geometry of some soft tissue organs such as the liver is not reconstructed correctly, which is particularly evident for severe metal artifacts. Using the U-Net as the final layers of the network substantially improves the evaluation metrics, completely eliminates artifacts, and reconstructs organs more accurately. For no artifacts, the iCTU-Net underperforms compared to the FBP, especially in terms of PSNR. As shown by the arrows in the zoomed regions in Figure 4, the iCTU-Net is not capable of resolving small structures of only a few millimeters in size. From now on, we will only use the U-Net for post-processing as it yields the best results.
Finally, we train the iCTU-Net with the MSE loss. For no artifacts and moderate artifacts, the SSIM and PSNR evaluation metrics for the SSIM and MSE losses are similar. However, the SSIM metric for the MSE loss is considerably worse for severe artifacts and the reconstructions of the MSE iCTU-Net in Figure 4 look grainy. Thus, the network with U-net post-processing layers combined with SSIM loss performs best. Only this network configuration is referred to as iCTU-Net in this work. . Results of the ablation study, where the ground truth and FBP are compared to different iCTU-Net settings. All networks are trained with pre-corrected NMAR sinograms and the SSIM loss, except for the MSE iCTU-Net, which is trained with the MSE loss. A slice without metal artifacts, with moderate metal artifacts, and with severe metal artifacts is shown. The scans are windowed to [−300 HU, 300 HU] to increase the visibility of the artifacts. The arrows in the zoomed regions indicate small structures that the iCTU-Net cannot resolve accurately.

Input Study
In the input study, we investigate different sinogram inputs for the iCTU-Net. The results of the evaluation metrics for the input study are shown in Table 2, and reconstructed images are shown in Figure 5. The SSIM and PSNR in Table 2 show that the network performs similarly independent of the input. The network trained without metal in the input sinogram achieves the best PSNR, and the network trained with the pre-corrected NMAR sinograms achieved the best SSIM. However, these differences in SSIM and PSNR are not significant. For the metal input, some reconstruction inaccuracies close to metal objects can be observed, as indicated by the arrows in the zoomed images in Figure 5. Apart from this, the reconstructions in Figure 5 show no noticeable differences in image quality. Therefore, we continue to use the pre-corrected NMAR sinograms for the iCTU-Net. This allows for a fairer comparison with the deep learning reference methods, since they also use NMAR inputs. Table 2. SSIM and PSNR evaluation metrics for the input study. The differentiation of artifact severity is not meaningful for No Metal Input because none of the test data contain metal. Since this network is not trained with any metal data, it is not suitable for artifact reduction. However, to allow a reasonable comparison to the other methods, we keep the categories, meaning the same slices are used for evaluation. The best result for each metric is marked bold.  5. Results of the input study, where the ground truth and FBP are compared to iCTU-Nets trained with different input sinograms. All networks are trained with the U-Net post-processing layers and the SSIM loss, which yield the best results in the ablation study. No Metal Input, Metal Input, and iCTU-Net are, respectively, trained with metal-free, metal, and NMAR pre-corrected sinograms. A slice without metal artifacts, with moderate metal artifacts, and with severe metal artifacts is shown. The scans are windowed to [−300 HU, 300 HU] to increase the visibility of the artifacts. The arrows in the zoomed images indicate an anatomy that the Metal Input network cannot resolve accurately.

Comparison Study
The results of the evaluation metrics for the comparison study are shown in Table 3, and reconstructed images are shown in Figure 6. The deep learning reference methods U-Net Sino and U-Net Image both perform better than NMAR in terms of SSIM, especially for severe artifacts. In terms of PSNR, they perform worse when artifacts are not present and similarly when artifacts are present. The U-Net Image achieves a slightly higher SSIM than the U-Net Sino, but the performance of both methods is very similar. In Figure 6, no substantial removal of metal artifacts can be observed for the U-Net Sino and U-Net Image, only a smoothing of the streak artifacts is observed for the U-Net Image. Figure 6. Results of the comparison study, where the ground truth and FBP are compared to NMAR, U-Net Sino, U-Net Image, and iCTU-Net. A slice without metal artifacts, with moderate metal artifacts, and with severe metal artifacts is shown. The scans are windowed to [−300 HU, 300 HU] to increase the visibility of the artifacts. The arrows and circles in the zoomed images indicate anatomies that could only be recovered by the iCTU-Net. Without artifacts, the iCTU-Net is outperformed by all methods in terms of PSNR and SSIM as they are all FBP-based and already outperformed the iCTU-Net in the ablation study. For moderate artifacts, the iCTU-Net achieves competitive SSIM values compared to the reference methods but performs worse in terms of PSNR. Nevertheless, the iCTU-Net is the only method capable of completely removing moderate metal artifacts, as shown in Figure 6. As indicated by the arrows in the zoomed images in Figure 6, the iCTU-net is also the only method that can restore a blood vessel into which a metal object has been inserted. For severe artifacts, the iCTU-Net performs better than all reference methods with SSIM = 0.970 ± 0.009 and PSNR = 40.7 ± 1.6. The second best method, the U-Net Image, only achieved SSIM = 0.944 ± 0.024 and PSNR = 39.8 ± 1.9. Averaged over all images, the SSIM of the iCTU-Net is competitive with the U-Net Image, but a worse PSNR is achieved. The iCTU-Net is able to remove severe metal artifacts completely, whereas for the other methods strong streak artifacts are still present over the whole image. The iCTU-Net can not only efficiently remove severe artifacts but also reliably restore the anatomy that is obstructed by these artifacts. This is especially evident inside the circles shown in the zoomed images in Figure 6. All other methods fail to restore the anatomy in this region.

Discussion
We trained the iCTU-Net with metal-affected data, to investigate its ability to mitigate metal artifacts. The iCTU-Net outperformed the reference methods for reconstructions with severe metal artifacts. Similar results were found for the application of the iCTU-Net to sparse-angle CT reconstruction, where the iCTU-Net showed good performance for a small number of projections [28]. However, the iCTU-Net was not able to resolve small structures of only a few millimeters in size. The reconstructions were slightly blurred, which is probably the reason why the iCTU-Net could not match the quality of the FBP when no metal was present. In the ablation study, it was found that the loss function and the post-processing layers have a major impact on the quality of the reconstruction. We had attempted to sharpen the reconstructed image by combining the SSIM loss with an additional gradient difference loss [36], but no substantial improvements were observed. In the future, we will investigate alternatives to the U-Net as post-processing layers to further optimize the network. The iCTU-Net was trained with a dataset of 3964 slices, of which only 310 contained no metal. Due to this small fraction of metal-free training data, the network might not be able to learn how to properly reconstruct metal-free sinograms.
In the input study, we trained the reconstruction network exclusively with metal-free data to test this hypothesis. We found that the network trained with metal-free raw data did not perform better than the iCTU-Net for the no artifact category. Therefore, we can conclude that training the network with mainly metal-affected data does not degrade the quality of the reconstructions. Interestingly, the evaluation metrics for the moderate and severe artifact categories also did not differ substantially. Thus, the network trained with metal-affected input data reconstructs images with metal-affected test data just as well as the network trained without metal reconstructing images that do not include metal. This shows that the iCTU-Net reliably reduces metal artifacts. This is confirmed by the fact that all networks in the input study performed very similarly for all severities of artifacts. The network seems to handle metal objects in the raw data very well.
The input study showed that the iCTU-Net performs similarly regardless of the sinogram input data used. Training the network with uncorrected metal sinograms revealed similar performances compared to the network trained with pre-corrected NMAR sinograms. This means that reconstruction without pre-correction is feasible, which reduces the complexity of the algorithm.
In the comparison study, a sinogram pre-processing and an image post-processing approach were investigated. We have found that the image-based post-processing deep learning approach provides better results than the sinogram pre-processing approach. This is consistent with the findings of Arabi et al. [37]. Since the reference methods are all FBP-based, they are superior to the iCTU-Net in the absence of artifacts due to the aforementioned blurring. However, the artifacts introduced by the FBP cannot be completely mitigated by the reference methods. The iCTU-Net is the only method that removes all metal artifacts and yields the best results of all methods for severe metal artifacts. Since the iCTU-Net is trained end-to-end, the network can fully utilize the raw data and learn to reconstruct an artifact-free image. The U-Net Sino learns to mitigate disturbances in the sinogram with the raw data as input. However, small errors in the sinogram can lead to significant deviations in the reconstruction [28], which the U-Net Sino cannot correct. The U-Net Image only mitigates the artifacts in the image domain introduced by the FBP. In doing so, the network no longer has the original raw data to learn from.
The usage of digital XCAT phantom data for metal data simulation instead of real patient data has several advantages. First of all, with the organ masks provided by the XCAT, metal objects can automatically be inserted in specific body regions. In this work, we inserted iron into the blood vessels. For future studies it would be better to insert attenuation coefficients of materials that are commonly used for contrast agents and catheters. Moreover, for the simulation of polychromatic projections, it is not necessary to segment the images into soft tissue, bone, and metal to assign the corresponding attenuation coefficients, as is done in several other works [14,21,37]. Instead, the organ masks of the XCAT allow for the insertion of energy-dependent attenuation coefficients for every organ. In the future, it will be desirable to test the iCTU-Net on experimental raw data instead of simulated data. However, this requires the iCTU-Net to be adapted to work with the raw data of multirow detector CT scanners. The two-dimensional projection data might lead to restrictions due to GPU memory limitations. Since dual-energy CT has been shown to help reduce metal artifacts [5,10], the iCTU-Net should benefit from the additional spectral information. Photon-counting CT is another spectral technology that can be used to reduce metal artifacts [38]. The energy of individual photons can be measured by energy-resolving detectors [39]. The iCTU-Net is readily applicable to energy-resolved raw data by including the energy information in separate input channels. The additional spectral information in the raw data is expected to mitigate beam hardening artifacts.
We will also investigate the ability of the iCTU-Net to simultaneously mitigate different kinds of artifacts. This is achievable by using a training dataset that contains a combination of artifacts. Promising results for the isolated mitigation of artifacts with the iCTU-Net in low-dose CT and sparse-angle CT have already been shown [28].

Conclusions
The presented end-to-end deep learning CT reconstruction algorithm was trained with simulated interventional data to mitigate metal artifacts during reconstruction. We showed that the iCTU-Net reconstruction MAR approach is better suited to mitigate metal artifacts than commonly used sinogram pre-processing and image post-processing deep learning approaches. The iCTU-Net is the only studied method that can eliminate the metal streak artifacts. However, the end-to-end reconstruction approach performs worse than the other approaches when no artifacts are present. Reconstructions without any metal showed that the iCTU-Net is prone to blurring. Because the whole reconstruction is learned, the network is able to fully utilize the raw data, which benefits the removal of metal artifacts. In the future, we will try to improve the network architecture by investigating alternative loss functions and post-processing layers to avoid blurring. We will also train networks with data including different kinds of artifacts to investigate simultaneous mitigation of several types of artifacts.