Accurate Transmission-Less Attenuation Correction Method for Amyloid- β Brain PET Using Deep Neural Network

: The lack of physically measured attenuation maps ( µ -maps) for attenuation and scatter correction is an important technical challenge in brain-dedicated stand-alone positron emission tomography (PET) scanners. The accuracy of the calculated attenuation correction is limited by the nonuniformity of tissue composition due to pathologic conditions and the complex structure of facial bones. The aim of this study is to develop an accurate transmission-less attenuation correction method for amyloid- β (A β ) brain PET studies. We investigated the validity of a deep convolutional neural network trained to produce a CT-derived µ -map ( µ -CT) from simultaneously reconstructed activity and attenuation maps using the MLAA (maximum likelihood reconstruction of activity and attenuation) algorithm for A β brain PET. The performance of three different structures of U-net models (2D, 2.5D, and 3D) were compared. The U-net models generated less noisy and more uniform µ -maps than MLAA µ -maps. Among the three different U-net models, the patch-based 3D U-net model reduced noise and cross-talk artifacts more effectively. The Dice similarity coefﬁcients between the µ -map generated using 3D U-net and µ -CT in bone and air segments were 0.83 and 0.67. All three U-net models showed better voxel-wise correlation of the µ -maps compared to MLAA. The patch-based 3D U-net model was the best. While the uptake value of MLAA yielded a high percentage error of 20% or more, the uptake value of 3D U-nets yielded the lowest percentage error within 5%. The proposed deep learning approach that requires no transmission data, anatomic image, or atlas/template for PET attenuation correction remarkably enhanced the quantitative accuracy of the simultaneously estimated MLAA µ -maps from A β brain PET.


Introduction
Among the many different physical and technical factors affecting the image quality and quantitative accuracy of position emission tomography (PET) images, attenuation of annihilation photon pairs due to photoelectric absorption and Compton scattering is the single largest factor [1]. In old PET scanners without the combination with computed tomography (CT) or magnetic resonance imaging (MRI), transmission sources with longlived radioisotopes, such as 68 Ga/ 68 Ge and 137 Cs, were used to acquire the transmission and blank scans needed for the attenuation correction (AC) in PET [2,3]. In PET/CT, CT images are converted into the PET attenuation map (µ-map) using piecewise linear relationship relationship between the CT Hounsfield Unit and linear attenuation coefficient for 511 eV photons [4,5]. Although PET quantification errors due to artifacts in CT images and spatiotemporal mismatch between CT and PET still exist [6][7][8][9][10], CT-based AC is on average). The head of each participant was positioned in a head holder attached to the patient bed, and the PET/CT scan was conducted by following the routine clinical protocol for brain studies (topogram, CT, and emission PET scans). All datasets were reconstructed using ordered-subset expectation maximization (OSEM, 3 iterations, 21 subsets, 5 mm Gaussian post-filter) with CT-derived attenuation map (µ-CT) and maximum likelihood reconstruction of activity and attenuation (MLAA, 6 iterations, 21 subsets, 5 mm Gaussian post-filter) with time-of-flight (TOF) information [56][57][58][59]. To mitigate the non-unique global scaling problem in the MLAA, the boundary constraint was applied during the attenuation image estimation in the MLAA. The initial µ-map estimate of the MLAA was a uniform image filled with 1.0. Scatter sinogram was estimated from µ-CT using single scatter simulation. The dimensions and voxel size of the reconstructed PET images were 200 × 200 × 109 and 2.04 mm × 2.04 mm × 2.03 mm, respectively. To obtain µ-CT, the CT images that had the dimensions and voxel size of 512 × 512 × 149 and 0.59 mm × 0.59 mm × 1.5 mm, respectively, were resampled to have the same dimensions and voxel size as the PET images.

Network Architecture
As mentioned previously, three different U-net models (2D, 2.5D, and 3D) were designed, trained, and tested. The U-net models were trained to produce µ-maps equivalent to µ-CT once the activity and attenuation maps obtained from the MLAA (λ-MLAA and µ-MLAA) are given to the U-net as input ( Figure 1). All the input and output image sizes were 200 × 200 for the 2D U-net models (slice-to-slice translation), which is a straightforward structure. For the 3D U-net models, input voxel patches with a size of 32 × 32 × 32 in head pixels were extracted from λ-MLAA and µ-MLAA, with stride 4. Three-dimensional (3D) U-net is expected to provide more accurate attenuation maps with reduced axial discontinuities using extensive local information from 3D patches. The 2.5D U-net was designed to produce a central slice once three neighboring slices with a size of 200 × 200 were provided to the network (slab-to-slice translation). The 2.5D U-net was a compromise between 2D U-net and 3D U-net employing whole transaxial planes to reduce in-plane discontinuities, as well as providing additional information from neighboring slices. The input image intensity was normalized to a range of 0-1. Biograph mCT40 scanner (Siemens Healthcare, Knoxville, TN, USA) having a time resolution of 580 ps. The PET/CT imaging was performed for 10 min in a single PET bed position, 90 min after administering an intravenous injection of 18 F-Florbetaben (305.9 MBq on average). The head of each participant was positioned in a head holder attached to the patient bed, and the PET/CT scan was conducted by following the routine clinical protocol for brain studies (topogram, CT, and emission PET scans). All datasets were reconstructed using ordered-subset expectation maximization (OSEM, 3 iterations, 21 subsets, 5 mm Gaussian post-filter) with CT-derived attenuation map (μ-CT) and maximum likelihood reconstruction of activity and attenuation (MLAA, 6 iterations, 21 subsets, 5 mm Gaussian post-filter) with time-of-flight (TOF) information [56][57][58][59]. To mitigate the non-unique global scaling problem in the MLAA, the boundary constraint was applied during the attenuation image estimation in the MLAA. The initial μ-map estimate of the MLAA was a uniform image filled with 1.0. Scatter sinogram was estimated from μ-CT using single scatter simulation.

Network Architecture
As mentioned previously, three different U-net models (2D, 2.5D, and 3D) were designed, trained, and tested. The U-net models were trained to produce μ-maps equivalent to μ-CT once the activity and attenuation maps obtained from the MLAA (λ-MLAA and μ-MLAA) are given to the U-net as input ( Figure 1). All the input and output image sizes were 200 × 200 for the 2D U-net models (slice-to-slice translation), which is a straightforward structure. For the 3D U-net models, input voxel patches with a size of 32 × 32 × 32 in head pixels were extracted from λ-MLAA and μ-MLAA, with stride 4. Three-dimensional (3D) U-net is expected to provide more accurate attenuation maps with reduced axial discontinuities using extensive local information from 3D patches. The 2.5D U-net was designed to produce a central slice once three neighboring slices with a size of 200 × 200 were provided to the network (slab-to-slice translation). The 2.5D U-net was a compromise between 2D U-net and 3D U-net employing whole transaxial planes to reduce in-plane discontinuities, as well as providing additional information from neighboring slices. The input image intensity was normalized to a range of 0-1. The U-net models consisted of convolution layers, rectified linear units (ReLU), 2 × 2 max pooling layers (2 × 2 × 2 in 3D networks), deconvolution layers, and a 1 × 1 (1 × 1 × 1 in 3D networks) convolution layer. In the contracting path, the 3 × 3 convolutions (3 × 3 × 3 in 3D networks) and ReLU function were repeated twice for each layer; further, the image dimension was reduced by half while the number of features was doubled by applying 2 × 2 max pooling (2 × 2 × 2 in 3D networks) with a stride of 2. In the expanding path, Figure 1. U-net architectures used to learn µ-CT from λ-MLAA and µ-MLAA, which were concatenated and passed in two input channels. The first row in the right box shows the network architecture. The dimensions of the input/output images and feature maps were different depending on the dimension of the network models (2D, 2.5D, or 3D). The dimensions of the feature maps in each layer are shown in the second row in the right box.
The U-net models consisted of convolution layers, rectified linear units (ReLU), 2 × 2 max pooling layers (2 × 2 × 2 in 3D networks), deconvolution layers, and a 1 × 1 (1 × 1 × 1 in 3D networks) convolution layer. In the contracting path, the 3 × 3 convolutions (3 × 3 × 3 in 3D networks) and ReLU function were repeated twice for each layer; further, the image dimension was reduced by half while the number of features was doubled by applying 2 × 2 max pooling (2 × 2 × 2 in 3D networks) with a stride of 2. In the expanding path, the image dimension was doubled, and the number of features was reduced by half by applying 2 × 2 up-convolution (2 × 2 × 2 in 3D networks) that used the nearest-neighbour interpolation. The concatenation operator was applied by the skip connections of the cropped feature map before max pooling in the contracting path to the expanding path. We implemented the networks using the TensorFlow library.

Training and Validation
The cost function was the L1-norm between the U-net output (µ-CNN) and µ-CT. The cost function was minimized using the adaptive moment estimation method (Adam optimization). The network weights were initialized using the Xavier method, which assigns random weights considering the number of inputs and outputs [60]. The batch size was 30 in the 2D and 2.5D networks, and it was 100 in the 3D network. The number of epochs was 50 in the 2D and 2.5D networks, it was and 20 in the 3D network. The learning rate was initially 0.001 for all the networks, and it was reduced for each epoch by 0.5, 0.5, and 0.8 times in the 2D, 2.5D, and 3D networks, respectively.
The training data were shuffled before each epoch to create randomization in the batch training. While using the Ryzen 1700X CPU with a GTX 1080 GPU, each epoch took approximately 12, 23, and 180 min for the 2D, 2.5D, and 3D networks, respectively.
We performed five-fold cross-validation to evaluate the performance of the networks. Accordingly, the 100 datasets were divided into 5 groups of 20 datasets in random order (20 in each group). In each cross-validation study, 3 groups (total: 60 datasets) were used for training the networks, 1 group (20 datasets) was used for training validation, and the remaining group (20 datasets) was used for evaluation. Supplementary Figure S1 shows the validation error learning curve for 2D, 2.5D, and 3D networks along the training epochs.

Image Analysis
The maps µ-MLAA and µ-CNNs were compared with µ-CT, the ground truth. The similarity of the µ-maps was evaluated using the Dice similarity coefficient, voxel-wise correlation, normalized root mean square error (NRMSE), and peak signal-to-noise ratio (PSNR) [41,52]. The Dice similarity coefficient was calculated for the air and bone segments. The voxels with µ-values greater than 0.1134 (=300 Hounsfield units) were regarded as bone segment; those with µ-values less than 0.0475 (=−500 Hounsfield units) were regarded as air segment [41]. In addition, the voxels with µ-values between the above limits were regarded as soft tissue. The Dice similarity coefficient was calculated using the following equation where N µ-CT and N µ-PET are the number of bone (or air) voxels in the µ-maps derived from CT (µ-CT) and PET (µ-MLAA or µ-CNNs) data, respectively. N (µ-CT ∩ µ-PET) indicates the number of overlapped voxels between the µ-maps from CT and PET. The NRMSE enables comparison between the datasets of models with different scales, and the PSNR is a quantitative measure of the differences between two images.
where Max and Min are the maximum and minimum intensities, respectively, of the reference image, µ-CT. MSE is the average value of the squared error between µ-CT and µ-CNN or µ-CNNs, and m and n are the width and height in pixels, respectively, of the image. The PSNR measurement was also performed on the activity images reconstructed using the different µ-maps and the OSEM algorithm. Regional analysis was also performed in the same manner as in our previous study with 18 F-FP-CIT brain PET [41]. The groundtruth PET activity images, corrected for attenuation using µ-CT, were spatially normalized using statistical parametric mapping software (http://www.fil.ion.ucl.ac.uk/spm, accessed on: 1 August 2019), and the same transformation parameters were applied to the images corrected using µ-MLAA and µ-CNNs. The standard uptake value (SUV) and SUV ratio relative to the reference region (SUVr) were measured using the automatic regions of interest (ROIs) predefined by the statistical probabilistic anatomic maps [61,62].

Comparison between 2D, 2.5D, and 3D U-Net
Training the 2D network is fast and requires less memory compared to training the 3D network. However, stacking 2D network outputs to generate 3D attenuation maps leads to discontinuities in the axial dimension. On the other hand, training the 3D network is computationally expensive and requires much more memory. As training a 3D network using whole PET images is generally difficult, training with 3D patches extracted from whole PET images is usually employed. However, 3D patch-based inference would also lead to discontinuities at the borders of patches. The 2.5D networks have been proposed as a compromise between 2D and 3D, using multi-slice inputs. The 2D and 3D networks are evaluated in the previous studies using different tracers using 18 F-FP-CIT and 18 F-FDG, respectively [40,41], but there are no direct comparison using a same tracer among 2D, 2.5D, and 3D networks. In this study, we compared these architectures with evaluation measures stated above.

Comparison between Single-Image Input and Multi-Input
As noise patterns and cross-talk artifacts in λ-MLAA and µ-MLAA are highly correlated, our networks were designed to exploit some useful features from combination of λ-MLAA and µ-MLAA to solve the problems when generating attenuation maps. By doing so, we expect that better performance could be achieved compared to when employing only λ-MLAA or µ-MLAA as an input. To explore the relative importance of λ-MLAA and µ-MLAA input, a 3D network was additionally trained with each of two separately and tested.

Results
The µ-MLAA appeared very different from µ-CT, mainly owing to severe cross-talk artefacts between the activity and attenuation maps generated by MLAA. On the other hand, the U-net models generated less noisy and more uniform µ-maps ( Figure 2). The µ-maps estimated using the original MLAA were considerably underestimated, which was mitigated by applying the deep neural networks. The performance of patch-based 3D learning was superior to those of 2D slice-to-slice and 2.5D slab-to-slice translations (Figures 2-4, and Table 1). The 2.5D U-net models did not yield performance improvement in comparison with 2D. Figure 4 shows the plots of the NRMSE between µ-maps along the axial slice location in a representative patient, as shown in Figures 2 and 3. As shown in this figure, the error relative to µ-CT (ground truth µ-map) was dramatically reduced by applying the deep neural networks to the MLAA output images. In this case, the NRMSE decreased as the dimensions of the model increased from 2D to 3D. However, the incremental gain from 2D to 2.5D was not as high as that from 2.5D to 3D. The 3D patch-based learning was also useful in eliminating the discontinuity of pixel values in the axial direction, as shown in the output of 2D networks (Figure 2). Figure 4 also shows that the NRMSE between µ-CNNs and µ-CT is larger in the facial bone areas (smaller-numbered slices in Figure 4) such as the nasal cavity than in the cranial regions.    Figure 4 shows the plots of the NRMSE between μ-maps along the axial slice location in a representative patient, as shown in Figures 2 and 3. As shown in this figure, the error relative to μ-CT (ground truth μ-map) was dramatically reduced by applying the deep neural networks to the MLAA output images. In this case, the NRMSE decreased as the dimensions of the model increased from 2D to 3D. However, the incremental gain from 2D to 2.5D was not as high as that from 2.5D to 3D. The 3D patch-based learning was also     Figure 4 shows the plots of the NRMSE between μ-maps along the axial slice location in a representative patient, as shown in Figures 2 and 3. As shown in this figure, the error relative to μ-CT (ground truth μ-map) was dramatically reduced by applying the deep neural networks to the MLAA output images. In this case, the NRMSE decreased as the dimensions of the model increased from 2D to 3D. However, the incremental gain from 2D to 2.5D was not as high as that from 2.5D to 3D. The 3D patch-based learning was also useful in eliminating the discontinuity of pixel values in the axial direction, as shown in the output of 2D networks ( Figure 2). Figure 4 also shows that the NRMSE between μ-CNNs and μ-CT is larger in the facial bone areas (smaller-numbered slices in Figure 4) such as the nasal cavity than in the cranial regions. The deep neural networks resulted in a significant improvement from μ-MLAA to μ-CNNs in the measurement of Dice similarity coefficients relative to μ-CNN ( Table 1) Figures 2 and 3. The NRMSE between the U-net outputs (µ-CNNs) and µ-CT is larger in the facial bone areas (smallernumbered slices) such as the nasal cavity than in the cranial regions (larger-numbered slices). The deep neural networks resulted in a significant improvement from µ-MLAA to µ-CNNs in the measurement of Dice similarity coefficients relative to µ-CNN ( Table 1). The inaccuracy of µ-MLAA disabled proper segmentation of bone and air regions based on the µ-values. The Dice similarity coefficients between µ-MLAA and µ-CT in bone and air segments obtained here with 18 F-Florbetaben PET data (0.073 and 0.055) were smaller than those obtained in our previous study conducted with 18 F-FP-CIT PET (0.374 and 0.317) [41]. In both the bone and air segments, the mean Dice similarity coefficients increased, and their standard deviations decreased by applying the deep neural networks. The 3D patch-based learning yielded higher similarity in the bone and air segmentation to the ground truth than 2D or 2.5 learnings. Although the 2D U-net yielded a lower Dice similarity coefficient (0.718 and 0.400 in bone and air, respectively) than the previous study 18 F-FP-CIT PET, in which the 2D U-net was also applied (0.787 and 0.575), the 3D U-net achieved better results (0.826 and 0.674).
The voxel-wise correlation of µ-maps are summarized in Table 2, showing the improved accuracy of the µ-map by CNN. All three U-net models showed better voxel-wise correlation of the µ-maps compared to MLAA. The patch-based 3D U-net model was the best. The PSNR evaluation summarized in Table 3 also shows improved µ-map generation and PET quantification achieved by applying the 3D patch-based learning. There were at least 18 dB PSNR improvements in the µ-map and PET activity data, respectively, by applying DL to the MLAA outputs, and the improvement was the largest with 3D U-nets. Figure 5 shows the activity images corrected for attenuation using different µ-maps and reconstructed using the OSEM algorithm, and the SUV error map was relative to the ground truth (corrected image using µ-CT). Table 3. Peak signal-to-noise ratio (PSNR) and normalized root mean square error (NRMSE) relative to the ground truth (µ-CT and λ-CT).  Figure 5. Activity images in standard uptake value (SUV) corrected for attenuation using different μ-maps and SUV error maps relative to the ground truth (image corrected using μ-CT). The smallest SUV error was obtained using μ-3D.

Method
ROI analysis was conducted in four regions (putamen, caudate head, cerebellum, and occipital cortex) using activity maps that were attenuation-corrected using the proposed μ-maps. In addition, percentage errors of SUV and SUVr were calculated ( Figure 6). While the uptake value of MLAA yielded a high percentage error of 20% or more, the uptake value of 3D U-nets yielded the lowest percentage error within 5%. Furthermore, the standard deviation of MLAA was larger than that of the CNN outcomes. In both SUV and SUVr quantification, the 3D U-nets showed a smaller bias and dispersion than the 2D or 2.5D U-nets. Figure 5. Activity images in standard uptake value (SUV) corrected for attenuation using different µ-maps and SUV error maps relative to the ground truth (image corrected using µ-CT). The smallest SUV error was obtained using µ-3D.
ROI analysis was conducted in four regions (putamen, caudate head, cerebellum, and occipital cortex) using activity maps that were attenuation-corrected using the proposed µ-maps. In addition, percentage errors of SUV and SUVr were calculated ( Figure 6). While the uptake value of MLAA yielded a high percentage error of 20% or more, the uptake value of 3D U-nets yielded the lowest percentage error within 5%. Furthermore, the standard deviation of MLAA was larger than that of the CNN outcomes. In both SUV and SUVr quantification, the 3D U-nets showed a smaller bias and dispersion than the 2D or 2.5D U-nets.

Method
Attenuation  . Activity images in standard uptake value (SUV) corrected for attenuation using different μ-maps and SUV error maps relative to the ground truth (image corrected using μ-CT). The smallest SUV error was obtained using μ-3D.
ROI analysis was conducted in four regions (putamen, caudate head, cerebellum, and occipital cortex) using activity maps that were attenuation-corrected using the proposed μ-maps. In addition, percentage errors of SUV and SUVr were calculated ( Figure 6). While the uptake value of MLAA yielded a high percentage error of 20% or more, the uptake value of 3D U-nets yielded the lowest percentage error within 5%. Furthermore, the standard deviation of MLAA was larger than that of the CNN outcomes. In both SUV and SUVr quantification, the 3D U-nets showed a smaller bias and dispersion than the 2D or 2.5D U-nets. The results of network training using only λ-MLAA or μ-MLAA are shown in Figure  7. The errors in attenuation map generation were much smaller than original MLAA. However, the combined input (λ-MLAA and μ-MLAA) yielded the smallest error, as shown Figure 7. The amount of error in the networks trained using only λ-MLAA and μ-MLAA was almost identical.

Discussion
In this study, we compared the performance of three different CNN schemes based on the U-net structure to predict μ-CT from the outcomes of MLAA simultaneous activity and attenuation reconstruction for 18 F-Florbetaben Aβ PET. As shown in the results, the 3D patch-based learning used in our previous work for whole-body 18 F-FDG PET [40] outperformed the 2D slice-to-slice translation strategy that was used in the 18 F-FP-CIT brain PET study [23]. Although the addition of neighboring slices in the input (2.5D slab-to-slice translation) improved the μ-map prediction performance, the improvement was not remarkably high. The general performance of the original MLAA in the prediction of μ-map for 18 F-Florbetaben was inferior to that of 18 F-FP-CIT [41]. However, the 3D patch-based learning could overcome this limitation of MLAA and enabled accurate PET quantification. The improved performance using 3D patch-based learning compared to using 2.5D slab-to-slice translation is much higher than between 2.5D slab-to-slice translation and 2D slice-to-slice translation ( Figure 4 and Tables 1-3). Extensive local information from 3D patch would allow CNN to better extract the features of the attenuation map, whereas only three slices from the 2.5D slab would not be enough to make significant improvement in the CNN performance.
As shown in Figure 4, the error in predicting μ-CT was higher in the facial bone regions than in the cranial bone regions. This implies that the deep learning-based approach used in this study was not sufficiently good for reconstructing the fine anatomical details in the facial bone regions. However, this inferior result mainly appears to originate from

Discussion
In this study, we compared the performance of three different CNN schemes based on the U-net structure to predict µ-CT from the outcomes of MLAA simultaneous activity and attenuation reconstruction for 18 F-Florbetaben Aβ PET. As shown in the results, the 3D patch-based learning used in our previous work for whole-body 18 F-FDG PET [40] outperformed the 2D slice-to-slice translation strategy that was used in the 18 F-FP-CIT brain PET study [23]. Although the addition of neighboring slices in the input (2.5D slab-toslice translation) improved the µ-map prediction performance, the improvement was not remarkably high. The general performance of the original MLAA in the prediction of µ-map for 18 F-Florbetaben was inferior to that of 18 F-FP-CIT [41]. However, the 3D patch-based learning could overcome this limitation of MLAA and enabled accurate PET quantification. The improved performance using 3D patch-based learning compared to using 2.5D slabto-slice translation is much higher than between 2.5D slab-to-slice translation and 2D slice-to-slice translation ( Figure 4 and Tables 1-3). Extensive local information from 3D patch would allow CNN to better extract the features of the attenuation map, whereas only three slices from the 2.5D slab would not be enough to make significant improvement in the CNN performance.
As shown in Figure 4, the error in predicting µ-CT was higher in the facial bone regions than in the cranial bone regions. This implies that the deep learning-based approach used in this study was not sufficiently good for reconstructing the fine anatomical details in the facial bone regions. However, this inferior result mainly appears to originate from the poor anatomical information provided by the MLAA reconstruction. Improved MLAA performance in more advanced PET systems with better timing resolution will be useful in overcoming the limitations of the proposed method [63][64][65]. Another useful approach that we should try in future studies is the modification of the loss function to obtain bettertrained networks. Recently, Shi et al. showed improvement in the present method for whole-body 18 F-FDG PET studies by adding an additional loss function in the projection space (integration of attenuation coefficient along the line of response) to the image fidelity loss [43]. This modification would be highly relevant, because the PET AC is performed in the projection space and not in the reconstructed image space. Further investigations are required on the advanced loss functions for the Aβ brain PET data.
As shown in Figure 7, training with combined input (λ-MLAA and µ-MLAA) achieved better convergence and yielded the smaller errors compared to training with only λ-MLAA or µ-MLAA. This implies that both λ-MLAA and µ-MLAA are advantageous for network to suppress noise levels and reduce cross-talk artifacts in attenuation map generation. In addition, the accuracy of training using only λ-MLAA and µ-MLAA was almost identical. This indicates that the importance of λ-MLAA and µ-MLAA to the networks is almost identical. Perhaps, the superior performance of the proposed method would be mainly attributed to the utilization of combined inputs for generating enhanced attenuation maps, while conventional approaches focused on only manipulating µ-MLAA by applying a background prior and using constrained Gaussian mixture models [58,66].
There are a growing number of PET AC studies that have reported AC performance improvement by introducing generative adversarial networks (GAN) [38,39,67]. We have also attempted to apply a conditional GAN method that was effective in our previous studies on PET to MRI transformation and Aβ PET template generation [68,69] to the present task. However, we could not achieve significant performance improvement in the µ-map prediction.
Another active transmission-less approach for PET AC is the deep learning-based conversion of non-attenuation-corrected PET images to the µ-map or attenuation-corrected PET image [19]. Initial studies on this approach have also shown promising results in 18 F-FDG brain and whole-body PET studies. However, studies on tracers other than 18 F-FDG have rarely been conducted.
When compared to the MRI-based AC methods applied to the amyloid and tau PET dataset [47], our 3D patch-based learning with no MRI input shows comparable results with CNN employing Dixon sequence and is slightly inferior to CNN employing integrated UTE/multi-echo Dixon sequence (Dice coefficient: ours = 0.83, CNN-Dixon = 0.84, CNNintegrated UTE/multi-echo Dixon = 0.87). However, this comparison only serves as a reference because the radiotracer and enrolled patients are different.
Our method has been validated using various tracers including 18 F-FP-CIT, 18 F-FDG, 68 Ga-DOTATOC [40,41,70], and 18 F-Florbetaben (this study). A limitation of this study is that we currently need to re-train the network from the scratch for applying this method to new radiopharmaceutical. Thus, a large number of scans is required for each radiopharmaceutical. To overcome this limitation, transfer learning that reuses a model developed for "old" tracers as the starting point for a model on "new" tracers should be considered as future works.

Conclusions
We have proposed an attenuation correction method using MLAA with deep learning. The proposed deep learning approach that requires no transmission data, anatomic image, or atlas/template for PET attenuation correction remarkably enhanced the quantitative accuracy of the simultaneously estimated MLAA µ-maps from Aβ brain PET. The benefit of using 2.5D slab-to-slice translation was not significant compared to using 2D slice-to-slice translation. The approach using 3D patch outperformed the others. The combined input of λ-MLAA and µ-MLAA could reduce the error of generated attenuation maps compared to single input of λ-MLAA or µ-MLAA. Both λ-MLAA and µ-MLAA have almost equal contributions for accuracy of the proposed method. Further study might be needed to apply the proposed method on "new" tracers, utilizing previous trained networks. This attenuation map generation would be applied to other diagnosis supporting tasks as well as the attenuation correction task.