Validation of a Saliency Map for Assessing Image Quality in Nuclear Medicine: Experimental Study Outcomes

Simple Summary: Since modern positron-emission tomography images are reconstructed with many nonlinear corrections, there is a need for a comprehensive evaluation method based on human vision instead of the conventional method using the count number. Image quality evaluation metrics related to human vision have been actively studied in the ﬁeld of natural imaging, but there have been few reports in the ﬁeld of nuclear medicine. This study’s aim was to verify the appropriateness of an image quality assessment using a saliency map by comparing it with the gaze data obtained during evaluation. We calculated the Pearson’s correlation coefﬁcient between the gaze data and the saliency map. The correlation between the two was high, indicating that saliency mapping is a valid evaluation method. Abstract: Recently, the use of saliency maps to evaluate the image quality of nuclear medicine images has been reported. However, that study only compared qualitative visual evaluations and did not perform a quantitative assessment. The study’s aim was to demonstrate the possibility of using saliency maps (calculated from intensity and ﬂicker) to assess nuclear medicine image quality by comparison with the evaluator’s gaze data obtained from an eye-tracking device. We created 972 positron emission tomography images by changing the position of the hot sphere, imaging time, and number of iterations in the iterative reconstructions. Pearson’s correlation coefﬁcient between the saliency map calculated from each image and the evaluator’s gaze data during image presentation was calculated. A strong correlation ( r ≥ 0.94) was observed between the saliency map (intensity) and the evaluator’s gaze data. This trend was also observed in images obtained from a clinical device. For short acquisition times, the gaze to the hot sphere position was higher for images with fewer iterations during the iterative reconstruction. However, no differences in iterations were found when the acquisition time increased. Saliency by ﬂicker could be applied to clinical images without preprocessing, although compared with the gaze image, it increased slowly.


Introduction
Image quality evaluation in the field of nuclear medicine is based on objective physical and subjective visual evaluations. There are advantages and disadvantages to both evaluation methods, and the information obtained from each is different. In some cases, it is impossible to obtain a high correlation between physical and visual evaluations [1], which may be because the evaluation criteria and tasks are different, and human visual characteristics are nonlinear [2]. The visual evaluation of medical image quality is important because diagnosis is based on the subjective judgment of the physician. However, since the visual evaluation depends on the evaluator, it is often hard to construct an evaluation environment to obtain accurate results. Therefore, the benefits of establishing a physical evaluation method that correlates well with the visual evaluation would be significant. Many human vision-based image quality metrics have been proposed, and saliency, which represents the ease of causing human visual attention, is one metric [3]. Although there are many applications of saliency maps in the medical field, such as lesion detection [4,5] and segmentation [6,7], there are few examples of their use for image quality evaluation in nuclear medicine, such as Hosokawa's study [8]. Hosokawa et al. showed that image quality evaluation using saliency maps can provide an objective evaluation close to subjective assessments. However, since that was a basic study that used a rectangular phantom in which cold signals were placed, whether the same results can be obtained under clinical conditions, such as using an anthropomorphic phantom or actual device, has not been verified. Additionally, the validity of the evaluation method using the saliency map was performed by comparing it with a qualitative visual evaluation (three-point scale), and no quantitative evaluation was performed.
To obtain subconscious information that cannot be verbalized, various biometric data, such as heartbeat, sweating, and brain measurements, have been used. Gaze measurements reveal attention and interest. Therefore, comparing the gaze data with saliency mapping is commonly used for validation [9,10]. Information on potential visual attention is widely used in marketing [11], sports [12], and the diagnosis of cognitive disorders [13]. A recently reported application in the field of radiological technology is the analysis of gazing during mammography between skilled and novice users [14]. We believe that the degrees of interest shown by visual attention to the target signals in a uniform phantom from medical images reflect the image quality and do not depend on experience or knowledge.
This study's aim was to assess the validity of using saliency maps to evaluate the image quality of positron emission tomography (PET) images obtained by imaging a phantom simulating a human body. We compared the saliency maps with the gaze data of the evaluator obtained from an eye-tracking device.

Materials and Methods
First, we obtained PET images obtained by a Monte Carlo simulation. We used GATE version 8.2 (OpenGATE collaboration, http://www.opengatecollaboration.org, accessed on 30 June 2022) [15] for the simulation code. The simulated PET system was the Discovery ST Elite (GE Healthcare). The imaging object was a NEMA/IEC body phantom, and one hot sphere with a diameter of 10 mm was placed in it. The hot spheres were placed in 18 patterns at different distances from the phantom center ( Figure 1). The background radioactivity concentration was set to 2.65 kBq/mL, and the hot sphere was set to four times that concentration. The acquisition time ranged from 10 to 180 s. The obtained sinograms were reconstructed by the three-dimensional ordered-subset expectation maximization (3D-OSEM) method using Customizable and Advanced Software for Tomographic Reconstruction (CASToR, open-source, https://castor-project.org, accessed on 30 June 2022) version 3.0.1 [16]. For the number of OSEM updates, the subset was fixed at 20, and the iteration was set from 1 to 3. Scattering and attenuation corrections were applied, but time-of-flight or point spread function corrections were not applied. The field of view (FOV) of the reconstructed image was 320 × 320 mm, and the matrix size was 128 × 128. Gaussian filters with a full width at half maximum (FWHM) of 4 mm were applied in the axial and trans-axial directions. The imaging was limited to one bed position where the hot sphere was located at the axial center. We acquired 972 data points with a combination of an imaging time increasing every 10 s (18 patterns), the hot sphere positions (18 patterns), and the iteration number (3 patterns). Each PET dataset had 47 slice images Radiation 2022, 2 250 with a thickness of 3.27 mm. We also obtained actual PET images by Discovery ST Elite (GE Healthcare, Milwaukee, WI, USA). The acquisition time was set to 18 types, ranging from 10 to 1800 s. The PET images were reconstructed by 3D-OSEM (iteration 2, subset 20). The FOV was 320 mm × 320 mm, and the matrix size was 128 × 128. Gaussian filters with a 2-mm FWHM were used.
Radiation 2022, 2, FOR PEER REVIEW 3 combination of an imaging time increasing every 10 s (18 patterns), the hot sphere positions (18 patterns), and the iteration number (3 patterns). Each PET dataset had 47 slice images with a thickness of 3.27 mm. We also obtained actual PET images by Discovery ST Elite (GE Healthcare, Milwaukee, WI, USA). The acquisition time was set to 18 types, ranging from 10 to 1800 s. The PET images were reconstructed by 3D-OSEM (iteration 2, subset 20). The FOV was 320 mm × 320 mm, and the matrix size was 128 × 128. Gaussian filters with a 2-mm FWHM were used. For the physical evaluation, the percent of contrast (Q10 mm), percent of background variability (N10 mm), and the ratio of the two (Q10 mm/ N10 mm) were calculated [17]. These indices were calculated using Equations (1) and (2) (2) where SD10 mm is the standard deviation of the background area, CB, 10 mm is the average pixel value of the background area, and CH, 10 mm is the average pixel value of the hot sphere placement position. The ratio of aH to aB (aH/aB) is the ratio of the radioactivity concentration in the hot sphere to the radioactivity concentration in the background region, which was set to four in this study.
The iLab C++ Neuromorphic Vision Toolkit (iNVT) version 3.1 was used to calculate the saliency map [18]. The input formats available in iNVT are limited; therefore, the matrix size of the PET images was changed to 256 × 256 and converted to an 8-bit JPEG format by ImageJ (National Institutes of Health, Bethesda, MD, USA) [19] version 1.52a software to be used as the input for the iNVT. The calculation of the saliency map is based on several features, but in the simulation study, the features of intensity and flicker were used. Since the salience by intensity increases where the change in pixel values is large, we preprocessed the body phantom by filling its periphery with the pixel values of the background region ( Figure 2). The pixel values of the background region were obtained from the slices before and after the slice in which the hot sphere was clearly depicted. The processed PET image was used as the input, and the pixel values of the hot sphere position in the saliency map were used. The feature of flicker is used to compute the saliency of the video and respond to the change in intensity from the previous frame [20]. Therefore, For the physical evaluation, the percent of contrast (Q 10 mm ), percent of background variability (N 10 mm ), and the ratio of the two (Q 10 mm / N 10 mm ) were calculated [17]. These indices were calculated using Equations (1) and (2): where SD 10 mm is the standard deviation of the background area, C B, 10 mm is the average pixel value of the background area, and C H, 10 mm is the average pixel value of the hot sphere placement position. The ratio of a H to a B (a H /a B ) is the ratio of the radioactivity concentration in the hot sphere to the radioactivity concentration in the background region, which was set to four in this study. The iLab C++ Neuromorphic Vision Toolkit (iNVT) version 3.1 was used to calculate the saliency map [18]. The input formats available in iNVT are limited; therefore, the matrix size of the PET images was changed to 256 × 256 and converted to an 8-bit JPEG format by ImageJ (National Institutes of Health, Bethesda, MD, USA) [19] version 1.52a software to be used as the input for the iNVT. The calculation of the saliency map is based on several features, but in the simulation study, the features of intensity and flicker were used. Since the salience by intensity increases where the change in pixel values is large, we preprocessed the body phantom by filling its periphery with the pixel values of the background region ( Figure 2). The pixel values of the background region were obtained from the slices before and after the slice in which the hot sphere was clearly depicted. The processed PET image was used as the input, and the pixel values of the hot sphere position in the saliency map were used. The feature of flicker is used to compute the saliency of the video and respond to the change in intensity from the previous frame [20]. Therefore, the process of filling the outside of the phantom was not necessary. The calculation was performed by considering a series of 2D images as a video. Salience by intensity indicates prominence within a slice, whereas evaluation by flicker implies prominence in the slice direction. The PET images obtained from the actual device were processed in the same way. However, only the intensity features were used to calculate the salience. the process of filling the outside of the phantom was not necessary. The calculation was performed by considering a series of 2D images as a video. Salience by intensity indicates prominence within a slice, whereas evaluation by flicker implies prominence in the slice direction. The PET images obtained from the actual device were processed in the same way. However, only the intensity features were used to calculate the salience. Preprocessing to calculate saliency. Since the salience was calculated to be higher in the area when the change in pixel value in the image was large, the blank area around the body phantom was preprocessed to be filled with the pixel value of the background area. Four slices before and after the slice in which the hot sphere was depicted were used as the background region pixel values. A binarized mask image (mask image) was used to combine the slices in which the hot sphere was depicted (processed image). Longdash lines were used to prevent misinterpretation of where lines intersect.
A Tobii Eye Tracker 4C (Tobii, Sweden) was used as the eye-tracking device. Six radiographers with 1-15 years of experience working in the nuclear medicine department were asked to participate in this study, and the method of acquiring gaze data was explained to them. The evaluator was instructed to find and gaze at the hot sphere, and the training was conducted with 10 images. The evaluator calibrated the test before the evaluation to ensure that the gazing location was within the circle of the estimated gazing area, which corresponded to a size of 55 mm in diameter in the PET image. The estimated gazing area was hidden during the image quality evaluation. The images presented to the evaluator contained equal proportions of slices containing hot spheres and slices without hot spheres for a total of 1944 images. Each image was displayed on a full screen for 0.5 s in a random order. In consideration of the evaluator's fatigue, the evaluation was divided into 36 sessions, and 54 images were continuously displayed per session. The study using clinical images was conducted in one session due to the small number of images. The gaze data were acquired at intervals of about 10 ms, and the acquisition time and X and Y coordinates of the gazing point were recorded. The program used to acquire the raw data of the gazing points was written in Python 3.6 using the software development kit provided by Tobii. A 128 × 128 matrix of gaze images was created from the frequency of the gazing points at each coordinate. The pixel values were displayed as Z-scores. Regions of interest (ROIs) with a diameter of 55 mm were placed on the gaze image (320 × 320 mm) Figure 2. Preprocessing to calculate saliency. Since the salience was calculated to be higher in the area when the change in pixel value in the image was large, the blank area around the body phantom was preprocessed to be filled with the pixel value of the background area. Four slices before and after the slice in which the hot sphere was depicted were used as the background region pixel values. A binarized mask image (mask image) was used to combine the slices in which the hot sphere was depicted (processed image). Longdash lines were used to prevent misinterpretation of where lines intersect.
A Tobii Eye Tracker 4C (Tobii, Sweden) was used as the eye-tracking device. Six radiographers with 1-15 years of experience working in the nuclear medicine department were asked to participate in this study, and the method of acquiring gaze data was explained to them. The evaluator was instructed to find and gaze at the hot sphere, and the training was conducted with 10 images. The evaluator calibrated the test before the evaluation to ensure that the gazing location was within the circle of the estimated gazing area, which corresponded to a size of 55 mm in diameter in the PET image. The estimated gazing area was hidden during the image quality evaluation. The images presented to the evaluator contained equal proportions of slices containing hot spheres and slices without hot spheres for a total of 1944 images. Each image was displayed on a full screen for 0.5 s in a random order. In consideration of the evaluator's fatigue, the evaluation was divided into 36 sessions, and 54 images were continuously displayed per session. The study using clinical images was conducted in one session due to the small number of images. The gaze data were acquired at intervals of about 10 ms, and the acquisition time and X and Y coordinates of the gazing point were recorded. The program used to acquire the raw data of the gazing points was written in Python 3.6 using the software development kit provided by Tobii. A 128 × 128 matrix of gaze images was created from the frequency of the gazing points at each coordinate. The pixel values were displayed as Z-scores. Regions of interest (ROIs) with a diameter of 55 mm were placed on the gaze image (320 × 320 mm) centered on the coordinates where the hot sphere was placed, and the average Z-score was calculated.
R (open-source, http://www.R-project.org, accessed on 30 June 2022) version 4.1.1 [21] was used for statistical processing to obtain Pearson's product-moment correlation coefficient between each indicator. The significance level for all statistical tests was considered to be 5%.

Results
Some of the reconstructed PET images, saliency maps (intensity), and black and white inverted gaze images are shown in Figure 3. The pixel values of the hot sphere location in the saliency map and gaze image were low at an acquisition time of 10 s but became higher as the acquisition time increased to 30 s and 180 s. Q 10 mm , N 10 mm , and Q 10 mm /N 10 mm , as well as the max pixel values (intensity and flicker) at the hot sphere position in the saliency map and the average Z-scores in the ROIs of the gaze image, are shown in Figure 4. The respective correlation coefficients are presented in Table 1. The Q 10 mm value was higher for images with larger iterations but was almost independent of the acquisition time. However, the standard deviation tended to decrease as the acquisition time increased. The N 10 mm value was lower in the PET images reconstructed with smaller iterations and decreased with an increasing acquisition time. The Q 10 mm /N 10 mm values were higher in the images reconstructed with iterations 2 and 3 than in the images reconstructed with iteration 1. The pixel value of the hot sphere position in the saliency map (intensity) showed high values in the images with small iterations when the acquisition time was short and tended to increase and saturate when sufficient counts were obtained. The salience by flicker was higher for smaller iterations and increased with an increasing acquisition time. The error bars are not shown in Figure 4 because the standard deviation of the gaze images was large due to large individual differences. The average Z-score within the ROI increased with the smaller number of iterations when the acquisition time of the PET image was short (≤90 s). When the acquisition time was >90 s, the Z-score was constant regardless of the number of iterations. In the correlation between the mean Z-score of the gaze images and each evaluation indicator, only Q 10 mm did not show a significant correlation in iterations 2 and 3, and N 10 mm was the highest. The results of the study are shown in Figure 5 and Table 2 using clinical images. The trend was similar to that in the simulation study, although the number of images was small and thus varied widely. In the study using the actual device, the gaze data and saliency showed a high correlation.

Discussion
This study differs from the work by Hosokawa et al. in the following respects [8].
In the current study, the phantom used was not a rectangular phantom but instead a NEMA/IEC body phantom that simulated the human body. In addition to the simulation study, PET images obtained from the actual device were used. Flicker was used in addition to the intensity as features to calculate the saliency map. To demonstrate the validity of the quality evaluation of the image using the saliency map, we compared it with the evaluator's gaze data. The human viewpoint is not fixed to a single point but instead moves slightly. Therefore, there is no repeatability in the maximum pixel value of the gazing point in the gaze image. In addition, the gaze data decreased with the duration of blinking during the evaluation. We adopted the average Z-score in the hot sphere position calculated by the number of fixations at each position as our evaluation indicator. The results from the simulation study showed that the correlation between the gaze image and saliency map (intensity) was >0.94, indicating that they were excellent indicators.
To reduce the influence of individual differences in the pixel values of the gaze image, it was necessary to obtain the average of many samples, so a Monte Carlo simulation was used to create the PET images. Those images had better reproducibility than those from a phantom study using clinical equipment and were less likely to contain errors due to procedural errors. Alternatively, the problem with that approach is that it does not take the patient table into account, and the scattered radiation correction is different from that of the clinical machine. The results obtained from the actual device showed a similar trend to the results of the simulation experiment.
The PET images were displayed for only 0.5 s to minimize the influence of various factors, such as experience and knowledge. Reportedly, the initial gaze position immediately after image presentation follows bottom-up attention [22]. Furthermore, the gaze immediately after image presentation has been shown to correlate with a saliency map calculated from the bottom-up in mammographic lesion detection [23]. Our study results also support that finding.
The saliency map through intensity was created by filling the area outside the body phantom with background pixel values. We also proposed a method of using the saliency calculated by flicker that did not require this preprocessing. However, the correlation between saliency by flicker and the gaze information was lower than the correlation between saliency by intensity and the gaze information. The process of filling in the outside body is difficult in the evaluation of clinical images. Although some studies have used saliency mapping in clinical imaging, the most prominent location in many clinical PET images is not necessarily the lesion [24]. Normal tissue, inflammation, and benign lesions may also accumulate FDG and affect its saliency. Not only is quantitative evaluation difficult, but the salience of the lesion may disappear if the normal areas are very prominent. Therefore, it is difficult to evaluate the ability of the saliency maps calculated from the intensity features in this software to depict lesions present on clinical images. A method to compute saliency maps from flicker features may solve this problem, but further improvement is needed. To solve this problem, top-down attention needs to be considered. However, the method using top-down attention involves a field of computer-aided detection (CADe). Recently, CADe using a deep convolutional neural network (DCNN) has been actively studied [25]. Models trained with the evaluator's eye data reportedly are more accurate than the general-purpose models proposed for natural images [26]. Unless the algorithm used is fixed, however, it is impossible to determine whether the change in results is due to a difference in image quality or in the algorithm. CADe using a DCNN is in its infancy and changes quickly. We preprocessed the input images and did not make any changes to the established algorithms.
In this study, conventionally used image quality evaluation indices were also calculated for comparison. N 10 mm had a high negative correlation with the average Z-score of the gaze target image, but the evaluation was based on the amount of noise in the background region and did not consider the visibility of the hot sphere. Therefore, changing the radioactivity concentration of the hot sphere did not change the value. The value of Q 10 mm was constant regardless of the acquisition time, and hence the image quality could not be evaluated by Q 10 mm itself. The explanation for why Q 10 mm /N 10 mm showed different changes from the average Z-score and saliency of the gaze image is thought to be that the gaze image and saliency map were obtained from 8-bit images, whereas Q 10 mm /N 10 mm was calculated from 32-bit float images. Since saliency maps have been actively researched using natural images, algorithms that use 8-bit images as their input are common. Even though high dynamic range images have been used as input in studies [27,28], 8-bit images are still commonly used [6,7].
It is unclear if the Q 10 mm /N 10 mm calculated from a 32-bit image represents the quality of the medical images a doctor sees.
Recently, the no-reference image quality assessment (NR-IQA) concept has been extensively studied. Unlike full-reference IQAs, such as the normalized mean squared error, the NR-IQA is characterized by its ability to perform absolute evaluations. Initially, the NR-IQA was studied in the field of natural imaging and subsequently applied to many medical magnetic resonance imaging situations [29,30]. Moreover, applications in the field of PET are expected. However, it is the noise and distortion of the entire image that is evaluated and not the ability to accurately describe the lesion. That purpose is different from the purpose of our study. The most primitive method of Itti's model was used in this study, but other established algorithms are also worth a try [31].

Conclusions
In this study, we used the saliency map calculated by Itti's algorithm for image quality evaluation in nuclear medicine. Itti's algorithm is an established algorithm and has a clear calculation method. The validity of the proposed method was demonstrated by comparing it with the gaze data of the evaluator. Even though the algorithm was designed to calculate the saliency of the natural images, the low-resolution gray-scale nuclear medicine images showed the same trend as the gaze images. A strong correlation was observed between the two, suggesting that salience can be used to evaluate the image quality when a uniform phantom is used. When attempting to apply this approach to clinical images, although further work must be performed, its potential is evident in the flicker feature.

Institutional Review Board Statement:
We confirmed in advance that no ethical approval from our institutional review board was required to conduct this study as it did not use any clinical data.