Few-Shot Personalized Saliency Prediction Based on Adaptive Image Selection Considering Object and Visual Attention

A few-shot personalized saliency prediction based on adaptive image selection considering object and visual attention is presented in this paper. Since general methods predicting personalized saliency maps (PSMs) need a large number of training images, the establishment of a theory using a small number of training images is needed. To tackle this problem, although finding persons who have visual attention similar to that of a target person is effective, all persons have to commonly gaze at many images. Thus, it becomes difficult and unrealistic when considering their burden. On the other hand, this paper introduces a novel adaptive image selection (AIS) scheme that focuses on the relationship between human visual attention and objects in images. AIS focuses on both a diversity of objects in images and a variance of PSMs for the objects. Specifically, AIS selects images so that selected images have various kinds of objects to maintain their diversity. Moreover, AIS guarantees the high variance of PSMs for persons since it represents the regions that many persons commonly gaze at or do not gaze at. The proposed method enables selecting similar users from a small number of images by selecting images that have high diversities and variances. This is the technical contribution of this paper. Experimental results show the effectiveness of our personalized saliency prediction including the new image selection scheme.


Introduction
Many researchers have attempted to predict a saliency map that indicates image components that are more attractable than their neighbors [1][2][3][4]. Since a saliency map reflects human visual attention, it has been expected to contribute to image processing tasks including image re-targeting [5,6], image compression [7,8], and image enhancement [9,10]. The purpose of those studies is prediction of instinctual human visual attention, that is, the common regions of images to humans. Such a saliency map is called a Universal Saliency Map (USM). However, visual attention can differ between persons if individual backgrounds are taken into account [11][12][13]. In fact, since it has been reported that each person views regions of images reflecting personalized interests [14][15][16], the prediction of person-specific visual attention, which is called a Personalized Saliency Map (PSM), has been needed [17,18].
In order to accurately predict PSMs, Xu et al. constructed a PSM dataset and proposed a PSM prediction method [14,19]. Their PSM dataset includes a large number of images and their corresponding gaze data obtained from many persons. To the best of our knowledge, their PSM dataset is the first dataset focusing on PSM prediction. Xu's method, which is based on a multi-task Convolutional Neural Network (multi-task CNN) [20], needs a large amount of training data for PSM prediction. Thus, for predicting a PSM by this method for a new person not included in the PSM dataset, a large amount of gaze data, which involve a heavy burden, must be obtained for retraining the multi-task CNN. Furthermore, in a real-world situation, PSM prediction of new images not included in the PSM dataset is necessary. Thus, a PSM prediction method without such large-scale data acquisition is desirable. Our previous study revealed that the use of gaze data obtained from similar persons, who view regions in images similar to those viewed by the target person, is effective for PSM prediction [21]. Since the previous study assumes that similar persons have already gazed at the new image, the actual gaze data of similar persons can be utilized. However, since the new image is not always gazed by similar persons, a method that can estimate a PSM of the target person based on predicted PSMs calculated from similar persons is a much more practicable approach. From the above discussions, construction of such a method is a challenging but indispensable task to be addressed.
For predicting a PSM for the new target person, he/she needs to view several images to search for similar persons. Before this procedure, we need to select images from the PSM dataset for calculating person similarities between the target person and those included in the PSM dataset. However, if the selected images are visually similar to each other, the calculated person similarities are not reliable. In order to realize robust PSM prediction with reduction in the number of selected images, an adaptive image selection scheme solving the above problem is necessary. Specifically, we focus on the following two aspects: 1) diversity of images and 2) variance of PSMs. Since the PSM dataset consists of images that have high diversity, we should also select images with maintenance of their diversity. Moreover, the variance of PSMs for persons included in the PSM dataset should be high since the regions in images that many persons commonly gaze at or do not gaze at can be represented by a USM. Thus, by introducing an adaptive image selection scheme focusing on the above two aspects, it is expected that PSM prediction for the new target person is realized with high accuracy. This paper presents a few-shot PSM prediction (FPSP) method using a small amount of training data based on adaptive image selection (AIS) considering object and visual attention. Figure 1 shows an illustration of the problem we try to tackle. First, we construct and train a multi-task CNN from the PSM dataset for predicting PSMs of persons included in the PSM dataset [20]. Next, the person similarity is calculated by using selected images included in the PSM dataset. These images are chosen by AIS focusing on the diversity of images and the variance of PSMs. For guaranteeing the high diversity of the selected images, AIS focuses on the kinds of objects included in the training images in the PSM dataset by using a deep learning-based object detection method. Then, objects that have high variances of PSMs are detected, and then we can adaptively select images including such objects, which are shown in the orange area in Figure 1. Finally, FPSP of a target image for the new target person is realized on the basis of the person similarity and PSMs predicted by the multi-task CNN trained for the persons in the PSM dataset. Consequently, FPSP based on AIS for the new target person can be realized from a small amount of training data with high accuracy. It should be noted that this paper is an extended version of [22]. Specifically, we enable the novel PSM prediction of the target person from those predicted from similar persons based on the multi-task CNN. Furthermore, we newly introduce the AIS into the above PSM prediction approach.

Few-shot PSM Prediction Based on Adaptive Image Selection
In this section, we explain variables used in this section shown in Table 1 and our proposed method shown in Figure 2. In our method, the multi-task CNN is trained from the PSM dataset for saliency prediction of persons included in the PSM dataset (See Section 2.1). Then, we chose images based on AIS from the PSM dataset (See Section 2.2), and the target person needs to view only the selected images for his/her PSM prediction. Finally, we predict the target person's saliency map for the new target image by using the predicted saliency maps of similar persons in the PSM dataset (See Section 2.3).
PSM predicrted by multi-task Convolutional Neural Network (CNN) for image X tgt and person p P Number of person N Number of images ∆(p, X n ) Difference map between USM S USM (X n ) and PSM S PSM (p, X n ) ∆ l (p, X n , S USM (X n )) Difference map calculated for image X n and person p n Index of images p Index of persons l Index of decoding layers Number of color channels The rest of this paper is organized as follows. In Section 2, FPSP including the AIS scheme is explained in detail. In Section 3, the effectiveness of our proposed method is shown from experimental results. Finally, in Section 4, we conclude this paper.

Construction of a Multi-Task CNN for PSM Prediction
In this subsection, we explain the construction of a multi-task CNN for PSM prediction. This multi-task CNN is constructed for calculating P PSMs, where P is the number of persons, who are those included in the PSM dataset [20]. In the proposed method, the input data including images X n ∈ R d 1 ×d 2 ×d 3 (n = 1, 2, . . . , N; N being the number of training images, d 1 × d 2 being the number of pixels, and d 3 being the number of color channels) and USMs S USM (X n ) ∈ R d 1 ×d 2 are used for training the multi-task CNN. The USM means the area which many persons gaze at. In our method, the USM S USM (X n ) can be obtained by an arbitrary method, and it is not our contribution. Thus, it is shown in Section 3. Given PSMs S PSM (p, X n ) ∈ R d 1 ×d 2 for P persons, where S PSM (p, X n ) is obtained from pth person's gaze data for image X n and included in the PSM dataset [20], we calculate a difference map ∆(p, X n ) between the USM and the PSM of each person as ∆(p, X n ) = S PSM (p, X n ) − S USM (X n ) by following [14]. The multi-task CNN has one encoding part and P decoding parts consisting of three layers. Its output layer provides P results of ∆(p, X n ) (p = 1, 2, ..., P). The detail of multi-task CNN is shown in Figure 3. Moreover, we train the multi-task CNN by minimizing the following loss function: where∆ l (·) is a difference map calculation function that applies a 1 × 1 convolution layer to the outputs obtained from lth decoding layer, and || · || 2 F means the operator of the two-order Frobenius norm. Given a new target image X tgt , by using the above trained network, we predict the PSM of person p as follows: Therefore, PSMs of multiple persons can be predicted by the single model based on the multi-task CNN. The details of the multi-task CNN used in our method. In this figure, "Conv" and "MaxPool" mean applying a convolution and maxpooling layer to each input data, respectively.

Adaptive Image Selection for Reduction of Viewed Images
In this subsection, we explain the AIS scheme for the reduction of images that the target person views for predicting his/her PSMs. Given the new person p new not included in the PSM dataset, the multi-task CNN cannot learn the new person's PSM since the target person does not gaze at all of the images in the PSM dataset. Therefore, from the target person, we obtain some seed PSMs for images in the PSM dataset [20]. Note that the number of images viewed by the target person should be small for reducing his/her burden. Thus, the influence of one image on training is large, and the diversity of images significantly depends on its selection scheme. Although the PSM dataset has diversity of images, selected images do not necessarily have high diversity. We therefore propose a novel image selection method for maintaining the diversity of images with consideration of the kinds of objects in images and the variance of PSMs as shown in Figure 4. For maximizing the kinds of objects included in the selected images, we apply YOLO-v3 [23], which is one of the novel object detection methods, to the images included in the PSM dataset. Moreover, we use gaze data of persons included in the PSM dataset in order to consider the variance of PSMs. In AIS, we select images based on the detected objects and their PSMs. Specifically, we select objects that have high variances of PSMs since objects that have low variances are expected to be represented by a USM. Finally, we select images that include many kinds of objects having a high variance of PSMs. First, we detect objects O (n,m) (m = 1, . . . , M; M being the kinds of objects in all images) in images by using YOLO-v3 [23]. Detected objects are represented by the bounding box in which sizes are d h (n,m) × d w (n,m) . Then, we calculate the object variance v (n,m) as follows: where S PSM (p, O (n,m) ) is the PSM of person p for the object O (n,m) , and (j, k) means the pixel location. Note that we treat v (n,m) = 0 if image X n does not include mth object, and we adopt the largest v (n,m) if image X n includes mth objects. For selecting images, we calculate the sum of variances,v n , of PSMs for each image as follows:v Finally, we select C images that have the highest values in Equation (5) from the PSM dataset. It is known that human visual attention depends on objects, and our method that explicitly uses this relationship is a simple but useful for maintaining the diversity and the variance. In particular, AIS adopts YOLO-v3 that has achieved remarkable high performance in the field of recent object recognition and enables for calculating the variance of visual attention pixel-wise. Thus, AIS that focuses on the combination use of the object detection and the visual attention is effective.

FPSP Based on Person Similarity
In this subsection, we explain FPSP using person similarity and the predicted PSM by the multi-task CNN. We predict the PSM of the new person p new based on the similar persons' PSMs predicted by the multi-task CNN. First, we predict S out (p, X sel c ) by inputting the target image into the multi-task CNN and using Equation (2), where X sel c (c = 1, 2, . . . , C) are the C images selected in Section 2.2. Next, from the predicted PSMs S out (p, X sel c ), we calculate cross correlation as a similarity score β p between the target person p new and person p included in the PSM dataset as follows: where corr(·, ·) calculates the cross correlation. Note that S PSM (p new , X sel c ) is obtained by using the gaze data for the target person p new . This means that the new person p new needs to view only the selected C images to obtain the gaze data for calculating the PSM S PSM (p new , X sel c ). Then, for eliminating the influence from dissimilar persons, we only select similar persons based on the selection coefficient a p as follows: where τ is a pre-determined threshold value. Finally, by using the similarity score and the selection coefficient, we calculate the person similarity between the new person p new and person p as follows: By using the person similarities w p and similar persons' PSMs predicted by the multi-task CNN, we can simply predict the PSM S FPSP (p new , X tgt ) of the new person p new for the target image X tgt as follows: Therefore, by using the person similarity w p , the proposed method enables the prediction of the PSM of the new person from a small amount of training gaze data.

Experiment
In this section, the effectiveness of the FPSP based on AIS is shown from results of experiments. Section 3.1 shows the experimental settings, and Section 3.2 shows the performance evaluation and discussion.

Experimental Settings
In this subsection, we explain our experimental settings. We used the PSM dataset [20] that consisted of 1600 images and their corresponding gaze data for 30 persons who have normal or corrected-to-normal vision. The gaze data were obtained when each person gazed at each image for three seconds under free-viewing conditions. Moreover, we calculated PSMs based on gaze data by following [24]. We randomly chose 500 images as test images and used the remaining 1100 images as training images. Moreover, we selected C images from the training images and changed the number of the selected images, C, in {10, 20, . . . , 100}. In this experiment, we randomly chose 10 persons as the new target persons in Section 2 and used the remaining 20 persons as those used for the training. We used the PSM calculated on the basis of gaze data as Ground Truth (GT). Moreover, the multi-task CNN was optimized on the basis of stochastic gradient descent [25], and then we set mini-batch size, learning rate, momentum and the number of iterations as 9, 0.00003, 0.9 and 1000, respectively. We experimentally set the threshold value τ to 0.7, where its determination will be investigated in future work. In the proposed method, S USM (X n ) can be calculated as an average of the visual attention of training 20 persons.
For confirming the effectiveness of FPSP including the image selection scheme, we performed qualitative evaluation and quantitative evaluation. In the quantitative evaluation, we used the difference between the predicted PSM and its GT based on Pearson's correlation coefficient (CC), Kullback-Leibler divergence (KLdiv), and histogram intersection (Sim) [26] by following [27]. We also performed two kinds of comparative experiments. In the first comparative experiment, for revealing the effectiveness of our PSM prediction method based on a small amount of gaze data, we compared our method with the following four comparative methods that predict the USM chosen from the MIT saliency benchmark [28]: One of the state-of-the-art USM prediction methods based on deep learning (SalGAN) [4] which was trained from the SALICON dataset [29].
Moreover, we compared our method with the following two PSM prediction methods using a small amount of gaze data: •

A PSM prediction method based on visual similarities (Baseline1) [30] •
A PSM prediction method based on visual similarities and spatial information (Baseline2) [22].
It should be noted that the above comparative methods were trained by using the selected images since we assume that the target person views only selected images. In the second comparative experiment, for revealing the effectiveness of our image selection method, we compared our method with the following image selection methods: • Image selection based on visual features (ISVF) Images having with a low similarity to those of visual features to other images were selected. We adopted the outputs of the final convolution layer of pre-trained DenseNet201 [31] as visual features. • Image selection focusing on variance of PSMs (ISPSM) Images having a high variance of PSMs included in the PSM dataset were selected.

Performance Evaluation and Discussion
In this subsection, we confirm and discuss the experimental results. Figures 5-12 and Table 2 shows experimental results. First, Figure 5 shows the predicted results of one person and reveals that the FPSP method enables predicting the PSM that is the most similar to GT among all of the PSMs predicted by comparative methods. In Table 2, we show the average results, and it can be confirmed that FPSP based on AIS is the most effective for the PSM prediction in any evaluation indices. Therefore, by comparing the averages, we confirm the effectiveness of FPSP.  Table 2. Comparison of performance in multiple evaluation indices. The mark (↑) means that the higher the index becomes, the higher the performance increases. Similarly, the mark (↓) means that the lower the index becomes, the higher the performance increases. Note that 100 (=C) selected images were used for training in baselines 1 and 2 and FPSP. It should be noted that the bold font represents the highest value in its evaluation index.       We show the results predicted by FPSP based on AIS and the USM prediction methods for each subject in Figures 6-8. Note that we denote 10 target persons as Subs 1-10 in these figures. These figures show that FPSP enables the person-specific prediction for most persons more successfully compared to the USM prediction methods. Specifically, FPSP outperforms SalGAN, which is one of the state-of-the-art USM prediction methods. Thus, we confirm the effectiveness of the construction of the prediction model for each person. Furthermore, we show the results predicted by FPSP based on AIS and the PSM prediction methods for each subject in Figures 9-11. These figures show that the results predicted by FPSP are higher than those of other PSM prediction methods. Thus, FPSP enables more accurate prediction than baseline PSM prediction methods. Therefore, the effectiveness of FPSP is verified in the first experiment.

CC↑
Next, we discuss the difference between AIS, ISVF, and ISPSM in the second experiment. Focusing on the baselines in Table 2, we can confirm that the use of AIS is the most effective image selection method.
Furthermore, Figure 12 shows the performance of FSPS with changes in the number of training images when the training images are selected by AIS, ISVF and ISPSM for the calculation of the person similarity. In CC and KLdiv, the results of FPSP based on AIS are robust to changes in the number of training images and constantly higher than that of AIS and ISPSM. In other words, FPSP based on AIS enables accurately predicting the PSM of the target person just by gazing at 10 images included in the PSM dataset. Thus, it is convinced that our image selection method, AIS, is also effective for FPSP. Therefore, the effectiveness of FPSP based on AIS is verified by the experimental results. We summarize the discussions. We confirm the effectiveness of the proposed PSM prediction method, FPSP, in Figure 5 and Table 2 from the perspective of the qualitative and quantitative evaluations by focusing the average results. Moreover, by comparing FPSP with USM prediction methods and baseline PSM prediction methods for each person in Figures 6-11, it is verified that FPSP enables the accurate prediction for each person. Finally, Figure 12 confirms the robustness and effectiveness of AIS for FPSP. Therefore, we reveal that FPSP based on AIS enables the accurate prediction with a small number of training images and reduces the burden of persons to obtain their gaze data for the PSM prediction.

Conclusions
In this paper, we have proposed few-shot personalized saliency prediction based on adaptive image selection considering object and visual attention. FPSP enables the accurate PSM prediction with the small number of training images. Moreover, AIS realizes that the number of images that the new person views becomes smaller. Finally, FPSP based on AIS enables the accurate prediction with the small number of training images and reduces the burden of persons to obtain their gaze data for the PSM prediction. Experimental results showed the effectiveness of our proposed method. Funding: This work was partly supported by the MIC/SCOPE #18160100.

Conflicts of Interest:
The authors declare no conflict of interest.