Introduction
The subjective values and meanings of images often receive considerable attention from the research community in the field of image understanding. However, this is in the eye of the beholder, so to speak, and is quite difficult to be assessed from images. Recent advantages in machine learning techniques allow us to tackle such an ambiguous task in a data-driven manner, and there have been several research attempts to estimate the subjective values of images, such as the aesthetic quality (Datta, Joshi, Li, & Wang, 2006; Ke, Tang, & Jing, 2006; Luo & Tang, 2008; Nishiyama, Okabe, Sato, & Sato, 2011; Marchesotti, Perronnin, Larlus, & Csurka, 2011), using human-labeled datasets. However, while these approaches have achieved a certain level of success, it is not clear whether such an objective ground-truth measure actually exists for subjective values.
On the other hand, there is a long history of research focusing on eye movement and its relationship to the human mind. The use of gaze inputs is recently receiving more and more attention amid the increasing demand for natural user interfaces, and casual gaze sensing techniques are becoming readily available. However, a gaze is simply considered an alternative pointing input of a different modality in most of the application scenarios. Several research attempts incorporating the concept of cognitive state recognition have recently been proposed to extend the potential of gaze interaction. In these works, the eye movements are indirectly used to infer the cognitive states of the users, e.g., the task and contextual cues (Bulling & Roggen, 2011; Bulling, Weichel, & Gellersen, 2013), intention (Bednarik, Vrzakova, & Hradis, 2012), user characteristics (Toker, Conati, Steichen, & Carenini, 2013; Steichen, Carenini, & Conati, 2013), cognitive load (Bailey & Iqbal, 2008; Chen, Epps, & Chen, 2013), and memory recall (Bulling, Ward, Gellersen, & Troster, 2011). While the view that the eye movement patterns of a person while viewing images can reflect his or her complex mental state has been widely shared among researchers (Yarbus & Riggs, 1967), it has also been pointed out that a classification task based on eye movements is often very challenging (Greene, Liu, & Wolfe, 2012). Therefore, it is still important to investigate what can be practically inferred from the eye movements.
We focus on a preference estimation in this work in which a user is comparing a pair of natural images. Shimojo et al. (Shimojo, Simion, Shimojo, & Scheier, 2003) reported on the cascade effect of a gaze, and they showed that people tend to fixate on a preferred stimulus longer when they are asked to compare two stimuli and make a two-alternative forced choice. Several methods have been proposed to predict the preferences from the eye movements (Bee, Prendinger, Nakasone, André, & Ishizuka, 2006; Glaholt, Wu, & Reingold, 2009) based on this study. However, the main focus of these studies is a comparison between the same categories of stimuli such as the faces and product images, and more importantly, the target task is the early detection of decision making events. The estimation is done while the users are making preference decisions, and therefore, it is unclear whether it is also possible to estimate their preference between two natural images during free viewing. Although eye movements during comparative visual searching have also been widely studied (Pomplun et al., 2001; Atkins, Moise, & Rohling, 2006), a comparison between two unrelated images has not been fully investigated.
The goal of this research is to explore the possibility of gaze-based image preference estimation, and we make two contributions in this paper. First, we take a data-driven approach to the image preference estimation task using eye movements. A classifier that outputs image preference labels is trained by using a dataset of eye movements recorded while users are comparing pairs of images. The training is done by using an algorithm that can exploit the beneficial features for the classification task. In this way, we can identify the important features for preference estimation and to assess how they differ among different people. More importantly, we also investigate whether or not preference estimation based on eye movements is still possible in a scenario in which the users are freely viewing image pairs with no instruction. While most of the prior work focused on preference decision making, its application scenario is indeed quite limited and investigating the free-viewing scenario is of practical importance.
Second, we present a quantitative comparison with an image-based preference estimation technique. As briefly mentioned above, it has been demonstrated that the aesthetic image quality can be estimated in a datadriven manner. However, it is not yet clear if the same approach can be taken for highly subjective values such as personal preference. Another purpose of this work is to validate whether the standard framework of aesthetic quality classification is also beneficial for image preference estimation. It is quite unclear particularly in the case of the free-viewing scenario whether the gaze-based classification is still comparative with image-based classification. In this study, we quantitatively compare the classification performances of these two approaches.
Methodology
Gaze-based Preference Estimation
The input to our method is a gaze data sequence {(gn, tn)}, i.e., N gaze positions gn associated with their time stamp values tn. t0 = 0.0 indicates the time when the image pair appeared on the display, and tN−1 = 1.0 is the time when the pair disappeared. Our goal is to classify which image the user prefers from the eye movement patterns during the comparative viewing.
As discussed earlier, it has been pointed out in prior work that humans tend to look at the preferred stimulus longer (Shimojo et al., 2003). In this study, we are interested in investigating whether any other kinds of features beneficial to the preference estimation task exist. Therefore, various fixation and saccade statistics are considered as the input features in a similar way as in (Castelhano, Mack, & Henderson, 2009; Mills, Hollingworth, Van der Stigchel, Hoffman, & Dodd, 2011; Greene et al., 2012). The use of a random forest algorithm (Breiman, 2001) allows us to automatically select the more efficient features for the classification task, and their contribution can be quantitatively evaluated as feature weights.
Eye Movement Features. We first follow a standard procedure to extract the fixation and saccade events from these data; i.e., if the velocity exceeds the threshold 30 [degrees/second], the gaze data is classified as saccades. We regard {(gn, tn), …, (gm, tm)} as data during a fixation if their angular velocities are below a predefined threshold. The first fixation is discarded because its position is highly affected by the previous stimulus. We define three attributes for each fixation event F, the position p, duration T, and time t. If the i-th fixation Fi happens from tn to tm, pi is defined as a median of the gaze positions, Ti = tm − tn and ti = tn. Assuming that the areas in which each of the paired images is displayed are known, fixations {(pi, Ti, ti)} can be divided into two subsets, i.e., fixations on the image on the left FL and that on the right FR. At the same time, the fixation positions are normalized according to the display area of each image so that the x and y coordinates are at [0, 1].
Saccade events are defined only when two successive fixations Fi and Fi+1 happen on one side of the image pair. Four attributes are defined for each saccade event: direction d, length l, duration T, and time t. Given a saccade vector s = pi+1 − pi, length l is defined as its norm |s| and the direction d is defined as a normalized vector s/l. The duration and time are defined in the same way as the fixation events. As a result, two sets of saccade events SL and SR are defined for each side of the image pair.
We compute various statistics for each attribute from these fixation and saccade sets.
Table 1 summarizes the attribute and statistical operation combinations. The means and variances are computed for all the attributes, and the covariances between
x and
y are additionally computed for the vector attributes (fixation position and saccade direction). The sums are computed for the scalar quantities other than time
t, and the total counts of the fixation and saccade events are also computed and normalized so that the sum between the left and right images becomes 1.0. There are a total of 25 computed values for each side (11 from the fixations and 14 from the saccades), and they are concatenated to form a 50-dimensional feature vector
xf= (
,
)
T of a paired image.
Preference Classification. The task is to output preference label y ∈ {1, −1}, which indicates whether the preferred image is the one on the left (1) or right (−1) from the input feature vector xf. As discussed above, we assume that the ground-truth labels of the image preference are given, and train a classifier that maps xf into y using the labeled data.
Due to the symmetric nature of the problem definition, a labeled pair of images and its corresponding eye movement data can provide two training data. If the user prefers the image on the left, for example, feature vector xf = (, )T is associated with label y = 1, while the left-right flipped feature vector xf = (, )T can also be used with label y = −1 for training.
Random forest (Breiman, 2001) is a supervised classification method using a set of decision trees. Given a set of training samples, the random forest algorithm trains the decision trees using random sample subsets of the samples. Each tree is grown in a way to determine the threshold value for an element in the feature vector that most accurately splits the samples into correct classes. After the training, the classification of an unknown input feature is done based on a majority vote from these trees. In addition to its accuracy and computational efficiency, the random forest algorithm has an advantage in that it can provide feature importance by evaluating the fraction of the training samples that are classified into the correct class using each element. The classifiers used in the experiments are implemented using the scikit-learn library (
http://scikit-learn.org/) (Pedregosa et al., 2011). The number of trees was empirically set to 1000, and the depth of each tree was restricted to 3.
Image-based Preference Estimation
An alternative approach for image preference estimation is to use features extracted from the image pairs. In addition to the method using eye movement features described above, we also examine the image features in the same classification framework for comparison. In this section, we briefly describe the details of the image features defined following a state-of-the-art method for aesthetic quality estimation. A pair of images, i.e., image ILdisplayed on the left side and IR on the right side, is input. We can use the concatenated feature vector xv= (, )T in the same way as for the classification using the eye movement features by extracting image features vLand vRfrom each of the images.
It has traditionally been considered that there are several important rules defining the aesthetic quality of images, such as the color harmony theory and the Rule of Thirds. While such features can serve as a rough guideline, it is not an easy task to quantify the subjective image quality measurement. However, a datadriven learning approach for aesthetic quality estimation has recently become popular. An aesthetic quality estimator learns from a large dataset of images obtained from websites such as Photo.Net (Datta et al., 2006) and DPChallenge.com (Ke et al., 2006) with community-provided image quality scores in these works. They used several image-related features including generic image descriptors that are not explicitly related to image quality and showed that the community scores can be well predicted using the learned estimator.
Image Features. Following (Marchesotti et al., 2011), we also adopt two generic image features. The first one is the GIST feature (Oliva & Torralba, 2001; Douze, Je’gou, Sandhawalia, Amsaleg, & Schmid, 2009), which is commonly used in scene recognition tasks. With the GIST feature, the overall layout and structure of an image is represented as a set of local histograms of Gabor filter responses. In our setting, an input image is resized to 64 × 64 pixels and then divided into 4 × 4 regular grids. The filter responses at six orientations are computed at each level of the two-level image pyramid, and histograms are extracted from each grid for each color channel to form the 192-dimensional GIST feature vector.
The second feature is based on the bag-of-features (BoF) representation of local descriptors (Sivic & Zisserman, 2003). Inspired by the bag-of-words representation used in natural language processing, an image in the BoF representation is described by the visual codewords frequency. Local descriptors are first extracted from the training images, and a visual codebook, i.e., a discrete set of representative descriptors, is learned. Then, each local descriptor of an input image is assigned to one of the codewords, and the image is represented as a histogram of codewords.
The BoF representation, which is based on scale and rotation invariant local descriptors such as the scale-invariant feature transform (SIFT) (Lowe, 2004), is widely used in various image recognition tasks. We use two local descriptors, SIFT and color, just like in (Marchesotti et al., 2011). The SIFT descriptor is a rotation-invariant histogram of local gradients defined as relative to the most prominent orientation in the local region. Unlike the original method (Lowe, 2004) that extracts SIFT descriptors at sparse keypoint locations, these descriptors are densely extracted on regular grids (Jurie & Triggs, 2005). The grids are placed every 64 pixels, and 64 × 64 local image patches are extracted. The 128-dimensional SIFT descriptors are computed from the local patches at four scales. The color descriptors are also extracted from the same local patches. Each patch is divided into 4 × 4 grids, and the mean and standard deviations per color channel are computed as the 96-dimensional color descriptor.
The dimensions of these two descriptors are reduced to 64 by principal components analysis (Jolliffe, 2005). Then, the codebooks of the two descriptors are obtained by clustering the features extracted from the training data into 100 clusters. The clustering is done by fitting Gaussian mixture models using the EM algorithm (Dempster, Laird, & Rubin, 1977). The original descriptors extracted from the input image are assigned with their nearest codeword, and 100dimensional histograms of both features are concatenated to form the 200-dimensional BoF feature vector.
Results
We discuss our experimental results in this section to validate the gaze-based preference estimation method. There are three purposes for the experiments: 1) to see whether or not the data-driven training in an improvement over the simple classification approaches, 2) to assess the difference between gaze-based and imagebased classifiers, and 3) to test the performance of the gaze-based estimation in a free-viewing scenario.
Image-based Estimation
Figure 3 shows a comparison between the gazebased and image-based classifiers. The first and second graphs show the classification accuracies using the two image features, GIST and BoF (SIFT and color descriptors). As described earlier, the same random forest framework as for the gaze-based classifier (the third graph) was used for both features. The fourth graph additionally shows the mean accuracy of the classifier using a combined image and gaze feature. In this case, the BoF and gaze feature vectors were concatenated, and the random forest classifier was trained in the same way as above. All of the classifiers were evaluated by conducting a within-subject leave-one-out cross validation.
The image-based classifiers performed better than the metadata-based baseline methods discussed in the previous section; however, the gaze-based classifier significantly outperformed them all (Wilcoxon signedrank test: p < 0.01). The results using the joint feature showed a slightly better level of accuracy, but we did not observe any significant difference. Although prior work claimed that the aesthetic image quality can be estimated in a similar data-driven manner, these results show that inferring personal image preference is a much more difficult task. Eye movements can tell us a lot about personal preferences, and this indicates the potential of gaze information in the context of media understanding.
Personal Differences
In the previous section, the training that was conducted used the personal datasets. While this follows the standard procedure for supervised classifications, it is not always possible to collect the most appropriate training data from the target user. The objective in this section is to confirm whether or not it is possible to use the training data obtained from different people for the classification task.
Figure 4 shows an accuracy comparison between the within-subject and cross-subject training conditions for both the image-based and gaze-based classifiers. The within-subject condition corresponds to the leave-oneout setting discussed in the previous section. In the cross-subject condition, the training and testing were done in a leave-one-subject-out manner; the classifier was trained for each person using the data from the other 10 participants. Each graph in
Figure 4 corresponds to a participant (
s1 to
s14), and the rightmost graphs show the mean accuracy from among all the participants.
While the within-subject training improves the accuracies of some participants, such as for s4, the crosssubject training generally achieved a comparative level of accuracy and there was no statistically significant difference in the mean scores. This indicates that the learning-based framework could successfully capture discriminative eye movements that can be commonly observed among different people.
Feature Importances
It is also important to visualize the differences between the within-subject and cross-subject conditions and to quantitatively assess how each element of the feature vector contributed to the classification task. The variable importances of the gaze features obtained using the random forest classifier training process are shown in
Figure 5. In our implementation, the feature importances are computed as a fraction of the samples that each of the elements contributed to in the final prediction. A higher value thus means there was more contribution to the classification.
Our 50-dimensional gaze feature vector consists of 25 statistical measures computed from both sides of the paired image regions. However, as discussed earlier, the definition of the classification task is symmetric and the labeled training data was duplicated to create leftright flipped training samples. Therefore, two corresponding elements (e.g., fixation counts on the left side and the right side) theoretically have the same importance throughout the training process, and the sums of the two values are shown in
Figure 5. The graphs correspond to the importances of the 25 features listed in
Table 1 and are color-coded according to the training data used.
s1 to
s14 indicate the within-subject training condition, i.e., the feature importances obtained when personal training datasets were used.
All indicates the case when all of the data from the 14 participants were used for training.
The three most contributing features are
fixationcount,
fixation-duration-sum, and
saccade-count in most of the cases, and this agrees with the gaze cascade effect. Compared to these three elements, the contribution of
saccade-duration-sum is not very high. The time stamp statistics (
time-mean and
time-variance of both the fixation and saccade) showed a certain amount of contribution, and
saccade-length-sum also contributed for some participants. It can be seen that person
s4, who showed the largest performance improvement from the withinsubject training in
Figure 4, had a unique distribution compared to the other participants, and the fixation position was the key to the improvement. The random forest algorithm can only assure that the combination of these features led to the performance gain, and cannot provide any reasoning behind each factor. Further study will be an important future work to gather feedback for better understanding the mechanism behind preference decision.
Free Viewing
The results discussed in the previous sections were based on the dataset obtained during the labeling phase, where the participants were instructed to assign preference labels. While this setting is the same as in prior works (Bee et al., 2006; Glaholt et al., 2009), as discussed in (Shimojo et al., 2003), the labeling task itself can affect the eye movements and the gaze cascade effect is not strongly observed during free viewing. From a practical point of view, its application is severely limited if the preference estimation can be done only when users are instructed to judge their preferences.
Figure 6 shows the performance of the gaze-based and image-based classifiers for the data recorded during the free viewing phase of the experiments. We used 400 pairs from the labeling phase as the training data for the target person, and the classifier was tested against 80 pairs from the free viewing phase. The first two graphs show the mean accuracy of the two imagebased classifiers, and the rightmost graph shows the mean accuracy of the gaze-based classifier.
While it was less accurate than when using the test data from the labeling phase, the mean accuracy of the gaze-based classifier was 61% and still significantly higher than the results when using the metadata-based baseline methods (Wilcoxon signed-rank test: p < 0.01). However, it must be pointed out that the difference from the image-based classifiers was much smaller than in the previous cases and no statistically significant difference was found between the image-based and gaze-based classifiers (Wilcoxon signed-rank test: p = 0.70). This indicates there is an important limitation to the gaze-based preference estimation method; i.e., the performance gain from image-based estimation method highly depends on the existence of a preference decision task and its performance is almost equivalent to the image-based estimation method in a freeviewing scenario.
For comparison, the third graph shows the results when using the data from the free viewing phase for both the training and testing. The mean accuracy was evaluated by conducting a within-subject leaveone-out test. They are less accurate than when using the training data from the labeling phase; however, since the amount of training data from the free viewing phase was much lower than that from the labeling phase, a direct comparison was impossible. A detailed investigation using more training data will be an important future work.
Conclusion
We presented a data-driven approach for image preference estimation from eye movements. A labeled dataset of eye movements was collected from 14 participants that were comparing two images side by side under two conditions, free viewing and preference labeling. The feature vectors were composed of a set of fixation and saccade event statistics, and the random forest algorithm was used to build a set of decision trees. This allowed us to not only build image preference classifiers but also assess the contributions of each statistic element to the classification task.
The proposed classifier was more accurate than the metadata-based baseline methods, and the training process was shown to better improve the accuracy than a simple classification strategy using the fixation duration. While the training was shown to be effective even when using training data from different people, variations could be observed in the feature importances obtained during the training process.
We also compared the gaze-based preference estimation technique with the image-based methods based on generic image features. The classification performance of the gaze-based method was significantly better than the image-based methods, indicating the effectiveness of the data-driven approach for classification tasks that use eye movements. However, we observed a lower level of accuracy under the free viewing condition than under the labeling condition, and the performance was almost equivalent to the image-based estimation technique. This strongly suggests that characteristic eye movements are caused by the preference decision activity itself, and further investigation will be required to improve the accuracy of preference estimation under free viewing.
The image preferences when using our approach can be inferred from the eye movements during image browsing. This allows us to explore using the eye movements in new applications, e.g., automatic image organization and summarization. Our future work will include extension of the learning-based preference estimation approach to single images. Since our experimental setting implies a two-item comparison task even without instruction, there can still be a task-related eye movement bias. On the other hand, the relationship between eye movements and subjective preference in the case of single images becomes more unclear and it will be increasingly important to more thoroughly look into the machine learning-based techniques.