Tensor-Based Emotional Category Classification via Visual Attention-Based Heterogeneous CNN Feature Fusion.

The paper proposes a method of visual attention-based emotion classification through eye gaze analysis. Concretely, tensor-based emotional category classification via visual attention-based heterogeneous convolutional neural network (CNN) feature fusion is proposed. Based on the relationship between human emotions and changes in visual attention with time, the proposed method performs new gaze-based image representation that is suitable for reflecting the characteristics of the changes in visual attention with time. Furthermore, since emotions evoked in humans are closely related to objects in images, our method uses a CNN model to obtain CNN features that can represent their characteristics. For improving the representation ability to the emotional categories, we extract multiple CNN features from our novel gaze-based image representation and enable their fusion by constructing a novel tensor consisting of these CNN features. Thus, this tensor construction realizes the visual attention-based heterogeneous CNN feature fusion. This is the main contribution of this paper. Finally, by applying logistic tensor regression with general tensor discriminant analysis to the newly constructed tensor, the emotional category classification becomes feasible. Since experimental results show that the proposed method enables the emotional category classification with the F1-measure of approximately 0.6, and about 10% improvement can be realized compared to comparative methods including state-of-the-art methods, the effectiveness of the proposed method is verified.


Introduction
Due to the increasing number of images on the Web, the demand for image understanding has increased [1][2][3]. Image understanding mainly focuses on two types of information: image-based information and human-based information. By using image-based information such as textures and luminance gradients, many researchers have tried to investigate semantic segmentation and object recognition [4][5][6][7][8][9]. Moreover, by using human-based information such as brain activities and gaze movements, many researchers have tried to investigate image emotion recognition and interest level estimation [10][11][12][13][14]. Therefore, we divide image understanding into image-based understanding and human-based understanding corresponding to the first and second types of information, respectively. Although the recent development of convolutional neural networks (CNNs) [4] has enabled the realization of image-based understanding with high performance [4][5][6][7][8][9], human-based understanding is still difficult since it is closely related to abstract semantics perceived by humans [15]. Image emotions lie on the highest level of abstract semantics, which can be defined as semantics describing the intensities and types of feelings, moods, affections, or sensibility evoked in humans viewing images [16]. In this study, we focus on the classification of images into emotional categories.
In studies on estimation of emotions evoked by humans gazing at images, the effectiveness of the use of several bio-signals has been mentioned [17][18][19][20]. It has been shown in the fields of psychology and neuroscience that human emotions are evoked by objects included in images [21,22]. Moreover, there is a relationship between emotional properties of an image and visual attention, i.e., the changes with time in visual attention are closely related to human emotions [23]. Therefore, in the same manner as emotion estimation, it is expected that the use of information on objects gazed at and information on changes in visual attention with time can be effective for emotional category classification.
In order to use the information on objects gazed at and information on changes in visual attention with time, we should obtain gaze data including the gazed locations of images and their duration times. Moreover, the objects in gazed areas need to be characterized by using CNNs which have been successfully implemented for object recognition. Therefore, the acquisition of gaze data and the training of CNNs for object recognition are needed for emotional category classification. Due to the burden of users to get a large amount of training gaze data, the number of images with gaze information is limited. On the other hand, CNNs need a large amount of training data. Thus, by using eye gaze data, the use of CNNs trained from scratch is not suitable for emotional category classification. It is necessary to use CNNs that are pre-trained by other domain datasets and extract outputs of an intermediate layer of the pre-trained CNN as CNN features. Extraction of CNN features is well known as one of transfer learning approaches [24]. In addition to consideration of objects that are gazed at, information on changes in visual attention with time is effective for emotional category classification as described above. Thus, this information should be dealt with together with consideration of objects gazed at. Then, in order to extract CNN features, we simply represent superimposed images and changes in visual attention with time. The superimposed representation is simple but effective for emotional category classification [25]. Therefore, for the collaborative use of CNN features and visual attention with changes over time, we treat the new image representation based on the superimposation and extract its CNN features.
Although CNN features have high representation ability for categories of the source domain, they do not necessarily have the ability for our target domain. Thus, for obtaining more semantic features and improving the representation ability to the image emotional category, it is desirable to use multiple CNN features calculated from multiple CNN models. Then, we have to consider the heterogeneous feature fusion method. For fusing heterogeneous CNN features, we should deal with not only changes over time but also interactions between CNN features. However, since CNN features have high dimensions, the fusion and analysis of the information are difficult. We therefore focus on tensor-based feature fusion like vector concatenation. The dimension of each mode of the constructed tensor is a lower dimension than that of vector concatenation. Thus, tensor-based feature fusion enables analysis of the changes over time and interactions between CNN features. However, we need to handle high-order information including information on CNN features themselves, the number of CNN features, and the changes over time. Consequently, for emotional category classification, a learning methodology with tensor analysis is strongly needed.
In this paper, we propose a new method for tensor-based emotional category classification via visual attention-based heterogeneous CNN feature fusion. In the proposed method, the new gaze-superimposed image representation [25] is adopted for associating images with eye gaze data as shown in Figure 1. Moreover, we extract multiple CNN features from each frame of the image representation. Note that the frame in the proposed method means the pair of the image and visual attention at each time unit that is divided the total gaze time in this image representation, although the term of frame is generally used for a movie. Furthermore, we extract several CNN features and construct a new CNN feature-based tensor (CFT) for considering the interactions of CNN features.
Since each feature of the CFT is calculated from the gaze-based image representation, it can be expected that the proposed method will enable visual attention-based heterogeneous CNN feature fusion and that it will lead to improvement of the representation ability. Therefore, this CNN feature fusion based on a CFT is the main contribution of this paper. Finally, for the newly derived novel CFT, we perform supervised feature transformation based on general tensor discriminant analysis (GTDA) [26], which can transform original features into highly discriminant features, and realize emotional category classification based on logistic tensor regression (LTR) [27]. Consequently, accurate emotional category classification via the new feature fusion approach becomes feasible. The rest of this paper is organized as follows. Related works and placement of this study are described in Section 2. In Section 3, tensor-based emotional category classification via visual attention-based heterogeneous CNN feature fusion is explained. The effectiveness of the proposed method verified from experimental results is shown in Section 4. In Section 5, we summarize this paper and present some discussions. Note that, in Appendix A, the mathematical notations, e.g., the tensor algebra, in this paper are shown.

Related Works
In this section, we introduce related works that focus on emotional category classification. Many researchers have focused on the dominant emotional category (DEC) when they classified images into emotional categories [28][29][30]. DEC means the emotional category that many humans evoke when they gaze at an image. There are several methodologies for tackling the DEC classification problem [28,31,32]. Furthermore, for constructing classifiers of emotional categories in these methods, image datasets with emotional categories have been published [29,30]. In [30], the dataset consists of images collected from Flickr, and the images are mainly realistic photos. Moreover, an abstract painting dataset for classifying different types of images closely related to emotional categories has been published [29]. In contrast to realistic photos, abstract paintings do not consist of clear objects and uniform colors. Thus, classification of such images by using only features directly calculated from the images is difficult .
Pasupa et al. proposed a classification method [28] using both eye gaze data, which are closely related to human emotions, and simple handcrafted visual features. On the other hand, since CNN features, which have more semantic information, have been used for visual features in recent years [24], Chen et al. proposed a CNN feature-based DEC classification method [32], and Rao et al. handled the outputs of some layers of a CNN trained from scratch [31]. In the DEC classification problem, it has been expected that CNN features are effective and that the collaborative use of eye gaze data and CNN features enables improvement in performance. Although these CNN-based methods certainly classify images into DECs with high performance, e.g., approximately 70% of classification accuracy [31,32], the large number of images which are pre-classified by humans for the training from scratch. Thus, CNN-based DEC classification methods are effective in the case that there exists the dataset with a large number of images already labeled emotional category [33], but images obtained from the domain which is different from the above dataset cannot be classified with high performance due to the lack of labeled data. Thus, in order to train CNN-based DEC classification methods for images obtained from the new domain, our method can help the label assignment problem since we can perform the training from the small number of training images. Therefore, human-based information such as gaze information is needed. In particular, since it is difficult to extract the emotion-related characteristics, gaze information is suitable for the DEC classification.
For improving the performance of the emotional category classification, it was reported in [10,34] that the use of multiple visual features is effective. Zhao et al. focused on the common factor between emotion features and each visual feature for predicting emotion distribution [10,34]. Based on the assumption that many images are pre-given to the emotion distribution, they extracted emotion features based on the emotion distribution from these images. Then, even though different visual features represent different semantics, they considered the relationships between emotion features and visual features but did not consider the relationships between visual features. Thus, although they used multiple visual features, their method cannot consider the interactions between visual features.
From the above discussion, we focus on the collaborative use of eye gaze data and multiple CNN features for image emotional category classification. In order to use multiple CNN features with consideration of their interactions, we newly introduce their feature fusion.

Tensor-Based Emotional Category Classification
In this section, we explain the proposed method. Our method classifies images into emotional categories via tensor-based analysis that enables realization of visual attention-based heterogeneous feature fusion suitable for our target problem. An overview of the proposed method is shown in Figure 2. Construction of the new gaze-based image representation for relating images and visual attention with changes over time is shown in Section 3.1. CNN feature extraction and construction of the CFT are shown in Section 3.2. Feature transformation based on GTDA and LTR-based emotional category classification using the transformed CFT are shown in Sections 3.3 and 3.4, respectively. Since the effectiveness of the use of the combination of GTDA and LTR has been confirmed in [35], we adopt them in our method.

Construction of Gaze-Based Image Representation
In order to perform the new gaze-based image representation, the proposed method associates images with eye gaze data. We denote training images as X image n ∈ R d 1 ×d 2 ×d 3 (n = 1, 2, · · · , N; N being the number of training images). Note that the dimensions d 1 , d 2 , and d 3 correspond to the width and the height of an image and the number of color channels, i.e., three. In our method, a fixation map of each frame f (= 1, 2, · · · , d 4 ; d 4 being the number of frames) is constructed on the basis of eye gaze data, and a Gaussian filter is applied to the obtained fixation map to obtain W gaze n, f ∈ R d 1 ×d 2 . Eye gaze data include data of gazed locations and their duration times, and we construct the fixation map by voting for pixel locations based on gazed locations. Then, a gaze and image weight (GIW) matrix W n, f ∈ R d 1 ×d 2 of each frame f is calculated as follows: where O ∈ R d 1 ×d 2 is a matrix for which the elements are all one. Finally, the image representation X 4th n ∈ R d 1 ×d 2 ×d 3 ×d 4 is calculated by using GIW as follows: where X 4th n,col, f ∈ R d 1 ×d 2 and X image n,col ∈ R d 1 ×d 2 (col = 1, 2, · · · , d 3 ) are respectively obtained by matricizing X 4th n and X image n for the mode of color channels. The operator "•" means the calculation of the Hadamard product. The fourth-order GIT can reconstruct the original images as follows: Thus, this representation consists of the image and the visual attention. By adopting this image representation, we extract CNN features with consideration of the changes in visual attention with time. In this way, construction of the new image representation, which is the input for the emotional category classification, becomes feasible.

Extraction of CNN Features and Construction of CFT
The proposed method extracts CNN features from the outputs of the last pooling layer of pre-trained CNNs. Specifically, we extract CNN features by using three kinds of state-of-the-art CNNs, DenseNet201 [36], InceptionResNet-v2 [37], and Xception [38]. It should be noted that the kinds and the number of CNNs are experimentally set in Section 4 since the purpose of this paper is to reveal the effectiveness of the use of multiple CNN features for the emotional category classification. Then, in this paper, we choose the above CNNs as the state-of-the-art methodologies. The dimensions of these CNN features are 1920, 1536, and 2048, respectively. In the proposed method, we construct the CFT by aligning these features. However, since the dimensions of these CNN features are different, their direct spatial concatenation is difficult. Thus, we apply supervised dimensionality reduction to these CNN features to unify their dimensions to the lowest one, i.e., 1536. In the proposed method, we simply adopt Fisher discriminant analysis (FDA) [39], which is one of the most well-known supervised dimensionality reduction methods. Finally, by aligning the CNN features, the proposed method In the proposed method, we adopt the multiple CNN features for improving the representation ability. Moreover, our novel representation, CFT, can consider the dimensions of each CNN feature, changes in visual attention with time and kinds of CNN features. In this way, the proposed method simultaneously enables consideration of the interactions of multiple CNN features. Therefore, the proposed heterogeneous CNN feature fusion, i.e., the construction of the CFT, is expected to have high representation ability.

Feature Transformation Based on GTDA
We apply GTDA to V 3rd n to obtain discriminative features that are suitable for emotional category classification. We define the class label y n ∈ {0, 1} annotated to an image X image n . Then, y n = 1 means that n-th image X image n includes a target label, i.e., a target emotional category. Note that, since each image has multiple emotional categories, the proposed method has to deal with multi-label problems, and the binary classification for each emotional category is thus adopted. In order to calculate the , we solve the following optimization problem: where η l is obtained as the largest eigenvalue of (S w l ) −1 S b l as shown in [26]. In addition, S b l and S w l are defined as follows: where Note that M y (y ∈ {0, 1}) is the class mean tensor belonging to class y, and M is the total mean tensor of all training tensors. Note that n y is the number of images belonging to class y. Moreover, M y and M are all third-order tensors that lie in R d f Finally, we obtain a tensorV 3rd n by transforming the CFT V 3rd n as follows: Therefore, we can calculate highly discriminative features by using GTDA considering the categorical information.

Emotional Category Classification Based on LTR
In order to construct the LTR-based classifier, we use the transformed CFTV 3rd n as the input of LTR. GivenV 3rd from a test image, we try to estimate its class label y test . The LTR model used in the proposed method is formulated as follows: where Z, which is a parameter tensor of regression coefficients, is the same size as that of the transformed CFTV n . In order to obtain the optimal parameter tensorẐ of Z, we solve the following maximum log-likelihood problem:Ẑ We can solve the above maximization problem by adding L 1 -norm regularization of Z based on the idea of [27]. Finally, the proposed method estimates the class label as follows: In this way, the proposed method realizes the heterogeneous CNN feature fusion and the tensor-based analysis with consideration of the changes in visual attention with time.

Experimental Results
We show experimental results in this section in order to verify the effectiveness of the proposed method. The experimental conditions are shown in Section 4.1 and the performance evaluation is shown in Section 4.2.

Experimental Conditions
A dataset of abstract paintings that contains 280 images [29] was used in the experiment. Each image was annotated by at least one emotion label(Images were rated by at least 14 persons in web-survey which was performed by Machajdik et al.) [29]. among eight emotional categories (awe, amusement, contentment, anger, excitement, sad, disgust, and fear). It should be noted that these emotional categories were defined by the psychological study on affective images [40]. We used these annotations as ground truths, and most of images have several emotion labels. Thus, we trained our method and comparative methods for each emotional category and each subject. From the 280 images, 224 images were randomly selected as training images and the remaining 56 images were used as test images to evaluate the performance of our emotional category classification method. For the evaluation measure, we adopted the F1-measure (F) obtained as follows: where Recall and Precision are calculated by using the obtained classification results as follows: TP, FN, and FP mean the numbers of images estimated to be true positive, false negative, and false positive, respectively. Since the number of images in the dataset was limited, evaluation was performed with a statistical test, Welch's t-test [41], between our method and other methods, and the results are shown with the F1-measure. Thirteen able-bodied subjects who were eleven healthy males and two healthy females, aged between 22 and 26 years (mean age : 23.5 ± 1.2 years) participated in the experiment. These subjects were normal or corrected-to-normal vision, and their eye gaze data were collected. The eye gaze data were obtained through Tobii Eye tracker 4C (https://tobiigaming.com/eye-tracker-4c/). Each subject gazed at images until evoking some emotions (This human research was conducted with the approval by the ethical committee in Hokkaido University.). These subjects just gazed at images, but we did not collect their evoked emotions, i.e., the ground truths were not their evoked emotions but labeled emotion labels provided by [29]. Their gaze was adjusted to the center of a display before showing a new image in one second. The gazing time length was normalized in such a way that it became d 4 .
For comparison of the proposed method (PM), we adopted eight comparative methods as shown in Table 1. Comparative method 1 (CM1) does not use changes in visual attention with time. Therefore, in CM1, d 4 = 1 in the new image representation. Furthermore, CM1 uses only one CNN feature among three kinds of CNN features shown in Section 3.2. CM1 was adopted for evaluating the novel approaches introduced in this paper. We also adopted comparative method 2 (CM2), which uses only eye gaze features extracted on the basis of the state-of-the-art method [42], and we performed emotional category classification based on an extreme learning machine (ELM) [43]. CM2 was adopted for evaluating the use of the combination of image information and gaze information in our method. We also performed a comparison with the following three methods. We adopted a recently PM [28] that collaboratively uses eye gaze information and hand-crafted visual features as comparative method 3 (CM3). In the experiment, since multi-modal features such as gaze features and visual features were used, this fusion method is considered to be suitable for comparison. Qiu et al. proposed an emotional category classification method [44] by performing fusion of bio-information based on deep canonical correlation analysis (Deep CCA) [45]. Thus, we used the above state-of-the-art method as comparative method 4 (CM4) by using gaze features [42] and CNN features. Comparative method 5 (CM5) classifies images into emotional categories by applying feature fusion based on CCA [46] to both CNN features and gaze features [42]. Comparative method 6 (CM6) and comparative method 7 (CM7) use the CNN feature-fusion based on the vector concatenation. Concretely, CM6 makes the second-order CFT whose modes are the dimension of CNN features and the change over time. CM6 concatenates multiple CNN features at each time and applies GTDA to the constructed second-order CFT. Then, CM7 concatenates all of the CNN features, that is, CM7 treats the vector whose dimension is the dimension of CNN features times the kinds of CNN features. We took the average of the change with time in CNN features so as to prevent becoming a higher dimension. In order to handle the vector, CM7 applies FDA [39] instead of GTDA. CMs 6 and 7 classify these features into emotional categories based on a support vector machine (SVM) [47], which is one of the simplest classifiers, and ELM. Finally, comparative method 8 (CM8) fuses CNN features based on late fusion. In CM8, we first constructed a second-order CFT that consists of CNN features with consideration of the change over time for each CNN feature. Then, we applied GTDA and SVM or ELM to each second-order CFT and determined the final emotional category based on a softmax function, which is one of the simple but effective late fusion methods. Actually, in the use of multiple modalities, late fusion used in CM8 is applied [48,49]. Table 1. The difference of the proposed method (PM) and comparative methods (CMs). The marks ' ' and 'X' mean that the corresponding method considers or does not consider the time change. Moreover, "Softmax" means that we applied the softmax function to the outputs of several classifiers and obtained probabilities. Then, classification was performed based on the value obtained by multiplying these probabilities. Furthermore, "Hand-crafted feature" means that CM3 extracted hand-crafted visual features such as Gabor filter-based and Sobel filter-based visual features from images obtained by superimposing original images and fixation maps.

Time Change Gaze Feature Fusion
X Novel gaze feature [42] Only gaze feature CM3 X Hand-crafted feature [ Tables 2 and 3 show the results of the experiment. Table 2 shows the average of F1-measures of all emotional categories that were calculated for each subject. Table 3 shows those of all subjects that were calculated for each emotional category. "D", "I", and "X" represent DenseNet201, InceptionResNet-v2, and Xception, respectively. In our method, the combination order of CNN features influences the emotional category estimation performance by comparing PM (D-I-X), PM (D-X-I), and PM (X-D-I), and PM (D-I-X) outputs the best results on average. This is related to the mode expansion in the second mode of GTDA adopted in our method. PM (D-X-I), which has the worst results in PMs, outperforms all of the comparative methods. Thus, the effectiveness of PM is verified without considering the combination order of CNN features. The influence of this order has an interesting characteristic, and we should consider its decision method. However, in this paper, since we focus on heterogeneous CNN feature fusion and analysis, we will tackle this decision problem as our future work.  From the obtained results, the PM outperforms the comparative methods in the average of F1-measure. As shown in the results of PM with CM1, the effectiveness of the novel approach adopted in our method is verified. Moreover, the use of multiple CNN features is more effective for the emotional category classification than that using only one CNN feature. Then, it is expected that the greater the number of CNN features, the higher the accuracy of the emotional category classification. However, if the number of CNN features is four or more, the combination of CNN features increases. Thus, we simply used three CNN features for the simplicity in this experiment. The method which determines the optimal kinds and the number of CNN features should be investigated in the future work.

Performance Evaluation
Comparing PM with CM1 and CM2 verifies that the new gaze-based image representation and CFT, i.e., the collaborative use of image and gaze information, are effective. Furthermore, since PM has a higher F1-measure than those of CM3 and CM4, which are recent and state-of-the-art frameworks, PM can classify images into emotional categories with high performance. A comparison of PM and CM5 indicates that the combination use of gaze information and images via both the new gaze-based image representation and CFT is more effective for emotional category classification than the baseline fusion method. Moreover, a comparison of PM with CMs 6, 7, and 8 shows that the proposed heterogeneous CNN feature fusion and its analysis are more effective than the vector-based concatenation methods for emotional category classification. Then, the tendency of experimental results of CM8 is different from that of other methods according to the difference of the fusion method [50]. Concretely, CM8 adopted the late-fusion method which generally provides high performance when the performance of each method to be fused is close.
We also show the results of Welch's t-test between PM (D-I-X) and the comparative methods. Since the p-value is lower than 0.05, PM is statistically superior to CMs 1-5 and 7. On the other hand, the results for CMs 6 and 8 have higher p-values than those of other comparative methods since CMs 6 and 8 utilize the change in CNN features with time in the same manner as that in PM. The differences between PM, CM 6, and CM 8 are only the concatenation method and when to concatenate heterogeneous modalities. In other words, CMs 6 and 8 are similar to PM. However, these differences are considered to cause the slight improvement of the classification performance.
In addition to the quantitative evaluations, we show one of the experimental results in Figure 3. In Figure 3, if the classified category is the same as the ground truth, the corresponding category is indicated in red. If the classified category is false, the corresponding category is indicated in black. Although the gaze-based image representation of Subs 2 and 7 are classified into four categories including all of the GTs that of Sub 8 is classified into three categories including one of the GTs. Concretely, although Subs 2 and 7 gazed at almost the same area at each frame in the image shown, Sub 8 gazed at a different area. This difference causes the difference in classified emotional categories. Thus, we confirmed that the change in visual attention with time is related to human emotions. The areas that the subjects gazed at are shown in white at frames 1, 50, and 100. From these gaze data, PM (D-I-X) classifies this image into some categories. If the classified category is the same as the ground truth, the corresponding category is indicated in red. If the classified category is false, the corresponding category is indicated in black.

Discussion and Conclusions
In this paper, we presented an emotional category classification method based on tensor analysis that realizes the visual attention-based heterogeneous CNN feature fusion. In order to improve the classification performance, the PM constructs the new tensor, CFT, that integrates the outputs from the multiple CNN architectures with consideration of the changes in visual attention with time. Consequently, emotional category classification becomes feasible by using GTDA and LTR. Experimental results verified the effectiveness of the PM. In the experiment, we used only one image dataset in consideration of the burden on the subjects. Obtaining eye gaze data is a great burden for the subjects. Since such a task may prevent verifying the correct effectiveness, we used only one dataset. However, the use of an abstract painting dataset is more suitable than realistic image datasets for emotional category classification. Thus, there is no lack of the effectiveness of the PM with respect to the number of images in this dataset. In a future work, we will use other datasets in order to verify the robustness of our method.