Multi-Output Learning Based on Multimodal GCN and Co-Attention for Image Aesthetics and Emotion Analysis

: With the development of social networks and intelligent terminals, it is becoming more convenient to share and acquire images. The massive growth of the number of social images makes people have higher demands for automatic image processing, especially in the aesthetic and emotional perspective. Both aesthetics assessment and emotion recognition require a higher ability for the computer to simulate high-level visual perception understanding, which belongs to the ﬁeld of image processing and pattern recognition. However, existing methods often ignore the prior knowledge of images and intrinsic relationships between aesthetic and emotional perspectives. Recently, machine learning and deep learning have become powerful methods for researchers to solve mathematical problems in computing, such as image processing and pattern recognition. Both images and abstract concepts can be converted into numerical matrices and then establish the mapping relations using mathematics on computers. In this work, we propose an end-to-end multi-output deep learning model based on multimodal Graph Convolutional Network (GCN) and co-attention for aesthetic and emotion conjoint analysis. In our model, a stacked multimodal GCN network is proposed to encode the features under the guidance of the correlation matrix, and a co-attention module is designed to help the aesthetics and emotion feature representation learn from each other interactively. Experimental results indicate that our proposed model achieves competitive performance on the IAE dataset. Progressive results on the AVA and ArtPhoto datasets also prove the generalization ability of our model.


Introduction
Current massive media and interactive modes have been developing since the Internet revolution. Presently, people tend to enjoy, share and store images of higher aesthetic quality in social networking [1]. At the same time, they often want to convey their emotional information to others by images [2]. Both aesthetics and emotion are the abstract and highlevel semantic information of images, which contains lots of subjectivity [3]. There are a wide variety of applications in aesthetic assessment and emotion recognition, such as photo album management [4], on-line photo suggestion [5], image retrieval [6], digital monitoring [7], and news analysis [8]. Therefore, it is significant to analyze the aesthetics and emotion of images [9].
The image aesthetics study is typically cast into a classification or regression problem where the visual content is mapped to the aesthetic ratings provided by human annotators [10,11]. The emotion recognition of images is trying to predict the aroused human emotion when given a particular piece of visual content [12,13]. Many early works have mostly focused on manually crafted methods, which rely on various pixel-level features as well as feature representations such as color histograms [14],texture descriptors [15] and SIFT [16]. In recent years many deep neural network models, especially Convolutional Neural Network (CNN), have achieved the state-of-the-art performance on many computer vision-related tasks [17][18][19][20]. An important reason for the success of CNN is that many largescale datasets are available, such as AVA [21], MS_COCO [22], T4SA [23] and PhotoArt [24]. These datasets have tremendously promoted the development on many visual research areas, for example, image semantic segmentation [25,26], aesthetics assessment [27,28], and emotion recognition [29,30].
However, most studies treat image aesthetics assessment and emotion recognition as two independent studies, ignoring the fact that they are both people's mental responses to visual stimuli and correlated, i.e., an image can touch one's heart not only because of its visual content, but also its aesthetic composition. Images with higher aesthetic quality usually arouse people some positive emotions. Psychological studies have proved that human aesthetics as well as emotion can be aroused by visual content in the same time and affected with each other [31,32].
Besides that, the composition information of different visual elements and the semantic correlations between them define the harmony of an image [33]. Both aesthetics assessment and emotion recognition can benefit from capturing and exploring such important dependencies, which exist everywhere in our life, in visual and textual form [34,35]. Constructing a new large-scale multimodal dataset for it can be helpful but too costly. An alternative approach is to combine existing datasets with common large-scale corpora in some way.
Motivated by all of this, in this paper, we propose a multi-output learning model based on multimodal GCN and co-attention for aesthetics and emotion analysis. A summary of our work is as follows:

1.
Assisted by target and scene recognition, we extract semantic labels from images. Via combining the co-occurrence frequency of the tags in the image dataset with their similarity in the external large-scale textual corpus, we construct two correlation matrices for aesthetics assessment and emotion recognition, respectively. Besides that, we also produce a 2D marked feature map to map the correlation similarity onto the image.

2.
We convert an image into a regional graph in which each node denotes one region, and any two nodes are connected by a weighted edge. The weight is obtained by combining their visual similarity and correlation similarity. We perform reasoning on the graph with stacked graph convolution and acquire composition-aware re-encoded feature representation.

3.
A co-attention module is designed to capture the mutual impacts between aesthetics and emotion for further enhancing the feature representations.
The main contributions of our work are as follows: • We propose a novel end-to-end trainable multi-output learning framework for image aesthetics and emotion analysis. • We extract semantic labels from images, construct multimodal correlation matrix, and propose a unique regional feature encoding mechanism that can capture potential aesthetic and emotional relationships. • We present to use co-attention mechanism to jointly generate the aesthetics-guided and emotion-guided attention, which leads to better understanding of image aesthetics and emotion from each other interactively. • We evaluate our method on three datasets, and our proposed method consistently achieves superior performance over previous approaches.
The remainder of this work is organized as follows: in Section 2, we discuss previous relevant works on image aesthetics assessment and emotion recognition. Section 3 details our proposed multi-output learning model based on multimodal GCN and co-attention for aesthetic and emotion analysis. In Section 4, we describe the datasets and implementation details in our experiments, and then discuss the experimental results. Finally, in Section 5, we conclude our work and present the future direction.

Related Work
In this section, we discuss the related work of our proposed method. Before the popularity of CNN, most emotion recognition studies has been dominated by traditional methods using manually crafted features or shallow classifiers such as local binary patterns (LBPs) [13,15,36], the Facial Action Coding System (FACS) with action units (AUs) [22,23], and sparse learning [13]. In the last few years, deep neural networks have contributed a lot to performance improvements concerning other popular learning algorithms, such as Naive Bayes and SVM [37]. In particular, Long Short-Term Memory Networks (LSTM) [38][39][40] and Convolutional Neural Networks (CNN) [29,41,42] were used for many emotion recognition tasks, such as Twitter [39] and some other image datasets [43]. Most existing visual sentiment classifiers have been trained on images, in which the emotion category is annotated by crowd-sourcing. Researchers propose deep networks to detect emotions by encoding visual features and map them with true labels.
As for aesthetics assessment, the estimation of image styles, aesthetics, and quality has been actively investigated over the past few decades. Similar to emotion recognition studies, the work in image aesthetics evaluation experience the process from traditional ways [21] to deep learning models [44,45]. K. Sheng et al. [46] presented an attention-based multi-patch aggregation network that enhances training signals by assigning relatively larger weights to misclassified image patches. Bowen Pan et al. [47] proposed a multi-task deep convolutional rating network to learn the aesthetic score and attributes simultaneously through a generative adversarial network.
However, researchers rarely focus on analyzing these two high-level and abstract concepts, image aesthetic quality and emotional expression, in an interactive and unified way, which has proved effective in the fields of computer [3,9,48] and psychology [31,32]. Yu et al. [49] extend a large-scale emotion dataset by further rating the aesthetic scores by volunteers and propose a hybrid multi-task learning model on unified aesthetics and emotion prediction. Nevertheless, they just concatenate the final output vectors which may lead to inadequate learning.
In view of the significant progress in cross-modal learning that the co-attention mechanism has made [50,51], we propose a co-attention model to encode aesthetics and emotion features in an interactive way simultaneously.
Besides that, most attention to previous work is paid to modeling the image itself to improve the performance, ignoring to explore intrinsic and high-level properties, such as the composition of semantic elements. Recently, GCN has been widely used for natural language processing and text emotion recognition [52,53], due to its powerful capacity of exploring relationships among semantic and emotional elements. GCN can focus on graphstructure data and encode features based on only themselves but also their relationships with each other. Some researchers have applied GCN to semantic vision, as CNN cannot reason rich dependency of different visual contents. Chen et al. [33] proposes a multi-label image classification model based on GCN. They build a directed graph over the object labels and map the label graph into a set of object classifiers. Liu et al. [34] takes the image as a graph of regional nodes and computes aesthetic properties by reasoning on the graph. Despite this, they only consider relationships that exist in their dataset. Combining textual contents to facilitate visual understanding is usually a very effective approach, but one of the limitations is the lack of multimodal data.
To address this issue, we transfer two large-scale textual corpora and construct the relation graph assisted by mining knowledge from them. In this paper, we present a multi-output learning model integrating GCN and co-attention to predict aesthetics and emotion in the images.

Proposed Model
In this section, we propose a model for image aesthetics and emotion transactional analysis assisted by multimodal GCN and co-attention mechanism. The pipeline of our proposed framework is shown in Figure 1. First, we conduct textual multi-label semantic recognition and construct a correlation matrix for these labels assisted by external knowledge transferring. Taking an image as the input, we extract the regional feature map from the deep network and carry out information propagation on this regional graph via GCN. After that, we acquire a processed regional feature map. We train two parallel branches for aesthetics assessment and emotion recognition, respectively, and then send the two elaborate regional feature maps to the co-attention module. The co-attention module generates two attention mask matrix: one for the aesthetics feature and the other for the emotion feature. Finally, we gain two attentional feature descriptors and take them to perform prediction on each of those tasks.

Textual Multi-Label Semantic Recognition
Image aesthetics assessment and emotion recognition could profit from multimodal contents as they can provide more vivid and adequate information. For that purpose, we would like to extract multiple textual properties from each input image. We send the input image into a DeepLabv3+ [54] model that has already shown the state-of-art performance in extensive visual Semantic Segmentation, multi-target detection, and scene recognition research. As shown in Figure 2, corresponding labels for targets (with a confidence ≥ 0.7) and scenes in each image are acquired simultaneously from the DeepLabv3+ pre-trained on the MS_COCO [22] (for targets) and Ade20k dataset [55] (for targets and scenes). Thus, we obtain a group of semantic labels on the entire dataset, i.e., L = {L 1 , L 2 , · · · , L c }, where c denotes the number of categories.

Correlation Matrix for Textual Semantic Labels
For better using textual semantic attributes in images, we transfer external textual correlation knowledge to construct a correlation matrix to guide information propagation among the regional nodes in GCN. For aesthetics assessment, we generate a dataset that contains all user comments crawled from the provided page links in the AVA dataset [21] on https://www.dpchallenge.com (accessed on 19 June 2021) [56]. Each image is commented and scored by users, commenters, participants, and non-participants on the website. Figure 3 shows some examples in the AVA-comment dataset, where semantic labels are marked in red. For emotion recognition, we use the Twitter sentiment dataset [23]. With the training of these two corpora, we transform each semantic label into the task-specific word vector v t , where t is the label of the aesthetics and emotion prediction task, via the GloVe word embeddings [57] model, severally.
To construct the correlation matrix, we define the correlation between the semantic labels via mining their co-occurrence patterns within the image dataset and model the semantic label correlation dependency for different task t in the form of C t ij : where n t ij means the concurring times of label L i and label L j in the same image, n t i or n t j denotes the number of images that contain label L i or L j , N t is the total number of images in the dataset, s v t i , v t j is the cosine similarity of semantic word vector v t i and v t j . The formulation of the cosine similarity is expressed in the form: The correlation matrix is unsymmetrical, as C t ij is not equal to C t ji . After that, we perform L2-normalization on C t ij and obtainĈ t ij as follows: Finally, we construct the correlation matrixĈ t ∈ R N L ×N L for each of two prediction tasks, where N L denotes the number of textual semantic labels.

Visual Regional Feature Extraction
In this part, we view an image as a graph composed of local regions and would like to complete reasoning over the regional graph to promote image aesthetics assessment and emotion recognition. For this purpose, we take the image as the input and send it to the DeepLabv3+ to produce the 3D feature map in which each spatial region represents a local area of the image, and the arrangement of all regional feature maps describes a spatial composition of various visual components of an image. Besides that, we also produce a 2D semantic map in which each region is marked by an integer corresponding to one of textual semantic labels. For visualization, we show the marked map on the original image in Figure 4, where semantic labels are displayed as colored numbers, for example, 3 is Sky, 5 is Tree, 13 is Person, 17 is Mountain, 27 is Sea, 104 is Ship, and so on. The 3D feature map has the same size (h × w) as the 2D semantic map, and that is the latter is an additional label group for the former.

GCN for Visual Regional Feature Encoding
To capture correlations between visual image contents and explore these interdependencies, we construct a multimodal regional correlation graph guiding composition-aware feature encoding. In this part, we adopt graph convolution operation under the interactive guidance of textual semantic properties as the regional dependency modeling mechanism to learn a more competitive feature representation for image aesthetics assessment and emotion recognition.
In the beginning, given the feature map of dimensions h × w × d, we denote a set of m regions as R = {r 1 , r 2 , · · · , r m }, where m = h × w is the total number of regional feature vectors. The graph structure is composed of multiple node vectors. In other words, to construct the graph, we represent each node as the local region, which is a d-dimensional vector. In addition, we connect each pair of nodes with a weighted edge. Considering multimodal content, we calculate the weight A ij as follows: whereĈ is the correlation matrix given in Equation (3) ,ĩ andj are the corresponding label of i and j in the 2D semantic map, andV ij is the visual similarity normalized by so f tmax like [], represented using following equations: where U ∈ R d×d and V ∈ R d×d are two transformation matrices [58] that will be constantly updated with the network training. For the weight matrix A ij , we filter out those pretty small values by a threshold of 0.1 and obtain a new weight matrixÂ as: As an adjacency matrix of the relationship between nodes,Â ∈ R m×m describes the interdependence of regional areas in the image. In the graph reasoning stage, we use multi-layer GCN to transmit the structural information of nodes through the adjacent matrix as follows: where H l is the lth layer updated node feature, B l ∈ R d×d is lth layer learnable transformation matrix, and ReLu is the activation function. Figure 5 is a visualization of multi-layer GCN, where regional nodes with different semantic labels are marked with different colors, and edges denote the relationships between nodes with different multimodal contents. During the stacked graph reasoning, each node feature is updated only by itself and the nodes that are correlated with multimodal content.

Co-Attention Module for Aesthetics and Emotion Analysis
In this part, we model the mutual interaction between regional image features of different tasks based on the collaborative attention mechanism. Specifically, given the task-specific visual feature representation X t ∈ R m×d , we construct the attention matrix and generate two attentional feature representations for aesthetics and emotion.
First, similar to [59], we construct the attention matrix M ∈ R m×m that calculates the similarity between X aes and X emo via the measurement function: where W s ∈ R d×d and b s ∈ R m×m are both learnable matrices, and tanh is the activation function.
With the affinity matrix M, two task-specific attentional feature representations X t ∈ R m×k can be calculated as follows: X emo = tanh X emo W e + X emo W e M T X aes W a + b a (11) where W a ∈ R d×k , W e ∈ R d×k , b a ∈ R m×k , and b e ∈ R m×k are all learnable matrices, and denotes element-wise product. Finally, we feed the attentional feature into a three-layer network consisting of a convolutional layer and two FC layers to obtain a 1D vector for aesthetic and emotional evaluation. Figure 5. Visualization of multi-layer GCN, a framework for stacked graph reasoning with the regional map based on relationships between nodes. Nodes with different semantic labels are marked with different colors, and edges denote multimodal relationships between nodes. Each node feature is gradually updated only by itself and the nodes with edges connected to it.

Intermediate Supervision
Our model architecture contains three stages: (1) learning image features via deep networks; (2) graph reasoning with the regional map; and (3) collaborative attention encoding. To avoid the gradient vanishing with repetitious backpropagation, we apply intermediate supervision, which has demonstrated strong performance among the methods of multiple iterative stages [60,61], to produce a loss on each stage. We feed the feature into a three-layer network consisting of a convolutional layer and two FC layers to obtain a 1D vector and use the cross-entropy loss as the objective function. It is expressed as follows: Loss total = Loss aes + Loss emo (13)

Experiments and Results
In this section, we evaluate our proposed multi-output learning model integrating multimodal GCN and co-attention. First, we introduce the datasets and the experimental settings in our work. Then the results of our proposed method are compared against several baselines. Finally, we discuss the performance of our model.

Datasets
We conduct experiments on the Images with Aesthetics and Emotion (IAE) dataset [49], a new dataset associated with both aesthetic and emotional labels. Specifically, IAE is an extension of the earlier work in [43], where more than 22,000 images are manually divided into eight emotion categories, i.e., amusement, anger, awe, contentment, disgust, excitement, fear, and sadness. Each category consists of more than 1100 images. These images are next rated with ten volunteers from the aesthetic perspective in [49]. Thus, the IAE dataset contains 11,550 high-aesthetic and 10,536 low-aesthetic images. Figure 6 shows the distribution of 8-category emotion in high-aesthetic and low-aesthetic images. As we can see, most positive emotions, especially awe and contentment, appear more frequently with high aesthetics, while negative emotions always come with low aesthetics. It provides empirical support that aesthetics and emotion are correlated and can be studied interactively. For a fair comparison, we follow the same data partition used in their work, i.e., 70% of images for training, 10% of images for validation, and the rest for testing. To verify the generalization ability of the single prediction task for aesthetics and emotion, we also choose two benchmark datasets, AVA [21] and ArtPhoto [24]. AVA is the largest publicly available dataset for aesthetic visual analysis and contains scored images collected during the digital photography contest on https://www.dpchallenge.com (accessed on 19 June 2021). ArtPhoto is a set of artistic photographs downloaded from an art sharing site, https://www.deviantart.com (accessed on 19 June 2021) [62], taking the emotion categories as search terms.
To capture interdependencies between visual image contents, we construct a correlation matrix that helps to guide composition-aware feature encoding by transferring external textual knowledge. For aesthetics assessment, we crawl the user comments for images in AVA from https://www.dpchallenge.com (accessed on 19 June 2021) to form the AVA-Comments dataset. All quotes and extra HTML tags such as links are removed. As for emotion recognition, we use the Twitter sentiment dataset [23] in which more than 3 million tweets containing both text and images are collected. Following their work, we select only the tweets classified with a confidence ≥ 0.85 and marked as positive or negative, thus get more than 550K tweets. With the training of these two corpora, we transform each semantic label into the task-specific word vector via the GloVe word embeddings model, severally.

Experimental Settings
DeepLabv3+ network with Xception [63] as its backbone is used for our feature encoding. The input raw image is resized to 224 × 224, and the output feature map is 28 × 28 × 256 after the Atrous Spatial Pyramid Pooling (ASPP) [64] module, followed by a 1 × 1 convolution layer. The number of GCN layers is three, and the dimension k of feature space in the co-attention module is 128. In our networks, the model is initialized on ImageNet and then fine-tuned end-to-end with images for aesthetics and emotion multi-output prediction. For network optimization, SGD is used as an optimizer. The base learning rate is 10 −4 and reduced by a factor of 10 every ten epochs. The momentum, weight decay, total epoch and batch size are set to 0.9, 10 −5 , 50 and 32, respectively. Our networks are implemented based on PyTorch.

Comparison Evaluation with Baselines
In this section, we first present a comparison evaluation on IAE to demonstrate the effectiveness of our method. Then we evaluate the performance of our model in the single prediction task for aesthetics and emotion on AVA and ArtPhoto, respectively. Table 1 shows the performance of our method on the IAE dataset along with the different results of others. ResNet50 [65], WRN [66], MLSP [67], RGNet [34], CycleEmo-tionGAN [68], and APSE [69] are single-task learning models. SSNet [70], CSNet [71], AENet-FL and AENet [49] are multi-task learning models. The experimental results in Table 1 reveal that our proposed model outperforms other methods on various networks. Compared with single-task learning methods, our method outperforms two base networks ResNet50 and WRN by 9.78% and 9.7% for aesthetics assessment. However, for emotion recognition, the improvement is 9.05% and 7.49%. We also introduce four single-task learning models as baselines, of which MLSP and RGNet are for aesthetic assessment, while CycleEmotionGAN and APSE are for emotion recognition. Against them, our method exhibits at least 5.18% and 3.95% performance improvement in the respective prediction tasks. Besides that, it is interesting to see that the best result of previous multi-task learning models (AENet) also beats those single-task learning models, which give a proof of the aesthetic and emotion conjoint analysis. For multi-task learning models, such as SSNet, our method achieves 6.84% and 9.91% higher accuracy for aesthetics and emotion tasks, respectively. Compared with CSNet, our method outperforms it by 6.93% and 6.97% for aesthetics and emotion. AENet is an extension of AENet-FL, as the former adds the extra fusion layers and achieves 81.05% and 66.23% accuracy, overcoming the latter. In comparison, our method achieves 85.63% and 70.14%, surpassing them by a large margin of at least 4.58% and 3.91% for aesthetics and emotion, severally. The competitive results demonstrate the effectiveness of our proposed model, both in aesthetics assessment and emotion recognition.
As mentioned above, to verify the generalization ability of the single prediction task for aesthetics and emotion, we also train our model on IAE and test it on AVA and ArtPhoto. The comparison results can be found in Table 2. As we can see, there is a sharp decline in performance for all methods. It is due to the heterogeneity between the different data sets. In comparison, our method achieves 80.2% and 40.77% accuracy, surpassing them by a large margin for both prediction tasks, especially for emotion prediction. It may be because we pre-extract semantic labels from images in the test set and construct a correlation matrix based on their co-occurrence frequency and textual similarity. With the help of transferring the correlation, the information gap could be alleviated. The competitive results prove that the multimodal GCN module based on external knowledge transfer does improve the prediction performance and generalization ability for both aesthetics assessment and emotion recognition.  Figure 7 demonstrates the normalized confusion matrix of 8-category emotions on the IAE and ArtPhoto datasets to analyze the performance of our proposed method in each emotion category. They both contain eight emotion categories, i.e., amusement, anger, awe, contentment, disgust, excitement, fear, and sadness. In the IAE dataset, the Anger and Fear categories have lower accuracy than others. Many Anger and Fear categories are misclassified as sadness categories. Besides that, the excitement category can be easily mistaken for the amusement category, which may be attributed to the similar emotion features of the two categories. Moreover, when we test on the ArtPhoto, our model does not perform as well as before. Only the Fear category can be correctly distinguished with a probability of more than 50%. Figure 8 is a visualization of intermediate supervision, as mentioned in Section 3.6. As we can see, in both aesthetics and emotion branches, the loss in all three stages is decreasing as the training continues. It proves that the gradient is back-propagated to the whole end-to-end network, and all parameters are updated. Figure 9 summarizes the results in different stages on IAE, as mentioned in Section 3.6. For aesthetics assessment and emotion recognition, both the GCN module and co-attention module in our model can boost performance. Moreover, it is interesting to find out that our co-attention module performs better in enhancing the prediction of the Awe and Disgust categories, which may be attributed to the similar consistency and correlation between the two emotion categories and aesthetic perception.  We conduct several experiments to evaluate the performance of the GCN module. We have designed 5 variants of our model, without or with different GCN. The first variant is that we remove the GCN module and send features to the co-attention module directly. The second variant is that we set all values in the weight matrixÂ to 1, which turns the GCN into the fully connected networks (FCN). For the rest of the variants, we takeĈ,V, andÂ =Ĉ V ( denotes element-wise product) as the textual, visual, and multimodal weight matrix for GCN. The performance of these models is shown in Table 3. For a fair comparison, we use the same experimental settings among all the variants. From the results, we can see that the variant with FCN does not perform better than the variant without GCN. Both perform worse than the rest of the variants with different modality GCN, which means that simply increasing the number of layers does not work, and GCN does explore the dependencies among visual elements. As for single modality GCN, visual and textual correlation reasoning achieve a similar accuracy on both aesthetics and emotion prediction tasks. Both are defeated by multimodal GCN, by at least 1.67 % and 1.33 % higher accuracy on aesthetics and emotion tasks, which provides evidence for the validity of our multimodal GCN module.

Conclusions
Every day, more and more humans share their experiences, communicate their moods and state of mind with images on social networks. Meanwhile, they always look forward to enjoying higher quality images, which are more likely to trigger their emotion. Therefore, aesthetics assessment and emotion recognition are two fundamental problems in user perception understanding. Although the two tasks are correlated and mutually beneficial, they are usually solved separately in existing studies.
For image aesthetics and emotion analysis, they belong to image processing and pattern recognition in computational mathematics. Methods of machine learning and deep learning have provided tremendous assistance in mathematical problem-solving in recent years. Using the deep neural network, we map images and abstract concepts into numerical matrices. In addition, after a series of mathematical operations, we build relationships between visual data with labels in aesthetic and emotional perspectives.
In this paper, we propose an end-to-end multi-output deep learning model based on multimodal GCN and co-attention for image aesthetics and emotion conjoint analysis. We extract semantic labels from the image and construct a correlation matrix, assisted by the label distribution in the dataset and external textual information by corpus-based knowledge transferring. As the correlation matrix mapped onto the image, we perform stacked graph reasoning on the regional image map and then obtain a composition-aware re-encoded feature representation. After that, we send these features into the co-attention module and let them learn from each other interactively and selectively. To avoid the gradient vanishing with repetitious backpropagation, we apply intermediate supervision to produce a loss on each stage.
Experimental results on the multiple datasets demonstrate the performance of our proposed method for aesthetics and emotion analysis. On the Image with Aesthetics and Emotion (IAE) dataset, our method achieves 85.63% and 70.14% accuracy for aesthetics and emotion tasks, respectively. Compared with the baselines, our method outperforms both single-and multi-task learning models by at least 4.58% and 3.91% accuracy. We also test on AVA and ArtPhoto by performing a cross-dataset evaluation with IAE to verify the generalization ability of our method for two prediction tasks. Our method achieves 80.2% and 40.77% accuracy, surpassing the baselines by a large margin for both two tasks. However, the result for emotion prediction is needed to be further improved. In addition, the visualized results and the comparison results show that the intermediate supervision, GCN module, and co-attention module can boost the performance of our model.
As future work, we plan to introduce some more complex and nuanced relationships, such as camera style and emotion cause, to help the model simulate high-level visual perceptions.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.