Inferring Emotion Tags from Object Images Using Convolutional Neural Network

: Emotions are a fundamental part of human behavior and can be stimulated in numerous ways. In real-life, we come across di ﬀ erent types of objects such as cake, crab, television, trees, etc., in our routine life, which may excite certain emotions. Likewise, object images that we see and share on di ﬀ erent platforms are also capable of expressing or inducing human emotions. Inferring emotion tags from these object images has great signiﬁcance as it can play a vital role in recommendation systems, image retrieval, human behavior analysis and, advertisement applications. The existing schemes for emotion tag perception are based on the visual features, like color and texture of an image, which are poorly a ﬀ ected by lightning conditions. The main objective of our proposed study is to address this problem by introducing a novel idea of inferring emotion tags from the images based on object-related features. In this aspect, we ﬁrst created an emotion-tagged dataset from the publicly available object detection dataset (i.e., “Caltech-256”) using subject evaluation from 212 users. Next, we used a convolutional neural network-based model to automatically extract the high-level features from object images for recognizing nine (09) emotion categories, such as amusement, awe, anger, boredom, contentment, disgust, excitement, fear, and sadness. Experimental results on our emotion-tagged dataset endorse the success of our proposed idea in terms of accuracy, precision, recall, speciﬁcity, and F1-score. Overall, the proposed scheme achieved an accuracy rate of approximately 85% and 79% using top-level and bottom-level emotion tagging, respectively. We also performed a gender-based analysis for inferring emotion tags and observed that male and female subjects have discernment in emotions perception concerning di ﬀ erent object categories.


Introduction
Emotions represent the mental and psychological state of a human being and are a crucial element of human daily living behavior [1,2]. They are a vital source to perceive, convey, and express a particular feeling and can stimulate the human mind 3000 times more than the rational thoughts [3]. With the rapid growth in digital media, emotion communication has increased through images [4][5][6][7][8], Hence, the exclusive information about a person's behavior can be acquired by interpreting their emotion perception towards different real-life object categories. In this aspect, user demographic information can also be taken into account for emotion recognition. The psychological and behavioral studies proved that the human perception of emotion varies concerning user demographics [8]. Only a few research studies have shown interest in classifying emotions based on user demographics that generally refer to user attributes such as age, gender, marital status, and profession. In [37,38], the authors reported a gender discrimination scheme for image-based emotion recognition. Dong et al. [39] analyzed the demographics that are widely used to characterize customer's behaviors. Likewise, Fischer et al. [38] reported that diverse genders and age groups have different emotion perceptions. Besides, men invoke powerful emotions (such as anger), and women have powerless emotions (such as sadness and fear). Likewise, the authors in [37] mentioned that females are more expressive in emotions as compared to males. Therefore, it can be realized that multiple emotion classification based on object-related features and demographic analysis is a challenging task.
For addressing the challenges discussed above, in this study, we explore the implications of inferring emotion tags from images based on object analysis and categorization. Human-object interaction (either in real-life or through object images) entails varying emotion perceptions for different users, as shown in Figure 1. Different objects may produce varying emotions on the viewers depending upon their judgment. Hence, it seems advantageous to infer a viewer's emotions based on visual object interaction through images. In this aspect, based on the majority voting principle, we can establish a standard way to annotate emotions associated with different object categories. Thus, our proposed scheme is envisaged on the hypothesis that we can accurately infer emotion tags from the images using object-related features to alleviate the existing challenges associated with visual features (such as image contrast, brightness, and color information). Thus, to validate the proposed idea, we used a publicly available Caltech-256 dataset that consists of a hierarchy of object categories, including 20 top-level and 256 bottom-level object categories. We pre-processed this dataset for assigning emotion tags to these object categories. To the best of our knowledge, no one else has reported such research work that infers the emotion tags from images based on object analysis. The main contributions of this paper are as follows: • We proposed a novel idea for inferring emotion tags from images based on object classification. In this aspect, we used a public domain object detection dataset, i.e., Caltech-256, for implementation and validation of our proposed idea using different supervised machine learning classifiers.

•
We produced an emotion-tagged dataset from Caltech-256, where we assigned overall nine emotion tags to different object categories based on subjective evaluation from 212 users. Thus, our proposed scheme can recognize nine (09) different emotions, including amusement, excitement, awe, anger, fear, boredom, sadness, disgust, and contentment. • We employed a two-level (i.e., top-level and bottom-level) emotion tagging approach for our proposed scheme. We provided a detailed analysis of how human emotions are affected by a different level of information. Also, we investigated the effect of user demographics, i.e., gender, on inferring emotion tags.
The rest of the research study is organized as follows: Section 2 describes the methodology opted to accomplish the proposed idea. Section 3 entails the performance analysis, result and, discussion of inferring emotion tags using CNN-based features. Finally, Section 4 provides the conclusions for this research work and provide recommendations for future work.

Proposed Methodology
The proposed method for inferring emotion tags based on object categories consists of four steps, i.e., dataset acquisition, emotion annotation/tagging, feature extraction, and classification of emotion tags as shown in Figure 2. The detailed description related to each step is given below.

Data Acquisition
The proposed work mainly emphasizes on the concept of inferring emotion tags from object categories. Thus, we required a dataset that contains real-life object categories that we encounter in our daily life. For this purpose, we utilized a state-of-the-art object-detection dataset named Caltech-256 [40]. Generally, this dataset comprises of 256 object categories and a total of 30,608 images. Moreover, each category has a maximum of 827 and a minimum of 80 images. Specifically, this dataset is chosen on account of two reasons: 1) the dataset contains most of the real-life living and non-living object categories, with which a person might have interactions in their routine life and, 2) the dataset of includes those types of images that are excessively uploaded and shared on different social media platforms and may induce certain emotions on the viewers. Hence, this dataset is appropriate to validate the classification performance for inferring emotion tags based on object categories. Besides this, Caltech-256 dataset consists of a hierarchy of object categories, i.e., top-level (node-level) object categories that are twenty (20) in number, including animal, transport, electronics, nature, music and food categories, etc. Likewise, there 256 bottom-level (leaf-level) object categories, which are derived from twenty top-level object categories and entail no more sub-categories. As an example, the top-level animal object category further contains bottom-level categories like frog, snake, ostrich, and crab, etc. Figure 3, presents some of the example images of bottom-level object categories (represented by black color blocks), which are derived from the top-level object categories of Caltech-256 dataset (as a specimen represented by red color blocks).

Data Annotation and Tagging
The Caltech-256 dataset consists of object images that are generally desirable to use for object detection or recognition purposes [41]. According to the proposed scheme, the raw data acquired from the dataset need to be pre-processed to create an emotion-tagged dataset from the existing object categories in Caltech-256. Mainly, nine (09) emotion tags are targeted in this study, including amusement, anger, awe, boredom, contentment, disgust, excitement, fear, and sad. For assigning a particular emotion tag to the hierarchy of the objects, i.e., top-level and bottom-level object categories, a subjective evaluation method is devised. This method depends on viewers' opinion based on the majority voting criterion to build a strongly labeled emotion dataset from the heterogeneous object categories of Caltech-256 dataset. For this purpose, we conducted an online survey from more than 250 subjects using Google forms, where we got complete responses from 212 subjects. The proportion of male and female candidates was nearly the same, i.e., 51.8% and 48.2%, respectively. The survey participants mostly included university-level students and teachers. A few other subjects, such as doctors and private job employees, were also a part of the survey. We asked these participants to view random images from each object category and label what kind of emotions they typically perceive when interacting with that specific category objects in real-life or on social media platforms. More precisely, we requested the viewers to assign a specific emotion tag to each object category based on their real-life interaction with such objects that are included in the Caltech-256 dataset. We firstly adopted a subjective evaluation method to label emotion tags for the bottom-level object categories. Then, we used those tags to assign emotion tags for twenty (20) top-level object categories using majority voting principle of sub-category labels. The detailed description of emotion tag assignment based on subjective evaluation (for both top and bottom-level object categories) is given in the next sections.

Emotion Tag Assignment to Bottom-Level Object Categories
To consider all possible emotion variations within the top-level object categories, we first performed emotion labeling at the bottom-level. In this regard, we labeled each of 256 leaf categories of the object in Caltech-256 dataset for emotion tagging. All users were asked to assign only one emotion tag to each category based on their perception. After collecting emotion responses concerning bottom-level object categories from all the users, we applied a maximum user response criterion for assigning a single emotion tag to each category. For example, some of the object categories like blimp, boom-box, comet, playing card, and dice are assigned with the amusement emotion tag as a result of the combined response from all users. Likewise, the categories of images such as picnic-table, raccoon, and Saturn are marked with contentment emotion tag. In contrast, object categories like knife, bear, and bat belong to the fear emotion class as labeled by most of the users. Subsequently, all the bottom-level object categories are marked with one out of nine emotion tags to build a strongly labeled object category dataset. Overall, our emotion-tagged dataset based on the bottom-level object categories includes a total of 6177 images for amusement, 277 for anger, 3292 for awe, 5792 for boredom, 4270 for contentment, 1569 for disgust, 6336 for excitement, 1674 for fear, and 356 images for sad emotion category, respectively.

Emotion Tag Assignment to Top-Level Object Categories
We employed a top-level emotion tag assignment to analyze the overall emotion response for twenty primary object categories included in Caltech-256. In this regard, we examined average responses concerning all emotion tags of bottom-level categories in relation to their parent categories, i.e., top-level categories. Finally, we marked the top-level categories with emotion tag based on the maximum occurrence of emotion from the respective bottom-level object categories. For example, a top-level object category such as animal encompasses three different bottom-level categories, including air, water, and land animals. Further, each of these animal categories consists of more animal categories related to their kinds, i.e., ostrich, crab, frog and, duck, etc. Accordingly, when we observed all the bottom-level object categories in animal class, most of the categories were marked with fear emotion tag. Thus, the top-level animal object category is assigned with the fear emotion tag. Similarly, the insect category is tagged with disgust emotion as most of the derived bottom-level object categories were labeled with the same emotion tag by the users. Likewise, food and extinct object categories are marked with excitement and sad emotion tags. In this way, the emotion tags are assigned to the all top-level object categories to produce an emotion-tagged dataset based on object categorization. The final dataset includes a total of 1456 images for amusement, 609 for anger, 991 for awe, 6814 for boredom, 2005 for contentment, 813 for disgust, 12,127 for excitement, 4698 for fear, and 190 for sad emotions. Table 1 presents the emotion tag assigned to 256 bottom-level object categories derived from the 20 top-level object categories.

Gender-Based Emotion Tag Assignment
For gender-based emotion analysis, we separated the emotion tags reported by males and females participants and performed new annotation for emotion tags based on gender information. The primary purpose of gender-based emotion tagging is to investigate the benefits of employing user demographics for inferring emotion tags based on object categories. The subjective labeling consisted of 51.8% male and 48.2% female responses for all object categories. However, the responses were different for male and female candidates. For example, the bottom-level object category such as necktie is tagged with boredom sentiment based on the combine responses from all females. However, males marked the necktie object category with excitement emotion tags most of the times. Similarly, females marked spaghettis as excitement, while males tagged it with boredom. Hence, there seems to be an apparent advantage of using gender-based emotion analysis, which is the improvement in emotion classification performance. For assigning gender-based emotion tags at both top and bottom levels, we devised the same procedure that we opted for combined emotion responses from all the users.

Feature Extraction Using Off-the-Shelf CNN: Resnet-50
For accurately inferring emotion tags from different image categories based on object analysis, the selection of appropriate features is crucial. However, it is hard to know what features can produce good results, particularly in this case where heterogeneous images are present with a single emotion class. Each object category, tagged with a single emotion label, entails various types of objects with different appearances and shapes. Hence, it is not effective to utilize object-related features for emotion recognition as multiple object types may have diverse features with in the same category. Similarly, visual features like color and texture of an image are generally extracted based on the domain knowledge, i.e., human observations and assumptions. Hence, they do not encounter all the visual details [42]. Consequently, a major gap arises between what humans perceive and what we get from these hand-crafted features. Therefore, it is required to extract deep high-level features from different object images for better realization and representation of emotion. For our purposed scheme, we preferred CNN-based automatically extracted features for inferring emotion tags from object images. In this regard, we utilized a CNN model that is pre-trained on the benchmark dataset, i.e., ImageNet [28,31], entailing 1000 object categories with 1.2 million training images. From this large collection of images, CNN can learn rich feature representations for a wide range of images, which can outperform the hand-rafted features. One such type of pre-trained state-of-the-art CNN architecture is ResNet-50 that consists of 152 layers [43]. This residual network makes the training phase easier as compared to other CNN models due to its substantially deeper network layers [44]. The first layer of the ResNet-50 network utilizes filters for capturing blob and edge features. These primitive features are then processed by deeper network layers to combine the early features to form higher-level image features, as shown in Figure 4. These higher-level features are better suited for recognition tasks as they combine all the primitive features into a richer image representation [45]. Thus, selecting the rightest deep layer, just before the classification layer, is an effective design choice as it incorporates more rich information is a good place to start. Therefore, in our proposed model, we used the last fully-connected layer, i.e., 'fc1000 , of ResNet-50 to automatically learn features from the emotion-tagged data. We obtained 1 × 1000 feature vector per sample. The main reason for adopting ResNet-50 architecture is that it has a minimal error rate (i.e., 3.57% on the ImageNet test set) as compared to other models.

Feature Classification
To infer emotion tags, validate the performance of the automatically learned features via CNN, we selected three different classifiers, including Support vector machine (SVM), Random Forest (RF), and K-nearest neighbor (KNN). The SVM classifier is a supervised learning algorithm that uses the labeled samples to find the difference between classes [46]. Another one, KNN, is also a widely used classifier in the decision making systems as it has a very simple algorithm with stable performance. It uses the number of nearest training samples of the feature domain for the feature classification [47]. Finally, RF is also a powerful classifier, which is based on the combination of decision trees to predict a specific class. Therefore, these three classifiers are reported as foremost in producing high accuracies both for balance and imbalanced datasets due to their efficient prediction [48]. We selected a k-fold cross-validation scheme with k = 10 to maximize the use of available data for training and testing the performance of selected classifiers. The entire dataset is split into ten equal folds, out of which nine data splits are used for training purpose, and the remaining fold is used for testing instances.

Results and Discussion
This section explains the series of analysis along with comprehensive performance and empirical results to evaluate our proposed scheme for inferring emotion tag based on object categories. Besides, it explores the impact of user demographics for inferring emotion tags. The detail for each related subject is described in the following sections.

Method of Analysis and Classifiers
The proposed scheme for inferring emotions is evaluated on Caltech-256 dataset by using three classifiers as a baseline, i.e., K-nearest neighbor (K-NN), Random Forest (RF), and, Support vector machine (SVM). These classifiers are chosen due to their efficient performance for visual sentiment analysis. The performance of these classifiers is validated based on the widely used performance metrics, such as accuracy, F1-measure, precision, recall, and specificity. The proposed scheme focuses on object categories consisting of multi-level classification at two different levels. These two levels are termed as 1) top-level emotion tags, i.e., the emotion tag responses collected from top-level object categories like animal, food and, natural, etc., and 2) bottom-level emotion tags, i.e., the emotion tag responses collected from sub-object categories like air, water, and land animals. The subsequent sections provide detailed experimental results and performance analysis for both levels.

Performance Analysis for Inferring Emotion Tags at Top-Level
For analysis of the top-level emotion inferring, the chosen classifiers are trained to infer nine emotion categories. As discussed earlier, the collected responses from subjective labeling divided into three types of analyses for the top-level object categories. The illustration of the above three categories are as follows: • Combine analysis of inferring emotion tags: It is the analysis based on combined responses from both gender for Caltech-256 top-level object categories • Analysis of inferring emotion tags for males: These are emotion responses that are collected from only males • Analysis of inferring emotion tags for females: In this category, emotion responses are collected from only females.
After grouping these emotion responses into three categories, we measure the proposed scheme performance for each category. Table 2 provides the average performance analysis for inferring emotion tags using K-NN, RF, and SVM classifiers. These results conducted with the help of three different analyses, i.e., male, female, and combine emotion responses. It is observed from Table 2 that for all three types of analysis at top-level, the K-NN classifier outperforms RF and SVM classifiers. However, SVM performs well in the cases when the margin between different class boundaries, i.e., hyperplanes, is sufficiently large. However, it is not possible in the case of our proposed scheme with small inter-class variations. In addition, the SVM classifier does not perform well if the dataset contains noise, such as labeling noise. In that case, target classes seem to be overlapping. Moreover, RF is an ensemble classifier that builds a collection of random decision trees (with random data samples and subsets of features), each of which individually predicts the class label. The final decision is made based on the majority voting principle. RF is inherently less interpretable than an individual decision tree. Hence, in the case of noisy-labeled data and an imbalanced dataset, RF may produce some random results. For our proposed scheme, the best accuracy is achieved by the K-NN classifier, and the worst performance is achieved using RF classifier for top-level inferring emotion tag. It can also be observed from the Table 2, inferring emotion tags based on male responses, we achieved the highest accuracy of 85.01%, which is 0.15% and 2.78% higher than the female and combined emotion tag responses. Similarly, female emotion tag responses for inferring emotion tags are better than combine emotion tag responses. Likewise, the RF and SVM classifiers showed similar performance for individual male, female, and combine emotion tag responses. Hence, gender information is effective for inferring emotion tags.  Figure 5 shows the F1-measure values obtained for inferring emotion tag using KNN classifier. It can be observed from Figure 5 that the F1-measure value for the emotion classes such as boredom and awe, the males and females emotion responses outperform than the combined emotion tag results. However, in the case of other emotion classes, gender and combine emotion tag responses showed a quite similar value of F1-measure. It can be noted from Table 2 that for the top-level emotion tag responses of males, the highest value of F1-measure is achieved using the K-NN classifier, which is 0.01% and 0.92% higher than the F1-measure of SVM and RF classifiers respectively. Moreover, Figure 6 provides the confusion matrix for the best case top-level inferring emotion tag analysis obtained by K-NN for male, female, and combined emotion tag responses. It can be observed from Figure 6, in the case of male-based emotion tag analysis, 2318 amusement emotion instances are truly classified out of 2868 instances. On the other hand, for emotion tag responses from the combined category, 1182 out of 1440 instances are truly classified for the amusement emotion. Similarly, 153 instances are truly classified from 187 instances when inferring sad emotion tags for male emotion responses. In the same way, for the female and combined emotion tags, 151 and 155 instances are truly classified out of 187 instances, for the sad emotion class, respectively. Inferring emotions using K-NN classifier indicates that it is easier to predict amusement, awe, boredom, excitement, disgust, fear, contentment, anger and sad emotion tags based on object categories. Moreover it is observed Figure 5 that inferring gender-based emotions using different object categories is more effective as compared to combined emotion inferring. Therefore, combined emotion inferring results are not as effective as those based on female and male emotion responses. However, the overall inferring emotion performance is practicable and sufficient.

Performance Analysis for Inferring Emotion Tags at Bottom-Level
This section provides the performance analyses of male, female, and combined emotion tag responses for bottom-level inferring emotions, as shown in Table 2. It is observed that the K-NN classifier is giving better results as compared to SVM and RF classifiers for three emotion tag responses. Likewise, the SVM classifier also gives better results for individual emotion tag male, female, and combined responses. On the other hand, the worst accuracy for three different scenarios is achieved with RF classifier, which is only 65% for inferring emotion tags based on male responses. It is investigated that gender-based emotion tag response performs better than the combined based emotion tag responses for K-NN, SVM, and RF classifier. This indicates that splitting the combined emotion tag responses into gender-wise emotion responses leads to better performance for inferring emotions. The best accuracy of 79% and 78% is achieved from male and female emotion tag responses, respectively. The SVM and RF classifiers depict the same performance for male, female and combine emotion tags based responses. For emotion tags at bottom-level, Figure 7 compares the best case F1-measure values of each of the emotion class achieved using K-NN. The F1-measure achieved for bottom-level emotion inferring with respect to male subjects is 0.789 using K-NN classifier, which is 0.042% and 0.147% higher than the SVM and RF classifiers, respectively. As shown in Figure 6, the F1-measure achieved for the boredom, contentment, disgust and sad emotion class for male-based emotion responses outperforms the female and combined emotion tag responses. Similarly, for the amusement, awe, and fear emotion class based on female responses, the value of F1-measure is higher than the male and combined emotion tag responses. On the other hand, the value of F1-measure achieved for the anger, and sad emotion class for the female-based emotion tag responses are low as compared to male and combined based emotion tag responses. Generally, females are fewer respondents to object categories includes in Caltech-256 object category dataset for the anger and sad emotion class as compared to males. for inferring emotion tags using the K-NN classifier indicate that it is easier to predict amusement, awe, boredom, excitement, disgust, fear, and contentment sentiments except for anger and sad tag based on object categories. In particular, the sad and anger sentiments vary from object to object both for males and females. Since the objects in the bottom-level of the Caltech-256 dataset has fewer categories, which perceive anger and sad emotions to both female and male while labeling the object categories with emotion tags. Specifically, for both male and female, emotion perception varies for the same object categories included in the Caltech-256 dataset. Thus, the results for combined emotion inferring are not as effective as compared to the results obtained from male and female based emotion tag results particularly. However, the overall performance for emotion inferring is practicable and sufficient. Furthermore, it can be noted that the inferring emotion tag accuracy achieved from the top-level analysis is higher than the bottom-level emotion inferring. The reason for this is that when we go into the detail of an object category, we come up with more emotional responses as emotion response varies for sub-object categories. This indicates that inferring emotion tag performance varies when we perceive emotions from top to bottom level object categories.

Comparison with the State-of-the-Art Study
This section provides a comparative analysis of the proposed scheme for inferring emotion tags with state-of-the-art techniques. A comparison of the proposed method with the well-known existing approaches is presented in Table 3. It can be noted from Table 3 that all the previous studies are based on the self-developed datasets, such as images obtained from Flicker, to accomplish a desirable inferring emotion tag performance. Moreover, these studies are based on visual features, such as context and content-wise analyses of visual stimuli. Consequently, these schemes perform poorly in real-world scenarios. Moreover, it is also hard to opt for the human behavioral analysis based on such schemes as most of the time, the visual information like color and texture of an image is not enough to perceive a specific emotion tag. However, the proposed scheme considers an image object category as a visual feature to perceive an emotion tag. Thus, it is capable of efficiently infer nine emotion classes based on three types of analysis, i.e., combined, female, and male emotion responses. The proposed scheme effectiveness can be demonstrated by comparing the output results for emotion inferring with the existing state-of-the-arts. It can be observed from Table 3 that the accuracy rates achieved for the proposed scheme outperform the reported results of the existing emotion recognition studies. The apparent advantages of our proposed scheme offer effective decision making over the existing schemes, which are crucial to opt for real-time applications. Despite efficient recognition performance as compared to the existing studies, our proposed system entails certain limitations. For example, it only takes into consideration the gender information for emotion perception and analysis. However, emotions do vary for different human beings with respect to their regions, traditions, and other demographics. To be more precise, as we presented in Section 2.2.3, for example, the object "dog" is category classified under "contentment" emotion tag, which is highly diverse between individuals belonging to different region, age, and profession. We just use the majority voting in this case for assigning the emotion tags. Thus, the classification of emotion tags from images based on the proposed scheme suffers from the problem of subjectivity. Therefore, for accurately modeling human emotions based on object interactions, it is required to incorporate the detailed user demographics in the proposed model as well to generalize a large diverse population.

Conclusions
In this research work, we presented an efficient scheme for inferring emotion tag from object images. We pre-processed the Caltech-256 dataset and applied subjective evaluation from emotion labeling, in which we adopted the CNN model (ResNet-50) to extract deep features for emotion classification. A novel method is devised to infer emotion tags from visual stimuli based on image object categories rather than color and texture features. We trained and inferred nine emotion classes, including amusement, anger, awe, boredom, contentment, disgust, excitement, fear and, sadness, from object images. Moreover, we used gender-based analysis to see how emotion perception varies from males to females. Overall, the proposed method achieved an accuracy of approximately 85% for top-level and 79% for bottom-level emotion tag recognition, which showed the effectiveness of the proposed idea. In future, the proposed scheme can be expanded to take into consideration more user demographic information to better model and infer the human emotions. The proposed study and the output results can be utilized in the future studies for a wide-ranging application, such as recommender systems and emotion-based content generation and retrieval. Furthermore, considering the adaptability of the proposed mechanism, it can be used to depict or infer the mental state of a person, which in turn can be quite useful in medical fields for patient's treatment. Moreover, the proposed scheme can be used to model human behavior based on human-object interactions and emotion recognition. With the increasing use of visual contents on social media platforms, the proposed scheme can help in analyzing the behavior of different communities towards some particular problem.