An Integrated Deep Learning and Belief Rule-Based Expert System for Visual Sentiment Analysis under Uncertainty

: Visual sentiment analysis has become more popular than textual ones in various domains for decision-making purposes. On account of this, we develop a visual sentiment analysis system, which can classify image expression. The system classiﬁes images by taking into account six different expressions such as anger, joy, love, surprise, fear, and sadness. In our study, we propose an expert system by integrating a Deep Learning method with a Belief Rule Base (known as the BRB-DL approach) to assess an image’s overall sentiment under uncertainty. This BRB-DL approach includes both the data-driven and knowledge-driven techniques to determine the overall sentiment. Our integrated expert system outperforms the state-of-the-art methods of visual sentiment analysis with promising results. The integrated system can classify images with 86% accuracy. The system can be beneﬁcial to understand the emotional tendency and psychological state of an individual.


Introduction
In the modern communication system, people use different social media platforms (e.g., Facebook, Twitter, Instagram, and Flickr) to express their opinions on various issues and activities of their daily life. In these platforms, users can share visual content with the textual one to communicate with others. It is easier to express emotions intuitively through images [1]. There is a percept that "A picture is worth a thousand words". From Figure 1, we can understand how an image can be able to deduce an individual's sentiment without any text. In Figure 1a, the cat is in a happy mood as it is enjoying the fruit. On the other hand, Figure 1b represents the storm forecast. Thus, visual sentiment analysis has become a part of our daily lives [2].
The accurate prediction of users' sentiment by using their uploaded images on social media has become an important research challenge [3,4]. However, the image data may contain inconsistent data, missing data, or duplicated data which leads to various types of uncertainty (e.g., ignorance, incompleteness, imprecision, ambiguity, and vagueness). These uncertainties can obstruct prediction accuracy. For image classification, deep learning methods are widely used as they can represent the accurate and robust features of images [5,6]. In addition, to handle different types of uncertainty in image data, the Belief Rule-Based Expert System (BRBES) is more applicable [7,8]. Since an integrated model performs better than a stand-alone model [9][10][11], we propose the integration of a deep learning method with BRBES to improve the prediction accuracy in visual sentiment analysis. This is the key contribution of our research work. As a Deep Learning method can process raw data directly, it is used effectively to solve various classification and regression problems. However, as a data-driven approach, it has the limitation of addressing different types of uncertainty [12]. On the contrary, as a knowledge-driven approach, BRBES can address various types of uncertainty (e.g., ignorance, incompleteness, imprecision, ambiguity, and vagueness) [13]. However, it cannot integrate the associative memory in its inference procedure. For example, we use multiplication and division operators to calculate the activation weight of a rule in the BRBES inference framework. However, these provide incorrect activation values of each rule. To solve this issue, a Deep Neural Network (DNN) model can be used to calculate the rule activation weight by providing more accurate values. Therefore, as an integrated framework of the Deep Learning method within the BRBES inference framework, our proposed system provides the exact value of rule activation weights that results in an accurate prediction of sentiment under uncertainty.
BRBES is composed of a set of rules and provides results based on those rules. The rules consist of the antecedent and consequent parts. The antecedent part of a rule is based on the input, and the consequent part contains the output. Generally, there are two types of BRBES: one is Conjunctive BRB [14] and another is Disjunctive BRB [15]. The AND logical operator is utilized to connect each antecedent attribute in the Conjunctive BRB, where the OR logical operator is used in the Disjunctive one. For the AND logical operator, Conjunctive BRB takes more time in computation and constitutes a large number of rules in the rule base. In our experiments, we use the Disjunctive one as it needs less time for computation and has a low number of rules. We explore a total of eight belief rule base for BRBES that takes less computational time.
In this paper, we address the following research questions: (1) Why we use the Deep Learning model for visual sentiment analysis? (2) What is the advantage of utilizing BRBES in our proposed system? (3) Why and how we combine Deep Learning with BRBES? We compose the remainder of this paper as follows: Section 2 surveys related work on visual sentiment analysis. Section 3 provides an overview of our proposed BRB-DL approach. Section 4 discusses the procedure of experiments. Section 5 reports the experimental results and evaluation of BRB-DL compared to different models such as SVM, Naive Bias Classifier, Decision Tree classifier, VGG16, VGG19, and ResNet50. Finally, Section 6 concludes the paper with some future plans.

Related Work
This section presents a literature review on visual sentiment analysis. Siersdorfer and Hare [16] mainly focused on the bag-of-visual word representation and color distribution of images. They estimated the polarity of sentiment in images by extracting the discriminative sentiment-related features and deployed a machine learning approach. Machajdik and Hanbury [17] considered a method that extracted and combined low-level features of images. These features were used for emotion classification. They considered awe, amusement, contentment, excitement as positive emotions and anger, fear, disgust, and sadness as negative ones.
To generate a large-scale Visual Sentiment Ontology (VSO), Borth et al. [18] represented a method based on psychological theories and web mining. They proposed SentiBank that was a visual concept detector library. The research work was tested on a dataset of 807 artistic photographs depicting eight emotions, including amusement, awe, contentment, excitement, anger, disgust, fear, and sadness. Moreover, Chen et al. [19] introduced DeepSentiBank for detecting the emotion of an image. Vasavi and Aditi [20] adopted a Deep Learning approach to predict emotions depicted in images. They conducted their experiment on a popular Flickr Image Dataset and predicted five emotions of images including love, happiness, violence, fear, and sadness.
However, there are some data available that contain various kinds of noise. To diminish the noise of large-scale image data, You et al. [5] offered a progressive CNN (PCNN) model. In addition, to reduce over-fitting in visual sentiment analysis, Islam and Zhang [4] adopted the transfer learning approach. In this study, they utilized the hyper-parameters learned from a Deep Convolutional Neural Network (DCNN). Wang et al. [21] rendered a visual sentiment analysis framework where an adjective and a noun are jointly learned by using deep neural networks. To train a visual sentiment classifier, Vadicamo et al. [22] applied the sentiment polarity of the textual contents and proposed a cross-media learning approach. In addition, Campos et al. [6] trained an AlexNet model adapted for visual sentiment prediction.
Fengjiao and Aono [23] considered a merged method where both hand-crafted and CNN features were incorporated. They employed hand-crafted features to extract the local visual information and CNN models to get the global visual information. To label emotions of painting images, Tan et al. [24] proposed a method where the painting features were considered. They developed a classification model based on VGG16 and ResNet50. Moreover, Paolanti et al. [25] analyzed the sentiment of social images related to cultural heritage and compared them among VGG16, ResNet, and Inception models. Recently, Chowdhury et al. [26] adopted the strategy of the ensemble of transfer learning models and employed three pre-trained deep CNN models including VGG16, Xception, and Mo-bileNet. A summary of the prior research on visual sentiment analysis is shown in Table 1. None of them applied an integrated approach to Deep Learning and BRBES. However, in our proposed method, we focus on the integration of a Deep Learning method with a BRBES inference framework. Our proposed method helps to predict the sentiment of images effectively.

Reference Description
Model Limitation [4] Proposed a visual sentiment analysis framework using transfer learning approach. Deep CNN Cannot predict multi-label multi-class sentiment.
[5] Employed a progressive technique to fine-tune the deep neural network. Progressive CNN Multimodality models are not applied for sentiment analysis.
[6] Trained an AlexNet model for visual sentiment prediction. CaffeNet CNN Other CNN architectures are not used. [16] Estimated the sentiment polarity of an image by extracting the discriminative sentiment related features.

SentiWordNet
Cannot predict multi-label multi-class sentiment.
[17] Developed a method by extracting and combining low-level features for image emotion classification.
Naive Bayes classifier More feasible ground truth is required.
[18] Introduced a visual concept detector library named SentiBank. Plutchik's Wheel Aesthetic features and facial expression features are not used. [21] Proposed a visual sentiment analysis framework considering both an adjective and a noun deep neural networks.

DCAN
Cannot focus on automatically discovering middle-level representation.
[22] Leveraged on a cross-media learning approach to predict the sentiment polarity of an image. Deep CNN Cannot predict multi-label multi-class sentiment. [23] Proposed a merged method where both hand-crafted and CNN features were incorporated.

VGG16
Cannot predict multi-label multi-class sentiment.
[24] Proposed a classification model based on VGG16 and ResNet50. VGG16, ResNet50 Other models are not considered.
[25] Introduced a deep learning method to estimate the sentiment of cultural heritage related images.

VGG16, ResNet, and InceptionV2
Cannot predict multi-label multi-class sentiment.
[26] Introduced TLEnsemble method by using three deep CNN models.

Proposed Framework
In this research, an integrated model of Convolutional Neural Network (CNN) [27] and Belief Rule Base (BRB) is developed to classify the visual sentiments. The system flow chart is illustrated in Figure 2. From Figure 2, it can be seen that the integrated model first fetches data from the dataset and send it to the data augmentation section. After augmentation, it is sent to preprocessing steps. In the preprocessing steps, the image is resized into a 150 × 150 shape. After that, the RGB image is converted to a Gray Scale image. Then, the processed image is sent to the CNN model. The result of the CNN model is then fed into the BRB model which predicts the final sentiment label of the image.

Convolutional Neural Network Model
The architecture of the Convolutional Neural Network (CNN) is shown in Figure 3.  (1): The input shape of this model is (150, 150, 1), where the first 150 refers to the height of the input image and the second 150 implies the width of the input image. Finally, 1 signifies that the image is a Gray Scale image. A max pooling layer with 2 × 2 pool size is introduced in each convolution layer. The max pooling layer decreases the number of total parameters by selecting the highest value from a rectified feature map. Thus, it can lessen the data size. Along with the max pooling layer, ReLu, and dropout layer are also included in each convolution layer. ReLu works for activating the parameters while the dropout layer deactivates the neurons randomly so that it can avoid overfitting. The global Average Pooling layer is introduced in the last layer that is perfect for feeding into the dense layer. Since the model is classifying eight sentiments, the output layer has eight nodes. Therefore, Softmax is used as an activation function that can be shown as Equation (2): Here, z is the input vector, e i is the standard exponential function of i where i ∈ z. The input vector z is the output of Fully Connected (FC) layer of the CNN model. The FC layer produces raw prediction values which are known as logits [28]. Logits are real numbers (−∞ to +∞). The softmax activation function turns these logits into the probabilities of each class.
The Adam optimizer has been used to optimize the integrated model. As a loss function, Categorical Cross-entropy is used to reduce the validation loss. The architecture of the CNN model is shown in Table 2.
The input image is an array of pixels. The convolution layer consists of multiple kernels with multiple weights. The variation of the kernel weight helps to manipulate different scales of the images. These kernels are used to extract features from the input image. The features of an image (edges, interest points, etc.) provide very rich information on the content. When a kernel is slid over the input image, it produces a feature map for different pixels. This operation is performed based on the weights of the kernel and the neighboring pixels. This feature map is then passed through the ReLu activation function, which increases the nonlinearity by converting the negative values to zero of the feature map. The pooling layer merges the features which are semantically similar into one. The max pooling layer computes the maximum value from the portion of the feature map covered by the pooling layer. For the image segmentation, the layers extract two types of features (full region feature and foreground feature) for each region. Thus, the convolution layer and the max pooling layer generate different feature maps for different images. These feature maps are used to train and validate the model.

Belief Rule-Based Expert System
A belief rule is an extended form of IF THEN rules. It consists of the antecedent part and consequent part. The antecedent part contains the antecedent attributes and the consequent part that takes the consequent attributes. Referential values are utilized by the antecedent attributes and the belief degrees are connected with the consequent attributes. The relation can be shown as Equation (3): where I 1 , I 2 , . . . , I T k are the antecedent attributes of kth rule (k = 1, 2, . . . , L). Q 1 , Q 2 , . . . , Q T k are the referential values. O 1 , O 2 , . . . , O n are the referential values of the consequent attribute and β 1 , β 2 , . . . , β n are the belief degree for each referential value, and ∑ n j=1 β jk ≤ 1 where attribute weights are δ k1 , δ k2 , . . . , δ kTk , and the rule weight is θ k .
Generally, the group of belief rules is considered as the Belief Rule Base (BRB). In a Belief Rule-Based Expert System (BRBES), it helps to generate the initial knowledge base, and Evidential Reasoning (ER) provides services as an inference engine. Some of the knowledge representation parameters are rule weight [29], belief degrees [30], and attribute weight [31]. These are used to identify uncertainty in data. The inference procedure includes input transformation [32], rule activation [29], belief update [33], and rule aggregation [34]. The working process of a BRBES is shown in Figure 4.
The process of the calculating activation weight, w k , in disjunctive BRB is shown in Equation (4): where α i is the matching degree and θ k is the rule weight. The process of belief degree update is shown in Equation (5): The original belief degree is represented by theβ ik , where β ik is the updated belief degree. Rule aggregation is calculated using Equation (6): where β j is the ER (Evidential Reasoning) aggregated belief degree. The outputs of the rule aggregation process are some fuzzy values [7]. The process of calculating the crisp value [8] from these fuzzy outputs is shown in Equation (8): Here, u(S i ) is the utility score for each referential value, while β i is ER aggregated belief degree. Figure 5 illustrates the Belief Rule Base Tree of our experiment. X2 which is a root node of this tree represents the "Overall Sentiment Score". In BRB, such node is related to the consequent attribute of the rule. As mentioned earlier, this consequent attribute consists of a number of referential values, each associated with belief degree related to overall sentiment.  Table 3. Table 3. Matching degree.
Activation weight for this experiment is calculated with Equation (4). The rule weight (θ k ) is considered 1 for our experiment [35]. Hence, The values of all activation weight are shown in Table 4.  Equation (5) is used to update the belief degrees. The initial belief degrees for this experiment are presented in Table 5. Since all antecedent attributes are used to define this rule base, τ(t, k) = 1 in this experiment [36]. Therefore, In the same process, we have calculated the value of β 0,1 to β 5,2 . Equations (6) and (7) are used to calculate the aggregated belief degrees. In this experiment, the calculated aggregated belief degrees for positive, neutral, and negative are shown in Table 6.

Integrated Framework
Our proposed integrated approach is used to predict the sentiment label and class of an image. To select an image file from the directory, we use a method named filedialog.askopenfilename() from tkinter package. Since the user selects the image from the file gallery, the image may not have a specific size all the time. Therefore, this may reduce the accuracy of the model. Hence, the image is converted to a grayscale image and resized into a 150 × 150 dimension by performing interpolation to up-size or down-size N-dimensional images. This operation is done with the help of the default function of the scikit-image [37] library. After that, the processed image is convoluted by each of the convolution layers that are used to develop the integrated framework. The filters of each layer create a map of different features. This map is then sent to the Max-pooling layer to select the greatest pixel value in a pooling window. The output map of the max pooling layer is delivered to the trained hidden layers where matrix chain multiplication is performed using optimized weight. The output is forwarded to the output layer.
The Softmax activation function helps the output layer by calculating the possibility of an image is allied to a specific class. In our experiment, "Anger", "Joy", "Love", "Surprise", "Fear," and "Sadness" classes are used as the referential values. As the antecedent attribute, "Sentiment Class" is considered in BRBES. As the corresponding referential value of the antecedent attribute, the probability of each class is used. Moreover, the consequent attribute is the "Overall Sentiment Score" with referential values "Positive", "Neutral", and "Negative".The utility score for these referential values is chosen as "1.0", "0.5", and "0.0", respectively. The belief rule used for this integrated system is shown in Table 5. The inference procedure is directed, and the final results are calculated using these belief rules. The process of calculating matching degree in BRB is shown in Algorithm 1. if string(i) < length(date) then 10: matching_degree ← pow(input[i], relativeWeight) 11: sumMatchingDegree ← sumMatchingDegree+matching_degree 12: goto loop. 13: close; 14: close;

Dataset Collection
The proposed CNN model is trained by using the dataset collected from [38]. There are 5732 image files in six categories including anger, joy, love, surprise, fear, and sadness. Table 7 presents the statistics of the dataset.
The images are augmented by using image augmentation functions such as rotating, scaling, flipping, zooming, and shifting bits. Due to the augmentation, the datasets increased from 5732 images to 18,358 images. The larger amount of dataset obtained after augmentation will help to increase the accuracy. An example of data augmentation is shown in Figure 6.

Evaluation Measures
Evaluation metrics are used to explain the performance of a model. Hence, our proposed model is evaluated by using the confusion matrix, precision, recall, f1-score, and accuracy. A confusion matrix is known as a particular table layout that helps to visualize the performance of a classification model. The formation of the confusion matrix is shown in Figure 7. According to Figure 7, the y-axis of the confusion matrix represents the actual values while the predicted values are represented by the x-axis. The process of calculating precision, recall, f1-score, and accuracy is shown in Equations (9)- (12):

Implementation
The system is developed in Spyder IDE using Python programming language. The model is trained on the Google Colab cloud system. Various libraries are used for our experiment, such as Tensorflow, Keras, sklearn, NumPy, matplotlib, nlpaug, etc. In the backend of the system, Tensorflow is required. Keras develops the model by implementing some built-in functions, such as activation functions, optimizers, layers, etc. Sklearn library generates a confusion matrix, modelCheckpoint callback function, etc. In numerical analysis, NumPy library is used. In addition, matplotlib is used for graphical representation, such as accuracy vs. epoch graph, loss vs. epoch graph, confusion matrix, etc. Nlpaug API is used for data augmentation.

Results and Discussion
The learning curve of our model (accuracy and loss graph) is shown in Figure 8a According to Figure 8a,b, the x-axis represents the number of epochs while the y-axis represents the accuracy and loss, respectively. From Figure 8a, it can be noticed that the validation accuracy increases from 0.22 to 0.65, and the training accuracy increases from 0.2 to 0.64 after the first 50 epochs. After that, it reaches 0.87 for validation and 0.81 for training in the last epoch. From Figure 8b, it is seen that the validation loss decreased to 0.9 from 1.99 and the training loss decreased to 0.95 from 2.00 after the first 50 epochs. After that, it decreased gradually and became 0.4 for validation and 0.55 for training in the last epoch. The accuracy learning curve follows the upward trend while the loss one follows the downward trend. These trends help the model to achieve an average accuracy of 93.23% for training and 87.17% for testing at the end of the training phase.
The evaluation metrics for this model are shown in Table 8. According to Table 8, it can be observed that the accuracy of each class is equal to or more than 80%, while the accuracy of joy and sadness is equal to or more than 90%.
The integrated framework is designed to use real-time validation. For real-time validation, we first select an image from the local storage. After that, the model calculates the overall sentiment of the image by analyzing the values of "Positive", "Neutral", and "Negative". The process of the real-time validation is shown in Figure 9a,b.  We have performed 5-fold cross validation to validate the model performance. From our previous study [11], we have seen that the deep learning model works better with a 70:20:10 split ratio. Therefore, we split the dataset into a 70:20:10 ratio, where 70% of the total images are used for training the model, 20% of them are used for validation, and the remaining 10% are used for testing. These images are selected randomly from the dataset. The results of 5-fold cross validation and the average accuracies along with the standard deviations are shown in Table 9. From Table 9, it can be seen that the highest training accuracy was found in the fourth fold, which was 0.95. However, the highest validation (0.89) accuracy and testing (0.87) accuracy were achieved in the third fold.

Comparison to Different Models
The dataset considered in this research [38] trained by using various machine learning models (Support Vector Machine, SVM with Hog Features, Decision Tree Classifier, Naive Bias Classifier) and pre-trained CNN models (VGG16, VGG19, ResNet50). The performance of the proposed model has been compared against these eight models. Table 10 illustrates the comparison among these eight models including our proposed model by taking into account of the performance metrics such as Precision, Recall, F1-Score, and Accuracy. From the table, it can be observed that our proposed BRB-DL model outperforms eight models. The reason for this is that the values of Precision, Recall, F1-Score, and Accuracy of the BRB-DL model are higher than that of eight models.

Conclusions and Future Direction
The objective of this research is to calculate overall sentiment from six classes of an image file. We have applied a very effective method for visual sentiment analysis that integrates a Convolutional Neural Network (CNN) with a Belief Rule-Based Expert System (BRBES). The CNN model is used to calculate the class-wise prediction probability, while the BRB triggers some particular rules for estimating the overall sentiment of the image. This integrated framework can be used to analyze the user's sentiment of social media platforms. In addition, this model can help to treat neurologically disorder patients.
Although the prediction accuracy of each class in our experiment is promising, the model can be improved by increasing the accuracy of love and surprise classes.
In the future, we have a plan to promote our system by adopting the BRBES-based adaptive Differential Evolution (BRBaDE) approach [39]. This can improve the prediction accuracy by using parameter and structure optimization.