Prediction of Public Trust in Politicians Using a Multimodal Fusion Approach

: This paper explores the automatic prediction of public trust in politicians through the use of speech, text, and visual modalities. It evaluates the effectiveness of each modality individually, and it investigates fusion approaches for integrating information from each modality for prediction using a multimodal setting. A database was created consisting of speech recordings, twitter messages, and images representing ﬁfteen American politicians, and labeling was carried out per a publicly available ranking system. The data were distributed into three trust categories, i.e., the low-trust category, mid-trust category, and high-trust category. First, unimodal prediction using each of the three modalities individually was performed using the database; then, using the outputs of the unimodal predictions, a multimodal prediction was later performed. Unimodal prediction was performed by training three independent logistic regression (LR) classiﬁers, one each for speech, text, and images. The prediction vectors from the individual modalities were then concatenated before being used to train a multimodal decision-making LR classiﬁer. We report that the best performing modality was speech, which achieved a classiﬁcation accuracy of 92.81%, followed by the images, achieving an accuracy of 77.96%, whereas the best performing model for text-modality achieved a 72.26% accuracy. With the multimodal approach, the highest classiﬁcation accuracy of 97.53% was obtained when all three modalities were used for trust prediction. Meanwhile, in a bimodal setup, the best performing combination was that combining the speech and image visual modalities by achieving an accuracy of 95.07%, followed by the speech and text combination, showing an accuracy of 94.40%, whereas the text and images visual modal combination resulted in an accuracy of 83.20%.


Introduction
Trust is an important social characteristic, guiding interactions between people in society. In general terms, it is the willingness to subject oneself to the actions of other individuals. The popularity of politicians, for example, is a measure of public trust in specific individuals. In continuation of our previous work [1,2], we performed experiments for the automatic prediction of public trust using audio data. In these works, we found that there was a significant statistical difference in the speech characteristics of politicians deemed highly trustworthy and others that were trusted less. Moreover, we also observed important gender-based differences [1]. A multilayer perceptron was used to classify speech data into low-trust and high-trust politicians with an average accuracy of 81% being achieved. From our findings, it was concluded that speech characteristics could be used for the prediction of public trust in politicians. In the subsequent study [2], we proposed a multimodal framework for predicting trust using speech and text modalities, and we therein evaluated trust prediction using the two modalities individually, as well as their combination. The results showed that using more than one modality for trust prediction led to significant improvements in classification performance. The current study is an extension of our investigation into the multimodal prediction of trust, where we proposed an enhanced multimodal prediction framework and extended experiments by including a mid-trust category, as well as adding a third modality (image).
Here, we further investigated objective indicators for the prediction of public trust manifested through social signals, and we extended our dataset by including the image modality that complements the two modalities already available, i.e., speech and text. We employed advanced machine learning (ML) techniques used in social signal processing (SSP) [3,4] for speech analysis by representing speech data as standard acoustic feature-sets in the OpenSMILE toolkit [5]. Text features were represented using natural language processing (NLP) techniques, such as the bag of words/term frequency-inverse document frequency (BoW/TF-IDF) [6], document to vector (doc2vec) [7], as well as variations in the BERT [8] and distilbert [9] models. Image features were represented using computer vision models such as ResNet50 [10], VGG16 [11], and Xception [12]. These formats/models represent state-of-the-art techniques for each of the respective modalities.
In this study, a novel multimodal framework was suggested for classification and its performance was evaluated through thorough experimentation ( Figure 1). To validate experiments conducted for the prediction of trust, a baseline was first established using the modalities of speech, text, or images individually. The results from the proposed multimodal fusion of the three modalities were then compared to the achieved baseline results. The multimodal fusion was a two-stage process. The first stage involved predicting the trust via each modality individually ( Figure 1, Stage 1). We stored the generated class probability values and labels generated in this step for use in the next stage. In the second stage, the confidence values or labels for all individual modalities in the first stage were combined to form a single fusion vector. This fused vector was passed as an input for the training of a separate fusion classifier, which was the ultimate decision maker for the final prediction of trust ( Figure 1, Stage 2).

Related Works
Advancements in computing technologies have led to the development of sophisticated technologies for the analysis of human behavior. Aspects of human behavior, commonly referred to as social signals, can be expressed in the form of spoken or written words, body language, or facial expressions [13]. Our behavior is a manifestation of the conscious and subconscious aspects of cognitive processes. Amongst various artefacts of The rest of the paper is structured as follows. Section 2 provides a review of the literature for tasks utilizing speech, text, or images for social signal processing applications. Section 3 discusses the database preparation and feature representation process of speech, text, and image databases. Section 4 elucidates on the methodology. The experiments and their results are discussed in Section 5, with a conclusion being provided in Section 6.

Related Works
Advancements in computing technologies have led to the development of sophisticated technologies for the analysis of human behavior. Aspects of human behavior, commonly referred to as social signals, can be expressed in the form of spoken or written words, body language, or facial expressions [13]. Our behavior is a manifestation of the conscious and subconscious aspects of cognitive processes. Amongst various artefacts of human conduct, speech, text, and visual attributes have been widely used in various research areas derived from a larger study domain, i.e., behavioral analysis.
The most prominent forms of human communication are text and speech, and understandably, both these forms have been used in research extensively to determine traits related to human behavior. Our mental state is directly related to the manner of the production of our speech. Research studies have indicated that the same part of the human brain controls social interactions and speech [14]; therefore, insights into social behavior can be provided by analyzing speech data.
Most studies in the field of engineering, focused on investigating trust, have been based on speech analysis [37], where it has often been contrasted with deception [15,38,39]. Some studies have also investigated textual/lexical features [40,41], as well as facial expressions [42]. The authors in [43] converted speech recordings into RGB images of spectrograms; then, using a computer vision model for trust prediction, they predicted trust from images, but those images represented speech. These studies suggest that speech, text, and visual attributes may provide insights into trustworthiness. Whereas these studies utilized a unimodal prediction, we proposed a multimodal system for trust prediction.
It has been reported that fusing different modalities has the potential to provide more information compared to a single modality [32][33][34]. In fact, some research works have explored complementary and cross-modality information infusion, as reported in [44,45]. The authors in [46] provided a review of methods for multimodal data fusion. To this end, our study explored the use of two approaches for data fusion, involving multiple modalities (speech, text, and images). The prediction resulting from multiple combinations of these three modalities was compared to results produced individually from each modality.

Database Generation and Feature Computation
A diverse range of social signals have, for a long time, been investigated to find clues and gain insights into human behavior. Examples of these include speech, text, facial expressions, and gestures, which have been widely used by researchers for a variety of behavioral analysis tasks. Even though the prediction of trust using speech, text, and images has been extensively studied by psychologists, this field is in its nascent state within the field of engineering. Furthermore, to the best of our knowledge, no publicly available dataset exists for use in trust prediction. Therefore, we have created our own dataset for the task at hand. We also made use of publicly available results of the online ranking of public trust in USA politicians [47]. Publicly available speech, Tweets (text), and pictures of politicians taken from the ranking list and trust labels were generated using their ranking scores. The collected data were recorded no longer than 12 months prior to the publishing of the ranking scores in January 2018. The final dataset consisted of speech, text, and image data of fifteen well-known politicians. The individuals were divided into three groups: the first one included politicians perceived to have high-trust, the second one included politicians in the mid-trust category, and the third category included low-trust politicians. Each class consisted of data from five individuals, two of which were males, and the other three were females. In addition, only those politicians who had a minimum of 100 votes were considered in this study. The three trust categories were formed from the individuals based on the ratio of positive votes cast vs. the total number of votes cast.

Speech Modality
The speech database consisted of 30 audio files for each of the fifteen politicians included in the study. These audio files had an average length of 12 min and were extracted from the audiovisual recordings of these politicians. Acoustic speech features were calculated using the OpenSMILE [5] toolkit; this toolkit has become the industry standard for speech acoustic research. The speech modality was represented as acoustic parameters using five commonly used parameter sets for paralinguistic research. The names of these feature sets are: Prosody [5].

Text Modality
The text database consisted of 30,325 tweets (High Trust: 10,755, Mid Trust: 9804, Low Trust: 9766). It was created by processing Tweets (from Twitter) of the same politicians included in the speech database. A number of data processing-included tasks were performed to make it suitable for classification experiments. Major tasks included removing all formatting elements, punctuation marks, and special characters.
Text data were represented using a variety of NLP techniques. These included features represented as Bag of Words (BoW) [6] and document to vector (doc2vec) [7,52], which are some of the most basic but widely used text representation techniques. The other sets of features were derived using pretrained text embeddings of several variations of Bidirectional Encoder Representations from Transformers (BERT) and a distilled version of BERT [8,9] pretrained networks. The names of these features/models are as follows: Bag of Words/Term Frequency-Inverse Document Frequency (BoW/TF-IDF); 2.

Image Modality
A visual database was created by downloading images of subjects from Google using an image scrapping tool. A total of 2196 images (High Trust: 752, Mid Trust: 741, and Low Trust: 703) were downloaded and processed. Classification experiments were performed using three pretrained image processing models. The names of the models, which represent the current state of the art of computer vision tasks, are as follows:

Methodology
The proposed framework performs "end-to-end" tasks right from data preparation and feature extraction up to data assembling and the final prediction of trust. As mentioned earlier, the multimodal framework predicts trust in two stages (Figure 1). In the first stage, trust is predicted on the basis of each modality individually. This stage is a pre-requisite to the final prediction of trust, which is based on using multiple modalities, and we refer to it as the second stage.
Stage 1: Moving on to a "step-by-step" description, raw data for each modality are fed into the system. In the first step, we represent each modality in a suitable format, and we have used a variety of state-of-the-art models for each modality. In the next step, we perform a unimodal prediction of trust using each modality individually. Here, we save the classification labels, as well as the confidence values, which we shall later use for information fusion. In the final step, we fuse the input values (confidence and labels) from Stage 1 to create a new feature vector, which will be used to train the final decision-making classifier.
Stage 2: The fusion of the modalities is performed using two techniques, confidencebased fusion and label-based fusion. In confidence-based fusion, the final prediction of trust category is determined on the basis of the average of confidence scores of the unimodal classifier for predicting each class, whereas, for label-based fusion, we simply perform a majority vote of the prediction outcomes (labels) of each modality.
The block diagram ( Figure 1) shows a general framework of the data processing and classification steps applied in our experiments.
Classification Performance Measure The classification performance of the framework has been measured using the accuracy a defined as: where t p and t n are the numbers of true-positive and true-negative classification outcomes, respectively. Similarly, f p and f n denote the numbers of false-positive and false-negative classification outcomes, respectively. The final accuracy is computed as an average of fivefold cross-validation.

Experiments and Results
The experiments were divided into three categories: (i) unimodal prediction of trust, (ii) bimodal prediction of trust, and (iii) trimodal prediction of trust. Unimodal prediction refers to the use of any single modality from speech, text, and images to achieve the prediction of trust; bimodal prediction refers to the use of any two modalities to determine the prediction of trust, whereas trimodal prediction refers to the use of all three modalities to determine the prediction of trust. We discuss the specific aspects of experimental design and the obtained results for each experiment separately, as follows.

Unimodal Prediction of Trust
The proposed multimodal fusion framework uses confidence values and prediction labels generated by single modality classifiers working with either of the three modalities. Therefore, the first stage of our experiments implemented the unimodal classification. Three separate logistic regression (LR) classifiers were trained and tested, one each for speech, text, and images.

Trust Prediction from Speech
The speech dataset was divided into the training (80%) and testing (20%) subsets, and a, LR classifier was trained to differentiate between individuals with low-trust, mid-trust, and high-trust, using acoustic speech features computed previously (Section 3.1). The class probability vectors and labels, generated during the testing stage, were saved for future processing in the multimodal classification, as described in Section 4. Table 1 shows speechbased classification results given as an accuracy for five standard acoustic feature-sets. The results are presented as an average over five runs through fivefold cross-validation. We report that the highest accuracy of 92.81% on the test partition was achieved with the ComParE feature set, closely followed by the IS10 feature-set, resulting in an accuracy of 92.32%. It is interesting to note that although IS10 had a much smaller number of features than those of ComParE, they had a similar performance. The choice of feature-set can be made on the basis of the experimental goal-whether one aims to achieve the highest accuracy or one aims for a relatively lightweight but high-performing feature-set. In our study, we opted for the best performing model in terms of the classification accuracy, which is the ComParE feature-set.

Trust Prediction from Text
Trust prediction from text was performed using the same LR setup, which was used for the speech data features computed using each of the embeddings and representation formats, as mentioned in Section 3.2. The dataset of text messages was divided into the training (80%) and testing (20%) subsets, and the classifier was trained to distinguish between the individuals with low-trust, mid-trust, and high-trust, using text features. The results are shown in Table 2. As can be seen, the highest accuracy of 72.31% on the test partition was achieved with the BERT_large_uncased model, followed closely by the distilbert_uncased model, resulting in an accuracy of 72.26%. Comparing the top four best performing NLP formats, one can see that their performance was largely similar. However, there were major underlying differences amongst these models. The foremost is that the first three techniques use pretrained word embeddings to represent data, whereas the BoW/TF-IDF generates feature vectors from scratch. Comparing the three pretrained NLP models, one notes that they differ considerably in architecture. For matters of simplicity, we only compared the number of parameters in each model. BERT_large_uncased comprises 336 million parameters, distilbert_uncased has 66 million parameters, whereas BERT_base_uncased has 110 million parameters. It is interesting to note here that BoW/TF-IDF, where we trained the model from scratch, also provided results comparable to the pretrained models.

Trust Prediction from Images
Trust prediction from images was performed using the same LR setup, which was used for the speech and text data analysis. In this case, the images were fed to the networks listed in Section 3.3. The dataset of images was divided into the training (80%) and testing (20%) subsets, and the classifier was trained to distinguish between the individuals with low-trust, mid-trust, and high-trust. The results are shown in Table 3. Resnet50 provided the highest classification accuracy of 77.96%. It considerably outperformed the other two computer vision models used for this task.

Comparison of Individual Modalities
A comparison between the three modalities showed that speech on its own led to a higher overall performance compared to text and images. The best performing features for each modality are presented in Table 4.

Multimodal Prediction of Trust
As shown above, we demonstrated that any modality from either speech, text, or images can be used on its own to predict trust. The next question is whether we can improve trust prediction by combining additional modality/modalities. To answer this question, we created a new set of multimodal features by concatenating the probability vectors and labels generated by the unimodal classifiers to train a decision-making classifier for classification based on multiple modalities, as shown in Figure 1. This third (bimodal) or fourth (trimodal) classifier acted as the decision maker to determine the final trust label. It is important to mention here that only the best performing features of each modality have been used for the multimodal prediction of trust.
Assuming that vectors p ai = {p a1i ,p a2i } for i − 1, . . . , N (where N is the number of data samples) represent two-class probability vectors generated by the LR trained on speech data, and p bi = {p b1i ,p b2i } are the two-class probability vectors generated by the LR trained on text data, multimodal feature vectors p sti were generated as p abi = {p a1i ,p a2i ,p b1i ,p b2i } for i − 1, . . . ,N Similarly, if we were to add image modality represented as p ci = [p c1i ,p c2i } to the above system, the resultant vector for multimodal prediction using three modalities would be generated as p abci = {p a1i ,p a2i ,p b1i ,p b2i ,p c1i ,p c2i } for i − 1, . . . ,N Having the knowledge of the ground truth labels for each sample i, we were able to train a third LR classifier to perform trust recognition based on the multimodal feature vectors, p abi , representing any two modalities from speech, text, and images or p abc when all three modalities are used for trust prediction. Based on this, we investigated multiple combinations of modalities to achieve multimodal prediction. The combinations of the three modalities for multimodal prediction are as follows: As the number of examples for each modality was different, we fed an equal number of samples to the prediction framework, so vector fusion could be formed. As discussed earlier, the multimodal prediction of trust was achieved in two stages: the first stage predicting trust using a singular modality, and the second stage predicting trust through multiple modalities by vector concatenation (as shown in Equations (2) and (3)). As the size of dataset changed, we again performed unimodal prediction of trust with the truncated dataset to maintain consistency of the classification process. The results of stage 1 are shown in Table 5. When combining the classification measures of multiple modalities, there are several approaches to achieve multimodal prediction. One such approach is confidence level fusion where we concatenate the confidence or probabilities of each class for every modality to create a new feature vector, as shown in Equations (2) and (3). Another approach is to achieve final multimodal prediction through voting of the labels of individual modalities, with the final prediction being determined by the label receiving the most votes from individual modalities. However, this technique cannot be applied when only two modalities are at play, as a majority vote cannot be achieved, due to an even number of modalities. The selection of technique usually depends on the number of modalities being used. Therefore, for multimodal prediction involving two modalities, the final prediction is achieved using confidence level fusion, whereas for multimodal prediction using three modalities, we achieve the final prediction using confidence level fusion, as well as the majority voting of labels.

Multimodal Prediction of Trust Using Two Modalities (Bimodal Prediction)
Bimodal prediction refers to the use of any two modalities to achieve a multimodal prediction of trust. The results of using different combinations of the three modalities in a bimodal setup are given in Table 6. The best performing combination was "image and speech," which provided the highest accuracy of 95.07% on the test partition. This was closely followed by "speech and text," resulting in an accuracy of 94.40% on the test partition. These results outperformed the highest accuracy of 92.73% achieved with speech data only in unimodal prediction (stage 1 results, see Table 5). As can be observed from Table 6, speech was one of the two best performing modalities in a bimodal setup, thus suggesting that it is a very strong modality for trust prediction. In order to further investigate the combination of different representation formats of each modality, we performed additional experiments using the top 2 and top 3 performing models of each modality to determine how different combinations of features result in trust prediction outcomes. The results are given in Tables 7 and 8, respectively. It can be observed from the two tables that, as in the previous experiment, speech was present in the best performing results. Furthermore, it can also be noted that experiments including the text modality produced the worst results, which follows from its poor performance on an individual level. Another thing to note is that, using the top 3 models (Table 8) for each modality produced a slight reduction in performance compared to the top 2 model experiment (Table 7). This can be attributed to the phenomenon of diminishing returns where adding more models does not result in improved results. Trimodal prediction refers to the use of all three modalities for the prediction of trust. The results are shown in Table 9. As discussed earlier, we used two techniques, i.e., confidence-based fusion and label-based fusion, to achieve the final prediction of trust. It can be seen that the label fusion (majority voting) technique results in an accuracy of 95.33% on the test partition, considerably outperforming the accuracy of 91.87% achieved with confidence fusion. When comparing the results shown in Tables 4 and 6, Tables 7-9, we can conclude that combining multiple modalities improved the classification accuracy. This is in line with the findings of our previous studies. To further explore the trimodal prediction of trust, experiments were conducted using the top 2 and top 3 performing models for all three modalities. The results are given in Tables 10 and 11, respectively. It can be observed that using the top 2 models provided a marginally better result compared to using the top 3 models, as was observed in Section 5.2.1.

Conclusions
This study provides a comprehensive investigation into the suitability of three modalities of social signals, i.e., speech, text, and images, for the prediction of trust. It evaluated the effectiveness of each modality individually but also explored multiple combinations of these modalities to achieve trust prediction. It proposed a multimodal framework for trust prediction that performs "end-to-end" tasks. The results revealed that by using multiple modalities, we can improve the classification accuracy compared to a single modality. The multimodal classification approaches presented in this paper enabled us to compare and contrast between different combinations of the three modalities (and their respective representation formats) to determine the most suitable approach for trust prediction.
The proposed framework for the multimodal prediction of trust was utilized to demonstrate its usefulness to effectively predict public trust in politicians by using three different data modalities (speech, text, and images). Through the experiments, we were able to ascertain that it was possible to achieve a relatively high prediction accuracy for all three modalities i.e., 92.81% for speech, followed by 77.93% for text, and 68.67% for images. However, when these modalities were combined, the accuracy could be increased by up to 97%. This was achieved by using the top 2 models for all three modalities combined. Moreover, when comparing the results of different combinations of modalities in multimodal experiments, the highest accuracy was achieved when speech was one of the modalities used. This suggests that speech is a very robust modality for the task at hand. This can be attributed to the fact that speech is one of the oldest modalities to be used for behavioral analysis and social signal processing tasks. Subsequent advancements in speech analysis techniques have resulted in the development of many mature technologies. However, an assessment of trustworthiness that is based on more than one modality would be the natural choice, as multiple modalities provide more robustness in the prediction framework and make the assessment more reliable. This is also true for people in a realworld environment; when we are able to see and hear a person and read their texts, we are able to form a more comprehensive opinion about the person as compared to assessing them only on the basis of their speech, written text or even just an image.
One of the critical limitations of the methodology employed here is that it applies only to a closed set of politicians. In future studies, we intend to investigate the utility of the proposed framework by extending it for a more generalized dataset.