A Comparative Analysis of Modeling and Predicting Perceived and Induced Emotions in Soniﬁcation

: Soniﬁcation is the utilization of sounds to convey information about data or events. There are two types of emotions associated with sounds: (1) “perceived” emotions, in which listeners recognize the emotions expressed by the sound, and (2) “induced” emotions, in which listeners feel emotions induced by the sound. Although listeners may widely agree on the perceived emotion for a given sound, they often do not agree about the induced emotion of a given sound, so it is difﬁcult to model induced emotions. This paper describes the development of several machine and deep learning models that predict the perceived and induced emotions associated with certain sounds, and it analyzes and compares the accuracy of those predictions. The results revealed that models built for predicting perceived emotions are more accurate than ones built for predicting induced emotions. However, the gap in predictive power between such models can be narrowed substantially through the optimization of the machine and deep learning models. This research has several applications in automated conﬁgurations of hardware devices and their integration with software components in the context of the Internet of Things, for which security is of utmost importance.


Introduction
The Internet of Things (IoT) has enabled a rich landscape of interconnected ubiquitous devices capable of offering a variety of services and applications. IoT supports a gamut of sensors that are capable of recording and transmitting data from a wide variety of sources. To ensure the reliability of these interconnected devices, when inter-operating together, extensive monitoring and alarming systems are needed. Techniques and approaches such as textual warning messages, visualization (e.g., DataDog [1]), and alarming through sounds are the mainstream channels employed for communication purposes in different hardware/software platforms that include IoT. For instance, Flight Guardian [2], a flight deck warning system designed for older airplanes lacking digital warning systems, can improve flight safety by monitoring a pilot's situational awareness using real-time video analysis and underlying knowledge to generate timely speech warnings. While the use of textual data and visualizations has been explored in typical cyber-physical systems (CPSs) and the IoT, the use of sounds in these contexts is accompanied with additional complexity and may require some other analysis and comprehension before becoming the main avenue for communication. An example of such complexity is whether certain types of sounds induce certain types of emotions within the system operator. The answer to this question is important in ensuring the effectiveness of communication in such complex systems.

Sonification Applications in the IoT
The literature regarding the use of sonification in the IoT describes the many facets and versatility of sonification across several application domains. One of the primary uses of sonification in the IoT is in medical applications. IoT sensors can continuously record and monitor data from different parts of the body. Researchers have proposed the use of the IoT for remotely monitoring elderly patients' health [3]. Measurements such as heart rate, blood pressure, and body temperature [4] can be collected remotely. In the event of an accident, such as a fall, such a system can quickly alert doctors, the patient's caretakers, or both, and can sound an alarm that could alert anyone in the patient's general vicinity [5]. Researchers have also proposed a sonification system for asthma patients that can inform a patient's emergency contacts when a sudden asthma attack occurs and can activate a buzzer to alert nearby people who may be able to help [6].
Sonification has also been used as an alternate modality to learn about bodily movements and functions. Danna and Velay [7] proposed the use of sonifications of hand movements made while writing to help researchers understand the motor control needed to perform that task, which may help patients with disabilities. Likewise, Turchet [8] proposed the use of interactive sonifications to help during the therapy of patients with limited bodily movement and control. Shoes enabled with IoT sensors can collect data and be monitored remotely to give patients feedback on their gait and body movements. The authors argued that the use of sonification in therapy can help patients with motor disabilities to walk better. Researchers have also suggested sonification of electroencephalogram (EEG) data as part of brain-computer interfaces and to help understand the brain's response to auditory stimuli as a supplement to brain imaging [9].
Sonification can also be useful in promoting overall wellbeing in IoT systems through music. Quasim et al. [10] proposed an emotion-based music recommendation and classification framework (EMRCF) to recommend songs to individuals based on their mood and previous listening history. The authors proposed the analysis of facial features to predict the person's mood, from which the system would recommend songs that were pre-sorted into one of six categories, such as joyful, inspired, enthusiastic, emotional, silent, and depressed.
Timoney et al. [11] presented a summary of research in the area of IoT and music known as the Internet of Musical Things (IoMusT). The authors also proposed a framework for utilizing IoT sensors and machine learning algorithms to help patients create music that helps during therapy. The authors contended that such a framework could also enable remote therapy from the comfort of the patient's home.
In addition to these medical applications, sonification can be used in the IoT for safety-critical applications. For example, a smart helmet can detect harmful gases in the environment during mining operations [12] or detect gas leakage in a home [13,14]. The use of sonification has also been proposed to alert users when someone is detected in thermal imagery or other IoT sensors at critical border crossings, which can help counteract illegal border crossings [15]. Sonification in combination with IoT sensors can also prove essential in devising safety equipment for blind people. Saquib et al. [16] proposed a smart IoT device called "BlinDar", which uses ultrasonic sensors and global positioning systems (GPS) to ease navigation for blind people. GPS can also allow blind personnel to share their location with others in real time.
Sonification can also be used in combination with IoT sensors in smart city applications, such as waste collection and monitoring [17,18]. Such systems can enable efficient waste management by monitoring waste levels and can direct personnel to collect trash in hightraffic areas.

Research Problem: Modeling and Predicting Perceived and Induced Emotion
Emotions play an essential role in human behavior. Music and emotions have been studied for many years. The American Psychological Association [19] defines emotion as "a complex reaction pattern, involving experiential, behavioral, and physiological elements, by which an individual attempts to deal with a personally significant matter or event". "Affective computing" is a multidisciplinary field comprised of computer science, cognitive science, and psychology [20]. Using AI, affective computing can enable robots and computers to understand and respond to humans on a much deeper level. This intersection of AI and computer science, also called "artificial emotional intelligence", aids the development of tools for recognizing affective states and expressing emotions [21].
Affective computing enables emotion recognition in various types of multimedia, such as text, pictures, audio, and video, to create and improve user-friendly interfaces capable of parsing human emotions. Affective datasets contain lists of human annotations concerning the emotions recognized in the stimuli, which are then used to train machine learning models.
Humans can also experience emotions from music, speech, and audio files. Induced emotion refers to emotions that involve introspective perception of psychophysiological changes, whereas perceived emotion refers to listeners recognizing the emotions expressed by the external environment [22]. It is important to distinguish between induced and perceived emotions because a stimulus may invoke a different response compared to what the stimuli may actually represent. For example, listening to a cheerful song may not necessarily induce a happy emotion in the listener, despite the listener correctly perceiving the song to be a happy one.
Audio emotion recognition (AER), a subfield of emotion recognition, involves emotion recognition from music, speech, and sound events. In particular, the music industry has extensively studied the effects of soundtracks on individuals' emotions. Conventionally, emotion recognition models can be categorical or dimensional. Categorical models consider emotions with discrete labels (such as happiness, sadness, anger, fear, surprise, and disgust [23]), whereas dimensional models characterize emotions along one or more dimensions (such as arousal and valence [24]). The Geneva Emotional Music Scales (GEMS) [25] model has been widely used for measuring emotions induced by music, and the arousal-valence dimensional model has been used in studies of perceived and induced emotions [26][27][28].
To our knowledge, there is no comprehensive study of the performance of the prediction of perceived and induced emotions from acoustic features. In this paper, we explore emotion recognition using two datasets, IADSE [29] and Emosoundscape [30], which each represent emotions in a two-dimensional space (i.e., arousal and valence). Further, we try to identify the significant acoustic features for arousal and valence, as well as for perceived and induced emotions. The IADSE is a set of sounds for which induced emotions have been measured. The Emosoundscape dataset is a set of sounds for which perceived emotions have been measured. Analysis and modeling of these two datasets enable us to investigate and find the best models for predicting perceived and induced emotions with high accuracy.

Research Questions
This article primarily addresses the following research questions: RQ1. How well do machine learning models perform when predicting arousal and valence? 2.
RQ2. How different are the models that are built for predicting perceived and induced emotions? 3.
RQ3. What are the significant acoustic features for predicting arousal and valence? 4.
RQ4. How do the significant features vary for predicting perceived and induced emotions?

Contributions of This Work
Th purpose of this paper is to compare and contrast induced and perceived emotions from sounds with the help of various machine learning and deep learning models. We study these two types of emotions through features that characterize different aspects of emotions. More specifically, given a set of acoustic features of sounds, the authors would like to model emotional characteristics, such as "arousal" and "valence". To build such models, the authors use two datasets, IADSE [29] and Emosoundscape [30], which are already tagged with arousal and valence. IADSE concerns induced emotions, and Emosoundscape concerns perceived emotions. We believe that the results of this research can help us further understand emotions in a better way and, thus, help in improving current IoT systems to reduce cognitive load. The key contributions of this paper are as follows: - We present a small-scale survey of the literature related to emotion recognition, along with the features and datasets used. -We build machine learning models to predict perceived and induced emotions. -We compare and contrast the features used to build the best prediction models for different emotional dimensions (i.e., arousal, valence, and dominance). -We report the significant acoustic features identified when building the best prediction models for both perceived and induced emotions.
Our results show that the machine learning models built for predicting perceived (i.e., intended) emotions are more accurate than the models built for estimating induced (i.e., felt) emotions. We also report that the accuracy of the models can be improved through acoustic feature selection, as well as by engineering and hyper-parameter tuning. Regarding the latter, machine learning techniques based on ensemble learning (e.g., Random Forests) outperform some other machine and deep learning algorithms.
This paper is organized as follows: Section 2 reviews the literature. The methodology and materials of the study are presented in Sections 3 and 4, respectively. Section 5 presents the results and analysis. Section 6 concludes the paper and highlights future research directions.

Related Work
The state of the art of machine learning techniques in automatic audio emotion recognition relies on characteristics of the input, output, and problem domains (types of techniques and research questions): • Input Acoustic sounds, such as music, natural, and non-speech sounds, can both elicit and convey emotions. Research concerning emotion induction has received comparatively less attention than emotion perception [22,31,32]. Perceived emotion is the emotion that the sound stimulus is intended to convey. Induced emotion is the emotion felt by the listener after introspection and processing of the sound [22,30]. Thus, perceived and induced emotions may not be the same. Table 1 shows a summary of music and audio emotion recognition in the literature.
Machine learning algorithms that perform audio emotion recognition require appropriate features to recognize emotions. Speech audio recognition using Hidden Markov (HMM), Gaussian Mixture (GMM), and Support Vector Machine (SVM) models have categorized speech acoustic features with a high degree of accuracy [33][34][35]. Table 2 lists the features used for emotion recognitions in the literature.
Automatic emotion recognition in music has been a topic of interest for many researchers. The aim is to easily categorize music with similar emotions without laborintensive human annotation. Music emotion recognition research has been conducted using regression, classification, and deep learning models.

Music Emotion Recognition
Yang et al. [36] used regression analysis to predict arousal and valence ratings found in 195 music samples that were composed of popular songs from English, Chinese, and Japanese albums. The authors reported R 2 values of 58.3% for arousal and 28.1% for valence using an SVM with 114 acoustic features, such as loudness, sharpness, and other features.
Yang and H. Chen [37] carried out an experiment to recognize emotions in music signals so that similar music could be retrieved and classified. The authors developed a custom ranking algorithm-RBF-Listnet-to optimize the retrieval of similar music samples based on the underlying emotion. The authors argued that automated retrieval reduced human annotation efforts for fetching similar music samples. The authors reported a gamma statistic of 0.326 for valence recognition.
Eerola et al. [38] proposed a model for predicting perceived emotions in a music dataset called Soundtrack110 that contained 110 samples. The authors used a set of 29 features extracted using MIRToolbox to predict arousal and valence ratings. The authors reported an explained variance of 58% to 85% using linear regression models. The authors also reported R 2 statistics for the prediction of various categorical emotions (angry, scary, happy, sad, and tender).
Seo and Huh [39] used machine learning and deep neural networks to recognize induced emotions, with the ultimate goal being to classify similar music samples. The authors used 100 music samples from Korean pop music. The authors reported a best match rating of 73.96% via an SVM, which was slightly greater than that of the deep neural network, i.e., 72.90%.
Liu et al. [40] classified the emotions in music samples by using their spectrograms as features in a deep learning model. Spectrograms contain both time and frequency information, and the authors used them to classify similar music samples using convolutional neural networks (CNNs). The authors used a publicly available dataset called 1000-Song [41] to test the proposed model. The authors reported an average accuracy of 72.4% using the CNN model.
Fan et al. [42] proposed the use of a ranking algorithm called smoothed RankSVM (SRSVM) for ranking music with similar emotions. The authors created a corpus of 100 music clips from different musical genres. The authors utilized 56 features generated via the MIRToolbox and reported gamma statistics of of 0.801 and 0.795 for arousal and valence, respectively.

Sound Emotion Recognition
In addition to music, researchers have looked into emotion recognition with other sound stimuli, such as emotion recognition for audio samples (non-speech) that are also called sound events or soundscapes. Schafer [43] categorized soundscapes into six categories (natural sounds, human sounds, sounds and society, mechanical sounds, quiet and silence, and sounds as indicators). The categories are based on the origin of the sound source and the context in which the sound is heard [30]. Similarly to music emotion recognition, machine learning algorithms require labels to train the model to establish ground truth. Audio emotion recognition thus combines human annotation and machine learning to recognize emotions.
Schuller et al. [44] compared human annotations of emotions to those of regression with a sound dataset that contained 390 audio samples of different sounds, such as nature, animal, and musical instrument sounds. The authors reported correlation values of 0.61 for arousal and 0.49 for valence between regression and human annotations.
Drossos et al. [45] investigated the use of rhythmic sound features for arousal prediction. The authors utilized 26 rhythm features, which were derived by applying the MIRToolbox to the IADS dataset [46]. They reported the highest accuracy of 88.37% in arousal recognition. Furthermore, feature fluctuation was found to be the best individual feature for predicting arousal values.
Fan et al. [30] created a dataset called EmoSoundscape, which contains 1213 six-secondlong sounds, for soundscape emotion recognition. The authors compared the results of emotion ratings from 1182 human annotators against regression. The authors used 39 features extracted by using MIRToolbox, as well as YAAFe [47]. The authors reported the results as two protocols: A and B. Protocol A involved shuffling the sound database 10 times and then selecting sounds for training and testing (80% and 20%, respectively). Protocol B used the leave-one-out method, wherein one sound at a time was selected for training, and the remaining were used for testing during each iteration. For Protocol A, the R 2 and MSE are 0.853 and 0.049 for arousal and 0.623 and 0.128 for valence, respectively. However, for Protocol B, the R 2 and MSE were 0.855 and 0.048 for arousal and 0.629 and 0.124 for valence, respectively.
Sundaram and R. Schleicher [48] developed an audio-based retrieval system to retrieve similar sounds by querying the system. The authors selected sounds from the BBC sound effects library (http://bbcsfx.acropolis.org.uk (accessed on 1 July 2021)) and the IADS dataset to build the system. The authors also collected human annotations for these sounds to compare them against the emotional ratings of the sounds retrieved by the system. For each query, the system retrieved the top five similar sounds by using MFCC features with similar features in the latent space. The average RMSE between the queried and retrieved sounds was found to be between 1.2 to 2.6.
Researchers have also used neural networks and deep neural networks for predicting emotions in sounds. Fan et al. [49] evaluated the use of deep learning models, such as CNNs and Long Short-Term Recurrent Neural Networks (LSTM-RNNs), for sound emotion recognition using the EmoSoundscapes [30] dataset. The authors compared the performance of five deep learning architectures in predicting arousal and valence ratings. The authors used two sets of techniques to extract features. The first method used a pretrained deep neural network created by S. Hershey et al. [50], whereas the second method involved 54 features extracted using MIRToolbox and YAAFE. The best performance for arousal was reported with the CNN with an R 2 and MSE of 0.832 and 0.035, respectively, whereas the best performance for valence was reported to have an R 2 of 0.759 and MSE of 0.078 via VGGish (a deep CNN model). The authors also investigated the arousal and valence prediction for various sounds using Schafer's categories.
Ntalampiras and Potamitis [51] used a deep learning model called the echo state network to study the similarities between music and sound datasets in eliciting emotions. The authors used three feature sets-Mel-Spectrum (MFCC), temporal modulation, and Perceptual Wavelet Packets (PWP)-and each was extracted from the IADS and 10,000 song datasets. The authors first trained the network on the music dataset and then used that trained network for sounds in the IADS dataset to determine if the arousalvalence prediction would improve. The best performance was achieved using the temporal modulation features with a mean square error of 3.13 for arousal and 3.10 for valence when using GMM clustering as the regressor.
Ntalampiras [52] compared emotion prediction using two CNNs that were designed to individually predict arousal and valence. The authors used the EmoSoundscapes data. They extracted features by employing a sample window of the audio files and then applying Fourier transformation to yield 23 features that were similar to the MFCC obtained from MIRToolbox. The authors reported an MSE value of 0.0168 for arousal and 0.0107 for valence. The authors also predicted arousal-valence ratings for sound categories as per Schafer's taxonomy.
Cunningham et al. [53,54] used shallow neural networks and regression to predict emotion using the IADS dataset. The authors employed 76 MFCC features using the MIRToolbox. The authors reported an RMSE of 0.989 and an R 2 of 0.28 for arousal, as well as an RMSE of 1.645 and R 2 of 0.12 for valence by using regression. However, an arousal with an RMSE of 0.987 and R 2 0.345 and valence with an RMSE of 0.514 and R 2 of 0.269 were achieved by using a neural network.
Researchers have also studied the effect of manipulation of sound on arousal-valence emotion prediction. Drossos et al. [55,56] created a sound dataset called BEADS, which contained binaural sound clips. The dataset is publicly available and consists of 32 sounds annotated with emotion labels. These sounds have been adjusted across five spatial positions (0, 45, 90, 135, and 180 degrees). The authors also reported a comparison of BEADS with the IADS dataset. The authors observed maximum arousal differences of 2.47 and 2.07 for arousal and valence between IADS and BEADS, respectively. Additionally, sounds at a 0 degree spatial angle elicited a higher arousal rating and a lower valence rating than those at other angles. Asutay et al. [57] conducted an experimental study to understand whether distorting the sound to reduce its identifiability caused any changes in the perceived emotions. Three different studies with participants were undertaken. Participants in each study rated both the distorted and original sound recordings from the IADS dataset using the Self-Assessment Manikin (SAM) scale; the recordings were either introduced one after the other or in a random order [58]. The third group (i.e., the control group) was presented with the original sounds and their textual descriptions before being asked to rate them. The authors contended that the processed sounds were emotionally neutral, but the participants were still able to identify them with the help of priming. Thus, the authors argued that sound designers should focus not just on the physical properties of the sounds, but also on psycho-acoustical features in order to evoke the desired emotions.

Datasets and Psychoacoustic Features
To conduct our experiment, we utilized two datasets: IADSE [29] and EmoSoundscape [30]. These datasets contain sound samples with their annotated emotions. EmoSoundscape contains ratings for perceived emotions and uses a two-dimensional space (arousal/valence). IADSE contains ratings for induced emotions and uses a threedimensional space (arousal/valence/dominance). Therefore, we chose these two datasets to compare induced and perceived emotion predictions.
To extract the features from these datasets, we used the MIRToolbox [59], which extracts (psycho)acoustic and musically related features from databases of audio files for statistical analysis [59]. Following Lange and Frieler [60], a total of 68 features were extracted from each stimulus to represent either the arithmetic mean or the sample standard deviation of the frame-based features computed over default window sizes (typically 50 ms for low-level features and 2-3 s for medium-level features) and a 50% overlap. The selected features represent the following families: -Dynamics-intensity of the signal, such as the root mean square (RMS) of the amplitude; -Rhythm-articulation, density, and temporal periodicity of events, such as the number of events per second (event density); -Timbre/Spectrum-brightness, noisiness, dissonance, and shape of the frequency spectrum, such as the spectral center of mass (centroid); -Pitch-presence of harmonic sounds, such as the proportion of frequencies that are not multiples of the fundamental frequency (inharmonicity); -Tonality-presence of harmonic sounds that collectively imply a major or minor key, such as the strength of a tonal center (key clarity).
Although most features represent relatively low-level acoustic or auditory attributes (e.g., RMS), some are based on perceptual models (e.g., roughness), and yet others are based on cognitive models that presume long-term exposure to the stimulus domain (e.g., key clarity). A summary of the features is shown in Table 3. We computed the pairwise correlation for the 68 features in each dataset. The seven pairs of features shown in Table 4 are the features in the EmoSoundscape dataset with correlations greater than 90%. In addition, the four pairs of features shown in Table 5 are the features in the IADSE dataset with correlations greater than 90%. We used the EmoSoundscape dataset [30], which consists of two subsets. The first subset contains 600 audio samples categorized into 6 groups of 100 samples each according to Schafer's soundscape taxonomy; these groups are natural sounds, human sounds, sounds and society, mechanical sounds, quiet and silence, and sounds as indicators. The second subset contains 613 samples; each is a mix of soundscapes from two or three of the first subset's classes. All of these soundscapes are annotated with their perceived emotion, including arousal and valence. The first subset of this dataset was used for our experiment.
Because the EmoSoundscape dataset contains arousal and valence for each sound sample, the scatter plot of valence versus arousal for the EmoSoundscape dataset is shown in Figure 1a. Note that both variables are z-normalized.

IADSE Dataset: A Dataset for "Induced" Emotion
We also used the IADSE dataset, which contains 935 sounds, with each sound rated by at least 100 listeners on 9-point Likert scales for the dimensions of felt arousal, valence, and dominance (induced emotions). In addition, a scatter plot of the data points in the IADSE dataset is shown in Figure 1b.

Evaluation Metrics for Analysis
To measure the performance of the regression models, different common metrics can be utilized, including the mean absolute error (MAE), mean squared error (MSE), root mean square error (RMSE), R-squared (R 2 ), median absolute error, max error, and explained variance. These evaluation metrics have been heavily utilized in the machine learning literature and they are the main evaluation metrics in the context of evaluating machine learning models. RMSE, MSE, and R 2 were chosen to evaluate the performance of each regression model. It is important to note that the prime objective of this study was to compare the performance of various machine learning models in predicting emotion dimensions (i.e., valence, arousal, dominance) and not to conduct controlled experiments and perform statistical significance tests.
The mean square error (MSE) is the average of the square of the errors; the larger the value is, the larger the error will be.
The root mean square error (RMSE) can be considered as the standard deviation of the prediction errors. Because it applies a high penalty for large errors, it is beneficial when large errors are undesirable.
Finally, R 2 represents the ratio of the total sum of squares of the prediction error to the total sum of squares of error with the mean; the closer the value of R 2 is to 1, the better the regression model will be.
It should be mentioned that R 2 is a less commonly used metric for assessing non-linear models [61].

Methodology
In our previous work with the IADS and the EmoSoundscape datasets [62], we reported that Random Forest outperformed other models in A/V prediction using a 1D psycho(acoustic) feature set, while other models mostly suffered from overfitting. This result is somewhat expected because ensemble models reduce the risk of overfitting. Ensemble models combine the prediction results of several base models. In addition, among ensemble models, Random Forest is preferable for overfitting problems [63]. Unlike other ensemble methods, adding more trees in a Random Forest model does not increase the risk of overfitting. Therefore, we chose Random Forest as one of the prediction models for these datasets in this article. Random Forest (RF) is an ensemble method that averages the prediction results of several decision trees. To compare the prediction results from the ensemble model (RF) with deep models, we developed a multilayer perceptron model and a 1D convolutional neural network model. For all of the models, we used 30% of the data as the test data and also applied 5-fold cross validation (CV). In order to compute the training and testing errors, we averaged the RMSE values over these 5 folds.

Feature Selection
Feature selection techniques can be divided into two main types. The filter methods are usually performed as a preprocessing step by using the underlying properties of the features measured with different univariate statistics. The other method uses an estimator to perform feature selection, so it is considered as a wrapper method. It selects the features based on the performance of a model. The filter-based methods are faster, whereas, the wrapper methods are more computationally expensive.
In our previous work [62], we used a filter-based method called the "univariate linear regression test" (KBest) for selecting the k best features. In contrast, in this work, we used a wrapper method called Recursive Feature Elimination (RFE) [64]. Using RFE, we applied Random Forest as the estimator to be fitted to the datasets. Then, the features were ranked based on their weights, and the features with the lowest weight were removed. This process was repeated until a desirable number of features remained. Because we did not have prior knowledge about the number of best features, we tried the number of features as a value ranging from 1 to 68 in the hyper-parameter tuning phase, along with selecting the best parameters for each Random Forest regressor.

Hyper-Parameter Tuning
Hyper-parameter tuning is the process of selecting the best parameters for a model to obtain the optimal results. Grid search is a technique that can be employed to find the optimal parameters of the model through which all combinations of the determined values for parameters are examined. We performed an exhaustive search for hyper-parameter tuning on 90 parameters overall, including 22 Random Forest parameters and 68 RFE parameters, in order to find the optimal values. Here is the list of parameters that were tuned: -n_estimators : (50, 100, 150, 200, 250, 300), number of trees in the forests; -max_depth : (5,10,20,30,50), maximum number of levels in each decision tree; -min_samples_split : (2, 3, 4, 5, 6, 7), minimum number of data points placed in a node before the node is split; -min_samples_lea f : (1, 2, 3, 5), minimum number of data points allowed in a leaf node; -k : range(1, 68), number of features selected using RFE with the RF estimator.

Results and Analysis
This section reports the performance and the results of the analysis by comparing predictions for both perceived and induced emotions.

Performance of Prediction Models
The arousal, valence, and dominance predictions on the IADSE dataset and the arousal and valence predictions on the EmoSoundscape dataset using the tuned Random Forest models are shown in Table 6.  The best arousal prediction for the EmoSoundscape dataset was achieved with 15 features. The best evaluation metrics were 0.24, 0.05, and 0.86 for the test RMSE, MSE, and R 2 , respectively. On the other hand, for the IADSE dataset, the best arousal prediction was achieved with 25 features. The best evaluation metrics were 0.78, 0.61, and 0.56 for the RMSE, MSE, and R 2 , respectively.
Regarding valence, the best prediction on the EmoSoundscape dataset was achieved using 14 features. The best evaluation metrics were 0.37, 0.14, and 0.59 for the test RMSE, MSE, and R 2 , respectively. On the other hand, for the IADSE dataset, the best valence prediction was achieved with nine features. The best evaluation metrics were 1.16, 1.34, and 0.37 for the test RMSE, MSE, and R 2 , respectively. For the dominance prediction, we achieved 0.83, 0.70, and 0.26 for the test RMSE, MSE, and R 2 , respectively, using seven features.
To compare the performance of the RF model as an ensemble method with deep models, two deep neural networks were developed to predict perceived and induced emotions. Specifically, a four-layer perceptron and a one-layer convolutional neural network followed by a four-layer perceptron were utilized. These models fit the data using all features and the selected features that were identified by the exhaustive search. Table 7 shows the emotion predictions using these deep models.  Train  Test  Train  Test  Train  Test  Train  Test  Train  Test   Tuned RF   all  --------- Although the performance of the deep models was close to performance of the tuned RF, RF achieved a better emotion prediction for both datasets.The best test errors using tuned RF are shown in bold in Table 7. In addition, in most cases, the performance of the deep models using selected features was better than their performance using all features, which indicates the effectiveness of our selected features in predicting emotions in each dataset.
To compare the performance of our work with that of similar works in the literature, a few papers reporting their arousal and valence prediction results on the EmoSoudscape dataset were found. Fan et al. [42] reported MSE values of 0.049 and 0.128 for predicting arousal and valence, respectively. Converting their results into RMSE, they achieved 0.22 and 0.36 for arousal and valence prediction using a Support Vector Regressor (SVR), which were close to our results obtained using the tuned Random Forest. Furthermore, they improved their results by augmenting the dataset and applying a tuned convolutional neural network (CNN). Since their work was on the augmented dataset, their results are not comparable with ours.
Part of the work performed by Ntalampiras [52] was on the EmoSoundscape dataset, and they used CNN models. The MSE values reported for arousal and valence prediction were around 0.049 and 0.11, respectively, which are equivalent to 0.22 and 0.33 for RMSE. These results are close to other reported performances on the EmoSoundscape dataset, with a slight improvement in valence prediction.
Ntalampiras [52] also applied a CNN on the data collected from both subsets of the EmoSoundscape dataset and achieved better performance for both arousal and valence. It must be noted that the second subset of EmoSoundscape contains sound events that are mix of soundscapes from two or three of the sound events in the first subset. We could not identify any work on the IADSE dataset with which to compare our results.
With respect to the research question RQ1 and according to Table 7, we observe that the selected machine learning prediction models have similar performance, except that the model built based on the optimized Random Forests outperforms the other models in reducing the RMSE values for both arousal and valence. Regarding RQ2, Table 7 indicates that the RMSE models built for predicting perceived emotions (i.e., EmoSoundscape) are associated with lower RMSE values in comparison with the RMSE values captured for models predicting induced emotions (i.e., IADSE). The average RMSE values for both the training and testing datasets computed for induced emotion dimensions (i.e., arousal and valence) are substantially low compared to the RMSE values captured for perceived emotion dimensions. This observation implies that modeling induced emotion is harder than building models for predicting perceived emotion.

Significant Features
The significant features used in the tuned Random Forest models for each prediction of emotion for both datasets are sorted in Tables 8 and 9. In addition, Table 10 provides a better insight about the common features among different emotion predictions and among/within the IADSE and EmoSoundscape datasets. In Table 10, significant features for emotion prediction using the tuned Random Forest are indicated by '*' in the IADSE dataset and by '+' in the EmoSoundscape dataset. Furthermore, if any of these significant features have highly correlated features in their peer datasets, these correlated features are marked as (*) and (+) for the IADSE and EmoSoundscape datasets, respectively.     Considering the induced emotion predictions for the IADSE dataset, 25 features were considered as the significant features for predicting induced arousal, whereas nine features were considered for induced valence. There were five common features for predicting induced arousal and valence: (1) dynamics rms (std), (2) tonal keyclarity (mean), (3) spectral roughness (mean), (4) spectral brightness (mean), and (5) spectral spread (mean). We can also consider spectral rolloff95 (mean) as the sixth common feature because it had a high correlation with spectral spread (mean) in the IADSE dataset. Furthermore, for the dominance prediction, seven significant features were identified, and three of these features were indicated as significant features for the induced arousal and valence predictions.
On the other hand, in predicting perceived emotions in the EmoSoundscape dataset, 15 features were considered as the significant features for predicting perceived arousal, whereas 14 features were considered for perceived valence. There were five common features for predicting induced arousal and valence: (1) dynamics rms (mean), (2) timbre lowenergy (std), (3) timbre spectralflux (mean), (4) spectral roughness (mean), and (5) spectral rolloff95 (mean). We can also treat spectral spread (mean) as the sixth common feature because it has a high correlation with spectral rolloff95 (mean) in the EmoSoundscape dataset.
In addition, 15 features were considered as the significant features for predicting induced valence (IADSE), whereas nine features were considered for perceived valence (EmoSoundscape). There was only one common feature for predicting induced and perceived arousal, which was spectral roughness (mean). We can also treat spectral spread (mean) and spectral rolloff95 (mean) as the second and third common features because these two features demonstrated a high correlation with each other in both datasets, and each of them was selected as a significant feature for predicting induced and perceived valence.
With respect to RQ3 and according to Tables 8 and 9, we observe that the number of significant features for predicting arousal is greater than the number of significant features for predicting valence. This observation implies that predicting arousal-based emotions, such as excitement, is harder than predicting valence-based emotions, such as positiveness. Therefore, to model these two dimensions, different numbers and sets of features would be required. With respect to RQ4 and according to Table 8 for perceived emotions and Table 9 for induced emotions, we observe that the number of significant features for predicting arousal in induced emotion (i.e., the IADSE dataset in Table 9) is substantially greater than the number of significant features listed for arousal prediction in perceived emotion (i.e., EmoSoundscape dataset in Table 9). In an analogous way, this observation indicates that modeling induced emotions is substantially harder than building models for predicting perceived emotions.

Conclusions and Future Work
It is important to monitor devices and their activities when connected to a network and to detect possible threats or events that occur in the system. In addition to conventional communication channels, such as textual descriptions and visualization, sonification is an effective technique for quickly directing users' attention to interconnected devices [65,66]. One of the major advantages of using sounds rather than textual descriptions and visualizations to alert users is that operators can listen to sounds and use visual displays at the same time without significantly increasing cognitive workload.
Although informative communication through sounds (i.e., sonification) is very promising, implementing sonification comes with its own challenges and problems. One of the key challenges is the design of proper and representative sounds for specific events. There are several issues when designing a sound to represent an event. The issues include if the sound should represent the semantics and meaning of the event, it needs to point out spatial information that is needed to trace the events, or it needs to convey the impact of each event to the user. Furthermore, the sounds selected for sonifying these events or semantics need to be tested for their usability in order to find out whether they convey the required information.
In addition to the above issues, the psychological impact of each sound is also a key issue when designing sonifications for a large and complex system, such as the IoT. More specifically, it is important to have a clear understanding of the impact of each sound on users. As such, it is important to understand perceived (i.e., expressed emotion) and induced (i.e., felt) emotions.
This paper investigates whether it is possible to build machine-learning-based models to predict perceived and induced emotion, where emotion is defined based on three dimensions: (1) arousal, (2) valence, and (3) dominance. To perform the research and analysis, we utilized two datasets-one that concerns perceived emotions and another that concerns induced emotions. The EmoSoundscape dataset measures a user's perceived emotion, whereas the IADSE dataset quantifies a user's induced emotion. Our initial assumption was that it would be more difficult to model and predict induced emotion in comparison with perceived emotion.
Our findings confirm our assumption in that it is relatively more difficult to predict induced emotion than perceived emotion. As highlighted in Table 7, the RMSE values obtained for training and testing of models built for the IADSE (i.e., induced emotion) are greater than those calculated for the EmoSoundscape dataset (i.e., perceived emotion). We also observed that the models built for both induced and perceived emotion are of moderate accuracy, which indicates that identifying the optimal and best models for predicting these emotions is generally a difficult task.
The research reported in this paper needs further improvement and more comprehensive analysis. In particular, due to the great performance of ensemble learning approaches (more specifically, Random Forests), some other ensemble-learning-based approaches need to be explored with the intention of optimizing the best models. We also need to conduct additional research with perceived and induced emotion in certain contexts, such as security and monitoring of the IoT, and to determine if certain emotions cause the operator of a system to react in a certain and robust way. More precisely, it is important to understand how perceived and induced emotions trigger actions that we expect when certain events occur in the IoT.