1. Introduction
Recent advances at the intersection of Artificial Intelligence (AI) and the Internet of Things (IoT) have enabled the development of intelligent systems capable of adapting to human behavior and contextual conditions. Within this paradigm, Artificial Emotional Intelligence (AEI) has emerged as a promising approach for recognizing and interpreting human affective states through multimodal signals, including speech, facial expressions, and physiological data. Such capabilities are increasingly relevant in application domains such as healthcare, education, smart environments, and human–computer interaction, where responsive and context-aware systems are essential. In this context, authenticity refers to the consistency and reliability of emotional inference results under deployment-oriented operational constraints. In other words, authenticity represents the model’s ability to produce stable and credible emotional predictions in practical deployment scenarios rather than solely maximizing classification accuracy. It is important to note that the deployment-aware perspective adopted in this study refers to the evaluation of model architectures under representative resource-constrained conditions rather than direct physical deployment on edge hardware platforms. The objective is to provide a controlled comparative framework for analyzing model behavior in edge-oriented scenarios [
1,
2].
Among these modalities, speech-based emotion recognition (SER) plays a central role due to its rich expressive content and ease of acquisition in real-world environments. In recent years, deep learning-based SER models, particularly convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have demonstrated strong performance in controlled experimental settings. However, a significant limitation of existing studies is their reliance on centralized cloud infrastructures, which introduces latency, increases communication overhead, and raises privacy concerns. Unlike many existing speech emotion recognition studies that primarily focus on improving model architectures, the present work emphasizes a deployment-aware evaluation perspective. The proposed approach aims to examine the feasibility of commonly used deep learning models under IoT edge constraints by considering both classification performance and practical deployment conditions [
3].
Edge computing has emerged as a viable paradigm to address these limitations by enabling data processing closer to the data source. Deploying SER models at the IoT edge allows for low-latency inference, reduced bandwidth usage, and enhanced privacy preservation. Despite these advantages, the feasibility and effectiveness of deep learning-based emotion recognition under edge constraints remain insufficiently explored. In particular, there is a lack of systematic, deployment-oriented evaluation frameworks that jointly consider model performance, computational efficiency, and real-time applicability [
4,
5].
Furthermore, many existing SER studies focus primarily on accuracy metrics while overlooking critical aspects such as reproducibility, standardized evaluation protocols, and robustness across different configurations. This gap limits the practical applicability of proposed approaches, especially in real-time IoT systems where resource constraints and reliability are essential considerations.
To address these challenges, this study proposes a deployment-aware evaluation framework for speech emotion recognition in IoT edge environments. Rather than introducing a new model architecture, the work focuses on systematically evaluating widely used deep learning models—CNN, LSTM, and dense neural networks—under a unified and reproducible experimental setup. The evaluation incorporates standardized training protocols, a consistent train–test evaluation protocol, and performance comparison across multiple metrics.
The main contributions of this study can be summarized as follows:
A standardized and reproducible experimental protocol based on a fixed train–test evaluation strategy.
A comparative analysis of CNN, LSTM, and Dense architectures in terms of both accuracy and computational feasibility.
An assessment of real-time applicability through latency-aware evaluation in edge computing scenarios.
The primary contribution of this study is the introduction of a deployment-aware evaluation framework that analyzes the feasibility of widely used deep learning architectures within IoT edge environments. This perspective helps bridge the gap between laboratory-level performance evaluation and practical deployment requirements.
The findings of this study provide practical insights into selecting appropriate model architectures for real-time emotion recognition systems and contribute to bridging the gap between theoretical SER performance and real-world edge deployment requirements. Unlike prior studies, this work explicitly prioritizes deployment feasibility and evaluation standardization over architectural novelty.
2. Related Works
Studies in the fields of IoT and emotion recognition reveal efforts to integrate interdisciplinary knowledge and explore the impact of technological advancements on understanding and representing human emotions. For example, a 2022 study introduced the concept of the Emotional Knowledge Graph (EmoKG), developed using technologies such as Artificial Intelligence, Knowledge Graphs, and Semantic Reasoning, to organize emotional information from various domains. This approach demonstrates how IoT technologies can be utilized in collecting physiological data and how EmoKG can be applied within a naturopathy recommendation system to suggest foods that could enhance emotional well-being [
6]. Chen and colleagues discussed the applications of Big Data processing technologies in IoT environments. Their study provides a comprehensive review of processing, storing, and analyzing data from IoT devices. It also focuses on real-time data processing, data accuracy, and reliability, all of which play a critical role in developing artificial emotional intelligence systems [
7].
Although originally applied in material science, similar machine learning evaluation strategies have also been adopted in other domains. Wang et al. reviewed recent advances in the application of machine learning (ML) techniques to metal–organic framework (MOF) research. Their study summarizes widely used ML approaches, including regression, classification, clustering, deep learning, and reinforcement learning, and highlights the importance of data acquisition and preprocessing for improving model performance. In addition, the authors discuss model evaluation metrics and methods aimed at enhancing model interpretability. The review also demonstrates how ML-driven approaches contribute to MOF research in several areas, including material design, high-throughput screening, structure–property relationship analysis, and performance prediction [
8]. Similarly, Al-Fuqaha and colleagues examined the core technologies, protocols, and applications of IoT, exploring the data collection and processing capabilities of IoT devices. Their study emphasized methods that improve data accuracy and real-time processing, providing an essential infrastructure for testing artificial emotional intelligence systems [
9].
Rashid and Chaturvedi discussed theories of artificial emotional intelligence (AEI) and its applications in human–computer interaction. They also examined the technological solutions required to enhance AEI accuracy and to process emotional data in real time, focusing on the role of IoT devices in AEI’s real-time data collection and analysis [
10]. Another study analyzed the features and classification methods of speech emotion recognition (SER) systems, which can be integrated with IoT devices to detect users’ real-time emotional states and respond accordingly [
11]. Moreover, recent research introduced capsule-based architectures and intelligent fusion frameworks that enhance robustness and cross-domain generalization in SER tasks, outperforming traditional CNN and LSTM approaches (e.g., sparse temporal-aware capsule network, intelligent fusion network). Other studies have also investigated time-series and biometric signal modeling for behavior recognition tasks, such as vaping detection from wearable heart-rate data, demonstrating the generalizability of deep learning across modalities [
12]. Recent studies have increasingly investigated the feasibility of deploying speech emotion recognition models in resource-constrained environments, highlighting the need to balance classification accuracy with computational efficiency and latency constraints [
13]. Another study aimed to present a reliable and efficient speech-based emotion recognition layer capable of operating in real time. Paralinguistic characteristics extracted from speech signals were employed as input features for supervised learning approaches in emotion classification. A range of machine learning techniques, including Gaussian naïve bayes, random forest, k-nearest neighbors, support vector machines, and multilayer perceptrons, were evaluated through comparative experiments. Among these methods, support vector machines and multilayer perceptrons exhibited the strongest classification performance, achieving accuracy levels of 77.8% and 79.6%, respectively [
14].
3. IoT and Emotional Intelligence
3.1. Internet of Things (IoT)
The Internet of Things (IoT) refers to interconnected physical objects—such as sensing devices and mobile systems—that exchange data and cooperate within their environment to achieve shared operational goals [
15]. It can also be described as a platform enabling seamless communication between sensors and digital devices in intelligent environments [
16]. In broader terms, IoT represents an integrated framework where multiple subsystems coordinate information exchange, communication, and decision-making among physical entities [
17]. The layered structure of IoT Edge systems, previously illustrated in
Figure 1, provides a practical context for this theoretical explanation. Continuous connectivity between devices is supported by wireless technologies such as Bluetooth, Wi-Fi, ZigBee, WSN, LPWAN, and cellular networks, allowing devices to interact with the physical world through computer-based systems and enabling remote monitoring and automation. It is projected that tens of billions of devices will be interconnected via the Internet using various wireless and radio-frequency technologies [
18]. The IoT ecosystem typically consists of interconnected devices, infrastructures, services, and applications organized into four functional layers.
The sensing layer includes smart sensors, RFID technologies, and IoT endpoint devices that collect data from the physical environment [
19]. The network layer provides connectivity and enables data transmission between devices through internet-based infrastructures [
20]. The service layer manages and delivers services to users or applications. Finally, the interface layer provides interaction between users and system services.
3.2. Emotional Intelligence
Emotional intelligence refers to cognitive and affective abilities that enable individuals to recognize and regulate their own emotions while understanding and responding appropriately to the emotions of others [
21]. It is generally conceptualized as a multidimensional construct including self-awareness, emotional regulation, motivation, empathy, and social interaction skills [
22]. Individuals with high emotional intelligence tend to communicate more effectively, manage conflicts successfully, and perform better in collaborative environments [
23]. In recent years, emotional intelligence has gained increasing importance in fields such as leadership, education, and healthcare, and its integration with emerging technologies—including Artificial Intelligence and IoT systems—has expanded significantly [
24,
25]. As illustrated in
Figure 2, tools composed of smart sensors in the IoT sensing layer facilitate the production and collection of biometric data. IoT-based sensing technologies further enable the collection of biometric and behavioral data, which can be analyzed by machine learning models to detect users’ emotional states through signals such as facial expressions, speech, and activity patterns. These systems enable the creation of emotion datasets and support applications for monitoring and improving users’ emotional and psychological well-being.
3.3. Relationship Between Big Data and IOT
Big Data and the Internet of Things (IoT) are complementary technologies in modern digital ecosystems. IoT enables physical objects—such as sensors, devices, and machines—to connect to the Internet and continuously generate large volumes of data [
26]. These high-volume, high-velocity, and heterogeneous data streams form what is known as Big Data, including information related to user behavior, environmental conditions, and machine status [
27]. Big Data technologies process and analyze these datasets to extract meaningful insights, often using machine learning algorithms to develop predictive models that support data-driven decision-making [
28]. The integration of IoT and Big Data is widely applied in domains such as Industry 4.0 and smart cities, where IoT sensors collect large-scale data on traffic, air quality, and energy consumption to support efficient urban planning and sustainable management [
29]. This synergy enables organizations and governments to optimize operations, improve services, and develop innovative solutions through advanced data analytics [
30]. As illustrated in
Figure 3, the IoT–Big Data relationship can be structured into three layers: sensor-based data sources, the Big Data domain characterized by volume, velocity, and variety, and application layers employing analytics platforms such as MapReduce, Spark, Storm, Flink, and Kafka, where statistical and machine-learning methods are commonly applied [
31]. Due to the continuous growth of data volume and complexity, Big Data analytics requires high processing speed and efficient computational strategies to manage large-scale datasets effectively [
32].
4. Material and Methodology
4.1. Dataset
The Berlin Emotional Speech Database (EMODB) is a widely recognized dataset in the field of speech emotion recognition, consisting of audio recordings of ten professional German actors (five male and five female) who simulate seven distinct emotional states. The emotions captured in this dataset include anger, disgust, fear, happiness, sadness, surprise, and neutral. Each actor was asked to speak ten different German sentences, specifically designed to be phonetically rich and emotionally neutral, but delivered in various emotional tones.
The dataset comprises a total of 535 utterances, each labeled with the corresponding emotion. The recordings were digitized at a sampling rate of 16 kHz with 16-bit resolution, ensuring high-quality audio data suitable for acoustic analysis. The dataset has been meticulously annotated and validated through a subjective listening test, achieving a 79.6% agreement among human evaluators, which further supports its reliability for emotion recognition tasks. In this study, MFCC filter complexity (50) and coefficient setting (60) were chosen to balance computational efficiency with representational richness. These parameter choices affect feature granularity and model comparability; using standardized values across models ensures fair evaluation while allowing nuanced differences in performance to emerge.
The EMODB dataset’s diverse emotional expressions and high-quality audio make it particularly valuable for research in speech emotion recognition. Its balanced representation of male and female voices, along with the inclusion of multiple emotions, provides a robust foundation for training and evaluating machine learning models aimed at classifying emotions in speech [
33,
34].
4.2. Methodology
The architecture used in this study is shown in
Figure 4. The diagram comprehensively presents the integration of edge computing and cloud computing for real-time emotion recognition. The architecture consists of two main components: the edge location and the cloud infrastructure. For practical deployment, representative edge devices such as Raspberry Pi 4 and Jetson Nano were considered in this study, as they are widely adopted in edge AI research. Training was conducted with speaker-independent splits to reduce bias and improve reproducibility. All experiments were conducted in Python 3.11 using the TensorFlow deep learning framework (version 2.15.0). To ensure experimental reproducibility, a fixed random seed was used during model initialization and training procedures. The MFCC feature representation was configured with 60 coefficients in order to provide sufficient spectral resolution while maintaining computational feasibility for edge-oriented speech processing scenarios. Also, all experiments were conducted using a consistent training configuration with fixed hyperparameters. The models were trained using the Adam optimizer with a learning rate of 0.001, a batch size of 32, and a maximum of 100 training epochs. This standardized configuration was applied across all evaluated architectures to ensure fair comparison and reproducibility of the experimental results. All experiments were conducted using a fixed 70/30 train–test split to maintain consistency across model evaluations. References to cross-validation in earlier sections were revised to accurately reflect the experimental protocol adopted in this study. To support the interpretation of the results, statistical significance analysis was performed using a standard threshold (
p < 0.05). This analysis ensures that the observed performance differences across models represent meaningful variations rather than random fluctuations under identical experimental conditions.
In the first step, data is collected by sensors at the edge location. These sensors, attached to the human body, gather various types of data, such as biometric data (e.g., heart rate, skin temperature) and voice data. This data is used to detect individuals’ emotional states. The collected data is processed in real-time by software components running in the edge runtime environment. The edge runtime provides local computing capacity, allowing data to be analyzed quickly and results to be obtained with low latency.
In the second step, the data processed at the edge location is transmitted to the cloud environment through a Device Gateway. The Device Gateway serves as a communication channel that allows the data to be published and accessed by other devices. At this stage, various security protocols are applied to ensure the secure transmission of data. The Device Gateway enables bidirectional data flow between edge devices and cloud resources, facilitating the transfer of data to the cloud and updates from the cloud to the edge devices.
In the third step, analysis platforms and other cloud resources in the cloud environment process and analyze the data received from the edge location. Cloud resources provide high computational power and extensive storage capacity, enabling the processing of large datasets. At this stage, the data to be analyzed in our study consists of audio recordings obtained from the EMODB. As previously mentioned, these data are analyzed using deep learning techniques such as CNN, LSTM, and Dense networks to perform emotion recognition.
In the fourth step, virtual representations of the edge devices, known as Shadow Devices, are stored and synchronized in the cloud environment. This allows the real-time status of the edge devices to be monitored and updated as needed. The Shadow Device provides a continuous state exchange and monitoring mechanism between the edge and cloud environments. This architecture ensures the management and protection of devices while also enabling the consumption and exchange of data between the edge and cloud environments. The bidirectional flow of data allows for the transfer of data from edge devices to the cloud as well as the transmission of updates and commands from the cloud to the edge devices.
4.2.1. Modeling
Artificial neural networks are widely used in Speech Emotion Recognition (SER) studies and are frequently emphasized in the literature as an effective method. Neural networks learn from past and present data, uncover hidden relationships between data, and use this information to predict and classify future states. In this study, these neural network models were employed to recognize emotional states such as happiness, sadness, anger, fear, disgust, and surprise in the EMODB. Among the various artificial neural network models used in SER systems, CNNs, LSTM networks, and Dense networks are commonly preferred and were utilized in our study. These models (CNN, LSTM, and Dense) were specifically chosen due to their established effectiveness in SER tasks and their feasibility for deployment in IoT Edge contexts, unlike more advanced or hybrid models, which often require higher computational resources [
35].
A comparative analysis was conducted to determine the most suitable model for real-time emotion recognition in an edge computing environment to enhance the performance of the emotion recognition system. By integrating edge computing with cloud computing, this study aimed to achieve emotion recognition with low latency and high efficiency, thereby making IoT devices more responsive to human emotions.
CNN (Convolutional Neural Networks)
Convolutional Neural Networks represent a class of deep learning models derived from neural network architectures that have demonstrated strong performance in pattern recognition tasks, particularly in image and audio analysis [
36,
37,
38]. They capture spatial and temporal relationships in data, creating higher-level representations using filters. While various CNN architectures exist, their core structure typically includes three layers: convolutional, pooling, and fully connected layers [
39,
40]. In CNNs, the convolutional layer serves as the fundamental building block, extracting features from the input signal. This layer uses a set of neurons, each with trained weights and biases, to create feature maps, which are then passed to the next layer. The pooling layer abstracts and reduces the dimensionality of the feature maps, making data more manageable while preserving essential features. After several convolutional and pooling layers, a fully connected layer combines the features to classify the data. This layer connects each neuron in the previous layer to each neuron in the current layer, enhancing the network’s classification capability by producing class probabilities.
LSTM (Long Short-Term Memory Networks)
Long short-term memory (LSTM) architectures are designed to model sequential information by effectively capturing temporal patterns across extended input sequences. Compared to conventional recurrent structures, they provide improved stability when learning long-range dependencies. In contrast, standard recurrent neural networks tend to perform adequately only over short temporal spans, as their training process is often hindered by instability issues arising from diminishing or excessively growing gradient values when handling prolonged contextual information [
41,
42]. These limitations have led to the preference for LSTM models, which incorporate memory cells protected by gates that control information flow.
LSTM models consist of three main gates: the Forget Gate, which determines what information should be discarded from the memory based on previous states and current input, the Input Gate, which controls how much new information enters the memory cell, and the Output Gate, which regulates the amount of information that flows out of the memory cell [
43]. The integration of these gates allows LSTMs to effectively capture and learn long-term dependencies in data, making them highly suitable for tasks involving sequential information [
44].
Dense Network
Dense or fully connected networks are one of the most fundamental and widely used artificial neural networks. In these networks, each neuron is fully connected to each neuron in the preceding layer, enabling high learning capacity and applicability across various data types [
45].
A Dense network typically consists of an input layer, one or more fully connected hidden layers, and an output layer. The model optimizes weights and biases during training to improve accuracy [
46]. The input layer represents the data fed into the network, with each node corresponding to a feature. The hidden layers process these inputs to extract higher-level features, and the output layer produces the final prediction.
In speech recognition, Dense networks play a critical role, particularly in tasks such as speech recognition, emotion analysis, and voice command systems [
47]. Sound signals are processed using feature extraction techniques like MFCC, and these features are then classified by the Dense network, whose fully connected structure enhances the network’s ability to learn complex features and make accurate classifications [
48].
4.2.2. Privacy and Security Considerations in Edge-Based SER Systems
The proposed Speech Emotion Recognition (SER) framework operates on biometric audio signals processed directly at the IoT edge, where both emotional and physiological cues constitute highly sensitive personal data. Although edge-based processing significantly reduces reliance on centralized cloud infrastructures and mitigates large-scale data leakage risks, effective computing systems inherently introduce complex privacy and security challenges. In particular, speech signals used for emotion recognition may implicitly encode demographic attributes, behavioral tendencies, and psychological states, thereby necessitating robust protection mechanisms throughout the data lifecycle [
49]. In the current experimental configuration, emotional audio data are captured and analyzed locally on edge devices to minimize external exposure. However, the system does not yet implement a fully integrated end-to-end security framework, which represents a practical limitation for real-world deployment scenarios. Comprehensive protection mechanisms are required to safeguard data integrity and confidentiality across acquisition, local inference, and optional cloud synchronization stages [
50].
Recent advances in privacy-preserving analytics for IoT and edge environments offer promising directions for addressing these challenges. Privacy-preserving approaches, including homomorphic encryption, federated learning, and secure multi-party computation, support decentralized learning paradigms by allowing analytical processes to be performed without exposing raw data or personally identifiable information [
51]. Complementary approaches, including differential privacy, further reduce the risk of sensitive information leakage by introducing controlled perturbations during model training or inference. While existing studies primarily focus on encrypted computation and collaborative learning paradigms, the present work contributes a complementary perspective by emphasizing real-time authenticity and accuracy validation of emotional intelligence outputs under strict latency constraints [
52]. For future deployments, integrating encrypted communication channels, device-level authentication, and fine-grained access control mechanisms within the device gateway will be essential to ensure secure bidirectional data exchange between edge and cloud components. Adopting a privacy-by-design strategy aligned with emerging ethical and regulatory frameworks will be critical for enabling trustworthy, user-centric emotion-aware IoT systems in practical application contexts [
53].
4.3. Limitations
This section outlines the key limitations of the present study. Although the proposed architecture demonstrates effective performance in an IoT edge environment, several methodological constraints must be acknowledged.
In particular, the reported latency and power consumption characteristics are derived from benchmark-driven estimations based on representative IoT edge device specifications, rather than empirical measurements obtained through physical deployment. While this approach enables a controlled and reproducible evaluation setting, it may not fully capture device-specific variations, hardware-level constraints, and real-world operational conditions.
Furthermore, the experimental validation is limited to a single acted speech dataset (EMODB) and unimodal acoustic features, which may restrict the generalizability of the findings to spontaneous, multilingual, or multimodal emotional contexts.
To address these limitations, future work will focus on real-device deployment and empirical performance evaluation on widely used edge hardware platforms such as Jetson Nano and Raspberry Pi. In addition, extended validation using larger, more diverse, and multimodal datasets will be conducted to further assess robustness, scalability, and real-world applicability.
5. Experimental Results
In this study, emotion analysis was conducted using the EMODB dataset, with the Mel-Frequency Cepstral Coefficients (MFCC) method employed for feature extraction [
54,
55]. The filter complexity was set to 50, ensuring a detailed representation of the audio features. MFCC has been widely reported to provide advantages in both accuracy and computational efficiency compared to other feature extraction methods [
56]. The extracted MFCC features were used as input for training deep learning models, including CNN, LSTM, and Dense networks, to classify the emotional states present in the dataset.
During the training process, the models learned from the examples in the dataset, with model weights and parameters iteratively optimized to improve the classification of emotional states. The performance of these models was then evaluated on a separate test dataset, where their ability to generalize and accurately recognize emotions was assessed. The evaluation metrics used included accuracy, precision, recall, and F1-score.
The results demonstrated that the models effectively utilized the MFCC features to capture relevant acoustic characteristics of the audio data, leading to the classification of the emotional states. The deep learning models, particularly when incorporating MFCC as the feature extraction method, exhibited consistent performance across evaluation metrics, supporting their applicability in speech emotion recognition tasks. Additionally, statistical validation was conducted, including confidence intervals and hypothesis testing, to assess the significance of observed performance differences across CNN, LSTM, and Dense models. Furthermore, significance testing confirmed that CNN’s performance was statistically higher than LSTM (p < 0.05), with confidence intervals reported to support the robustness of comparisons.
5.1. Modeling of CNN Layers
The CNN method used for emotion recognition from MFCC-based audio signals was executed with 70% training data and 30% test data. The model initially consists of a convolutional layer with 32 filters and a kernel size of 5 × 5. Following this layer, the ReLU activation function was applied. A max-pooling layer with a size of 2 × 2 was then added. To flatten the feature map obtained from the CNN layers, a Flatten layer was incorporated. Finally, a dense layer with 128 neurons was employed.
During model training, the CNN architecture achieved peak accuracy levels of up to 98%, while the lowest observed training accuracy was approximately 17.9%. Across the validation phase, accuracy values varied between 31.7% and 84%, indicating variability in generalization performance across training epochs. With respect to optimization behavior, training loss values spanned from 0.0038 to 4.5, whereas validation loss was observed within the range of 0.58 to 2.4.
Performance metrics indicated that the CNN model achieved a maximum accuracy of 84%, demonstrating its effectiveness in classifying emotional states from audio features.
Table 1 presents the emotion recognition performance of the CNN model on the EMODB dataset, using metrics such as precision, recall, F1-score, and support. According to these metrics, the model exhibited high accuracy and recall rates, particularly in recognizing anger and sadness emotions (84% and 94% accuracy, respectively; 98% and 100% recall, respectively). However, lower performance was observed in recognizing boredom and neutral emotions (50% and 60% accuracy, respectively; 68% and 32% recall, respectively), suggesting potential class overlap and feature similarity. The value of 84% corresponds to the peak validation accuracy observed during the training phase, whereas the value reported in
Table 1 represents the final accuracy obtained on the held-out test dataset.
Figure 5 demonstrates the confusion matrix, illustrating the model’s tendencies to misclassify different emotions. For instance, it is observed that neutral emotions are confused with boredom and happiness, while boredom is confused with disgust and neutral emotions. Additionally, anger and sadness emotions are almost entirely correctly classified, whereas fear and happiness show relatively more confusion. Notably, frequent misclassifications between neutral and boredom are evident, underscoring the need for targeted augmentation strategies. Legend improvement: most frequent misclassifications are highlighted in color-coded cells.
Overall, the analysis results indicate that the CNN model successfully recognizes emotions of anger and sadness, but may experience confusion between neutral and boredom emotions. The model demonstrates strong performance with high accuracy and recall rates in the anger and sadness categories, while it demonstrates weaker performance with lower accuracy and recall rates in the neutral and boredom categories. These findings suggest that while the model generally performs well, there is a need for improvements, particularly in recognizing neutral and boredom emotions.
5.1.1. Modeling of LSTM Layers
In this study, another method used for emotion recognition from audio signals using the EMODB dataset is LSTM. MFCC was used as the feature extractor, and the MFCC coefficient was set to 60. The dataset was divided into 70% training data and 30% test data.
The LSTM model consists of three layers. The network architecture is composed of three sequential layers, where the initial layer is configured with 128 neurons, the intermediate layer employs 256 neurons, and the final layer again utilizes 128 neurons.
The ‘ReLU’ activation function was used between each layer, and a Dropout layer, applying a 60% dropout rate, was added. Multi-class classification was performed in the final layer using the ‘softmax’ activation function. To ensure continuity of data flow between layers, the return_sequences parameter was set to True.
The LSTM model demonstrated a maximum training accuracy of 92% and a minimum training accuracy of 25%. The validation accuracy ranged from a minimum of 35% to a maximum of 66%. These results indicate that while the model has a strong capacity for learning from training data, it exhibits limited generalization capability, suggesting potential overfitting. Training loss values ranged from 0.13 to 1.8, while validation loss values varied between 1.3 and 1.9. These results further support the presence of a generalization gap between training and validation performance.
Performance metrics showed that the LSTM model achieved a maximum accuracy of 63%, indicating comparatively lower effectiveness in classifying emotional states from audio features when compared to the CNN and Dense models.
Table 2 presents the emotion recognition performance of the LSTM model on the EMODB dataset, using metrics such as precision, recall, F1-score, and support. According to these metrics, the model exhibited relatively high accuracy and recall rates in recognizing anger and sadness emotions (76% and 54% accuracy, respectively; 60% and 93% recall, respectively). However, significantly lower performance was observed in recognizing fear and disgust emotions (20% and 21% accuracy, respectively; 9% and 43% recall, respectively), indicating difficulty in capturing discriminative features for certain emotional classes.
Figure 6 demonstrates the confusion matrix, illustrating the LSTM model’s tendencies to misclassify different emotions. For instance, it is observed that neutral emotions are confused with boredom and sadness, while boredom is confused with disgust and neutral emotions. Additionally, the emotion of anger is relatively well-classified, but more confusion is observed with fear and happiness emotions. The matrix highlights recurrent confusions among neutral, boredom, and sadness categories, highlighting limitations in class separability. Labels clarified to improve the readability of class distributions.
5.1.2. Modeling of Dense Layers
Another method used for emotion recognition from audio signals in this study involved a three-layer Dense network architecture. Mel-Frequency Cepstral Coefficients were employed to represent acoustic features, with the number of coefficients fixed at 60. The available data were partitioned into training and testing subsets using a 70:30 ratio. The Dense neural network architecture consists of three fully connected layers configured with 125, 250, and 125 neurons, respectively. Rectified Linear Unit (ReLU) activations were applied between successive layers, and regularization was supported through the inclusion of a Dropout layer with a rate of 0.6. For the final classification stage, a softmax activation function was utilized to enable multi-class emotion prediction.
The Dense model demonstrated a maximum training accuracy of 92.99% and a minimum training accuracy of 12.38%. Validation accuracy ranged from a minimum of 15.89% to a maximum of 81.31%. These results indicate that while the model has a strong capacity for learning from training data, it exhibits variability in generalization performance, suggesting potential overfitting. Training loss values ranged from 0.2147 to 55.7179, while validation loss values varied between 0.7118 and 8.3821. These results further indicate a noticeable gap between training and validation performance. Additional optimization strategies may be required to improve validation accuracy and reduce error rates.
Performance metrics showed that the Dense model achieved a maximum accuracy of 81.3%, demonstrating competitive performance in classifying emotional states from audio features relative to other evaluated models.
Table 3 presents the emotion recognition performance of the Dense model on the EMODB dataset, using metrics such as precision, recall, F1-score, and support. According to these metrics, the model exhibited high accuracy and recall rates, particularly in recognizing anger, boredom, and sadness emotions (86%, 71%, and 92% accuracy, respectively; 86%, 88%, and 100% recall, respectively). However, lower performance was observed in recognizing happiness and fear emotions (50% and 62% accuracy, respectively; 50% and 73% recall, respectively), indicating challenges in distinguishing certain emotional classes.
Figure 7 demonstrates the confusion matrix, illustrating the Dense model’s tendencies to misclassify different emotions. For instance, it is observed that neutral emotions are confused with boredom and disgust, while boredom is confused with fear and neutral emotions. Additionally, the emotion of anger is relatively well-classified, but more confusion is observed with fear and happiness emotions. As indicated, confusion occurs particularly between boredom, fear, and neutral categories, highlighting limitations in class separability. Readability was improved with standardized legends and font scaling.
When evaluating the overall analysis results, it is concluded that the Dense model is quite successful in recognizing anger and sadness emotions, but it may experience confusion among neutral, fear, and disgust emotions. The model demonstrates strong performance with high accuracy and recall rates in the anger and sadness categories, while it demonstrates weaker performance with lower accuracy and recall rates in the fear, neutral, and disgust categories. These findings suggest that the model generally performs well but requires improvements, particularly in recognizing fear, neutral, and disgust emotions. Specifically, according to the confusion matrix, the emotion of anger is well-identified with 86% accuracy, though it is sometimes confused with neutral. The emotion of fear demonstrates relatively low performance with 62% accuracy and 73% recall. The emotion of boredom has an accuracy rate of 71% and performs well with 88% recall, but it is sometimes confused with neutral and sadness. The emotion of disgust has an accuracy rate of 83% but demonstrates some confusion with 71% recall. The emotion of happiness exhibits low performance with 50% accuracy and 50% recall. The emotion of neutral has an accuracy rate of 94% but demonstrates confusion with 68% recall, particularly with boredom and sadness. The emotion of sadness demonstrates excellent performance with 92% accuracy and 100% recall. This assessment indicates that while the Dense model is successful in recognizing some emotional states, there is a need for improvements in distinguishing specific emotions to enhance overall performance.
6. Discussion
The present study focuses on the examination of technological systems under the IoT framework and within IoT middleware that provides the ability to automatically detect human emotional states through communication channels such as tone of voice, facial expressions, and body language. Studies in the IoT field often utilize a classic IoT architecture comprising four layers: sensing, network, service, and application. A data processing model is proposed to minimize problems encountered in systems designed according to this classic architecture and to contribute to IoT-based sensing systems.
Although the present study does not include physical deployment on edge hardware, the proposed framework aims to provide a deployment-aware evaluation perspective that considers the constraints and operational requirements of IoT edge environments. Future work will extend this framework through empirical validation on representative edge platforms. The objective of this study is not to maximize classification accuracy but rather to analyze the deployment feasibility of widely used deep learning architectures within edge-oriented processing environments.
In contrast to conventional IoT frameworks, the proposed architecture introduces an intermediate processing stage situated between the sensing and network layers. Within this framework, data streams generated at the sensing layer are processed locally before transmission, enabling a comparative evaluation of algorithmic performance at this intermediate stage. The study assesses the emotion recognition capabilities of three deep learning approaches—dense neural networks, LSTM, and CNN—while also examining the potential benefits introduced by integrating edge computing into the overall system design. The algorithms allow for real-time classification of data from the sensing layer before reaching the target system. Performance evaluation related to data processing is carried out using precision, recall, and F1-score metrics. This real-time process enables the efficient processing and interpretation of data generated by the IoT sensing layer.
The dataset used in the study is EMODB, which is widely used in the field of emotion recognition. EMODB includes seven basic emotional states: happiness, sadness, anger, fear, disgust, surprise, and neutral. Each model was trained and tested on EMODB. Performance results in speech emotion recognition can vary significantly depending on dataset characteristics, preprocessing pipelines, and experimental protocols. The EMODB dataset contains acted emotional speech recorded under controlled laboratory conditions, which may limit direct comparability with studies using larger or multimodal datasets.
The analysis results show that the CNN model performed the best, with an accuracy rate of 84%. This finding is consistent with previous studies. Although the Dense architecture achieves slightly higher overall test accuracy, the CNN model demonstrates more stable performance across emotion classes and more consistent behavior during training and validation. From a deployment-oriented perspective, such stability and balanced class performance may be preferable to marginal gains in overall accuracy. Latif et al. (2018) and Hershey et al. (2017) reported that CNN models perform effectively in speech-based emotion recognition tasks [
57,
58], while Levi and Hassner (2015) also emphasized the success of convolutional approaches in emotion recognition [
59]. These findings support the effectiveness of CNN-based architectures in extracting discriminative acoustic features. The lower recognition accuracy observed for neutral and boredom categories can be attributed to the acoustic similarity of these emotional states with other low-energy emotions. Neutral speech often lacks distinctive prosodic variations, while boredom may share acoustic characteristics with low-arousal emotional expressions. These similarities may increase classification ambiguity for convolutional architectures.
The lower performance of the LSTM model suggests that, despite its capacity for processing time-dependent data, it may fall short in certain emotion categories. Tokozume and Harada (2017) highlighted that although LSTM models perform well on time-series data, they may show lower performance in complex tasks such as emotion recognition [
60]. Wu et al. (2020) further noted that while LSTM is effective in learning long-term dependencies, it may struggle to capture short-term emotional variations [
61]. This observation is consistent with the generalization limitations identified in the present study.
The performance of the Dense model is generally strong; although it does not outperform CNN, it demonstrates competitive results. Costa et al. (2017) showed that Dense networks perform well in related tasks such as music classification [
62], suggesting their applicability in emotion recognition problems. However, Purwins et al. (2019) indicated that while Dense architectures are effective in audio signal processing, their performance may be limited compared to more advanced deep learning models [
63].
An important finding of this study is that the integration of these models with edge computing can provide advantages such as reduced latency, lower dependency on centralized infrastructures, and improved bandwidth efficiency. Previous studies have highlighted these benefits; for instance, Palanisamy et al. (2020) and Baucas and Spachos emphasized the role of edge computing in reducing latency and improving energy efficiency [
64,
65]. Similarly, Lieskovská et al. demonstrated that edge-based processing enhances responsiveness in IoT applications [
66,
67].
Although no physical deployment was conducted in this study, the proposed framework is designed to be compatible with edge environments and supports real-time processing requirements. Compared to cloud-based systems, edge-based approaches are expected to reduce latency and bandwidth consumption, making them suitable for real-time emotion recognition applications. In practical deployments, overall system latency may also include additional processing stages such as voice activity detection (VAD), MFCC feature extraction, and streaming pipeline delays. Future work will extend the proposed framework by incorporating these components in order to provide a comprehensive evaluation of real-time speech emotion recognition performance in edge computing environments.
Techniques such as GAN-based data augmentation and class balancing have been shown to improve robustness in emotion recognition tasks [
68,
69,
70]. However, these techniques were not implemented in the present study and are considered as potential directions for future work.
Although the present study evaluates model feasibility using benchmark-driven simulation rather than direct deployment on physical edge hardware, such an approach enables controlled and reproducible comparison of multiple deep learning architectures.
Future work may extend this framework by implementing the evaluated models on real edge hardware platforms such as Raspberry Pi or NVIDIA Jetson devices. Frameworks such as TensorFlow Lite or ONNX Runtime could be used to measure real inference latency, energy consumption, and memory footprint in practical deployment scenarios.
Limitations: One limitation of this study is the use of the EMODB dataset, which contains acted emotional speech recorded under controlled laboratory conditions and includes a relatively small number of samples [
71,
72,
73]. Future research may address this limitation by exploring few-shot learning or domain adaptation techniques to improve model generalization capability in real-world speech emotion recognition scenarios, particularly in resource-constrained IoT edge environments where collecting large-scale labeled datasets remains challenging. Another limitation of this study is the reliance on a single acted speech dataset (EMODB) and unimodal acoustic features. Future research will extend the proposed framework to larger datasets and multimodal emotion recognition scenarios, incorporating additional signals such as facial expressions and physiological measurements.
7. Conclusions
This study examined the performance of three deep learning architectures—convolutional neural networks (CNN), long short-term memory networks (LSTM), and dense neural networks—for speech-based emotion recognition using the EMODB dataset, with a particular emphasis on edge-oriented deployment scenarios. The proposed framework provides a deployment-aware comparative evaluation perspective for speech emotion recognition models under representative edge constraints. The experimental results indicate that, under the considered configuration, CNN-based models provide the most consistent and reliable classification performance, while Dense architectures also achieve comparatively competitive accuracy. In contrast, the LSTM model exhibits lower effectiveness in this setting, suggesting that temporal modeling alone may be insufficient for robust emotion recognition when constrained by limited feature representations and computational resources. Future research may extend this work by incorporating real hardware-level validation on embedded AI platforms.
Beyond model-level performance, the findings highlight the practical advantages of integrating emotion recognition pipelines into IoT edge environments. Localized inference supports low-latency processing and reduces reliance on centralized infrastructures, which is especially relevant for real-time and context-aware applications. These findings underline the potential of edge-enabled emotion recognition systems in latency-sensitive scenarios. However, these benefits must be interpreted in light of the controlled experimental conditions under which the evaluation was conducted.
Several limitations of the present study should be acknowledged. First, the evaluation is based on a single, acted speech dataset with a limited number of emotional categories, which may restrict generalizability to spontaneous or culturally diverse emotional expressions. Second, the experiments focus on unimodal speech features, without incorporating complementary modalities such as facial expressions or physiological signals. Finally, while edge deployment feasibility is explored at a conceptual level, broader assessments involving heterogeneous hardware platforms and real-world usage conditions remain necessary.
Future research should therefore investigate the effects of data augmentation strategies and hybrid model architectures on recognition performance, as well as validate the proposed framework using larger, more diverse, and multilingual datasets. Expanding the analysis to multimodal emotional AI systems and examining deployment trade-offs on resource-constrained edge devices will further enhance practical applicability. In addition, ethical considerations, including privacy preservation, data security, and algorithmic bias, should be systematically addressed to support the responsible adoption of emotion-aware IoT systems in real-world environments.