4.1. Parameter Settings and Implementation Details
All experimental results were obtained using an AMD Ryzen 7-5800HS CPU with 40 GB of RAM. The simulations were performed on an NVIDIA GeForce RTX 4060 GPU. Additionally, TensorFlow 2.15.0 and Python 3.10.12 were utilized to obtain all results.
The DALDL model was trained using empirically tuned hyperparameters. It uses 32 filters and a growth rate of 32. The architecture includes four dense blocks and two attention modules (HCA and CSA). A dropout rate of 0.2 was applied to reduce overfitting. The batch size was set to 64 for a good balance between speed and generalization. The learning rate was set to 0.001, chosen after testing values from to . A weight decay of 0.0005 was used for regularization. The Adam optimizer and Cross-Entropy loss function were used for training. ReLU6 was selected as the activation function. Input images were resized to pixels. Data augmentation included flipping, brightness adjustment, and cropping. The dataset was divided into 70% for training and 30% for validation.
To evaluate the effectiveness of the proposed DALDL model, we compare it against two distinct baselines: AlexNet and SqueezeNet. These baselines were chosen to highlight the trade-offs between model complexity, computational efficiency, and recognition performance. AlexNet represents a high-capacity deep learning model. On the other hand, SqueezeNet is a highly compressed architecture designed for real-time applications. By comparing DALDL with these two models, we assess whether it successfully balances accuracy and efficiency while remaining suitable for real-world deployment in ADASs.
AlexNet was selected as a baseline because it serves as a reference for large-scale CNNs with a high capacity for feature extraction. With 60 million parameters and large convolutional filters (e.g., 11 × 11, 5 × 5 kernels), AlexNet has demonstrated state-of-the-art performance in large datasets such as ImageNet. However, its high computational cost and large memory footprint make it impractical for embedded ADAS applications that require real-time inference. Additionally, AlexNet exhibits performance degradation when trained on smaller datasets like CK+, as it is originally designed for large-scale data. This comparison is crucial because if DALDL achieves similar or superior performance at a significantly lower computational cost, it validates its efficiency in learning meaningful representations without relying on extensive model depth or parameter size.
On the other end of the complexity spectrum, SqueezeNet was chosen as a lightweight baseline due to its emphasis on model compression and efficient computation. SqueezeNet reduces the number of parameters to 1.24 million, achieving a 50× smaller footprint than AlexNet while retaining a reasonable level of accuracy. It employs Fire modules, which replace large convolutional filters with 1 × 1 convolutions to minimize computational complexity. This design makes it highly suitable for low-power, real-time applications such as embedded ADASs. However, the aggressive compression in SqueezeNet leads to a loss of discriminative power, particularly in fine-grained emotion recognition tasks where subtle facial expressions must be detected. By comparing DALDL against SqueezeNet, we evaluate whether it retains computational efficiency without sacrificing accuracy, ensuring it remains a viable option for deployment in resource-constrained environments.
The choice of these two baselines effectively frames DALDL’s contributions in terms of accuracy, computational efficiency, and model complexity trade-offs. If DALDL surpasses SqueezeNet in recognition accuracy while maintaining a lightweight architecture, it demonstrates that the proposed model preserves critical features while avoiding unnecessary complexity. If it performs comparably to or better than AlexNet while requiring fewer computational resources, it establishes DALDL as a superior alternative to traditional deep learning architectures for emotion recognition in ADASs.
Overall, the inclusion of AlexNet and SqueezeNet as baselines provides a comprehensive evaluation of DALDL’s capabilities, ensuring that its superiority is demonstrated both in terms of accuracy-driven deep learning and real-time feasibility. This comparative analysis reinforces DALDL’s positioning as a practical and effective solution for real-world driver emotion recognition, where both high accuracy and low latency are paramount.
4.2. Performance of the DALDL Architecture with CK+ Dataset
The Extended CK+ dataset contains 327 labeled sequences from 118 subjects, representing seven facial expressions: anger, contempt, disgust, fear, happiness, sadness, and surprise. The dataset is imbalanced, with surprise (285 images) being the most frequent class, while contempt (18 images) is the least. Each sequence consists of frames transitioning from a neutral to a peak expression, with the last three frames labeled per sequence. This results in a total of 981 images used for training and evaluation. The dataset’s class imbalance poses a challenge for models to generalize well across all expressions.
Table 3 summarizes the description of the CK+ dataset.
Figure 4 presents the confusion matrices of the proposed DALDL architecture and the baseline algorithms. For the Anger class, DALDL performs best, classifying 34 out of 38 samples correctly. AlexNet and SqueezeNet follow closely with 33 and 32 correct classifications, respectively. Most misclassifications occur with disgust and fear. The models struggle to differentiate subtle variations in facial tension. The Contempt class was the most difficult class to classify. DALDL improves recall with 5 out of 11 correct classifications. AlexNet and SqueezeNet classify only four correctly. Most errors occur with anger, disgust, and fear. The subtle expressions of contempt create high confusion. DALDL also achieves 92% accuracy (46/50 correct) in terms of the Disgust class, outperforming AlexNet (86%) and SqueezeNet (84%). Most misclassifications happen with fear and anger. The models find it hard to separate disgust from fear due to similar facial muscle activations.
Furthermore, DALDL classifies 19 out of 27 correctly, slightly better than AlexNet and SqueezeNet (18 correct each) for the fear class. Fear is often confused with anger or disgust. The expressions share overlapping eyebrow and mouth movements, leading to misclassification. Happiness is the best-classified emotion across all models. DALDL performs best with 53 out of 57 correct classifications. AlexNet follows with 51 correct, and SqueezeNet achieves 49 correct. Few errors occur, and they are mainly with surprise due to expressive facial similarities. On the other hand, DALDL correctly classifies 22 out of 29 samples for the Sadness class, while AlexNet and SqueezeNet classify 21 and 20, respectively. Most errors come from confusion with fear and disgust. The subtle nature of sadness makes it harder to detect compared with more intense emotions. Lastly, the Surprise class has the highest accuracy. DALDL achieves 96.2% accuracy (76/79 correct). AlexNet follows with 72 correct, and SqueezeNet slightly lags at 69 correct. Few misclassifications occur, mostly with Happiness due to shared facial openness.
Finally, we present a comparison among the proposed method and the baselines in terms of the F1-score in
Figure 5a. It is a classwise comparison, and for each class, the proposed method outperforms the baselines.
Figure 5b presents a comparison in terms of average accuracy and F1-score. Superior performance achievement by the proposed DALDL method can be observed in this case too.
4.3. Performance of the DALDL Architecture with KMU-FED Dataset
In this paper, the KMU-FED dataset is considered as a collection of 1,106 images captured in real driving environments. It consists of facial expressions from 12 subjects under varying lighting conditions and partial occlusions caused by accessories such as sunglasses and hair. The dataset is labeled with six basic facial expressions: anger, disgust, fear, happiness, sadness, and surprise. Due to its real-world setting, KMU-FED is particularly suited for developing robust driver emotion recognition models in ADASs.
Figure 6 presents the confusion matrices of the proposed DALDL architecture along with the baseline architectures. In the confusion matrices, the labels 1, 2, 3, 4, 5, and 6 correspond to the six facial expressions present in the KMU-FED dataset. Based on the dataset’s structure, the labels represent the following emotions: anger, disgust, fear, happiness, sadness, and surprise.
The confusion matrices provide a detailed comparison between the proposed DALDL model and the two baseline architectures, AlexNet and SqueezeNet, on the KMU-FED dataset. The accuracy trends observed highlight the strengths and weaknesses of each model in recognizing driver emotions under real-world conditions. DALDL achieved the highest accuracy, approximately 89.5%, demonstrating superior performance in classifying subtle facial expressions. AlexNet followed with an accuracy of around 85.4%, showing competitive results but with a noticeable drop in correctly classified instances compared with DALDL. SqueezeNet performed the worst, achieving an accuracy of approximately 82.78%, indicating that while it is a lightweight model, its feature extraction capabilities may not be sufficient for handling complex facial expressions affected by varying lighting conditions and occlusions in a driving environment.
A deeper examination of individual expression classification reveals that DALDL outperforms both AlexNet and SqueezeNet in correctly distinguishing most emotions, particularly in differentiating between fear, sadness, and surprise, which are often confused due to their visual similarities. AlexNet exhibited a higher misclassification rate for expressions such as anger and disgust, which can share overlapping facial features, leading to greater ambiguity in classification. SqueezeNet, in comparison, struggled the most, particularly with fear, sadness, and surprise, suggesting that its limited feature extraction capability hampers its ability to recognize fine-grained emotional variations.
The variability in misclassifications across the three models further highlights their relative strengths and weaknesses. In DALDL, the diagonal values in the confusion matrix, which represent correct classifications, had higher counts with some variation, indicating a strong yet adaptable prediction capability. AlexNet, on the other hand, showed greater fluctuations along the diagonal compared with DALDL, implying that while it maintains reasonable classification strength, it lacks the robustness provided by DALDL’s dual attention approach. SqueezeNet, in contrast, exhibited the lowest diagonal consistency, with misclassifications spread across multiple classes, reinforcing its weaker generalization ability for real-world driver monitoring scenarios.
One of the key differentiating factors influencing the accuracy of these models is their underlying architecture. DALDL’s improved performance can be attributed to its DALDL framework, which effectively captures spatial and contextual features. This enables it to handle challenging real-world scenarios, such as occlusions and lighting variations, better than its counterparts. AlexNet, despite its deeper network, does not incorporate an optimized attention mechanism, leading to a decline in performance compared with DALDL. SqueezeNet, designed primarily for computational efficiency, sacrifices feature extraction capability by reducing model parameters, which negatively impacts its ability to distinguish between closely related emotions.
Next, we present a class-wise F1-score comparison among the proposed DALDL architecture for drivers’ emotion recognition. Superior performance can be observed for the DALDL as it outperforms the AlexNet and SqueezeNet in each class (see
Figure 7a). Also, the average accuracy and F1-score obtained by the proposed DALDL method are also higher than the baselines, as can be observed in
Figure 7b.
To validate the real-time suitability of the proposed DALDL model, we evaluated its inference speed and computational efficiency on an NVIDIA RTX 4060 GPU. DALDL achieved an average inference time of 3.9 ms per image, outperforming SqueezeNet (4.3 ms) and significantly surpassing AlexNet (15.2 ms). Despite having the smallest parameter count and model size among the compared architectures, DALDL maintains high accuracy while ensuring low latency. This demonstrates that the proposed model is not only lightweight but also well suited for real-time deployment in ADAS environments. Please refer to
Table 4 for more details.
In real-world ADAS deployments, several practical challenges must be considered beyond model performance. Embedded automotive hardware often has strict limitations on memory, processing power, and thermal dissipation, making it difficult to deploy conventional deep learning models. Power consumption is also a critical factor, particularly in electric vehicles, where energy efficiency directly impacts range. Additionally, environmental factors such as varying lighting conditions, partial facial occlusions (e.g., sunglasses, hair, hands), and camera placement within the vehicle cabin can significantly affect recognition accuracy. While the proposed DALDL model addresses computational efficiency through its lightweight architecture, it also demonstrates robustness to occlusions and lighting variations by integrating coordinate and channel attention mechanisms. Further optimization and adaptive techniques may be needed to handle extreme real-world variability, which we would like to consider as our future work.
While the proposed model integrates both HCA and CSA for enhanced feature extraction, an ablation study to evaluate their individual contributions was not conducted in this version. We acknowledge its importance and plan to include this analysis in future work to better understand each module’s role in performance improvement.
Finally, we provide a numerical comparison of the proposed deep learning architecture against others in the following table (
Table 5). Note that accuracy and F1-score values for baseline models are based on either reimplementation or best reported results in comparable FER settings. Inference time measured on NVIDIA RTX 4060.
Lastly, we compare the proposed DALDL architecture with related works that have used the same dataset as ours. Details of the comparison can be found in
Appendix B.