Next Article in Journal
Effects of Spinach Addition on the Nutritional Value, Functional Properties, Microstructure and Shelf Life of Lamb Meat Dumplings
Previous Article in Journal
Design of a Secret Sharing Scheme with Mandatory Subgroup Participation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Emotion Recognition in Autistic Children Through Facial Expressions Using Advanced Deep Learning Architectures

by
Petra Radočaj
1,* and
Goran Martinović
2
1
Layer d.o.o., Vukovarska Cesta 31, 31000 Osijek, Croatia
2
Faculty of Electrical Engineering, Computer Science and Information Technology, Josip Juraj Strossmayer, University of Osijek, Kneza Trpimira 2B, 31000 Osijek, Croatia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(17), 9555; https://doi.org/10.3390/app15179555 (registering DOI)
Submission received: 1 August 2025 / Revised: 27 August 2025 / Accepted: 28 August 2025 / Published: 30 August 2025

Abstract

Atypical and subtle facial expression patterns in individuals with autism spectrum disorder (ASD) pose a significant challenge for automated emotion recognition. This study evaluates and compares the performance of convolutional neural networks (CNNs) and transformer-based deep learning models for facial emotion recognition in this population. Using a labeled dataset of emotional facial images, we assessed eight models across four emotion categories: natural, anger, fear, and joy. Our results demonstrate that transformer models consistently outperformed CNNs in both overall and emotion-specific metrics. Notably, the Swin Transformer achieved the highest performance, with an accuracy of 0.8000 and an F1-score of 0.7889, significantly surpassing all CNN counterparts. While CNNs failed to detect the fear class, transformer models showed a measurable capability in identifying complex emotions such as anger and fear, suggesting an enhanced ability to capture subtle facial cues. Analysis of the confusion matrix further confirmed the transformers’ superior classification balance and generalization. Despite these promising results, the study has limitations, including class imbalance and its reliance solely on facial imagery. Future work should explore multimodal emotion recognition, model interpretability, and personalization for real-world applications. Research also demonstrates the potential of transformer architectures in advancing inclusive, emotion-aware AI systems tailored for autistic individuals.

1. Introduction

Autism spectrum disorder (ASD) is a complex neurodevelopmental condition that affects approximately 1 in 54 children and is marked by persistent challenges in social communication, emotional understanding, and behavioral regulation [1,2]. While the manifestations of ASD vary widely among individuals, a hallmark characteristic across the spectrum is difficulty with recognizing and interpreting facial expressions—skills that are critical for establishing and maintaining social relationships [3]. The inability to accurately perceive and respond to emotional cues in others can lead to breakdowns in communication, limited peer interactions, and significant social isolation over time [4,5]. Understanding and enhancing emotion recognition in autistic children is therefore of paramount importance for improving their quality of life and supporting their developmental outcomes [3,6]. Research has shown that autistic individuals tend to exhibit atypical gaze patterns when viewing human faces [7]. Unlike neurotypical individuals, who instinctively focus on emotionally salient regions of the face such as the eyes and mouth, autistic individuals may direct their attention to less informative areas or distribute their gaze in ways that miss key social cues [8]. This divergence in facial processing has been linked to measurable deficits in emotional interpretation and is believed to contribute to the broader social difficulties experienced by individuals with ASD. Consequently, accurate assessment and training in facial emotion recognition are becoming a foundational element of therapeutic interventions [9].
Traditionally, assessments of emotional recognition in ASD populations have relied heavily on behavioral observations, structured interviews, and standardized diagnostic questionnaires administered by clinicians or caregivers [10,11]. While these tools offer valuable insights, they suffer from several limitations: they are time-consuming to administer, highly dependent on the observer’s interpretation, and often lack the granularity needed to capture subtle emotional processing differences. Furthermore, these assessments are not easily scalable for large populations or routine screening in clinical or educational settings [11]. The variability in expressive behaviors among individuals with ASD further complicates consistent evaluation, underscoring the need for more objective, efficient, and scalable assessment methods.
Recent advancements in artificial intelligence and deep learning have opened new frontiers for emotion recognition, particularly through the use of image-based techniques. Deep learning algorithms, and in particular CNNs, have demonstrated exceptional performance in visual classification tasks, including facial expression recognition [5,6]. These models are capable of automatically learning hierarchical feature representations from raw pixel data, making them ideally suited for tasks that require fine-grained interpretation of facial expressions [12,13,14]. Unlike traditional machine learning methods, which require extensive manual feature engineering, CNNs can identify relevant facial structures and emotional indicators without human intervention, thus increasing accuracy while reducing bias and subjectivity [3].
In addition to CNNs, transformer-based architectures such as Vision Transformers (ViTs) have emerged as powerful tools in the domain of image analysis. ViTs utilize self-attention mechanisms to capture long-range dependencies and global contextual relationships across the entire image, providing complementary advantages to the localized pattern recognition capabilities of CNNs [15]. In the context of emotion recognition, ViTs may offer enhanced sensitivity to subtle, distributed changes in facial expressions that are not easily captured by spatially localized filters alone [16]. The usage of these two approaches—CNNs for detailed, localized feature extraction and ViTs for capturing broader contextual patterns—represents a promising direction for emotion recognition in complex, real-world scenarios.
Despite their demonstrated success in general-purpose emotion recognition tasks, the application of deep learning models to ASD-specific datasets remains underdeveloped. Most existing systems are trained on data collected from neurotypical populations, which may not adequately represent the facial expressiveness or emotional display patterns of autistic individuals [17,18]. This misalignment can result in biased models that perform poorly when deployed in real-world settings involving autistic users. The heterogeneity of ASD further complicates model training, as individuals may differ significantly in how they display emotions—both in terms of facial muscle activation and temporal dynamics [19]. Consequently, there is a critical need to develop models that are robust to this variability and capable of generalizing across the spectrum of expressive behaviors seen in autism. Another major barrier to progress in this area is the lack of representative training data. Many widely used facial expression datasets, such as FER2013 or AffectNet, contain limited examples of autistic individuals and often focus only on exaggerated, prototypical emotions [20,21]. This lack of inclusivity limits the generalizability of models trained on these datasets. To overcome this limitation, recent research has begun to explore the use of ASD-specific datasets that include images or videos of autistic children displaying a range of emotional expressions [22,23,24]. Such datasets enable the development of models that are better attuned to the subtle and diverse ways emotions are expressed within the ASD population.
In this study, a deep learning-based approach has been employed to improve the classification of facial emotions in autistic children. Recognizing the essential role of emotion interpretation in social development, the study utilizes state-of-the-art deep learning techniques, including CNNs and transfer learning, to build specialized models for emotion recognition. By leveraging pre-trained architectures—such as ResNet, DenseNet, or Inception networks—and fine-tuning them on autism-relevant datasets, the models have been adapted for multi-class classification tasks involving four fundamental emotions: natural, anger, fear, and joy [25]. These emotions were selected based on their social relevance and frequency of occurrence in early childhood interactions. The use of transfer learning significantly accelerates the training process and improves performance, especially when dealing with limited or imbalanced datasets [26]. Pre-trained models, originally trained on large-scale image databases, retain a broad understanding of general visual features that can be refined to specialize in emotion recognition. Fine-tuning these models on ASD-specific data helps to bridge the gap between general vision tasks and the domain-specific challenges of emotion classification in autistic children. By optimizing the performance of these models through careful selection of hyperparameters, data augmentation techniques, and model architecture adjustments, this study demonstrates the practical viability of deploying deep learning for emotion recognition in autism contexts. The findings support the notion that transfer learning not only improves classification accuracy but also enhances the model’s sensitivity to less pronounced emotional cues that are common in ASD populations.
The implementation of such systems holds substantial promise for real-world applications. Emotion-aware tools can be integrated into educational software, therapeutic games, and communication aids, providing real-time feedback and support to children during social interactions [6,11]. These tools can also assist therapists and educators by offering quantitative, objective metrics for monitoring emotional understanding and social engagement over time [4]. In addition, mobile or tablet-based implementations allow for widespread deployment in schools, clinics, or home environments—especially in regions where access to specialized care may be limited [27].
Despite progress in ASD-related emotion recognition using deep learning, several research gaps remain. First, many existing emotion classifiers are not designed with the variability of autistic expression in mind and may underperform when tested on diverse ASD populations [21,28]. Second, while deep learning architectures such as CNNs and transformers offer powerful modeling capabilities, there has been limited exploration into how different architectures—individually or in combination—perform specifically in autism-related contexts [17]. Understanding the trade-offs between spatial and contextual feature extraction remains an open question. Third, most emotion recognition systems are limited to a narrow set of basic emotions. Future research should examine the recognition of more nuanced or compound emotions, as well as the temporal dynamics of facial expressions [10,28]. The unfolding of emotion over time may provide critical cues, particularly in autism, where expression timing and response delays can differ from neurotypical patterns. Finally, computational efficiency is an ongoing concern. To enhance accessibility and real-world applicability, emotion recognition models must be optimized for deployment on low-power, resource-constrained devices without significant loss of accuracy [29].
This research contributes to the growing field of autism-related emotion recognition and deep learning in several key ways:
  • It compares CNN and transformer-based architectures for emotion classification in autistic children, evaluating their strengths and limitations in capturing subtle facial cues.
  • It assesses the role of dataset composition and transfer learning in optimizing model accuracy and generalizability across diverse ASD expressions.
  • It demonstrates the potential for scalable, real-time, and objective emotion-aware systems that can augment therapeutic interventions, social skills training, and assistive communication technologies.
The remainder of this paper is organized as follows: Section 2 reviews relevant work in deep learning, autism research, and emotion recognition. Section 3 describes the methodology, including dataset selection, model architectures, and training protocols. Section 4 presents the experimental results and model comparisons. Section 5 concludes with key findings and directions for future work.

2. Related Works

Recent advancements in deep learning have considerably improved the ability of automated systems to analyze human behavior, including complex tasks such as emotion recognition. Within the context of ASD, the accurate and timely identification of emotions through facial expressions is critical for interpreting the distinctive communication styles of autistic individuals and informing personalized interventions. Despite progress in this area, the development of deep learning models that are both accurate and generalizable—while remaining practical for deployment—remains an ongoing research challenge. This is particularly true given the atypical and often subtle nature of emotional expression in autistic children.
A substantial portion of the existing literature has explored the application of deep learning techniques for the diagnosis and early screening of ASD using facial image analysis. For example, Li et al. [30] introduced a two-stage transfer learning framework incorporating MobileNetV2 and MobileNetV3-Large, achieving an area under the curve (AUC) of 96.32% and an accuracy of 90.5% on ASD screening tasks, with optimizations tailored for mobile deployment. Ranjana and Muthukkumar [31] proposed a DenseResNet-based architecture that combines DenseNet and ResNet models, achieving a classification accuracy of 97.07% on a large dataset. Similarly, Akter et al. [32] employed MobileNetV1 within a transfer learning paradigm, reporting 90.67% accuracy for ASD classification. Although these contributions demonstrate the effectiveness of deep learning in extracting ASD-related facial features, their primary focus is on detecting ASD presence rather than on the more granular task of emotional state recognition. This distinction is significant, as the methodologies employed in ASD diagnosis may not directly translate to the domain of emotion recognition, where subtler distinctions are required. More directly relevant to emotion analysis, several studies have focused on the recognition of facially expressed emotional or affective states—including pain—in autistic children. Talaat et al. [28] presented a real-time system for emotion identification that integrates a deep convolutional neural network (DCNN) with a kernel autoencoder for feature extraction. Their system demonstrated strong performance, particularly when employing the Xception model, achieving an accuracy of 95.23% across six emotional categories. The integration of fog and Internet of Things (IoT) technologies further supported real-time implementation. Afrin et al. [33] developed a hybrid architecture that fuses DenseNet121 and MobileNetV2 to identify four emotional states (joy, anger, natural, and fear) in autistic children. Notably, they addressed significant limitations in existing datasets—such as class overlap and image duplication—by curating a modified dataset, FERAC (Facial Emotion Recognition-Autistic Children), consisting of 770 images. This dataset was constructed through observational data collection under expert supervision, leading to the exclusion of problematic classes like ‘Sadness’ and ‘Surprise’. The hybrid model trained on FERAC achieved an accuracy of 75%, outperforming several baseline architectures; however, the overall performance still indicates room for enhancement in generalizability and robustness. In a related line of work, Sandeep and Kumar [34] proposed a system for pain recognition in autistic children by combining ResNeXt and MediaPipe with a convolutional neural network to assess facial indicators of pain in real time. Although highly specific in scope, this study exemplifies the potential of deep learning for recognizing complex affective states in ASD populations. Multimodal frameworks have also been investigated to enhance emotion recognition performance by integrating visual and auditory features. Wang et al. [35], for instance, proposed a Convolutional Vision Transformer model that incorporates both facial image data and speech-derived features. Their system achieved facial-only emotion recognition accuracy of 79.12%, which improved to 90.73% with multimodal feature fusion using an attention mechanism. While such approaches yield higher performance, they also introduce considerable computational overhead and practical limitations related to real-time data synchronization, particularly in settings where visual cues are the primary or sole modality of interest. Moreover, the suboptimal performance of unimodal facial emotion recognition reported in such studies indicates that further advancements are needed to improve the efficacy of visual-only models.
Collectively, these works highlight several ongoing challenges and research gaps. First, a majority of studies in this field emphasize ASD classification rather than comprehensive emotion recognition. Among those that do address emotional states, there is often a lack of consistent accuracy and model robustness when applied to the inherently diverse and sometimes ambiguous facial expressions observed in autistic individuals. The quality, diversity, and class balance of training datasets also remain critical bottlenecks, as unbalanced or insufficiently representative datasets hinder the development of generalized models. Furthermore, the computational complexity of many current models—especially those employing multimodal or ensemble methods—poses constraints on their applicability in clinical or real-time environments, where efficiency and low-latency performance are essential. The atypical nature of affective expression in autistic children further necessitates model architectures that are finely attuned to detecting subtle and non-standard emotional cues.
In response to these challenges, the present study proposes a deep learning-based approach specifically designed for emotion recognition in autistic children using static facial images. The goal is to develop models that are both accurate and computationally efficient, capable of recognizing a diverse array of emotions while being suitable for deployment in real-world contexts. By addressing dataset limitations, refining model architectures, and targeting a broader emotional spectrum, this work aims to advance the state of emotion recognition within ASD-focused research and applications.

3. Materials and Methods

We evaluated deep learning approaches for emotion recognition in individuals with ASD, focusing on the comparative performance of CNNs and transformer-based architectures. Our workflow consists of three main steps, as shown in Figure 1: (1) preprocessing facial expression data into four target emotion classes—natural, anger, fear, and joy—based on a curated ASD-specific dataset; (2) training and validation CNN and transformer models independently to assess their performance in emotion classification; and (3) evaluating classification accuracy and standardized metrics.

3.1. Data Preprocessing and Experimental Setup

We utilized the Facial Emotion Recognition-Autistic Children (FERAC) Dataset, which consists of 770 images of children’s faces [25]. This dataset was developed by modifying an existing one, in collaboration with a medical professional who is the director of the Autism Development Centre at Ma Shishu O General Hospital, Chattogram, Bangladesh. Initially, all black and white and duplicate images were removed to ensure high data quality. The dataset comprises four emotional classes: natural, fear, joy, and anger [25]. For our experimental setup, we partitioned the dataset into 691 images for the training directory and 79 images for the testing directory. Figure 2 shows sample images, while Table 1 displays the dataset’s distribution.
For model input consistency, all images were first resized to 224 × 224 pixels. The deep learning models were then implemented and trained in a Python 3.10 (Python Software Foundation, Wilmington, DE, USA) environment on Kaggle. We utilized the Keras-GPU and TensorFlow-GPU frameworks for this task. Training was performed using an NVIDIA Tesla P100 GPU for enhanced speed. The training protocol consisted of 10 epochs with a batch size of 16. The Adam optimizer was employed for dynamic updates, and an adaptive learning rate mechanism was also included to prevent overfitting and facilitate better model convergence.

3.2. CNN and Transfer Learning for Emotion Classification

CNNs have become the foundation of most modern computer vision applications due to their ability to detect hierarchical patterns in image data [36]. For emotion recognition, CNNs are effective in identifying local features such as eye movement, mouth shape, and facial muscle patterns [37,38].
To enhance the generalization capability and reduce the training time, transfer learning was adopted. Instead of training a CNN from scratch, pre-trained models such as ResNet or DenseNet, originally trained on large-scale image datasets like ImageNet, were used as a base [32]. These models were then fine-tuned on the emotion dataset to adapt the learned features to the specific task of emotion recognition in children.
The training process involved standard preprocessing steps, including resizing, normalization, and data augmentation. Augmentation techniques such as rotation, flipping, and scaling were applied to improve model robustness and mitigate overfitting. The final layers of the CNN were customized to include dense layers and a softmax classifier suited to the multi-class nature of the task. The whole process is shown in Algorithm 1.
Algorithm 1: CNN with Transfer Learning
1. function EmotionRecognition_CNN (Input_Images, Pretrained_Model)
2.   Input:
3.     Input_Images: Set of training images with emotion labels
4.     Pretrained_Model: A CNN model pre-trained on ImageNet
5.   Output: Trained_CNN_Model
6.   Preprocess Input_Images (resize, normalize, augment)
7.   Load Pretrained_Model without top classification layers
8.   Freeze early layers of Pretrained_Model to retain learned features
9.   Add custom classification layers:
10.     Dense layer with ReLU activation
11.     Dropout for regularization
12.     Final Dense layer with Softmax activation for emotion classes
13.   Compile model using Adam optimizer and cross-entropy loss
14.   Train model on Input_Images with validation split
15.   Evaluate model performance on test set
16.   return Trained_CNN_Model
17. end function

3.3. Transformer-Based Approach for Emotion Classification

Transformers, originally designed for natural language processing tasks, have recently shown impressive performance in computer vision problems [39]. Vision Transformers (ViTs) process images by dividing them into smaller patches and interpreting these patches as a sequence of tokens, similar to how words are processed in sentences [40]. This allows the model to capture global contextual relationships that CNNs might miss.
In this study, a pre-trained transformer model was used. They are specialized Vision Transformers fine-tuned for facial emotion recognition. The model processes high-resolution image inputs and uses multi-head self-attention mechanisms to capture complex emotional patterns distributed across different facial regions.
The feature extraction stage involved passing preprocessed images through the frozen encoder to generate deep feature representations. These features were then used to train a custom classifier composed of dense layers, as shown in Algorithm 2. This two-stage approach allowed for leveraging powerful representations while reducing computational cost during training.
Algorithm 2: Transformer-Based Emotion Classification
1. function EmotionRecognition_Transformer(Input_Images, Transformer_Model)
2.    Input:
3.     Input_Images: Set of facial images labeled with emotions
4.     Transformer_Model: Pretrained model for emotion analysis
5.    Output: Trained_Transformer_Model
6.    Preprocess Input_Images:
7.     Convert grayscale to RGB if necessary
8.     Normalize pixel values and apply standard augmentation
9.    Use Transformer_Model to extract deep features from Input_Images
10.   Store extracted features and corresponding labels
11.   Define custom classifier head:
12.     Dense layer with ReLU activation
13.     Dropout for regularization
14.     Final Dense layer with Softmax activation
15.   Compile model with suitable optimizer and loss function
16.   Train classifier head using extracted features and labels
17.   Evaluate model using accuracy, F1-score, and AUC
18.   return Trained_Transformer_Model
19. end function

3.4. Performance Assessment

We evaluated the performance of our deep learning models for multi-class classification across four distinct classes. We analyzed the training and validation accuracy, loss, and key classification metrics derived from the confusion matrix. These metrics—precision, recall, F1-score, and accuracy, as defined in Equations (1)–(4)—allowed us to perform a comprehensive evaluation of the models’ capability across all classes.
Precision = TP TP + FP   ,
Recall = TP TP + FN   ,
F 1 - score = 2   × P recision   ×   R ecall P recision + R ecall ,
Accuracy = TP + TN TP + FP + TN + FN ,

4. Results and Discussion

In terms of overall model performance, transformer-based architectures consistently outperformed CNN-based models across all major evaluation metrics, as shown in Table 2. Among the CNNs, ResNet152V2 achieved the highest accuracy at 0.7484, along with the best F1-score of 0.7048 and precision of 0.7328 within this category. The next best-performing CNN was DenseNet201, which reached an accuracy of 0.7355 and an F1-score of 0.7007, showing a modest trade-off between recall of 0.7355 and precision of 0.6755. InceptionV3 and InceptionResNetV2 showed slightly lower performance, with InceptionV3 achieving an accuracy of 0.7290 and an F1-score of 0.6868, while InceptionResNetV2 trailed with 0.7161 accuracy and a 0.6774 F1-score.
In contrast, transformer models exhibited notably stronger performance. The Swin Transformer outperformed all other models, achieving the highest accuracy of 0.8000, an F1-score of 0.7889, precision of 0.7903, and recall of 0.8000. These results suggest that Swin not only excels in correctly identifying the target classes but also maintains a strong balance between sensitivity and specificity. ViT and CvT also performed well, both reaching an accuracy of 0.7613. While ViT demonstrated a higher F1-score of 0.7341, CvT had the advantage in precision of 0.7525. ViT-DeiT, a distilled variant of the Vision Transformer, matched ResNet152V2 in accuracy of 0.7484 but exceeded it in F1-score of 0.7395, indicating more balanced predictions.
These findings highlight the relative superiority of transformer-based models over traditional CNNs in the context of emotion recognition in autistic children, especially when considering the overall balance between the evaluated metrics.
Further analysis of emotion-specific F1-scores provides insight into how each model performs across the individual emotion categories: natural, anger, fear, and joy, as shown in Table 3.
For the joy emotion, which had the highest overall recognition rates across models, transformer architectures again led the performance. CvT achieved the best result with an F1-score of 0.9286, followed closely by Swin (0.9192), ViT (0.9118), and ViT-DeiT (0.9016). Among CNNs, the top performer was ResNet152V2 with an F1-score of 0.8844. This suggests that joy-related facial expressions are relatively easier to classify, particularly for transformer-based models. The fear category posed the greatest challenge for all models, with CNNs failing to recognize this emotion entirely with an F1-score of 0.0000. In contrast, transformer models showed some capability, albeit limited. Swin achieved the highest F1-score at 0.4000, followed by ViT-DeiT (0.3000), ViT (0.2667), and CvT (0.1818). While the performance remains suboptimal, the transformers’ ability to detect fearful expressions, even to a small extent, is notable and may be a starting point for further refinement. Recognition of anger was moderate across models, with transformer models again demonstrating superiority. Swin reached the highest F1-score of 0.6400, followed by ViT-DeiT (0.6000) and ViT (0.4444). CNNs performed relatively poorly, with scores ranging from 0.2353 for ResNet152V2 to 0.4167 for InceptionResNetV2. These results suggest that anger expressions are more difficult to recognize than joy but easier than fear, with transformer models showing a clear advantage. In the natural category, performance was somewhat comparable between model types. ResNet152V2 was the top-performing CNN with an F1-score of 0.6341, while the highest among transformers was Swin (0.6269), closely followed by CvT (0.6076). These results indicate that both CNNs and transformers are capable of identifying neutral expressions with moderate accuracy.
The confusion matrices, as shown in Figure 3, provide further insight into model behavior by showing the distribution of predicted versus actual labels. Correct predictions are found along the diagonal, with misclassifications represented by off-diagonal values.
Among CNNs, ResNet152V2 showed the most balanced performance, correctly classifying 26 natural, 5 anger, 9 fear, and 88 joy instances. Most CNNs performed well on joy (≥87 correct) but struggled significantly with fear—frequently misclassifying these instances as natural or anger. For example, InceptionV3 misclassified 7 fear instances as natural and failed to predict any correctly.
Transformer models showed marked improvement. The Swin Transformer correctly identified 91 joy, 8 anger, and 1 fear instances, along with 21 natural. Notably, it distributed misclassifications more evenly and with lower frequency, which aligns with its high F1-scores across all emotion categories. ViT-DeiT also performed well in anger recognition (9 correct predictions), while CvT achieved strong natural (24) and joy (91) predictions but misclassified many fear instances. ViT demonstrated strong joy detection with 93 correct predictions and modest improvements in fear and anger compared to CNNs.
A key observation is that all models—even top-performing transformers—continued to struggle with the fear category, reflecting its inherent complexity and possibly lower representation in the dataset. However, transformers displayed greater capacity to generalize across emotions and maintain balanced classification, suggesting they are more suited to the nuanced affective states often exhibited by autistic individuals.
The results clearly demonstrate the superior performance of transformer-based architectures compared to traditional CNN models in recognizing emotions in autistic children. Transformers consistently outperformed CNNs across all metrics, with the Swin Transformer achieving the best overall results. This can be attributed to the inherent architectural differences between transformers and CNNs [41,42]. Transformers utilize self-attention mechanisms that allow them to capture long-range dependencies and global contextual information within images, which is crucial for accurately interpreting subtle and complex facial expressions [43]. In contrast, CNNs rely on localized convolutional filters that focus on spatially limited regions, potentially missing important global cues necessary for nuanced emotion recognition, especially in autistic children whose expressions may be atypical or less exaggerated [44]. Among CNNs, ResNet152V2 performed best, likely due to its deeper architecture and residual connections that enable better feature extraction and gradient flow [45,46]. However, it still lagged behind transformers, suggesting that simply increasing depth may not fully compensate for CNNs’ limited ability to model complex relational features across the entire face. Emotion-specific performance differences further elucidate why transformers have an advantage. Joy was the easiest emotion to detect because joyful expressions typically involve distinct, strong visual cues—such as smiling—that are easier to learn and recognize [17]. Transformers’ capacity to integrate diverse facial regions likely enhances their ability to capture these consistent features robustly [47]. Fear was the most challenging emotion across all models. Fear expressions can be subtle, often involving micro-expressions or slight muscle tensions that vary widely between individuals, especially among autistic children who may express fear differently or less overtly [37,44]. Transformers showed some capability here due to their holistic attention approach, which can aggregate weaker, dispersed signals better than CNNs’ local filters. Nevertheless, the limited data on fear expressions and their inherent subtlety constrained overall detection performance. Anger detection fell between joy and fear in difficulty. Angry expressions can vary significantly in intensity and presentation, which likely requires models to generalize across diverse facial cues—a task that transformers are better equipped to handle given their global context modeling [41]. The confusion matrices confirm that misclassifications predominantly involve fear and anger being mistaken for neutral or other emotions. This suggests that these emotions either share overlapping visual features or are underrepresented in the training data, emphasizing the need for richer datasets.
An important consideration in interpreting these results is the dataset composition and its similarity to previous studies on autistic emotion recognition. In this work, the dataset comprised 770 total images, split into 615 training and 155 validation samples: natural (37), anger (15), fear (10), and joy (93). This mirrors the FERAC study by Afrin et al. [33], which also used 770 images in four classes but with a 90:10 train–validation split, leaving only 5 test samples for fear and 8 for anger. In the study by Talaat et al. [28], the class imbalance was even more pronounced, with some categories—such as anger and fear—having only 3 test samples each in a 758/72 train–validation configuration across six classes. While Afrin et al. [33] applied targeted data augmentation (rotations, flips, shifts, zoom, brightness adjustments) to partially balance the training data, Talaat et al. [28] relied on model robustness without explicit balancing. In all three cases, imbalance persisted in the evaluation set, meaning that aggregate performance metrics could mask substantial performance disparities in minority classes. From a performance perspective, our top-performing model, the Swin Transformer, achieved an accuracy of 0.8000, F1-score of 0.7889, precision of 0.7903, and recall of 0.8000. This exceeds the best result in Afrin et al.’s work, where the DenseNet121 + MobileNetV2 hybrid achieved 0.7500 accuracy, 0.6000 precision, 0.5700 recall, and 0.5600 F1-score [33]. In contrast, Talaat et al. reported substantially higher nominal performance, with their Xception model reaching 0.9523 accuracy, 0.9320 sensitivity, 0.9421 specificity, and 0.9134 AUC [28]. However, these results were obtained on a validation set in which several classes, including anger and fear, contained only three samples each, rendering per-class accuracy highly unstable and limiting the reliability of aggregate measures. Across all three studies, dominant classes such as joy and natural consistently achieved the highest recognition rates, whereas minority and visually subtle categories like fear remained challenging. This recurring trend underscores the need for larger, more balanced datasets and for evaluation strategies, such as cross-validation or external dataset testing, that provide a more reliable assessment of model generalizability.
Despite promising results, several limitations remain. The poor recognition of fear likely stems from dataset imbalance and a lack of distinctive features for this emotion, indicating a need for more fear samples or additional data modalities (e.g., audio or physiological signals) to improve detection. The dataset’s size and diversity may also restrict the models’ generalizability, as autistic children display a wide range of emotional expressions influenced by individual differences not fully captured here. Increasing participant diversity and emotional contexts would strengthen robustness. The dataset’s limited scale also prevented the application of computationally intensive validation schemes such as k-fold cross-validation for both CNNs and ViTs; instead, we mitigated this by employing extensive data augmentation and preprocessing. Furthermore, focusing solely on facial expressions overlooks other important emotional cues, especially since autistic children may express emotions differently. Incorporating multimodal data and context could further enhance recognition performance and clinical applicability. Finally, the absence of multiple training runs with varying seeds precluded formal statistical testing; future work will address this through repeated experiments on larger, multi-center datasets.

5. Conclusions and Future Work

This study investigated and compared the performance of CNNs and transformer-based models for emotion recognition in autistic children, a task complicated by atypical and often subtle emotional expressions. The models were evaluated across key performance metrics, including accuracy, F1-score, precision, and recall. Among CNNs, ResNet152V2 emerged as the best performer, achieving an accuracy of 0.7484 and an F1-score of 0.7048. However, all CNN models consistently failed to recognize the fear emotion and showed limited effectiveness in detecting anger. These limitations highlight the difficulty traditional models face when dealing with nuanced affective cues commonly exhibited by autistic individuals. Transformer-based models significantly outperformed CNNs across all categories. The Swin Transformer achieved the highest overall performance, with an accuracy of 0.8000 and an F1-score of 0.7889. It also delivered improved recognition of challenging emotions such as fear and anger, where CNNs struggled. Confusion matrix analysis confirmed Swin’s ability to distribute errors more evenly and correctly classify a broader range of emotional states. This suggests that the global attention mechanisms of transformer models provide an advantage in understanding complex facial patterns.
Despite these promising findings, limitations remain, including class imbalance and reliance solely on visual data. Future research should explore multimodal approaches that integrate facial imagery with additional signals such as speech, body posture, gaze tracking, and physiological data, which may enhance the recognition of subtle emotions in autistic individuals. Another promising avenue is the incorporation of emotion recognition models into virtual reality (VR) environments for therapeutic and educational purposes. VR-based social simulations could provide controlled, immersive contexts for practicing emotional understanding, while automated emotion recognition could offer real-time feedback to participants and therapists. Such integration would require addressing challenges related to data synchronization, latency, and user comfort but holds considerable potential for enhancing intervention effectiveness.

Author Contributions

Conceptualization, P.R.; methodology, P.R.; software, P.R.; validation, G.M.; formal analysis, P.R.; investigation, P.R.; resources, P.R.; data curation, P.R.; writing—original draft preparation, P.R.; writing—review and editing, P.R. and G.M.; visualization, P.R.; supervision, G.M.; project administration, G.M.; funding acquisition, P.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code developed in this study is available from the corresponding author upon request. The Facial Emotion Recognition-Autistic Children (FERAC) Dataset, an open-access resource, is available for download at https://www.kaggle.com/datasets/rajasreechaiti/ferac-dataset, accessed on 29 August 2025.

Conflicts of Interest

Author Petra Radočaj was employed by the company Layer d.o.o. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

  1. Sauer, A.K.; Stanton, J.E.; Hans, S.; Grabrucker, A.M. Autism Spectrum Disorders: Etiology and Pathology. In Autism Spectrum Disorders; Grabrucker, A.M., Ed.; Exon Publications: Brisbane, Australia, 2021; ISBN 978-0-645-00178-5. [Google Scholar]
  2. Hirota, T.; King, B.H. Autism Spectrum Disorder: A Review. JAMA 2023, 329, 157–168. [Google Scholar] [CrossRef]
  3. Pandey, R.; Bhushan, B. Facial Expression Databases and Autism Spectrum Disorder: A Scoping Review. Autism Res. Off. J. Int. Soc. Autism Res. 2025, 18, 1314–1329. [Google Scholar] [CrossRef]
  4. Prakash, V.G.; Kohli, M.; Kohli, S.; Prathosh, A.P.; Wadhera, T.; Das, D.; Panigrahi, D.; Kommu, J.V.S. Computer Vision-Based Assessment of Autistic Children: Analyzing Interactions, Emotions, Human Pose, and Life Skills. IEEE Access 2023, 11, 47907–47929. [Google Scholar] [CrossRef]
  5. ElMahalawy, J.; ElSwaify, Y.A.; Elliboudy, D.; Abbas, O.M.; Moustafa, N.; Wael, N. AI-Powered Human-Computer Interaction Assisting Early Identification of Emotional and Facial Symptoms of Autism Spectrum Disorder in Children: “A Deep Learning-Based Enhanced Facial Feature Recognition System. In Proceedings of the 2024 International Conference on Machine Intelligence and Smart Innovation (ICMISI), Alexandria, Egypt, 12–14 May 2024; pp. 87–93. [Google Scholar]
  6. Al-Nefaie, A.H.; Aldhyani, T.H.H.; Ahmad, S.; Alzahrani, E.M. Application of Artificial Intelligence in Modern Healthcare for Diagnosis of Autism Spectrum Disorder. Front. Med. 2025, 12, 1569464. [Google Scholar] [CrossRef]
  7. Altered Interactive Dynamics of Gaze Behavior During Face-to-Face Interaction in Autistic Individuals: A Dual Eye-Tracking Study|Molecular Autism|Full Text. Available online: https://molecularautism.biomedcentral.com/articles/10.1186/s13229-025-00645-5 (accessed on 31 July 2025).
  8. Hedger, N.; Chakrabarti, B. Autistic Differences in the Temporal Dynamics of Social Attention. Autism Int. J. Res. Pract. 2021, 25, 1615–1626. [Google Scholar] [CrossRef] [PubMed]
  9. Ellis, K.; White, S.; Dziwisz, M.; Agarwal, P.; Moss, J. Visual Attention Patterns during a Gaze Following Task in Neurogenetic Syndromes Associated with Unique Profiles of Autistic Traits: Fragile X and Cornelia de Lange Syndromes. Cortex 2024, 174, 110–124. [Google Scholar] [CrossRef] [PubMed]
  10. Yeung, M.K. A Systematic Review and Meta-Analysis of Facial Emotion Recognition in Autism Spectrum Disorder: The Specificity of Deficits and the Role of Task Characteristics. Neurosci. Biobehav. Rev. 2022, 133, 104518. [Google Scholar] [CrossRef]
  11. Black, M.H.; Chen, N.T.M.; Iyer, K.K.; Lipp, O.V.; Bölte, S.; Falkmer, M.; Tan, T.; Girdler, S. Mechanisms of Facial Emotion Recognition in Autism Spectrum Disorders: Insights from Eye Tracking and Electroencephalography. Neurosci. Biobehav. Rev. 2017, 80, 488–515. [Google Scholar] [CrossRef]
  12. Abdullah, N.M.; Al-Allaf, A.F. Facial Expression Recognition (FER) of Autism Children Using Deep Neural Networks. In Proceedings of the 2021 4th International Iraqi Conference on Engineering Technology and Their Applications (IICETA), Najaf, Iraq, 21–22 September 2021; pp. 111–116. [Google Scholar]
  13. Syed, A.J.; Durrani, D.J.; Shahid, N.; Khan, W.; Muhammad, A. Expression Detection of Autistic Children Using CNN Algorithm. In Proceedings of the 2023 Global Conference on Wireless and Optical Technologies (GCWOT), Malaga, Spain, 24–27 January 2023; pp. 1–5. [Google Scholar]
  14. Jaffar, S.S.; Abdulbaqi, H.A. Facial Expression Recognition in Static Images for Autism Children Using CNN Approaches. In Proceedings of the 2022 Fifth College of Science International Conference of Recent Trends in Information Technology (CSCTIT), Baghdad, Iraq, 15–16 November 2022; pp. 202–207. [Google Scholar]
  15. Wang, Y.; Deng, Y.; Zheng, Y.; Chattopadhyay, P.; Wang, L. Vision Transformers for Image Classification: A Comparative Survey. Technologies 2025, 13, 32. [Google Scholar] [CrossRef]
  16. Elharrouss, O.; Himeur, Y.; Mahmood, Y.; Alrabaee, S.; Ouamane, A.; Bensaali, F.; Bechqito, Y.; Chouchane, A. ViTs as Backbones: Leveraging Vision Transformers for Feature Extraction. Inf. Fusion 2025, 118, 102951. [Google Scholar] [CrossRef]
  17. Ding, Y.; Zhang, H.; Qiu, T. Deep Learning Approach to Predict Autism Spectrum Disorder: A Systematic Review and Meta-Analysis. BMC Psychiatry 2024, 24, 739. [Google Scholar] [CrossRef]
  18. Ke, F.; Choi, S.; Kang, Y.H.; Cheon, K.-A.; Lee, S.W. Exploring the Structural and Strategic Bases of Autism Spectrum Disorders with Deep Learning. IEEE Access 2020, 8, 153341–153352. [Google Scholar] [CrossRef]
  19. Farhat, T.; Akram, S.; Rashid, M.; Jaffar, A.; Bhatti, S.M.; Iqbal, M.A. A Deep Learning-Based Ensemble for Autism Spectrum Disorder Diagnosis Using Facial Images. PLoS ONE 2025, 20, e0321697. Available online: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0321697 (accessed on 31 July 2025). [CrossRef] [PubMed]
  20. Meyer-Lindenberg, H.; Moessnang, C.; Oakley, B.; Ahmad, J.; Mason, L.; Jones, E.J.H.; Hayward, H.L.; Cooke, J.; Crawley, D.; Holt, R.; et al. Facial Expression Recognition Is Linked to Clinical and Neurofunctional Differences in Autism. Mol. Autism 2022, 13, 43. [Google Scholar] [CrossRef]
  21. Corluka, N.; Laycock, R. The Influence of Dynamism and Expression Intensity on Face Emotion Recognition in Individuals with Autistic Traits. Cogn. Emot. 2024, 38, 635–644. [Google Scholar] [CrossRef]
  22. Dia, M.; Khodabandelou, G.; Sabri, A.Q.M.; Othmani, A. Video-Based Continuous Affect Recognition of Children with Autism Spectrum Disorder Using Deep Learning. Biomed. Signal Process. Control. 2024, 89, 105712. [Google Scholar] [CrossRef]
  23. Khor, S.W.H.; Md Sabri, A.Q.; Othmani, A. Autism Classification and Monitoring from Predicted Categorical and Dimensional Emotions of Video Features. Signal Image Video Process. 2024, 18, 191–198. [Google Scholar] [CrossRef]
  24. Milling, M.; Baird, A.; Bartl-Pokorny, K.D.; Liu, S.; Alcorn, A.M.; Shen, J.; Tavassoli, T.; Ainger, E.; Pellicano, E.; Pantic, M.; et al. Evaluating the Impact of Voice Activity Detection on Speech Emotion Recognition for Autistic Children. Front. Comput. Sci. 2022, 4, 837269. [Google Scholar] [CrossRef]
  25. FERAC Dataset. Available online: https://www.kaggle.com/datasets/rajasreechaiti/ferac-dataset (accessed on 31 July 2025).
  26. Islam, M.M. The Impact of Transfer Learning on AI Performance Across Domains. J. Artif. Intell. Gen. Sci. 2024, 1, 1–4. [Google Scholar] [CrossRef]
  27. Vanaja, D.S.; Arockia Raj, J. AI-Enhanced IoT Tool for Emotional and Social Development in Children with Autism. Int. J. High Speed Electron. Syst. 2025, 2540148. Available online: https://www.worldscientific.com/doi/10.1142/S0129156425401482?srsltid=AfmBOoo_5VRFMRuybfFDxDhc_GbQ7OIjmyYw3DjMyYXbhyeUOi9abV1M (accessed on 31 July 2025). [CrossRef]
  28. Talaat, F.M. Real-Time Facial Emotion Recognition System among Children with Autism Based on Deep Learning and IoT. Neural Comput. Appl. 2023, 35, 12717–12728. [Google Scholar] [CrossRef]
  29. Haider, F.; Pollak, S.; Albert, P.; Luz, S. Emotion Recognition in Low-Resource Settings: An Evaluation of Automatic Feature Selection Methods. Comput. Speech Lang. 2021, 65, 101119. [Google Scholar] [CrossRef]
  30. Li, Y.; Huang, W.-C.; Song, P.-H. A Face Image Classification Method of Autistic Children Based on the Two-Phase Transfer Learning. Front. Psychol. 2023, 14, 1226470. [Google Scholar] [CrossRef] [PubMed]
  31. Muthukkumar, R. Enhancing the Identification of Autism Spectrum Disorder in Facial Expressions Using DenseResNet-Based Transfer Learning Approach. Biomed. Signal Process. Control 2025, 103, 107433. [Google Scholar] [CrossRef]
  32. Akter, T.; Ali, M.H.; Khan, M.I.; Satu, M.S.; Uddin, M.J.; Alyami, S.A.; Ali, S.; Azad, A.; Moni, M.A. Improved Transfer-Learning-Based Facial Recognition Framework to Detect Autistic Children at an Early Stage. Brain Sci. 2021, 11, 734. [Google Scholar] [CrossRef]
  33. Afrin, M.; Hoque, K.E.; Chaiti, R.D. Emotion Recognition of Autistic Children from Facial Images Using Hybrid Model. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024; pp. 1–6. [Google Scholar]
  34. Sandeep, P.V.K.; Kumar, N.S. Pain Detection through Facial Expressions in Children with Autism Using Deep Learning. Soft Comput. 2024, 28, 4621–4630. [Google Scholar] [CrossRef]
  35. Wang, Y.; Pan, K.; Shao, Y.; Ma, J.; Li, X. Applying a Convolutional Vision Transformer for Emotion Recognition in Children with Autism: Fusion of Facial Expressions and Speech Features. Appl. Sci. 2025, 15, 3083. [Google Scholar] [CrossRef]
  36. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
  37. Pávez, R.; Díaz, J.; Arango-López, J.; Ahumada, D.; Méndez, C.; Moreira, F. Emotion Recognition in Children with Autism Spectrum Disorder Using Convolutional Neural Networks. In Proceedings of the Trends and Applications in Information Systems and Technologies; Rocha, Á., Adeli, H., Dzemyda, G., Moreira, F., Ramalho Correia, A.M., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 585–595. [Google Scholar]
  38. Alamgir, F.M.; Zaman, T.; Hassan, M.M.; Jonayed, M.R.; Alam, M.S. Classification Model for Autism Spectrum Disorder Individuals: Utilizing Facial Grid-Wise Emotion Features and Dual-Branch Visual Transformation. In Proceedings of the 2024 IEEE International Conference on Power, Electrical, Electronics and Industrial Applications (PEEIACON), Rajshahi, Bangladesh, 12–13 September 2024; pp. 864–869. [Google Scholar]
  39. Indra Devi, K.B.; Durai Raj Vincent, P.M. The Emergence of Artificial Intelligence in Autism Spectrum Disorder Research: A Review of Neuro Imaging and Behavioral Applications. Comput. Sci. Rev. 2025, 56, 100718. [Google Scholar] [CrossRef]
  40. Alharthi, A.G.; Alzahrani, S.M. Do It the Transformer Way: A Comprehensive Review of Brain and Vision Transformers for Autism Spectrum Disorder Diagnosis and Classification. Comput. Biol. Med. 2023, 167, 107667. [Google Scholar] [CrossRef]
  41. Shi, R.; Li, T.; Zhang, L.; Yamaguchi, Y. Visualization Comparison of Vision Transformers and Convolutional Neural Networks. IEEE Trans. Multimed. 2024, 26, 2327–2339. [Google Scholar] [CrossRef]
  42. Khan, A.; Rauf, Z.; Sohail, A.; Khan, A.R.; Asif, H.; Asif, A.; Farooq, U. A Survey of the Vision Transformers and Their CNN-Transformer Based Variants. Artif. Intell. Rev. 2023, 56, 2917–2970. [Google Scholar] [CrossRef]
  43. Ma, F.; Sun, B.; Li, S. Facial Expression Recognition with Visual Transformers and Attentional Selective Fusion. IEEE Trans. Affect. Comput. 2023, 14, 1236–1248. [Google Scholar] [CrossRef]
  44. Pereira, R.; Mendes, C.; Ribeiro, J.; Ribeiro, R.; Miragaia, R.; Rodrigues, N.; Costa, N.; Pereira, A. Systematic Review of Emotion Detection with Computer Vision and Deep Learning. Sensors 2024, 24, 3484. [Google Scholar] [CrossRef]
  45. Hwooi, S.K.W.; Othmani, A.; Sabri, A.Q.M. Deep Learning-Based Approach for Continuous Affect Prediction from Facial Expression Images in Valence-Arousal Space. IEEE Access 2022, 10, 96053–96065. [Google Scholar] [CrossRef]
  46. Ahmadiar, A.; Melinda, M.; Muthiah, Z.; Zainal, Z.; Rizky, M.M. Thermal Image Classification of Autistic Children Using Res-Net Architecture. Indones. J. Electron. Electromed. Eng. Med. Inform. 2025, 7, 1–10. [Google Scholar] [CrossRef]
  47. Tagmatova, Z.; Umirzakova, S.; Kutlimuratov, A.; Abdusalomov, A.; Im Cho, Y. A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition. Appl. Sci. 2025, 15, 7100. [Google Scholar] [CrossRef]
Figure 1. Workflow for emotion recognition in children with autism using deep learning. Step 1 involves preprocessing FERAC facial expression data with augmentation, including rotation, zoom, resizing, and flipping across four emotion classes: natural, anger, fear, and joy. Step 2 consists of training and evaluating CNN models (InceptionV3, InceptionResNetV2, DenseNet201, ResNet152V2) and transformer-based models (ViT, ViT-DeiT, CvT, Swin) using Keras and TensorFlow. Step 3 covers emotion classification and performance evaluation using an 80/20 stratified train-validation split with metrics such as accuracy, precision, recall, and F1-score derived from the confusion matrix.
Figure 1. Workflow for emotion recognition in children with autism using deep learning. Step 1 involves preprocessing FERAC facial expression data with augmentation, including rotation, zoom, resizing, and flipping across four emotion classes: natural, anger, fear, and joy. Step 2 consists of training and evaluating CNN models (InceptionV3, InceptionResNetV2, DenseNet201, ResNet152V2) and transformer-based models (ViT, ViT-DeiT, CvT, Swin) using Keras and TensorFlow. Step 3 covers emotion classification and performance evaluation using an 80/20 stratified train-validation split with metrics such as accuracy, precision, recall, and F1-score derived from the confusion matrix.
Applsci 15 09555 g001
Figure 2. Samples of emotional facial expression images representing different emotions were used in the experiment.
Figure 2. Samples of emotional facial expression images representing different emotions were used in the experiment.
Applsci 15 09555 g002
Figure 3. The confusion matrices for the models presented in this study.
Figure 3. The confusion matrices for the models presented in this study.
Applsci 15 09555 g003
Table 1. Dataset distribution.
Table 1. Dataset distribution.
CategoryTotal ImagesTraining ImagesValidation Images
Natural18414737
Anger745915
Fear493910
Joy46337093
Total770615155
Table 2. Detailed performance metrics of CNN and transformer models on emotion recognition in autistic children.
Table 2. Detailed performance metrics of CNN and transformer models on emotion recognition in autistic children.
Deep Learning ApproachModelAccuracyF1-ScorePrecisionRecall
CNNInceptionV30.72900.68680.68020.7290
InceptionResNetV20.71610.67740.65320.7161
DenseNet2010.73550.70070.67550.7355
ResNet152V20.74840.70480.73280.7484
TransformerViT0.76130.73410.72720.7613
ViT-DeiT0.74840.73950.73470.7484
CvT0.76130.73000.75250.7613
Swin0.80000.78890.79030.8000
The highest assessment metrics are bolded.
Table 3. The F1-score values for emotion recognition in autistic children per model.
Table 3. The F1-score values for emotion recognition in autistic children per model.
Deep Learning ApproachModelNaturalAngerFearJoy
CNNInceptionV30.56760.40000.00000.8544
InceptionResNetV20.53520.41670.00000.8488
DenseNet2010.58820.41380.00000.8670
ResNet152V20.63410.23530.00000.8844
TransformerViT0.53120.44440.26670.9118
ViT-DeiT0.50750.60000.30000.9016
CvT0.60760.16670.18180.9286
Swin0.62690.64000.40000.9192
The highest assessment metrics are bolded.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Radočaj, P.; Martinović, G. Emotion Recognition in Autistic Children Through Facial Expressions Using Advanced Deep Learning Architectures. Appl. Sci. 2025, 15, 9555. https://doi.org/10.3390/app15179555

AMA Style

Radočaj P, Martinović G. Emotion Recognition in Autistic Children Through Facial Expressions Using Advanced Deep Learning Architectures. Applied Sciences. 2025; 15(17):9555. https://doi.org/10.3390/app15179555

Chicago/Turabian Style

Radočaj, Petra, and Goran Martinović. 2025. "Emotion Recognition in Autistic Children Through Facial Expressions Using Advanced Deep Learning Architectures" Applied Sciences 15, no. 17: 9555. https://doi.org/10.3390/app15179555

APA Style

Radočaj, P., & Martinović, G. (2025). Emotion Recognition in Autistic Children Through Facial Expressions Using Advanced Deep Learning Architectures. Applied Sciences, 15(17), 9555. https://doi.org/10.3390/app15179555

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop