A Real Time Multi Modal Computer Vision Framework for Automated Autism Spectrum Disorder Screening

Dénes-Fazakas, Lehel; Mateas, Ioan Catalin; Berciu, Alexandru George; Szilágyi, László; Kovács, Levente; Dulf, Eva-H.

doi:10.3390/electronics15061287

Open AccessArticle

A Real Time Multi Modal Computer Vision Framework for Automated Autism Spectrum Disorder Screening

by

Lehel Dénes-Fazakas

^1,2

,

Ioan Catalin Mateas

³,

Alexandru George Berciu

³

,

László Szilágyi

^1,2

,

Levente Kovács

^1,2

and

Eva-H. Dulf

^2,3,*

¹

Biomatics and Applied Artificial Intelligence Institute, John von Neumann Faculty of Informatics, Obuda University, 1034 Budapest, Hungary

²

Physiological Controls Research Center, University Research and Innovation Center, Obuda University, 1034 Budapest, Hungary

³

Automation Department, Faculty of Automation and Computer Science, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1287; https://doi.org/10.3390/electronics15061287

Submission received: 31 January 2026 / Revised: 2 March 2026 / Accepted: 16 March 2026 / Published: 19 March 2026

(This article belongs to the Special Issue Computer Vision and Machine Learning for Biometric Systems)

Download

Browse Figures

Versions Notes

Abstract

Background: The early detection of autism spectrum disorder (ASD) is imperative for enhancing long-term developmental outcomes. Nevertheless, conventional screening methods depend on time-consuming, expert-driven behavioral assessments and are characterized by limited scalability. Automated video-based analysis provides a noninvasive and objective approach for the extraction of behavioral biomarkers from naturalistic recordings. Methods: A modular multimodal framework was developed that integrates motion-based video analysis and facial feature extraction for the purpose of ASD versus typically developing (TD) classification. The system is capable of processing RGB videos, skeleton/stickman representations, and motion trajectory streams. A comprehensive set of kinematic features was extracted, encompassing joint trajectories, velocity and acceleration profiles, posture variability, movement smoothness, and bilateral asymmetry. The repetitive stereotypical behaviors exhibited by the subjects were characterized using frequency-domain analysis via FFT within the 0.3–7.0 Hz band. Facial expression features derived from normalized face crops and landmark-based morphological descriptors were integrated as complementary modalities. The feature-level fusion process was executed subsequent to z-score normalization, and the classification procedure was conducted using a Random Forest model with stratified 5-fold cross validation. The implementation of GPU acceleration was instrumental in facilitating near real-time inference. Results: The motion-based ComplexVideos pipeline demonstrated a cross-validated accuracy of 94.2 ± 2.1% with an area under the ROC curve (AUC) of 0.93. Skeleton-based KinectStickman inputs demonstrated moderate performance, with an accuracy range of 60–80%. In contrast, facial-only models exhibited an accuracy of approximately 60%. The integration of multiple modalities through feature fusion has been demonstrated to enhance the robustness of classification algorithms and mitigate the occurrence of false negative outcomes, thereby surpassing the performance of single-modality models. The mean inference time remained below one second per video frame under standard operating conditions. Conclusions: The experimental results demonstrate that the integration of multimodal cues, including motion and facial features, facilitates the development of effective and efficient video-based screening methods for autism spectrum disorder (ASD). The proposed framework is designed to offer a scalable, extensible, and computationally efficient solution that can support early screening in clinical and remote assessment settings.

Keywords:

autism spectrum disorder; video-based screening; computer vision; multi-modal fusion; motion analysis; skeleton-based representation; facial expression analysis; behavioral biomarker extraction; random forest classification; real-time inference

1. Introduction

Autism Spectrum Disorder (ASD) is a neurodevelopmental condition characterized by persistent deficits in social interaction and communication, as well as the presence of repetitive and stereotypical behavioral patterns. Recent epidemiological studies indicate that approximately one in 54 children worldwide is affected by ASD [1], highlighting its growing societal and clinical significance. Early identification and timely intervention have been shown to significantly influence long-term developmental outcomes and quality of life for individuals on the autism spectrum. However, conventional diagnostic approaches remain challenging, particularly during early childhood, due to their reliance on subjective behavioral observations and the heterogeneity of symptom manifestation.

Typically, ASD is diagnosed during early childhood through expert-driven behavioral assessments. While core symptoms are present across individuals, the clinical presentation varies substantially, complicating standardized evaluation. Subtle motor anomalies, such as repetitive hand movements, postural instability, or asymmetric motor patterns, may emerge as early as two years of age but are often overlooked by caregivers or primary healthcare providers. This diagnostic latency can delay intervention, reducing the effectiveness of therapeutic strategies.

Advances in artificial intelligence and computer vision technologies have created new opportunities for automated behavioral analysis. Video-based assessment systems enable the extraction and quantitative analysis of movement patterns, posture dynamics, and interaction behaviors. These technologies allow continuous monitoring and objective measurement of behavioral indicators associated with ASD. Compared to traditional manual observation methods, automated video analysis offers superior scalability, reproducibility, and the ability to process large volumes of data efficiently. Furthermore, such systems can be deployed in clinical environments, educational settings, and home-based screening applications, thereby expanding accessibility to early screening tools [2].

Importantly, automated video analysis is not intended to replace clinical diagnosis but rather to complement existing assessment frameworks by providing preliminary screening support. By reducing diagnostic latency and supporting earlier therapeutic intervention, these systems have the potential to improve clinical workflows and patient outcomes.

In this study, we propose an integrated video-based autism screening framework that leverages multi-modal visual representations, including raw video data, skeleton-based stickman representations, and motion trajectory information. The system extracts motion-related features associated with stereotypical behavior, postural variability, and bilateral asymmetry. Spectral analysis techniques are applied to identify repetitive movement patterns within clinically relevant frequency ranges. To ensure computational efficiency, the proposed pipeline is optimized for GPU acceleration using CUDA-enabled architectures, enabling near real-time processing [3].

The proposed framework is evaluated on multiple publicly available datasets, including the ComplexVideos dataset [4] and the KinectStickman dataset [5], enabling cross-dataset validation and robustness assessment. Additionally, a facial micro-expression analysis module is incorporated to enhance classification performance through multi-source feature fusion. The overall architecture is designed to be modular and extensible, facilitating integration with future detection models and multi modal diagnostic systems.

The experimental results demonstrate that the proposed approach achieves high classification accuracy while maintaining real-time processing capabilities. These findings support the feasibility of deploying cost-effective, scalable, and accessible video-based screening tools for early autism detection, contributing to the advancement of AI-assisted neurodevelopmental assessment technologies [6].

2. Related Work

2.1. Early Screening and Clinical Assessment Limitations

Autism spectrum disorder (ASD) comprises a heterogeneous group of neurodevelopmental conditions characterized by impairments in social communication and the presence of repetitive and restricted behaviors [7,8,9,10]. Although early behavioral signs can be detectable from around 14 months, a substantial proportion of children are not diagnosed until after the age of four [11]. This diagnostic delay is clinically relevant because early intervention is strongly associated with improved developmental outcomes and quality of life for affected children and their families [12].

Current screening and diagnostic practice primarily relies on questionnaires and expert administered instruments such as the Autism Diagnostic Observation Schedule (ADOS), which require specialized training, considerable time, and repeated assessments across development [7,12]. These constraints limit scalability and can exacerbate disparities in communities with insufficient clinical resources. Consequently, recent research has placed increasing focus on automated and noninvasive screening approaches based on video analysis and machine learning [11,13,14].

2.2. Video-Based Machine Learning Approaches

A prominent research direction involves automated recognition of ASD-relevant social behaviors from video. The Sapiro laboratory at Duke University proposed a machine learning framework for ASD detection by analyzing child adult interaction videos and automatically quantifying behavioral markers such as gaze to face, gaze to objects, smiling, and vocalization [7]. Using computer vision-based facial and behavioral cues, their work demonstrated that automated behavioral inference is feasible, and that combining selected features can yield diagnostic prediction performance comparable to traditional assessments [7].

Beyond single-modality pipelines, multimodal fusion has been explored to improve robustness under real-world conditions. Abid Ali et al. introduced a multimodal framework that combines RGB appearance with motion dynamics (e.g., optical flow) for recognizing stereotypical behaviors in unconstrained clinical videos [11]. Their study emphasized the scarcity of standardized datasets and addressed it by collecting and annotating new data, showing that modality fusion can outperform approaches relying on a single representation [11].

2.3. Pose Estimation and Skeleton-Based Representations

To reduce reliance on appearance and improve privacy, several methods transform videos into skeleton-based representations. Kojovic et al. investigated 2D pose estimation from videos using OpenPose and employed a CNN-LSTM architecture to classify ASD versus typically developing children based on non-verbal social interaction patterns [12]. Their results indicated that pose-driven temporal models can reach strong classification performance and that reliable prediction may be possible even from shorter video segments, which is important for practical screening scenarios [12].

2.4. Eye-Tracking and Sensor-Based Measurements

Eye-tracking has also been widely studied as a non-invasive modality for ASD risk assessment, motivated by differences in social attention patterns. Prior work evaluated portable eye-tracking systems (e.g., SMI RED250 in Figure 1) for analyzing gaze distribution across predefined regions of interest in short social videos and reported promising classification metrics [15,16]. However, eye-tracking solutions often require controlled laboratory conditions, can be sensitive to head motion, and may involve high equipment cost and calibration complexity, limiting deployment in community and home settings [15]. Alternative experimental paradigms using devices such as Tobii eye trackers in Figure 2 have shown high specificity but variable sensitivity, suggesting that gaze features may be particularly useful for ruling in risk when positive, yet may miss a subset of affected cases [17].

In addition to gaze, wearable and embedded sensors have been used to quantify repetitive movements [18,19]. For example, sensorized toys equipped with inertial measurement units (IMUs) combined with deep learning back-ends have been proposed to classify motion patterns during play in naturalistic environments, achieving high preliminary recognition performance across several movement classes [20]. Such systems enable longer-term monitoring outside specialized clinical laboratories, complementing video-only approaches [20].

2.5. Modeling Motor and Audio–Motor Stereotypies

Motor stereotypies are core symptoms in ASD, and automated video analysis has been used to characterize their frequency and typology. Wan et al. reported that vocal stereotypies may be under-investigated relative to motor stereotypies, and demonstrated that standardized video-based annotation can provide detailed phenotypic profiles, including associations with verbal ability and intellectual disability [17]. These findings support the value of multi domain behavioral modeling (motor and vocal) for richer screening signals.

2.6. Deep Learning for Abnormal Hand and Body Movements

Deep learning models have been applied to detect self-stimulatory behaviors (e.g., hand flapping) from home-recorded videos. Studies based on datasets such as SSBD evaluated feature representations ranging from privacy-preserving hand landmarks (e.g., MediaPipe) to deep visual embeddings (e.g., MobileNetV2 features) combined with temporal classifiers such as LSTMs, demonstrating strong performance while highlighting a trade-off between privacy and accuracy [21]. More broadly, kinematic and imitation task-based classification has been explored using traditional machine learning (e.g., SVM with nested cross-validation), where combining kinematics and gaze features improved diagnostic prediction compared to either modality alone [22].

2.7. Facial Expression and Gesture Analysis

Facial expression analysis has been investigated as an additional screening signal, including approaches that capture webcam frames during interactive applications (e.g., video games) and train CNN-based classifiers on facial images of children with ASD and typically developing controls [23]. Gesture-centric methods have also been proposed: Zunino et al. analyzed short action videos (e.g., grasping tasks) and applied LSTM-based temporal modeling to distinguish ASD from controls, with attention mechanisms providing interpretability by highlighting salient motion regions [24]. These results suggest that combining facial, gesture, and full-body motion cues may improve robustness and clinical utility.

2.8. Summary and Motivation

Overall, the literature indicates that automated ASD screening benefits from (i) scalable video-based analysis, (ii) privacy-aware representations such as skeletons and landmarks, and (iii) multimodal fusion across appearance, motion, gaze, and facial/gesture cues. Remaining challenges include limited standardized datasets, cross-domain generalization, and the balance between deployability (low-cost sensors and standard cameras) and measurement fidelity. These considerations motivate integrated frameworks that support multi-source feature extraction and efficient, near-real-time inference in practical screening settings.

3. Materials and Methods

3.1. System Overview

We propose a modular multi-modal framework for early autism spectrum disorder (ASD) screening based on advanced video and facial analysis. The system is designed to process heterogeneous visual inputs through specialized modality-specific branches, followed by feature normalization and fusion into a unified decision pipeline. This design supports both (i) behavioral motion analysis from video streams and (ii) complementary facial cues from still images, enabling robust ASD versus typically developed (TD) discrimination.

Figure 3 summarizes the main components of the pipeline and their interactions. The architecture follows a stage-wise processing scheme: data ingestion and modality-specific preprocessing, feature extraction, classification, and multi-modal fusion.

3.2. Datasets and Input Modalities

3.2.1. Video Datasets

The primary dataset used for developing the video-based pipeline is ComplexVideos [4], comprising 100 subjects evenly distributed between ASD and TD groups. For each subject, the dataset provides synchronized recordings in three video modalities: standard RGB recordings, skeleton-like “stickman” representations, and motion-tracking videos based on colored trajectory markers. This multi-source structure enables extracting complementary movement descriptors from the same underlying behavior.

To support cross-dataset validation and assess generalization, we further used the KinectStickman dataset [5]. This dataset includes skeleton-based recordings captured with Kinect sensors and delivered as multiple segmented clips per subject. The recordings frequently involve natural activity contexts such as gameplay-based interactions, providing an ecologically valid motion analysis setting.

3.2.2. Image Datasets

Facial analysis was conducted using two complementary image datasets. AutisticChildrenEmotion [25] contains facial expression images organized into six emotional categories collected from children diagnosed with ASD. This dataset supports learning expression-related facial patterns.

In addition, FaceMorphology [26] provides high-resolution facial photographs distributed across ASD and non-ASD subjects, enabling extraction of static morphological descriptors such as geometric proportions, landmark-based distances, and facial symmetry indices.

3.2.3. External Validation Dataset

An independent ASD+TD validation dataset [27] was used to evaluate robustness beyond the primary training sources. This dataset contains previously unseen recordings from both ASD and TD groups and serves as an external benchmark for assessing real-world generalization.

3.3. Feature Extraction and Preprocessing

3.3.1. Motion Feature Extraction from Video

Motion features were extracted from pose and skeleton representations to capture clinically relevant behavioral markers. Extracted descriptors include joint trajectories, inter-joint angles, velocity profiles, acceleration dynamics, posture variability measures, and coordination indicators between body segments.

Stereotypical Motion Detection via Spectral Analysis

Repetitive stereotypical behaviors were detected using frequency domain analysis based on the Fast Fourier Transform (FFT). Motion signals were transformed into the spectral domain and analyzed within the 0.3–7.0 Hz frequency band corresponding to rhythmic repetitive movements associated with ASD-related stereotypies. The analysis was structured anatomically to capture hand-flapping dynamics, torso-rocking motion, and head oscillation patterns.

Bilateral Asymmetry, Variability, and Smoothness

Bilateral asymmetry was computed by comparing left–right kinematic trajectories across paired joints (e.g., shoulders, elbows, wrists, hips, knees, and ankles). Postural variability was quantified using temporal dispersion statistics, while movement smoothness was assessed through higher-order temporal derivatives capturing abrupt motion transitions.

3.3.2. Statistical Characterization of Motion Distributions

To characterize the statistical behavior of the extracted motion features prior to classification, multiple distribution-based analyses were performed. These analyses include acceleration-related descriptors, temporal abruptness measures, mean tracking velocity, spatial tracking point distributions, and velocity variability metrics.

Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 illustrate the aggregated distributions computed across ASD and TD subject groups. These visualizations provide insight into inter-group separability, intra-class variability, and feature stability, and support the selection of discriminative motion descriptors for downstream classification. Across the kinematic measures examined, including average acceleration, mean tracking velocity, abruptness, and spatial distribution, the TD group contains a constant statistical outlier. This observation, indicated by an open circle in the box-and-whisker plots, shows a data point that exceeds the distribution’s upper fence. While the top neighboring values are the predicted non-outlier maximums for the TD group, this anomalous individual consistently has much higher values throughout various motion parameters. Plotting this person separately reveals a large divergence from the TD cohort’s central tendency, indicating that the major quartile boundaries appropriately reflect the group’s overall kinematic profile without intentional skewing.

3.3.3. Facial Feature Extraction

For facial expression processing, face regions were detected and normalized to a standard resolution (

64 \times 64

pixels), with contrast normalization applied to improve robustness to illumination changes. Expression-related descriptors were derived from normalized facial representations.

For facial morphology analysis, landmark-based geometric features were computed using inter-landmark distances, angular relationships, and symmetry indices. These static descriptors provide complementary facial information independent of short-term expression dynamics.

3.3.4. Multi-Modal Feature Integration

Prior to integration, all modality-specific feature vectors were standardized using z-score normalization. Feature-level fusion was performed by concatenating normalized descriptors into a unified representation. Weighted fusion strategies were applied during training to balance the relative contribution of motion-based and facial-based features.

3.4. Classification Models

3.4.1. Random Forest Classification

A Random Forest classifier [28,29] was employed as the primary decision model due to its robustness on heterogeneous feature spaces and its suitability for moderate dataset sizes. The ensemble learning strategy enables non linear feature interactions to be captured while reducing overfitting through decision tree aggregation. Hyperparameters were optimized using grid-based cross validation.

3.4.2. Transfer Learning and Fine Tuning for Pose Estimation

Transfer learning and fine tuning were applied to the pose estimation backbone in order to improve joint localization accuracy under occlusion, motion blur, and pediatric recording conditions. Task-specific annotated samples were used to adapt pretrained weights and enhance downstream feature reliability.

3.5. Evaluation Protocol

Performance evaluation followed stratified 5-fold cross validation to preserve class balance across folds. Evaluation metrics include accuracy, sensitivity, specificity, F1-score, and receiver operating characteristic (ROC) curves with corresponding area-under-the-curve (AUC) values. Confusion matrices were analyzed to inspect class-wise error patterns, and repeated training runs were performed to assess result stability.

3.6. Implementation Details

The framework was implemented in Python 3.10 using standard computer vision, deep learning, and machine learning libraries. GPU acceleration via CUDA was employed to support efficient real-time video processing. Output data were exported in structured formats (CSV and JSON) to ensure reproducibility and facilitate downstream statistical analysis.

4. Results

4.1. Performance of Individual Modalities

The performance of the proposed framework was first evaluated independently on each data modality in order to quantify their individual contribution to autism spectrum disorder (ASD) detection.

4.1.1. ComplexVideos Motion-Based Classification

The motion-based model trained on the ComplexVideos dataset achieved strong discriminative performance between ASD and typically developed (TD) subjects. Using stratified 5-fold cross-validation, the classifier obtained an average accuracy in the range of 85–95%, with a stability variation limited to approximately

\pm 3 %

across repeated runs.

Figure 9 presents the confusion matrix obtained on the evaluation set, demonstrating balanced classification performance with low false-positive and false-negative rates. The receiver operating characteristic (ROC) analysis yielded an area under the curve (AUC) exceeding 0.90, indicating high separability between the two behavioral groups.

To further analyze behavioral dynamics, multiple motion descriptors were compared between groups. Figure 10 and Figure 11 illustrate representative examples of extracted velocity, asymmetry, smoothness, and composite motion scores for ASD and TD subjects, respectively. Clear distributional differences can be observed, especially in stereotypical movement intensity and motion variability.

4.1.2. KinectStickman Skeleton-Based Classification

The skeleton-based KinectStickman model demonstrated moderate classification performance compared to the primary multi-modal video pipeline. Across evaluation runs, accuracy values ranged between 60 and 80%, depending on subject activity segments and recording variability.

Although skeleton representations reduce visual noise and privacy concerns, the simplified spatial encoding limits fine-grained behavioral cues, which explains the observed performance gap relative to RGB-based motion processing.

4.1.3. Facial Expression and Morphology Models

The facial expression classification pipeline achieved an average accuracy of approximately 60% on the evaluation dataset. This result indicates that emotion-based facial cues alone provide limited discriminative power for ASD screening when used independently.

Similarly, the facial morphology model achieved moderate classification performance. While static facial geometry contributes useful structural information, it is insufficient as a standalone biomarker and is better utilized as a complementary modality within the fused framework.

4.2. Multi-Modal Fusion Results

To evaluate the benefit of multi-modal integration, predictions from motion-based, facial expression, and morphology models were combined using weighted feature-level fusion.

The fused system consistently outperformed all individual modalities. Fusion improved classification robustness by reducing variance across folds and increasing overall accuracy beyond the best-performing single modality. In addition, false-negative rates were reduced, which is critical for screening oriented clinical applications.

Figure 12 illustrates the fusion module architecture and information flow between modality-specific classifiers and the final decision layer.

4.3. Cross-Validation Stability Analysis

Model stability was assessed using stratified 5-fold cross-validation across all datasets. The ComplexVideos-based pipeline demonstrated a consistent performance with limited accuracy fluctuation (

\pm 3 %

), indicating stable generalization behavior.

Repeated training runs confirmed that the learned feature representations were robust against random initialization and data-partitioning effects. This stability is essential for deployment in real-world screening environments where data variability is unavoidable.

4.4. Feature Importance Analysis

To interpret the contribution of extracted descriptors, feature importance scores were computed from the trained Random Forest classifier.

Figure 13 presents the ranked importance distribution of the most influential motion features. Acceleration-based metrics, motion smoothness indicators, and bilateral asymmetry descriptors emerged as dominant predictors, confirming the relevance of motor behavior abnormalities in ASD detection.

Correlation analysis further revealed strong relationships between velocity variability, abruptness metrics, and stereotypical motion scores, supporting the multi dimensional nature of ASD-related behavioral signatures.

4.5. Computational Performance

The optimized GPU accelerated pipeline enabled near-real-time processing. Average inference time remained below one second per video frame under standard operating conditions. This computational efficiency enables practical deployment in screening scenarios, including clinical environments and remote assessment settings.

5. Discussion

The experimental results demonstrate that the proposed multi-modal framework provides a reliable performance for early autism spectrum disorder (ASD) screening based on visual behavioral analysis. The achieved classification accuracy and stability indicate that motion-derived behavioral descriptors constitute strong biomarkers for distinguishing ASD and typically developed (TD) subjects.

5.1. Interpretation of Motion-Based Behavioral Patterns

One of the main observations is the dominant contribution of motion-related features, particularly acceleration dynamics, temporal smoothness, bilateral asymmetry, and stereotypical movement indicators. These findings are consistent with behavioral studies reporting altered motor coordination, increased movement variability, and repetitive motion patterns in children with ASD.

The high discriminative power observed for frequency-domain stereotypy detection further supports the relevance of rhythmic repetitive behaviors as objective screening cues. The anatomical segmentation strategy (hands, torso, and head) enabled targeted detection of clinically meaningful motor patterns, improving interpretability and classification robustness.

5.2. Effectiveness of Multi-Modal Fusion

The fusion-based architecture consistently outperformed individual modality pipelines. While motion-based models achieved strong standalone performance, integrating facial expression and morphological information further improved robustness and reduced prediction variance.

This result highlights the complementary nature of dynamic behavioral cues and static facial descriptors. Facial morphology alone exhibited limited predictive power, yet its integration contributed additional structural information that improved overall classification stability. Similarly, facial expression analysis, despite having moderate standalone accuracy, provided contextual emotional cues that enhanced the fused decision process.

These findings confirm the importance of combining heterogeneous visual modalities for complex neurodevelopmental disorder screening tasks, particularly when dealing with limited dataset sizes and behavioral variability.

5.3. Cross-Dataset Generalization and Robustness

The inclusion of the KinectStickman dataset [5] and the independent ASD+TD validation dataset [27] allowed evaluation beyond the primary training source. Although skeleton-based representations achieved lower accuracy compared to RGB motion analysis, they demonstrated consistent generalization trends and validated the adaptability of the proposed pipeline.

The observed cross-validation stability (

\pm 3 %

variation) further indicates that the learned representations are robust against data-partitioning effects. This property is essential for practical deployment, where environmental conditions, camera setups, and subject behavior may vary significantly.

5.4. Clinical Relevance and Practical Applicability

From a clinical perspective, the proposed framework supports non-invasive, camera-based screening that can be deployed in natural environments such as clinics, educational institutions, and home settings. The real-time processing capability (sub-second inference time per frame) enables near-real-time behavioral assessment without interrupting natural interactions.

Importantly, the system is not intended to replace professional diagnosis, but rather to serve as a complementary screening and decision-support tool. By providing objective behavioral indicators, the framework can assist clinicians in prioritizing cases that require further specialized evaluation.

5.5. Limitations

Despite promising results, several limitations must be acknowledged. First, although multiple datasets were used, the overall sample size remains moderate compared to large-scale clinical cohorts. Expanding the training data with more diverse age groups, cultural backgrounds, and recording environments is necessary to further improve generalization.

Second, facial expression and morphology datasets contain controlled imaging conditions that may not fully reflect unconstrained real-world scenarios. Lighting variability, occlusions, and camera quality differences may impact performance in uncontrolled environments.

Third, while Random Forest classifiers provide interpretability and robustness, deep end-to-end architectures could potentially capture higher-level temporal dependencies if sufficient annotated data become available.

5.6. Future Research Directions

Future work will focus on extending the proposed framework in several directions. First, incorporating additional modalities such as audio-based vocal analysis and eye gaze estimation could further enrich behavioral representations. Second, temporal deep learning architectures (e.g., transformer-based sequence models) may improve long-term behavioral pattern modeling.

Another important direction involves clinical validation with longitudinal datasets, enabling evaluation of early developmental trajectories and screening performance over time. Finally, integrating uncertainty estimation mechanisms could provide confidence-aware predictions, increasing trustworthiness in clinical deployment scenarios.

Overall, the presented framework establishes a scalable foundation for multi-modal visual ASD screening and demonstrates the feasibility of combining motion dynamics and facial analysis for objective early behavioral assessment via autism-detection tools.

6. Conclusions

Conclusions and Future Directions

This work presented a multi-modal video-based framework for the automated screening of autism spectrum disorder (ASD), aiming to support objective behavioral assessment in settings where expert clinical resources may be limited. Across multiple datasets and processing modalities, the proposed system demonstrated that quantitative motion analysis can capture discriminative behavioral signatures associated with ASD. In particular, the ComplexVideos-based pipeline achieved a consistently high performance, exceeding 90% accuracy across validation strategies, with a cross-validation score of

94.2 \pm 2.1 %

and an AUC of

0.93

, indicating strong separability between ASD and typically developing patterns. Cross-dataset evaluation further suggested that the extracted features reflect meaningful behavioral differences rather than dataset-specific artifacts: while KinectStickman yielded more modest accuracies (60–80%), it provided supportive evidence of generalization and external validity. The modular architecture enabled integration of heterogeneous inputs, and the Random Forest classifier facilitated interpretability through feature importance rankings, which is critical for clinical trust and validation.

Despite these encouraging results, several limitations remain. The overall accuracy does not reach perfect reliability, implying that automated predictions must be interpreted as screening support rather than used to provide a definitive diagnosis. Performance is sensitive to data quality and recording conditions, and the facial expression component was constrained by a notable class imbalance, limiting its contribution. Furthermore, the lack of evaluation across culturally and ethnically diverse populations restricts claims of global applicability, and the dependence on advanced computational infrastructure may hinder deployment in low-resource environments.

Future work will focus on both short-term technical improvements and longer-term translational research. In the near term, addressing the imbalance of facial-expression data through targeted acquisition, augmentation, and re-weighting strategies is expected to improve robustness. Additional gains are anticipated through refined feature extraction for subtle behavioral cues, systematic optimization of the Random Forest configuration, and the development of adaptive fusion mechanisms that dynamically adjust modality weights based on input quality. Expanding training with substantially larger datasets and increased computational capacity is also expected to enhance generalization. A particularly promising extension is the integration of fine motor assessment via hand gesture analysis, which may complement existing gross motor descriptors and improve sensitivity to stereotypies.

In the longer term, translating the framework into practice will require a clinician-oriented interface that supports transparent interpretation of predictions, including intuitive visualization of extracted features, confidence indicators, reporting functionality, and principled uncertainty quantification with confidence intervals. Age-specific modeling is another important research direction, given that ASD manifestations evolve across developmental stages; specialized classifiers optimized for distinct age groups may improve reliability in real-world screening. Finally, integrating complementary biomarkers (e.g., genetic signals, neuroimaging measures, and physiological data) could enable richer multi-source diagnostic profiles consistent with precision medicine paradigms, while adaptive systems that incorporate clinical feedback and periodic model updates may provide a pathway toward continuously improving screening performance over time.

Author Contributions

Conceptualization, I.C.M., L.D.-F. and A.G.B.; methodology, I.C.M. and L.D.-F.; software, I.C.M.; validation, L.D.-F., A.G.B. and L.S.; formal analysis, L.K.; investigation, I.C.M.; resources, A.G.B.; data curation, I.C.M.; writing—original draft preparation, L.D.-F.; writing—review and editing, A.G.B., L.S., L.K. and E.-H.D.; visualization, I.C.M. and E.-H.D.; supervision, E.-H.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available. The video datasets can be found in [4,5], and the image datasets are available in [26,27].

Acknowledgments

On behalf of the Development of Artificial Intelligence in the Medical Field project, we are grateful for the possibility to use ELKH Cloud (https://science-cloud.hu/), which helped us achieve the results published in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Farooq, M.S.; Tehseen, R.; Sabir, M.; Atal, Z. Detection of autism spectrum disorder (ASD) in children and adults using machine learning. Sci. Rep. 2023, 13, 9605. [Google Scholar] [CrossRef] [PubMed]
Lord, C.; Cook, E.H.; Leventhal, B.L.; Amaral, D.G. Autism spectrum disorders. Neuron 2000, 28, 355–363. [Google Scholar] [CrossRef] [PubMed]
Barami, T.; Manelis-Baram, L.; Kaiser, H.; Ilan, M.; Slobodkin, A.; Hadashi, O.; Hadad, D.; Waissengreen, D.; Nitzan, T.; Menashe, I.; et al. Automated analysis of stereotypical movements in videos of children with autism spectrum disorder. JAMA Netw. Open 2024, 7, e2432851. [Google Scholar] [CrossRef] [PubMed]
Al-Jubouri, A.; Hadi, I.; Rajihy, Y. Three-Dimensional Dataset Combining Gait and Full Body Movement of Children with Autism Spectrum Disorders Collected by Kinect v2 Camera. 2020. Available online: https://datadryad.org/dataset/doi:10.5061/dryad.s7h44j150 (accessed on 10 December 2025). [CrossRef]
Natraj, S.; Kojovic, N.; Maillart, T.; Schaer, M. Video-Audio Neural Network Ensemble for Comprehensive Screening of Autism Spectrum Disorder in Young Children (OpenPose ADOS Dataset). 2024. Available online: https://zenodo.org/records/12658214 (accessed on 10 December 2025). [CrossRef]
de Belen, R.A.J.; Bednarz, T.; Sowmya, A.; Del Favero, D. Computer vision in autism spectrum disorder research: A systematic review of published studies from 2009 to 2019. Transl. Psychiatry 2020, 10, 333. [Google Scholar] [CrossRef] [PubMed]
Wu, C.; Liaqat, S.; Helvaci, H.; Cheung, S.c.S.; Chuah, C.N.; Ozonoff, S.; Young, G. Machine Learning Based Autism Spectrum Disorder Detection from Videos. In Proceedings of the IEEE International Conference on E-Health Networking, Application & Services (HEALTHCOM), Virtual, 1–2 March 2021. [Google Scholar] [CrossRef]
Lanzarini, E.; Pruccoli, J.; Grimandi, I.; Spadoni, C.; Angotti, M.; Pignataro, V.; Sacrato, L.; Franzoni, E.; Parmeggiani, A. Phonic and Motor Stereotypies in Autism Spectrum Disorder: Video Analysis and Neurological Characterization. Brain Sci. 2021, 11, 431. [Google Scholar] [CrossRef] [PubMed]
Babu, P.R.K.; Di Martino, J.M.; Chang, Z.; Perochon, S.; Aiello, R.; Carpenter, K.L.H.; Compton, S.; Davis, N.; Franz, L.; Espinosa, S.; et al. Complexity analysis of head movements in autistic toddlers. J. Child Psychol. Psychiatry 2022, 64, 156–166. [Google Scholar] [CrossRef] [PubMed]
Rose, K. An Autistic Frequency (Stimming). 2018. Available online: https://theautisticadvocate.com/an-autistic-frequency/ (accessed on 10 January 2026).
Ali, A.; Negin, F.F.; Thümmler, S.; Bremond, F.F. Video-based Behavior Understanding of Children for Objective Diagnosis of Autism. In Proceedings of the 17th International Conference on Computer Vision Theory and Applications (VISAPP), Virtual, 6–8 February 2022. [Google Scholar] [CrossRef]
Kojovic, N.; Natraj, S.; Mohanty, S.P.; Maillart, T.; Schaer, M. Using 2D video-based pose estimation for automated prediction of autism spectrum disorders in young children. Sci. Rep. 2021, 11, 15069. [Google Scholar] [CrossRef] [PubMed]
Rehg, J.M.; Abowd, G.D.; Rozga, A.; Romero, M.; Clements, M.A.; Sclaroff, S.; Essa, I.; Ousley, O.Y.; Li, Y.; Kim, C.; et al. Decoding Children’s Social Behavior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 3414–3421. [Google Scholar] [CrossRef]
Young, G.S.; Constantino, J.N.; Dvorak, S.; Belding, A.; Gangi, D.; Hill, A.; Hill, M.; Miller, M.; Parikh, C.; Schwichtenberg, A.J.; et al. A video-based measure to identify autism risk in infancy. J. Child Psychol. Psychiatry 2020, 61, 1031–1039. [Google Scholar] [CrossRef] [PubMed]
Ahmed, Z.A.; Jadhav, M. A Review of Early Detection of Autism Based on Eye-Tracking and Sensing Technology. In Proceedings of the International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 26–28 February 2020; pp. 160–166. [Google Scholar] [CrossRef]
Cedrus. New: Cedrus Introduces StimTracker for SMI Eye Trackers. 2014. Available online: https://community.cedrus.com/t/new-cedrus-introduces-stimtracker-for-smi-eye-trackers/5092 (accessed on 10 December 2025).
Jaradat, A.S.; Wedyan, M.; Alomari, S.; Barhoush, M.M. Using Machine Learning to Diagnose Autism Based on Eye Tracking Technology. Diagnostics 2024, 15, 66. [Google Scholar] [CrossRef] [PubMed]
Muyinda, P.B.; Masagazi, F.M.; Mugagga, A.M.; Mulumba, M.B. Tracking Students’ Eye-Movements when Reading Learning Objects on Mobile Phones: A Discourse Analysis of Luganda Language Teacher-Trainees’ Reflective Observations. J. Learn. Dev. 2016, 3, 51–65. [Google Scholar] [CrossRef]
Pierce, K.; Marinero, S.; Hazin, R.; McKenna, B.; Carter Barnes, C.; Malige, A. Eye Tracking Reveals Abnormal Visual Preference for Geometric Images as an Early Biomarker of an Autism Spectrum Disorder Subtype Associated with Increased Symptom Severity. Biol. Psychiatry 2016, 79, 657–666. [Google Scholar] [CrossRef] [PubMed]
Raja, K.S.S.; Balaji, V.; Kiruthika, U.S.; Raman, C. An IoT Platform for Children Behaviour Analysis and Early Detection of Neurodevelopmental Disorders. In Proceedings of the 2021 Innovations in Power and Advanced Computing Technologies (i-PACT), Kuala Lumpur, Malaysia, 27–29 November 2021; pp. 73–84. [Google Scholar] [CrossRef]
Lakkapragada, A.; Kline, A.; Mutlu, O.C.; Paskov, K.; Chrisman, B.; Stockham, N.; Washington, P.; Wall, D.P. The Classification of Abnormal Hand Movement to Aid in Autism Detection: Machine Learning Study. JMIR Biomed. Eng. 2022, 7, e33771. [Google Scholar] [CrossRef]
Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A.J. Applying Machine Learning to Kinematic and Eye Movement Features of a Movement Imitation Task to Predict Autism Diagnosis. Sci. Rep. 2020, 10, 8346. [Google Scholar] [CrossRef] [PubMed]
Derbali, M.; Jarrah, M.; Randhawa, P. Autism Spectrum Disorder Detection Using Video Games Facial Expression Diagnosis. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 110–119. [Google Scholar] [CrossRef]
Zunino, A.; Morerio, P.; Cavallo, A.; Ansuini, C.; Podda, J.; Battaglia, F.; Veneselli, E.; Becchio, C.; Murino, V. Video Gesture Analysis for Autism Spectrum Disorder Detection. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 3421–3426. [Google Scholar] [CrossRef]
Talaat, F.M. Autistic Children Emotions Dataset. 2025. Available online: https://www.kaggle.com/datasets/fatmamtalaat/autistic-children-emotions-dr-fatma-m-talaat (accessed on 10 January 2026).
Das, P. Autistic Children Facial Image Dataset. 2025. Available online: https://www.kaggle.com/datasets/prayashdas/autistic-children-facial-image-dataset (accessed on 10 January 2026).
Nada, A. AV-ASD Videos Part 5. 2025. Available online: https://www.kaggle.com/datasets/nadaahmed567/av-asd-videos-part-5/data (accessed on 10 January 2026).
Genuer, R.; Poggi, J.M. Random Forests. In Random Forests with R; Springer: Berlin/Heidelberg, Germany, 2020; pp. 33–55. [Google Scholar] [CrossRef]
Nelli, F. Machine Learning with scikit-learn. In Python Data Analytics; Springer: Berlin/Heidelberg, Germany, 2023; pp. 259–287. [Google Scholar] [CrossRef]

Figure 1. Portable eye-tracking device SMI RED250 (SensoMotoric Instruments GmbH (SMI), Teltow, Germany).

Figure 2. Eye-tracking setup using a Tobii T120 device (Tobii AB, Stockholm, Sweden) for simultaneous presentation of social and non social stimuli.

Figure 3. System architecture diagram illustrating the main processing components and data flow.

Figure 4. Average acceleration distribution computed across ASD and TD subject groups.

Figure 5. Average abruptness distribution extracted from motion trajectories for ASD and TD subjects.

Figure 6. Mean tracking velocity distribution across analyzed motion sequences.

Figure 7. Spatial distribution of tracking points obtained from motion trajectory analysis.

Figure 8. Standard deviation of tracking velocity illustrating motion variability patterns.

Figure 9. Confusion matrix obtained for the ComplexVideos motion-based classification model.

Figure 10. Motion analysis metrics for a representative ASD subject.

Figure 11. Motion analysis metrics for a representative TD subject.

Figure 12. Multi modal fusion architecture integrating motion and facial analysis pipelines.

Figure 13. Feature importance distribution highlighting dominant motion descriptors.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dénes-Fazakas, L.; Mateas, I.C.; Berciu, A.G.; Szilágyi, L.; Kovács, L.; Dulf, E.-H. A Real Time Multi Modal Computer Vision Framework for Automated Autism Spectrum Disorder Screening. Electronics 2026, 15, 1287. https://doi.org/10.3390/electronics15061287

AMA Style

Dénes-Fazakas L, Mateas IC, Berciu AG, Szilágyi L, Kovács L, Dulf E-H. A Real Time Multi Modal Computer Vision Framework for Automated Autism Spectrum Disorder Screening. Electronics. 2026; 15(6):1287. https://doi.org/10.3390/electronics15061287

Chicago/Turabian Style

Dénes-Fazakas, Lehel, Ioan Catalin Mateas, Alexandru George Berciu, László Szilágyi, Levente Kovács, and Eva-H. Dulf. 2026. "A Real Time Multi Modal Computer Vision Framework for Automated Autism Spectrum Disorder Screening" Electronics 15, no. 6: 1287. https://doi.org/10.3390/electronics15061287

APA Style

Dénes-Fazakas, L., Mateas, I. C., Berciu, A. G., Szilágyi, L., Kovács, L., & Dulf, E.-H. (2026). A Real Time Multi Modal Computer Vision Framework for Automated Autism Spectrum Disorder Screening. Electronics, 15(6), 1287. https://doi.org/10.3390/electronics15061287

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Real Time Multi Modal Computer Vision Framework for Automated Autism Spectrum Disorder Screening

Abstract

1. Introduction

2. Related Work

2.1. Early Screening and Clinical Assessment Limitations

2.2. Video-Based Machine Learning Approaches

2.3. Pose Estimation and Skeleton-Based Representations

2.4. Eye-Tracking and Sensor-Based Measurements

2.5. Modeling Motor and Audio–Motor Stereotypies

2.6. Deep Learning for Abnormal Hand and Body Movements

2.7. Facial Expression and Gesture Analysis

2.8. Summary and Motivation

3. Materials and Methods

3.1. System Overview

3.2. Datasets and Input Modalities

3.2.1. Video Datasets

3.2.2. Image Datasets

3.2.3. External Validation Dataset

3.3. Feature Extraction and Preprocessing

3.3.1. Motion Feature Extraction from Video

Stereotypical Motion Detection via Spectral Analysis

Bilateral Asymmetry, Variability, and Smoothness

3.3.2. Statistical Characterization of Motion Distributions

3.3.3. Facial Feature Extraction

3.3.4. Multi-Modal Feature Integration

3.4. Classification Models

3.4.1. Random Forest Classification

3.4.2. Transfer Learning and Fine Tuning for Pose Estimation

3.5. Evaluation Protocol

3.6. Implementation Details

4. Results

4.1. Performance of Individual Modalities

4.1.1. ComplexVideos Motion-Based Classification

4.1.2. KinectStickman Skeleton-Based Classification

4.1.3. Facial Expression and Morphology Models

4.2. Multi-Modal Fusion Results

4.3. Cross-Validation Stability Analysis

4.4. Feature Importance Analysis

4.5. Computational Performance

5. Discussion

5.1. Interpretation of Motion-Based Behavioral Patterns

5.2. Effectiveness of Multi-Modal Fusion

5.3. Cross-Dataset Generalization and Robustness

5.4. Clinical Relevance and Practical Applicability

5.5. Limitations

5.6. Future Research Directions

6. Conclusions

Conclusions and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI