Multimodal LLM vs. Human-Measured Features for AI Predictions of Autism in Home Videos
Abstract
1. Introduction
2. Related Work
2.1. Feature Reduction in Autism Machine Learning Models
2.2. Crowdsourced Behavioral Annotation
2.3. Evolution of Large Language Models for Behavioral Assessment
2.4. Addressing a Key Gap: Direct Comparison of LLMs and Human Annotators
3. Materials and Methods
3.1. Dataset
3.2. Behavioral Feature Extraction
3.3. Machine Learning Classifiers and Direct Diagnosis
3.4. Model Configurations
3.5. Performance Assessment and Human Baselines
3.6. Statistical Analysis
4. Results
4.1. Overall Performance Benchmarking
4.2. Reliability and Agreement Analysis
4.2.1. Within-Group vs. Between-Group Agreement
4.2.2. Feature-Level and Domain-Level Reliability
4.3. Feature Attribution and Interpretability
4.4. Video Characteristics and Error Analysis
4.4.1. Distribution of Agreement and Uncertainty
4.4.2. Diagnosis as the Primary Driver of Agreement
4.4.3. Case Analysis: High vs. Low Agreement Videos
4.4.4. Systematic Error Pattern Analysis
4.5. Ablation Studies and Component Analysis
4.5.1. Effect of Audio Input
4.5.2. Effect of Thinking Mode
4.5.3. Effect of Prompt Format
4.5.4. Effect of Behavioral Context
5. Discussion
5.1. Overall Performance Benchmarking
5.2. Reliability and Agreement Analysis
5.3. Feature Attribution and Interpretability
5.4. Video Characteristics and Error Analysis
5.5. Ablation Studies and Component Analysis
5.6. Limitations and Future Directions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Gemini Model Prompt and Behavioral Assessment Questions
Appendix A.1. Standardized Prompt for Gemini Models
You are a behavioral analyst observing short videos of children interacting with a caregiver. Your task is to answer a set of structured multiple-choice questions about the child’s behavior based on the video. Each question is scored on a scale from 0 to 3. You should provide the most appropriate score based on your observation. Even if the behavior is not directly shown, use indirect evidence (e.g., body language, facial expression, context) to infer the most likely behavior.
Appendix A.2. Direct Diagnosis Prompt Variations
Appendix A.2.1. Binary Format
- (0) No, the child’s behavior appears neurotypical
- (1) Yes, the child shows behavioral patterns consistent with autism
Appendix A.2.2. Three-Choice Format
- (0) No evidence of autism
- (1) Some evidence of autism
- (2) Strong evidence of autism
References
- Lord, C.; Risi, S.; Lambrecht, L.; Cook, E.H., Jr.; Leventhal, B.L.; DiLavore, P.C.; Pickles, A.; Rutter, M. The Autism Diagnostic Observation Schedule—Generic: A standard measure of social and communication deficits associated with the spectrum of autism. J. Autism Dev. Disord. 2000, 30, 205–223. [Google Scholar] [CrossRef]
- Lord, C.; Rutter, M.; DiLavore, P.; Risi, S.; Gotham, K.; Bishop, S. Autism Diagnostic Observation Schedule, 2nd ed.; Western Psychological Services: Torrance, CA, USA, 2012. [Google Scholar]
- Lord, C.; Rutter, M.; Le Couteur, A. Autism Diagnostic Interview-Revised: A revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. J. Autism Dev. Disord. 1994, 24, 659–685. [Google Scholar] [CrossRef] [PubMed]
- Shaw, K.A. Prevalence and early identification of autism spectrum disorder among children aged 4 and 8 years—Autism and Developmental Disabilities Monitoring Network, 16 Sites, United States, 2022. MMWR Surveill. Summ. 2025, 74, 1–22. [Google Scholar] [CrossRef] [PubMed]
- National Autism Data Center. How Early Does Diagnosis Happen? Autism by the Numbers; National Autism Data Center: Philadelphia, PA, USA, 2025. [Google Scholar]
- Durkin, M.S.; Maenner, M.J.; Baio, J.; Christensen, D.; Daniels, J.; Fitzgerald, R.; Imm, P.; Lee, L.C.; Schieve, L.A.; Van Naarden Braun, K.; et al. Autism spectrum disorder among US children (2002–2010): Socioeconomic, racial, and ethnic disparities. Am. J. Public Health 2017, 107, 1818–1826. [Google Scholar] [CrossRef]
- Magaña, S.; Lopez, K.; Aguinaga, A.; Morton, H. Access to diagnosis and treatment services among Latino children with autism spectrum disorders. Intellect. Dev. Disabil. 2013, 51, 141–153. [Google Scholar] [CrossRef]
- Dawson, G. Early behavioral intervention, brain plasticity, and the prevention of autism spectrum disorder. Dev. Psychopathol. 2008, 20, 775–803. [Google Scholar] [CrossRef] [PubMed]
- Landa, R.J. Efficacy of early interventions for infants and young children with, and at risk for, autism spectrum disorders. Int. Rev. Psychiatry 2018, 30, 25–39. [Google Scholar] [CrossRef]
- Tariq, Q.; Daniels, J.; Schwartz, J.N.; Washington, P.; Kalantarian, H.; Wall, D.P. Mobile detection of autism through machine learning on home video: A development and prospective validation study. PLoS Med. 2018, 15, e1002705. [Google Scholar] [CrossRef]
- Wall, D.P.; Dally, R.; Luyster, R.; Jung, J.Y.; DeLuca, T.F. Use of artificial intelligence to shorten the behavioral diagnosis of autism. PLoS ONE 2012, 7, e43855. [Google Scholar] [CrossRef]
- Wall, D.P.; Kosmicki, J.; Deluca, T.; Harstad, E.; Fusaro, V.A. Use of machine learning to shorten observation-based screening and diagnosis of autism. Transl. Psychiatry 2012, 2, e100. [Google Scholar] [CrossRef]
- Levy, S.; Duda, M.; Haber, N.; Wall, D.P. Sparsifying machine learning models identify stable subsets of predictive features for behavioral detection of autism. Mol. Autism 2017, 8, 65. [Google Scholar] [CrossRef] [PubMed]
- Kosmicki, J.; Sochat, V.; Duda, M.; Wall, D. Searching for a minimal set of behaviors for autism detection through feature selection-based machine learning. Transl. Psychiatry 2015, 5, e514. [Google Scholar] [CrossRef]
- Abbas, H.; Garberson, F.; Glover, E.; Wall, D.P. Machine learning approach for early detection of autism by combining questionnaire and home video screening. J. Am. Med. Inform. Assoc. 2018, 25, 1000–1007. [Google Scholar] [CrossRef] [PubMed]
- Fusaro, V.A.; Daniels, J.; Duda, M.; DeLuca, T.F.; D’Angelo, O.; Tamburello, J.; Maniscalco, J.; Wall, D.P. The potential of accelerating early detection of autism through content analysis of YouTube videos. PLoS ONE 2014, 9, e93533. [Google Scholar] [CrossRef]
- Washington, P.; Tariq, Q.; Leblanc, E.; Chrisman, B.; Dunlap, K.; Kline, A.; Kalantarian, H.; Penev, Y.; Paskov, K.; Voss, C.; et al. Crowdsourced privacy-preserved feature tagging of short home videos for machine learning ASD detection. Sci. Rep. 2021, 11, 7620. [Google Scholar] [CrossRef]
- Washington, P.; Leblanc, E.; Dunlap, K.; Penev, Y.; Varma, M.; Jung, J.Y.; Chrisman, B.; Sun, M.W.; Stockham, N.; Paskov, K.M.; et al. Selection of trustworthy crowd workers for telemedical diagnosis of pediatric autism spectrum disorder. In Proceedings of the Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, Virtual, 5–7 January 2021; Volume 26, p. 14. [Google Scholar]
- Washington, P.; Leblanc, E.; Dunlap, K.; Penev, Y.; Kline, A.; Paskov, K.; Sun, M.W.; Chrisman, B.; Stockham, N.; Varma, M.; et al. Precision telemedicine through crowdsourced machine learning: Testing variability of crowd workers for video-based autism feature recognition. J. Pers. Med. 2020, 10, 86. [Google Scholar] [CrossRef]
- Washington, P. A perspective on crowdsourcing and human-in-the-loop workflows in precision health. J. Med. Internet Res. 2024, 26, e51138. [Google Scholar] [CrossRef]
- Washington, P.; Kline, A.; Mutlu, O.C.; Leblanc, E.; Hou, C.; Stockham, N.; Paskov, K.; Chrisman, B.; Wall, D. Activity recognition with moving cameras and few training examples: Applications for detection of autism-related headbanging. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 8–13 May 2021; pp. 1–7. [Google Scholar]
- Voss, C.; Washington, P.; Haber, N.; Kline, A.; Daniels, J.; Fazel, A.; De, T.; McCarthy, B.; Feinstein, C.; Winograd, T.; et al. Superpower glass: Delivering unobtrusive real-time social cues in wearable systems. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct, Heidelberg, Germany, 12–16 September 2016; pp. 1218–1226. [Google Scholar]
- Kalantarian, H.; Washington, P.; Schwartz, J.; Daniels, J.; Haber, N.; Wall, D.P. Guess what? Towards understanding autism from structured video using facial affect. J. Healthc. Inform. Res. 2019, 3, 43–66. [Google Scholar] [CrossRef]
- Serna-Aguilera, M.; Nguyen, X.B.; Singh, A.; Rockers, L.; Park, S.W.; Neely, L.; Seo, H.S.; Luu, K. Video-based autism detection with deep learning. In Proceedings of the 2024 IEEE Green Technologies Conference (GreenTech), Springdale, AR, USA, 3–5 April 2024; pp. 159–161. [Google Scholar]
- Kojovic, N.; Natraj, S.; Mohanty, S.P.; Maillart, T.; Schaer, M. Using 2D video-based pose estimation for automated prediction of autism spectrum disorders in young children. Sci. Rep. 2021, 11, 15069. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Cheng, M.; Pan, Y.; Yuan, L.; Hu, S.; Li, M.; Zeng, S. Assessing the social skills of children with autism spectrum disorder via language-image pre-training models. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; pp. 260–271. [Google Scholar]
- Deng, S.; Kosloski, E.E.; Patel, S.; Barnett, Z.A.; Nan, Y.; Kaplan, A.; Aarukapalli, S.; Doan, W.T.; Wang, M.; Singh, H.; et al. Hear me, see me, understand me: Audio-visual autism behavior recognition. IEEE Trans. Multimed. 2024, 27, 2335–2346. [Google Scholar] [CrossRef]
- Raj, S.; Masood, S. Analysis and detection of autism spectrum disorder using machine learning techniques. Procedia Comput. Sci. 2020, 167, 994–1004. [Google Scholar] [CrossRef]
- Zunino, A.; Morerio, P.; Cavallo, A.; Ansuini, C.; Podda, J.; Battaglia, F.; Veneselli, E.; Becchio, C.; Murino, V. Video gesture analysis for autism spectrum disorder detection. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 3421–3426. [Google Scholar]
- Berlin S, J.; Pandian, D.; Rajagopalan, S.S.; Jayagopi, D. Detecting a child’s stimming behaviours for autism spectrum disorder diagnosis using rgbpose-slowfast network. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3356–3360. [Google Scholar]
- Sadouk, L.; Gadi, T.; Essoufi, E.H. A novel deep learning approach for recognizing stereotypical motor movements within and across subjects on the autism spectrum disorder. Comput. Intell. Neurosci. 2018, 2018, 7186762. [Google Scholar] [CrossRef] [PubMed]
- Zhu, F.L.; Wang, S.H.; Liu, W.B.; Zhu, H.L.; Li, M.; Zou, X.B. A multimodal machine learning system in early screening for toddlers with autism spectrum disorders based on the response to name. Front. Psychiatry 2023, 14, 1039293. [Google Scholar] [CrossRef]
- Eslami, T.; Mirjalili, V.; Fong, A.; Laird, A.R.; Saeed, F. ASD-DiagNet: A hybrid learning approach for detection of autism spectrum disorder using fMRI data. Front. Neuroinform. 2019, 13, 70. [Google Scholar] [CrossRef]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
- Kjell, O.N.; Kjell, K.; Schwartz, H.A. Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. Psychiatry Res. 2024, 333, 115667. [Google Scholar] [CrossRef]
- Xu, Y.; Fang, Z.; Lin, W.; Jiang, Y.; Jin, W.; Balaji, P.; Wang, J.; Xia, T. Evaluation of large language models on mental health: From knowledge test to illness diagnosis. Front. Psychiatry 2025, 16, 1646974. [Google Scholar] [CrossRef]
- Stade, E.C.; Stirman, S.W.; Ungar, L.H.; Boland, C.L.; Schwartz, H.A.; Yaden, D.B.; Sedoc, J.; DeRubeis, R.J.; Willer, R.; Eichstaedt, J.C. Large language models could change the future of behavioral healthcare: A proposal for responsible development and evaluation. npj Ment. Health Res. 2024, 3, 12. [Google Scholar] [CrossRef]
- Stanley, J.; Rabot, E.; Reddy, S.; Belilovsky, E.; Mottron, L.; Bzdok, D. Large language models deconstruct the clinical intuition behind diagnosing autism. Cell 2025, 188, 2235–2248. [Google Scholar] [CrossRef]
- Jiang, Y.; Shen, Q.; Lai, S.; Qi, S.; Zheng, Q.; Yao, L.; Wang, Y.; Pan, G. Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs. arXiv 2024, arXiv:2410.05684. [Google Scholar] [CrossRef]
- Hu, C.; Li, W.; Ruan, M.; Yu, X.; Paul, L.K.; Wang, S.; Li, X. Exploiting ChatGPT for diagnosing autism-associated language disorders and identifying distinct features. Res. Sq. 2024, 3, 4359726. [Google Scholar]
- Rajagopalan, S.S.; Zhang, Y.; Yahia, A.; Tammimies, K. Machine learning prediction of autism spectrum disorder from a minimal set of medical and background information. JAMA Netw. Open 2024, 7, e2429229. [Google Scholar] [CrossRef]
- Myers, E.; Stone, W.L.; Bernier, R.; Lendvay, T.; Comstock, B.; Cowan, C. The diagnosis conundrum: Comparison of crowdsourced and expert assessments of toddlers with high and low risk of autism spectrum disorder. Autism Res. 2018, 11, 1629–1634. [Google Scholar] [CrossRef]
- Leblanc, E.; Washington, P.; Varma, M.; Dunlap, K.; Penev, Y.; Kline, A.; Wall, D.P. Feature replacement methods enable reliable home video analysis for machine learning detection of autism. Sci. Rep. 2020, 10, 21245. [Google Scholar] [CrossRef] [PubMed]
- Dow, D.; Day, T.N.; Kutta, T.J.; Nottke, C.; Wetherby, A.M. Screening for autism spectrum disorder in a naturalistic home setting using the systematic observation of red flags (SORF) at 18–24 months. Autism Res. 2020, 13, 122–133. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Han, L.; Stein, G.; Day, S.; Bien-Gund, C.; Mathews, A.; Ong, J.J.; Zhao, P.Z.; Wei, S.F.; Walker, J.; et al. Crowdsourcing in health and medical research: A systematic review. Infect. Dis. Poverty 2020, 9, 8. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Google AI. Gemini Thinking Mode Documentation. 2024. Available online: https://ai.google.dev/gemini-api/docs/thinking-mode (accessed on 9 September 2025).
- Nelson, B.W.; Winbush, A.; Siddals, S.; Flathers, M.; Allen, N.B.; Torous, J. Evaluating the performance of general purpose large language models in identifying human facial emotions. npj Digit. Med. 2025, 8, 615. [Google Scholar] [CrossRef]
- Deng, C.; Lai, S.; Zhou, C.; Bao, M.; Yan, J.; Li, H.; Yao, L.; Wang, Y. ASD-Chat: An innovative dialogue intervention system for children with autism based on llm and vb-mapp. arXiv 2024, arXiv:2409.01867. [Google Scholar] [CrossRef]
- Chen, X.Y.; Chen, Y.M.; Chen, C.P.; Su, B.H.; Gau, S.S.F.; Lee, C.C. SocialRecNet: A Multimodal LLM-Based Framework for Assessing Social Reciprocity in Autism Spectrum Disorder. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Luo, X.; Rechardt, A.; Sun, G.; Nejad, K.K.; Yáñez, F.; Yilmaz, B.; Lee, K.; Cohen, A.O.; Borghesani, V.; Pashkov, A.; et al. Large language models surpass human experts in predicting neuroscience results. Nat. Hum. Behav. 2025, 9, 305–315. [Google Scholar] [CrossRef] [PubMed]
- Mehandru, N.; Miao, B.Y.; Almaraz, E.R.; Sushil, M.; Butte, A.J.; Alaa, A. Evaluating large language models as agents in the clinic. npj Digit. Med. 2024, 7, 84. [Google Scholar] [CrossRef]
- Moor, M.; Banerjee, O.; Abad, Z.S.H.; Krumholz, H.M.; Leskovec, J.; Topol, E.J.; Rajpurkar, P. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Saab, K.; Tu, T.; Weng, W.H.; Tanno, R.; Stutz, D.; Wulczyn, E.; Zhang, F.; Strother, T.; Park, C.; Vedadi, E.; et al. Capabilities of gemini models in medicine. arXiv 2024, arXiv:2404.18416. [Google Scholar] [CrossRef]
- Yang, L.; Xu, S.; Sellergren, A.; Kohlberger, T.; Zhou, Y.; Ktena, I.; Kiraly, A.; Ahmed, F.; Hormozdiari, F.; Jaroensri, T.; et al. Advancing multimodal medical capabilities of Gemini. arXiv 2024, arXiv:2405.03162. [Google Scholar] [CrossRef]
- Google Research and Google DeepMind. Advancing Medical AI with Med-Gemini. 2024. Available online: https://research.google/blog/advancing-medical-ai-with-med-gemini/ (accessed on 9 September 2025).
- Huynh, M.; Kline, A.; Surabhi, S.; Dunlap, K.; Mutlu, O.C.; Honarmand, M.; Azizian, P.; Washington, P.; Wall, D.P. Ensemble Modeling of Multiple Physical Indicators to Dynamically Phenotype Autism Spectrum Disorder. arXiv 2024, arXiv:2408.13255. [Google Scholar] [CrossRef]
- Varma, M.; Washington, P.; Chrisman, B.; Kline, A.; Leblanc, E.; Paskov, K.; Stockham, N.; Jung, J.Y.; Sun, M.W.; Wall, D.P. Identification of social engagement indicators associated with autism spectrum disorder using a game-based mobile app: Comparative study of gaze fixation and visual scanning methods. J. Med. Internet Res. 2022, 24, e31830. [Google Scholar] [CrossRef]
- Lakkapragada, A.; Kline, A.; Mutlu, O.C.; Paskov, K.; Chrisman, B.; Stockham, N.; Washington, P.; Wall, D.P. The classification of abnormal hand movement to aid in autism detection: Machine learning study. JMIR Biomed. Eng. 2022, 7, e33771. [Google Scholar] [CrossRef]
- Farooq, M.S.; Tehseen, R.; Sabir, M.; Atal, Z. Detection of autism spectrum disorder (ASD) in children and adults using machine learning. Sci. Rep. 2023, 13, 9605. [Google Scholar] [CrossRef] [PubMed]
- Almadhor, A.; Alasiry, A.; Alsubai, S.; Al Hejaili, A.; Kovac, U.; Abbas, S. Explainable and secure framework for autism prediction using multimodal eye tracking and kinematic data. Complex Intell. Syst. 2025, 11, 173. [Google Scholar] [CrossRef]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
- Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.d.L.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training compute-optimal large language models. arXiv 2022, arXiv:2203.15556. [Google Scholar] [CrossRef]
- Bahri, Y.; Dyer, E.; Kaplan, J.; Lee, J.; Sharma, U. Explaining neural scaling laws. Proc. Natl. Acad. Sci. USA 2024, 121, e2311878121. [Google Scholar] [CrossRef]










| Model | Classifier | Accuracy (%) | Precision (%) | Sensitivity (%) | Specificity (%) | ROC-AUC (%) | PR-AUC (%) |
|---|---|---|---|---|---|---|---|
| 2.5 Pro * | LR5 | 89.6 ± 2.1 | 89.0 ± 2.1 | 90.4 ± 2.7 | 88.8 ± 2.2 | 95.6 ± 0.6 | 96.0 ± 0.7 |
| LR10 | 87.2 ± 1.4 | 88.5 ± 2.0 | 85.6 ± 2.7 | 88.8 ± 2.2 | 91.6 ± 2.5 | 93.6 ± 1.9 | |
| Direct | 90.0 ± 2.5 | 89.7 ± 2.6 | 90.4 ± 2.7 | 89.6 ± 2.7 | 90.0 ± 2.5 | 86.2 ± 3.2 | |
| 2.5 Flash * | LR5 | 85.6 ± 1.1 | 80.0 ± 1.4 | 96.0 ± 0.0 | 76.0 ± 2.2 | 92.6 ± 0.4 | 93.5 ± 0.3 |
| LR10 | 79.6 ± 1.1 | 76.7 ± 1.5 | 84.0 ± 0.0 | 75.2 ± 2.2 | 88.2 ± 0.5 | 88.9 ± 0.4 | |
| Direct | 85.6 ± 1.1 | 86.2 ± 1.9 | 84.0 ± 0.0 | 87.2 ± 2.2 | 86.0 ± 1.1 | 80.4 ± 1.6 | |
| 2.5 Flash Lite Preview | LR5 | 63.6 ± 1.1 | 60.5 ± 1.0 | 80.0 ± 0.0 | 47.2 ± 2.2 | 66.0 ± 1.6 | 62.2 ± 1.3 |
| LR10 | 53.2 ± 2.2 | 52.2 ± 1.6 | 75.2 ± 2.2 | 30.4 ± 4.2 | 57.7 ± 1.1 | 53.5 ± 0.4 | |
| Direct | 53.6 ± 1.1 | 54.8 ± 1.5 | 39.2 ± 2.2 | 68.0 ± 0.0 | 54.0 ± 1.1 | 52.2 ± 0.6 | |
| 2.0 Flash * | LR5 | 80.0 ± 0.0 | 75.9 ± 0.0 | 88.0 ± 0.0 | 72.0 ± 0.0 | 88.1 ± 1.1 | 89.7 ± 1.1 |
| LR10 | 72.0 ± 0.0 | 66.7 ± 0.0 | 88.0 ± 0.0 | 56.0 ± 0.0 | 84.6 ± 0.7 | 85.2 ± 0.6 | |
| Direct | 78.0 ± 0.0 | 73.3 ± 0.0 | 88.0 ± 0.0 | 68.0 ± 0.0 | 78.0 ± 0.0 | 70.5 ± 0.0 | |
| 2.0 Flash Lite | LR5 | 69.5 ± 2.5 | 64.6 ± 1.9 | 89.2 ± 2.7 | 49.7 ± 2.2 | 83.0 ± 1.9 | 75.8 ± 1.6 |
| LR10 | 65.5 ± 1.8 | 59.6 ± 1.6 | 95.6 ± 0.0 | 34.7 ± 3.2 | 85.2 ± 0.7 | 80.5 ± 1.6 | |
| Direct | 82.7 ± 3.3 | 80.8 ± 4.8 | 86.4 ± 2.7 | 79.1 ± 6.3 | 82.7 ± 3.3 | 76.3 ± 4.2 | |
| 1.5 Pro | LR5 | 80.0 ± 0.0 | 94.1 ± 0.0 | 64.0 ± 0.0 | 96.0 ± 0.0 | 91.8 ± 0.3 | 89.0 ± 1.1 |
| LR10 | 76.0 ± 2.5 | 93.3 ± 0.6 | 56.8 ± 5.0 | 96.0 ± 0.0 | 87.6 ± 0.5 | 85.4 ± 1.0 | |
| Direct | 72.4 ± 2.1 | 92.5 ± 0.6 | 48.8 ± 4.2 | 96.0 ± 0.0 | 72.4 ± 2.1 | 70.6 ± 2.1 | |
| 1.5 Flash | LR5 | 72.0 ± 1.8 | 85.4 ± 5.7 | 52.8 ± 2.2 | 91.2 ± 4.2 | 81.3 ± 0.9 | 82.6 ± 1.0 |
| LR10 | 78.8 ± 2.2 | 78.1 ± 2.3 | 80.8 ± 3.5 | 76.8 ± 2.7 | 83.6 ± 1.2 | 63.6 ± 1.6 | |
| Direct | 82.4 ± 1.1 | 76.3 ± 1.4 | 94.4 ± 2.7 | 70.4 ± 2.7 | 82.4 ± 1.1 | 75.0 ± 1.2 | |
| Clinicians | LR5 | 88.0 ± 9.0 | 100.0 ± 0.0 | 76.0 ± 16.9 | 100.0 ± 0.0 | 98.1 ± 2.9 | 98.4 ± 2.6 |
| LR10 | 98.0 ± 3.0 | 100.0 ± 0.0 | 96.0 ± 6.8 | 100.0 ± 0.0 | 99.0 ± 1.8 | 99.2 ± 1.6 | |
| Crowdworkers [17] | LR5 | 92.0–98.0 ± 3.0–7.0 | 92.0–100.0 ± 0.0–10.4 | 88.0–96.0 ± 6.8–13.0 | 92.0–100.0 ± 0.0–10.4 | 99.0–99.4 | 99.1–99.4 |
| LR10 | 90.0–96.0 ± 5.0–8.0 | 85.7–100.0 ± 0.0–12.4 | 92.0–96.0 ± 6.8–10.0 | 84.0–100.0 ± 0.0–13.7 | 98.5–98.7 | 98.9–99.0 |
| Video | Ground Truth | Agree? | Gemini 2.5 Pro Strategy | Gemini 2.5 Flash Strategy |
|---|---|---|---|---|
| Cases with Different Diagnoses (Disagreement) | ||||
| V1 | ASD | No | Diagnosed: ASD Atypical behavior emphasis: “Name called, no response. That’s a red flag”; “arm flapping is a classic stim”; “lining them up…restrictive, repetitive” | Diagnosed: NT Strength emphasis: “Clear positive social interaction”; “joint attention and shared activity”; “typical developmental trajectories” |
| V20 | NT | No | Diagnosed: NT Social competency focus: “Initiating with joke…understands social nuances, humor”; “excellent social-emotional reciprocity” | Diagnosed: ASD Pattern recognition: “Repetitive question eliciting repetitive response”; “could fall under echolalia”; “lacks genuine turn-taking” |
| V69 | ASD | No | Diagnosed: ASD Exceptional skill flagging: “2.5-year-old knowing capitals is highly abnormal”; “example of hyperlexia…points to ASD” | Diagnosed: NT Social priority: “Child seems quite engaged”; “maintaining eye contact”; “developmental trajectory appears typical” |
| Cases with Same Diagnoses but Different Strategies (Agreement) | ||||
| V3 | NT | Yes | Diagnosed: NT Performance assessment: “Not just singing; he is performing”; “showing social awareness and engagement”; “obvious give and take” | Diagnosed: NT Systematic checklist: “Run through typical indicators for ASD”; “could be motor mannerisms…also common for neurotypical children” |
| V15 | ASD | Yes | Diagnosed: ASD Temporal mapping: “0:38–0:42: Caregiver calls name…glances briefly”; “1:15–1:20: Repetitive hand movements”; chronological pattern analysis | Diagnosed: ASD Categorical analysis: “1. Social Interaction Deficits”; “2. Communication Challenges”; “3. Restricted, Repetitive Patterns” |
| V50 | NT | Yes | Diagnosed: NT Cognitive contextualization: “Interest in capitals is intense, but…it’s the way she engages with the adult”; evaluates social function of interests | Diagnosed: NT Conservative assessment: “Video is extremely short…lacks context”; “impossible to make reliable assessment”; defaults to NT when uncertain |
| Group | Consensus Rank | Feature | Permutation | Random Forest |
|---|---|---|---|---|
| LLMs | ||||
| 1 | Stereotyped Speech-1 | 2 | 2 | |
| 2 | Expressive Language-2 | 4 | 1 | |
| 3 | Expressive Language-1 | 3 | 3 | |
| 4 | Eye Contact | 1 | 5 | |
| 5 | Shares Excitement | 5 | 4 | |
| Crowdworkers | ||||
| 1 | Shares Excitement | 3 | 1 | |
| 2 | Eye Contact | 1 | 4 | |
| 3 | Emotion Expression | 4 | 2 | |
| 4 | Social Overtures | 5 | 3 | |
| 5 | Communicative Engagement | 2 | 8 | |
| Clinicians | ||||
| 1 | Expressive Language-2 | 3 | 1 | |
| 2 | Expressive Language-1 | 2 | 2 | |
| 3 | Stereotyped Speech-1 | 1 | 3 | |
| 4 | Speech Patterns-2 | 6 | 4 | |
| 5 | Eye Contact | 7 | 5 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Azizian, P.; Honarmand, M.; Jaiswal, A.; Kline, A.; Dunlap, K.; Washington, P.; Wall, D.P. Multimodal LLM vs. Human-Measured Features for AI Predictions of Autism in Home Videos. Algorithms 2025, 18, 687. https://doi.org/10.3390/a18110687
Azizian P, Honarmand M, Jaiswal A, Kline A, Dunlap K, Washington P, Wall DP. Multimodal LLM vs. Human-Measured Features for AI Predictions of Autism in Home Videos. Algorithms. 2025; 18(11):687. https://doi.org/10.3390/a18110687
Chicago/Turabian StyleAzizian, Parnian, Mohammadmahdi Honarmand, Aditi Jaiswal, Aaron Kline, Kaitlyn Dunlap, Peter Washington, and Dennis P. Wall. 2025. "Multimodal LLM vs. Human-Measured Features for AI Predictions of Autism in Home Videos" Algorithms 18, no. 11: 687. https://doi.org/10.3390/a18110687
APA StyleAzizian, P., Honarmand, M., Jaiswal, A., Kline, A., Dunlap, K., Washington, P., & Wall, D. P. (2025). Multimodal LLM vs. Human-Measured Features for AI Predictions of Autism in Home Videos. Algorithms, 18(11), 687. https://doi.org/10.3390/a18110687

