1. Introduction
Overview of ADHD diagnosis and why it matters: Attention Deficit/Hyperactivity Disorder (ADHD) is a very prevalent neurodevelopmental disorder of childhood, found in millions of individuals across the globe. ADHD is characterized by chronic patterns of inattention, hyperactivity, and impulsivity which greatly impact an individual’s functioning at school, socially, and at work. Proper and early diagnosis is critical to enable early intervention, which decreases the disorder’s impact on an individual’s life trajectory. Despite the need for early identification, ADHD diagnosis is a challenging task due to the lack of absolute biomarkers and the subjective nature of behavioral assessment. Clinicians thus greatly rely on symptom rating scales, parent and teacher ratings, and observational ratings, which are prone to biases and variability. Never before has the need for computerized, data-driven diagnostic tools for ADHD been so evident. The application of machine learning (ML) techniques is a promising step toward improved diagnostic accuracy, reliability, and efficiency. Machine learning techniques have been employed to study healthcare data to identify predictive models, risk estimates, and pattern detection in recent years. However, in ADHD, the application of machine learning has been much less studied compared to other mental disorders. In this paper, we intend to investigate the potential of using machine learning to forecast ADHD diagnosis using the WiDS Datathon 2025 dataset. WiDS Datathon 2025 and the dataset: The WiDS Datathon 2025 is aimed at giving the participants of the contest a chance to implement machine learning methods on actual health data. The 2025 contest provides the challenge of predicting the diagnosis of ADHD based on a given behavioral, demographic, and clinical dataset. The data provided for the challenge has a wide range of features that possess a potential influence on the development and diagnosis of ADHD, i.e., parent education, age, gender, socioeconomic status, medical history, and a series of behavioral and psychological tests. The task posed by this dataset is not easy but it is realistic, and the challenge is to diagnose ADHD in a timely manner and accurately. Data preprocessing and feature engineering are key to the success of machine learning models, especially when dealing with large, noisy, and imbalanced data. The paper begins with an exploration of the dataset, examining prominent features, and applying the appropriate preprocessing techniques to prepare the data for effective modeling. The following section will outline the steps involved in missing value handling, feature scaling, encoding categorical variables, and class imbalance, which are common in health-related data. Data preprocessing and feature engineering: Just like any machine learning project, data preprocessing is the foundation of any effective prediction model. Preprocessing in the process of working with the WiDS Datathon dataset involved dealing with missing values, scaling the quantitative features, and encoding the categorical features to make the dataset suitable for the implementation of different algorithms. Missing values were addressed through imputation techniques in a way that would not compromise the quality of the dataset and without losing any meaningful information. Feature engineering techniques such as one-hot encoding for categorical features and standardization for quantitative features were used to enable improved learning from the data by the model. Moreover, feature selection played an important role in improving model performance. Not every feature in the dataset is beneficial in classifying ADHD. Through feature importance analysis and applying dimensionality reduction techniques like Principal Component Analysis (PCA), only the most important features were retained. This made the models focus on the features with the highest contribution to the task of classification, thereby improving model accuracy and interpretability. Machine learning models to forecast ADHD: To solve the problem of predicting ADHD, some machine learning algorithms were considered. These were decision tree-based algorithms like Random Forest and XGBoost, which have been found to perform well in handling both numerical and categorical variables in healthcare data. Logistic Regression was also employed as a baseline model because it is easy to interpret and is simple. Some other advanced models like Support Vector Machines (SVMs) were also considered to see whether they can detect complex patterns in the data. The Random Forest model was selected specifically due to its high interpretability and its capacity to work with high-dimensional data. Hyperparameter tuning was performed using methods such as Grid Search and Random Search to determine the best set of parameters for each model. We aimed to achieve the highest possible accuracy while avoiding overfitting and ensuring that the model generalizes well to unseen data. Performance measurement and outcomes: The models’ performance was tested using general parameters such as accuracy, precision, recall, F1-score, and AUC-ROC curve. The achieved accuracy of 77% is an indication of the ability of machine learning models to support ADHD diagnosis. Though the achieved performance is a satisfactory starting point, there is a possibility of improving predictive accuracy through model optimization using techniques such as hyperparameter tuning, ensemble methods, and more advanced methods such as deep learning. A closer look at the confusion matrix showed that the model was fairly good in detecting ADHD-positive and ADHD-negative cases, with a moderate false positive and false negative rate. The ROC curve also showed that the model could discriminate between the two classes quite well. There is much more work to be done to make the model more precise and reduce errors, especially in real clinical practice, where the cost of false negatives (missing ADHD) is high. Challenges and future work: Despite the encouraging performance achieved in this paper, there are a number of challenges to be addressed in an effort to enhance the performance of the model further. Among these challenges is class imbalance of the data, because it leads to biased estimates and model accuracy loss. Future work will implement techniques like SMOTE (Synthetic Minority Over-sampling Technique) and cost-sensitive learning in an effort to reduce the imbalance. The interpretability of machine learning models is yet another challenge, especially within the healthcare field, where clinicians need to observe the rationale behind some predictions and how they are made. Approaches such as SHAP (Shapley Additive explanations) and LIME (Local Interpretable Model-agnostic Explanations) will be explored to provide improved model decision insights, increasing trustworthiness and adoption in practice. Significance of the research: The evidence presented in this paper adds to the body of research that examines machine learning applications for diagnosing neurodevelopmental disorders such as ADHD. Using a comprehensive and diverse dataset, this research seeks to prove the ability of machine learning algorithms to yield more accurate, scalable, and affordable diagnostic tools. These technologies have the potential to assist clinicians in making properly informed treatment and diagnostic decisions, which can result in improved outcomes for ADHD sufferers. This study also paves the way for future advancements in AI application in medicine. With advancing machine learning, increasingly complex algorithms and increasingly accurate datasets will contribute to even more accurate prediction models for the diagnosis of ADHD.
2. Literature Survey
Literature Review of Machine Learning and AI Applications for ADHD Diagnosis: Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopmental disorder that touches the lives of millions globally, influencing individuals throughout their lifespan. The condition is defined by symptoms of inattention, hyperactivity, and impulsivity, which produce severe impairments in educational, social, and occupational functioning. Classical ADHD diagnosis is strongly dependent on clinical interviews, observation of behavior, and questionnaires that are subjective and open to misinterpretation. This subjectivity has generated interest in the construction of more objective and quantitative methods of diagnosis, and machine learning (ML) and artificial intelligence (AI) approaches have been shown to hold potential in this regard [
1,
2,
3,
4]. EEG-Based Machine Learning Classification for ADHD: One method of developing objective measures is through the use of electroencephalography (EEG), a non-invasive imaging modality that records the electrical activity of the brain. EEG data has neurophysiological markers with which ADHD insights may be given. Fink suggests a supervised machine learning pipeline for classifying ADHD based on EEG signals using methodologies like UMAP for dimensionality reduction and other machine learning models like Random Forest, XGBoost, CatBoost, and Logistic Regression. The study demonstrates the potential of machine learning to classify ADHD vs. control subjects using EEG data, achieving a test accuracy of 73.71% with the Random Forest model [
5,
6]. However, the author also notes the challenges associated with EEG data, such as noise, artifacts, and inter-subject variability, highlighting the need for further research in feature engineering and artifact handling [
7]. Prediction Models in ADHD: Salazar de Pablo et al. conducted a systematic review and meta-regression to assess the current state of prediction models for ADHD. The review included 100 prediction models, with a majority focused on diagnosis rather than prognosis or treatment response. The study found that while numerous prediction models have been developed, few have been externally validated or implemented in clinical practice, and many are at high risk of bias [
3]. Their finding is that the addition of clinical predictors enhances the performance of ADHD prediction models. They stress that future research needs to aim at generating high-quality, replicable, and externally validated models that will progress the field [
8]. Multimodal Imaging Classification of ADHD: Colby et al. explored the application of multimodal imaging information to diagnose ADHD, based on structural and functional magnetic resonance imaging (MRI) data. The research utilized data from the ADHD-200 machine learning competition, an initiative to create imaging classifiers of ADHD. Their machine learning strategy utilized feature selection and subset extraction to make predictions for ADHD diagnosis with an accuracy of 55% [
9]. The study sheds some light on the neurobiological basis of ADHD by extracting predictive structural and functional features. The authors also speak about the difficulties of using multisite imaging data and the value of data sharing efforts such as the ADHD-200. Synthesis: The literature shows an increasing interest in exploiting machine learning and AI to enhance the diagnosis and knowledge of ADHD. EEG-based machine learning models show promise for objective ADHD classification but require further refinement to address data challenges. Prediction models offer a pathway towards personalized medicine in ADHD, but many existing models lack external validation and clinical implementation. AI systems have the potential to automate the diagnostic process and alleviate the burden on healthcare systems [
2,
3,
5,
6,
9]. Multimodal imaging studies provide insights into the neurobiological basis of ADHD, contributing to the development of more accurate diagnostic tools. Overall, the surveyed papers highlight the potential of computational methods to advance ADHD research and clinical practice while also acknowledging the ongoing challenges and the need for continued investigation.
Table 1 summarizes recent research papers on ADHD detection, including the datasets used, machine learning models applied, and reported performance metrics.
3. Methodology
The steps involved in the proposed flowchart for ADHD diagnosis in participants are given in
Figure 1. They are as follows: First, we uploaded the data provided by the Wids Datathon 2025 from the Kaggle platform. Next, we applied data visualization to understand the relationship between the attributes, detect the outliers, and identify the missing values. These are all represented through the various graphs (bar plot, scatter plot, pie chart, etc.). In the data pre-processing step, there were several sub-steps. The first was data cleaning, where we filled the null values, eliminated duplicate rows, fixed structural errors, detected the outliers, and then expelled them. The null values in numeric attributes were filled by means and medians based on the skewness of individual attributes, and for categorical attributes, the null values were filled with modes. In this project, we used the IQR (interquartile range) method to detect and eliminate outliers. There were no duplicate rows and structural errors in the given datasets. The purpose of the second sub-step, data integration, was to merge one or more datasets. As there were three datasets, we merged them for the subsequent process. The third sub-step was data transformation, where feature scaling, encoding categorical variables, and discretization processes came into the picture. Normalization or standardization was applied for feature scaling. Fourth, dimensionality reduction was performed to reduce the dimension of the dataset. After merging the datasets, the total number of attributes was 19,030, so we had to reduce the dimension of the dataset so that the model could work better. Here, the PCA technique was used to reduce the dimension of the dataset.
Later, we moved to the next step, i.e., hypothesis testing. In this step, we came up with various kinds of assumptions and tested them to see whether they were correct or not. This process helped us to know how strong or weak the relationships between different attributes are. Next, in the train–test split, we split the datasets into a 3:7 or 4:6 or 5:5 ratio, from which we segregated the datasets into a train and test dataset. The train dataset was used for training the ML model, and the test dataset was used later to test the model. In this project, the Wids Datathon already provided the train and test dataset, so we did not apply a train–test split. After all these steps, we trained the ML model by applying several ML algorithms, trying to find the best ML model that would give the highest accuracy. In our project, after combining the XGBoost and Random Forest algorithms, an accuracy of 79.42% was obtained. Lastly, we tested the ML model to find whether the model could provide accurate results.
4. Results
In our project, before merging all the datasets, we tried to analyze each dataset by plotting different kinds of graphs. First, we analyzed the relationship between the attributes in the categorical data. The categorical data contains information on each participant’s personal life and on when and where they underwent the test. A thorough summary of the socioeconomic and demographic traits of the research participants is provided in
Figure 2. The distribution of participants by enrollment year is shown in the top-left histogram with a KDE line; the highest enrollment, of about 410 participants, took place in 2018–2019. The following two bar charts depict the distributions of ethnicity and race, respectively, while the adjacent bar plot displays the percentage distribution of the study sites. The MRI scan locations for each participant are summarized in the pie chart. Box plots show the distribution of Parent 1 and Parent 2’s educational levels, highlighting the differences in their educational backgrounds. The distribution of both parents’ occupation scores is shown in histograms with KDE curves. Finally, the relationship between Parent 1’s occupation and education scores is displayed in the scatter plot in the lower-right corner, suggesting possible associations between these socioeconomic factors.
A heatmap is a data visualization technique that uses color to represent the magnitude of the values in a matrix. In a correlation heatmap, for example, each cell shows the strength and direction of the relationship between two numeric variables, with colors indicating positive (e.g., red), negative (e.g., blue), or no correlation (e.g., white). If we consider
Figure 3, we can say that the enrolment year and MRI scan location have a positive correlation (0.71 → mild red), which means they are directly proportional to each other. On the other hand, parental education and child ethnicity have a negative correlation (−0.17 → dark blue), indicating that they are inversely proportional to each other.
In
Figure 4, a regression plot visualizes the relationship between Parent 1’s occupation level and MRI scan location, using a linear trend line to show potential correlation. Each point represents a data entry, with occupation scores on the
x-axis and encoded MRI locations on the
y-axis. While it suggests a trend, MRI location is a categorical variable encoded as numbers, so a regression plot may not be appropriate; a box plot or strip plot would better represent such data.
In
Figure 5, a bar chart shows how MRI scan locations are distributed across different study sites. It first ensures the relevant columns are present and correctly formatted, then uses a grouped bar plot to count how many scans occurred at each location, with different colors representing the study sites.
In quantitative data, there are the numeric attributes which show the measures of the different kinds of tests which were performed on the participants. In
Figure 6, the mean for each of the attributes is calculated, and then a bar chart is plotted, where each bar represents the average value of one attribute. This visualization provides a quick overview of the central tendency of key numerical features in the dataset, helping to understand their relative magnitudes and identify those that stand out as particularly high or low on average.
In
Figure 7, the plot displays histograms along the diagonal to show the distribution of each variable, while the scatter plots in the off-diagonal cells reveal how pairs of variables are related. This helps in identifying patterns, trends, or correlations among the selected features in a clear and compact format.
A variety of behavioral, psychological, and parental factors obtained from questionnaire-based assessments are depicted in
Figure 8. The top visualizations use histograms with density curves to show the distribution of color vision scores and general eating habits. Frequency distributions of conduct problem scores from the Strengths and Difficulties Questionnaire (SDQ) are displayed alongside percentage-based bar charts of parental control and involvement scores. Histograms and density curves are used to display the total difficulties and emotional problem scores. Distributions of externalizing behavior scores (such as aggression and hyperactivity), internalizing behavior scores (such as depression and anxiety), and hyperactivity scores are also shown in the figure. Frequency and percentage distributions are used to illustrate peer-related problems and prosocial behaviors, respectively. The impact levels of challenges in day-to-day living are summarized in a pie chart. Age distribution at the time of the MRI scan is shown in additional plots, along with scores that represent parental monitoring, discipline support, and parenting techniques like using rewards and punishments.
One hypothesis shows that participants with higher prosocial behavior scores tend to exhibit fewer conduct problems, suggesting a negative relationship between these two behavioral attributes. In
Figure 9, a scatter plot with a regression trendline may hint at this inverse relationship; the conclusion acknowledges that while encouraging prosocial skills could be beneficial, further research is necessary to confirm the strength and direction of this influence.
Let us visualize the distribution of externalizing behavior problems across different levels of parental control. The violin plot shown in
Figure 10 combines aspects of box plots and density plots, making it easy to observe both the spread and concentration of the data. The plot allows for a comparison of the distribution shapes, medians, and variability of externalizing scores within each category of parental control, potentially highlighting trends or differences that warrant further analysis.
Let us explore how age at the time of the MRI scan is distributed across different levels of perceived impact of difficulties. The hypothesis assumes that the distribution of participant age might vary depending on how impactful their difficulties are perceived to be. In
Figure 11, a swarm plot displays the spread and clustering of individual data points within categories, helping to identify subtle differences or overlaps. The visual output allows for a nuanced comparison of age distributions across impact levels. However, the observed conclusion suggests that age does not show a strong pattern in relation to perceived impact, indicating that factors beyond age—such as personality traits or life experiences—might play a more influential role in shaping how individuals perceive the severity of their challenges.
After combining all three datasets, outliers were removed. Several machine learning algorithms were applied, and finally, a combined model using XGBoost and Random Forest achieved an accuracy of 79.42%. While these results are promising, there is potential for further improvement. Future work will focus on incorporating additional features, tuning model parameters, and exploring advanced ensemble techniques to enhance prediction accuracy.
5. Conclusions
This study shows how machine learning can improve the identification of ADHD, especially in females, a group that is underdiagnosed because of their tendency to internalize their symptoms. The suggested method used XGBoost and Random Forest classifiers to attain a promising accuracy of 79.42% by utilizing a variety of datasets made available by the WIDS Datathon 2025, such as categorical, quantitative, and functional connectome data. These findings show how data-driven models capture the subtle trends associated with ADHD in females and provide a step towards an earlier, more accurate, and fairer diagnosis. Future research can expand on this by incorporating longitudinal data, enhancing feature selection, and generalizing the model to larger populations.
Author Contributions
Conceptualization, P.P. and K.K.; methodology, S.P.; software, S.G.; validation, S.P., K.K. and S.G.; formal analysis, P.P.; investigation, K.K.; resources, S.P.; data curation, S.G.; writing—original draft preparation, P.P.; writing—review and editing, P.P.; visualization, S.G.; supervision, P.P.; project administration, P.P.; funding acquisition, P.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Ethical review and approval were waived for this study because it was conducted using anonymized, publicly available data provided by the WIDS database, and did not involve direct interaction with human participation.
Informed Consent Statement
Patient consent was waived because the study because it was conducted using anonymized, publicly available data provided by the WIDS database, and did not involve direct interaction with human participation.
Data Availability Statement
The Wids Datathon 2025 provided the three datasets: categorical data, quantitative data, and function connectomes. It contains information on 1213 participants seeking to take a test to detect ADHD. It can be accessed here:
https://www.kaggle.com/competitions/widsdatathon2025/data (accessed on 18 April 2025).
Acknowledgments
We gratefully acknowledge Department of Computer Science, KLE Technological University Belagavi for their constant support, valuable guidance, and wise feedback during this research. We also thank all the team members for their hard work and cooperation. We also thank the WIDS organization and Kaggle for furnishing the dataset for this study.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| ADHD | Attention Deficit/Hyperactivity Disorder |
| ML | Machine Learning |
| EEG | Electroencephalography |
| MRI | Magnetic Resonance Imaging |
| fMRI | Functional Magnetic Resonance Imaging |
| WiDS | Women in Data Science |
| PCA | Principal Component Analysis |
| SMOTE | Synthetic Minority Over-sampling Technique |
| SDQ | Strengths and Difficulties Questionnaire |
| XGBoost | Extreme Gradient Boosting |
| IQR | Interquartile Range |
References
- Ameer, I.; Arif, M.; Sidorov, G.; Gómez-Adorno, H.; Gelbukh, A. Mental Illness Classification on Social Media Texts Using Deep Learning and Transfer Learning. arXiv 2022, arXiv:2207.01012. [Google Scholar] [CrossRef]
- Sharma, A.; Jain, A.; Sharma, S.; Gupta, A.; Jain, P.; Mohanty, S.P. IPAL: A Machine Learning Based Smart Healthcare Framework for Automatic Diagnosis of ADHD. arXiv 2023, arXiv:2302.00332. [Google Scholar]
- Salazar de Pablo, G.; Iniesta, R.; Bellato, A.; Caye, A.; Dobrosavljevic, M.; Parlatini, V.; Garcia Argibay, M.; Li, L.; Cabras, A.; Ali, M.H.; et al. Individualized Prediction Models in ADHD: A Systematic Review and Meta-Regression. Mol. Psychiatry 2024, 29, 3865–3873. [Google Scholar] [CrossRef] [PubMed]
- Alhussen, A.; Alutaibi, A.I.; Sharma, S.K.; Khan, A.R.; Ahmad, F.; Tejani, G.G. Early ADHD Diagnosis with NeuroDCT-ICA and RFO Algorithm Based ADHD-AttentionNet. Sci. Rep. 2025, 15, 6967. [Google Scholar] [CrossRef] [PubMed]
- Balamurugan, A.M.; Venusree, S.; Nandhini, M.V. Machine Learning Approach for ADHD Diagnosis in Children Using EEG. In Proceedings of the International Conference on Intelligent Systems and Computational Networks (ICISCN), Chennai, India, 24–25 January 2025. [Google Scholar]
- Nora Fink, N. A High-Accuracy Supervised Machine Learning Approach for ADHD Classification Using EEG Signals. SSRN Electron. J. 2024. Available online: https://ssrn.com/abstract=5146511 (accessed on 20 April 2025).
- Saurabh, S.; Gupta, P.K. Deep Learning-Based Modified Bidirectional LSTM Network for Classfication of ADHD Disorder. Arab. J. Sci. Eng. 2024, 49, 3009–3026. [Google Scholar] [CrossRef]
- Tucker, R.; Williams, C.; Reed, P. Association of Exercise and ADHD Symptoms: Analysis within an Adult General Population Sample. PLoS ONE 2025, 20, e0314508. [Google Scholar] [CrossRef] [PubMed]
- Colby, J.B.; Rudie, J.D.; Brown, J.A.; Douglas, P.K.; Cohen, M.S.; Shehzad, Z. Insights into Multimodal Imaging Classification of ADHD. Front. Syst. Neurosci. 2012, 6, 59. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Tachmazidis, I.; Batsakis, S.; Adamou, M.; Papadakis, E.; Antoniou, G. Diagnosing attention-deficit hyperactivity disorder (ADHD) using artificial intelligence: A clinical study in the UK. Front. Psychiatry 2023, 14, 1164433. [Google Scholar] [CrossRef] [PubMed]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).