Automatic Identiﬁcation of Children with ADHD from EEG Brain Waves

: EEG (electroencephalogram) signals could be used reliably to extract critical information regarding ADHD (attention deﬁcit hyperactivity disorder), a childhood neurodevelopmental disorder. The early detection of ADHD is important to lessen the development of this disorder and reduce its long-term impact. This study aimed to develop a computer algorithm to identify children with ADHD automatically from the characteristic brain waves. An EEG machine learning pipeline is presented here, including signal preprocessing and data preparation steps, with thorough explanations and rationale. A large public dataset of 120 children was selected, containing large variability and minimal measurement bias in data collection and reproducible child-friendly visual attentional tasks. Unlike other studies, EEG linear features were extracted to train a Gaussian SVM-based model from only the ﬁrst four sub-bands of EEG. This eliminates signals more than 30 Hz, thus reducing the computational load for model training while keeping mean accuracy of ~94%. We also performed rigorous validation (obtained 93.2% and 94.2% accuracy, respectively, for holdout and 10-fold cross-validation) to ensure that the developed model is minimally impacted by bias and overﬁtting that commonly appear in the ML pipeline. These performance metrics indicate the ability to automatically identify children with ADHD from a local clinical setting and provide a baseline for further clinical evaluation and timely therapeutic attempts.


Introduction
Attention deficit hyperactivity disorder (ADHD) is a behavior disorder characterized by inattention, impulsivity, and in some cases hyperactivity, typically diagnosed in childhood [1]. It is a common childhood developmental disorder. The symptoms of ADHD start before age 12, and in some children, they are noticeable as early as three years of age [2]. The prevalence of ADHD has been estimated at approximately 12.1% among boys and 3.9% among girls [3]. About 6.4 million American children aged 4-17 have been diagnosed with ADHD [4]. ADHD makes it difficult for children to develop the skills to control their attention, behavior, emotions, and activities. As a result, they often act in ways that are hard for parents to manage [5]. Persistent adult ADHD may cause serious long-term consequences, such as poor academic achievement and job performance, increased risk of antisocial behavior, and drug and alcohol abuse [3]. Hence, early detection of this disorder is of great value [6,7].
EEG is a reliable method that provides information about the background activity of the brain and indexes the substrate of cognition and behavior, shown in Table 1 [8]. Therefore, it can be a useful tool for investigating and diagnosing the abnormal behavior of ADHD children. J. Lubar conducted the first abnormalities study by EEG signals in ADHD in 1973. He found that theta (Table 1) activity increased and also beta ( Table 1) power dramatically reduced in ADHD [9]. Most patients with ADHD have a common brain-wave pattern that consists of an abundance of slow (delta or theta) brain waves and a shortage of fast (beta) brain waves. This means that they have a high theta-to-beta ratio [10] that could be employed for automatic recognition from the characteristic brain wave. Table 1. EEG sub-bands are associated with different brain functions [11,12]. Recently, researchers have been working with fMRI (functional magnetic resonance imaging) and MRI to identify ADHD. This is a fast-developing and complex research domain [13]. Rubia et al. [14,15] reported decreased activation in the ADHD group in mesial and lateral prefrontal areas in the right hemisphere and the cingulate gyrus by using fMRI. On the other hand, EEG is quicker, affordable, portable, and gives accessible insights into brain function. Therefore, EEG can be a useful gadget for investigating and diagnosing the abnormal behavior of ADHD children.

Sub-Band Frequency Range Associated Brain Function
Even with the current progress, using EEG tests for ADHD detection needs a more precise approach in this area to get more accurate results. The amount of information in EEG signals is vast. It is also complicated for a human to detect abnormalities manually. This is where machine learning (ML) can be useful. Generally, ML is programming computers to optimize a performance criterion using example data or experience [16], which could be employed for the current task.
This study thus aims to automatically identify ADHD children employing machine learning techniques. Previously, most studies used nonlinear features of the EEG signal and used KNN or neural networks for classification (discussed in Section 2). In this study, we have extracted statistical, time-domain, and frequency-domain EEG features; used PCA to select the best features; trained a Gaussian SVM classifier on the selected features; and employed two cross-validation methods: holdout and k-fold cross-validation to validate classifier performance. Since ADHD's behavior can be caused by differences in brain function, we worked with only four sub-bands: delta, theta, alpha, and beta frequency. The overall contributions of this work can be summarized as follows. 1.
An EEG ML pipeline is presented for ADHD detection, explaining each stage of the pipeline (including signal preprocessing and data preparation) with thorough explanations and rationale.

2.
Unlike other studies, we employed only the first four sub-bands of EEG, eliminating signals more than 30 Hz and thus reducing the computational load for ML model training while keeping mean accuracy of 93.2%.

3.
Simple EEG linear features are emphasized in our proposed model development, whereas other works were only based on complex nonlinear features. 4.
The model was trained on a large dataset of 120 children (the highest of other models was 49) collected from two different sessions at two different places, eliminating the measurement bias in data collection. Also, the experimental setup was child-friendly, easy to reproduce in local settings, and could be employed for future ADHD detection.

5.
We also performed rigorous validation (unlike other works) to ensure that our model is not impacted by bias and overfitting, which commonly appear in the ML pipeline.
The rest of the paper is organized as follows. First, recent and related works are presented in Section 2. Materials and method information is given in Section 3, along with the dataset description. Section 4 describes the preprocessing methods. Section 5 describes the feature extraction and feature selection. Section 6 represents the results. Lastly, Sections 7 and 8 give the discussion and conclusion of this study.

Related Works
ADHD was originally known as childhood hyperkinetic reaction [17]. The American Psychiatric Association (APA) did not officially recognize it as a mental disorder until the 1960s, and in the 1980s, the diagnosis was "attention deficit disorder with or without hyperactivity" [17]. Since then, many studies have been done to identify ADHD using fMRI and EEG. Yin et al. [18] found that neural flexibility altered in children with ADHD and demonstrated the potential clinical utility of neural flexibility to identify children with ADHD, as well as to monitor treatment responses and disease severity using fMRI data. They obtained moderate accuracy of 77% for 10-fold cross-validation and 74.46% for the independent test. Pulini et al. [19] mentioned that the accuracy of ADHD classification ranged from 60% to 80% using neuroimaging features. According to Pulini, circular analysis and a small sample can exaggerate high classification accuracies in neuroimaging studies of ADHD. The use fMRI shows moderate accuracy. It is also expensive, whereas EEG has more portability and freedom in data acquisition. Kiiski et al. [20] calculated the weighted phase lag index (WPLI) for each frequency band of EEG to describe the functional EEG connectivity as a neuromarker for adult ADHD symptoms. Alchalabi et al. [21] applied a machine learning classifier on an EEG-controlled serious game to detect ADHD patients, where EEG data was monitored during the game. In this study, the participants had to play a "FOCUS" game and their attention levels were observed. In the game, the player had to move an avatar by focusing and using mental commands. It achieved 96% in classifying the EEG data to detect the correct attention state during gameplay and 98% in classifying the patients' EEG data. Ghassemi et al. [22] used nonlinear EEG features to classify adult normal and ADHD participants. Fifty participants underwent a continuous performance test (CPT), where they had to click the left mouse button with their index finger when any letter except for the target "X" was shown on the screen. Three nonlinear featureswavelet entropy, correlation dimension, and Lyapunov exponent-were extracted, and the KNN algorithm was used as a classifier. This study achieved an accuracy of around 96%. Mohammadi et al. [23] performed EEG classification on the data acquired from 30 healthy (9.85 ± 1.77 years) and 30 ADHD (9.62 ± 1.75 years) children during a visual attention task. Higuchi, Katz, and Petrosian fractal dimension exponents and approximate entropy nonlinear features were extracted from the signal. Overall, 92.28% and 93.65% accuracy was achieved, respectively, using the mRMR method and the DISR method using a multilayer perceptron (MLP) neural network. Allahverdy et al. [24] also used visual attention tasks to detect ADHD in 20 healthy and 29 children with ADHD aged 7-12 years using EEG nonlinear features. Lyapunov exponent, Higuchi fractal dimension, Katz fractal dimension, and Sevcik fractal dimension nonlinear features were extracted from the EEG data and showed an accuracy of 96.7% using frontal lobe electrodes with an MLP neural network. Most of the work discussed above mostly used EEG nonlinear features and neural networks for classification. For our model development, we selected this SVM classifier for its simplicity and effectiveness in high-dimension spaces. The number of participants in these reported studies did not exceed 49, while our focus was to find a sample with more variability in the dataset, with a balance between ADHD and healthy subjects. In addition, the experimental setting and the data collection procedure are important for reproducing these studies and establishing a standard for child ADHD detection. The selection of the dataset for this study (described in Section 3) was made considering these factors.

Materials and Methods
The public dataset employed in this study is available in the IEEE data port [25]. All the participants were school-aged and right-handed. The participants were 60 healthy children and 60 children with ADHD diagnosed by an experienced psychiatrist of children and adolescents according to DSM-IV criteria [26]. The ADHD children had taken Ritalin for up to 6 months [25]. Ritalin is used in ADHD treatment. It works by altering the concentration of certain natural substances in the brain [27]. There is no conclusive evidence that Ritalin medication will influence the distinction of the brain waves of ADHD children. The healthy group was selected from two primary schools. Table 2 summarizes the information about the participants. The EEG signals were recorded by a digital device (SD-C24, Sholeh Danesh Co., Tehran, Iran) in the Psychology and Psychiatry Research Center at Roozbeh Hospital (Tehran, Iran) [28]. The recording was performed based on the 10-20 standard [29] by 19 electrodes (Fz, Cz, Pz, C3, T3, C4, T4, Fp1, Fp2, F3, F4, F7, F8, P3, P4, T5, T6, O1, O2) with A1 and A2 electrodes as references on earlobes. Figure 1 shows the electrode locations of the international 10-20 system for EEG. dren and adolescents according to DSM-IV criteria [26]. The ADHD children had Ritalin for up to 6 months [25]. Ritalin is used in ADHD treatment. It works by alt the concentration of certain natural substances in the brain [27]. There is no concl evidence that Ritalin medication will influence the distinction of the brain wav ADHD children. The healthy group was selected from two primary schools. Table 2 marizes the information about the participants. The EEG signals were recorded by a digital device (SD-C24, Sholeh Danesh Co., ran, Iran) in the Psychology and Psychiatry Research Center at Roozbeh Hospital (Te Iran) [28]. The recording was performed based on the 10-20 standard [29] by 19 elect (Fz, Cz, Pz, C3, T3, C4, T4, Fp1, Fp2, F3, F4, F7, F8, P3, P4, T5, T6, O1, O2) with A1 an electrodes as references on earlobes. Figure 1 shows the electrode locations of the int tional 10-20 system for EEG. The recording protocol was designed based on a visual attention task. In the tas children were shown 20 images with several age-suitable characters, such as imag different animals, and they were asked to enumerate them. The number of characte each image was chosen between 5 and 16 randomly. To have a continuous stimulus ing the EEG recording, each image was displayed immediately after the child's resp The recording protocol was designed based on a visual attention task. In the task, the children were shown 20 images with several age-suitable characters, such as images of different animals, and they were asked to enumerate them. The number of characters in each image was chosen between 5 and 16 randomly. To have a continuous stimulus during the EEG recording, each image was displayed immediately after the child's response. Thus, the child's performance defines the duration of the EEG recording. The correctness of the answers was not considered [28].
All procedures performed to obtain this dataset were approved by the Institutional Review Board (IRB) and the Ethical Committee of Tehran University of Medical Sciences (TUMS) [28]. Since one of the deficits in ADHD children is visual attention [13,30], in this dataset, the data were obtained by a visual attention task where the children were shown some images that were appropriate and friendly for 7-to 12-year-olds. The balanced dataset was collected from two different places and sessions, so it was free from measurement bias. Considering all these factors and the flexibility in acquiring datasets from children, we chose this dataset.

Preprocessing
EEG signals contain different artifacts and noises that should be removed before the analysis. The sampling frequency of the EEG signal is 128 Hz. For the preprocessing method, we used a 4th-order FIR Butterworth filter with the cutoff frequencies 0.5 Hz and 63 Hz. To remove the power-line noise, a 50 Hz notch filter was used. We designed the notch filter using a stop Butterworth filter with the cutoff frequencies 49 Hz and 51 Hz.
For each subject, the time-series EEG signal was divided into 2 s segments for each channel with 50% overlap. This means each EEG window contained 1 s of previous and 1 s of current windows. In the dataset, for the control group, the minimum task duration was 50 s for one subject, and the maximum task duration was 285 s for one subject with ADHD. As the task timing differed for each subject, the number of segments varied for every subject [24]. For the classification of the EEG signal, we followed the pipeline shown in Figure 2. Thus, the child's performance defines the duration of the EEG recording. The correctness of the answers was not considered [28].
All procedures performed to obtain this dataset were approved by the Institutional Review Board (IRB) and the Ethical Committee of Tehran University of Medical Sciences (TUMS) [28]. Since one of the deficits in ADHD children is visual attention [13,30], in this dataset, the data were obtained by a visual attention task where the children were shown some images that were appropriate and friendly for 7-to 12-year-olds. The balanced dataset was collected from two different places and sessions, so it was free from measurement bias. Considering all these factors and the flexibility in acquiring datasets from children, we chose this dataset.

Preprocessing
EEG signals contain different artifacts and noises that should be removed before the analysis. The sampling frequency of the EEG signal is 128 Hz. For the preprocessing method, we used a 4th-order FIR Butterworth filter with the cutoff frequencies 0.5 Hz and 63 Hz. To remove the power-line noise, a 50 Hz notch filter was used. We designed the notch filter using a stop Butterworth filter with the cutoff frequencies 49 Hz and 51 Hz.
For each subject, the time-series EEG signal was divided into 2 s segments for each channel with 50% overlap. This means each EEG window contained 1 s of previous and 1 s of current windows. In the dataset, for the control group, the minimum task duration was 50 s for one subject, and the maximum task duration was 285 s for one subject with ADHD. As the task timing differed for each subject, the number of segments varied for every subject [24]. For the classification of the EEG signal, we followed the pipeline shown in Figure 2.

Feature Extraction and Feature Selection
Feature extraction is a dimensionality reduction process that reduces an initial set of raw data to more useful and manageable information for processing [31]. This feature extraction has been proven to be an important step in the process of EEG signal classification [32]. To extract significant information from raw data and for efficient training of classifiers, feature extraction is necessary. We extracted 11 features: standard deviation, RMS, skewness, kurtosis, Hjorth activity, Hjorth mobility, Hjorth complexity, Shannon's entropy, spectral entropy, power spectral entropy (PSD), and band power. Table 3 gives a brief description of these computed features.

Feature Name Definition Mathematical Description
Standard Deviation It is a statistical feature that is a measure of how spread out the data is to the mean.
x n = n-th data sample, N = Total No. of samples, µ = mean [33,34] RMS RMS is the square root-mean-square value of a signal x n = n-th data sample, N = Total No. of samples [35] Skewness Skewness is the measure of the lack of symmetry from the mean of the dataset.
x n = n-th data sample, N = Total No. of samples, µ = mean [33] Kurtosis Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
− 3 x n = n-th data sample, N = Total No. of samples, µ = mean [36] Hjorth Activity It is the variance of the amplitude of the signal in a time function. Represent the signal power.
Ha = var(x(t)) x(t) = amplitude of time-varying signal [37,38] Hjorth Mobility The mobility is the square root of the activity of the first derivative of the signal divided by the activity of the signal. Represents the mean frequency.
x (t) = 1st derivation of the amplitude of the signal [37,38] Hjorth Complexity It represents the change in frequency. It is defined as the ratio between the mobility of the first derivative of the signal and the mobility of the signal.

Hc =
Hm(x (t)) Hm(x(t)) [37,38] Shannon's Entropy Shannon's entropy measures the uncertainty/randomness in a dataset H = ∑ N n=1 −(P n × logP n ) P n = probability of occurrence x n [39] Spectral Entropy (SEN) SEN is the normalized Shannon's entropy SEN = − ∑ N−1 n=0 P k log 2 P k logN P = spectral of normalized frequency, N = number of frequencies in binary. [28] Power Spectral Density (PSD) PSD of the signal describes the power present in the signal as a function of frequency [40] Band Power It measures both power and power spectral density in a specified channel bandwidth [41] For the 2 s window, we extracted these 11 for every 19 channels for every sub-bands. After the feature extraction, we got a total of 836 features and 16,474 instances.
To remove irrelevant features from the classification pipeline, a proper selection of features is necessary. We applied the ANOVA feature-ranking method to visualize which features had the highest importance scores (Figure 3). Analysis of variance (ANOVA) is a statistical method that compares variances across the means (or average) of different groups [42]. In the current analysis, for each predictor variable, a one-way analysis of variance was performed and grouped by class, then features ranked using p-values. For each predictor variable, the algorithm tests the hypothesis that the predictor values grouped by the response classes are derived from populations with the same mean against the alternative hypothesis that the population means are not all the same [43]. ANOVA analyzes the correlation among the features of the data. To enable feature selection for ANOVA, the F-statistic can be used. Each data feature will be ranked based on the F-statistic, and the features with the higher scores can be chosen as the optimal set of components from the data available, so the features with the higher magnitude of scores can be considered optimal features from the data [44]. by the response classes are derived from populations with the same mean against the alternative hypothesis that the population means are not all the same [43]. ANOVA analyzes the correlation among the features of the data. To enable feature selection for ANOVA, the F-statistic can be used. Each data feature will be ranked based on the Fstatistic, and the features with the higher scores can be chosen as the optimal set of components from the data available, so the features with the higher magnitude of scores can be considered optimal features from the data [44]. Feature reduction is necessary to remove highly correlated features to avoid overfitting. We used PCA, an unsupervised method for dimension reduction, with 80%, 85%, 90%, and 95% explained variance. Explained variance is a statistical measure of how much variation in a dataset can be attributed to each of the principal components generated by the PCA method [45].
For the classification, we used a Gaussian support vector machine (SVM) classifier, a supervised machine learning algorithm. SVM algorithms use a set of mathematical functions that are defined as the kernel [46]. The function of the kernel is to take data input and transform it into the required form [47] so that a nonlinear decision surface can be transformed into a linear equation in a more dimensional space. It returns the dot product between two points in standard feature dimensions [48]. Gaussian is one of the kernel functions that is often used when there is no prior knowledge of a given dataset [49]. The Feature reduction is necessary to remove highly correlated features to avoid overfitting. We used PCA, an unsupervised method for dimension reduction, with 80%, 85%, 90%, and 95% explained variance. Explained variance is a statistical measure of how much variation in a dataset can be attributed to each of the principal components generated by the PCA method [45].
For the classification, we used a Gaussian support vector machine (SVM) classifier, a supervised machine learning algorithm. SVM algorithms use a set of mathematical functions that are defined as the kernel [46]. The function of the kernel is to take data input and transform it into the required form [47] so that a nonlinear decision surface can be transformed into a linear equation in a more dimensional space. It returns the dot product between two points in standard feature dimensions [48]. Gaussian is one of the kernel functions that is often used when there is no prior knowledge of a given dataset [49]. The Gaussian kernel can be expressed as k(x, y) = exp − ||x−y|| 2 2σ 2 [50]. Here, K is the kernel function, x and y are n-dimensional inputs.
Before applying the classifier, the dataset was split for holdout and k-fold crossvalidation. Cross-validation is a statistical method used to estimate the true generalization performance of machine learning models [51]. The holdout method is the simplest crossvalidation and randomly splits the dataset. For this, the dataset was separated into three sets, "training set," "validation set," and "test set." This method is good to use when the dataset is very large [52,53]. From the present dataset, we took 70% for the training set, 15% for the test set, and 15% for the validation set. We trained the classifier for 80%, 85%, 90%, and 95% variance of the PCA to remove the correlated features at different percentages and compared the results. K-fold cross-validation is performed while the dataset is split into a K number of folds [54]. From the dataset, we took 90% for the training set and 10% for the test set to evaluate the performance. For the k-fold cross-validation, we used 10-fold, which means the training set was divided into 10 parts. Nine parts were used for training and one-tenth was reserved for training. This procedure repeats ten times each time, reserving a different tenth for testing. Figure 4 shows the k-fold cross-validation process. the training set was divided into 10 parts. Nine parts were used for traini was reserved for training. This procedure repeats ten times each time, res tenth for testing. Figure 4 shows the k-fold cross-validation process. We trained the classifier for 80%, 85%, 90%, and 95% variance of th the correlated features at different percentages and compared the result Table 4 shows the accuracy of the classifier for holdout validation. serve that for 90% variance, we got the highest test accuracy-93.2%. T curacy was 85.5.5% for 80% variance using holdout cross-validation.  We trained the classifier for 80%, 85%, 90%, and 95% variance of the PCA to remove the correlated features at different percentages and compared the results. Table 4 shows the accuracy of the classifier for holdout validation. Here, we can observe that for 90% variance, we got the highest test accuracy-93.2%. The lowest test accuracy was 85.5.5% for 80% variance using holdout cross-validation. Beyond 90% variance, as we increased the PCA variance, and the test accuracy decreased because the dataset results in overfitting. We performed holdout cross-validation (with 90% PCA variance) ten times to see the classifier accuracy in different runs, and each time the training test and validation set were randomly divided into 70:15:15. Table 5 shows the performance of holdout cross-validation. It shows that for each time running holdout cross-validation, the test accuracy was around 93% for the SVM classifier. The mean and the STD for performing holdout 10 times were 93.2% and 0.44, respectively, which shows that there was no bias in the classifier model.  Table 6 shows the accuracy of the classifier for k-fold cross-validation. Here, we also got the highest test accuracy at 90% variance, which is 94.2%, and the lowest test accuracy at 84.4% for 97% variance for using k-fold cross-validation. The test accuracy also decreased as we increased the PCA variance because the dataset results in overfitting after 90% variance. From Tables 4 and 6, we observe that both cross-validation methods have the highest accuracy for 90% variance. The 10-fold validation method has a 1% higher accuracy than the holdout method.

Discussion
As ADHD is the most common disorder in children, early diagnosis will help to prevent future complications [23]. In this paper, we present a machine-learning approach for identifying children with ADHD using an SVM applied to a publicly available dataset (120 participants' 19-channel EEG data). After denoising, we divided the EEG signal into five sub-bands and took only four frequency bands (delta, theta, alpha, and beta) for the next process, because it is a slow (delta or theta) brain wave and due to a shortage of fast (beta) brain waves in ADHD patients. From the four sub-bands, we extracted statistical, time-domain, and frequency-domain features from each subject's data. The STD measures the variability and RMS is calculated to determine the power changes in the brain wave. Hjorth parameters indicate the complexity of the brain wave. Mobility, activity, and complexity are the most used Hjorth parameters and also the first derivatives of the signal [56]. Skewness represents the rate of asymmetric distribution of the EEG data. Kurtosis measures the distribution of observed data around the mean. It describes how often outliers occur [57]. Entropy measures the uncertainty or randomness of the brain wave [56]. PSD calculates the power distribution of EEG series in the frequency domain, and it is used to evaluate the abnormalities of the brain [58]. STD, RMS, skewness, and kurtosis are statistical and simple features. These are the simple features to characterize the brain wave.
We also applied PCA of different percentages of variance to reduce dimensionality to prevent overfitting. PCA reduces the number of variables or features of a large dataset while preserving as much information as possible [59]. It makes it convenient and faster for the machine learning algorithm to analyze the dataset.
We used two different cross-validation methods and split the dataset into the train-testvalidation set. For different percentages of correlated features, we got different accuracy for both cross-validation methods in the SVM classifier. The highest test accuracy was 93.24% for 90% of the variance in the holdout cross-validation method and 94.2% accuracy for 90% of the variance in the k-fold cross-validation method. Cross-validation evaluates the performance of machine learning models, and this helps to compare machine learning methods and determine which is ideal for solving a specific problem [60]. In our study, we got similar accuracy in both holdout and k-fold cross-validation, which makes our model robust against bias and overfitting.
SVM has been widely used to classify EEG signals for neurological disorders [61]. It works relatively well when there is a clear margin of separation between the classes [62], and the dataset used in this study has two classes: healthy children labeled as class 1 and children with ADHD labeled as class 2. We used Gaussian kernel SVM for classification, as it has excellent learning performance and can give a reliable estimate of uncertainty. The Gaussian kernel ensures a globally optimal predictor that minimizes the estimation and approximation errors of a classifier [63].
In this study, we used statistical and time-and frequency-domain features of the four EEG sub-bands, but most prior work used nonlinear features and also worked with all the sub-bands. As mentioned in Section 2, they mostly employed KNN and neural networks. Recently, many studies have been focusing on MRI to identify neurological disorders, but compared to MRI, EEG is more flexible, affordable, and also suitable for children. We also compared the results of two cross-validations. From the result of these two cross-validations, we can say there is no bias in the dataset. In this study, we got around 93% accuracy in the SVM classifier to identify ADHD in children from 11 features extracted from each sub-band of the EEG signal. This accuracy is reasonable for classifying any EEG signals. The main challenges we faced in this study were to understand this dataset to determine the window segment and select the good features. As it is a large dataset, we selected a 2 s window segment with 50% overlap for each of the sub-bands to have as much information as possible. Windowing is used to isolate features into small segments of overall EEG data to improve feature resolution [64].
In summary, we have presented an EEG machine learning pipeline for ADHD detection, explaining each stage of the pipeline (including signal preprocessing and data preparation) with thorough explanations and rationale. We utilized only the first four subbands of EEG and eliminated the higher-frequency band, which reduced the computational load for the model and kept mean accuracy of 93.2%. Simple EEG features were extracted from a large dataset of 120 children, which was collected from two different sessions at two different places, eliminating measurement bias in data collection. The experimental setup was also child-friendly, easy to reproduce in local settings, and could be employed for future ADHD detection. We also performed rigorous validation to ensure that our model was not impacted by bias and overfitting, which commonly appear in the machine learning pipeline. Despite this, we need to address a few limitations of our research.

1.
To improve the accuracy, we may need to evaluate more features through the use of different machine learning models for comparison of results.

2.
We will also try different window sizes (0.5 s or 5 s, for example) in future studies.