Abstract
Attention deficit hyperactivity disorder (ADHD) is a prevalent neurodevelopmental disorder among children and adolescents. Behavioral detection and analysis play a crucial role in ADHD diagnosis and assessment by objectively quantifying hyperactivity and impulsivity symptoms. Existing video-based action recognition algorithms focus on object or interpersonal interactions, they may overlook ADHD-specific behaviors. Current keypoints-based algorithms, although effective in attenuating environmental interference, struggle to accurately model the sudden and irregular movements characteristic of ADHD children. This work proposes a novel keypoints-based system, the Multi-cue Feature Fusion Network (MF-Net), for recognizing actions and behaviors of children with ADHD during the Test of Variables of Attention (TOVA). The system aims to assess ADHD symptoms as described in the DSM-V by extracting features from human body and facial keypoints. For human body keypoints, we introduce the Multi-scale Features and Frame-Attention Adaptive Graph Convolutional Network (MSF-AGCN) to extract irregular and impulsive motion features. For facial keypoints, we transform data into images and employ MobileVitv2 for transfer learning to capture facial and head movement features. Ultimately, a feature fusion module is designed to fuse the features from both branches, yielding the final action category prediction. The system, evaluated on 3801 video samples of ADHD children, achieves 90.6% top-1 accuracy and 97.6% top-2 accuracy across six action categories. Additional validation experiments on public datasets NW-UCLA, NTU-2D, and AFEW-VA verify the network’s performance.
    1. Introduction
Attention Deficit Hyperactivity Disorder (ADHD) is one of the prevalent neurodevelopmental disorders in childhood [,]. It is characterized by a range of symptoms, including inattention, hyperactivity, and impulsivity []. According to a survey [] compiling data from various studies, involving a total sample of 3277590 participants, ADHD in children and adolescents was 8.0% (all figures represent statistically verified results). The inattentive type of ADHD was the commonest type of ADHD (ADHD-I) 3% followed by hyperactive type (ADHD-HI) 2.95% and the combined type (ADHD-C) 2.44%, as shown in Figure 1. These symptoms may place significant stress on the learning and interpersonal interactions of children with ADHD, while also adversely affecting their mental health and increasing the risk of developing depression and anxiety disorders. Furthermore, children with ADHD may encounter challenges in employment as they transition into adulthood. These symptoms can adversely affect their learning and daily life, and may persist into adulthood [,,,,,]. This underscores the necessity of early detection and treatment for children with ADHD. Clinically, the diagnosis of ADHD is often based on subjective evaluations by healthcare professionals. This typically involves using standardized criteria such as the DSM-V or interviewing about the child’s daily behavior [].
 
      
    
    Figure 1.
      The pie chart of ADHD prevalence statistics. The total sample size is 3,277,590, which includes data from various studies conducted between 2007 and 2022. The overall prevalence rate is 8%. The prevalence rates for each subtype are as follows: ADHD-I (3%), ADHD-HI (2.95%), and ADHD-C (2.44%).
  
The inherent variability in human observation and interpretation presents considerable challenges in achieving consistent diagnoses across different environments and evaluators. Objective measurements can reveal subtle behavioral patterns or symptom manifestations that may not be easily observed through traditional assessment methods, thereby enriching our understanding of ADHD. In addition to diagnosis through the DSM-5, several studies have explored the causes and symptoms of ADHD, proposing objective indicators and methods to identify the disorder. These methods include: (1) Brain activity and physiological signal detection, such as functional magnetic resonance imaging (fMRI) [,,], electroencephalography (EEG) []. While these techniques provide valuable insights into brain function, they are often expensive, time-consuming, and require specialized equipment and expertise, limiting their widespread application in clinical settings. (2) Behavioral detection and analysis, such as handwriting analysis [], body movement analysis during sleep [,] and wearable device detection system [,,,,]. Although these methods offer less invasive alternatives, they may not reliably capture the prominent symptoms needed for accurate diagnosis. Furthermore, wearable devices can potentially inconvenience children with ADHD, inadvertently restricting their natural activities and potentially skewing results.
In addition to the aforementioned methods, Test of Variables of Attention (TOVA) is another widely employed clinical tool for the diagnosis of ADHD in children. It is specifically designed to assess children’s attention and impulsivity levels []. Its measurement approach focuses on assessing individual attention deficits during task performance, including sustained and selective attention, aiding clinicians in accurately evaluating patients’ attention levels. It has become commonly used to diagnose the severity of ADHD in children [,,,,]. TOVA has been validated as a reliable method in many studies [,,,,,]. It demonstrates objectivity and is less influenced by confounding variables. Compared to the use of expensive and complex equipment, TOVA provides objective results concerning ADHD attention levels with a straightforward test. During the test, children often exhibit restless behaviors due to attention deficits or hyperactivity and impulsivity. Analyzing the behavioral performance during the TOVA can serve as an assessment of ADHD symptoms as described in the DSM-V. This can provide additional data points to support clinical judgment, potentially enhancing the accuracy and consistency of diagnoses. Therefore, we propose a novel and reliable paradigm for analyzing the behavior of children with ADHD to aid in the analysis of TOVA results, which utilizes cameras to record and recognize children’s actions during the TOVA.
Action recognition in computer vision has traditionally focused on a wide range of tasks involving complex interactions with objects or other individuals, such as drinking water or shaking hands. Established action recognition networks, including Two-stream CNN [], 3D-ResNet [], and various multimodal networks [,], have been designed to process not only human movements but also the contextual information provided by objects and environments. These approaches have demonstrated considerable success in general action recognition scenarios where the interplay between humans and their surroundings is crucial for accurate classification.
However, the specific requirements of our research, centered on the Test of Variables of Attention (TOVA) for ADHD diagnosis, present a unique set of challenges that diverge from conventional action recognition tasks. The behaviors exhibited during TOVA primarily consist of intricate combinations of movements from various body parts, with minimal relevance of background elements. This distinctive characteristic of our task necessitates a shift in focus towards the individual undergoing assessment, emphasizing the importance of precise body part localization and sophisticated motion modeling. In light of these requirements, our approach aligns more closely with methods that prioritize human-centric data. Graph Convolutional Neural Networks (GCNs) based on skeleton data, such as ST-GCN [] and 2s-AGCN [], have emerged as prominent solutions in action recognition tasks that benefit from isolating human movement from environmental factors. These methods leverage extracted human body keypoint information to construct graph-based representations of human poses and movements over time, effectively eliminating external environmental interference, offering a promising foundation for our ADHD-focused action recognition task.
Nevertheless, the application of existing GCN-based methods to our specific context reveals certain limitations. Children with ADHD undergoing TOVA often exhibit sudden, unpredictable movements, such as abruptly shifting their body position while resting on a table. These sporadic behaviors pose certain challenges to traditional GCN approaches, which typically model movements over entire time sequences. Such methods may fail to capture the significance of critical frames or adequately represent local temporal movements that are crucial for identifying ADHD-related behaviors. Additionally, while previous GCN-based networks have not extensively utilized facial keypoints due to their limited contribution to action recognition in other tasks, our research recognizes the unique importance of facial and head movements in ADHD assessment. Children with ADHD often display distinctive facial expressions and head movements during TOVA, making the integration of facial keypoints crucial for comprehensive motion modeling in our context.
To address these ADHD-specific challenges, we propose a Multi-cue Feature Fusion Network (MF-Net). While our framework builds upon established foundations in action recognition, it introduces several key innovations specifically designed for ADHD behavior analysis: For human keypoints data: (1) First, we develop a novel Multi-scale Spatial-Temporal Information Extraction Module integrated with AGCN. Unlike existing multi-scale approaches that focus on general action recognition, our module specifically targets the temporal irregularity of ADHD movements through an adaptive feature extraction mechanism. This innovation enables the simultaneous capture of both sustained attention patterns and sudden impulsive movements - a capability crucial for ADHD assessment but not addressed by current methods. (2) Second, we design an innovative Motion Attribute Encoder Module that substantially advances CBAM’s temporal attention mechanism by introducing a new temporal channel encoding mechanism to identify and emphasize frames containing characteristic ADHD behaviors, enabling more precise capture of patients’ irregular movement patterns. This module innovatively identifies and emphasizes frames containing diagnostically relevant movements, specifically addressing the challenge of detecting ADHD-related behavioral patterns. The integration with facial keypoint analysis through our adapted MobileViTv2 network further enhances our system’s ability to capture comprehensive behavioral characteristics.
Our proposed Multi-cue Feature Fusion Network (MF-Net) offers a novel, objective approach to quantifying hyperactivity and impulsivity symptoms during the Test of Variables of Attention (TOVA), potentially enhancing the accuracy and efficiency of ADHD diagnosis. This multifaceted approach not only advances the technical aspects of action recognition but also demonstrates the potential for AI-assisted clinical tools in neurodevelopmental disorder assessment.
In summary, our main contributions are as follows:
- We present a novel multi-cue framework that effectively integrates both global body movements and local facial-head movements through a novel fusion strategy, achieving superior performance in ADHD behavior recognition during TOVA assessment.
- We propose two innovative modules for ADHD-specific feature extraction: a Multi-scale Spatial-Temporal Information Extraction Module that captures movement patterns at different temporal scales, and a Motion Attribute Encoder that identifies critical behavioral frames. These modules work synergistically to model both sustained attention and sudden impulsive movements characteristic of ADHD.
- We present a MobileViTv2 classification network that constructs images based on facial keypoints.
2. Methods
2.1. Overview of MF-Net
In order to recognize the erratic, brief, and rapid actions displayed by children with ADHD during the TOVA, we propose a Multi-cue Feature Fusion Network (MF-Net) by extracting motion features from both body keypoints and facial keypoints. For the global body keypoints feature, a Multi-scale Features and Frame-Attention Adaptive Graph Convolutional Network (MSF-AGCN) is constructed for extracting features from both joints and bones data. The network adopts AGCN as the backbone network. In this network, two innovative modules are proposed: Multi-scale Spatial-Temporal Information Extraction Module and Motion Attribute Encoder Module. The Multi-scale Spatial-Temporal Information Extraction Module (MSST ) performs attention-based feature extraction at different dimensions of the backbone network and is fused with the final output of the backbone network. The Motion Attribute Encoder Module (MAE) is utilized for temporal-dimension attention extraction in the early stage of the backbone network. For the local facial keypoints feature, facial keypoints data from each sample are converted into pseudo-color images and processed using MobileVitv2 in transfer learning to capture fine-grained facial and head movements. Finally, the extracted features from body keypoints and facial keypoints are fused to obtain action classification results. An overview of the complete method is illustrated in the Figure 2.
 
      
    
    Figure 2.
      The overall framework proposed in this paper. The system encompasses three components: Keypoints data extraction, Human body and facial keypoints feature extractor, and Holistic integration of multi-cue feature representations. The feature extraction process of MF-Net comprises three branches: MSF-AGCN (joints) for modeling joint movements, MSF-AGCN (bones) for capturing skeletal deformations, and Mobilevitv2 (face) dedicated to extracting facial keypoints features.
  
2.2. Human Body Keypoints Feature Extractor
2.2.1. Adaptive Graph Convolutional Network
The actions defined involve the motion states of various body parts exhibited by children with ADHD during the TOVA. The Graph Convolutional Network (GCN) constructed from body keypoints not only localizes individual parts but also encodes continuous motion information. Spatial-Temporal Graph Convolutional Network (ST-GCN) introduces the application of GCN to skeleton-based action recognition. This network simply defines the graph in graph convolution and is inadequate for complex motion modeling. Adaptive Graph Convolutional Network (AGCN) adopted in this feature extractor is an improved graph convolutional network for holistic action modeling based on skeleton data.
The core component of AGCN is the G-TCN module. It consists of graph convolution operations and temporal convolution operations. Graph convolution operations consist of two main parts: the construction of the graph and the method of performing graph convolutions. For the first part, graph convolutions extract features from the skeleton graph , where  represents the set of keypoints, and  represents the set of edges connecting the keypoints. The edge set includes , connections between different keypoints within the same frame, as well as , connections between the same keypoints across different frames. The edges capture temporal evolution, connecting keypoints within frames and across frames. In the , unlike connections limited to adjacent joints, we add two different types of connections as shown in Figure 3b: (1) connections between the nose and shoulders to enhance detection of head movements relative to the body, aiding in recognizing behaviors such as sudden head turns; (2) connections between the left and right hip keypoints and the mouth keypoints, shortening the topological distance between the face and body. This facilitates the capture of body swaying motion features.
 
      
    
    Figure 3.
      Two different graph connections. (a) New Graph. This excludes the connections of the nose, mouth, and body. (b) Our Graph. This includes additional connections, where the red lines represent the connections between the nose and shoulders, and the blue lines represent the connections between the mouth and body.
  
For the second part, AGCN introduces an attention mechanism when computing the graph convolution. The output of the graph convolution can be represented as:
      
        
      
      
      
      
    
          where  represents a spatial-temporal graph constructed from input keypoints data,  is the weight vector obtained through  convolution, and  is adjacency matrix defined between keypoints in the first part,  is adjacency matrix with learnable parameters that are initialized, and  is adjacency matrix learned through the attention mechanism. However, its reliance on global temporal modeling often fails to account for the importance of critical frames and localized temporal features. This structural limitation hinders the network’s ability to effectively capture the nuanced and abrupt motion patterns that are characteristic of ADHD children’s behaviors. To this end, we construct Multi-scale Features and Frame-Attention Adaptive Graph Convolutional Network (MSF-AGCN), as illustrated in Figure 4. The following two modules are proposed for enhancements.
 
      
    
    Figure 4.
      The structure of MSF-AGCN. The network architecture is built upon the Adaptive Graph Convolutional Network (AGCN) as its baseline, incorporating two novel components: the Multi-scale Spatial-Temporal Information Extraction Module and the Motion Attribute Encoder Module.
  
2.2.2. Multi-Scale Spatial-Temporal Information Extraction Module
The module aims to model the brief and rapid actions (e.g., sudden turning, lying down) as well as actions with varying amplitudes (e.g., slight head turning, vigorous body swaying) exhibited by children with ADHD during the TOVA at different scales. It takes into account the influence of local motion information on the overall motion state.
As shown in Figure 5, this module consists of three parts: Temporal Convolutional Network (TCN) operation, Convolutional Block Attention Module (CBAM) [] and MaxPool layer.
 
      
    
    Figure 5.
      The architectures of the Multi-scale Spatial-Temporal Information Extraction Module and the Motion Attribute Encoder Module proposed in the MSF-AGCN.
  
(1) TCN operation, which is used to learn the motion patterns and dynamic information of the keypoints data over time from the low-level features of the input. It first employs a temporal convolution with a kernel size of  along the temporal dimension. The temporal convolution operation changes the number of channels of the output features, thereby extracting more advanced and abstract feature representations in the temporal dimension. Subsequently, the BatchNormalization layer is applied to normalize the output features of the temporal convolution. Finally, the ReLU activation function is used to achieve non-linear mapping. This operation can be expressed by the following formula:
      
        
      
      
      
      
    
          where ,  represents the ,  features, respectively.  represents the convolutional kernel weights with a kernel size of ,  represents the temporal convolution operation.
(2) Convolutional Block Attention Module (CBAM). This module consists of two sub-modules: the channel attention module and the temporal attention module. Two modules are concatenated and executed sequentially. The channel attention module aims to adaptively adjust the importance weights of different feature channels to highlight the more relevant and discriminative feature channels for the current task. The temporal attention module focuses on the attention allocation at different time steps, enhancing the modeling capability of key time segments. Its operation is shown in the following formula:
      
        
      
      
      
      
    
          where  represents the the feature output by the CBAM module.  and  denote the channel attention module and temporal attention module, respectively. The ⊙ represents element-wise multiplication.
(3) MaxPool layer. This layer is applied to compress the size of the feature maps and capture the most prominent feature information within the local region. The maxpool operation is formulated as:
      
        
      
      
      
      
    
          where ,  are the features before and after maxpool, respectively.
Assuming the input feature size is , where C denotes channel, T denotes frames and V denotes joints, after processing by Multi-scale Spatial-Temporal Information Extraction module, the output feature shape becomes , This module is placed after the outputs of the fourth and seventh blocks of the network to extract keypoints features at the first and second stages of the temporal scale, respectively. These features are then fused with the final output features of AGCN, resulting in an enhanced output feature representation. The final output can be represented by the following formula:
          
      
        
      
      
      
      
    
          where , ,  represent the spatial-temporal features extracted after the 4th block, the spatial-temporal features extracted after the 7th block, and the output features from the 10th block in AGCN, respectively.
2.2.3. Motion Attribute Encoder Module
This module aims to extract temporal attention features to address the challenges of modeling irregular motion patterns. The design of this module modifies certain aspects of the Convolutional Block Attention Module (CBAM) attention mechanism, incorporating attention to extract temporal features along the time dimension. As shown in Figure 5, the dimensions C and T of the input feature  are swapped to  using a transpose operation. Then the adjusted features are subjected to adaptive  average pooling and max pooling operations on the last two dimensions, respectively. These pooled features are then merged into a fused, dimension-reduced feature  via a concatenation operation. This process can be formulated as:
      
        
      
      
      
      
    
          where ,  denote using  feature maps for adaptive average pooling and max pooling, with  indicating the operation is performed along the second dimension. After that, a temporal convolution layer with a kernel size of (9, 1) and a stride of 2 is applied to further extract the attention scores, followed by activation using the sigmoid function. This results in an attention score feature . Finally, the attention scores along the T dimension are multiplied with the original input feature to obtain the output features. It generates a tensor representation that encodes the temporal attention information of the input feature. The output is formulated as:
      
        
      
      
      
      
    
          where  denotes the sigmoid activation function, while  denotes the convolutional kernel weights with a kernel size of .
The Motion Attribute Encoder module is placed after the 4th block of the network to extract temporal attention features at an early stage across all time frames, capturing the motion attributes of the entire sequence.
2.3. Facial Keypoints Feature Extractor
2.3.1. Transformation of Facial Keypoints into Images
Facial keypoints can capture head movements and facial expression changes, especially in the movements exhibited by ADHD children during the TOVA. Considering the weaker connectivity between facial keypoints compared to skeleton keypoints, constructing a graph structure is not suitable. Furthermore, the overall facial configuration during motion can reveal subtle movements. Therefore, we transform facial keypoints into images as input to extract features. We extracted 68 facial keypoints for each frame. Each keypoint  for the  frame and  keypoint is represented by its coordinates  and a confidence score . To prepare the data for visualization, the coordinates and scores are normalized to fit within the RGB color range . The coordinates are scaled based on the maximum values observed in the dataset,  and , across all keypoints. The normalized coordinates are computed as follows:
      
        
      
      
      
      
    
Similarly, the confidence scores , which inheretly lie between 0 and 1, are directly scaled to the range :
      
        
      
      
      
      
    
Finally, image I of dimensions  pixels is constructed. The RGB value of each pixel in the image is set based on the normalized values:
      
        
      
      
      
      
    
2.3.2. Feature Extraction on Transformed Images
After the above processing, we obtain the new input data . MobileViTv2 is employed for the extraction of data features. It is a variant of MobileViT, a lightweight model that combines CNN and Transformer to reduce parameter count. MobileViTv2 not only has the ability to capture image information but also possesses strong global sequence modeling capabilities due to the Transformer used, as the input data contains sequential relationships. We retain the main network of MobileViTv2 and modify the output of the final fully connected layer. It can be expressed as:
      
        
      
      
      
      
    
          where  represents the weights of the fully connected layer.
2.4. Holistic Integration of Multi-Cue Feature Representations
For the task of recognizing behaviors in children with ADHD, motion cues from body joints, body part movements, and facial expressions and head motions serve as important complementary sources of information. Our proposed MF-Net model aims to comprehensively leverage these diverse feature modalities. Specifically, the MSF-AGCN module extracts global motion features  and  from human body keypoints representing joint and bone movements, respectively. The MobileViTv2 focuses on local facial keypoints to learn facial motion features . These three feature sources encompass distinct yet complementary types of information. To effectively fuse these multi cue features, we design a feature fusion module that performs weighted fusion of the outputs from the MSF-AGCN and MobileViTv2 branches, generating a holistic feature representation for behavior classification. This module first applies softmax processing to the fully connected layer outputs of the individual branches, yielding probability distribution tensors. Subsequently, it adjusts the weights of these tensors and combines them to obtain the optimal behavior recognition result. This process can be expressed as:
      
        
      
      
      
      
    
        where  denotes the softmax function, and , ,  denote the feature weights for the joints, bones, and face branches, respectively.
3. Experiments
3.1. Datasets
SCU-ADHD-Action-Dataset. The TOVA visual test spans a total of 21.6 min. During the test, children sat in front of a computer screen that flashed every two seconds, with a black square appearing either above or below the white area on the screen. When the black square appeared above, the children were required to click the left mouse button. Cameras were placed above the computer screen and under the desk to record the children’s full-body movements from the start of the test. As the lower body movements were not significant, only the data from the top camera was used in this experiment. For each child, 20 min of valid video data were extracted and segmented into individual samples based on the screen flash interval (every two seconds), with each sample containing 60 frames. The size of each frame image is 1280 (width) × 720 (height) pixels. In this work, we collected video data of 7 ADHD children during the TOVA visual test at the Mental Health Center of West China Hospital, Sichuan University. After data screening and segmentation, a total of 3801 video samples were obtained.
By detecting the keypoints of the children in each frame of every video, the keypoints information is obtained, forming a time-series keypoints data sample. To obtain comprehensive keypoint information, we employ AlphaPose [,,], a state-of-the-art tool renowned for its accuracy and versatility in human pose estimation. AlphaPose is particularly suitable for our research due to its ability to extract both body and facial keypoints simultaneously, a feature critical for our holistic approach to ADHD behavior analysis. The keypoint extraction process is applied to each frame of every video in our dataset, resulting in a time-series of keypoints data. This approach allows us to capture the dynamic nature of children’s movements throughout the TOVA, providing a rich dataset for subsequent analysis. The keypoints data extracted by AlphaPose for a single frame is a three-dimensional vector , where x and y represent the coordinates of the keypoint in the frame, and s indicates the confidence score. The keypoints data for each video sample has a shape of , where C represents the three dimensions of a single keypoint data, T is the number of frames in the video, and V is the number of joints. We incorporate 17 key body points, comprising: 9 head keypoints: These include crucial keypoints such as the eyes, ears, and nose, allowing us to track head movements and orientation, which are often indicative of attention shifts and impulsivity; 8 upper body keypoints: Focusing on the shoulders, arms, and upper torso, these points enable us to capture gross motor movements and postural changes frequently observed in children with ADHD. We also extract 68 facial keypoints, which provide granular facial data that is particularly valuable for detecting subtle cues of inattention or facial states that may not be apparent from body movements alone. To ensure the reliability and consistency of our keypoint data, we implement preprocessing steps focused on missing data handling; specifically, we employ interpolation techniques to estimate the positions of occluded or low-confidence keypoints based on surrounding frames, ensuring continuity in our time-series data.
Based on clinical expertise, we define six action state categories that characterize the behavioral manifestations of children with ADHD during the TOVA visual test, as detailed below:
- (1)
- Turning the head and look around. This action manifests when the child is unable to concentrate. Their attention is diverted by the surrounding environment, leading to sudden and unpredictable behavior. It is indicative of symptoms of inattention.
- (2)
- Shaking body. This behavior arises from symptoms of hyperactivity and impulsivity. It leads to an inability to sustain attention on a single task for an extended period. It is manifested as body movements while seated, characterized by irregularity.
- (3)
- Resting head on hand. This action occurs when the child feels fatigued or has difficulty maintaining attention. It is related to symptoms of inattention.
- (4)
- Displaying facial expressions of inattention. This type of action reflects the manifestation of ADHD symptoms on the child’s facial expressions when unable to focus on completing the task, such as pouting, frowning, etc., and is characterized by unpredictability. It is more pronounced compared to typical children.
- (5)
- Lying on the desk. This action is associated with symptoms of inattention. The child is unable to continue the test for an extended period, ultimately resulting in and lying on the desk.
- (6)
- Exhibiting no significant ADHD symptoms. This state is defined as the child exhibiting normal behavior during the test, apart from the aforementioned five action states. There are no significant body or head movements.
3.2. Implementation Details
We divide the keypoints dataset of children with ADHD into a training set and a validation set in a 7:3 ratio to validate the effectiveness of our model. The deep learning model is implemented using PyTorch and trained on an NVIDIA RTX 4090 GPU. The training process is conducted for 100 epochs with a batch size of 8. We employ SGD with Nesterov momentum (0.9) as the optimizer, using a learning rate of 0.01 and a weight decay of 0.0001. Cross-entropy is used as the loss function.
3.3. Performance Comparison on SCU-ADHD-Action-Dataset
3.3.1. Comparison with State-of-the-Art Models on SCU-ADHD-Action-Dataset
In this section, to verify the superiority of our network, we conduct comprehensive comparisons with state-of-the-art (SOTA) methods on the SCU-ADHD-Action-Dataset. Our comparisons encompass both skeleton-based and video-based approaches, evaluating them in terms of accuracy, computational complexity (FLOPs), and parameter counts, as shown in Table 1. The classification accuracy is calculated as:
      
        
      
      
      
      
    
          where  represents the number of correctly classified samples and  represents the total number of samples in the test set.
 
       
    
    Table 1.
    Comparison of Accuracy on SCU-ADHD-Action-Dataset with State-of-the-Art Models.
  
The FLOPs, which measure the computational complexity, is calculated as:
      
        
      
      
      
      
    
          where L is the total number of layers,  and  are the input and output channels of layer l, and  represents the spatial dimension of the computational unit.
The total number of parameters is computed as:
      
        
      
      
      
      
    
          where the second term  accounts for the bias parameters in each layer.
For skeleton-based methods, we first examine ST-GCN, which pioneered the use of GCN for action recognition. However, it achieves a relatively lower accuracy of 86.3% due to its limitation of learning only on predefined adjacency matrices. The improved AGCN enhances the performance by 2% through adaptive graph learning. MS-G3D [] further advances the performance to 88.7% by leveraging multi-scale convolution to learn richer features. CTR-GCN [] and Shift-GCN++ [], two recent advanced graph convolutional networks, achieve accuracies of 88.7% and 88.9% respectively. CTR-GCN introduces dynamic topology adjustment for enhanced flexibility, while Shift-GCN++ implements a lightweight displacement-based graph convolution mechanism. SkateFormer [], another recent transformer-based skeleton network, achieves 89.0% accuracy through its sophisticated attention mechanism.
For video-based methods, we evaluate SlowFast [] (ResNet50 backbone) and MViTv2-S []. SlowFast, despite its dual-pathway architecture designed for effective temporal modeling, achieves only 68.7% accuracy on our dataset. While MViTv2-S, a multi-scale vision transformer, shows some improvement over SlowFast, both networks exhibit significantly higher computational costs. Their FLOPs are substantially higher than skeleton-based approaches, particularly MViTv2-S, which demonstrates the computational advantage of skeleton-based methods for action recognition.
Our proposed MSF-AGCN achieves better accuracy with only a modest increase in parameters compared to base AGCN. The MobileViTv2, designed for facial and head motion feature extraction, maintains exceptional efficiency with only 0.63G FLOPs. Most notably, our integrated MF-Net achieves the highest accuracy of 90.6% while maintaining reasonable computational efficiency at 6.95G FLOPs, still significantly lower than video-based alternatives.
The experimental results demonstrate that our proposed MF-Net achieves state-of-the-art performance (90.6%) on the SCU-ADHD-Action-Dataset while maintaining reasonable computational costs (6.95G FLOPs). The superior efficiency-performance trade-off of our skeleton-based approach compared to video-based methods validates the effectiveness of our multi-modal fusion strategy for ADHD action recognition.
3.3.2. Comparison of Feature Extraction with t-SNE Visualization
We also perform t-SNE visualization of the original data, feature-extracted data by AGCN, and feature-extracted data by MSF-AGCN. t-SNE is a nonlinear algorithm that reduces high dimensional data to low-dimensional space, where similar data points are closer and dissimilar data points are farther apart. Figure 6 illustrates the clustering effects of different methods. In Figure 6a, the original data, except for class 5 (Lying on the desk) which shows clear clustering, the other five classes are more randomly distributed. In Figure 6b, the data points classified by AGCN show better clustering. In Figure 6c, MSF-AGCN demonstrates significantly better clustering than AGCN, particularly for class 1 (Shaking body, highlighted by the red circle). Overall, MSF-AGCN shows superior feature extraction for the irregular movements of ADHD children compared to AGCN.
 
      
    
    Figure 6.
      t-SNE visualization of the original data, features extracted by AGCN, and features extracted by MSF-AGCN on the validation set. The points in six colors represent the feature data of six categories. Similar feature data points are closer together, while dissimilar feature data points are farther apart. (a) t-SNE visualization of the original data. (b) t-SNE visualization of features extracted by AGCN. (c) t-SNE visualization of features extracted by MSF-AGCN. numbers 1–6 respectively represent the following actions. 1: Turning the head and look around. 2: Shaking body. 3: Resting head on hand. 4: Displaying facial expressions of inattention. 5: Lying on the desk. 6: Exhibiting no significant ADHD symptoms. The clustering effect of MSF-AGCN is significantly better than AGCN in the “Shaking Body” category.
  
3.3.3. Validation of MSF-AGCN Module Effectiveness
In this section, to validate the effectiveness of the proposed Multi-scale Spatial-Temporal Information Extraction Module and Motion Attribute Encoder Module, we select a sudden movement action of an ADHD child to compare the performance of the AGCN and MSF-AGCN networks. Attention heatmaps are a visualization tool used to show the regions or features that a neural network model (particularly attention mechanism models) focuses on when processing input data. We compare the keypoints attention heatmaps at the same layer of AGCN without the Motion Attribute Encoder Module and the layer in MSF-AGCN where the Motion Attribute Encoder Module is placed to verify the effectiveness of this module. Additionally, we compare the keypoints attention heatmaps at the final block of MSF-AGCN, where the Multi-scale Spatial-Temporal Information Extraction Module integrates features at different scales, and the final block of AGCN to validate the effectiveness of this module. As shown in Figure 7, the top part displays frame samples of a child making a sudden move while resting on his head. Below are the attention maps between different networks and layers. From the comparison of the first two attention maps, it is evident that the MSF-AGCN network with the Motion Attribute Encoder Module can focus on areas that AGCN misses, including the hand and head. These sudden movements cause changes in the movement attributes of the ADHD child. In the keypoints heatmaps of the final output features of AGCN and MSF-AGCN, MSF-AGCN also focuses on more details of the head and hand movements in feature extraction. Overall, the two proposed modules help model some of the sudden and irregular movements of ADHD children.
 
      
    
    Figure 7.
      Attention heatmaps of different output layers of AGCN and MSF-AGCN during an ADHD child’s ‘sudden move while resting on his head’ action. The red circle represents the area of enhanced attention compared to the original AGCN. Our designed modules effectively help the network focus on the relevant movement areas.
  
3.3.4. Analysis of the Impacts of Motion Attribute Encoder Module Position and Structural Graph Design
We evaluate the performance impact of placing the Motion Attribute Encoder Module at different layers of MSF-AGCN. As illustrated in Figure 8a, when placed after the feature extraction layers of the second and third stages, the achieved accuracy rates are both 88.3%, lower than the 89.3% accuracy attained by positioning it after the first stage’s feature extraction layer. The Motion Attribute Encoder Module is designed to extract temporal attention features. An earlier integration of this module allows the model to capture salient motion cues from the more complete temporal sequences. Consequently, integrating the module at the first stage yields better performance.
 
      
    
    Figure 8.
      Bar chart for the Motion Attribute Encoder Module Position and different Structural Graph Designs. (a) Accuracy bar chart of the Motion Attribute Encoder Module placed at different stages of the layers. (b) Accuracy bar chart of different networks combined with different graphs.
  
We also investigate the effects of combining different networks with distinct graph representations to validate the efficacy of our proposed graph design. As shown in Figure 3, the “Our Graph” representation incorporates additional connections between the nose, mouth, and body, while the “New Graph” excludes these connections. We compare four distinct combinations: AGCN + Our Graph, MSF-AGCN + New Graph, AGCN + New Graph, and MSF-AGCN + Our Graph. As depicted in Figure 8b, the AGCN network achieves an accuracy of 89.1% when paired with the “New Graph”, outperforming its 88.3% accuracy with “Our Graph”. As a classical graph convolutional neural network, AGCN primarily focuses on modeling skeletal connections and exhibits a relatively low reliance on non-skeletal connections, such as those between limbs and the torso. Therefore, the introduction of additional non-skeletal edges in “Our Graph” may introduce certain noise, consequently impacting AGCN’s performance on this graph representation. In contrast, our proposed MSF-AGCN model, through multi-scale information extraction, demonstrates an ability to leverage the rich structural information embedded in “Our Graph”. MSF-AGCN achieves an accuracy of 89.5% when combined with “Our Graph”, outperforming its 88.3% accuracy with the “New Graph”, yielding the optimal result. This reflects that MSF-AGCN combined with “Our Graph” achieves better performance.
3.4. Ablation Studies
In this section, we demonstrate the effectiveness of our network through ablation experiments on different variants of AGCN and various network cues.
3.4.1. Performance of MSF-AGCN
In this part, we present four variants of AGCN: the original AGCN network without any additional modules; AGCN + MSST, which includes only the Multi-scale Spatial-Temporal Information Extraction Module to verify its effect; AGCN + MAE, which includes only the Motion Attribute Encoder Module to verify its effect; and our MSF-AGCN, which integrates both the Multi-scale Spatial-Temporal Information Extraction Module and Motion Attribute Encoder Module. We conduct action recognition experiments on joints keypoints data using these four networks, and the results are shown in Table 2. The accuracy of the original AGCN on joints is 88.3% lower than the 89.5% accuracy achieved by our proposed MSF-AGCN. The accuracy of AGCN + MSST and AGCN + MAE is 89.2% and 88.7%, respectively, higher than the original AGCN, indicating the effectiveness of the MSST and MAE modules. The combined MSF-AGCN achieves the highest accuracy of 89.5%, demonstrating that combining the two modules yields the best network performance.
 
       
    
    Table 2.
    Comparison of Accuracy on SCU-ADHD-Action-Dataset among Different AGCN Variants.
  
3.4.2. Performance Evaluation of Multi-Cue Feature Fusion in MF-Net
In this section, we first compare the action recognition accuracy on human body keypoints using MSF-AGCN for joints and bones data, MobileViTv2 for facial keypoints, and MF-Net that fuses joints, bones, and face features. As shown in Table 3, MSF-AGCN achieves an accuracy of 89.5% on joints, higher than the 88.7% on bones, because joints contain more motion information, while bones only provide relational information between joints. MobileViTv2 achieves an accuracy of 83.3% on facial keypoints, lower than on human body keypoints, as facial keypoints only provide local features and are less effective for modeling overall motion. When combining the three cues, MF-Net achieves the highest accuracy of 90.6%, indicating that integrating both global and local features from human body and facial keypoints results in the best action modeling performance. Additionally, we compare the top-1 and top-2 count class performances of MF-Net with AGCN (joints) and MSF-AGCN (joints) to highlight the performance improvements. As shown in Figure 9, MF-Net has the highest count class in five categories for both top-1 and top-2. Compared to AGCN, MF-Net improves the count class for ’Shaking body’ by 21.5%. This is due to MSF-AGCN’s better feature extraction and multi-scale feature fusion for irregular and sudden movements. Overall, MF-Net achieves top-1 and top-2 accuracies of 90.6% and 97.6%, respectively.
 
       
    
    Table 3.
    Comparison of Accuracy on SCU-ADHD-Action-Dataset among Different Network Cues.
  
 
      
    
    Figure 9.
      A bar chart illustrating the top-1 and top-2 counts in six categories for AGCN, MSF-AGCN, and MF-Net on the validation set. (a) Bar chart of top-1 and top-2 counts for six categories in AGCN. (b) Bar chart of top-1 and top-2 counts for six categories in MSF-AGCN. (c) Bar chart of top-1 and top-2 counts for six categories in MF-Net. In the chart, numbers 1–6 respectively represent the following actions. 1: Turning the head and look around. 2: Shaking body. 3: Resting head on hand. 4: Displaying facial expressions of inattention. 5: Lying on the desk. 6: Exhibiting no significant ADHD symptoms. MF-Net exhibits substantial improvements in top-1 and top-2 count classes compared to AGCN.
  
3.5. Extensive Evaluations on Public Datasets
3.5.1. Public Datasets
We also select several publicly available datasets, including NW-UCLA, NTU-2D, and AFEW-VA, to evaluate our model’s performance on human body keypoints and facial keypoints.
NW-UCLA. NW-UCLA [] dataset comprises various human actions captured from multiple camera angles. The dataset includes 10 different action categories, with each action performed by different subjects from multiple viewpoints, totaling 1494 video clips. This dataset is similar to our own, as both involve a single person performing an action. We utilize the data provided in the CTR-GCN paper, consisting of a total of 1484 files. We use data from two viewpoints as the training set and data from the remaining viewpoint as the test set to validate the performance of MSF-AGCN.
NTU-2D. The NTU-RGB+D [] dataset is one of the most widely used datasets for skeleton-based action recognition. It includes 60 actions categorized into 40 daily activities, 9 health-related actions, and 11 mutual actions performed by pairs. The dataset comprises a total of 56,880 samples. To evaluate the performance of MSF-AGCN in recognizing actions involving multiple individuals and complex interactions, we utilize the NTU-2D dataset from MMaction. This dataset employs a keypoint extraction method similar to ours. Specifically, it uses Keypoint Fast R-CNN to extract 17 human keypoints from RGB videos. Each keypoint is characterized by three values: , where x and y represent the coordinates of the keypoint, and s indicates the confidence score of each keypoint.
AFEW-VA. AFEW-VA [,] (Acted Facial Expressions in the Wild-Valence and Arousal) is a facial expression video dataset specifically designed for emotion analysis and recognition. This dataset is derived from real movie clips and contains a large number of natural expressions and real-world scenes. In total, the dataset includes 600 video clips, with each frame of every video clip annotated with 68 facial landmarks. The dataset labels emotions along two dimensions: Valence and Arousal. Arousal measures the intensity or arousal level of the emotion, with low values indicating low arousal states such as calmness or fatigue, and high values indicating high arousal states such as excitement or surprise. We calculate the mean Arousal value across all frames of each video and use this mean as the label for the video. The labels are then reclassified into three categories: less than 0, 0–3, and greater than 3. To evaluate the effectiveness of MobileVitv2 in facial feature extraction, we randomly select 400 video clips as the training set and 200 video clips as the test set.
3.5.2. Comparison with Some State-of-the-Art GCNs on the NW-UCLA Dataset
To validate the performance of MSF-AGCN across different datasets, we compare the performance of several state-of-the-art networks on the NW-UCLA dataset, including the ones used for our own human body keypoints dataset, with our proposed MSF-AGCN. All networks are trained using the same hyperparameters as provided in the CTR-GCN. As shown in Table 4, Shift-GCN++ achieves an accuracy of only 89.9%, which may be attributed to the small size of the dataset and the higher complexity of the model. Our MSF-AGCN shows a significant improvement over shift-GCN++, achieving an accuracy increase of 4.5%. Additionally, the performances of ST-GCN, AGCN, and CTR-GCN are similar, with accuracies of 93.1%, 93.3%, and 93.5% respectively. The MS-G3D achieves an accuracy of 94.2%. Our MSF-AGCN achieves the highest accuracy of 94.4% among the compared networks, demonstrating better action recognition capability on the NW-UCLA dataset.
 
       
    
    Table 4.
    Comparison of Accuracy on NW-UCLA Dataset with Some State-of-the-Art GCNs.
  
3.5.3. Comparison with Some State-of-the-Art GCNs on the NTU-2D Dataset
In addition to the smaller NW-UCLA dataset, we also select the larger NTU-RGB+D dataset for comparison. The NTU-RGB+D dataset is widely used for skeleton-based action recognition and contains a variety of complex actions. For this experiment, we utilize 2D keypoints data, with each frame containing 17 keypoints, similar to the keypoints data we extract in our own experiments. This consistency allows for a more direct comparison of model performance. All networks are trained using the hyperparameters provided in the CTR-GCN paper, ensuring a fair and consistent evaluation across models. As shown in Table 5, MSF-AGCN achieve an accuracy of 94.4%, a 1% improvement over AGCN. This indicates that our proposed module enhances the detail modeling of action features in AGCN, providing a more nuanced understanding of the actions. MSF-AGCN’s accuracy is second only to MS-G3D and higher than ST-GCN and AGCN, performing on par with CTR-GCN. This demonstrates that MSF-AGCN maintains strong performance even with complex actions, validating its robustness and effectiveness in action recognition tasks.
 
       
    
    Table 5.
    Comparison of Accuracy on NTU-2D Dataset with Some State-of-the-Art GCNs.
  
3.5.4. Comparison of Different SOTA CNN Networks on Facial Keypoints Dataset and AFEW-VA Datasets
In this section, we compare the performance of several commonly used CNN classification networks, including ResNet50 [], MobileNetV2 [], EfficientNetV2 [], and the network we used, MobileViTv2 [], on facial keypoints and AFEW-VA datasets. All networks are trained with the same hyperparameters to ensure consistency. As shown in Table 6, on our own facial keypoints dataset, ResNet50, MobileNetV2, and EfficientNetV2 achieve accuracies of 81.1%, 81.7%, and 80.9%, respectively, all of which are lower than the 83.3% accuracy achieved by MobileViTv2. Similarly, on the AFEW-VA dataset, we adopt the same strategy as with the facial keypoints dataset, constructing  facial keypoints images for each video. This dataset only classifies the emotional intensity of the face. The accuracies of ResNet50, MobileNetV2, and EfficientNetV2 on this dataset are 58.5%, 62.5%, and 62.5%, respectively. MobileViTv2 achieved an accuracy of 65.5%, outperforming the three CNN networks. MobileViTv2 also has fewer parameters compared to ResNet50 and EfficientNetV2. Therefore, whether on our own dataset or on the AFEW-VA dataset, MobileViTv2 demonstrates better modeling capability for facial features.
 
       
    
    Table 6.
    Comparison of Different SOTA CNN Networks on Facial Keypoints and AFEW-VA Datasets.
  
4. Conclusions
In this paper, we present a solution for recognizing behavioral actions of children with ADHD during the TOVA visual assessment, aiming to assist clinicians in symptom evaluation. To achieve this goal, we propose a Multi-cue Feature Fusion Network (MF-Net) based on keypoint representations. Our method encompasses collecting behavioral videos of ADHD children during the TOVA test and extracting human body and facial keypoints. For the human body keypoints, We introduce a novel Multi-scale Features and Frame-Attention Adaptive Graph Convolutional Network (MSF-AGCN) to extract global motion features from joints and bones. For the facial keypoints, we transform them into image representations and employ the MobileVitv2 model for feature extraction. The features from these three cues are then integrated to yield the final action recognition result. Experimental results on both our dataset and public benchmarks demonstrate promising action recognition accuracy, validating the effectiveness of our approach.
Our work introduces several innovative technical contributions tailored for effective ADHD behavior recognition during TOVA assessments. The proposed MSF-AGCN represents a novel approach to extract global motion features from irregular and impulsive movements exhibited by ADHD children. Specifically, the Multi-scale Spatial-Temporal Information Extraction Module enables comprehensive temporal feature extraction by addressing challenges unique to ADHD-specific behavioral patterns. Moreover, the Motion Attribute Encoder Module introduces an new temporal attention mechanism designed to identify diagnostically relevant motion attributes. Furthermore, our multi-cue fusion strategy presents an innovative approach to combining global body motion features with local facial and head movement characteristics, allowing for a more comprehensive extraction of clinically meaningful behavioral cues.
In conclusion, our work contributes a novel approach to ADHD action recognition during TOVA assessments, providing clinicians with a valuable, quantitative tool for evaluating ADHD symptoms and monitoring treatment efficacy and patient progress over time. While our current results are promising, future work should focus on expanding the dataset to address class imbalance issues, and possibly incorporating additional modalities to further optimize the network’s performance and robustness in real-world clinical settings, including evaluating the impact of adversarial examples on model performance, similar to their use in image datasets as discussed in recent studies [].  
Author Contributions
Conceptualization, W.T.; methodology, W.T., C.S. and L.H.; software, W.T. and C.S.; validation, C.S.; formal analysis, Y.L., Z.T. and G.Y.; investigation, W.T.; resources, J.Z.; data curation, L.H.; writing—original draft preparation, W.T.; writing—review and editing, Y.L., J.Z., Z.T., G.Y. and L.H.; visualization, W.T.; supervision, L.H.; project administration, J.Z. and L.H.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Sichuan Science and Technology Program [2023YFS0290]; the Sichuan Science and Technology Program [2023YFS0327].
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data will be available upon reasonable request from the corresponding author.
Conflicts of Interest
The authors declare they have no conflicts of interest related to the publication of this paper.
References
- Feldman, H.M.; Reiff, M.I. Attention Deficit–Hyperactivity Disorder in Children and Adolescents — NEJM. Pediatr. Clin. N. Am. 2014, 50, 1049–1092. [Google Scholar]
- Wehmeier, P.M.; Schacht, A.; Barkley, R.A. Social and emotional impairment in children and adolescents with ADHD and the impact on quality of life. J. Adolesc. Health 2010, 46, 209–217. [Google Scholar] [CrossRef] [PubMed]
- Ayano, G.; Demelash, S.; Gizachew, Y.; Tsegay, L.; Alati, R. The global prevalence of attention deficit hyperactivity disorder in children and adolescents: An umbrella review of meta-analyses. J. Affect. Disord. 2023, 339, 860–866. [Google Scholar] [CrossRef] [PubMed]
- Xenitidis, K.; Campbell, C.; Routh, R.; Jackson, G. Attention Deficit/Hyperactivity Disorder Across the Lifespan. Annu. Rev. Med. 2013, 202, 155. [Google Scholar]
- Loe, I.M.; Feldman, H.M. Academic and educational outcomes of children with ADHD. Ambul. Pediatr. 2007, 7, 82–90. [Google Scholar] [CrossRef]
- Wolraich, M.L. Attention-Deficit/Hyperactivity Disorder Among Adolescents: A Review of the Diagnosis, Treatment, and Clinical Implications. Pediatrics 2005, 115, 1734. [Google Scholar] [CrossRef]
- Webster-Stratton, C.; Reid, J.; Hammond, M. Social skills and problem-solving training for children with early-onset conduct problems: Who benefits? J. Child Psychol. Psychiatry 2010, 42, 943–952. [Google Scholar] [CrossRef]
- Banaschewski, T.; Brandeis, D.; Heinrich, H.; Albrecht, B.; Brunner, E.; Rothenberger, A. Association of ADHD and conduct disorder—brain electrical evidence for the existence of a distinct subtype. J. Child Psychol. Psychiatry 2003, 44, 356–376. [Google Scholar] [CrossRef]
- Wilens, T.E.; Spencer, T.J. Understanding Attention-Deficit/Hyperactivity Disorder from Childhood to Adulthood. Postgrad. Med. 2010, 122, 97–109. [Google Scholar] [CrossRef]
- Doernberg, E.; Hollander, E. Neurodevelopmental disorders (asd and adhd): Dsm-5, icd-10, and icd-11. CNS Spectrums 2016, 21, 295–299. [Google Scholar] [CrossRef]
- Riaz, A.; Asad, M.; Alonso, E.; Slabaugh, G. Fusion of fMRI and non-imaging data for ADHD classification. Comput. Med. Imaging Graph. Off. J. Comput. Med. Imaging Soc. 2018, 65, 115–128. [Google Scholar] [CrossRef] [PubMed]
- Zuberer, A.; Schwarz, L.; Kreifelts, B.; Wildgruber, D.; Ethofer, T. Neural basis of impaired emotion recognition in adult attention deficit hyperactivity disorder. Biol. Psychiatry Cogn. Neurosci. Neuroimaging 2020, 7, 680–687. [Google Scholar] [CrossRef] [PubMed]
- Deshpande, G.; Wang, P.; Rangaprakash, D.; Wilamowski, B. Fully Connected Cascade Artificial Neural Network Architecture for Attention Deficit Hyperactivity Disorder Classification From Functional Magnetic Resonance Imaging Data. IEEE Trans. Cybern. 2015, 45, 2668–2679. [Google Scholar] [CrossRef] [PubMed]
- Cura, O.K.; Atli, S.; Akan, A. Attention deficit hyperactivity disorder recognition based on intrinsic time-scale decomposition of EEG signals. Biomed. Signal Process. Control. 2023, 81, 104512. [Google Scholar]
- Shin, J.; Maniruzzaman, M.; Uchida, Y.; Hasan, M.A.M.; Megumi, A.; Yasumura, A. Handwriting-based ADHD detection for children having ASD using machine learning approaches. IEEE Access 2023, 11, 84974–84984. [Google Scholar] [CrossRef]
- Okada, S.; Shiozawa, N.; Makikawa, M. Body movement in children with ADHD calculated using video images. In Proceedings of the 2012 IEEE-EMBS International Conference on Biomedical and Health Informatics, Hong Kong, China, 5–7 January 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 60–61. [Google Scholar]
- Nakatani, M.; Okada, S.; Shimizu, S.; Mohri, I.; Ohno, Y.; Taniike, M.; Makikawa, M. Body movement analysis during sleep for children with ADHD using video image processing. In Proceedings of the 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Osaka, Japan, 3–7 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 6389–6392. [Google Scholar]
- Zhang, Y.; Kong, M.; Zhao, T.; Hong, W.; Xie, D.; Wang, C.; Yang, R.; Li, R.; Zhu, Q. Auxiliary diagnostic system for ADHD in children based on AI technology. Front. Inf. Technol. Electron. Eng. 2021, 22, 400–414. [Google Scholar] [CrossRef]
- Jiang, X.; Chen, Y.; Huang, W.; Zhang, T.; Zheng, Y. WeDA: Designing and Evaluating A Scale-driven Wearable Diagnostic Assessment System for Children with ADHD. In Proceedings of the CHI ’20: CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020. [Google Scholar]
- Bijlenga, D.; Ulberstad, F.; Thorell, L.B.; Christiansen, H.; Hirsch, O.; Kooij, J.S. Objective assessment of attention-deficit/hyperactivity disorder in older adults compared with controls using the QbTest. Int. J. Geriatr. Psychiatry 2019, 34, 1526–1533. [Google Scholar] [CrossRef]
- Bautista, M.A.; Hernandez-Vela, A.; Escalera, S.; Igual, L.; Pujol, O.; Moya, J.; Violant, V.; Anguera, M.T. A Gesture Recognition System for Detecting Behavioral Patterns of ADHD. IEEE Trans. Cybern 2016, 46, 136–147. [Google Scholar] [CrossRef]
- Jaiswal, S.; Valstar, M.F.; Gillott, A.; Daley, D. Automatic detection of ADHD and ASD from expressive behaviour in RGBD data. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 762–769. [Google Scholar]
- Porumb, M. Using TOVA for the assessment of ADHD: A case study. Cogn. Brain Behav. 2007, 11, 571. [Google Scholar]
- Katz, M.; Adar Levine, A.; Kol-Degani, H.; Kav-Venaki, L. A compound herbal preparation (CHP) in the treatment of children with ADHD: A randomized controlled trial. J. Atten. Disord. 2010, 14, 281. [Google Scholar] [CrossRef]
- Peskin, M.; Sommerfeld, E.; Basford, Y.; Rozen, S.; Zalsman, G.; Weizman, A.; Manor, I. Continuous Performance Test Is Sensitive to a Single Methylphenidate Challenge in Preschool Children with ADHD. SAGE Publ. 2020, 24, 226–234. [Google Scholar] [CrossRef] [PubMed]
- Rotem, A.; Ben-Sheetrit, J.; Newcorn, J.; Danieli, Y.; Peskin, M.; Golubchik, P.; Ben-Hayun, R.; Weizman, A.; Manor, I. The Placebo Response in Adult ADHD as Objectively Assessed by the TOVA Continuous Performance Test. J. Atten. Disord. 2021, 25, 1311–1320. [Google Scholar] [CrossRef] [PubMed]
- Asbaqi, M.; Arjmandnia, A.; Rahmanian, M.; Asbaqi, E. Comparing the Effect of Neurofeedback Training with Neurofeedback Along with Cognitive Rehabilitation on ADHD Children’s Improvement. Neuropsychology 2016, 2, 75–88. [Google Scholar]
- Wada, N.; Yamashita, Y.; Matsuishi, T.; Ohtani, Y.; Kato, H. The test of variables of attention (TOVA) is useful in the diagnosis of Japanese male children with attention deficit hyperactivity disorder. Brain Dev. 2000, 22, 378–382. [Google Scholar] [CrossRef] [PubMed]
- Forbes, G.B. Clinical utility of the Test of Variables of Attention (TOVA) in the diagnosis of attention-deficit/hyperactivity disorder. J. Clin. Psychol. 1998, 54, 461–476. [Google Scholar] [CrossRef]
- Weyandt, L.L.; Mitzlaff, L.; Thomas, L. The Relationship Between Intelligence and Performance on the Test of Variables of Attention (TOVA). J. Learn Disabil. 2002, 35, 114–120. [Google Scholar] [CrossRef]
- Llorente, A.M.; Voigt, R.; Jensen, C.L.; Fraley, J.K.; Rennie, W.C.H.K.M. The Test of Variables of Attention (TOVA): Internal Consistency (Q1 vs. Q2 and Q3 vs. Q4) in Children with Attention Deficit/Hyperactivity Disorder (ADHD). Child Neuropsychol. 2008, 14, 314–322. [Google Scholar] [CrossRef]
- Huang, Y.S.; Chen, N.H.; Li, H.Y.; Wu, Y.Y.; Chao, C.C.; Guilleminault, C. Sleep disorders in Taiwanese children with attention deficit/hyperactivity disorder. J. Sleep Res. 2010, 13, 269–277. [Google Scholar] [CrossRef]
- Kollins, S.H.; Deloss, D.J.; Caadas, E.; Lutz, J.; Faraone, S.V. A novel digital intervention for actively reducing severity of paediatric ADHD (STARS-ADHD): A randomised controlled trial. Lancet Digit. Health 2020, 2, e168–e178. [Google Scholar] [CrossRef]
- Hunt, M.G.; Bienstock, S.W.; Qiang, J.K. Effects of diurnal variation on the Test of Variables of Attention performance in young adults with attention-deficit/hyperactivity disorder. Psychol Assess 2012, 24, 166–172. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. Adv. Neural Inf. Process. Syst. 2014, 1, 568–576. [Google Scholar]
- Hara, K.; Kataoka, H.; Satoh, Y. Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Wang, L.; Yuan, X.; Zong, M.; Ma, Y.; Wang, R. Multi-cue based four-stream 3D ResNets for video-based action recognition. Inf. Sci. 2021, 575, 654–665. [Google Scholar] [CrossRef]
- Zong, M.; Wang, R.; Chen, X.; Chen, Z.; Gong, Y. Motion saliency based multi-stream multiplier ResNets for action recognition-ScienceDirect. Image Vis. Comput. 2021, 107, 104108. [Google Scholar] [CrossRef]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef]
- Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. RMPE: Regional Multi-person Pose Estimation. In Proceedings of the ICCV, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Li, J.; Wang, C.; Zhu, H.; Mao, Y.; Fang, H.S.; Lu, C. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 5–20 June 2019; pp. 10863–10872. [Google Scholar]
- Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 143–152. [Google Scholar]
- Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]
- Cheng, K.; Zhang, Y.; He, X.; Cheng, J.; Lu, H. Extremely Lightweight Skeleton-Based Action Recognition With ShiftGCN++. IEEE Trans. Image Process. 2021, 30, 7333–7348. [Google Scholar] [CrossRef]
- Do, J.; Kim, M. Skateformer: Skeletal-temporal transformer for human action recognition. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 401–420. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4804–4814. [Google Scholar]
- Mehta, S.; Rastegari, M. Separable self-attention for mobile vision transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar]
- Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2649–2656. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Kossaifi, J.; Tzimiropoulos, G.; Todorovic, S.; Pantic, M. AFEW-VA database for valence and arousal estimation in-the-wild. Image Vis. Comput. 2017, 65, 23–36. [Google Scholar] [CrossRef]
- Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed. 2012, 19, 34. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Kwon, H. Adversarial image perturbations with distortions weighted by color on deep neural networks. Multimed. Tools Appl. 2023, 82, 13779–13795. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
