Continuous Recognition of Teachers’ Hand Signals for Students with Attention Deficits

: In the era of inclusive education, students with attention deficits are integrated into the general classroom. To ensure a seamless transition of students’ focus towards the teacher’s instruction throughout the course and to align with the teaching pace, this paper proposes a continuous recognition algorithm for capturing teachers’ dynamic gesture signals. This algorithm aims to offer instructional attention cues for students with attention deficits. According to the body landmarks of the teacher’s skeleton by using vision and machine learning-based MediaPipe BlazePose, the proposed method uses simple rules to detect the teacher’s hand signals dynamically and provides three kinds of attention cues ( Pointing to left , Pointing to right , and Non-pointing ) during the class. Experimental results show the average accuracy, sensitivity, specificity, precision, and F 1 score achieved 88.31%, 91.03%, 93.99%, 86.32%, and 88.03%, respectively. By analyzing non-verbal behavior, our method of competent performance can replace verbal reminders from the teacher and be helpful for students with attention deficits in inclusive education.


Introduction
Under the trend of inclusive education, all kinds of students with special needs enter the general classroom to receive an education.Learning in a general classroom requires concentration and attention in the face of the teacher's teaching.However, many students with attention deficits are not adept at this.Problems with joint attention among students with autism spectrum disorder (ASD) may continue into school [1].Students with attention deficit hyperactivity disorder (ADHD) often require special attention from teachers [2].It is also not easy for students with intellectual developmental disabilities (IDDs) [3] to maintain enough attention to the teacher's teaching in class.Even students with typical development need to pay attention to body language and teaching rhythm.Mitigating attention deficits in students with ASD caused by difficulties in processing dynamic visual information can be achieved by slowing down the speed of visual information and utilizing haptic technology [4].However, real-time visual information in the classroom cannot be slowed down.Instead, this visual information can be converted into vibration signals to provide haptic feedback.This feedback model can also be beneficial for other students with attention deficits.To obtain a better learning outcome, there is a real need to use engineering technology to assist students' learning in inclusive education.
In recent years, the use of technology to support students can be divided into iPad applications (Apps), augmented reality (AR), virtual reality (VR), and artificial intelligence (AI).App-based courses are effective in learning and communication skills for students with attention deficits [5].Studies have pointed out that AR-based courses increase learning effectiveness and learning motivation [6,7].Courses involving VR may improve cognitive ability and executive function [8].The research on AI intervention in education encompasses multiple topics such as aiding teachers' decision making, enhancing smart teaching systems, and improving educational data collection [9].However, technology to assist students with attention deficits in adapting their cognitive abilities in the general classroom is rare.Researchers also look forward to developing student-centered technology that enables students to adapt to any teacher's teaching or curriculum.
Studies regarding teachers' hand signals or gestures have been investigated previously.Some of these studies incorporate OpenPose, a tool capable of multi-person keypoint detection and pose estimation from images and videos [10,11], providing effective technical support for analyzing gestures.In [12], a two-stage method is proposed to detect pointing or not to an electronic whiteboard, named the instructional pointing gesture.In the first stage, the teacher's 25 joint points extracted by OpenPose are put into the gesture recognition neural network to detect whether the teacher is stretching out his/her arm and pointing to something.Then, the instructional pointing gesture is detected by verifying whether the teacher is specifically pointing at the whiteboard by analyzing the relative positioning of the teacher's hand and the whiteboard.In [13], a hand gesture recognition model for infrared images is proposed in classroom scenarios.Initially, the infrared teaching images are fed into the convolutional neural network (CNN) proposed in the study.This CNN extracts the positions of 25 joint points, which are subsequently used to determine the probability of the hand pose during teaching.The electronic display recognition algorithm is also realized.The gesture recognition methodology described in [14] operates with two different types of video recordings: exact gesture recording and real-class recording.Sixty keypoints in total comprising 18 body keypoints, 21 left-hand keypoints, and 21 right-hand keypoints are processed by OpenPose.Subsequently, machine learning implemented through Microsoft Azure Machine Learning Studio is applied to decide whether the teacher is gesticulating or not.In [15], deep learning is utilized to identify five types of online teaching gestures, including indicative, one-hand beat, two-hand beat, frontal habitual, and lateral habitual gestures.The study analyzes these gestures using images marked by human keypoint detection and establishes a recognition model with a ten-layer CNN.The study in [16] illustrates the effectiveness of using smart phone videos, in conjunction with OpenPose and machine learning algorithms, to estimate the direction of gaze and body orientation in elementary school classrooms.In this work, the body keypoints are provided by OpenPose for each person in the classroom.The main objective of [17] is to compare the non-verbal teaching behaviors, including arm extensions and body orientation, exhibited by both pre-service teachers and in-service teachers in asynchronous online videos.By utilizing deep learning technology, the teachers' poses are quantified while they perform instruction towards a video camera, with special attention given to the arm stretch range and body orientation related to the subject topic.In addition, hand signals or gestures have also been explored in various educational applications [18][19][20][21][22][23].
In recent years, there has been an increasing emphasis on students' digital skills to facilitate learning, as well as on improving teachers' digital competencies to enhance teaching practices.Students necessitate a comprehensive grasp of information and communication technology, which has the potential to enrich, transform, and elevate the quality of their learning experiences.Effective utilization of technology in the classroom relies on both technological accessibility and the routine choices made by teachers who incorporate technology into their teaching practices [24].The developments of advanced technologies in smart classrooms are expected to enhance teaching/learning experiences and interactions between teachers and students [25].The teacher's hand signals, serving as a non-verbal form of communication, can significantly aid students in concentrating on the learning content, thereby facilitating an improvement in their learning performance.The ability of students to recognize and comprehend these hand signals is crucial as it facilitates a seamless transition of their attention, enabling them to follow the teacher's instructions, engage with the content, and actively participate in the learning process, thereby enhancing the overall educational experience more effectively.
Most of the research mainly gives teachers useful feedback about their gesticulation in class and helps teachers to improve their teaching practices [12][13][14][15][16][17][18].Our goal is to develop a student-centered technology and highlight the significance of recognizing teachers' hand signals.To decrease the requirement of verbal messages from the teacher and help students connect to the rhythm of teaching more quickly by themselves, we are therefore interested in exploring how to offer a dynamic and instructional reminder of simple attention cues for assisting students with attention deficits in classes of inclusive education.

Proposed Method
Our study targets scenes captured from a video camera in an elementary school classroom.To provide a simple and practical attention cue to the students, the teacher's hand signals are classified into three kinds of signals: Pointing to left, Pointing to right, and Non-pointing from the students' field of vision during class.The pointing focus may be at a blackboard, an electronic whiteboard, or a project screen.
Identifying human activities is an important task that relies on video sequences or sensor data for prediction.This is a challenging computer vision task due to the diversity and complexity of input data, especially in applications like human motion recognition in videos.Typically, techniques such as extracting skeletal points are used to capture motion features, determining if actions are triggered or meet specific criteria, such as in yoga pose detection [26], Parkinson's disease diagnosis based on gait analysis [27], and predicting pedestrian intentions [28], rather than analyzing the entire video comprehensively.This approach helps to reduce the amount of data while enhancing efficiency and accuracy in activity recognition.Some studies [29][30][31][32][33] discuss the methodologies for detecting 2D or 3D keypoints in human poses, offering insights into the analysis of motion, gesture, posture, or action.Due to the intricacies and longer processing time associated with 3D detection, we choose a 2D approach.
We exploit a computer vision-based machine learning library MediaPipe to find the teacher's skeleton.MediaPipe provides complete models for many different tasks, including pose [34,35], hands [36], face [37], and so on.In this study, MediaPipe BlazePose [34,35], detecting 33 landmarks for body pose tracking, is used to detect the teacher's skeleton in the video sequence.BlazePose, despite supporting only single-person detection, utilizes heatmaps and regression for keypoint coordinates, making it generally faster than OpenPose.Therefore, we adopt BlazePose, which balances accuracy with computational efficiency to better meet the needs of mobile and real-time applications.Figure 1 depicts the 33 landmarks detected by MediaPipe BlazePose and their numeric labels.Table 1 provides a detailed list of these landmarks, each with their corresponding BlazePose landmark names.These 33 landmarks are from various parts of the human body.They were spe chosen to provide a comprehensive understanding of the human pose, enabling t mation of rotation, size, and position of the region of interest [34].These landma resent specific points of interest of the pose measurement.
Figure 2 shows some examples of the landmarks extracted by MediaPipe Bla and classified according to the three kinds of the teacher's hand signals.Our meth to find simple rules to analyze the three recognized signals in the students' field o to facilitate fast implementation.These 33 landmarks are from various parts of the human body.They were specifically chosen to provide a comprehensive understanding of the human pose, enabling the estimation of rotation, size, and position of the region of interest [34].These landmarks represent specific points of interest of the pose measurement.
Figure 2 shows some examples of the landmarks extracted by MediaPipe BlazePose and classified according to the three kinds of the teacher's hand signals.Our method aims to find simple rules to analyze the three recognized signals in the students' field of vision to facilitate fast implementation.
Only the coordinates of the shoulder, wrist, and hip are required in our method.The initial status of the signal of a video sequence is Non-pointing within the three kinds of signals.The Pointing to left and the Pointing to right signals should meet in Equation ( 1) and as indicated in Figure 3. Let X and Y represent the horizontal and the vertical coordinates in a video frame, respectively.The coordinates of the wrist and the shoulder are denoted as (X wrist , Y wrist ) and (X shoulder , Y shoulder ), respectively.According to the vertical coordinates of the wrist and the shoulder, R is defined in Equation (1).
In Equation (1) and Figure 3, L represents the Euclidean distance between the center of two shoulder landmarks and the center of two hip landmarks, D indicates the Euclidean distance of shoulder and wrist landmarks, and D x denotes the absolute difference between the horizontal coordinates of shoulder and wrist landmarks.The essential criterion for pointing signals is R > Th R , which is an adjustable threshold and is empirically set to 0.5 in our algorithm.Only the coordinates of the shoulder, wrist, and hip are required in our method.initial status of the signal of a video sequence is Non-pointing within the three kind signals.The Pointing to left and the Pointing to right signals should meet in Equatio and as indicated in Figure 3. Let X and Y represent the horizontal and the vertical coo nates in a video frame, respectively.The coordinates of the wrist and the shoulde denoted as (Xwrist, Ywrist) and (Xshoulder, Yshoulder), respectively.According to the vertical c dinates of the wrist and the shoulder, R is defined in Equation (1).
In Equation (1) and Figure 3, L represents the Euclidean distance between the ce of two shoulder landmarks and the center of two hip landmarks, D indicates the Euclid distance of shoulder and wrist landmarks, and Dx denotes the absolute difference betw the horizontal coordinates of shoulder and wrist landmarks.The essential criterion pointing signals is R > ThR, which is an adjustable threshold and is empirically set to 0 our algorithm.If the criterion of R > ThR is not satisfied, the status will be Non-pointing; if it is satisfied, the pointing direction will be further determined.For the determination of the pointing direction, Pointing to left means that the horizontal coordinate of the wrist landmark is smaller than that of the corresponding shoulder landmark (Xwrist < Xshoulder) for either the right hand or the left hand; and Pointing to right is determined by the condition that the horizontal coordinate of the wrist landmark is larger than that of corresponding shoulder If the criterion of R > Th R is not satisfied, the status will be Non-pointing; if it is satisfied, the pointing direction will be further determined.For the determination of the pointing direction, Pointing to left means that the horizontal coordinate of the wrist landmark is smaller than that of the corresponding shoulder landmark (X wrist < X shoulder ) for either the right hand or the left hand; and Pointing to right is determined by the condition that the horizontal coordinate of the wrist landmark is larger than that of corresponding shoulder landmark (X wrist > X shoulder ) for either the right hand or the left hand.A flowchart for the criterion of status for Pointing to left, Pointing to right, or Non-pointing according to the skeleton landmarks is demonstrated in Figure 4.The skeleton detection method aims to continuously follow and detect the X and Y coordinates of different landmarks on a single human body and to obtain good detection features by adjusting the minimum detection confidence and the minimum tracking confidence of the MediaPipe BlazePose model (detector and tracker).In the current study, the minimum detection confidence is set to 0.1, which means that the detector needs a prediction confidence of greater or equal to 10% to detect landmarks.Because the content of the blackboard, electronic whiteboard, or project screen may be usually complex, setting a lower minimum detection confidence can easily capture the landmarks of the teacher's skeleton.The minimum tracking confidence is set to 0.9, which means that if the confidence of the tracked landmarks is greater than or equal to 90%, the tracked landmarks are valid.If the confidence is lower than 90%, the detector is called in the next frame to retrack landmarks.
For each frame, the landmarks of the left hand are first to be checked, and then the right hand.The lasting duration of any of these three kinds of signals should be maintained at least Tht, and if the current status is different from the previous status, then the recognized signal can be triggered.Tht is empirically set to 1.5 s in our method to avoid the unintended motion of the arms.If the recognized signals are triggered by both hands, the later status is given precedence over the previous one.In other words, if the recognized signals of both hands happen in the same frame, the result of the right hand has higher priority.However, the recognized pointing signal of Pointing to left or Pointing to right has The skeleton detection method aims to continuously follow and detect the X and Y coordinates of different landmarks on a single human body and to obtain good detection features by adjusting the minimum detection confidence and the minimum tracking confidence of the MediaPipe BlazePose model (detector and tracker).In the current study, the minimum detection confidence is set to 0.1, which means that the detector needs a prediction confidence of greater or equal to 10% to detect landmarks.Because the content of the blackboard, electronic whiteboard, or project screen may be usually complex, setting a lower minimum detection confidence can easily capture the landmarks of the teacher's skeleton.The minimum tracking confidence is set to 0.9, which means that if the confidence of the tracked landmarks is greater than or equal to 90%, the tracked landmarks are valid.If the confidence is lower than 90%, the detector is called in the next frame to re-track landmarks.
For each frame, the landmarks of the left hand are first to be checked, and then the right hand.The lasting duration of any of these three kinds of signals should be maintained at least Th t , and if the current status is different from the previous status, then the recognized signal can be triggered.Th t is empirically set to 1.5 s in our method to avoid the unintended motion of the arms.If the recognized signals are triggered by both hands, the later status is given precedence over the previous one.In other words, if the recognized signals of both hands happen in the same frame, the result of the right hand has higher priority.However, the recognized pointing signal of Pointing to left or Pointing to right has higher priority than Non-pointing in the same frame.
Figure 5a,b depict the signal time series of D L and D x L , respectively.In the figures, the horizontal axis represents the frame number, the red broken lines indicate the threshold of Th R (0.5), and the yellow background shows the selection of D L or D x L signal according to whether the vertical coordinate of the wrist is greater than or equal to that of the shoulder.Finally, the signals with the yellow background will be decided as pointing signals by the essential criterion of Equation ( 1). Figure 5c

Experimental Results
We collected five in-class video sequences (video 1-video 5) which were manually labeled by a highly experienced teacher in inclusive education to evaluate the recognition performance of the proposed method.Each of the five video sequences depicts a teacher delivering a lecture using a blackboard, electronic whiteboard, or projection screen, with students seated in the classroom.The frame numbers of the five video sequences are 71,678, 73,285, 71,202, 46,404, and 32,992 with a frame rate of 30 frames/s.The resolution is 1920 × 1080 for video 1-video 4 and 854 × 480 for video 5.This study was approved by The Institutional Review Board / Ethics Committee of Mennonite Christian Hospital, Tai-

Experimental Results
We collected five in-class video sequences (video 1-video 5) which were manually labeled by a highly experienced teacher in inclusive education to evaluate the recognition performance of the proposed method.Each of the five video sequences depicts a teacher delivering a lecture using a blackboard, electronic whiteboard, or projection screen, with students seated in the classroom.The frame numbers of the five video sequences are 71,678, 73,285, 71,202, 46,404, and 32,992 with a frame rate of 30 frames/s.The resolution is 1920 × 1080 for video 1-video 4 and 854 × 480 for video 5.This study was approved by The Institutional Review Board / Ethics Committee of Mennonite Christian Hospital, Taiwan.
Accuracy denotes the ratio of the number of correctly classified signals to the total number of signals.Equations ( 2)-( 4) represent the Sensitivity i , Specificity i , and Precision i defined by True Positive i (TP i ), True Negative i (TN i ), False Positive i (FP i ), and False Negative i (FN i ) values for the class label i ∈ {Pointing to left, Pointing to right, Non-pointing}.Equation (5) denotes the F 1 score i defined by the Precision i and Recall i (Sensitivity i ).The macro-average [38] is calculated by taking the average of each metric of the three classes.
Figure 6 shows the confusion matrices for the five video frames.Standard assessments to measure the performance of the proposed method include accuracy, sensitivity, specificity, precision, and F 1 score in Table 2.Each metric of macro-average is also listed.The average accuracy, sensitivity, specificity, precision, and F 1 score achieve 88.31%, 91.03%, 93.99%, 86.32%, and 88.03%, respectively.Moreover, the average accuracy ranges from 83.66% to 91.30%, the average sensitivity ranges from 87.38% to 94.52%, the average specificity ranges from 91.59% to 95.64%, the average precision ranges from 80.88% to 89.97%, and the average F 1 score ranges from 83.40% to 91.09% for the five video sequences.
Accuracy denotes the ratio of the number of correctly classified signals to the total number of signals.Equations ( 2)-( 4) represent the Sensitivityi, Specificityi, and Precisioni defined by True Positivei (TPi), True Negativei (TNi), False Positivei (FPi), and False Negativei (FNi) values for the class label i ∈ {Pointing to left, Pointing to right, Non-pointing}.Equation ( 5) denotes the F1 scorei defined by the Precisioni and Recalli (Sensitivityi).The macro-average [38] is calculated by taking the average of each metric of the three classes.

Sensitivity (Recall ) = TP i TP i +FN i
(2) Figure 6 shows the confusion matrices for the five video frames.Standard assessments to measure the performance of the proposed method include accuracy, sensitivity, specificity, precision, and F1 score in Table 2.Each metric of macro-average is also listed.The average accuracy, sensitivity, specificity, precision, and F1 score achieve 88.31%, 91.03%, 93.99%, 86.32%, and 88.03%, respectively.Moreover, the average accuracy ranges from 83.66% to 91.30%, the average sensitivity ranges from 87.38% to 94.52%, the average specificity ranges from 91.59% to 95.64%, the average precision ranges from 80.88% to 89.97%, and the average F1 score ranges from 83.40% to 91.09% for the five video sequences.As shown in Table 2, we can see that the accuracy of video 3 is lower than those of the other videos.This is because the teacher in this video often uses a telescopic pointer to point at focus.Therefore, the extension of the teacher's hand is not as far as that of the teachers in other videos, and it makes the accuracy drop a little bit.Other videos in which the teacher is only pointing by hand have satisfactory results.As shown in Table 2, we can see that the accuracy of video 3 is lower than those of the other videos.This is because the teacher in this video often uses a telescopic pointer to point at focus.Therefore, the extension of the teacher's hand is not as far as that of the teachers in other videos, and it makes the accuracy drop a little bit.Other videos in which the teacher is only pointing by hand have satisfactory results.
The performance may be affected by empirical values of Th R and Th t .If Th R is chosen to be larger, it means that only when the teacher extends the arm fully can a pointing signal be detected.We choose Th R as 0.5.Hence, the students can obtain the attention reminder for Pointing to left or Pointing to right in most cases.In Table 2, both averaged sensitivities of Pointing to left and Pointing to right are more than 95%.However, the average sensitivity of Non-pointing is about 82.41%.If the minimum lasting time Th t is selected to be shorter, it may easily lead to rapid signal switching due to unintended movements of the teacher's hands.Therefore, we select Th t as 1.5 s for practical applications.Our continuous recognition approach provides three-class recognition and considers the time lag to avoid unintended and unnecessary rapid changes.The performances and advantages of our method are competent and feasible for applications in the classroom.

Discussion
Our proposed method differs from those described in [12][13][14] in the following aspects.While they focus on binary classification tasks using the single frame in the video sequence, our algorithm is tailored for continuous recognition with classes designated as Pointing to left, Pointing to right, and Non-pointing.In addition, our method considers avoiding unintended arm movements and triggers signal recognition only when a signal transition is maintained for a sufficient time interval.Table 3 shows a comparison of the proposed methods with those in [12][13][14].The method described in [12] is able to recognize situations where the teacher extends his/her arm and gestures towards the whiteboard.However, our approach allows teachers to use a telescoping pointer to guide attention.In [12], the accuracy of pointing gesture recognition at the first stage is 90%.By combining whiteboard detection, the final precision is 89.2% and the recall is 88.9% for instructional pointing gestures.In [13], the final accuracy exceeds 90% after the training process.In [14], the accuracy depends on the training and testing methods, which combine the different available datasets, ranging between 54% and 78%.The available training and testing datasets included acted recordings and real-class recordings; in both their semi-automatic classification versions and manual classification versions.While [12] and [14] rely on OpenPose, we utilize Mediapipe BlazePose with faster processing speed for pose information capture.Our method achieves competent performance with an accuracy of 88.31%, a sensitivity (recall) of 91.03%, and a precision of 86.32%, potentially coupled with expedited implementation speed compared to those machine learning or deep learning models in previous works, which can be attributed to our simple algorithm.The methods provided in [39,40] use continuous wave radar for real-time hand gesture recognition and achieve high accuracy rates.However, our choice of a video-based approach is motivated by considerations of accessibility, ease of implementation, and practical utility in inclusive education environments where resources and infrastructure may be limited.Given that most teachers have smart phones, this video-based method becomes highly convenient and widely accessible.
The fundamental possibility of recognizing audio-visual speech and gestures using sensors on mobile devices is demonstrated in [41].Studies have shown that wearable devices can effectively improve attention in educational settings through smart watch vibrations or visual stimuli from smart glasses [42][43][44][45][46]. Therefore, future implementations of our recognition algorithm can be embedded in wearable devices such as smart watches or smart glasses.The benefit of sending recognized signals to students' smart watches or smart glasses is to provide real-time feedback and assistance.Compared to smart phones, these devices provide attention cues to students in a less distracting manner.For students with attention deficits, this can help them stay focused, ameliorate hyperactive behaviors [43], and improve cognition [46].Using wearable devices can make these students less conspicuous when utilizing them in class, thereby enhancing the user-friendliness of assistive technology.

Conclusions
This paper introduces a method aimed at recognizing teachers' hand signals to aid students with attention deficits in the classroom.By analyzing non-verbal behavior, our study continuously recognizes the teachers' hand signals.According to the simple rules from the body landmarks of the teacher's skeleton using MediaPipe BlazePose, the proposed method dynamically detects the teacher's hand signals and can provide attention cues during class.The experimental results show competent performance compared with the other research.The proposed mechanism provides a powerful tool for enhancing engagement among students who display attentional challenges within traditional, verbally driven educational milieus.This approach augments the inclusiveness of the teaching environment by interpreting instructors' non-verbal cues.This innovative work paves the way for further exploration to refine non-verbal communication tools within frameworks of inclusive education classrooms.The findings from this study demonstrate robust performance, and they represent a significant progression in the creation of adaptive educational spaces that are meticulously designed to accommodate the needs of learners with attention deficits, by offering personalized support through less conspicuous means.
In the future, continuous recognition results can be sent to the smart watches or glasses worn by the students to be a practical tool for students with attention deficits.Additionally, deep learning techniques can be performed.Lightweight CNNs suitable for storage on wearable devices with small memory capacity can be trained for accurate classification.Silent and non-verbal messages are of high value in education.We look forward to helping students shift their attention smoothly to the teacher's teaching rhythm.

Figure 2 .
Figure 2. Examples of the three kinds of hand signals.

Figure 2 .Figure 3 .
Figure 2. Examples of the three kinds of hand signals.(a) Pointing to left.(b) Pointing to right.(c) Non-pointing.Algorithms 2024, 17, x FOR PEER REVIEW 6 of 14

Figure 4 .
Figure 4. Flowchart of recognition algorithm according to skeletal landmarks.

Figure 4 .
Figure 4. Flowchart of recognition algorithm according to skeletal landmarks.

Figure 5 .
Figure 5.An example of the continuous recognition of hand signals.

Figure 5 .
Figure 5.An example of the continuous recognition of hand signals.(a) D L .(b) D x L .(c) D L and D x L .

Table 2 .
Performance of the proposed recognition method.