Eye-Gaze Controlled Wheelchair Based on Deep Learning

In this paper, we design a technologically intelligent wheelchair with eye-movement control for patients with ALS in a natural environment. The system consists of an electric wheelchair, a vision system, a two-dimensional robotic arm, and a main control system. The smart wheelchair obtains the eye image of the controller through a monocular camera and uses deep learning and an attention mechanism to calculate the eye-movement direction. In addition, starting from the relationship between the trajectory of the joystick and the wheelchair speed, we establish a motion acceleration model of the smart wheelchair, which reduces the sudden acceleration of the smart wheelchair during rapid motion and improves the smoothness of the motion of the smart wheelchair. The lightweight eye-movement recognition model is transplanted into an embedded AI controller. The test results show that the accuracy of eye-movement direction recognition is 98.49%, the wheelchair movement speed is up to 1 m/s, and the movement trajectory is smooth, without sudden changes.


Introduction
ALS is a progressive and fatal neurodegenerative disease that causes the degeneration of the patient's upper and lower motor neurons, thereby weakening the muscles. Therefore, although many ALS patients are conscious, they cannot perform physical movements and verbal expressions. A growing body of research is dedicated to the application of artificial intelligence technologies to modify power wheelchairs to improve the quality of life of people with ALS. In the past few decades, researchers have carried out corresponding research on wheelchair motion control methods, including gesture control [1][2][3][4][5], voice control [6][7][8][9][10], eye-tracking control [11][12][13][14][15], and brain-computer interfaces [16][17][18][19][20][21][22]. These control methods can replace the rocker to complete the reading of the user's motion direction intention and realize the motion control of the wheelchair. However, due to the loss of limb control and language communication abilities in patients with amyotrophic lateral sclerosis (ALS), gesture control and voice control are not viable options. In brain-computer interfaces, although the collection of brain signals largely eliminates noise interference, the semi-invasive or invasive electrodes used in these interfaces can pose risks to human health [23]. Compared with the above-mentioned control methods, eye-tracking control has unique advantages in terms of safety, portability, and practicality for patients with ALS.
Currently, research on the eye-tracking control of wheelchairs mainly focuses on two aspects: eye-tracking recognition and wheelchair control. Xiaokun Li et al. designed a head-mounted device based on an energy-controlled iterative curve-fitting method of infrared light, which can achieve precise pupil detection and tracking. The experimental results showed that the average tracking accuracy of the method for pupil rotation was at least 1.38% higher than that of conventional methods [24]. Fatma et al. proposed a method that uses a front camera to capture the user's face information and determines the pupil center coordinates through a fuzzy logic controller to output wheelchair control decisions.

Dataset Creation
Datasets are the basis for training deep learning models, and the performance of deep learning models heavily depends on the quality and size of the datasets they are trained on. In order to further improve the practicability and accuracy of the model, 100 Chinese volunteers were recruited for this dataset collection. Through the two major scenes of virtual and reality, we recorded videos of volunteers gazing in different directions while completing tasks, extracted human eye images frame-by-frame using the OpenCv program and the Dlib algorithm, and automatically labeled the obtained data by the task attributes of the time period in which the frame was located.

Multidimensional Eye-Tracking Data Acquisition
In this paper, we built the eye-tracking dataset through two dimensions: virtual and reality. Multidimensional datasets can capture the complex relationships between

Dataset Creation
Datasets are the basis for training deep learning models, and the performance of deep learning models heavily depends on the quality and size of the datasets they are trained on. In order to further improve the practicability and accuracy of the model, 100 Chinese volunteers were recruited for this dataset collection. Through the two major scenes of virtual and reality, we recorded videos of volunteers gazing in different directions while completing tasks, extracted human eye images frame-by-frame using the OpenCv program and the Dlib algorithm, and automatically labeled the obtained data by the task attributes of the time period in which the frame was located.

Multidimensional Eye-Tracking Data Acquisition
In this paper, we built the eye-tracking dataset through two dimensions: virtual and reality. Multidimensional datasets can capture the complex relationships between different features, better describe the characteristics and attributes of the data, provide a more comprehensive and accurate representation of the data, and enhance model performance. (1) Virtual scene acquisition In this paper, a virtual scene for eye-tracking direction detection was built under the robot operating system ROS using Gazebo software, as shown in Figure 2. The virtual scene is a maze, and there is a car in the maze, which matches the virtual camera so that the perspective of the volunteer is consistent with that of the car. The volunteers keep their heads still while watching the road in front of the car with their eyes and control the movement of the car through the keyboard, simulating the movement of a wheelchair in a real-life scenario. When the state of the car is shown (as in Figure 2), volunteers look at the feasible road on the right side of the wall, and at the same time control the car to the right through the keyboard. The volunteer's facial image is captured and recorded by a camera placed in the center of the computer screen, while the volunteer's actions on the keyboard are recorded by a script to achieve hand-eye synergy data collection. Through this method, the eye-tracking dataset in the virtual scene was established.
Sensors 2023, 23, x FOR PEER REVIEW 5 of 26 different features, better describe the characteristics and attributes of the data, provide a more comprehensive and accurate representation of the data, and enhance model performance.
(1) Virtual scene acquisition In this paper, a virtual scene for eye-tracking direction detection was built under the robot operating system ROS using Gazebo software, as shown in Figure 2. The virtual scene is a maze, and there is a car in the maze, which matches the virtual camera so that the perspective of the volunteer is consistent with that of the car. The volunteers keep their heads still while watching the road in front of the car with their eyes and control the movement of the car through the keyboard, simulating the movement of a wheelchair in a reallife scenario. When the state of the car is shown (as in Figure 2), volunteers look at the feasible road on the right side of the wall, and at the same time control the car to the right through the keyboard. The volunteer's facial image is captured and recorded by a camera placed in the center of the computer screen, while the volunteer's actions on the keyboard are recorded by a script to achieve hand-eye synergy data collection. Through this method, the eye-tracking dataset in the virtual scene was established. (2) Real scene acquisition We set up an environment for eye-tracking data acquisition in a real scene, as shown in Figure 3b. The scene consisted of a wall with a nine-grid (as shown in Figure 3a), a laser pointer, and a wheelchair with a fixed camera. The nine-grid was a square area of 210 cm × 210 cm in size, each grid size was 70 cm × 70 cm, and a red sign was pasted in the center. When the volunteer's eye gazes at the designated red sign, the direction of the human eye gaze becomes a nine-classification problem, which reduces the influence of the volunteer's subjective behavior. In the real scene, the experimental assistant pointed to the red mark in the center of the grid with a laser pointer row-by-row from left to right, while the volunteer sat in a wheelchair and kept his head still, staring at the positions illuminated by the experimental assistant, and each position was maintained for 10 s. At the same time, the facial changes of the volunteers were recorded in real time by the camera on the wheelchair. Each frame of facial data was automatically marked according to the time the frame belonged to. (2) Real scene acquisition We set up an environment for eye-tracking data acquisition in a real scene, as shown in Figure 3b. The scene consisted of a wall with a nine-grid (as shown in Figure 3a), a laser pointer, and a wheelchair with a fixed camera. The nine-grid was a square area of 210 cm × 210 cm in size, each grid size was 70 cm × 70 cm, and a red sign was pasted in the center. When the volunteer's eye gazes at the designated red sign, the direction of the human eye gaze becomes a nine-classification problem, which reduces the influence of the volunteer's subjective behavior. In the real scene, the experimental assistant pointed to the red mark in the center of the grid with a laser pointer row-by-row from left to right, while the volunteer sat in a wheelchair and kept his head still, staring at the positions illuminated by the experimental assistant, and each position was maintained for 10 s. At the same time, the facial changes of the volunteers were recorded in real time by the camera on the wheelchair. Each frame of facial data was automatically marked according to the time the frame belonged to.

Data Preprocessing
To ensure the accuracy of convolutional neural network-based algorithms, images need to be preprocessed before they are analyzed. The quality of the image has a direct impact on the accuracy of the algorithm. Different tasks require different image preprocessing methods to remove irrelevant information and enhance relevant information to improve task reliability. In this paper, the convolutional neural network was used to detect the human eye-tracking state, so redundant information other than human eyes needed to be removed in the preprocessing stage.
We chose to use the detector function of the Dlib library [43] to draw 68 feature points of the recognized face, used OpenCv to extract frames from the video, and used the facedetection algorithm of the Dlib library for each frame of the image. In order to effectively extract the irrelevant area information of the bridge of the nose in the two eyes, the images of the left and the right eyes were extracted, respectively, instead of directly extracting the images of both eyes. The comparison between the two is shown in Figure 4. The area delineated by the feature points of serial numbers 42-47 is the corresponding area of the left eye, and the feature points of the corresponding area of the right eye are serial numbers 36-41. Taking the extraction of the left-eye image as an example, the minimum and maximum values of the abscissa and ordinate coordinates of the six points are the boundary area coordinates of the left-eye image. The following formula was used to obtain the boundary coordinates of the eye image. Among them, , are the corresponding abscissa and ordinate of each feature point, , are the minimum abscissa and ordinate of the eye area, and , are the maximum abscissa and ordinate of the eye area.

Data Preprocessing
To ensure the accuracy of convolutional neural network-based algorithms, images need to be preprocessed before they are analyzed. The quality of the image has a direct impact on the accuracy of the algorithm. Different tasks require different image preprocessing methods to remove irrelevant information and enhance relevant information to improve task reliability. In this paper, the convolutional neural network was used to detect the human eye-tracking state, so redundant information other than human eyes needed to be removed in the preprocessing stage.
We chose to use the detector function of the Dlib library [43] to draw 68 feature points of the recognized face, used OpenCv to extract frames from the video, and used the facedetection algorithm of the Dlib library for each frame of the image. In order to effectively extract the irrelevant area information of the bridge of the nose in the two eyes, the images of the left and the right eyes were extracted, respectively, instead of directly extracting the images of both eyes. The comparison between the two is shown in Figure 4.

Data Preprocessing
To ensure the accuracy of convolutional neural network-based algorithms, images need to be preprocessed before they are analyzed. The quality of the image has a direct impact on the accuracy of the algorithm. Different tasks require different image preprocessing methods to remove irrelevant information and enhance relevant information to improve task reliability. In this paper, the convolutional neural network was used to detect the human eye-tracking state, so redundant information other than human eyes needed to be removed in the preprocessing stage.
We chose to use the detector function of the Dlib library [43] to draw 68 feature points of the recognized face, used OpenCv to extract frames from the video, and used the facedetection algorithm of the Dlib library for each frame of the image. In order to effectively extract the irrelevant area information of the bridge of the nose in the two eyes, the images of the left and the right eyes were extracted, respectively, instead of directly extracting the images of both eyes. The comparison between the two is shown in Figure 4.   A total of 200 videos of volunteers staring in different directions in virtual and real scenes were saved. In order to extract the human eye information in each frame of the image, we processed the video frame-by-frame, and cut out the binocular pictures according to the feature point coordinates of human eyes. The part between the two eyes may have some noise and redundant information that interferes with the prediction of the gaze direction. For example, the nose bridge, eyeglass frames, eye shadows, and eye spacing between the two eyes may have an impact on the feature extraction of the eyes, thereby reducing the accuracy of estimating the gaze direction. By removing the part between the two eyes, the attention can be focused on the separate eye regions and the performance of the gaze direction recognition can be improved.
Considering the interference of the distance between the eyes on the human eye information, we intercepted the eye information of each frame of the picture, adjusted the size of the extracted picture to 100 × 50, and merged the eyes together, horizontally. According to the task attributes of the time of each frame of the image, the human eye information in each frame of the image was automatically marked, and finally, the collection of gaze data of 100 volunteers in virtual and real scenes was completed.

Unification of Datasets
The data collected in the real environment in Section 3.2 had nine different labels, but after visualization by t-SNE [44], these labels could be directly transformed into three categories, which is also consistent with reality. For example, when the test volunteers gazed at the leftmost column, their eye features were approximately distributed in the same region regardless of the column they gazed at. Therefore, the three category labels could be cleverly used to uniformly mark the datasets of the real scene, which is exactly the same as the label classification of the datasets collected in the virtual environment, which is convenient for further dataset screening. The t-SNE visualization effect is shown in Figure 5.
A total of 200 videos of volunteers staring in different directions in virtual and real scenes were saved. In order to extract the human eye information in each frame of the image, we processed the video frame-by-frame, and cut out the binocular pictures according to the feature point coordinates of human eyes. The part between the two eyes may have some noise and redundant information that interferes with the prediction of the gaze direction. For example, the nose bridge, eyeglass frames, eye shadows, and eye spacing between the two eyes may have an impact on the feature extraction of the eyes, thereby reducing the accuracy of estimating the gaze direction. By removing the part between the two eyes, the attention can be focused on the separate eye regions and the performance of the gaze direction recognition can be improved.
Considering the interference of the distance between the eyes on the human eye information, we intercepted the eye information of each frame of the picture, adjusted the size of the extracted picture to 100 × 50, and merged the eyes together, horizontally. According to the task attributes of the time of each frame of the image, the human eye information in each frame of the image was automatically marked, and finally, the collection of gaze data of 100 volunteers in virtual and real scenes was completed.

Unification of Datasets
The data collected in the real environment in Section 3.2 had nine different labels, but after visualization by t-SNE [44], these labels could be directly transformed into three categories, which is also consistent with reality. For example, when the test volunteers gazed at the leftmost column, their eye features were approximately distributed in the same region regardless of the column they gazed at. Therefore, the three category labels could be cleverly used to uniformly mark the datasets of the real scene, which is exactly the same as the label classification of the datasets collected in the virtual environment, which is convenient for further dataset screening. The t-SNE visualization effect is shown in Figure  5. For the convenience of description, we introduced some symbols: = { 1 , 2 , … , } and = 1,2, … ,100; = { 1 , 2 , 3 }. Among them, the set represents 100 volunteers participating in the data collection. Each contains 1350 photos, and each For the convenience of description, we introduced some symbols: X = {X 1 , X 2 , . . . , X n } and n = 1, 2, . . . , 100; Y = {y 1 , y 2 , y 3 }. Among them, the X set represents 100 volunteers participating in the data collection. Each X i contains 1350 photos, and each photo has a corresponding label. Y represents the set of labels, y 1 corresponds to the label left, y 2 corresponds to the label forward, and y 3 corresponds to the label right. A total of 135,000 pictures with gaze labels in different directions were selected to provide training data for the algorithm training in the next section. The established dataset is shown in Figure 6. photo has a corresponding label.
represents the set of labels, 1 corresponds to the label left, 2 corresponds to the label forward, and 3 corresponds to the label right. A total of 135,000 pictures with gaze labels in different directions were selected to provide training data for the algorithm training in the next section. The established dataset is shown in Figure 6.

Eye-Tracking Model Building
In this paper, we used deep learning for the estimation of the eye-gaze direction in the human eye-tracking recognition task for eye-tracking wheelchairs, which is divided into feature extraction networks and classification. This was completed by training a deep learning network (GazeNet) using the human eye database we have built, which incorporates several modules for feature extraction optimization, and tri-classifying the features for output using a cross-entropy loss function.

Eye-Gaze Direction Estimation
In this paper, we fully traded off the lightweight and accuracy of the algorithm when designing the network. In order to efficiently run the human gaze direction determination algorithm on embedded devices, we added the improved Inception module to reduce the model computation, and the ResBlock module and the CBAM attention module to improve the network performance in the designed GazeNet network structure, and the final overall network structure is shown in Figure 7. Each extracted human eye image first entered the improved Inception module after convolution and pooling operations, its output parameters entered the ResBlock module after convolution and pooling operations, and the resulting output entered the CBAM attention module after convolution and pooling operations. Then, the output features were again convolved and pooled before being triclassified by the fully connected layer to achieve the eye-gaze direction estimation.

Eye-Tracking Model Building
In this paper, we used deep learning for the estimation of the eye-gaze direction in the human eye-tracking recognition task for eye-tracking wheelchairs, which is divided into feature extraction networks and classification. This was completed by training a deep learning network (GazeNet) using the human eye database we have built, which incorporates several modules for feature extraction optimization, and tri-classifying the features for output using a cross-entropy loss function.

Eye-Gaze Direction Estimation
In this paper, we fully traded off the lightweight and accuracy of the algorithm when designing the network. In order to efficiently run the human gaze direction determination algorithm on embedded devices, we added the improved Inception module to reduce the model computation, and the ResBlock module and the CBAM attention module to improve the network performance in the designed GazeNet network structure, and the final overall network structure is shown in Figure 7. Each extracted human eye image first entered the improved Inception module after convolution and pooling operations, its output parameters entered the ResBlock module after convolution and pooling operations, and the resulting output entered the CBAM attention module after convolution and pooling operations. Then, the output features were again convolved and pooled before being triclassified by the fully connected layer to achieve the eye-gaze direction estimation. The parallel structure adopted by the Inception module enables the input image to be processed by multiple convolutional kernels of different scales and pooling operations to obtain different levels of feature information [45]. The purpose is to extract features at The parallel structure adopted by the Inception module enables the input image to be processed by multiple convolutional kernels of different scales and pooling operations to obtain different levels of feature information [45]. The purpose is to extract features at different scales while keeping the size of the output feature map of the convolutional layer unchanged, to efficiently expand the depth and width of the network, and to prevent overfitting while improving the accuracy of the deep learning network. The specific implementation uses a combination of multiple convolutional kernels of different scales and pooling operations to increase the nonlinear representation of the model without increasing the number of model parameters. In this paper, we used the Inception module to decompose a 5 × 5 convolution kernel ( Figure 8a

Inception Module
The parallel structure adopted by the Inception module enables the input i be processed by multiple convolutional kernels of different scales and pooling op to obtain different levels of feature information [45]. The purpose is to extract fea different scales while keeping the size of the output feature map of the convolution unchanged, to efficiently expand the depth and width of the network, and to preve fitting while improving the accuracy of the deep learning network. The specific mentation uses a combination of multiple convolutional kernels of different sca pooling operations to increase the nonlinear representation of the model without ing the number of model parameters. In this paper, we used the Inception modu compose a 5 × 5 convolution kernel (Figure 8a) into two 3 × 3 convolution kernels 8b), so that we could effectively use only about (3 × 3 + 3 × 3)/(5 × 5) = 72% computational overhead. This reduced the number of model parameters and comp while maintaining the same perceptual field, reducing the computational bur shown in Figure 8.

ResBlock Module
Convolutional neural networks can extract a rich feature hierarchy, but there is also the hidden danger of gradient disappearance or gradient explosion, and the use of regularization processing may produce degradation problems. Therefore, we chose to add the ResBlock module in the middle layer of the network, which achieved a jump connection by directly adding the output and input of the convolutional layer, thus ensuring better gradient transfer during backpropagation and reducing the number of model parameters [46]. The F(x) that needs to be taught can be written in the form of "residuals", as follows: During the training process, if the model finds that the gradient becomes very small when the neural network becomes deeper and deeper, i.e., the "degeneration phenomenon" appears, it can directly add the output and input to achieve a constant mapping [47]. The advantage of this is that even if the network is deep enough, it can guarantee effective feature extraction at each layer. In this way, the performance of the network can be maintained even if the network becomes deeper and deeper, avoiding the degradation problem that occurs in traditional neural networks, as shown in Figure 9.
when the neural network becomes deeper and deeper, i.e., the "degeneration phenome non" appears, it can directly add the output and input to achieve a constant mapping [47] The advantage of this is that even if the network is deep enough, it can guarantee effective feature extraction at each layer. In this way, the performance of the network can be main tained even if the network becomes deeper and deeper, avoiding the degradation problem that occurs in traditional neural networks, as shown in Figure 9. in Figure 9 represents the input, ( ) represents the output obtained after a serie of transformations in the network, and ( ) represents the residual, which is the differ ence between the network output ( ) and the input . During the training process, the network automatically learns a set of appropriate weights, so that ( ) converges to 0 The introduction of the residual concept allowed the network to better learn the mapping relationship between the input and the output, while reducing the risk of gradient disap pearance or gradient explosion, thus improving the performance and generalization o the model.

CBAM Module
In addition, since we wanted the feature processing to focus on the direction of the pupil in the eye, the Convolutional Block Attention Model (CBAM) was inserted at the back of the ResBlock module. The CBAM module can improve the performance of convo lutional neural networks for eye-tracking direction estimation by better learning and rep resenting specific image features through the attention mechanism [48]. The overall pro cess of the CBAM module can be divided into two parts, as shown in Figure 10. X in Figure 9 represents the input, H(X) represents the output obtained after a series of transformations in the network, and F(X) represents the residual, which is the difference between the network output H(X) and the input X. During the training process, the network automatically learns a set of appropriate weights, so that F(X) converges to 0. The introduction of the residual concept allowed the network to better learn the mapping relationship between the input and the output, while reducing the risk of gradient disappearance or gradient explosion, thus improving the performance and generalization of the model.

CBAM Module
In addition, since we wanted the feature processing to focus on the direction of the pupil in the eye, the Convolutional Block Attention Model (CBAM) was inserted at the back of the ResBlock module. The CBAM module can improve the performance of convolutional neural networks for eye-tracking direction estimation by better learning and representing specific image features through the attention mechanism [48]. The overall process of the CBAM module can be divided into two parts, as shown in Figure 10. Channel attention is mainly used to capture the correlation between different channels [49]. This module first obtained two channels through global average pooling and maximum pooling operations: the 1 × 1 × C channels. Then, it was mapped and activated by a two-layer neural network, with the number of neurons in the first layer being / . The number of neurons in the first layer is Relu, and the number of neurons in the second layer is . A summary of the two obtained features was produced and passed through the Sigmoid activation function to obtain the weight coefficients, , which was multiplied with the input feature mapping to obtain the output with enhanced channel feature representation. The features of an H × W × C input are F. The output feature formula is shown below: In Equation (3), denotes the Sigmoid function, MLP denotes the multilayer perceptron, AvgPool denotes the average pooling layer, MaxPool denotes the maximum pooling layer, 0 and 1 denote the two weight matrices, and denote the results of the input data after the AvgPool and MaxPool operations, and the superscript Channel attention is mainly used to capture the correlation between different channels [49]. This module first obtained two channels through global average pooling and maximum pooling operations: the 1 × 1 × C channels. Then, it was mapped and activated by a two-layer neural network, with the number of neurons in the first layer being C/r. The number of neurons in the first layer is Relu, and the number of neurons in the second layer is C. A summary of the two obtained features was produced and passed through the Sigmoid activation function to obtain the weight coefficients, M C , which was multiplied with the input feature mapping to obtain the output with enhanced channel feature representation. The features of an H × W × C input are F. The output feature formula is shown below: In Equation (3), σ denotes the Sigmoid function, MLP denotes the multilayer perceptron, AvgPool denotes the average pooling layer, MaxPool denotes the maximum pooling layer, W 0 and W 1 denote the two weight matrices, F avg and F max denote the results of the input data after the AvgPool and MaxPool operations, and the superscript C is Channel, indicating the average pooling and maximum pooling operations in the channel dimension.
Spatial attention is mainly used to capture the correlation between different locations on the feature map [50]. This module first performed average pooling and maximum pooling to obtain two H × W × 1 channels, respectively, which were stitched together and passed through a 7 × 7 convolutional layer with the Sigmoid activation function to obtain the weight coefficients, M S . This weight was multiplied with the input feature mapping by means of element multiplication to obtain the output of the enhanced spatial feature representation [51]. An H × W × C input is characterized by F, and the output feature formula is shown below: In Equation (4), σ denotes the Sigmoid function, AvgPool denotes the average pooling layer, MaxPool denotes the maximum pooling layer, W 0 and W 1 denote two weight matrices, f 7 × 7 denotes a convolution kernel of size 7 × 7, F avg and F max denote the results of the input data after the AvgPool and MaxPool operations, and the superscript S is Spatial, indicating the average pooling and maximum pooling operations in the spatial dimension.

Fully Connected Layer
The classification estimation task was implemented by a fully connected layer integrating the local features after the convolution operation through a weight matrix. In the eye-tracking direction classification task of this paper, it was necessary to obtain the result of determining whether the eye is forward, left, or right, and this, therefore, is a triple-classification problem. The output of the feature extraction network in this paper was 2 × 2 × 4, and the fully connected layer had 3 neurons. First, the output of the feature extraction network was flattened into a one-dimensional column vector: x = [x 1 , x 2 , · · · , x 15 , x 16 ] T ; then, for each neuron in the fully connected layer: Z = [Z 1 , Z 2 , Z 3 ] T , a linear operation was performed with each element in x. In the forwardpropagation process, the fully connected layer can be viewed as a linear weighted summation process. Specifically, each node in the previous layer was multiplied by a weighting factor, w, and a bias, b, was added to obtain the corresponding output on the fully connected layer, z. The computational process of the fully connected layer can be expressed using Equation (5): The output was normalized using the SoftMax function, which maps a vector into a probability distribution, where each element is a non-negative number, and the sum of all elements is 1. Thus, for a multiclassification problem with n classes, the SoftMax function can convert an n-dimensional vector into a probability distribution, where each element represents the predicted probability of that class. The neuron Z = [Z 1 , Z 2 , Z 3 ] T on the fully connected layer was normalized to y = [y 1 , y 2 , y 3 ] T by the SoftMax function with the constraint: y 1 + y 2 + y 3 = 1. The transformation relation between y i and Z j is:

Cross-Entropy Loss Function
During the training process, we wanted to make the output probability distribution of the model as close as possible to the true probability distribution, so we needed to design a suitable loss function to measure the difference between them. The cross-entropy function can effectively measure the difference between two probability distributions and is, therefore, widely used in classification problems [52]. In particular, if for a sample x with a true label of y, the predicted output of the SoftMax model for it isŷ, then the cross-entropy loss function can be expressed as: This loss function can be regarded as the KL dispersion (Kullback-Leibler divergence) between the true and predicted labels [53]. The cross-entropy loss function obtained the minimum value of 0 when the output probability distribution of the model was exactly the same as the true probability distribution; however, when the difference between them increased, the value of the cross-entropy loss function also increased. Since the SoftMax function mapped the y i values to between 0 and 1, and according to the constraint ∑ i y i = 1, it can be deduced that: when y i = 1, the loss function is: The derivative procedure for Loss i (y,ŷ) is as follows: From the derivation of the formula, it can be seen that the clever use of the crossentropy loss function with SoftMax for the triple-classification task makes it very easy to calculate the gradient in the backpropagation. The gradient of the backward update was obtained by simply taking the y i − 1 calculated in the forward direction.

The Design of the Eye-Tracking Wheelchair Control System
In the previous section, the eye-tracking recognition algorithm was investigated. A complete eye-tracking wheelchair control system design solution should also include data acquisition, data processing, motion control, human-machine interaction, and system optimization. The physical diagram and hardware data diagram of the wheelchair designed in this paper are shown in Figure 11.

The Design of the Eye-Tracking Wheelchair Control System
In the previous section, the eye-tracking recognition algorithm was investigated. A complete eye-tracking wheelchair control system design solution should also include data acquisition, data processing, motion control, human-machine interaction, and system optimization. The physical diagram and hardware data diagram of the wheelchair designed in this paper are shown in Figure 11. According to the role of hardware in the system, it can be divided into three sections: the data acquisition section, the data processing section, and the motion control section. The data acquisition area consists of the Gook HD98 HD camera with a resolution of 1920 × 1080 and a 10-inch touchscreen. The camera is responsible for capturing face images and the touch screen displays eye-tracking information in real time. The data processing section consists of the Jeston Tx2, which has 256 CUDA cores and up to 8 GB of memory. Its arithmetic power is comparable to that of a desktop-class graphics card GTX750, and it can easily handle the computational task of eye-tracking recognition. The motion control section is composed of Arduino as well as MG995 servo. Arduino receives an eye-tracking signal and drives MG995 to control the rocker to change the wheelchair motion.
Common power wheelchair modifications are adjustments to the hardware portion of an existing wheelchair. Changes to the wheelchair motion can be achieved by simply controlling the rocker during the wheelchair motion. Therefore, in this paper, a mechanical structure (as shown in Figure 12) was installed on the wheelchair rocker controller. The structure consists of a base, a servo bracket, a control rod extension, and a control arm. The two servos are the x-axis and y-axis servos, which cooperate with each other to realize the control of the electric wheelchair rocker by turning in different directions According to the role of hardware in the system, it can be divided into three sections: the data acquisition section, the data processing section, and the motion control section. The data acquisition area consists of the Gook HD98 HD camera with a resolution of 1920 × 1080 and a 10-inch touchscreen. The camera is responsible for capturing face images and the touch screen displays eye-tracking information in real time. The data processing section consists of the Jeston Tx2, which has 256 CUDA cores and up to 8 GB of memory. Its arithmetic power is comparable to that of a desktop-class graphics card GTX750, and it can easily handle the computational task of eye-tracking recognition. The motion control section is composed of Arduino as well as MG995 servo. Arduino receives an eye-tracking signal and drives MG995 to control the rocker to change the wheelchair motion.
Common power wheelchair modifications are adjustments to the hardware portion of an existing wheelchair. Changes to the wheelchair motion can be achieved by simply controlling the rocker during the wheelchair motion. Therefore, in this paper, a mechanical structure (as shown in Figure 12) was installed on the wheelchair rocker controller. The structure consists of a base, a servo bracket, a control rod extension, and a control arm. The two servos are the x-axis and y-axis servos, which cooperate with each other to realize the control of the electric wheelchair rocker by turning in different directions according to the control command, so as to fulfill the control of the direction of wheelchair movement.  In the upper computer TX2, we designed the intelligent eye-tracking wheelchair control platform through PYQT5. The intelligent eye-tracking wheelchair control platform contains a data acquisition section and a wheelchair control section. The data acquisition section collects face and human eye images in real time. The wheelchair control section contains a wheelchair start button, wheelchair motion direction display, camera selection, and face count display functions. The intelligent eye-tracking wheelchair control platform makes it easier to achieve wheelchair activation and monitor wheelchair movement. The interactive interface is shown in Figure 13. In the upper computer TX2, we designed the intelligent eye-tracking wheelchair control platform through PYQT5. The intelligent eye-tracking wheelchair control platform contains a data acquisition section and a wheelchair control section. The data acquisition section collects face and human eye images in real time. The wheelchair control section contains a wheelchair start button, wheelchair motion direction display, camera selection, and face count display functions. The intelligent eye-tracking wheelchair control platform makes it easier to achieve wheelchair activation and monitor wheelchair movement. The interactive interface is shown in Figure 13. In the upper computer TX2, we designed the intelligent eye-tracking wheelchair control platform through PYQT5. The intelligent eye-tracking wheelchair control platform contains a data acquisition section and a wheelchair control section. The data acquisition section collects face and human eye images in real time. The wheelchair control section contains a wheelchair start button, wheelchair motion direction display, camera selection, and face count display functions. The intelligent eye-tracking wheelchair control platform makes it easier to achieve wheelchair activation and monitor wheelchair movement. The interactive interface is shown in Figure 13.

Motion Control Optimization
In the process of wheelchair steering, the motion path of the rocker will have an important impact on the acceleration change in the system because of the non-linear relationship between the left and right wheel speeds and the rocker angle, as shown in Figure  14. In order to make the wheelchair motion change smoothly and weaken the risk of

Motion Control Optimization
In the process of wheelchair steering, the motion path of the rocker will have an important impact on the acceleration change in the system because of the non-linear relationship between the left and right wheel speeds and the rocker angle, as shown in Figure 14. In order to make the wheelchair motion change smoothly and weaken the risk of vibration and even rollover caused by the sudden change of acceleration, this section presents the research on the rocker trajectory-tracking control.  In order to establish the rocker control model, we introduced a polar coordinate system in the rocker motion plane, specifying the distance from the rocker to the reset as , and the angle between the projection line of the rocker on the plane and the positive left as . The coordinate diagram is shown in Figure 15. In order to establish the rocker control model, we introduced a polar coordinate system in the rocker motion plane, specifying the distance from the rocker to the reset as r, and the angle between the projection line of the rocker on the plane and the positive left as θ. The coordinate diagram is shown in Figure 15. In order to establish the rocker control model, we introduced a polar coordinate system in the rocker motion plane, specifying the distance from the rocker to the reset as , and the angle between the projection line of the rocker on the plane and the positive left as . The coordinate diagram is shown in Figure 15. Since the speed of the left and right wheels is symmetrical with respect to the change in the rocker angle, and the speed of the right wheel is almost constant in the interval of 0°− 90° , therefore, it is only necessary to analyze the velocity change curve of the left wheel at 0°− 90°. The acceleration, , of the left wheel is related to the velocity, , and the angle, θ, of the rocker in polar coordinates, as follows: In the range of 0°− 90° , continuously decreased as increased. In order to make the change more smoothly, it was necessary to make continuously increase as increased. Accordingly, the following − diagram was drawn, as in Figure 16.
In the range of 0 • − 90 • , dV w dθ continuously decreased as θ increased. In order to make the a change more smoothly, it was necessary to make dθ dt continuously increase as θ increased. Accordingly, the following θ − t diagram was drawn, as in Figure 16.  By: It was obtained as follows: where ( ) is the rocker angular velocity and is the rocker velocity, which can be considered constant in magnitude. Then, ( ) could be obtained according to Equation (13) with the derived ( ), and the rocker motion trajectory could be determined accordingly.
To simplify the analysis, ( ) as well as ( ) can be expressed as Equations (14) and (15): where 1 , 1 , and are known, denotes the time required for the rocker to move from the starting position to the end position, and ( ) = 2 . Then, substituting Equations (14) and (15) into (11) yields: By: It was obtained as follows: where ω(t) is the rocker angular velocity and V R is the rocker velocity, which can be considered constant in magnitude. Then, r(t) could be obtained according to Equation (13) with the derived θ(t), and the rocker motion trajectory could be determined accordingly.
To simplify the analysis, V w (θ) as well as θ(t) can be expressed as Equations (14) and (15): where a 1 , b 1 , and T e are known, T e denotes the time required for the rocker to move from the starting position to the end position, and θ(T e ) = π 2 .
Then, substituting Equations (14) and (15) into (11) yields: The simplification yields: with the following constraints: Substituting Equation (17) into (18) yields: where T e indicates the time taken for the rocker to move from 0 • to 90 • . According to the characteristics of the cubic function, it is necessary to satisfy t 3 > T e in order to make the acceleration change more smoothly in 0 − T e . Then, Matlab can be used to plot the curve of a with time t when t 3 changes. From Figure 17, it can be seen that the curve of t 3 with t tended to flatten out during the increase of t 3 from T e , and the maximum value tended to be 1.5V e T e . Accordingly, θ(t) and r(t) under the target conditions could be derived, and the rocker motion trajectory could then be derived from them. where indicates the time taken for the rocker to move from 0° to 90°. According to the characteristics of the cubic function, it is necessary to satisfy 3 > in order to make the acceleration change more smoothly in 0 − . Then, Matlab can be used to plot the curve of with time when 3 changes.
From Figure 17, it can be seen that the curve of 3 with tended to flatten out during the increase of 3 from , and the maximum value tended to be 1.5 . Accordingly, ( ) and ( ) under the target conditions could be derived, and the rocker motion trajectory could then be derived from them.

Blink Detection
The face-detection algorithm using the Dlib library with OpenCV image processing to obtain periocular feature point data was described in Section 3.2.2, and the eye-tracking estimation model was built in Section 3. Taking the left eye as an example, the outline of the eye was first located in the eye image, and points p1-p6 were used to represent the key points of the eye, as shown in Figure 18. The actual use of the system control needs to detect whether the human eye is open before the line-of-sight estimation. The eye opening and closing states can be determined by calculating the eye aspect ratio (EAR) [54] in real time and comparing it with the set threshold, and there is no need to enter the line-ofsight estimation procedure if the eye opening and closing do not meet the EAR threshold.

Blink Detection
The face-detection algorithm using the Dlib library with OpenCV image processing to obtain periocular feature point data was described in Section 3.2.2, and the eye-tracking estimation model was built in Section 3. Taking the left eye as an example, the outline of the eye was first located in the eye image, and points p 1 -p 6 were used to represent the key points of the eye, as shown in Figure 18. The actual use of the system control needs to detect whether the human eye is open before the line-of-sight estimation. The eye opening and closing states can be determined by calculating the eye aspect ratio (EAR) [54] in real time and comparing it with the set threshold, and there is no need to enter the line-of-sight estimation procedure if the eye opening and closing do not meet the EAR threshold.
The face-detection algorithm using the Dlib library with OpenCV image processing to obtain periocular feature point data was described in Section 3.2.2, and the eye-tracking estimation model was built in Section 3. Taking the left eye as an example, the outline of the eye was first located in the eye image, and points p1-p6 were used to represent the key points of the eye, as shown in Figure 18. The actual use of the system control needs to detect whether the human eye is open before the line-of-sight estimation. The eye opening and closing states can be determined by calculating the eye aspect ratio (EAR) [54] in real time and comparing it with the set threshold, and there is no need to enter the line-ofsight estimation procedure if the eye opening and closing do not meet the EAR threshold.  The respective EAR values of the left and right eyes were calculated, and finally, the average EAR value of the left and right eyes was obtained. When the eyes were open, EAR maintained a constant value; when the eyes were closed, the EAR value tended to 0. If the calculated EAR value was less than the set threshold, it was determined to be a blink action. The EAR calculation formula is shown below:

Saccades Processing
In addition to the impact that blinking can have on wheelchair control, the sweeping behavior of the user due to unexpected events during use may also affect the control of the wheelchair and the safety of the user. If the user rapidly changes the direction of gaze in a short period of time, the wheelchair may continuously receive movement commands in different directions, causing the wheelchair to swing from side to side. To reduce the risk of this unexpected situation, we used the results that occurred four or more times out of every eight classification results as the motion control command according to the majority principle, and if there was no majority result out of eight classification results, the wheelchair motion stopped. Since the network model is capable of generating approximately 16 classification results in 1 s, the above treatment had less impact on wheelchair maneuverability.

System Flow Chart
After analyzing the workflow of the eye-tracking wheelchair in a previous paper, the system flowchart of the eye-tracking wheelchair is presented in this paper, as shown in Figure 19. The control program begins in the host computer, and the camera in front of the wheelchair is automatically turned on by the program, and then the camera starts to extract facial features and collect eye-movement data. Jeston Tx2 calculates the EAR of each frame of the acquired eye-movement data. If the latter does not exceed the threshold, the eye-movement data are reacquired, and if the threshold is exceeded, the frame is fed into the GazeNet network model to calculate the eye-tracking direction and the classification result is output to the motion controller, Arduino. The latter manipulates the rocker to complete the control of the wheelchair movement based on the eye-movement signal. If the wheelchair is turned off at this point, the system stops working; otherwise, it will return to the eye-movement data acquisition process and continue to complete the calculation and judgment of the eye-movement data EAR, and so on. Sensors 2023, 23, x FOR PEER REVIEW 19 of 26 Figure 19. System flow chart.

Experiments and Results Analysis
In the previous sections, we completed the establishment of the model and the construction of the system. In order to verify the practicality and reliability of the eye-tracking wheelchair, in this section, we outline the test experiments conducted on the accuracy of the model recognition results and the accuracy of the wheelchair control. The first experiment was a comparative experimental evaluation of the GazeNet proposed in this paper with three existing models, such as AlexNet, ResNet18, and MobileNet-V2, to demonstrate the superiority of the model used in this paper in eye-tracking recognition. The second experiment quantified the wheelchair control accuracy by testing the deviation of the actual wheelchair motion trajectory from the target path, and the third experiment tested the effect of motion control optimization in Section 4.2 by measuring the Arduino output PWM signal.

Experiments and Results Analysis
In the previous sections, we completed the establishment of the model and the construction of the system. In order to verify the practicality and reliability of the eye-tracking wheelchair, in this section, we outline the test experiments conducted on the accuracy of the model recognition results and the accuracy of the wheelchair control. The first experiment was a comparative experimental evaluation of the GazeNet proposed in this paper with three existing models, such as AlexNet, ResNet18, and MobileNet-V2, to demonstrate the superiority of the model used in this paper in eye-tracking recognition. The second experiment quantified the wheelchair control accuracy by testing the deviation of the actual wheelchair motion trajectory from the target path, and the third experiment tested the effect of motion control optimization in Section 4.2 by measuring the Arduino output PWM signal.

Hyperparameter Optimization
In the training and testing of the GazeNet network, each image in X was subjected to the preprocessing operation described in Section 3.2.2 to obtain X, which was named the multi-environment attentional gaze dataset. The 135,000 binocular images in X were divided into a training set, validation set, and test set in the ratio of 98:1:1 and put into the network for training. The initial parameters were set with random initialization, the learning rate was 0.001, the optimizer used SGD (stochastic gradient descent) [55], the loss function used cross-entropy, and accuracy was used as an important evaluation index to measure the model performance. The accuracy rate was calculated as follows: where TP is true positive, TN is true negative, FP is false positive, and FN is false negative. As shown in Figure 20, a total of 30 rounds were trained, and after 18 epochs of training, the accuracy curve gradually leveled off and no longer significantly improved, which indicated that the training of the classification network was completed. In this paper, we chose to save the weight parameters of the model at the 29th round, and its accuracy rate was 0.98494. In the training and testing of the GazeNet network, each image in ̅ was subjected to the preprocessing operation described in Section 3.2.2 to obtain ̅ , which was named the multi-environment attentional gaze dataset. The 135,000 binocular images in ̅ were divided into a training set, validation set, and test set in the ratio of 98:1:1 and put into the network for training. The initial parameters were set with random initialization, the learn ing rate was 0.001, the optimizer used SGD (stochastic gradient descent) [55], the loss func tion used cross-entropy, and accuracy was used as an important evaluation index to meas ure the model performance. The accuracy rate was calculated as follows: where TP is true positive, TN is true negative, FP is false positive, and FN is false negative As shown in Figure 20, a total of 30 rounds were trained, and after 18 epochs of train ing, the accuracy curve gradually leveled off and no longer significantly improved, which indicated that the training of the classification network was completed. In this paper, we chose to save the weight parameters of the model at the 29th round, and its accuracy rate was 0.98494.

Assessment Measures and Methods
The GazeNet network described above was trained on the MEAGaze dataset using the same training set as the widely used AlexNet [56], ResNet18 [57], and MobileNet-V2 [58] models, and tested under the same test set. The test results are shown in Table 1 and Figure 21. The GazeNet network in this paper achieved an accuracy of 98.49% and used the smallest number of parameters, only 125,749, which was significantly better than the other three models. The comparison showed that the GazeNet network was about 1% point more accurate than the ResNet18 model, with the second-highest accuracy rate while the former had only 22.3% of the number of participants of the AlexNet model, with the second-lowest parameter amount. The GazeNet network has a minimum number o parameters with excellent accuracy, which makes it an ideal lightweight model that can meet the needs of both high accuracy and low resource consumption.

Assessment Measures and Methods
The GazeNet network described above was trained on the MEAGaze dataset using the same training set as the widely used AlexNet [56], ResNet18 [57], and MobileNet-V2 [58] models, and tested under the same test set. The test results are shown in Table 1 and Figure 21. The GazeNet network in this paper achieved an accuracy of 98.49% and used the smallest number of parameters, only 125,749, which was significantly better than the other three models. The comparison showed that the GazeNet network was about 1% point more accurate than the ResNet18 model, with the second-highest accuracy rate, while the former had only 22.3% of the number of participants of the AlexNet model, with the second-lowest parameter amount. The GazeNet network has a minimum number of parameters with excellent accuracy, which makes it an ideal lightweight model that can meet the needs of both high accuracy and low resource consumption.

Reliability Analysis of Eye-Tracking Wheelchair Control
The comparative experiments in the previous section demonstrated the high recognition rate of the eye-tracking model used in this paper. In order to verify the reliability and control accuracy of the eye-tracking wheelchair, we launched experiments related to the eye-tracking wheelchair in specific scenarios.
The experimental route diagram is shown in Figure 22. The solid part represents the target path of the wheelchair center, and the dashed part is the target path of both wheels. Since the width of the wheelchair that we used was 60 cm, the dashed path spacing was 60 cm. An experiment designed in this paper consisted of two laps of the line: the yellow line in Figure 22a is the target path of the first lap, the blue line in Figure 22c is the target path of the second lap, and Figure 22b is the real target line diagram combining the two, where the green line represents the overlapping part of the two laps of the line. The start and end points of both laps are the red points in Figure 22. The experimental design allowed the wheelchair to complete an experiment with an equal number of left and right turns, both eight times, making the experimental data more meaningful.
To make the data of the experiment generalizable, we recruited 20 volunteers aged 20-35 years to participate in the experiment. These 20 volunteers were 50% men and 50% women and wore glasses. Each volunteer was trained to operate a wheelchair while maintaining head immobilization to complete the experiment, during which the speed of the wheelchair was set to 1 m/s. We installed a laser distance measurement module on each side of the wheelchair, as shown in Figure 23, which had a range of up to 80 m, a measurement accuracy of 1.0 mm, and a measurement frequency of 20 Hz. It can measure and record the distance to the wall in real time and obtain the deviation of the wheelchair center from the target path based on the distance to the wall. The values of deviation with a distance of movement at each experiment were totaled and divided by the total number of people, and a graph of deviation with a distance of movement was obtained, as shown in Figure 24. The fluctuating part of this figure is the deviation of the wheelchair during the turn, and the maximum deviation appeared at the second circle of 8 m, which was 6.76

Reliability Analysis of Eye-Tracking Wheelchair Control
The comparative experiments in the previous section demonstrated the high recognition rate of the eye-tracking model used in this paper. In order to verify the reliability and control accuracy of the eye-tracking wheelchair, we launched experiments related to the eye-tracking wheelchair in specific scenarios.
The experimental route diagram is shown in Figure 22. The solid part represents the target path of the wheelchair center, and the dashed part is the target path of both wheels. Since the width of the wheelchair that we used was 60 cm, the dashed path spacing was 60 cm. An experiment designed in this paper consisted of two laps of the line: the yellow line in Figure 22a is the target path of the first lap, the blue line in Figure 22c is the target path of the second lap, and Figure 22b is the real target line diagram combining the two, where the green line represents the overlapping part of the two laps of the line. The start and end points of both laps are the red points in Figure 22. The experimental design allowed the wheelchair to complete an experiment with an equal number of left and right turns, both eight times, making the experimental data more meaningful.
To make the data of the experiment generalizable, we recruited 20 volunteers aged 20-35 years to participate in the experiment. These 20 volunteers were 50% men and 50% women and wore glasses. Each volunteer was trained to operate a wheelchair while maintaining head immobilization to complete the experiment, during which the speed of the wheelchair was set to 1 m/s. We installed a laser distance measurement module on each side of the wheelchair, as shown in Figure 23, which had a range of up to 80 m, a measurement accuracy of 1.0 mm, and a measurement frequency of 20 Hz. It can measure and record the distance to the wall in real time and obtain the deviation of the wheelchair center from the target path based on the distance to the wall. The values of deviation with a distance of movement at each experiment were totaled and divided by the total number of people, and a graph of deviation with a distance of movement was obtained, as shown in Figure 24. The fluctuating part of this figure is the deviation of the wheelchair during the turn, and the maximum deviation appeared at the second circle of 8 m, which was 6.76 cm.
It can be seen that the deviation during the wheelchair movement was small, the accuracy of the model and control was high, and the eye-activated wheelchair we designed had high practicality. cm. It can be seen that the deviation during the wheelchair movement was small, the accuracy of the model and control was high, and the eye-activated wheelchair we designed had high practicality.     cm. It can be seen that the deviation during the wheelchair movement was small, the accuracy of the model and control was high, and the eye-activated wheelchair we designed had high practicality.    In Section 4.2, we optimized the rocker control. To verify the effect of this work on the stability of the wheelchair control, we had the experimenter sit on the motorized wheelchair, disconnect the motor power, and connect the Arduino output pins with an oscilloscope. The experimenter simulated the eye-movement state in the previous experiment while keeping the head immobile. The average curve of the rudder duty cycle over time during the experiment was obtained by superimposing the numerical curves of the rudder duty cycle over time during the experiment for the 20 volunteers and applying the total number of people, as shown in Figure 25.   In Section 4.2, we optimized the rocker control. To verify the effect of this work on the stability of the wheelchair control, we had the experimenter sit on the motorized wheelchair, disconnect the motor power, and connect the Arduino output pins with an oscilloscope. The experimenter simulated the eye-movement state in the previous experiment while keeping the head immobile. The average curve of the rudder duty cycle over time during the experiment was obtained by superimposing the numerical curves of the rudder duty cycle over time during the experiment for the 20 volunteers and applying the total number of people, as shown in Figure 25. As can be seen from Figure 25, the control of the servo was smoother and without noise, which can also indicate the high accuracy of the neural network triple classification and the absence of several types of jumping outputs.

Conclusions
We collected an eye-movement dataset with 135,000 annotated images through virtual and real scenes and proposed a GazeNet eye-movement neural network model based on the three-category dataset. The model comparison experiment results showed that the GazeNet model proposed in this paper had a faster convergence speed and higher accuracy than the other three models. In terms of wheelchair control, we used a 2D steering gear to control the joystick and optimize the steering gear control signal. The follow-up Arduino output waveform experiment showed that the steering gear control signal was smooth and gentle, and the optimization results were good. In addition, the experiment also proved that the three-category result of the model was accurate, and there was no case of jumping outputs of several eye-movement categories. In the wheelchair control reliability analysis, we obtained the deviation between the target and the actual trajectory during the movement process through laser ranging and concluded that the accuracy of the motion control and eye-movement model was high.
However, the current motion control part is relatively complicated, and the upper limit of optimization is low. If the signal of Jeston Tx2 can be directly output to control the wheelchair, the motion control can be made simpler and more reliable. In addition, the eye-movement dataset was also collected in a specific scene and did not fully consider applications in scenes in daily life, such as crossing the road, rainy days, etc. After the experiment, many volunteers reported that continuous eye-movement control increased As can be seen from Figure 25, the control of the servo was smoother and without noise, which can also indicate the high accuracy of the neural network triple classification and the absence of several types of jumping outputs.

Conclusions
We collected an eye-movement dataset with 135,000 annotated images through virtual and real scenes and proposed a GazeNet eye-movement neural network model based on the three-category dataset. The model comparison experiment results showed that the GazeNet model proposed in this paper had a faster convergence speed and higher accuracy than the other three models. In terms of wheelchair control, we used a 2D steering gear to control the joystick and optimize the steering gear control signal. The follow-up Arduino output waveform experiment showed that the steering gear control signal was smooth and gentle, and the optimization results were good. In addition, the experiment also proved that the three-category result of the model was accurate, and there was no case of jumping outputs of several eye-movement categories. In the wheelchair control reliability analysis, we obtained the deviation between the target and the actual trajectory during the movement process through laser ranging and concluded that the accuracy of the motion control and eye-movement model was high.
However, the current motion control part is relatively complicated, and the upper limit of optimization is low. If the signal of Jeston Tx2 can be directly output to control the wheelchair, the motion control can be made simpler and more reliable. In addition, the eye-movement dataset was also collected in a specific scene and did not fully consider applications in scenes in daily life, such as crossing the road, rainy days, etc. After the experiment, many volunteers reported that continuous eye-movement control increased the burden on users, which reduced the accuracy of control to a certain extent and increased the risk for users. Considering that the current wheelchair control scheme lacks adaptability to the environment, we expect to add a visual slam and path planning [59] to the control part in the follow-up work. During the process, the direction of sight can be freely controlled, and the wheelchair can autonomously perceive the surrounding environment and make adjustments to reach the destination. For safety reasons, we plan to add a positioning system to the wheelchair. When the user operates the wheelchair, their family members can remotely know the user's specific location through the mobile phone program.
Author Contributions: J.X., writing-review and editing, funding acquisition; Z.H., methodology, supervision; L.L., software, formal analysis; X.L., visualization; K.W., data curation. All authors have read and agreed to the published version of the manuscript. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
The authors declare no conflict of interest.