FER-PCVT: Facial Expression Recognition with Patch-Convolutional Vision Transformer for Stroke Patients

Early rehabilitation with the right intensity contributes to the physical recovery of stroke survivors. In clinical practice, physicians determine whether the training intensity is suitable for rehabilitation based on patients’ narratives, training scores, and evaluation scales, which puts tremendous pressure on medical resources. In this study, a lightweight facial expression recognition algorithm is proposed to diagnose stroke patients’ training motivations automatically. First, the properties of convolution are introduced into the Vision Transformer’s structure, allowing the model to extract both local and global features of facial expressions. Second, the pyramid-shaped feature output mode in Convolutional Neural Networks is also introduced to reduce the model’s parameters and calculation costs significantly. Moreover, a classifier that can better classify facial expressions of stroke patients is designed to improve performance further. We verified the proposed algorithm on the Real-world Affective Faces Database (RAF-DB), the Face Expression Recognition Plus Dataset (FER+), and a private dataset for stroke patients. Experiments show that the backbone network of the proposed algorithm achieves better performance than Pyramid Vision Transformer (PvT) and Convolutional Vision Transformer (CvT) with fewer parameters and Floating-point Operations Per Second (FLOPs). In addition, the algorithm reaches an 89.44% accuracy on the RAF-DB dataset, which is higher than other recent studies. In particular, it obtains an accuracy of 99.81% on the private dataset, with only 4.10M parameters.


Introduction
The incidence, mortality, and disability of stroke in China have been higher than those in developed countries such as the United Kingdom, the United States, and Japan in the past 15 years [1]. Most stroke survivors cannot normally live because of suffering from sequelae such as hemiplegia, limb numbness, swallowing disorders, and depression. Brain neurobiology suggests that early training, at the right intensity, will aid recovery [2]. However, physicians need to be aware of patients' feelings in real-time during early rehabilitation to determine whether the training matches their physical recovery, then tailor the most rehabilitation-friendly training for each patient. This existing manual monitoring mode in clinical practice, which causes an enormous burden on medical resources, urgently needs to be improved and optimized.
Deep learning, as one of the powerful medical assistance technologies, has been widely applied in the medical field [3]. These applications include but are not limited to automatic

•
The proposed algorithm effectively combines the local perception ability of CNN and the advantages of ViT in extracting global features, which makes the algorithm achieve the highest accuracy on the RAF-DB dataset.

•
We treat emotion features as the weighted sum of neutral and V-A-like emotion features at different scales and design a unique classifier, which has been verified that more detailed facial emotion information of stroke patients has been extracted for classification.

Data Sources and Data Preprocessing
There are three datasets used in this study: (1) two public datasets for healthy people, RAF-DB [21] and FER+ [36]; (2) a private dataset for stroke patients. Table 1 describes the sample properties of three datasets in detail. • RAF-DB dataset The Real-world Affective Faces Database (RAF-DB) contains a single-label subset with 15,339 images, which can be divided into seven basic emotional classes: happy, sad, surprised, angry, fearful, disgusted, and neutral. These samples are of significant variability in subjects' age, ethnicity, head poses, lighting conditions, occlusions (e.g., glasses, facial hair, or self-occlusion), and post-processing operations (e.g., various filters and effects) [21]. These diverse differences make the trained models have better generalization.
The Face Expression Recognition Plus dataset (FER+) contains 35,887 images of size 48 × 48 that can be divided into 10-class emotions. Only 21,161 images/8 emotions are used in this experiment: happy, sad, surprised, angry, fearful, disgusted, neutral, and contempt.

•
Private dataset The inclusion criteria were as follows: (1) patients aged 18-85 years old; (2) diagnosed with stroke confirmed by computed tomography (CT) and/or magnetic resonance imaging (MRI); (3) ≥2 weeks post-stroke; (4) upper limb of the healthy or affected side can use the upper limb rehabilitation robot for training; (5) patients signed the informed consent. The exclusion criteria were: (1) patients with unstable cerebrovascular disease; (2) patients with sensory aphasia or motor aphasia, and those who were unable to cooperate with assessment and testing; (3) Montreal Cognitive Assessment (MoCA) score ≤ 25; (4) patients with severe organ dysfunction or with malignant tumors; (5) House-Brackmann (H-B) grade ≥ III.
There were 42 participants in the experiment, of which 37 patients with stroke (25 men and 12 women, 31-87 years old) were confirmed cases from the Shanghai Third rehabilitation hospital and 5 healthy controls (4 physicians and 1 student). All subjects signed an informed consent form before the experiment.
In this study, four basic emotions (happy, sad, surprised, and angry) were used as biomarkers to assess the patient's concentration, and four special emotions (painful, strained, tired, and neutral) were used as biomarkers to determine whether the current training intensity is suitable for the patient. There were two schemes for collecting emotional videos. First, we guided patients to express these four basic emotions through videos and pictures. Second, these four special emotions were collected while patients were training with the upper limb rehabilitation robot. In addition, we asked patients to repeatedly lift the upper extremity and gradually increase the range of motion to capture these desired emotions. In this experiment, each patient participated in collections of two emotions at least, which ensured that each subject's sample had positive and negative labels.
After collecting the emotional videos, data preprocessing is an indispensable step, mainly sampling images, correcting faces, and labeling samples. The DB Face [37], a face detection algorithm, was used to predict the anchor boxes of faces and corresponding confidence scores in emotional videos automatically. Then, we removed face images with low confidence and incomplete from numerous video slices containing facial expressions. These preserved facial images were adjusted by rotating so that the line connecting the eyes' feature points detected by the DB Face algorithm was in the horizontal direction, with the midpoint of the line as the center of rotation. The line's rotation angle θ is calculated by Equation (1). The transformation matrix M of all pixels in the original image is defined as Equation (2). The coordinates of all original pixels can be transformed into the corrected coordinates using Equation (3).
where (x l , y l ), (x r , y r ), and (x c , y c ) are the feature coordinates of the left eye, the right eye, and the midpoint of the line connecting eyes in the original image, respectively. (x , y ) is the corrected coordinate. We labeled the face-aligned images using the Facial Action Coding System (FACS) [38]. First, the emotional label of each sample was initially determined based on the content of the corresponding emotional video of the sample. Then, these images were annotated again according to FACS definitions of eight expressions. Table 2 shows FACS definitions of eight expressions in this experiment. In addition to the five expressions of happy, sad, angry, surprised, and neutral, the other expressions required for this experiment must be clearly defined by FACS. Referring to the PSPI [39], FACS features of painful expressions include lowered brow (AU4), raised cheeks (AU6), tightened lid (AU7), wrinkled nose (AU9), raised upper lip (AU10), and closed eyes (AU43). By comparing the facial features corresponding to each AU, we defined that FACS features of strained expressions are lowered brow (AU4), raised cheeks (AU6), tightened lips (AU23), pressed lips (AU24), and sucked lips (AU28); FACS features of tired expressions are closed eyes (AU43) and downed head (AU54), as shown in Figure 1. and sucked lips (AU28); FACS features of tired expressions are closed eyes (AU43) and downed head (AU54), as shown in Figure 1.  After labeling and collation, the private dataset contains 1302 samples/8 categories, with no sample crossover and duplicates. Some samples of the private dataset are shown in Figure 2.   After labeling and collation, the private dataset contains 1302 samples/8 categories, with no sample crossover and duplicates. Some samples of the private dataset are shown in Figure 2.

Model Building
In order to occupy fewer computing resources to identify eight facial expressions of stroke patients accurately, we propose a lightweight FER model shown in Figure 3 (FER-PCVT). The FER-PCVT designed with ViT as the baseline mainly consists of three modules: the Convolutional Patch Embedding (CPE), the Pyramid Transformer (PTF), and the Valence-Arousal-Like Classifier (V-ALC). The first two modules combine to form the backbone network, Patch-Convolutional Vision Transformer (PCVT). The V-ALC is an expression classifier designed based on the Valence-Arousal (V-A) emotion theory [40].

Model Building
In order to occupy fewer computing resources to identify eight facial expressions of stroke patients accurately, we propose a lightweight FER model shown in Figure 3, named the Facial Expression Recognition with Patch-Convolutional Vision Transformer (FER-PCVT). The FER-PCVT designed with ViT as the baseline mainly consists of three modules: the Convolutional Patch Embedding (CPE), the Pyramid Transformer (PTF), and the Valence-Arousal-Like Classifier (V-ALC). The first two modules combine to form the backbone network, Patch-Convolutional Vision Transformer (PCVT). The V-ALC is an expression classifier designed based on the Valence-Arousal (V-A) emotion theory [40].

Convolutional Patch Embedding
Compared with the direct processing of pixel information of images using the transformer encoder of ViT, the accuracy will be further improved by using CNNs to extract the feature information from images and then processing them with the transformer encoder [35,41]. Based on this, the convolutional patch embedding module is implemented as a pixel-to-sequence mapping module to extract the feature sequences as the input of the Conv-TF Encoder of the pyramid transformer module. Specifically, the feature information extracted from the image by the convolutional layer and pooling layer is reduced to the patch size by the Block Scaling module. The Block Scaling module, consisting of two convolutional layers (size 2 × 2, stride 2, and size 1 × 1, stride 1), is applied to adjust the dimensions of feature maps entered into the Conv-TF Encoder by varying the number of repetitions. That is, the length and width of the sequence will be shortened to 1 2 ⁄ of the original size after repeating r times. This method of introducing convolutions into ViT achieves feature mapping from pixel to sequence while preserving the position information between patches. The detailed structure is shown in Figure 4a.

Convolutional Patch Embedding
Compared with the direct processing of pixel information of images using the transformer encoder of ViT, the accuracy will be further improved by using CNNs to extract the feature information from images and then processing them with the transformer encoder [35,41]. Based on this, the convolutional patch embedding module is implemented as a pixel-to-sequence mapping module to extract the feature sequences as the input of the Conv-TF Encoder of the pyramid transformer module. Specifically, the feature information extracted from the image by the convolutional layer and pooling layer is reduced to the patch size by the Block Scaling module. The Block Scaling module, consisting of two convolutional layers (size 2 × 2, stride 2, and size 1 × 1, stride 1), is applied to adjust the dimensions of feature maps entered into the Conv-TF Encoder by varying the number of repetitions. That is, the length and width of the sequence will be shortened to 1/2 r of the original size after repeating r times. This method of introducing convolutions into ViT achieves feature mapping from pixel to sequence while preserving the position information between patches. The detailed structure is shown in Figure 4a

Pyramid Transformer
ViT requires the input and output sequences in the transformer encoder to have the same dimensions. However, the length of sequences output by CNNs is reduced as the network deepens. This pyramidal output mode in the CNNs, significantly reducing the computational cost, has been shown to be beneficial in extracting feature information at different scales [42]. Thus, the PTF designed by introducing this output mode aims to reduce the storage, parameters, and GFLOPs required for computation. Details of the PTF are shown in Figure 4b. We use convolutional mapping instead of the linear mapping in the transformer encoder of ViT to extract the three feature matrices: , , and . Then, they are fed into the Multi-Head Self-Attention to be given different weights. In the Feed-Forward module, a bottleneck structure is formed by two convolutional layers with output channels /2 and , respectively, which compresses the channel dimension in the model. The activation function GeLU between the two convolutional layers is used to make the model fit data faster, and its expression is Equation (4).
where (•) is the Gauss Error Function. In the Block Combined Pooling module, 0 convolution kernels (size 3 × 3) expand the channel dimension of the input feature map, followed by downsampling with a max pooling window (size 3 × 3, stride 2). The module allows feature maps to be resized from × ℎ × to 0 × ℎ 2 ⁄ × 2 ⁄ , gradually reducing the feature output, like a pyramid.
In addition, the layer normalization constrains the outputs of the Conv-TF Encoder module and the Feed-Forward module to avoid the vanishing gradient. The inputs and outputs of the above two modules are connected by residual connections to prevent the loss of feature information extracted by the model. At the same time, the batch normalization regularizes the output of the Block Combined Pooling module.

Valence-Arousal-Like Classifier
FACS defines the neutral expression as no AU, meaning that no facial muscle movement can be used as a biomarker. It makes neutral expressions more challenging to identify than other expressions. Especially neutral expressions of some stroke patients are different from those of ordinary people when all facial muscles are completely relaxed. The V-A emotion theory [40] suggests that each emotion is a mixture of arousal and valence in different proportions. Referring to the theory, we design the V-ALC as an expression classifier, considering emotion as a weighted sum of neutral and V-A-like features. Details of the V-ALC are shown in Figure 5.

Pyramid Transformer
ViT requires the input and output sequences in the transformer encoder to have the same dimensions. However, the length of sequences output by CNNs is reduced as the network deepens. This pyramidal output mode in the CNNs, significantly reducing the computational cost, has been shown to be beneficial in extracting feature information at different scales [42]. Thus, the PTF designed by introducing this output mode aims to reduce the storage, parameters, and GFLOPs required for computation. Details of the PTF are shown in Figure 4b. We use convolutional mapping instead of the linear mapping in the transformer encoder of ViT to extract the three feature matrices: Q, K, and V. Then, they are fed into the Multi-Head Self-Attention to be given different weights. In the Feed-Forward module, a bottleneck structure is formed by two convolutional layers with output channels d i /2 and d i , respectively, which compresses the channel dimension in the model. The activation function GeLU between the two convolutional layers is used to make the model fit data faster, and its expression is Equation (4).
where er f (·) is the Gauss Error Function. In the Block Combined Pooling module, d 0 convolution kernels (size 3 × 3) expand the channel dimension of the input feature map, followed by downsampling with a max pooling window (size 3 × 3, stride 2). The module allows feature maps to be resized from In addition, the layer normalization constrains the outputs of the Conv-TF Encoder module and the Feed-Forward module to avoid the vanishing gradient. The inputs and outputs of the above two modules are connected by residual connections to prevent the loss of feature information extracted by the model. At the same time, the batch normalization regularizes the output of the Block Combined Pooling module.

Valence-Arousal-Like Classifier
FACS defines the neutral expression as no AU, meaning that no facial muscle movement can be used as a biomarker. It makes neutral expressions more challenging to identify than other expressions. Especially neutral expressions of some stroke patients are different from those of ordinary people when all facial muscles are completely relaxed. The V-A emotion theory [40] suggests that each emotion is a mixture of arousal and valence in different proportions. Referring to the theory, we design the V-ALC as an expression classifier, We adopt the pixel shuffle method to reshape low-resolution feature maps into highresolution ones. That is, the length and width of the input feature map are up-sampled by 12 times, and the result is condensed using a convolution kernel of size 12 × 12. These compressed sequences are grouped into the Channel Mean and the Batch Sharing to obtain the V-A-like and neutral features with one dimension, respectively. The result of multiplying the neutral feature with the adaptive weight is added to the V-A-like feature to output a complete feature map of emotion. Among them, is a parameter learned by the model from many training samples. The Channel Mean means averaging the values of different channels in the same batch, thereby reducing the channel dimension. The Batch Sharing refers to averaging the values in different batches on the basis of the Channel Mean, which aims to extract the most appropriate characteristics of neutral emotions from batches. Their expressions are Equations (5) and (6).
where is the input feature tensor, is the feature sequence of different batches in the input tensor, is the channel, is the batch, is the total number of channels, and is the total number of batches. After outputting a complete feature map of emotion, considering that emotion may be a composite state, we normalize these sequences using the Sigmoid function to avoid mutually exclusive results using the Softmax function. Finally, the prediction confidence of each category is output, where the expression with the highest confidence is the final We adopt the pixel shuffle method to reshape low-resolution feature maps into highresolution ones. That is, the length and width of the input feature map are up-sampled by 12 times, and the result is condensed using a convolution kernel of size 12 × 12. These compressed sequences are grouped into the Channel Mean and the Batch Sharing to obtain the V-A-like and neutral features with one dimension, respectively. The result of multiplying the neutral feature with the adaptive weight w AD is added to the V-A-like feature to output a complete feature map of emotion. Among them, w AD is a parameter learned by the model from many training samples. The Channel Mean means averaging the values of different channels in the same batch, thereby reducing the channel dimension. The Batch Sharing refers to averaging the values in different batches on the basis of the Channel Mean, which aims to extract the most appropriate characteristics of neutral emotions from batches. Their expressions are Equations (5) and (6).
where x is the input feature tensor, x b is the feature sequence of different batches in the input tensor, i is the i th channel, j is the j th batch, c is the total number of channels, and b is the total number of batches. After outputting a complete feature map of emotion, considering that emotion may be a composite state, we normalize these sequences using the Sigmoid function to avoid mutually exclusive results using the Softmax function. Finally, the prediction confidence of each category is output, where the expression with the highest confidence is the final result of the model's prediction. Table 3 shows the training settings in this experiment, including the selected optimizer, the loss function, and some specific hyperparameters. Table 4 shows the detailed structural parameters for each module combined in this experiment.  The patch size is the size of each patch when the image is split into patches. 2 The PTF module is repeated twice in the model's overall structure, i.e., n t f = 2, so the PTF1 and PTF2 refer to the first and second times, respectively. 3 The heads are the setting of the Multi-Head Self-Attention in the PTF module.

Performance Evaluation of PCVT Based on Public Datasets
We evaluate the learning capabilities of CvT [35], PvT [42], ResNet18 [25], ResNet18*, and PCVT on the RAF-DB dataset, focusing on accuracy and resource consumption. Among them, both CvT and PvT are hybrid variant networks formed by introducing convolution into ViT, which are of the same type as this study. ResNet 18 is the most commonly used convolutional neural network for image classification, and ResNet18* is a pre-trained model of ResNet18. CvT, PvT, ResNet18, and PCVT are retrained from scratch using the same computer to obtain experimental results that are not affected by the device conditions. For ResNet18*, we further trained it using this emotion dataset on top of the parameter weights.
As shown in Figure 6, the iterative curves of these five networks trained and validated on the RAF-DB dataset show that the PCVT proposed in this study performs better on the validation data than other models except for ResNet18*. It means that PCVT has better generalization than PvT, CvT, and ResNet18. Admittedly, as a pre-trained model, ResNet18* predictably shows the best classification ability from the beginning of the iteration. Compare the parameters, GFLOPs, and accuracy of the above five networks on the RAF-DB dataset, as shown in Table 5. The accuracy of PCVT is 84.22%, second only to that of ResNet18* (86.28%). Meanwhile, PCVT has the fewest parameters and GFLOPs.

Comparison with State-of-the-Art Methods
The proposed FER-PCVT is compared with the state-of-the-art methods on RAF-DB and FER+ datasets. As shown in Table 6, two FER-PCVT models without pretrained weights trained from scratch on two public datasets achieved 89.44% and 88.21% accuracy, respectively. The FER-PCVT learned on the RAF-DB achieves the highest accuracy, while the FER-PCVT learned on FER+ performs lower than other models.

Analysis Based on Confusion Matrix
The detailed performance of FER-PCVT for each class on the RAF-DB and FER+ datasets is analyzed based on the confusion matrix. As shown in Figure 7, FER-PCVT is sensitive to whether the dataset is balanced. There is no significant deviation in the predicted results on the RAF-DB dataset. However, the model shows significant bias on the FER+ dataset. As shown in Figure 7b, the model's predictions have extreme errors in the "disgust" and "contempt" classes with small samples; conversely, the model has highly accurate for the "happy" and "neutral" classes. Moreover, V-ALV determines the expression baseline based on the features of neutral expressions in the batch, so the imbalance of dataset affected the expression baseline generation. In addition, the Precision, Specificity, Sensitivity, F1-Score, and G-mean of FER-PCVT are also analyzed based on the confusion matrix, as shown in Table 7. We set the Precision and Recall to the same weight to obtain the F1-Score of FER-PCVT for each emotional category. On the RAF-DB dataset, the F1-Score values of FER-PCVT for surprised, fear, disgust, happy, sad, angry, and neutral are 84.2%, 73.3%, 67.9%, 94.5%, 86.7%, 80.4%, and 92.4%, respectively. However, on the FER+ dataset, FER-PCVT only performs well for categories with many samples, such as surprised (86.4%), happy (89.4%), sad (69.1%), and neutral (73.6%). G-mean reflects the contribution of each category to the model's accuracy. Although the model's accuracy reaches 88.21% on the FER+ dataset, the G-mean values of both disgust and contempt are 0%, which means that the accuracy depends on surprised (89.8%), fear (72.3%), happy (94%), sad (75.8%), angry (78.8%), and neutral (86.5%). In contrast, the G-mean values of all categories are higher than 90% in the RAF-DB dataset, and the order from high to low is neutral (96.8%), happy (95.6%), sad (92.1%), surprised (89.4%), anger (85.2%), fear (81.1%), and disgust (81.1%).         The above parameters for evaluating performance are calculated using the standard formulas shown in Equations (7)- (11): Speci f icity = TN TN + FP (8) where TP, TN, FP, and FN mean the true positive, the true negative, the false positive, and the false negative, respectively.

Visualization of Clustering Ability
The clustering ability of FER-PCVT on the RAF-DB dataset is visualized by the t-SNE plot based on the inputs and outputs of the last linear layer of V-ALC. As shown in Figure 8, the boundaries between the various categories are clear and intuitive, which means that FER-PCVT can distinguish and cluster the seven emotions well.

Accuracy Comparison and Impact of Pretrained Weights
We compare FER-PCVT with ResNet18 and the structure combining PCVT with the Multi-layer Perceptron (MLP) on the private dataset, focusing on the accuracy and parameters of these models with and without pretrained weights. Figure 9 shows the training and validation accuracy curves of ResNet18, PCVT+MLP, and FER-PCVT on the private dataset. As shown in Table 8, the structure formed by PCVT combined with MLP exhibits the worst precision on the private dataset, although it has the lowest number of parameters. FER-PCVT has similar accuracy to ResNet18 on the private dataset with or without pre-trained weights. However, the algorithm proposed in this experiment has only 4.10M parameters, about one-third of the parameters of ResNet18.

Accuracy Comparison and Impact of Pretrained Weights
We compare FER-PCVT with ResNet18 and the structure combining PCVT with the Multi-layer Perceptron (MLP) on the private dataset, focusing on the accuracy and parameters of these models with and without pretrained weights. Figure 9 shows the training and validation accuracy curves of ResNet18, PCVT+MLP, and FER-PCVT on the private dataset. As shown in Table 8, the structure formed by PCVT combined with MLP exhibits the worst precision on the private dataset, although it has the lowest number of parameters. FER-PCVT has similar accuracy to ResNet18 on the private dataset with or without pre-trained weights. However, the algorithm proposed in this experiment has only 4.10M parameters, about one-third of the parameters of ResNet18.

Accuracy Comparison and Impact of Pretrained Weights
We compare FER-PCVT with ResNet18 and the structure combining PCVT with the Multi-layer Perceptron (MLP) on the private dataset, focusing on the accuracy and parameters of these models with and without pretrained weights. Figure 9 shows the training and validation accuracy curves of ResNet18, PCVT+MLP, and FER-PCVT on the private dataset. As shown in Table 8, the structure formed by PCVT combined with MLP exhibits the worst precision on the private dataset, although it has the lowest number of parameters. FER-PCVT has similar accuracy to ResNet18 on the private dataset with or without pre-trained weights. However, the algorithm proposed in this experiment has only 4.10M parameters, about one-third of the parameters of ResNet18. (c) accuracy curves for models with pretrained weights on the special dataset; (d) accuracy curves for models without pretrained weights on the special dataset. Among them, "Train" means training, and "Val" means validation, and the bold font indicates the algorithm proposed in this study.

Visualization of Clustering Ability
To visualize the model's ability to classify the eight facial expressions of stroke patients, we plot the t-SNE of FER-PCVT on the private dataset. As shown in Figure 10, the model can cluster the four basic expressions and four special expressions of stroke patients well. Especially in special categories, the distribution of neutral expressions with other expressions is similar to that of the V-A emotion theory. Figure 9. Training and validation accuracy curves of ResNet18, PCVT+MLP, and FER-PCVT on the facial expression dataset of stroke patients. (a) Accuracy curves for models with pretrained weights on the basic dataset; (b) accuracy curves for models without pretrained weights on the basic dataset; (c) accuracy curves for models with pretrained weights on the special dataset; (d) accuracy curves for models without pretrained weights on the special dataset. Among them, "Train" means training, and "Val" means validation, and the bold font indicates the algorithm proposed in this study.

Visualization of Clustering Ability
To visualize the model's ability to classify the eight facial expressions of stroke patients, we plot the t-SNE of FER-PCVT on the private dataset. As shown in Figure 10, the model can cluster the four basic expressions and four special expressions of stroke patients well. Especially in special categories, the distribution of neutral expressions with other expressions is similar to that of the V-A emotion theory.

Visual Analysis
We perform a global visual analysis of models to find the regions that models focus on for classification. The Grad-CAM [52] is used to visualize ResNet18*. For ViT and FER-PCVT, visualization is achieved by stacking the attention weights of each layer in order. ResNet18*, ViT, and FER-PCVT have different focus points when identifying the facial emotions of stroke patients, as shown in Figure 11. The part covered in red is the region of the model's most concern when classifying and recognizing expressions. ResNet18* focuses on localized facial regions, while ViT extracts information globally. Although the red regions in the visualization images of ViT appear more on the periphery of the image, ViT also pays attention to the details of the facial features. However, FER-PCVT can focus

Visual Analysis
We perform a global visual analysis of models to find the regions that models focus on for classification. The Grad-CAM [52] is used to visualize ResNet18*. For ViT and FER-PCVT, visualization is achieved by stacking the attention weights of each layer in order. ResNet18*, ViT, and FER-PCVT have different focus points when identifying the facial emotions of stroke patients, as shown in Figure 11. The part covered in red is the region of the model's most concern when classifying and recognizing expressions. ResNet18* focuses on localized facial regions, while ViT extracts information globally. Although the red regions in the visualization images of ViT appear more on the periphery of the image, ViT also pays attention to the details of the facial features. However, FER-PCVT can focus more on muscle changes due to different expressions while extracting global information. For example, for strained expression, a common emotion when muscles are tense during training, FER-PCVT notices more changes in areas such as eyebrows, eyes, and lips than in other models. Moreover, the facial features of neutral expressions extracted by FER-PCVT are more specific than those of other models. In addition, FER-PCVT also showed a better ability to extract emotional features for these four basic expressions. more on muscle changes due to different expressions while extracting global information. For example, for strained expression, a common emotion when muscles are tense during training, FER-PCVT notices more changes in areas such as eyebrows, eyes, and lips than in other models. Moreover, the facial features of neutral expressions extracted by FER-PCVT are more specific than those of other models. In addition, FER-PCVT also showed a better ability to extract emotional features for these four basic expressions. Figure 11. Visualization of ResNet18*, ViT, and FER-PCVT. * It indicates that the model is a pretrained model.

Discussion
Experienced physicians can determine stroke patients' intervention strategies by observing their emotional changes [13]. Similarly, stroke rehabilitation systems based on deep learning/machine learning can also sense the patients' emotions and provide training suggestions according to emotional changes. Currently, more researchers use the patients' physiological signals as the information source of perceived emotion [10][11][12][13]. Few studies have designed FER algorithms for stroke rehabilitation. To assist physicians in analyzing the degree of physical recovery and adjusting the training intensity of stroke patients, we use eight common emotions of patients during rehabilitation as biometrics and design a lightweight FER algorithm. By detecting the positive emotions of stroke patients during rehabilitation, such as happy, surprised, and strained, patients' training motivation and interest will be provided to physicians. When painful emotions are detected, it means that the training intensity exceeds the patient's muscle tolerance, and the intensity should be adjusted in time to avoid secondary injuries. In addition, if negative emotions are detected frequently, such as sad, tired, and angry expressions, physicians must pay attention to patients' mental health.
The FER algorithm proposed in this study is an automated assessment technology for stroke rehabilitation, which acquires the training status of patients in a non-contact way. ViT is the basic framework for algorithm design since the global modeling of images using ViT is critical to the emotional classification task, as shown in Figure 11. However, CNNs structures are better at extracting local and detailed information in expression images than ViT. Therefore, introducing the characteristics of CNNs into the ViT structure can improve performance and robustness while maintaining high accuracy and memory efficiency. ViT converts the pixel information (2D) in the patch into the feature sequence (1D) required by the encoder through linear projection and Patch Embedding. The position relationships between patches need to be learned through the Position Embedding module. However, the sequence extracted by convolution contains position information, which is the inductive bias property of convolution. Thus, the CPE module containing convolutional layers and pooling layers is designed to replace the linear projection, the Patch Embedding, and the Position Embedding in ViT. There are some studies that have

Discussion
Experienced physicians can determine stroke patients' intervention strategies by observing their emotional changes [13]. Similarly, stroke rehabilitation systems based on deep learning/machine learning can also sense the patients' emotions and provide training suggestions according to emotional changes. Currently, more researchers use the patients' physiological signals as the information source of perceived emotion [10][11][12][13]. Few studies have designed FER algorithms for stroke rehabilitation. To assist physicians in analyzing the degree of physical recovery and adjusting the training intensity of stroke patients, we use eight common emotions of patients during rehabilitation as biometrics and design a lightweight FER algorithm. By detecting the positive emotions of stroke patients during rehabilitation, such as happy, surprised, and strained, patients' training motivation and interest will be provided to physicians. When painful emotions are detected, it means that the training intensity exceeds the patient's muscle tolerance, and the intensity should be adjusted in time to avoid secondary injuries. In addition, if negative emotions are detected frequently, such as sad, tired, and angry expressions, physicians must pay attention to patients' mental health.
The FER algorithm proposed in this study is an automated assessment technology for stroke rehabilitation, which acquires the training status of patients in a non-contact way. ViT is the basic framework for algorithm design since the global modeling of images using ViT is critical to the emotional classification task, as shown in Figure 11. However, CNNs structures are better at extracting local and detailed information in expression images than ViT. Therefore, introducing the characteristics of CNNs into the ViT structure can improve performance and robustness while maintaining high accuracy and memory efficiency. ViT converts the pixel information (2D) in the patch into the feature sequence (1D) required by the encoder through linear projection and Patch Embedding. The position relationships between patches need to be learned through the Position Embedding module. However, the sequence extracted by convolution contains position information, which is the inductive bias property of convolution. Thus, the CPE module containing convolutional layers and pooling layers is designed to replace the linear projection, the Patch Embedding, and the Position Embedding in ViT. There are some studies that have also introduced convolution into ViT. For example, VTFF [34] extracts the information from the original and local binary pattern images using two ResNet18. Then, it flattens and linearizes feature information to obtain patches with features instead of patches with image blocks in ViT. This network achieves an accuracy of 88.14% on RAF-DB while containing a large number of parameters (51.8M). However, the algorithm proposed in this study performs better on RAF-DB with 1.3% higher accuracy than VTFF. CvT [35] divides transformers into multiple stages, constituting the transformers' hierarchy. A convolutional token embedding module is added at the beginning of each stage, which is implemented as a convolutional projection to replace the linear projection before each self-attention in ViT. The algorithm proposed in this paper mainly realizes the convolutional mapping between pixels to sequences by combing convolution and pooling instead of linear projection in ViT. At the same time, the location information between patches is preserved. In contrast, we incorporate convolutional features in ViT more concisely. According to the experimental data in Table 5 and Figure 6, PCVT proposed in this study has higher accuracy and lower parameters than CvT on the RAF-DB dataset.
In addition, high accuracy and low parameters are necessary for a model to run well in rehabilitation equipment with less computing power than professional computers. Therefore, we designed the PTF module that introduced a pyramidal feature output mode to reduce parameters and GFLOPs, inspired by PvT [42]. PvT is proposed as a backbone model to serve downstream tasks in various forms, such as image classification, object detection, and semantic segmentation. Similar to this study, both PvT and FER-PCVT have reduced the sequence length of the transformer output as the network deepens, significantly decreasing computational overhead. Regarding implementation details, PvT splits the image/feature map into many patches (size of P i × P i , where i is the i th stage), and then feeds each patch into the linear projection to obtain many feature sequences whose dimensions are P i times shorter than the input. However, we mainly down-sample the feature map by combining convolutional and pooling to get a feature map that the size is reduced by half each time. Validated by the experiments shown in Figure 6 and Table 5, the proposed algorithm has higher accuracy and requires about 3.79M lower parameters than PvT in model training/inference. Furthermore, considering that some stroke patients have different facial expressions due to impaired facial muscles, we designed a classifier that is more suitable for the emotion classification of stroke patients to improve the accuracy further. We designed the V-ALC classifier based on the V-A emotion theory, treating emotion as the weighted sum of V-A-like and neutral features. The addition of V-ALC improves the model's accuracy from 84.22% to 89.44%, as shown in Tables 5 and 6. According to Table 8, the structure obtained by PCVT splicing V-ALC performs better than that obtained by PCVT splicing MLP in classifying the emotions of stroke patients.
We also visually analyze models to find the attention regions of ViT, ResNet18*, and FER-PCVT in classifying emotions and verify that FER-PCVT combines the advantages of the other two structures well. As shown in Figure 11, ResNet18, a typical CNNs structure, focuses on the facial regions that best represent emotions, similar to the areas humans notice when recognizing the emotions of stroke patients. For example, the tightened and open lips when angry, the wrinkled eyebrows when sad, the raised cheeks when strained, and the relaxed eyes and mouth when tired. Unlike ResNet18, ViT extracts global features while also paying attention to some facial regions located inside the image, especially for surprised and painful expressions. FER-PCVT extracts information globally like ViT but perceives more detailed facial regions than ViT, which means more details about emotions can be captured by FER-PCVT.
However, the algorithm proposed in this study recommends using a dataset with better balance for training, since the designed classifier sums neutral emotion features with weights with other emotion features for classification. Unbalanced sample sizes will affect the model's ability to extract an unbiased emotion baseline. The RAF-DB dataset is more balanced than the FER+ dataset, so the proposed method achieves the highest accuracy on the RAF-DB dataset, as shown in Table 6 and Figure 7. However, its performance on the FER+ dataset is weaker than other FER algorithms, such as RAN [29], VTFF [34], SCN [47], and FER-VT [48].
To summarize, the proposed method has several advantages: (1) It achieves higher recognition accuracy than other existing FER algorithms on the RAF-DB dataset. (2) The network structure successfully combines the local perception of CNNs and the global extraction capability of ViT, which effectively improves the ability of the model to extract feature sequences used to classify patients' emotions. (3) It has fewer parameters and GFLOPs than other algorithms, making it easier to embed in medical rehabilitation equipment with poorer computing performance than professional computers. Although the proposed method has shown lower consumption and better effectiveness on both public datasets and the private dataset, there are still some problems to be improved: (1) The algorithm performs better on the balanced dataset. Therefore, it is necessary to balance the sample size of each category in order to obtain unbiased prediction results. (2) The sample size of the private dataset used in this study is insufficient compared to public datasets, especially for painful and tired expressions. We hope to collect more clinical data to improve the model's generalization. (3) This study only conducted a qualitative analysis of emotions and did not further classify each emotion. For example, painful emotions are divided into severe, moderate, and slight pain in detail. It is hoped that future research can bring more specific and quantitative rehabilitation recommendations for the early training of stroke patients.

Conclusions
This study proposes a lightweight FER algorithm, FER-PCVT, which is more conducive to embedding in medical rehabilitation equipment to determine whether the current training intensity received by a stroke patient is most suitable for his physical recovery. To verify the performance of FER-PCVT, we collect and annotate a private dataset of stroke patients containing 1302 samples, which can be divided into 8 classes: painful, strained, tired, neutral, happy, sad, angry, and surprised. This algorithm is compared with other FER algorithms on two public datasets (FER+ and RAF-DB) and a private dataset. The experimental results show that: (1) PCVT, the backbone network of FER-PCVT, achieves an accuracy of 84.22%, parameters of 2.46M, and GFLOPs of 0.12 on the RAF-DB dataset, which is better than CvT, PvT, and ResNet18. (2) FER-PCVT achieves 88.21% and 89.44% on the FER+ and RAF-DB datasets, respectively. Its performance exceeds that of other existing expression recognition algorithms on the RAF-DB dataset. (3) FER-PCVT achieves an accuracy of 99.81% on the private dataset, with only 4.10M parameters. (4) FER-PCVT effectively combines the local perceptual ability and the feature output mode of the CNNs and the global extraction capability of ViT, which significantly reduces the parameters and ensures recognition accuracy. This method has excellent performance on public and private datasets, providing an intuitive and efficient automated assessment technique for stroke patients to receive more suitable early training.
Author Contributions: Conceptualization, Y.F., H.W., X.Z. and X.L.; methodology, Y.F. and X.Z.; software, Y.F. and X.Z.; validation, Y.F. and X.Z.; formal analysis, Y.F. and X.Z.; investigation, Y.F., H.W. and X.Z.; resources, H.W., X.C., Y.C., C.Y. and J.J.; data curation, Y.F.; writing-original draft preparation, Y.F.; writing-review and editing, Y.F. and H.W.; visualization, Y.F.; supervision, X.L. and J.J.; project administration, X.L.; funding acquisition, H.W., Y.C., X.C., J.J. and X.L. All authors have read and agreed to the published version of the manuscript. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Data Availability Statement: Two publicly available datasets FER+ and RAF-DB were analyzed in this study. The FER+ dataset can be found here: Challenges in Representation Learning: Facial Expression Recognition Challenge|Kaggle. The RAF-DB dataset can be found here: Real-world Affective Faces (RAF) Database (whdeng.cn). The private datasets in this study are available from the corresponding authors upon reasonable request.