Sign Language Recognition Using Two-Stream Convolutional Neural Networks with Wi-Fi Signals

Sign language is an important way for deaf people to understand and communicate with others. Many researchers use Wi-Fi signals to recognize hand and finger gestures in a non-invasive manner. However, Wi-Fi signals usually contain signal interference, background noise, and mixed multipath noise. In this study, Wi-Fi Channel State Information (CSI) is preprocessed by singular value decomposition (SVD) to obtain the essential signals. Sign language includes the positional relationship of gestures in space and the changes of actions over time. We propose a novel dual-output two-stream convolutional neural network. It not only combines the spatial-stream network and the motion-stream network, but also effectively alleviates the backpropagation problem of the two-stream convolutional neural network (CNN) and improves its recognition accuracy. After the two stream networks are fused, an attention mechanism is applied to select the important features learned by the two-stream networks. Our method has been validated by the public dataset SignFi and adopted five-fold cross-validation. Experimental results show that SVD preprocessing can improve the performance of our dual-output two-stream network. For home, lab, and lab + home environment, the average recognition accuracy rates are 99.13%, 96.79%, and 97.08%, respectively. Compared with other methods, our method has good performance and better generalization capability.


Introduction
Sign language is an important way for deaf people to understand and communicate with each other. Communication barriers are often encountered between the deaf communities and people who do not know about sign language. Many researchers try to build a sign language recognition system to break these barriers [1]. Currently, sign language recognition systems are roughly divided into two categories: (i) device-based sign language recognition systems; (ii) device-free sign language recognition systems [2,3].
Wearable sensors are widely used in device-based sign language recognition systems. In 1983, Grimesws et al. invented a data glove for dynamic gesture recognition [4]. Shukor et al. used data gloves to obtain data on Malaysian sign language letters, numbers, and words [5]. Kanokod et al. recognized gestures through the time delay neural networks (TDNNs) algorithm, and the gesture data is obtained from data gloves based on pyrolytic graphite sheets (PGS) [6]. In general, the advantages of the wearable device-based sign language recognition method are that the input data is accurate and the recognition rate is high [2,[4][5][6]. The disadvantages are also obvious. For example, wearable devices are often expensive and inconvenient to carry.
The device-free sign language recognition systems are usually inexpensive and not limited by the wearable device [2,7]. Several device-free sign language recognition systems use computer vision techniques with cameras. Koller conducted a survey on gesture recognition, focusing on the 1. This work shows how to process sign language data based on CSI traces through SVD. It not only makes sign language features more prominent, but also reduces noise and outliers to a certain extent. SVD helps to improve the recognition accuracy of the two-stream network, and has the characteristics of fast running, robustness, and generalization ability.

2.
We explored a novel scheme, dual-output two-stream network. The two-stream network consists of a spatial-stream network and a motion-stream network. The input of the spatial stream network is a three-dimensional array (similar to an array of RGB images) composed of the amplitude and phase of each gesture. The array differences, which represent the amplitude and phase changes, are fed into the motion stream network. The convolutional features from the two streams are fused, and then an attention mechanism automatically selects the most descriptive features.
The experimental results show that the dual output can effectively alleviate the back propagation problem of two-stream CNN and improve the accuracy. 3.
The fine-tuning of an ImageNet pre-trained CNN model on CSI datasets has not yet been exploited. We explored CNN architectures with different model layers on CSI data.

Received Signal Strength Indicator and Channel State Information
The principle of wireless indoor behavior detection is to transmit the generated wireless signal through multipath transmission. Reflection and scattering will cause multiple superimposed signals to be received in an indoor environment. These signals are physically affected by human behavior in the transmission space and generate various environmental characteristic information. Therefore, the information extracted from multipath superimposed signals can be used to identify human behavior [18].
The most common data sources for the device free gesture recognition systems based on Wi-Fi signals are the Received Signal Strength Indicator (RSSI) and CSI [19]. RSSI is the most widely used signal indicator for wireless devices [20]. It describes the attenuation experienced during the propagation of a wireless signal. In the wireless sensor link, the RSSI of the wireless sensor unit will be changed with the movement of the person. In other words, the movement of the person can be detected based on the change of RSSI. Since the RSSI information is easy to capture, RSSI was used for hand gesture recognition in the early days [21]. The RSSI is a kind of coarse-grained information, mainly from the superimposition result of the receiver during signal transmission. It is affected by the multipath effect and environmental noise, so it has large fluctuations and poor stability [2].
RSSI only reflects the total amplitude of multipath overlap on the media access control (MAC) layer, while CSI is more fine-grained subcarrier information of the physical layer. For a multiple-input multiple-output (MIMO) wireless technology in combination with orthogonal frequency division multiplexing (OFDM) Wi-Fi system, CSI is mainly derived from the sub-carriers decoded by the OFDM [22]. It can effectively eliminate or reduce the interference caused by the multi-path effect. CSI contains amplitude and phase information under different sub-carriers, and each sub-carrier does not interfere with each other. Thus, CSI is more sensitive and reliable than RSSI. It has higher detection accuracy and sensitivity, so it can achieve more detailed motion detection [23].
A set of CSI data can be obtained from each received data packet of a wireless network card compatible with the IEEE 802.11n protocol standard. The amplitude and phase of a sub-carrier on the CSI data are shown in Equation (1): where H(f k ) is the CSI of the sub-carrier with a center frequency of f k , H(f k ) and ∠H(f k ) are the amplitude and phase of the center frequency of f k , respectively. They are the most important information in CSI data, and k represents the total number of sub-carriers.

Singular Value Decomposition
In linear algebra, singular value decomposition (SVD) is the factorization of real or complex matrices [24]. SVD is a decomposition method that can be applied to any matrix. There is always SVD for any matrix A, as shown in Equation (2): Assuming that A is an m × n real or complex matrix, the obtained U is an m × m square matrix, and the orthogonal vector in U is called a left singular vector. Σ is an m × n rectangular diagonal matrix. Except for the diagonal elements, all elements of Σ are 0. The elements on the diagonal are called singular values. V T is the transposed matrix of V, which is an n × n square matrix. The orthogonal vector is called the right singular value vector.
Generally speaking, the values on the Σ are in descending order [24]. The larger the value, the higher the importance of the dimension. We can choose the top singular values to approximate the matrix. This way not only extracts important features from the data, but also simplifies the data and eliminates noise and redundancy. The number of singular values depends on various factors, such as different datasets, recognition methods, and temporal and spatial characteristics.

SignFi Dataset
The SignFi dataset contains the CSI data, which were extracted by the 802.11n CSI-Tool on the Intel WiFi Link 5300 device with three antennas. The dataset was collected through a transmitter with three external antennas and a receiver with one internal antenna. Figure 1 shows the measurement scenes of the lab and home environments. The 802.11n CSI-Tool provides CSI values of 30 sub-carriers, which were sampled approximately every 5 milliseconds. The duration of each gesture is about 1 s, so there were 200 CSI samples for each gesture. The CSI data was stored as a 3D matrix of complex values representing amplitude and phase information. The size of the 3D matrix is 200 × 30 × 3. The 3D amplitude and phase matrices are similar to digital images with spatial resolution of H × W and C color channels. Thus, the CSI data can be regarded as images. The three color channels correspond to the three antenna signals.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 13 Assuming that A is an × real or complex matrix, the obtained is an × square matrix, and the orthogonal vector in is called a left singular vector. Σ is an × rectangular diagonal matrix. Except for the diagonal elements, all elements of Σ are 0. The elements on the diagonal are called singular values. V T is the transposed matrix of V, which is an × square matrix. The orthogonal vector is called the right singular value vector. Generally speaking, the values on the Σ are in descending order [24]. The larger the value, the higher the importance of the dimension. We can choose the top singular values to approximate the matrix. This way not only extracts important features from the data, but also simplifies the data and eliminates noise and redundancy. The number of singular values depends on various factors, such as different datasets, recognition methods, and temporal and spatial characteristics.

SignFi Dataset
The SignFi dataset contains the CSI data, which were extracted by the 802.11n CSI-Tool on the Intel WiFi Link 5300 device with three antennas. The dataset was collected through a transmitter with three external antennas and a receiver with one internal antenna. Figure 1 shows the measurement scenes of the lab and home environments. The 802.11n CSI-Tool provides CSI values of 30 sub-carriers, which were sampled approximately every 5 milliseconds. The duration of each gesture is about 1 second, so there were 200 CSI samples for each gesture. The CSI data was stored as a 3D matrix of complex values representing amplitude and phase information. The size of the 3D matrix is 200 × 30 × 3 . The 3D amplitude and phase matrices are similar to digital images with spatial resolution of H × W and C color channels. Thus, the CSI data can be regarded as images. The three color channels correspond to the three antenna signals.
The SignFi dataset consisted of two parts. The first part included 276 gestures, a total of 8280 instances from the same user. Among them, 5520 instances and 2760 instances were obtained in the laboratory and home. Each gesture had 20 and 10 instances in the laboratory and home, respectively. The second part included 150 gestures with 7500 instances collected from five users in the laboratory, 50 instances of each gesture and 10 instances of each user. The dataset was further divided into four groups to train and evaluate our method including Home276, Lab276, Lab+Home276, and Lab150 groups. The number of gestures were 276 and 150. Table 1 shows the statistics of the SignFi dataset.   The SignFi dataset consisted of two parts. The first part included 276 gestures, a total of 8280 instances from the same user. Among them, 5520 instances and 2760 instances were obtained in the laboratory and home. Each gesture had 20 and 10 instances in the laboratory and home, respectively. The second part included 150 gestures with 7500 instances collected from five users in the laboratory, 50 instances of each gesture and 10 instances of each user. The dataset was further divided into four groups to train and evaluate our method including Home276, Lab276, Lab+Home276, and Lab150 groups. The number of gestures were 276 and 150. Table 1 shows the statistics of the SignFi dataset.

Data Preprocessing
The amplitude and phase can be obtained from the 3D matrix of the raw CSI. Their size is 200 × 30 × 3. The amplitude and phase of an antenna can be obtained from Equations (3) and (4): Note that we directly get the angular degree value of the phase without unwrapping the phase to eliminate the phase shift like SignFi method [1]. We combined the amplitude and phase of each gesture and reshaped it into a combination matrix with a size of 200 × 60 × 3 as the input data of the spatial-stream network. The input of the motion-stream network was the difference (size of 199 × 60 × 3) of the above combination matrix, which was a concatenation of amplitude difference and phase difference. The difference comes from two consecutive instances and describes the changes in amplitude and phase. It indicates the change of the gesture corresponding to the salient area of movement. Then, two types of modality data, namely combination matrix and their difference matrix, were preprocessed by SVD to remove redundant and irrelevant noise. Figure 2 shows the combination matrix and difference matrix before and after SVD preprocessing of sign language "GO" in the home and laboratory environments. Each picture in Figure 2 represents a 3D matrix. Figure 2a,b,e,f are the combination matrices with a size of 200 × 60 × 3, and Figure 2c,d,g,h are the difference matrices of a size of 199 × 60 × 3. The Y axis represents the first dimension, and the X axis represents the second dimension. The RGB color is the third dimension representing three antenna signals. On the X axis, the first half (0-29) is the amplitude information and the second half (30-59) is the phase information.
From Figure 2, we observed: (1) Combination matrices are more colorful than difference matrices. The color channels correspond to the three antenna signals. The richer the color, the greater the diversity of the signal, and the more information it contains. Thus, the combination matrices contain more information than the difference matrices. (2) The same user performs the same gesture, and the difference between the home and laboratory environment results in different CSI data, especially in the combination matrices, as shown in Figure 2a,b,e,f. In other words, the amplitude and phase of CSI are easily affected by the environment. However, the difference matrices are less affected by the environment. (3) We perform SVD preprocessing on the amplitude and phase of CSI data, respectively. In order to strike a balance between data feature integrity and noise elimination, SVD only selects the top 20 of the 30 singular value rankings, namely SVD_20. From Figure 2c,d,g,h, we can know that the matrix signal becomes smoother after performing SVD. CSI are easily affected by the environment. However, the difference matrices are less affected by the environment. (3) We perform SVD preprocessing on the amplitude and phase of CSI data, respectively. In order to strike a balance between data feature integrity and noise elimination, SVD only selects the top 20 of the 30 singular value rankings, namely SVD_20. From Figure 2c,d,g,h, we can know that the matrix signal becomes smoother after performing SVD.

Dual-Output Two-Stream Convolutional Neural Network
Convolutional Neural Network (CNN) is the most successful neural network in the field of deep learning [25]. The network avoids the complicated processing of images and can directly input the original images to achieve end-to-end results. CNN is derived from Hubel and Wiesel's study of the cat brain visual system in 1962 [26]. In 1998, Yann Lecun proposed the LeNet-5 network to solve the visual task of handwritten digit recognition [27]. In the 2012 ImageNet image recognition competition, Hinton used AlexNet to greatly improve the accuracy of image recognition and subvert the field of image recognition [28]. This made CNN attract much attention and has become a research hotspot. In order to improve the performance of CNN, several improved CNNs have been proposed, such as ZFNet [29], VGGNet [30], GoogleNet [31], ResNet [32], DenseNet [33], and ResNeXt [34]. These networks focus on three important aspects: depth, width, and cardinality. At the same time, the CNN network structure has been developed in terms of attention mechanism, efficiency, and automation. The most famous are SENet [35], CBAM [36], SqueezeNet, MobileNet [37], NASNet [38], and EfficientNet [39].
In video behavior recognition, Simonyan et al. proposed a two-stream CNN structure for RGB input and optical flow input [40]. They used two identical CNN structures for training and merged through a post-fusion method. This is an effective way in the field of behavior recognition when the training dataset is limited. A lot of research has been conducted based on this architecture [41][42][43]. For example, Wang et al. proposed a time segmentation network (TSN), which divides the input video into several segments and sparsely samples two-stream features from these segments [41]. Feichtenhofer et al. extended the two-stream CNN and proposed a spatio-temporal CNN [42]. Figure 3 shows the architecture of our proposed sign language recognition method, which combined SVD, a dual-output two-stream network, and an attention mechanism. The SignFi dataset was collected by CSI measurement. The raw CSI data are a 3D matrix that can be regarded as an

Dual-Output Two-Stream Convolutional Neural Network
Convolutional Neural Network (CNN) is the most successful neural network in the field of deep learning [25]. The network avoids the complicated processing of images and can directly input the original images to achieve end-to-end results. CNN is derived from Hubel and Wiesel's study of the cat brain visual system in 1962 [26]. In 1998, Yann Lecun proposed the LeNet-5 network to solve the visual task of handwritten digit recognition [27]. In the 2012 ImageNet image recognition competition, Hinton used AlexNet to greatly improve the accuracy of image recognition and subvert the field of image recognition [28]. This made CNN attract much attention and has become a research hotspot. In order to improve the performance of CNN, several improved CNNs have been proposed, such as ZFNet [29], VGGNet [30], GoogleNet [31], ResNet [32], DenseNet [33], and ResNeXt [34]. These networks focus on three important aspects: depth, width, and cardinality. At the same time, the CNN network structure has been developed in terms of attention mechanism, efficiency, and automation. The most famous are SENet [35], CBAM [36], SqueezeNet, MobileNet [37], NASNet [38], and EfficientNet [39].
In video behavior recognition, Simonyan et al. proposed a two-stream CNN structure for RGB input and optical flow input [40]. They used two identical CNN structures for training and merged through a post-fusion method. This is an effective way in the field of behavior recognition when the training dataset is limited. A lot of research has been conducted based on this architecture [41][42][43]. For example, Wang et al. proposed a time segmentation network (TSN), which divides the input video into several segments and sparsely samples two-stream features from these segments [41]. Feichtenhofer et al. extended the two-stream CNN and proposed a spatio-temporal CNN [42]. Figure 3 shows the architecture of our proposed sign language recognition method, which combined SVD, a dual-output two-stream network, and an attention mechanism. The SignFi dataset was collected by CSI measurement. The raw CSI data are a 3D matrix that can be regarded as an image. Thus, computer vision techniques and CNN models can be used to process the CSI data.  The amplitude and phase information contain noise and a certain phase offset. In our method, SVD was first used to remove redundant and irrelevant noise in amplitude and phase. Then, they were concatenated and converted to a 3D matrix, which is similar to an array of RGB images. After the SVD processing, the resulting matrix was fed into the spatial-stream CNN, which is the top layer of our dual-output two-stream network. Sign language includes not only the positional relationship of gestures in space, but also the changes of actions over time. We introduced the amplitude difference and phase difference information, which represented changes in amplitude and phase respectively. The difference matrix was input to the motion-stream CNN, which is the bottom layer of our dual-output two-stream network.
The proposed dual-output two-stream network is shown in Figure 4. In total, two types of modality data, combination matrix and difference matrix, were input into the network. In this study, the ResNet model was used for two stream CNNs. The convolutional layers in CNNs extracted multiple levels of features. When two streams are fused by concatenation, the attention mechanism (CBAM) [36] module will automatically select the most descriptive features learned by the two stream networks. Then, batch normalization (BN) is used to prevent overfitting. The ensemble prediction was the final output, as shown in the bottom layer of Figure 4. The two cross-entropy losses were combined to optimize the learning process. The dual-output and two cross-entropy losses in this structure mainly borrowed the ideas of GoogleNet architecture. The additional classification network mainly provided gradient training for the previous convolution. When the network deepens, the gradient cannot be effectively transmitted from back to front, and network parameters cannot be updated. Such a branch can alleviate the gradient propagation problem.
Most CNNs provide the pre-trained models based on the ImageNet dataset. For the pre-trained models, the CSI data is unknown. The transfer learning allowed us to use a small amount of newly labeled data to build a high-quality classification model for the new data. Therefore, we used the transfer learning to fine-tune the pre-trained CNN models to speed up the training and improve the accuracy. In transfer learning, we freeze the first five layers of the pre-trained model and train the remaining layers. In this way, we retain the generic features learned from the ImageNet data set, and also learn domain knowledge from the CSI data.  Figure 4. The framework of our dual-output two-stream network. The amplitude and phase information contain noise and a certain phase offset. In our method, SVD was first used to remove redundant and irrelevant noise in amplitude and phase. Then, they were concatenated and converted to a 3D matrix, which is similar to an array of RGB images. After the SVD processing, the resulting matrix was fed into the spatial-stream CNN, which is the top layer of our dual-output two-stream network. Sign language includes not only the positional relationship of gestures in space, but also the changes of actions over time. We introduced the amplitude difference and phase difference information, which represented changes in amplitude and phase respectively. The difference matrix was input to the motion-stream CNN, which is the bottom layer of our dual-output two-stream network.
The proposed dual-output two-stream network is shown in Figure 4. In total, two types of modality data, combination matrix and difference matrix, were input into the network. In this study, the ResNet model was used for two stream CNNs. The convolutional layers in CNNs extracted multiple levels of features. When two streams are fused by concatenation, the attention mechanism (CBAM) [36] module will automatically select the most descriptive features learned by the two stream networks. Then, batch normalization (BN) is used to prevent overfitting. The ensemble prediction was the final output, as shown in the bottom layer of Figure 4. The two cross-entropy losses were combined to optimize the learning process. The dual-output and two cross-entropy losses in this structure mainly borrowed the ideas of GoogleNet architecture. The additional classification network mainly provided gradient training for the previous convolution. When the network deepens, the gradient cannot be effectively transmitted from back to front, and network parameters cannot be updated. Such a branch can alleviate the gradient propagation problem.  The amplitude and phase information contain noise and a certain phase offset. In our method, SVD was first used to remove redundant and irrelevant noise in amplitude and phase. Then, they were concatenated and converted to a 3D matrix, which is similar to an array of RGB images. After the SVD processing, the resulting matrix was fed into the spatial-stream CNN, which is the top layer of our dual-output two-stream network. Sign language includes not only the positional relationship of gestures in space, but also the changes of actions over time. We introduced the amplitude difference and phase difference information, which represented changes in amplitude and phase respectively. The difference matrix was input to the motion-stream CNN, which is the bottom layer of our dual-output two-stream network.
The proposed dual-output two-stream network is shown in Figure 4. In total, two types of modality data, combination matrix and difference matrix, were input into the network. In this study, the ResNet model was used for two stream CNNs. The convolutional layers in CNNs extracted multiple levels of features. When two streams are fused by concatenation, the attention mechanism (CBAM) [36] module will automatically select the most descriptive features learned by the two stream networks. Then, batch normalization (BN) is used to prevent overfitting. The ensemble prediction was the final output, as shown in the bottom layer of Figure 4. The two cross-entropy losses were combined to optimize the learning process. The dual-output and two cross-entropy losses in this structure mainly borrowed the ideas of GoogleNet architecture. The additional classification network mainly provided gradient training for the previous convolution. When the network deepens, the gradient cannot be effectively transmitted from back to front, and network parameters cannot be updated. Such a branch can alleviate the gradient propagation problem.
Most CNNs provide the pre-trained models based on the ImageNet dataset. For the pre-trained models, the CSI data is unknown. The transfer learning allowed us to use a small amount of newly labeled data to build a high-quality classification model for the new data. Therefore, we used the transfer learning to fine-tune the pre-trained CNN models to speed up the training and improve the accuracy. In transfer learning, we freeze the first five layers of the pre-trained model and train the remaining layers. In this way, we retain the generic features learned from the ImageNet data set, and also learn domain knowledge from the CSI data. Most CNNs provide the pre-trained models based on the ImageNet dataset. For the pre-trained models, the CSI data is unknown. The transfer learning allowed us to use a small amount of newly labeled data to build a high-quality classification model for the new data. Therefore, we used the transfer learning to fine-tune the pre-trained CNN models to speed up the training and improve the accuracy. In transfer learning, we freeze the first five layers of the pre-trained model and train the remaining layers. In this way, we retain the generic features learned from the ImageNet data set, and also learn domain knowledge from the CSI data.

Network Training and Test Settings
We conducted experiments on sign language recognition tasks and performed all experiments on a PC with Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40 GHz CPU and GeForce GTX TitanX GPU with 12 GB of memory. We used momentum stochastic gradient descent (SGDM) with a momentum of 0.9 and an initial learning rate of 0.0001 to train the network to update the weights and biases. The activation function is a rectified linear unit (ReLU). The batch size is set to 16, and the training period is 500. The width and height of the input matrix data are resized to 224 × 224. The network weights are initialized using ImageNet's pre-trained model. Our experiment followed the paper of SignFi method training and evaluation scheme, using non-repetitive five-fold cross-validation. The training sample is 80% and the test sample is 20%, which is consistent with the references [1,3]. In order to preserve the percentage of test samples for each category, stratified K-fold was applied. The dual-output two-stream network contains a lot of batch normalization, so it is necessary to shuffle the training data after each training cycle.

SignFi Dataset Evaluation
We quantify the performance of dual-output two-stream network on the benchmark SignFi dataset. The first evaluation is to show the effect of the attention mechanism. We use ResNet50 in our model with and without attention mechanism, respectively. The combination modality of SVD preprocessing on all data groups is used as input. The evaluation results are shown in Table 2. We can observe that the attention mechanism can indeed improve accuracy. The next evaluation is to test the Home276 data group. We use RenNet18 and ResNet50 models in our two stream CNNs to understand which is more suitable for CSI data in sign language recognition tasks. The results in Table 3 show that, compared with other methods and single stream CNNs, the dual-output dual-stream network with ResNet50 and SVD obtains competitive results. In Table 3 we can also know that SVD preprocessing and difference matrices also improves the accuracy. The evaluation of Lab276 data group is shown in Table 4. SVD still improves the accuracy of our dual-output two-stream network. However, the best recognition accuracy (96.79%) of our method is lower than SignFi method (98.91%) and HOS-Re method (98.26%). The laboratory environment is likely to be more complicated than the home environment. The SignFi method uses the unwrapping transform to preprocess the CSI phase data, while the HOS-Re method extracts the third-order cumulant feature that can reduce signal noise.  In order to verify the generalization of our proposed network, we also mixed the Home276 group and Lab276 group together. Examples of the mixed group are randomly divided into training data and test data at a ratio of 8:2. The accuracies reported in Table 5 clearly show that our proposed method with ResNet50 (97.08%) is superior to other methods (94.81%, 96.34%) even without SVD preprocessing. SVD still improves performance a bit. Table 6 shows the comparison results of Lab150 data group. In our proposed method, the accuracy of using SVD preprocessing is also higher than that of not using SVD preprocessing. However, our best result has an accuracy of 95.88%, which is lower than the HOS-Re method (96.23%), but higher than the SignFi method (86.66%). The accuracy difference between our method and the HOS-Re method is less than 1%. Therefore, our method has similar performance to HOS-Re and is significantly better than SignFi method in this evaluation.  According to Tables 3-6, we can conclude that using SVD preprocessing can indeed improve the performance of our dual-output two-stream network. As the CNN model in our dual-output dual-stream network, ResNet50 is more suitable for CSI data in sign language recognition tasks than ResNet18. Although the accuracy of our proposed method is lower than other methods in the Lab276 group and Lab150 group, the best results can be obtained in a mixed environment. This means that our method has better generalization capability than other mehods.

Discussion
Deep learning models generally have achieved greater success due to the availability of massive datasets and extended model depth and parameterization. However, in practice, factors such as memory and computation time during training and testing are important factors to consider when choosing a model from a large number of models. In addition, the success of deep learning also depends on the training data and the model generalization, which is very important for deploying models in practical use because it is difficult to collect training data and train individual models for all different environments. In other words, the generalization capability is more important for practical use. According to the evaluation results shown in Table 5, our method has better practicability than other methods.
The diversity of input data is very helpful in CNN-based methods. CNN-based methods extract features through training. Input diversity means that CNN can extract more types of features. This can avoid overfitting during network training. Therefore, the proposed dual-output two-stream network uses two modalities of input data and achieves good performance. Moreover, input data containing redundant and irrelevant noise must be preprocessed. This can be proved in the above experiments. Tables 3-6 show that SVD preprocessing can improve the performance of our dual-output two-stream network.
In this study, the experimental results also show that deep learning is not always successful. It can be seen from Table 6 that the HOS-Re method obtains the best result. This method is a traditional machine learning method. It relies on manual feature engineering to calculate a large number of features and use SVM as a classifier. The method is different from CNN-based methods such as SignFi method and our method, which can automatically extract features through training. Through this evaluation, we can know that as long as good functions can be found, traditional machine learning based on feature engineering is still worthy of attention.

Conclusions
The sign language includes the positional relationship of gestures in space and the changes of actions over time. In this study, we proposed a dual-output two-stream network and provided two types of modality input data: the combined matrix of amplitude and phase and its difference matrix. SVD completed the preprocessing of the input data. Then, the attention mechanism was used to select the features learned by the network. We evaluated our proposed network on a public dataset SignFi. Experimental results showed that SVD preprocessing improves the performance of our dual-output two-stream network. Compared with other methods, our method has good performance and better generalization capability.