Water Areas Segmentation from Remote Sensing Images Using a Separable Residual SegNet Network

: Changes on lakes and rivers are of great signiﬁcance for the study of global climate change. Accurate segmentation of lakes and rivers is critical to the study of their changes. However, traditional water area segmentation methods almost all share the following deﬁciencies: high computational requirements, poor generalization performance, and low extraction accuracy. In recent years, semantic segmentation algorithms based on deep learning have been emerging. Addressing problems associated to a very large number of parameters, low accuracy, and network degradation during training process, this paper proposes a separable residual SegNet (SR-SegNet) to perform the water area segmentation using remote sensing images. On the one hand, without compromising the ability of feature extraction, the problem of network degradation is alleviated by adding modiﬁed residual blocks into the encoder, the number of parameters is limited by introducing depthwise separable convolutions, and the ability of feature extraction is improved by using dilated convolutions to expand the receptive ﬁeld. On the other hand, SR-SegNet removes the convolution layers with relatively more convolution kernels in the encoding stage, and uses the cascading method to fuse the low-level and high-level features of the image. As a result, the whole network can obtain more spatial information. Experimental results show that the proposed method exhibits signiﬁcant improvements over several traditional methods, including FCN, DeconvNet, and SegNet.


Introduction
Lakes and rivers are the interactive connecting points of atmosphere, biosphere, lithosphere, and land hydrosphere [1]. They are extremely sensitive to climate changes, and thus are able to reflect not only regional and global climate changes, but also local temperature changes [2]. Therefore, the study of lake and river changes is great significance to the study of global climate changes. The segmentation of lakes and rivers is an important first step to study their changes. Traditional methods of water area segmentation mainly include thresholding, clustering, support vector machine, and so on. McFeeters et al. [3] proposed a normalized difference water index (NDWI). This method uses the combination of an image's green band and near-infrared band to construct a wave band for segmentation, but this method is highly dependent on the environment. In 2000, Frazier et al. [4] classified the water body of the river beach based on maximum likelihood classification, but his method's generalization performance is poor, because there are obvious differences in different infrared These algorithms have great prospects in processing remote sensing satellite images and segmenting pathological cells in medical images and other fields. Although existing semantic segmentation algorithms perform well in remote sensing image extraction, there is a problem of gradient vanishing due to the deepening of convolution layers during training process, and thus the network performance degrades and its image segmentation accuracy is affected. In addition, a DNN normally has many convolution kernels, which increases the parameter numbers of the training network, and thus makes the training time-consuming and difficult [24]. To solve these problems, a separable residual SegNet (SR-SegNet) is proposed in this paper. A modified residual block is added to the encoder of SegNet to solve the problem of performance degradation. Furthermore, depthwise separable convolutions [25] are introduced to reduce the parameter number, shorten the training time, and cut down the calculation cost without compromising the network performance. In the proposed network, the characteristics of high-level (the higher layer of deep neural network layer) and low-level (the lower layer of deep neural network layer) are obtained by cascading. In addition, dilated convolutions are used to expand the receptive field in convolution layers to improve the ability of feature extraction without increasing the number of parameters. Compared with FCN, SegNet, and DeconvNet, the SR-SegNet proposed in this paper improves the F1-score, mean intersection over union (Miou), and recall, while also reducing the testing time.
The remainder of this paper is organized as follows. In Section 2, the structure of SR-SegNet is discussed in detail. Section 3 presents experimental details of the proposed model using Lake and River dataset. In Section 4, conclusions are given and the future research direction is discussed.

Proposed Method
In this section, the proposed SR-SegNet's architecture is presented in detail. The classical SegNet uses a large number of convolution kernels and a deep network structure to extract image features, thus its training is slow, and vanishing gradient (vanishing gradient: as the number of neural network layers increases, the accuracy of classification decreases) always happens. The segmentation of lakes and rivers in remote sensing images is limited by the images's spectrum, resolution, shadow, and other factors. In addition, there are some problems in the application of traditional methods, such as poor generalization performance, poor effect of water area segmentation, and so on. In this paper, we propose a SR-SegNet to solve these problems; the method can retrieve rich and multi-scale contexts for a more accurate segmentation, and significantly decrease the number of training parameters and reduce the image prediction time.

Model Overview
The classical SegNet has a large number of parameters. Therefore, vanishing gradient always happens and its ability of feature extraction deteriorates during training process. Its parameter number is large because there are too many convolution kernels. Moreover, five times of 2× upsampling are performed during encoding stage, resulting in a long training and a slow testing. In addition, because the classical SegNet simply performs upsampling during encoding stage, it lacks the fusion of high-level and low-level semantic information; as a result, detailed location information could be lost during the segmentation of water area remote sensing image. To solve these problems, we propose a separable residual SegNet. In SR-SegNet, a modified residual block [26] is introduced in the encoding stage, and the detailed information is presented in Section 2.2.1. To limit the parameter numbers associated with a large number of convolutional kernels, we use depthwise separable convolutions for efficiency. Finally, dilated convolutions are applied in our encoder to capture more water area spatial information. More details of the depthwise separable convolution can be found in Section 2.2.2. Figure 1 shows the detailed architecture of the proposed SR-SegNet. The entire network is divided into two parts: the encoder and the decoder. (1) In the encoding stage, a modified residual block is added to each convolution block to alleviate the degradation problem that often occurs in training process [27]. (2) Considering that the training process is complicated by a large number of parameters, we only perform four times of upsampling during encoding stage, instead of five times of 2× upsampling in the classical SegNet. We also remove the last five convolution layers in encoding stage. (3) To obtain detailed location information of a water body in a remote sensing image, a cascade method is used to fuse (add) both the deep and shallow features of the image. (4) Furthermore, to simplify the network, depthwise separable convolutions [25] are introduced to convolution layers in encoding stage to reduce the amount of computation and the number of parameters during training process. Based on the above four techniques, SR-SegNet v1 is constructed. (5) Because some 3 × 3 convolution kernels are replaced with 2 × 2 convolution kernels in the modified residual block, the receptive field is reduced to a certain extent, and hence the boundary extraction of an individual water body is not good. To solve this problem, SR-SegNet v2 with dilated convolutions is also proposed. Experiments prove that v2 achieves good results, demonstrating the effectiveness of dilated convolutions. Table 1 shows the detailed information of the proposed SR-SegNet v2. The input is a water body remote sensing image with three channels (red, green, and blue), and the output is a binary segmentation map in which the pixel in gray denotes the water body and the pixel in black denotes the background.
Up-sampling network The residual block, proposed by He K et al. [26] in 2015, aims at solving the problem that training error increases as the network deepens, and helping alleviate the problems of vanishing gradient and exploding gradient. Inspired by the residual block, to further address the problem of small object misidentification in segmentation of water area remote sensing images, a modified residual block is added to the proposed SR-SegNet in the encoder. As shown in Figure 2a, it is a traditional bottleneck block in ResNet-50. This method protects information integrity by bypassing the input to the output directly. The residual block in ResNet-50 only needs to learn the difference between the input and output; this setup simplifies the learning objectives and reduces the difficulties, and thus solves the degradation problem during training process [28]. The residual block is especially suitable for small and medium-sized object recognitions in water body remote sensing images. However, using 1 × 1 convolution kernels in the traditional bottleneck block may lose the semantic information of water body. For better performance, two successive 2 × 2 convolution kernels are adopted in our modified residual block in this paper, and then a 1 × 1 convolution kernel is connected to them. To enlarge the receptive field to obtain more water body features without increasing the number of parameters, 2 × 2 dilated convolutions with a dilation rate of 2 are introduced into the first two layers of the modified residual block, which generates the same receptive field as 3 × 3 convolution kernels.
In this paper, dilated convolutions are added to the modified residual block during encoding stage, as shown in Figure 2b. In the modified residual block, 3 × 3 convolution kernels are changed to 2 × 2 convolution kernels, which reduces the training complexity of network and saves the training time. However, by this structure, detailed information could be lost in the extraction of water body from remote sensing images. To further expand the receptive field during downsampling and further improve edge feature extraction and small targets recognition in water body remote sensing images, this paper proposes SR-SegNet v2 by introducing the dilated convolutions. Note that SR-SegNet v1 does not use dilated convolutions in its modified residual block. Compared with SR-SegNet v1, SR-SegNet v2 replaces 3 × 3 convolution kernels in its residual block with 2 × 2 dilated convolutions, with a dilation rate at 2. In Figure 2b, the channel number of the first 2 × 2 convolution layer is twice that of the latter, and the final 1 × 1 convolution kernels use 64, 128, 256, 512, and 512 channels, respectively, for five modified residual blocks in SR-SegNet. According to Equation (1), the receptive field of the standard 3 × 3 convolution kernel is 3, where m is the receptive field size of previous layer, stride is the convolution step size, and K is the convolution kernel size.
(1) Figure 3 shows a comparison between a standard convolution and a dilated convolution. As shown in Figure 3c, a standard 2 × 2 convolution is replaced by a dilated convolution with a dilation rate at 2, which is equivalent to put a zero between every adjacent pixels. Similarly, a 3 × 3 convolution kernel with a dilated ratio of 2 is equivalent to a 5 × 5 convolution kernel. Because these filled zeros do not need training, the dilated convolution can substantially expand its receptive field without increasing computational complexity. Equation (2) is for calculating the receptive field of a dilated convolution, where rate represents the dilation rate, the size of the convolution kernel is K, and the size of the convolution kernel with a dilated convolution is K d .

Depthwise Separable Convolution Construction
In the last two convolution blocks of SegNet, 3 × 3 convolution kernels, each with 512 channels, are used in each layer to increase the depth of the network and thus to extract more features. However, this layout produces a large number of parameters, resulting in high computational burden and difficult training. In practice, most of the water body remote sensing images are of medium or even high resolution, and the traditional SegNet will have a slow segmentation. To reduce the number of parameters without compromising the feature extraction, depthwise separable convolutions are introduced into convolution layers. Depthwise separable convolutions can be divided into two parts: depthwise convolutions and pointwise convolutions. Figure 4 is the construction of depthwise separable convolutions. Depthwise convolution refers to spatial dimensional convolutions on each channel of the input tensor, and pointwise convolutions apply standard 1 × 1 convolutions to fuse the output of each channel [29].   Figure 5 is a comparison between a standard convolution and the depth separable convolution. The standard convolution first performs a 3 × 3 × C convolution, then the batch normalization (BN) [30], and finally the nonlinear relu function [31]. Different from traditional convolution methods, the depthwise separable convolution applies a 3 × 3 × 1 depth convolution first, then the batch normalization and the nonlinear relu function, next a 1 × 1 × C point convolution, and last again the batch normalization and the nonlinear relu function. The standard convolution uses the complete 3 × 3 × C convolution kernel directly, but the depth separable convolution uses C single channel 3 × 3 convolution kernels at the same time [32]. For example, if N K × K standard convolutions of C channels each were used, the number of parameters would be NCK 2 . However, for the depthwise separable convolutions, KxK depthwise convolutions of C channels each are performed at each channel of the input picture, and thus CK 2 parameters are generated firstly. Next, N 1 × 1 × C pointwise convolutions are used to aggregate outputs, and thus NC parameters are generated. Therefore, the whole depth separable convolutions generates CK 2 + NC parameters, many fewer than that of standard convolutions. The application of depthwise separable convolutions in water area segmentation not only shortens training time and reduces computation, but also effectively avoids overfitting. Using depthwise separable convolutions makes the model easier to train. Moreover, training and prediction time will also be reduced.

Decoder Design
Although the end-to-end model can directly use a whole picture as the input and generate a whole picture as the output [33]. Image spatial information could be lost during decoding stage. The U-Net proposed by Ronneberger O et al. [34] uses concatenation in both encoder and decoder to fuse high-level and low-level image features to obtain more feature information.
Previous work shows that the layer by layer upsampling does not improve prediction results, but instead it increases the complexity of model and generates a large number of parameters. Information in encoding layers will be lost if the upsampling is as of the same size as the input image directly. In view of the above situations, our proposed SR-SegNet in this paper combines an end-to-end cascading mode, and adopts 4× unpooling for the first upsampling, instead of traditional 2× unpooling. Next, 2× unpooling is conducted layer by layer. As a result, there are only four times of unpooling. In the residual block of encoder, the first four residual blocks are cascaded with the upsampling in decoding stage. The cascading uses the fusion method to effectively obtain more spatial location information. This method combines high-level and low-level features, and can extract detailed features of water body remote sensing images, especially the edges in them [35].

Experiment and Result Analysis
To verify the effectiveness of SR-SegNet proposed in this paper, experiments were carried out on Lake and River dataset. Furthermore, semantic segmentation models were used as the control groups. All experiments were evaluated based on four major metrics, including Accuracy (Ac), Dice, F1-Score (F1), and Mean intersection over union (Miou). Experimental results show that the network proposed in this paper exceeded all comparing networks on the evaluation metrics.

Data Augmentation
The experimental dataset includes the remote sensing satellite images of Namtso Lake in Qinghai-Tibet Plateau and a river in Central China during 2015-2019 from China Center For Resources Satellite Data and Application (http://www.cresda.com/CN/). After undifferentiated classification, 32 training images and 7 testing images were prepared. Because the water body in an image only accounts for a small part, Adobe Photoshop CS6 software was used to cut the remote sensing image into small pieces 512 × 512 pixels each in size, and Labelme was used for classification and annotation. The lake was classified as Category 1, and the background as Category 2. The cropped images and their corresponding labels are shown in Figure 6a. It is worth noting that there were only pictures of Namtso Lake in the training set, and there was no picture of other lakes or rivers. Furthermore, remote sensing images in the training set and in the test set were not of the same river. To distinguish different rivers, River 1, River 2, and River 3 are used to mark them. The Deep neural network needs a large number of training data, but it is difficult to obtain these learning samples. Therefore, it is very necessary to use data augmentation to avoid overfitting when there are only a few training samples [36]. Thus, 5000 pictures were generated by scaling, translation, flipping, and rotation. According to a ratio of 7:3, 3750 pictures were divided into training set and 1250 pictures into validation set. Figure 6b shows the images and their corresponding labels after data augmentation.

Evaluation Metrics
To evaluate the quantitative performance of different models, four evaluation metrics were selected: Accuracy, Dice, F1-Score, and Miou.
where 'Ac' is defined as the number of pixels correctly classified in a whole picture; 'Dice' is used to measure the similarity between two pictures; 'Precision' is the proportion of correctly classified positive pixels to all predicted positive pixels; 'Recall' is the percentage of correctly classified positive pixels to all true positive pixels; 'F1' is the combination of accuracy and recall rate; and 'Miou' is used to describe the accuracy of segmentation [37]. TP is true positive, TN is true negative, FP is false positive, and FN is false negative. The calculation formulas are shown in Equations (3)-(8).

Experiment Setting and Training
In the experiment, VGGNet was used as the backbone network, and the official VGGNet weights published by keras were used as the pre-training weights. DeconvNet, FCN32s, FCN16s, and FCN8s were selected as the comparison networks. In this paper, SR-SegNet v1 and SR-SegNet v2 are proposed. The residual block of SR-SegNet v1 did not use dilated convolutions, and the residual block of SR-SegNet v2 used 2 × 2 dilated convolutions with a dilation rate of 2. During training phase, the SGD optimizer [38] with an initial learning rate of 0.0001 was used. The momentum was set to 0.9 and the weight decay was set to 0.0005. All models were trained for 300 epochs with a mini-batch size of 2. All experiments were carried out under windows 10 with a AMD Ryzen 7 2700 CPU (3.2 GHz), 16GB of memory (RAM), and a NVIDIA GeForce RTX 2070 (8 GB). Python 3.6 was used and the experiments were based on the keras programming framework. Furthermore, the cross entropy was used as the loss function of neural network, as shown in Equation (9). x i represents the sample; p (x) and q (x), respectively, represent two separate probability distributions of random variable x; and n is the number of samples.
The proposed water area segmentation system is illustrated in Figure 7. First, the Lake and River dataset was preprocessed to generate more data for neural network training through data augmentation; this technique increases the complexity of data and effectively reduces the overfitting of training [39]. Second, the dataset was divided into training set and testing set, and the images from the training set were put into the model for training. The training procedure used the gradient descent algorithm. The labels were compared with predicted results, and the parameters were updated continuously by using back propagation and calculating the loss function [40]. Finally, the model's optimal parameters were saved to predict and evaluate lake and river images in the testing set.

Result Analysis
The experiments prove that the proposed SR-SegNet v1 and SR-SegNet v2 in this paper reduce the number of parameters by 65% and 71%, respectively, compared with the classical SegNet, and the training speed of v1 is improved by more than 10%. In addition, the Miou of v2 is improved by 2.37%. The results are shown in Tables 2 and 3.
Because the classical SegNet uses a lot of convolution kernels, it generates a large number of parameters, making the model difficult to train and converge. In this paper, two improved networks are proposed. For the convolution layer, we use depthwise separable convolutions instead of standard convolutions, which greatly decreases the parameter numbers, shortens the training time, and makes the model easier to converge. Besides, the information loss caused by using this technique is at an acceptable level. In the experiment, the parameter numbers of SR-SegNet v1 is decreased by 71% and its training time is reduced by 18.3%. To compensate for the information loss caused by the usage of depthwise separable convolutions, we introduce dilated convolutions into the modified residual block and propose SR-SegNet v2. SR-SegNet v2's parameter number is slightly increased compared with SR-SegNet v1, and its training time is reduced by 7.7% compared with the classical SegNet.  To compare the performance of each model, we tested every model under the same conditions.  To further demonstrate the generalization performance of the network, the network trained using the Namsto Lake dataset was used to identify other lakes. In Figure 8, the first row is Namtso Lake, the second row is Chaohu Lake, and the third row is Qinghai Lake. It can be seen in Figure 8 that FCN and DeconvNet both adopt a simple encoding-decoding structure, and thus they could only identify the edge of the target very generally. The image spatial information is ignored in the extraction by FCN and DeconvNet, and thus neither extraction of lake boundary is fine enough. SR-SegNet can extract the lake location and boundary information better. In Figure 8f, it can be seen that SegNet has a better segmentation ability for all three lakes, but the details of the lakes are not extracted correctly, and the non-lake parts of Chaohu Lake are misidentified. Note that SR-SegNet can effectively solve the problem of network degradation and small lake recognition. First row,: Namtso Lake; second row, Chaohu Lake; third row, Qinghai Lake. Figure 9 shows the training curves of SR-SegNet v2 and SegNet. It can be seen that SR-SegNet v2 performs better than SegNet, and its training process is smoother, whereas SegNet has many fluctuations. It is shown that the performance of the model is further improved with modified residual blocks added. To demonstrate the superiority of the networks proposed in this paper, SegNet, SR-SegNet v1, and SR-SegNet v2 were tested and evaluated, respectively. Note that the proportion of lakes in a remote sensing image is relatively large, and the proportion of rivers in a remote sensing image is relatively small. Remote sensing images of lakes and rivers with both more positive pixels and fewer positive pixels were analyzed, respectively, in the experiment. Table 4 shows the number of positive pixels (water body) and negative pixels (background) of six remote sensing images selected from the test images. The proportion represents the ratio of the number of positive pixels to total pixels. It can be seen that in a 512 × 512 pixel remote sensing image, the number of lake pixels is 6 to 10 times greater than those of rivers. Table 4. Comparison of positive and negative pixels of test images. Positive Pixels, water body; Negative Pixels, non-water body; Proportion, the ratio of the number of positive pixels to the total pixels. The segmentation results of test picture are shown in Figure 10. It can be seen that SegNet has a good segmentation ability for remote sensing images of lakes with more positive pixels. However, in the first row, the "small lake" near Namtso Lake is not recognized, proving that SegNet's detection ability of small targets is poor. In contrast, SR-SegNet v2 solves this problem, with dilated convolutions in its modified residual blocks. This modification ensures the network's better performance and increases the receptive field of the convolution layer. In the second row, SegNet dose not do well in Chaohu Lake's segmentation, and there are many noises around the lake. However, the two improved networks proposed in this paper avoid degradation during training process and greatly reduce the noises, as a result of the introduction of modified residual blocks. The third row is the segmentation of Qinghai Lake; the extraction of lake boundary is not fine enough, which is also a problem to be solved in the future. It is worth noting that remote sensing images of Namtso Lake were used in training in this experiment, and remote sensing images of Qinghai Lake and Chaohu Lake were not included in the training set. However, good segmentation results for Qinghai lake and Chaohu Lake are given by SR-SegNet, which clearly proves the generalization abilities of the proposed networks. In the experiments, it was found that river segmentation of SegNet is not as good as for the lake. This phenomenon has the following two explanations: first, the proportion of positive pixels in a river image is far less than the proportion of negative pixels; and, second, the extraction of river features is more complex than that of lake features. The classical SegNet performance degrades because there are too many deep layers, and a large number of training parameter is not helpful in dealing with complex features such as those of rivers. SR-SegNet V1 and SR-SegNet v2 are proposed to reduce the upsampling time and the training parameter number by using depth separable convolutions. At the same time, to reduce the depth of the network without affecting feature extraction, the modified residual block is introduced into the encoding stage to alleviate the problem of network degradation and extract more information. As shown in Figure 10, SR-SegNet is more effective in river segmentation, its results are closer to the real labels, and it can extract complex features which SegNet cannot. SR-SegNet v2 adds dilated convolutions to its convolution layers to further increase its receptive field. In the segmentation of Rivers 1 and 3, SegNet doe not work well on small river detection, because it cannot extract enough spatial information. In contrast, with the introduction of modified residual blocks, our proposed method can effectively extract spatial information for small river identification. In Figure 10, SR-SegNet v2 is able to segment River 3. As its receptive field expands, SR-SegNet v2 has a better result on river reach segmentation, and the result fully proves the effectiveness of dilated convolutions.

Positive Pixels Negative Pixels Proportion (%)
The model-testing times are shown in Table 5. After introducing depthwise separable convolutions and the residual structure, the network is simplified. The average testing time of SR-SegNet v1 is 27% shorter than that of the classical SegNet. In SR-SegNet v2, with extra dilated convolutions, the network computation is increased. Therefore, its average testing speed is only about 10% faster than that of the classical SegNet.
The results of the quantitative comparison are summarized in Table 6. In experiments on three lakes, it can be seen that SR-SegNet does not greatly outperform the classical SegNet; it only shows a small improvement. SR-SegNet v2 has a 99.56% Ac and a 94.92% Miou in the lake extraction, 0.06% and 0.35% higher than those of the classical SegNet, respectively. However, for an image with complex rivers, where the number of positive pixels is far fewer than the number of negative pixels, the Ac and F1 of these three networks are relatively high, because there are fewer categories to classify, and the proportion of negative pixels in each image is very high. The selected Miou can distinguish all three networks more accurately. For Miou, SR-SegNet v2 hits the highest score with a gain of 2.46% compared to the classical SegNet (0.9311 vs. 0.9065) and SR-SegNet v1 achieves an improvement of 2.39% over the classical SegNet (0.9304 vs. 0.9065). The superiority of proposed networks in this paper for river extraction is thereby verified.

Verification Experiment
To further verify the generalization abilities of the models proposed in this paper, Cityscapes, a public dataset, was selected for further experiment. Due to the limitation of computer memory, this experiment did not use all categories of the Cityscapes dataset. Only four categories, namely human, car, road, and background, were selected. Then 2975 pictures were used as the training dataset and 2975 pictures as the validation dataset. With the Adam optimizer, the initial learning rate was 0.0001, the weight attenuation rate was 0.0005, the training batch batch-size was 3, and the iteration was 160 times.
The research topic of this paper is water area segmentation. To verify the generalization performance and effectiveness of the algorithms proposed in this paper, we selected a different dataset for verification. The biggest difference between the Cityscapes dataset and the water area segmentation dataset is that their objects are different, but using different objects for verification can better demonstrate the generalization performance of the networks proposed in this paper. The experimental results are shown in Table 7. We can see that the training speed of V2 is increased by about 8.3%, and its Miou is also increased by 1.01%. Therefore, the generalization performance and effectiveness of the proposed network is verified.

Conclusions
In this paper, lake and river segmentation is improved by using SR-SegNet. Traditional decoding methods use 2× upsampling step by step, but this paper proposes to run 4× upsampling for the first time, and then removes three convolution layers with 512 channels each in decoding stage, thus reducing a large number of parameters and improving the training speed. At the same time, to extract more deep features and ensure the model's segmentation accuracy, improved residual blocks are introduced into encoding stage to solve the problem of network degradation. Furthermore, to obtain a larger receptive field and obtain more spatial information, dilated convolutions are also added to convolution layers, and the cascading method is used to fuse the low-level and high-level features of the image.
The quantitative comparison results with SegNet, FCN, and DeconvNet demonstrates that SR-SegNet outperforms the other models. Compared with the standard SegNet, SR-SegNet gains a 2.37% improvements in Miou and saves 10-27% in model-testing time on the Lake and River dataset. However, due to the complexity of the model, this paper also needs to make improvements in the following aspects: (1) make the training process converge more quickly and improve the model's training and prediction speed; (2) search more data and further improve the generalization performance of the model; and (3) solve the problem of small identification.