Remote Sensing Time Series Classification Based on Self-Attention Mechanism and Time Sequence Enhancement

Nowadays, in the field of data mining, time series data analysis is a very important and challenging subject. This is especially true for time series remote sensing classification. The classification of remote sensing images is an important source of information for land resource planning and management, rational development, and protection. Many experts and scholars have proposed various methods to classify time series data, but when these methods are applied to real remote sensing time series data, there are some deficiencies in classification accuracy. Based on previous experience and the processing methods of time series in other fields, we propose a neural network model based on a self-attention mechanism and time sequence enhancement to classify real remote sensing time series data. The model is mainly divided into five parts: (1) memory feature extraction in subsequence blocks; (2) self-attention layer among blocks; (3) time sequence enhancement; (4) spectral sequence relationship extraction; and (5) a simplified ResNet neural network. The model can simultaneously consider the three characteristics of time series local information, global information, and spectral series relationship information to realize the classification of remote sensing time series. Good experimental results have been obtained by using our model.


Introduction
In recent years, the scale and length of time series data have exploded. Now, people often come into contact with time series data in their daily lives. For example, stock prices, weather readings, biological observations, operating status data monitoring, etc. In today's era of big data and artificial intelligence, people are increasingly relying on hidden information mined from time series data. People use this information to benefit their lives. For example, in the medical industry, data are processed to understand the patient's health; in the financial industry, past stock price charts are analyzed to obtain future stock price trends; in the power industry, time series data of electricity consumption are analyzed to provide a forecast of future electricity consumption. Therefore, the current quality of time series data processing will directly affect our quality of life. Time series data analysis in the field of remote sensing not only affects personal lives and productivity-it also affects the country's land management, planning guidelines, and policies. Therefore, the processing of remote sensing time series has become particularly important.
At present, many experts and scholars are devoted to the research and analysis of time series data, and have put forward many methods for the analysis of time series data. Among the methods of analyzing time series data, the methods based on distance and deep learning are more popular. For a long period of time, distance-based methods have been frequently used for processing time series data. Additionally, it is common to use a combination of a nearest neighbor forest classifier and a distance function [1]. Time series B.K., et al. [20] proposed applying the self-attention layer to the distance metadata obtained based on the DTW algorithm. By processing the sequence of distance data obtained by the DTW algorithm, the problem of different labels with the same distance is solved. Chen, B., et al. [21] proposed using a combination of the self-attention mechanism and the GRU to process the time series. In this model, the self-attention mechanism is not used in the time dimension, but in the feature dimension. Singh, S.P., et al. [22] used LSTM and the self-attention mechanism to decode human behavioral activities. Pandey, A., et al. [23] and Pandey, A., et al. [24] used the self-attention mechanism combined with CNN and LSTM to enhance speech signals. Hao, H., et al. [25] proposed a sequence model named TCAN. This model uses the combination of TCN and the self-attention mechanism to realize the processing of sequence models. The basic unit in the model is the TCAN block. In the TCAN block, the self-attention operation is used before the TCN operation to strengthen the important part of the input sequence and weaken the unimportant part. Similarly, Lin, L., et al. [26] also used TCN combined with the self-attention mechanism to process medical sequence data to complete the diagnosis of myotonic dystrophy. The difference is that this model applies the self-attention mechanism to the output sequence of causal convolution and expansion convolution. As the TCN includes multiple hidden layers, you can derive multiple outputs from the attention layer. All the output sequences from the attention layer are composed into a new two-dimensional sequence. Then, the twodimensional sequence is passed through the second self-attention layer to get the output of the model. Huang, Q., et al. [27] also used a combination of TCN and the self-attention mechanism to process audio signals. Some researchers pointed out that the self-attention mechanism uses a linear transformation to calculate the key vector K, query vector Q and value vector V of a specific time step, without considering the local information around the elements, which may lead to a lack of local data features in the calculation of K, V and Q. Therefore, a convolutional self-attention mechanism was proposed [28]. The convolutional self-attention mechanism uses a one-dimensional convolution operation with a size of convolutional kernel greater than 1 to obtain K, V and Q. Meanwhile, Yu, D., et al. [29] also combined this convolutional self-attention mechanism with LSTM to predict the hourly power level.
In the field of time series remote sensing data analysis, many researchers also used the self-attention mechanism. Yuan, Q., et al. [30] proposed some challenges of deep learning in the remote sensing field. Rußwurm, M., et al. [31] made a comparison of several existing neural network models for processing remote sensing time series data, and pointed out that the performance of the self-attention mechanism and recurrent neural network were better than the convolutional neural network in processing original time series remote sensing data. Garnot, V.S.F., et al. [32] pointed out that the parallelism of the recurrent network was inferior to the self-attention mechanism, so they introduced the self-attention mechanism into the model to classify the remote sensing time series data, and achieved good results. Li, Z., et al. [33] also used a transformer model based on the self-attention mechanism to classify crops. In other applications of remote sensing, there are many other models that use attention mechanisms. For example, Li, X., et al. [34] used the self-attention mechanism to embed the remote sensing image scene, Jin, Y., et al. [35] proposed the GSCA module based on the attention mechanism to get global spatial contextual information for shadow detection, and Chai, Y., et al. [36] proposed setting attention transformers after each block of the backbone to obtain the semantic information and textural information for cloud detection. However, for scene classification of remote sensing images, more people use convolutional neural networks [37][38][39].
Therefore, we summarized the experience of our predecessors and the methods to process time series in other fields, proposed a neural network model based on the selfattention mechanism and time sequence enhancement, and made a dataset for the real remote sensing image to complete the experiment. Our method comprises five parts: The first part is to extract the memory feature of the subsequence block. By slicing the original sequence sample, many subsequence blocks can be obtained, and we can then Remote Sens. 2021, 13, 1804 5 of 27 extract the memory feature vector of each subsequence block. In this process, each element can take into account the local feature information. The second part involves using the self-attention mechanism on the sequence of all the subsequence block's memory feature vectors. Through this process, each subsequence block takes global information into account and realizes the function to get the long time sequence dependence, similarly to the recurrent neural network. The self-attention mechanism involves less time complexity than the recurrent neural network. The third part is time sequence enhancement. Time sequence enhancement can take into account the importance of different subsequence blocks in the timing dimension. The fourth part is the spectral sequence relationship feature extraction, which can obtain the unique relationship features between different spectra of the remote sensing time series data. The last part involves using the ResNet deep neural network, which realizes the classification of the aforementioned extracted features. It should be noted that the ResNet in our model only uses its residual idea. Our ResNet is a simplified version with only three residual blocks.
We propose such a model to classify the types of land cover. The main function is to use the characteristics of the self-attention mechanism to grasp the important and unique parts in the time series of different land covers to complete the classification of the types of features. For example, there are two types of land cover-bare land and buildings-some areas of which have great similarities in remote sensing images. The woodland and bare rock on the mountain often cross and mix, and it is difficult to distinguish between them. Therefore, we need to capture the uniqueness of similar land covers in time sequences and realize the distinctions between them. Our model was also tested on real remote sensing images.
The innovations of our model are as follows: 1.
We proposed a method that processes the subsequence block to obtain the most representative vector. These representative vectors better interpret the local characteristics of the original sequence. Then, it enables using the self-attention mechanism on the obtained representative vector sequence to consider global dependency in units of blocks; 2.
Through the weight matrix obtained by the self-attention mechanism, we obtained the importance degree of each subsequence block, and could enhance specific blocks in the temporal dimension; 3.
Our experiments were carried out on real multiband remote sensing data, and the self-attention mechanism was used to consider the internal relationship between each band of remote sensing data, so as to promote the classification of remote sensing time series.

Materials and Methods
In this section, we will introduce our data and our proposed model in detail. The model mainly uses the self-attention mechanism, time sequence enhancement, and spectral sequence relation extraction.

Time Series Remote Sensing Images and Time Series Classification
After geometric and radiation normalization, the remote sensing data essentially become a seamlessly organized and quantitative image tile in a two-dimensional space. Repeated observations of a long-term sequence of a region will inevitably produce a sequence of image tiles. If we organize the image tiles in the same area in the time series, it will provide four-dimensional data with band as the Z axis and time as the T axis [40]. Figure 1 shows the time series remote sensing data of the same area, wherein X and Y represent spatial dimension information, Z represents band, and T represents time. In fact, for any classification problem, as long as the time sequence is considered, it can be called time series classification. When performing land cover classification for a single pixel in time series remote sensing data, the process takes into account the data of each band of the pixel within a certain time range. We arbitrarily took Landsat8 time series remote sensing data for a period of time, and visualized the time series of some samples. By observing these time series (Figure 2), we can identify a big difference in the trend according to the time series of different samples. Through these differences, we can divide the types of land cover into forest land, water bodies, buildings, and other types. In fact, for any classification problem, as long as the time sequence is considered, it can be called time series classification. When performing land cover classification for a single pixel in time series remote sensing data, the process takes into account the data of each band of the pixel within a certain time range. We arbitrarily took Landsat8 time series remote sensing data for a period of time, and visualized the time series of some samples. By observing these time series (Figure 2), we can identify a big difference in the trend according to the time series of different samples. Through these differences, we can divide the types of land cover into forest land, water bodies, buildings, and other types.
For remote sensing image data, it is not enough to consider only single-phase data. We need to consider the hidden information in the time dimension. In time series remote sensing data, there is a lot of phenological information offered by the Earth's surface. The information on land cover change hidden in the time dimension can help improve the classification of land cover types.
Therefore, time series remote sensing data classification can make full use of different types of phenological change information, and obtain more accurate results. In addition, in the context of current big data, with the continuous accumulation of remote sensing observation image data, the use of long-term series of land cover classification can help determine the law of land cover transformation under the influence of natural change and human activities, and better guide human social practices [41].
The reason we used the pixel-oriented method for classification is that it is better at finding the phenological change information of a certain land cover over time, and that this method is simpler for images with medium resolution, such as Landsat. If used on high-resolution remote sensing images, the pixel-oriented method will indeed be subject to certain restrictions. When considering the surrounding neighborhood's information, the method often involves a complicated process, with too many parameters, and it also has few samples and high dimensionality, which will have a definite impact on the feature extraction process. single pixel in time series remote sensing data, the process takes into account the data of each band of the pixel within a certain time range. We arbitrarily took Landsat8 time series remote sensing data for a period of time, and visualized the time series of some samples. By observing these time series (Figure 2), we can identify a big difference in the trend according to the time series of different samples. Through these differences, we can divide the types of land cover into forest land, water bodies, buildings, and other types. For remote sensing image data, it is not enough to consider only single-phase data. We need to consider the hidden information in the time dimension. In time series remote sensing data, there is a lot of phenological information offered by the Earth's surface. The information on land cover change hidden in the time dimension can help improve the classification of land cover types.
Therefore, time series remote sensing data classification can make full use of different types of phenological change information, and obtain more accurate results. In addition, in the context of current big data, with the continuous accumulation of remote sensing observation image data, the use of long-term series of land cover classification can help determine the law of land cover transformation under the influence of natural change and human activities, and better guide human social practices [41].
The reason we used the pixel-oriented method for classification is that it is better at finding the phenological change information of a certain land cover over time, and that this method is simpler for images with medium resolution, such as Landsat. If used on high-resolution remote sensing images, the pixel-oriented method will indeed be subject to certain restrictions. When considering the surrounding neighborhood's information, the method often involves a complicated process, with too many parameters, and it also has few samples and high dimensionality, which will have a definite impact on the feature extraction process.

Dataset
We used two datasets for our experiments. One of them was the standard dataset, and the other was the Landsat8 remote sensing data we downloaded and processed ourselves.

Benchmark Dataset
The experimental data come from the public dataset provided by the 2017 TiSeLaC time series land cover classification competition [42]. The original data were collected from 2A-level Landsat8 images of 23 scenes on Reunion Island in 2014. The study area has a pixel size of 2866 × 2633, a spatial resolution of 30 m, and it contains 10 bands, including the first 7 bands of the original data (Landsat8 Band1 to Band7) and 3 exponential bands (NDVI, NDWI and BI).
A total of 99,687 pixels were randomly sampled to form a dataset, which was divided into a training set of 81,714 pixels and a test set of 17,973 pixels. Figure 3 shows the pixel distribution after sampling. With reference to the CORINE Land Cover data for 2012 and the registration results of the land parcels reported by local farmers in 2014, the land cover of the study area was divided into 9 land cover types. Table 1 shows the detail of the data set. the first 7 bands of the original data (Landsat8 Band1 to Band7) and 3 exponential bands (NDVI, NDWI and BI).
A total of 99,687 pixels were randomly sampled to form a dataset, which was divided into a training set of 81,714 pixels and a test set of 17,973 pixels. Figure 3 shows the pixel distribution after sampling. With reference to the CORINE Land Cover data for 2012 and the registration results of the land parcels reported by local farmers in 2014, the land cover of the study area was divided into 9 land cover types. Table 1 shows the detail of the data set.  Our dataset is composed of the Landsat8 time series remote sensing data of some parts of Shenzhen in 2017. The area we selected is located in the overlapping area of the two images (path = 121, row = 44 and path = 122, row = 44). Therefore, our original data in this area contain 46 time steps and 11 bands. However, we processed the original remote sensing data to get L1GT-level data. We spliced the processed data, and then eliminated

Self-Selected Dataset
Our dataset is composed of the Landsat8 time series remote sensing data of some parts of Shenzhen in 2017. The area we selected is located in the overlapping area of the two images (path = 121, row = 44 and path = 122, row = 44). Therefore, our original data in this area contain 46 time steps and 11 bands. However, we processed the original remote sensing data to get L1GT-level data. We spliced the processed data, and then eliminated moments when the cloud cover area was large. In the end, the remote sensing data we used included 22 time steps and 10 bands of data. These 10 bands included two quality control bands and 8 30-m resolution bands.
As shown in Figure 4, the selected area of our dataset is located at the junction of Luohu District, Yantian District, and Longgang District in Shenzhen City. Luohu District was the first urban area to be developed in the Shenzhen Special Administrative Region. The terrain is high in the northeast and low in the southwest, with mostly hilly mountains and alluvial plains. The highest peak in Shenzhen, Wutong Mountain at 943 m above sea level, is located in the eastern part of the district. Yantian District is adjacent to Luohu District in the west and Longgang District in the north. The terrain is high in the north and low in the south, belonging to the coastal landform of low hills. In the north are Wutong Mountain and Meishajian, and the landform is mainly exposed bedrock and mountain forests. The terrain is basically composed of a mountainous landform zone in the north and a coastal landform zone in the south. Longgang District is located in the northeast of Shenzhen City, connecting Luohu District and Yantian District to the south. The natural environment of Longgang District is superior. The terrain is high in the northeast and low in the southwest, and it is in the coastal area of low hills. Longgang District is an important high-tech industry and advanced manufacturing base, with a regional GDP that ranks second in Shenzhen.
The terrain is high in the northeast and low in the southwest, with mostly hilly mountains and alluvial plains. The highest peak in Shenzhen, Wutong Mountain at 943 m above sea level, is located in the eastern part of the district. Yantian District is adjacent to Luohu District in the west and Longgang District in the north. The terrain is high in the north and low in the south, belonging to the coastal landform of low hills. In the north are Wutong Mountain and Meishajian, and the landform is mainly exposed bedrock and mountain forests. The terrain is basically composed of a mountainous landform zone in the north and a coastal landform zone in the south. Longgang District is located in the northeast of Shenzhen City, connecting Luohu District and Yantian District to the south. The natural environment of Longgang District is superior. The terrain is high in the northeast and low in the southwest, and it is in the coastal area of low hills. Longgang District is an important high-tech industry and advanced manufacturing base, with a regional GDP that ranks second in Shenzhen. A variety of landforms are included in the data selection area. For example, mountains, hills, woodland, cultivated land, lakes, sea and coast, etc. In addition, Wutong Mountain, the highest peak in Shenzhen, is located in our selected area. Therefore, we can mark a variety of ground object types in the remote sensing images, which is consistent with the subject we selected. At the same time, Shenzhen has also formulated a major plan to promote the urban renewal and secondary development of Luohu District, intending to build an ecological leisure and tourism area for citizens. Yantian District will take full advantage of its mountain and sea resources, Shenzhen-Hong Kong cooperation, and port hub. This will build Yantian Port into a comprehensive hub with modern influence in the Guangdong-Hong Kong-Macao Greater Bay Area, China, and the world, and build a high-quality urban area suitable for living, working, and traveling through a series of measures. Longgang District has further optimized its transportation layout and accelerated the construction of its transportation infrastructure, including its rail transit and highspeed highways. A study of these areas will certainly contribute to future planning implementation processes. A variety of landforms are included in the data selection area. For example, mountains, hills, woodland, cultivated land, lakes, sea and coast, etc. In addition, Wutong Mountain, the highest peak in Shenzhen, is located in our selected area. Therefore, we can mark a variety of ground object types in the remote sensing images, which is consistent with the subject we selected. At the same time, Shenzhen has also formulated a major plan to promote the urban renewal and secondary development of Luohu District, intending to build an ecological leisure and tourism area for citizens. Yantian District will take full advantage of its mountain and sea resources, Shenzhen-Hong Kong cooperation, and port hub. This will build Yantian Port into a comprehensive hub with modern influence in the Guangdong-Hong Kong-Macao Greater Bay Area, China, and the world, and build a high-quality urban area suitable for living, working, and traveling through a series of measures. Longgang District has further optimized its transportation layout and accelerated the construction of its transportation infrastructure, including its rail transit and highspeed highways. A study of these areas will certainly contribute to future planning implementation processes.
In order to derive a more accurate ground truth for each category, we first referred to the 2 m resolution image of the same area. Then, we considered displays with different band combinations to determine the ground truth of each category.
Finally, 12,803 sample pixels were selected and divided (8:2) into a training set and a test set that contained eight land types, namely bare land, woodland, water, arable, building, rock, road, and grass. We have used a different color for each type of land. The dataset is detailed in Table 2. Different combinations of bands indicate obvious differences among the land types. The true color image was synthesized from three bands of red, green, and blue (as shown in Figure 5a). The image obtained by this combination is more close to the true color of the ground object, so we could determine different ground object types more intuitively, but the image was dull and the hue was gray. The composite image of swir1, nir, and blue ( Figure 5b) shows a variety of vegetation types, which facilitated vegetation classification. The standard false color image (as shown in Figure 5c), synthesized from nir, red, and green bands, shows ground objects in bright colors, which was conducive to vegetation (red) classification and water body recognition. The nonstandard false color image (as shown in Figure 5d) was synthesized from nir, swir1, and red. This image has a clear water boundary, which has been conducive to the identification of coast and gives a better display of vegetation, but it is not convenient for distinguishing specific vegetation types.
to the 2 m resolution image of the same area. Then, we considered displays with different band combinations to determine the ground truth of each category.
Finally, 12,803 sample pixels were selected and divided (8:2) into a training set and a test set that contained eight land types, namely bare land, woodland, water, arable, building, rock, road, and grass. We have used a different color for each type of land. The dataset is detailed in Table 2. Different combinations of bands indicate obvious differences among the land types. The true color image was synthesized from three bands of red, green, and blue (as shown in Figure 5a). The image obtained by this combination is more close to the true color of the ground object, so we could determine different ground object types more intuitively, but the image was dull and the hue was gray. The composite image of swir1, nir, and blue ( Figure 5b) shows a variety of vegetation types, which facilitated vegetation classification. The standard false color image (as shown in Figure 5c), synthesized from nir, red, and green bands, shows ground objects in bright colors, which was conducive to vegetation (red) classification and water body recognition. The nonstandard false color image (as shown in Figure 5d) was synthesized from nir, swir1, and red. This image has a clear water boundary, which has been conducive to the identification of coast and gives a better display of vegetation, but it is not convenient for distinguishing specific vegetation types. In Figure 6, the region of interest we selected is displayed, and one can see that we fully considered the distribution characteristics of the ground objects in the image when selecting the samples. Additionally, we selected samples from areas with various features. The degree of separation between each category is shown in Table 3. In Table 3, the two values of each cell are Jeffries-matusita and Transformed Divergence, the closer the value is to 2, the higher the classification degree. In Figure 6, the region of interest we selected is displayed, and one can see that we fully considered the distribution characteristics of the ground objects in the image when selecting the samples. Additionally, we selected samples from areas with various features. The degree of separation between each category is shown in Table 3. In Table 3, the two values of each cell are Jeffries-matusita and Transformed Divergence, the closer the value is to 2, the higher the classification degree.

Model Structure
The structure (Figure 7) of the whole model can be divided into two main parts, namely, the feature extraction of remote sensing time series data and the classification of the ResNet neural network. We had to find the representative features of each category, and then classify the time series data according to these significant features. In our proposed neural network model, we mainly considered four kinds of features of remote sensing time series data, including the local intra-block memory feature, the inter-block correlation feature, the time sequence importance feature, and the spectral sequence correlation feature.

Model Structure
The structure (Figure 7) of the whole model can be divid namely, the feature extraction of remote sensing time series data the ResNet neural network. We had to find the representative fe and then classify the time series data according to these signific posed neural network model, we mainly considered four kinds o ing time series data, including the local intra-block memory featu lation feature, the time sequence importance feature, and the spec  In the structure of our proposed model, the length of the input time series was T, and the characteristic dimension was D. First, the original data are multidimensionalized through a convolution operation, to derive a hidden representation of the input time series. Then, the result of the convolution of the input sequence was sliced up into many subsequences. The subsequence length was BLOCK-NUM, which was set by us. The slicing method started from the first time step of each sample sequence, with 1 as the move step size and BLOCK-NUM as the slice length. Finally, many subsequences with the same shape were obtained. Using the self-attention mechanism in each subsequence, a new sequence that considers the local feature of the element was obtained. Similar to the sequence encoded in the encoder-decoder model to obtain a fixed length semantic vector, our model used a convolution whose kernal size was BLOCK-NUM to obtain the memory feature vector that represents the subsequence. These memory feature vectors of all subsequences were spliced into a new sequence. Lastly, the local intra-block memory feature in the subsequence blocks was extracted.
After that, we passed the sequence that was spliced by all memory feature vectors through a self-attention layer again, and derived a new sequence that introduces the interblock correlation feature. According to the weight matrix obtained through the process of using the self-attention mechanism among all the memory feature vectors, the importance degree of each block in the time sequence was calculated. By multiplying the sequence spliced by all memory feature vectors by the importance degree vector, we could derive the sequence that introduces the time sequence importance feature. Finally, the sequence that introduces the time sequence importance feature and the sequence that introduces the inter-block correlation feature are added together to combine the two kinds of features. The above operation was carried out on the time dimension. In order to consider the correlation information between various spectral sequences, we used the self-attention mechanism on the spectral dimension of the input sequence. We derived a new sequence that introduces the spectral sequence correlation feature. The input sequence changes dimensions through convolution to splice with the features listed above and derive the final feature sequence. The feature sequence was finally entered into a ResNet network for classification. After that, we passed the sequence that was spliced by all memory feature vectors through a self-attention layer again, and derived a new sequence that introduces the interblock correlation feature. According to the weight matrix obtained through the process of using the self-attention mechanism among all the memory feature vectors, the importance degree of each block in the time sequence was calculated. By multiplying the sequence spliced by all memory feature vectors by the importance degree vector, we could derive the sequence that introduces the time sequence importance feature. Finally, the sequence that introduces the time sequence importance feature and the sequence that introduces the inter-block correlation feature are added together to combine the two kinds of features. The above operation was carried out on the time dimension. In order to consider the correlation information between various spectral sequences, we used the self-attention mechanism on the spectral dimension of the input sequence. We derived a new sequence that introduces the spectral sequence correlation feature. The input sequence changes dimensions through convolution to splice with the features listed above and derive the final feature sequence. The feature sequence was finally entered into a ResNet network for classification.

Self-Attention Mechanism
Vaswani, A., et al. [17] proposed the self-attention mechanism for the first time and applied it to machine translation. The model proposed in this paper completely abandons the recurrent neural network and the convolutional neural network, and only uses the self-attention mechanism to deal with the sequence problem, and achieves excellent results.

Self-Attention Mechanism
Vaswani, A., et al. [17] proposed the self-attention mechanism for the first time and applied it to machine translation. The model proposed in this paper completely abandons the recurrent neural network and the convolutional neural network, and only uses the self-attention mechanism to deal with the sequence problem, and achieves excellent results.

The Principle of the Self-Attention Mechanism
The difference between the self-attention mechanism and the traditional attention mechanism is that the self-attention mechanism considers the interaction among the various elements within the sequence. The self-attention mechanism mainly includes three parts:  [43,44]. Figure 8 shows the structure of the self-attention mechanism.
Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 27 The difference between the self-attention mechanism and the traditional attention mechanism is that the self-attention mechanism considers the interaction among the various elements within the sequence. The self-attention mechanism mainly includes three parts: A. Calculate the query vector, key vector, and value vector for each time step; B. Calculate the weight matrix; C. Calculate the weight sum [43,44]. Figure 8 shows the structure of the self-attention mechanism. The expression of the self-attention mechanism is as follows: Q, K, and V are the query vector, key vector and value vector, respectively. d is the dimension of the key vector.
The calculation method of these three vectors is as follows: X is the data of one time step, and WQ, WK and WV are parametric matrices.
In the process of calculating the weight matrix W, each time step uses its own Q and the transpose of K at other time steps to do a dot product to get a score. After deriving the dot product score for each time step, we input the result into a softmax layer to get the weight of each time step's influence on the current time step. Finally, each time step was multiplied by the corresponding weight, and then added together to derive a new vector that represents the current time step. As such, each time step receives a new representation.

Intra-Block and Inter-Block Self-Attention
The process of intra-block and inter-block self-attention can be divided into two parts. The first part involves slicing the original sequence and then using the self-attention mechanism in the subsequence. The second part involves using self-attention mechanisms on all the blocks. Figure 9 shows these two parts.
In the processing of subsequence blocks, our method is different from the DP-SARNN proposed by Pandey, A., et al. [24]. Instead of combining RNN and self-attention in SARNN, we used only a self-attention mechanism, abandoning the RNN part of the recurrent neural network, which can greatly reduce the memory occupation and improve The expression of the self-attention mechanism is as follows: Q, K, and V are the query vector, key vector and value vector, respectively. d is the dimension of the key vector.
The calculation method of these three vectors is as follows: X is the data of one time step, and W Q , W K and W V are parametric matrices.
In the process of calculating the weight matrix W, each time step uses its own Q and the transpose of K at other time steps to do a dot product to get a score. After deriving the dot product score for each time step, we input the result into a softmax layer to get the weight of each time step's influence on the current time step. Finally, each time step was multiplied by the corresponding weight, and then added together to derive a new vector that represents the current time step. As such, each time step receives a new representation.

Intra-Block and Inter-Block Self-Attention
The process of intra-block and inter-block self-attention can be divided into two parts. The first part involves slicing the original sequence and then using the self-attention mechanism in the subsequence. The second part involves using self-attention mechanisms on all the blocks. Figure 9 shows these two parts.
In the processing of subsequence blocks, our method is different from the DP-SARNN proposed by Pandey, A., et al. [24]. Instead of combining RNN and self-attention in SARNN, we used only a self-attention mechanism, abandoning the RNN part of the recurrent neural network, which can greatly reduce the memory occupation and improve the efficiency of the model. In addition, in the self-attention part of the mechanism, the methods of acquiring the Q, K and V vectors are also different from in the DP-SARRN model. In DP-SARNN, layer normalizations are used for obtaining the Q, K and V for each time step, whereas in the model we present, we have used convolution for self-attention [28]. When using this method to obtain Q, K and V, the weight of each time step is shared, which can reduce the training burden of the model compared with the traditional self-attention mechanism.  The model we propose is different from the MCNN proposed in Cui, Z., et al. [8] in terms of the slice method. In MCNN, subsequences of different scales are obtained by the down-sampling of different scales.
In the model proposed by us, the method of acquiring subsequences starts from the first time step of each sample sequence, taking 1 as the move step size and BLOCK-NUM as the slice length.
Each result from the self-attention processing mechanism is added to the original subsequence that is the input of the self-attention mechanism. Then, the result of this addition is convolved into a one-dimensional vector using a convolution layer. This onedimensional vector takes into account the local features of the subsequence and can be used as the memory feature vector of the subsequence. Finally, all the memory feature vectors are spliced into new sequences. The new sequence passes through the inter-block self-attention mechanism again. This operation allows each subsequence to take into account all other subsequences, thus introducing the characteristics of global information for the entire sequence.

Time Sequence Enhancement
Time sequence enhancement is inspired by the TCAN model proposed by Hao, H., et al. [25]. The purpose of time sequence enhancement is to find out the important parts of the time dimension, and to enhance this important part and weaken the unimportant part. In the TCAN model, in order to prevent the leakage of future information, the self-attention mechanism does not consider the information of the whole time series, but only considers the sequence information before the current time step. It is reasonable to apply such a self-attention mechanism in the domain of time series prediction. However, it does not apply to time series classification. We suggest considering the data for all the time steps in the time series. As in the classification process, all time steps affect the final classification result.
The time sequence enhancement process uses the weight matrix obtained from the inter-block self-attention process. Figure 10 shows the process. self-attention mechanism again. This operation allows each subsequence to take into account all other subsequences, thus introducing the characteristics of global information for the entire sequence.

Time Sequence Enhancement
Time sequence enhancement is inspired by the TCAN model proposed by Hao, H., et al. [25]. The purpose of time sequence enhancement is to find out the important parts of the time dimension, and to enhance this important part and weaken the unimportant part. In the TCAN model, in order to prevent the leakage of future information, the selfattention mechanism does not consider the information of the whole time series, but only considers the sequence information before the current time step. It is reasonable to apply such a self-attention mechanism in the domain of time series prediction. However, it does not apply to time series classification. We suggest considering the data for all the time steps in the time series. As in the classification process, all time steps affect the final classification result.
The time sequence enhancement process uses the weight matrix obtained from the inter-block self-attention process. Figure 10 shows the process. The input of the inter-block self-attention mechanism is the concatenation, C, of all feature memory vectors. After the inter-block self-attention mechanism, we can derive a new sequence with global correlation characteristics and a weight matrix, W. In this part, we need to use the weight matrix, W. Each of its rows represents the weight of the current block affected by another block. Therefore, W(i,j) represents the influence weight of the jth block on the i-th block. If we add up all the entries in the j-th column, we can integrate the effect of the j-th block on all the other blocks. Additionally, this gives us a rough idea of the importance of the j-th block in the entire time series. Therefore, we add each row of the weight matrix, W. Then, the result is passed through the softmax layer, and we can derive a one-dimensional vector. Each element in the one-dimensional vector represents the importance of each block.

Spectral Sequence Relationship Extraction
Wu, Z., et al. [45] introduces the relationship of multi-source data. We know that multispectral remote sensing images contain data for multiple bands. In the multispectral remote sensing time series data, there is a time series in every band of every pixel. Whether or not this means that there is a specific relationship between different spectral time series in a certain land class is debatable. We use the self-attention mechanism in the spectral dimension of the original multispectral remote sensing time series, and make use of the specific correlation among different spectral sequences of each species to classify them. The input of the inter-block self-attention mechanism is the concatenation, C, of all feature memory vectors. After the inter-block self-attention mechanism, we can derive a new sequence with global correlation characteristics and a weight matrix, W. In this part, we need to use the weight matrix, W. Each of its rows represents the weight of the current block affected by another block. Therefore, W(i,j) represents the influence weight of the j-th block on the i-th block. If we add up all the entries in the j-th column, we can integrate the effect of the j-th block on all the other blocks. Additionally, this gives us a rough idea of the importance of the j-th block in the entire time series. Therefore, we add each row of the weight matrix, W. Then, the result is passed through the softmax layer, and we can derive a one-dimensional vector. Each element in the one-dimensional vector represents the importance of each block.

Spectral Sequence Relationship Extraction
Wu, Z., et al. [45] introduces the relationship of multi-source data. We know that multispectral remote sensing images contain data for multiple bands. In the multispectral remote sensing time series data, there is a time series in every band of every pixel. Whether or not this means that there is a specific relationship between different spectral time series in a certain land class is debatable. We use the self-attention mechanism in the spectral dimension of the original multispectral remote sensing time series, and make use of the specific correlation among different spectral sequences of each species to classify them.
As every sample is made up of two-dimensional sequence data, the first dimension of the sample is the time dimension, and the second dimension is the spectral dimension. Suppose the data have T time steps and D bands. When the self-attention mechanism is used in the time dimension, the smallest element is a one-dimensional vector composed of D spectral data. The purpose of using the self-attention mechanism in the time dimension is to find the correlation between different time steps. Therefore, in order to find the correlation within the spectral dimension, we need to use the self-attention mechanism on the spectral dimension. When using the self-attention mechanism on the spectral dimension, the smallest element is the T time step data of a spectrum. Accordingly, we can apply the correlation between spectral sequences to the classification of remote sensing time series.

ResNet
After extracting all the features and fusing them, we input the fused features into a ResNet network. The method of feature fusion involves summing the local features extracted from the self-attention method within every subsequence, the global features extracted from the self-attention mechanism among the subsequences, and the time series enhancement features. Additionally, we then splice the addition results, the spectral sequence relationship features, and the original features to derive the final fusion feature. It should be noted that we only used the idea [46] of residuals, and did not use a very deep network structure. A similar structure is used in our model to the ResNet mentioned in Fawaz, H.I., et al. [6]. Figure 11 shows the structure of the simplified ResNet. As every sample is made up of two-dimensional sequence data, the first dimension of the sample is the time dimension, and the second dimension is the spectral dimension. Suppose the data have T time steps and D bands. When the self-attention mechanism is used in the time dimension, the smallest element is a one-dimensional vector composed of D spectral data. The purpose of using the self-attention mechanism in the time dimension is to find the correlation between different time steps. Therefore, in order to find the correlation within the spectral dimension, we need to use the self-attention mechanism on the spectral dimension. When using the self-attention mechanism on the spectral dimension, the smallest element is the T time step data of a spectrum. Accordingly, we can apply the correlation between spectral sequences to the classification of remote sensing time series.

ResNet
After extracting all the features and fusing them, we input the fused features into a ResNet network. The method of feature fusion involves summing the local features extracted from the self-attention method within every subsequence, the global features extracted from the self-attention mechanism among the subsequences, and the time series enhancement features. Additionally, we then splice the addition results, the spectral sequence relationship features, and the original features to derive the final fusion feature. It should be noted that we only used the idea [46] of residuals, and did not use a very deep network structure. A similar structure is used in our model to the ResNet mentioned in Fawaz, H.I., et al. [6]. Figure 11 shows the structure of the simplified ResNet. In this structure, we only use three residual blocks to process and classify the fusion features. There are three convolutions in each residual block, and the number of convolution kernels for the three convolutions in the block is the same. The numbers of convolution kernels for each of the three residual blocks are 192, 256 and 256, and each convolution is followed by a batch normalization layer and a ReLu activation layer. Finally, there is a global pooling layer and a softmax layer.

Results
In this section, we will introduce the results of experiments conducted on two datasets. Our experimental process includes a comparison among different models and digestion experiments.

Experimental Setup
We selected some other models for time series classification. These models were long short-term memory (LSTM) and temporal convolutional network (TCN). Figures 12 and  13 respectively show the structure of LSTM and TCN. In this structure, we only use three residual blocks to process and classify the fusion features. There are three convolutions in each residual block, and the number of convolution kernels for the three convolutions in the block is the same. The numbers of convolution kernels for each of the three residual blocks are 192, 256 and 256, and each convolution is followed by a batch normalization layer and a ReLu activation layer. Finally, there is a global pooling layer and a softmax layer.

Results
In this section, we will introduce the results of experiments conducted on two datasets. Our experimental process includes a comparison among different models and digestion experiments.

Experimental Setup
We selected some other models for time series classification. These models were long short-term memory (LSTM) and temporal convolutional network (TCN). Figures 12 and 13 respectively show the structure of LSTM and TCN.  The structure of the LSTM was a recurrent neural network LSTM layer. The number of units was 64. After these, there was a batch normalization layer and an activation function layer. The activation function used ReLu. The last was a softmax layer.
For the TCN we used a keras integrated tcn package. The length of the sub-sequence block in OURS was 6. For the self-attention mechanism, the dimensions of the query vector, key vector and value vector are 64. The numbers of convolution filters in the three residual blocks are 192, 256 and 256, respectively, and the convolution was followed by batch normalization and ReLu activation functions. Table 4 shows the hyperparameters of the models. In [6], the author used a large number of deep learning network models to classify a large number of time series datasets, and we summarized the hyperparameters of several network models with better classification effects.

Results on Benchmark Data
We will first introduce the results for the standard dataset.

Result Comparison
We uses the models and parameter settings from Section 2.1 for our experiments. We used the trained model to classify the test set. Finally, we derived the confusion matrix of the classification results of each model. Tables 5-7 show the confusion matrix LSTM, TCN and OURS.  The structure of the LSTM was a recurrent neural network LSTM layer. The number of units was 64. After these, there was a batch normalization layer and an activation function layer. The activation function used ReLu. The last was a softmax layer.
For the TCN we used a keras integrated tcn package. The length of the sub-sequence block in OURS was 6. For the self-attention mechanism, the dimensions of the query vector, key vector and value vector are 64. The numbers of convolution filters in the three residual blocks are 192, 256 and 256, respectively, and the convolution was followed by batch normalization and ReLu activation functions. Table 4 shows the hyperparameters of the models. In [6], the author used a large number of deep learning network models to classify a large number of time series datasets, and we summarized the hyperparameters of several network models with better classification effects.

Results on Benchmark Data
We will first introduce the results for the standard dataset.

Result Comparison
We uses the models and parameter settings from Section 2.1 for our experiments. We used the trained model to classify the test set. Finally, we derived the confusion matrix of the classification results of each model. Tables 5-7 show the confusion matrix LSTM, TCN and OURS. The structure of the LSTM was a recurrent neural network LSTM layer. The number of units was 64. After these, there was a batch normalization layer and an activation function layer. The activation function used ReLu. The last was a softmax layer.
For the TCN we used a keras integrated tcn package. The length of the sub-sequence block in OURS was 6. For the self-attention mechanism, the dimensions of the query vector, key vector and value vector are 64. The numbers of convolution filters in the three residual blocks are 192, 256 and 256, respectively, and the convolution was followed by batch normalization and ReLu activation functions. Table 4 shows the hyperparameters of the models. In [6], the author used a large number of deep learning network models to classify a large number of time series datasets, and we summarized the hyperparameters of several network models with better classification effects.

Results on Benchmark Data
We will first introduce the results for the standard dataset.

Result Comparison
We uses the models and parameter settings from Section 2.1 for our experiments. We used the trained model to classify the test set. Finally, we derived the confusion matrix of the classification results of each model. Tables 5-7 show the confusion matrix LSTM, TCN and OURS.

Label One
We used the five evaluation indicators in Table 8 to evaluate the above models. Our model performed best on every evaluation indicator.

Digestion Experiment
In this section, we decided to remove some of the branches from our model for comparative experiments. In order to verify them separately, we deleted the self-attention part or the spectral sequence relationship feature part from our model. For our first model, we removed the self-attention part used in the time dimension, and only retained the relationship characteristics of the spectral sequence. On the contrary, for our second model, we removed the spectral sequence relationship features that operate in the spectral dimension, and retained the self-attention part in the time dimension. In the last model, we removed both these two branches.
Combining Tables 8 and 9, we can see that the single-branch model is better than the LSTM and TCN, but it is not as good as the classification result that combines the features of the two branches. The reason for this may be the feature fusion between different branches, which only compensates for their respective shortcomings and achieves a complementary result. The model with both branches removed performed better than LSTM and TCN, indicating that the structure of our choice based on ResNet was correct. After introducing our self-attention part and the spectral sequence relationship features, the performance of the model was improved again.

Result on Self-Selected Dataset
We used trained models to classify the data from the test set. We used the five indicators of "precision", "accuracy", "recall", "f1-score", and "kappa-score" for evaluation.
From Table 10, we can see that the model we proposed was better than the other models in terms of the classification of the test set. Moreover, we can see that the accuracy of the classification results for the test set is very high. We think that part of the reason is that the remote sensing image range of the dataset we selected was too small. The sample similarity in each category was relatively high, which led to the final classification result being too accurate. However, even so, our model performed better than other models on the same dataset. We can derive a confusion matrix that uses our model to classify the test set. Table 11 shows the confusion matrix of our model's classification results on the test set.  We used the trained model to forecast and classify the original 900 × 750 pixel points, and used the predicted results to derive a distribution map for the whole remote sensing image. Figure 14 shows the prediction classification results of the original 900 × 750 pixel points. We used the trained model to forecast and classify the original 900 × 750 pixel points, and used the predicted results to derive a distribution map for the whole remote sensing image. Figure 14 shows the prediction classification results of the original 900 × 750 pixel points. The above classification results show that the LSTM model is very suitable for the classification of roads.
The LSTM module in Figure 15 shows the roads in more detail. On the other hand, our model blurs some dense road areas. The above classification results show that the LSTM model is very suitable for the classification of roads.
The LSTM module in Figure 15 shows the roads in more detail. On the other hand, our model blurs some dense road areas. We used the trained model to forecast and classify the original 900 × 750 pixel points, and used the predicted results to derive a distribution map for the whole remote sensing image. Figure 14 shows the prediction classification results of the original 900 × 750 pixel points. The above classification results show that the LSTM model is very suitable for the classification of roads.
The LSTM module in Figure 15 shows the roads in more detail. On the other hand, our model blurs some dense road areas. However, there are deficiencies in the distinction between bare land and cultivated land. For example, in Figure 16, we can see that the LSTM model incorrectly classified some types of bare land as cultivated land. We used the trained model to forecast and classify the original 900 × 750 pixel points, and used the predicted results to derive a distribution map for the whole remote sensing image. Figure 14 shows the prediction classification results of the original 900 × 750 pixel points. The above classification results show that the LSTM model is very suitable for the classification of roads.
The LSTM module in Figure 15 shows the roads in more detail. On the other hand, our model blurs some dense road areas. However, there are deficiencies in the distinction between bare land and cultivated land. For example, in Figure 16, we can see that the LSTM model incorrectly classified some types of bare land as cultivated land. The accuracy of the TCN model on the test set was similar to that of our model, but still lacked quality in some places. For example, in Figure 17, we can see that the TCN did not divide the entire road, but divided the back part of the road into buildings. The accuracy of the TCN model on the test set was similar to that of our model, but still lacked quality in some places. For example, in Figure 17, we can see that the TCN did not divide the entire road, but divided the back part of the road into buildings.
(c) (d) The accuracy of the TCN model on the test set was similar to that of our mod still lacked quality in some places. For example, in Figure 17, we can see that the T not divide the entire road, but divided the back part of the road into buildings.

Discussion
Whether it is on a standard dataset or a dataset of our choice, we can come conclusion that the combination of the self-attention mechanism and the corr among multiple bands is beneficial to the time series classification of remote sensin In this section, we will use the results obtained on the standard dataset to explain the features of each part of the branch we proposed. We will first introd correlation features of the spectral sequence, and then the inter-block feature matri time sequence enhancement features of the self-attention part.

Spectral Sequence Relationship Feature Visualized Analysis
For multi-band remote sensing time series, the self-attention mechanism is the band dimension to obtain the relationship among each band sequence, and w visualized this relationship. Figure 18 shows the visualization of spectral sequen tionship feature.

Discussion
Whether it is on a standard dataset or a dataset of our choice, we can come to the conclusion that the combination of the self-attention mechanism and the correlation among multiple bands is beneficial to the time series classification of remote sensing data.
In this section, we will use the results obtained on the standard dataset to further explain the features of each part of the branch we proposed. We will first introduce the correlation features of the spectral sequence, and then the inter-block feature matrices and time sequence enhancement features of the self-attention part.

Spectral Sequence Relationship Feature Visualized Analysis
For multi-band remote sensing time series, the self-attention mechanism is used in the band dimension to obtain the relationship among each band sequence, and we have visualized this relationship. Figure 18 shows the visualization of spectral sequence relationship feature.
Overall, we see that the eighth band (NDVI) has the lowest impact on the other bands. In the four types of samples, forests, grassland, other crops and sugarcane crops, there are similarities in the distribution maps of the impact levels between the bands. The reason for this may be that all four types of land cover have green plants. This leads to similar distribution diagrams amongst the various bands.
The band relationship distribution maps of urban areas, other built-up surfaces, rocks, and bare land are also similar. The reason for this may be that the other built-up surfaces category includes some surface coverage similar to urban areas, such as some similar buildings. Moreover, there are some impervious objects in these three categories. Finally, water and sparse vegetation behave very differently from other surface coverage categories. ens. 2021, 13,   Overall, we see that the eighth band (NDVI) has the lowest impact on the other bands. In the four types of samples, forests, grassland, other crops and sugarcane crops, there are similarities in the distribution maps of the impact levels between the bands. The reason for this may be that all four types of land cover have green plants. This leads to similar distribution diagrams amongst the various bands.
The band relationship distribution maps of urban areas, other built-up surfaces, rocks, and bare land are also similar. The reason for this may be that the other built-up surfaces category includes some surface coverage similar to urban areas, such as some similar buildings. Moreover, there are some impervious objects in these three categories. Finally, water and sparse vegetation behave very differently from other surface coverage categories.

Inter-Block Self-Attention Matrix Visualized Analysis
In the process of global feature extraction, we use the degree of influence among the different subsequence blocks of each sample. In the model training process, this feature can be expressed as a weight matrix. For different types, their respective weight matrices should be different. Therefore, through the output and visualization of the intermediate results of the model, we have obtained visualizations of the weight matrices of different

Inter-Block Self-Attention Matrix Visualized Analysis
In the process of global feature extraction, we use the degree of influence among the different subsequence blocks of each sample. In the model training process, this feature can be expressed as a weight matrix. For different types, their respective weight matrices should be different. Therefore, through the output and visualization of the intermediate results of the model, we have obtained visualizations of the weight matrices of different types of features. In the distribution diagram, the darker the color, the smaller the degree of influence. Figure 19 shows the visualization of the Inter-block self-attention matrix.
For each type of land cover, we selected the inter-block influence matrix of two samples for visualization. Although the distribution diagrams of different samples in the same category are not the same, we found two similar distribution diagrams for each category through selection.
In addition, sparse vegetation and water are still the easiest to distinguish from other types of land cover. In the distribution graph of sparse vegetation, there is a dark band at (0, 0-18). The overall distribution of water is brighter.
3, x FOR PEER REVIEW 23 of 27 types of features. In the distribution diagram, the darker the color, the smaller the degree of influence. Figure 19 shows the visualization of the Inter-block self-attention matrix.
(q) (r) For each type of land cover, we selected the inter-block influence matrix of two samples for visualization. Although the distribution diagrams of different samples in the same category are not the same, we found two similar distribution diagrams for each category

Time Sequence Enhancement Feature Visualized Analysis
In the process of time sequence enhancement feature extraction, we can find the importance of each subsequence block in terms of timing. In the same way, we can output the intermediate results of the model and derive the timing importance curves of different types. Figure 20 shows the visualization of the time sequence enhancement feature.
In addition, sparse vegetation and water are still the easiest to distinguish from other types of land cover. In the distribution graph of sparse vegetation, there is a dark band at (0, 0-18). The overall distribution of water is brighter.

Time Sequence Enhancement Feature Visualized Analysis
In the process of time sequence enhancement feature extraction, we can find the importance of each subsequence block in terms of timing. In the same way, we can output the intermediate results of the model and derive the timing importance curves of different types. Figure 20 shows the visualization of the time sequence enhancement feature. (   In a sense, the timing importance curve exhibits a strong relationship with the interblock influence matrix between the blocks. If a certain column in the inter-block influence matrix is a dark band, then the corresponding position on the timing importance curve will have a relatively low value.
In the timing importance curve, we can also see some differences between different samples. For example, where the horizontal axis of the timing importance curve of the forests sample is equal to 7, or is within the range of 12.5-15, the curve is at a low peak. However, there is a peak in the range of 3-5 or the range of 8-12. In the range of 5-7 or 10-12 on the timing importance curve of rocks and bare soil, the value is significantly low. The sample timing importance curve of sparse vegetation maintains a relatively high value above 9.

Conclusions
In our proposed model, we need to extract sample features from each time series sample. The method of extraction is to obtain the memory feature vectors of the subsequence first, and then to use the self-attention mechanism among the feature vectors of the subsequence. Therefore, the processing of a subsequence takes into account the local and global features of the time series. Then, we use the self-attention mechanism on the spectral dimension of remote sensing data to determine the relationship among each band of the time series. The fusion of these features imbues our final sequence with more comprehensive information. However, when our model is used to extract the relational features of spectral sequences, we just use a simple self-attention mechanism. This method does not obtain the characteristics of spectral sequence relations perfectly.
Considering the current rapid development of graph convolution, we will consider applying graph convolution in the extraction of the relationship between various spectra in the next work. Different bands can be considered as different nodes, and thus graph convolution can be used to utilize the characteristics of the relationship between the various bands.