Semi-Supervised Convolutional Long Short-Term Memory Neural Networks for Time Series Land Cover Classiﬁcation

: Time series images with temporal features are beneﬁcial to improve the classiﬁcation accuracy. For abstract temporal and spatial contextual information, deep neural networks have become an effective method. However, there is usually a lack of sufﬁcient samples in network training: one is the loss of images or the discontinuous distribution of time series data because of the inevitable cloud cover, and the other is the lack of known labeled data. In this paper, we proposed a Semi-supervised convolutional Long Short-Term Memory neural network (SemiLSTM) for time series remote sensing images, which was validated on three data sets with different time distributions. It achieves an accurate and automated land cover classiﬁcation via a small number of labeled samples and a large number of unlabeled samples. Besides, it is a robust classiﬁcation algorithm for time series optical images with cloud coverage, which reduces the requirements for cloudless remote sensing images and can be widely used in areas that are often obscured by clouds, such as subtropical areas. In conclusion, this method makes full advantage of spectral-spatial-temporal characteristics under the condition of limited training samples, especially expanding time context information to enhance classiﬁcation accuracy. with LSTM and ConvLSTM models, our model can still achieve a more accurate classiﬁcation with a small number of known samples. sets with different cloud coverage to simulate the real situations of optical remote sensing images for comparison experiments. It shows that the lack of image data has


Introduction
Remote sensing image classification is widely used in various areas of research and application, such as change detection, geography national condition monitoring, and global ecological environment changes [1][2][3][4][5][6]. With the growing development of remote sensing technology, the types and quantities of remotely sensed data have soared. It has become easier to access multi-source and multi-temporal Earth Observation (EO) data. The increasing demand for products encourages an increasing number of scholars to research a novel approach aimed at improving the accuracy of classification and meeting a wide range of information requirements and applications.
The current classification methods are mainly based on the classification tasks of singephase image. It can be roughly divided into unsupervised, supervised, and semi-supervised classification methods. Unsupervised classification algorithms cluster elements by similar attributes without any prior knowledge, like K-means and ISODATA [7]. But they are hard to interpret and are time consuming for high dimension or enormous volume of data [8].
On the contrary, supervised classification algorithms identify other unknown categories of pixels by learning prior human intervention (e.g., Decision Trees, DT [9]; Random Forests, RF [10]; Support Vector Machines, SVM [11]; and Artificial Neural Networks, ANN [12]). For these, selecting a representative and abundant training samples is crucial. However, the training samples are artificially selected through limited experience and knowledge whether it is field exploration or reference data. There is no guarantee that the selected This problem will affect accuracy of target recognition and classification, so a sort of preprocessing will be employed in advance, such as cloud and cloud shadow detection [26] or fitting the multi-temporal curve to the missing data problem [27,28].
In order to address these problems mentioned above, we propose a SemiLSTM for time series land cover classification. The major contributions of this work are as follows: (i) It achieves an excellent classification via a small group of labeled samples in time series data; (ii) it is a robust classification model for time series optical remote sensing data with the influence of noise (especially clouds and shadows), which decreases the requirements for time series images without clouds and can be widely used in areas that are often obscured by clouds, such as subtropical areas; (iii) it makes full use of spectral-spatialtemporal characteristics, especially expanding the time context information to enhance classification accuracy.
In previous work, we have experimented with recurrent networks on time series data with similar time resolution, and achieved promising results [29]. Based on this, we revised a more reasonable training mode and conducted experiments on three data sets with different time distributions.
The remainder of this paper is organized as follows: Section 2 describes the architecture and working principle of our proposed SemiLSTM method, and other classifiers for comparison. Then, the experimental data sets, experimental setup, and results are presented in Sections 3 and 4. Section 5 discusses the length of temporal context, the appropriate number of labeled training samples and the robustness for time series land cover classification with different cloud coverage. The final Section 6 concludes the paper.

Methodology
In this study, we designed a SemiLSTM for classification of long time series images. It uses a small number of labeled samples and a large number of unlabeled samples to extract the spectral, spatial, and temporal information. Figure 1 reveals the basic framework of the proposed SemiLSTM method, which can be decomposed into two parts: the pretrained model and the semi-supervised spatiotemporal modeling. In the following, we first describe how to use a trained CNN model on all images to enhance spatial features. Then, we detail how the convolutional LSTM models spatiotemporal context information and implements the semi-supervised training process.

The Pretrained Model
For remote sensing images with low and medium spatial resolution, the phenomenons of the same object with different spectrums and the same spectrum of

The Pretrained Model
For remote sensing images with low and medium spatial resolution, the phenomenons of the same object with different spectrums and the same spectrum of different objects bring difficulties to image recognition and interpretation. Numerous studies have shown that CNNs have good capabilities of spatial expression, and previous works in remote sensing fields [30][31][32] have demonstrated that pretrained CNNs have good transferability for remote sensing image classification. Therefore, we exploit the Residual Network (ResNet) [33] that has been trained in a large natural image database-ImageNet [34] as a pretrained model to process remote sensing images. In this way, it can generalize specific spatial feature representations from complex satellite images, and does not require a large amount of remote sensing data to train from scratch.
In the part 1, the initial input is a multispectral image X t ∈ R (h×w×c) , (t = 1, 2, 3, . . . , n), from time series data set, and the output is an image with enhanced spatial feature representations X P t ∈ R (h×w×c ) , (c > c), as shown in Figure 2. Since the ResNet was trained from natural images with red, green, and blue channels, the data we input to the ResNet pretrained model are the RGB image from original multispectral image. When input to the trained ResNet model, we choose the feature map after the first convolutional layer (expressed as ResNet 1 ) to enhance the spatial features of details like textures and edges. Because for our low-and middle-resolution remote sensing images, the deeper convolution layer, the more spatial and texture features will be lost. With the processing of the deep convolutional network, the image's resolution is getting smaller and smaller [35,36]. Thus, the original height h and width w are reduced, and the number of channels is increased to 64 after the size of 7 × 7 convolution kernel. Then, the extracted feature map from ResNet 1 is upsampled by bilinear interpolation to restore the same h and was the original image. At this time, the enhanced spatial information has been received. To further retain the multispectral information, the feature map after upsampling operation is concatenated with the original multispectral image. In the end, we get a new time series data (X

The Semi-Supervised Convolutional LSTM
LSTM has been proven to have the ability to preserve long-term and short-term memory, it can effectively solve vanishing and exploding gradient problems of simple recurrent neural network for long sequence training data [19,37,38]. Due to the fully connected network structure, LSTM needs to convert three-dimensional images into vectors for pixel-level training when processing time series imagery. At this time, the spatial relationship between pixels and adjacent pixels is also lost. Owing to the capability of the convolution operations to process spatial information, some scholars have proposed variant LSTM networks with convolution structures [23,24]. Thus, when processing time series data, the spatial characteristics of images are preserved. This is conducive to processing remote sensing images with complex and diverse characteristics. For the

The Semi-Supervised Convolutional LSTM
LSTM has been proven to have the ability to preserve long-term and short-term memory, it can effectively solve vanishing and exploding gradient problems of simple recurrent neural network for long sequence training data [19,37,38]. Due to the fully connected network structure, LSTM needs to convert three-dimensional images into vectors for pixel-level training when processing time series imagery. At this time, the spatial relationship between pixels and adjacent pixels is also lost. Owing to the capability of the convolution operations to process spatial information, some scholars have proposed variant LSTM networks with convolution structures [23,24]. Thus, when processing time series data, the spatial characteristics of images are preserved. This is conducive to processing remote sensing images with complex and diverse characteristics. For the training process of traditional deep learning networks, most of them rely on plentiful prior knowledge and sample selection. In the practical application of remote sensing data, it not only increases the workload of labeling samples, but also increases the difficulty of seeking suitable training samples in the limited remote sensing database. Therefore, we proposed a semi-supervised LSTM network with convolutional structures to deal with the land cover classification of long time series remote sensing images. Next, the training process of this network is described in detail.
After the pretrained model, a new set of time series data is constructed from the feature-enhanced images. In order to improve the computational efficiency and training effect of the temporal model, it is divided into numerous time series subdata sets with the same time series length (t), as shown in Figure 2 Figure 3a, we take the sequence subdata set sub 1 as an example. The self-looping structure based on recurrent neural network can be regarded as a connection with multitudinous neural units. The weights among hidden layers are shared, which makes the network have memory capabilities. There are three gate mechanisms (input gate i t , forget gate f t , and output gate o t ) that jointly control the memory of long-term and short-term knowledge of neural units. As an input X p t is fed, a current memory cell state S t can be obtained by the following equation, which gathers the current input and the previous hidden state H t−1 .
where tanh represents the hyperbolic tangent function. W S,X is the weight matrix from input to memory cell. W S,H denotes the hidden-memory coefficient matrix. b S is a bias coefficient and the symbol " * " denotes the convolution operator with 3 × 3 pixels. where tanh represents the hyperbolic tangent function. , is the weight matrix from input to memory cell. , denotes the hidden-memory coefficient matrix. is a bias coefficient and the symbol " * " denotes the convolution operator with 3 × 3 pixels.  As each datum is input in sequence, the long-term memory will be accumulated and updated from both , which is controlled by the input gate , and the past memory cell state , which controlled by the forget gate . The two gates and respectively control how much relevant information is added from and how much irrelevant prior information is omitted from . They can be expressed by the following formulas: , * , * , • , where represents the sigmoid function. , , , , , , and , denote the weight matrices from input data to input gate and forget gate, and from hidden state to input gate and forget gate, respectively. and are bias coefficients, and "•"denotes a Hadamard product. Correspondingly, the memory cell state is updated by As each datum is input in sequence, the long-term memory S t will be accumulated and updated from both S t , which is controlled by the input gate i t , and the past memory cell state S t−1 , which controlled by the forget gate f t . The two gates i t and f t respectively control how much relevant information is added from S t and how much irrelevant prior information is omitted from S t−1 . They can be expressed by the following formulas: Remote Sens. 2021, 13, 3504 6 of 18 where σ represents the sigmoid function. W i,X , W f ,X , W i,H , and W f ,H denote the weight matrices from input data to input gate and forget gate, and from hidden state to input gate and forget gate, respectively. b i and b f are bias coefficients, and "·"denotes a Hadamard product. Correspondingly, the memory cell state is updated by Subsequently, S t is further controlled by the output gate o t and propagates to the final hidden output H t . o t is designed to determine how much memory content is output at the current moment t. The output gate o t and the short-term memory H t can be expressed as follows: Owing to the convolutional operations between input-to-state and state-to-state transitions, as shown in Figure 3b. Compared with the traditional LSTM, it can better preserve spatial information and reduce the redundancy of spatial data in the process of temporal context information modeling. Due to the many-to-one network form, each sequence input has only one output, i.e., the sequence subdata set sub 1 outputs the latest state H t . After that, H t enters a convolutional layer with a kernel size of 3 × 3 and a convolution stride of 1 pixel for decoding, which converts the high-dimensional features into the categories. Then, it is mapped to [0, 1] through the SoftMax function to obtain the predicted probability valueŷ t . For our time series imagery classification task in Part 2, each image in time series is sequentially passed to the convolutional LSTM encoder, and finally the predictionŷ i+t−1 of sub i is output.
For traditional training process of networks, the loss function is defined as the standard cross-entropy which is calculated by the predicted labelŷ and respective reference label from ground truth data D L . This is essentially a supervised classification that needs large enough labels for training. Actually, the corresponding reference labels are mixtures with labeled D L and unlabeled data D U for the classification task of long time series remote sensing data, rather than known labels for every image. Thus, semi-supervised learning inspired by Π-model [39] is utilized. The structure of semi-supervised learning and the calculation of loss function is shown in Figure 4, and the pseudocode of training process in Algorithm 1. The Loss function consists of supervised and unsupervised loss components. One is the standard cross-entropy between models' predictionsŷ and reference labels D L , evaluated for known labels only. Because of the class imbalance, the tunable focusing parameter γ and the balance factor α are used to balance the imbalanced proportion of positive and negative samples [40]. The other is evaluated for all inputs with labeled and unlabeled data. After the random sequential variation, the same time series input data produce two different predicted vectors from hidden layers,ŷ andŷ . Then the mean square difference between these two values is made which can be seen as an error in classification. Besides, the latter is scaled by the loss weighting function w(·) related to the number of training times to merge the two loss items. The initial value of w(·) is set to 0, i.e., the loss value of the unsupervised part is not calculated. Through new inputs and continuous iterative calculations, the value of w(·) ramps up. It is very important that the unsupervised loss component must be promoted slowly enough, otherwise the network could easily fall into a degenerate solution and fail to achieve a meaningful classification effect. Finally, the Adam optimizer is utilized to minimize the combined loss value to complete the optimization of the model parameters. It should be noted that one-hot encoding [41] is a smart way to demote ground truth values of different categories for multi-classes tasks. calculated. Through new inputs and continuous iterative calculations, the value of ramps up. It is very important that the unsupervised loss component must be promoted slowly enough, otherwise the network could easily fall into a degenerate solution and fail to achieve a meaningful classification effect. Finally, the optimizer is utilized to minimize the combined loss value to complete the optimization of the model parameters. It should be noted that one-hot encoding [41] is a smart way to demote ground truth values of different categories for multi-classes tasks. , the unsupervised weight ramp-up function; Require: , the convolutional LSTM with trainable parameters θ; Require: , the time series input with random sequential variation function; Require: , the balance factor, is a constant vector whose length is the number of categories C; Require: , the tunable focusing parameter, is a constant.
for epoch in [1, num_epochs] do: for i in [1,1] do: Require: sub i , the time series subdata sets with the same time length (t) after pre-trained model . . , X p n ; Require: D i is the label corresponding to the last phase of sub i , where unlabeled samples are represented as D L i ∈ [1, C] (C is the number of classes), and labeled samples are represented as D U i ; Require: w(t), the unsupervised weight ramp-up function; Require: f θ (T), the convolutional LSTM with trainable parameters θ; Require: g(T), the time series input with random sequential variation function; Require: α, the balance factor, is a constant vector whose length is the number of categories C; Require: γ, the tunable focusing parameter, is a constant.

Study Areas and Data Sets
In our experiments, we used three sets of time series optical remote sensing data to verify the proposed method. The details are described as follows.

Jiamusi
This data set was acquired by the Landsat 8/OLI sensor with a coordinate range of 45°42′′-48°31′N, 129°22′-129°42′E in Heilongjiang Province, China. Here, we only adopted seven bands with 30 m spatial resolution (B1-B7). There are three typical areas (J1/J2/J3) mainly based on urban expansion and natural phenology of various crops from 2015 to 2016. Each area covers 256 × 256 pixels of about 60 km 2 , as shown in Figure 5 (where marked by the red star). Due to a 16-day revisit cycle and free resources of satellite, there are a total of 26 available images with cloud-free or low cloud coverage, with an average of 1 to 2 images per month.
Predictions for original sequential input

Study Areas and Data Sets
In our experiments, we used three sets of time series optical remote sensing data to verify the proposed method. The details are described as follows.

Jiamusi
This data set was acquired by the Landsat 8/OLI sensor with a coordinate range of 45°42′′-48°31′N, 129°22′-129°42′E in Heilongjiang Province, China. Here, we only adopted seven bands with 30 m spatial resolution (B1-B7). There are three typical areas (J1/J2/J3) mainly based on urban expansion and natural phenology of various crops from 2015 to 2016. Each area covers 256 × 256 pixels of about 60 km 2 , as shown in Figure 5 (where marked by the red star). Due to a 16-day revisit cycle and free resources of satellite, there are a total of 26 available images with cloud-free or low cloud coverage, with an average of 1 to 2 images per month.

Study Areas and Data Sets
In our experiments, we used three sets of time series optical remote sensing data to verify the proposed method. The details are described as follows.

Jiamusi
This data set was acquired by the Landsat 8/OLI sensor with a coordinate range of 45°42′′-48°31′N, 129°22′-129°42′E in Heilongjiang Province, China. Here, we only adopted seven bands with 30 m spatial resolution (B1-B7). There are three typical areas (J1/J2/J3) mainly based on urban expansion and natural phenology of various crops from 2015 to 2016. Each area covers 256 × 256 pixels of about 60 km 2 , as shown in Figure 5 (where marked by the red star). Due to a 16-day revisit cycle and free resources of satellite, there are a total of 26 available images with cloud-free or low cloud coverage, with an average of 1 to 2 images per month.

Study Areas and Data Sets
In our experiments, we used three sets of time series optical remote sensing data to verify the proposed method. The details are described as follows.

Jiamusi
This data set was acquired by the Landsat 8/OLI sensor with a coordinate range of 45°42′′-48°31′N, 129°22′-129°42′E in Heilongjiang Province, China. Here, we only adopted seven bands with 30 m spatial resolution (B1-B7). There are three typical areas (J1/J2/J3) mainly based on urban expansion and natural phenology of various crops from 2015 to 2016. Each area covers 256 × 256 pixels of about 60 km 2 , as shown in Figure 5 (where marked by the red star). Due to a 16-day revisit cycle and free resources of satellite, there are a total of 26 available images with cloud-free or low cloud coverage, with an average of 1 to 2 images per month.

Study Areas and Data Sets
In our experiments, we used three sets of time series optical remote sensing data to verify the proposed method. The details are described as follows.

Jiamusi
This data set was acquired by the Landsat 8/OLI sensor with a coordinate range of 45°42′′-48°31′N, 129°22′-129°42′E in Heilongjiang Province, China. Here, we only adopted seven bands with 30 m spatial resolution (B1-B7). There are three typical areas (J1/J2/J3) mainly based on urban expansion and natural phenology of various crops from 2015 to 2016. Each area covers 256 × 256 pixels of about 60 km 2 , as shown in Figure 5 (where marked by the red star). Due to a 16-day revisit cycle and free resources of satellite, there are a total of 26 available images with cloud-free or low cloud coverage, with an average of 1 to 2 images per month.
Update network parameters end for end for return θ, w(epoch)

Study Areas and Data Sets
In our experiments, we used three sets of time series optical remote sensing data to verify the proposed method. The details are described as follows.

Jiamusi
This data set was acquired by the Landsat 8/OLI sensor with a coordinate range of 45 seven bands with 30 m spatial resolution (B1-B7). There are three typical areas (J1/J2/ mainly based on urban expansion and natural phenology of various crops from 2015 2016. Each area covers 256 × 256 pixels of about 60 km 2 , as shown in Figure 5 (wh marked by the red star). Due to a 16-day revisit cycle and free resources of satellite, th are a total of 26 available images with cloud-free or low cloud coverage, with an avera of 1 to 2 images per month. Besides, it covers six categories according to the actual situation, involving cultivat land, forest (including unused land), construction (incorporated buildings and road water, cloud, and shadow. Figure 6b illustrates the ground truth information of the l images (J1/J2/J3). The distribution of the categories is imbalanced due to human activiti The proportion of each category is 65.20%, 9.38%, 22.29%, and 3.13% respectively. Clou and shadows are easy to appear in summer, usually from June to August, and th coverage varies. In the past two years, forests and water have basically remain Besides, it covers six categories according to the actual situation, involving cultivated land, forest (including unused land), construction (incorporated buildings and roads), water, cloud, and shadow. Figure 6b illustrates the ground truth information of the last images (J1/J2/J3). The distribution of the categories is imbalanced due to human activities. The proportion of each category is 65.20%, 9.38%, 22.29%, and 3.13% respectively. Clouds and shadows are easy to appear in summer, usually from June to August, and their coverage varies. In the past two years, forests and water have basically remained unchanged, whereas cultivated lands and constructions have completely undergone different changes. With the increase in time, the building area has an obvious growth trend. In time series images with monthly intervals, cultivated land may first be converted to bare land, and then converted to construction land. Due to the natural growth of crops, cultivated land has obvious seasonal change rules, and has different spectral reflectance values in different seasons. It is conceivable that varying degrees and regular changes will increase the difficulty and accuracy of identifying land cover.
In the time series data, the image of each time corresponds to a label, but the known label is randomly distributed. Some known labels correspond to the real class values, whereas other unknown labels are 0. In training process, we further cropped images into many blocks with a size of 32 × 32 pixels as shown in Figure 6a. Among them, the blue boxes represent the training samples, and the orange boxes represent the testing samples. The distribution of the training and testing sets is random (i.e., J1, J2, and J3 have different sample distributions), but the sample ratio is still 1:3.
In the time series data, the image of each time corresponds to a label, but the known label is randomly distributed. Some known labels correspond to the real class values, whereas other unknown labels are 0. In training process, we further cropped images into many blocks with a size of 32 × 32 pixels as shown in Figure 6a. Among them, the blue boxes represent the training samples, and the orange boxes represent the testing samples. The distribution of the training and testing sets is random (i.e., J1, J2, and J3 have different sample distributions), but the sample ratio is still 1:3. .

Kunshan
This data set is acquired by the Landsat 5 and 7 sensors from 31°15′-31°32′N, 120°54′-121°12′E, namely Kunshan city of China. Only six bands are used here: B1-B5 and B7 from two satellites with same 30 m spatial resolution and wavelength range. This area covers 1005 × 937 pixels about 848 km 2 in Figure 5, marked by the blue star. It mainly shows urban expansion from 1991 to 2010 with a total of 20 images, where one image per year.
This time series data with annual intervals should be composed of annual cloudless images as much as possible. Compared with the monthly interval Jiamusi data set, it does not consider the coverage of clouds and shadows. It covers four types of labeled values, containing cultivated land, forest, construction, and water. Figure 6c shows the label

Kunshan
This data set is acquired by the Landsat 5 and 7 sensors from 31 • 15 -31 • 32 N, 120 • 54 -121 • 12 E, namely Kunshan city of China. Only six bands are used here: B1-B5 and B7 from two satellites with same 30 m spatial resolution and wavelength range. This area covers 1005 × 937 pixels about 848 km 2 in Figure 5, marked by the blue star. It mainly shows urban expansion from 1991 to 2010 with a total of 20 images, where one image per year.
This time series data with annual intervals should be composed of annual cloudless images as much as possible. Compared with the monthly interval Jiamusi data set, it does not consider the coverage of clouds and shadows. It covers four types of labeled values, containing cultivated land, forest, construction, and water. Figure 6c shows the label information in 1991 and 2010, and the gray boxes represent only a small part of the large-scale data set. In the past 20 years, there had been a significant urban expansion phenomenon. The area of construction land had increased from 18.61% to 34.37%, while the area of cultivated land had decreased from 66.26% to 43.42%. In 2010, the coverage of forests and water were 7.40% and 14.81%, and both showed the same growth trend, with an increase of 3.93% and 3.15% respectively.
Similarly, in training process, the known labeled data are randomly distributed. In time series data with land changes, there may be only two phases with ground-truth labels. After cropping numerous image blocks (32 pixels × 32 pixels), we randomly selected the training and testing data with the same ratio (3:1).

Munich
This data set can be downloaded on GitHub [20]. It consists of cropped image blocks with a size of 48 × 48 pixels, derived from the 102 km × 42 km Sentinel 2 image in the north of Munich, Germany. In our experiments, we only used part of the available images in 2016 with a time length of 30 due to the limited computational space. There are four 10 m (B2, B3, B4, B8), six 20 m (B5, B6, B7, B8A, B11, B12), and three 60 m (B1, B9, B10) bands. At the time of input, the 20 m and 60 m bands were bilinear interpolated to 10 m ground sampling distance.
The ground truth information contains 17 crop categories, and only partial labeled samples are used. The non-uniform distribution of categories is shown in Figure 7. Similarly, we randomly selected the training and testing image blocks at a ratio of 3:1. Unlike Jiamusi and Kunshan time series data sets, its basic assumption is that the land cover has not changed. In other words, there is only one known label for Munich time series data set.
This data set can be downloaded on GitHub [20]. It consists of cropped im with a size of 48 × 48 pixels, derived from the 102 km × 42 km Sentinel 2 image i of Munich, Germany. In our experiments, we only used part of the available 2016 with a time length of 30 due to the limited computational space. There ar (B2, B3, B4, B8), six 20 m (B5, B6, B7, B8A, B11, B12), and three 60 m (B1, B9, B At the time of input, the 20 m and 60 m bands were bilinear interpolated to 10 sampling distance.
The ground truth information contains 17 crop categories, and only par samples are used. The non-uniform distribution of categories is shown in Similarly, we randomly selected the training and testing image blocks at a r Unlike Jiamusi and Kunshan time series data sets, its basic assumption is that the has not changed. In other words, there is only one known label for Munich time set.

Experiments Setup
In order to verify the feasibility and effectiveness of our SemiLSTM meth series images classification, LSTM-based methods (ConvLSTM and LSTM) and learning classifiers (SVM and RF) were selected as comparison methods. All m various parameter values were repeatedly tested on the Window 10 platform w NVIDIA RTX 2080 Ti GPU (memory 11 GB) and a CORE i7-7800 × CPU. Dee network models based on RNN (SemiLSTM, ConvLSTM, and LSTM) were im in Python with the help of Tensorflow. The neural units of three networks we 256, the learning rate was 0.001, and the optimal lengths (t) of time series input data sets were different, where t = {20, 18, 30} in Jiamusi, Kunshan, and Munich re More details and analyses are provided in Section 5.1. For ConvLSTM and our models, owing to the convolutional kernel size of 3 × 3, we fed eight cropped

Experiments Setup
In order to verify the feasibility and effectiveness of our SemiLSTM method for time series images classification, LSTM-based methods (ConvLSTM and LSTM) and non-deep learning classifiers (SVM and RF) were selected as comparison methods. All models with various parameter values were repeatedly tested on the Window 10 platform with a single NVIDIA RTX 2080 Ti GPU (memory 11 GB) and a CORE i7-7800 × CPU. Deep learning network models based on RNN (SemiLSTM, ConvLSTM, and LSTM) were implemented in Python with the help of Tensorflow. The neural units of three networks were the same 256, the learning rate was 0.001, and the optimal lengths (t) of time series input in different data sets were different, where t = {20, 18, 30} in Jiamusi, Kunshan, and Munich respectively. More details and analyses are provided in Section 5.1. For ConvLSTM and our SemiLSTM models, owing to the convolutional kernel size of 3 × 3, we fed eight cropped 3D image blocks (h, w, c) as a batch into the networks. The difference in LSTM was that each image block was reshaped to a two-dimensional vector (h × w, c) as an input. Therefore, the batch was the number of pixels in an image block (h × w).
For non-deep learning classifiers (SVM and RF), we employed the Scikit-learn framework. Similar to the input of LSTM, image blocks needed to be transformed into a vector. However, these two methods cannot directly deal with temporal features. For this reason, we connected the time dimension with the spectral dimension. That is, the spectral channels of each image were superimposed together in chronological order, which was regarded as the channels of the entire image. It was necessary to assume that the surface information has not changed. To deal with imbalanced samples, class weight = "balanced", so that each class had a weight based on the size of training samples [42]. We also adopted a random search strategy [43] to automatically pick the major optimal parameter values of model. For SVM, the RBF kernel was used, the candidate parameters C ∈ {0.1, 0.3, 1, 3, 10, 30, 100, 300} and gamma ∈ {0.1, 1, 2, 10, 'auto'}. For RF classifier, the candidate parameters n_estimators∈{120, 300, 500}, max_depth∈{5, 15, 25, None}, min_samples_split ∈ {2, 5, 15, 25}, min_samples_leaf ∈ {1, 2, 5, 10}, and max_features ∈ {'log2 , 'sqrt', None}.
In addition, although we randomly selected training and testing samples at the same ratio, for comparison experiments between different classifiers the random sampling of the same time series data remained consistent. In order to more fairly evaluate the effects of various classifiers, all models' inputs were enhanced images after the same pretrained CNN model.

Accuracy Assessment of Classification
To evaluate the performance of various classifiers in various time series data sets, the following indicators were utilized: Overall accuracy (OA), Kappa coefficients (K), and the weighted F1 score (W-F1). All the evaluation indexes were employed in Scikit-learn package of Python. The optimal parameters of various models in detail were selected in previous section. The results are displayed in Table 1. SVM and RF as the non-deep learning methods performed well with small training data set, which have been widely used in classification [44][45][46][47][48]. In our experiments of these various data sets, the results of the two models were similar and not good at the areas with varying and complex surface coverage. Moreover, they were time-consuming to repeat parameter selections by the random search strategy when entering a new data and process the data with high dimensions which combine the time and channel dimensions. SVM takes about two hours to complete an epoch in Jiamusi. If more parameters to be selected, like RF, it will take more time.
LSTM, as a variant RNN with three gates, is widely used because of its ability and advantage in processing temporal context information. Three LSTM-based networks had better performance on three different time series data, compared with non-deep learning classifiers which are based on the assumption of invariant land cover information. The three methods are far less time-consuming than the non-deep learning methods. Among them, LSTM can complete an epoch approximately every four minutes in Jiamusi, while ConvLSTM and SemiLSTM can complete an epoch approximately every one minute. In theory, ConvLSTM avoids the spatial features that may be lost in the data conversion process through convolution operations, so ConvLSTM had a slight advantage over LSTM (as shown in Table 1, the weighted F1 score of ConvLSTM was increased by 0.16, 0.07, and 0.05 in the Jiamusi, Kunshan, and Munich respectively). As for our proposed SemiLSTM method, its classified performance was better than LSTM and ConvLSTM in all data sets. For example, in the classification of Kunshan data set, the classification accuracy OA of SemiLSTM was improved by 12.11% compared with LSTM, and was increased by 5.58% compared with ConvLSTM. This is because the addition of semi-supervised learning made the model less dependent on known labeled samples, thereby reducing the interference of negative samples on land cover classification. Moreover, our method had a prominent classified performance in the case of a small number of known labels and the lack of images due to cloud occlusion. The related experimental analysis is described in Sections 5.2 and 5.3 below. It solves the problems of poor representation of training samples and inevitable cloud occlusion on optical images to some extent.

The Importance of Temporal Context Information
Temporal features play an important role in classification of remote sensing imagery. For instance, different types of crops have different seasonal phenological characteristics in crops classification, and the use of temporal context features can well distinguish various types of crops [18,44]. Temporal context features are beneficial to image classification tasks, especially in areas with land cover changes. But what length of time is the favorable condition for image classification?
To find the optimal time length of time series image classification, we compared classification experiments with time series inputs of different lengths. We set a fixed time length (t) and extracted multiple subdata sets of time length t from the original sequence data in chronological order as the time series input of network. A loop was not executed until the last time node of the original time series data. Here, the Jiamusi and Kunshan were used for comparative experiments, because their time series data sets have equal interval time distribution and certain land cover changes. Therefore, the length of time series input t = {2, 10, 15, 20, 24, 26} in Jiamusi and t = {2, 10, 15, 18, 20} in Kunshan.
The OA metric of different time series lengths has the same change trend as the Kappa, but the change range is relatively small. Therefore, only the change of Kappa is shown in Figure 8. As the increase in length t, the classification effect was significantly improved. Since the total length of time series data was limited, the larger the t, the smaller the number of iterations. Therefore, as t increased, the accuracy of classification generally improved on a stable trend, or even decreased slightly. Just as human beings become obscure to longer-term memories, the larger length of time series, the memory of the deep neural network may be lost during the continuous updating process. It shows that it is necessary to appropriately increase the length of time series input and is conducive to land cover classification, rather than the larger time series length.

Appropriate Number of Labeled Training Data
Supervised deep learning methods such as LSTM and ConvLSTM require a large number of known labels for training. But in practical application, it is hard to obtain enough labels, especially for a long time series data. The method proposed in this paper makes it possible to use a small proportion of labeled training samples and obtain excellent classification results, owing to the idea of semi-supervised learning. It will greatly reduce the requirements and workload of training labels, and make the classification and analysis of time series data easier to implement which can be widely used in various fields.
In order to explore the effect of the number of known labels on training process, five In addition, as shown in Figure 8, although the experimental results in the Jiamusi and Kunshan data sets show similar trends, the optimal parameters are different due to different time resolutions and research areas. Therefore, the parameter selection of the time series length needs to consider many factors such as time resolution, the size of the research area, and the complexity of the land surface. In this experiment, the optimal time series lengths in Jiamusi and Kunshan were 20 and 18, respectively. Moreover, the optimal parameters t were used in the subsequent comparison experiments.

Appropriate Number of Labeled Training Data
Supervised deep learning methods such as LSTM and ConvLSTM require a large number of known labels for training. But in practical application, it is hard to obtain enough labels, especially for a long time series data. The method proposed in this paper makes it possible to use a small proportion of labeled training samples and obtain excellent classification results, owing to the idea of semi-supervised learning. It will greatly reduce the requirements and workload of training labels, and make the classification and analysis of time series data easier to implement which can be widely used in various fields.
In order to explore the effect of the number of known labels on training process, five comparative experiments were set up, in which the labeled samples accounted for 100%, 70%, 50%, 20%, and 10% of the total training samples respectively. The number of known labels under different proportions were rounded down, that is the smallest integer was taken. Due to the limited length of Kunshan data and only one label in Munich, we conducted these comparative experiments in Jiamusi. Therefore, the experiments where the number of known labels account for 10%, 20%, 50%, 70%, and 100% were that only the last 2, 5, 13, 18, and 26 labeled samples were retained for training. The labels of other time nodes were filled with null values to represent unlabeled training samples, so that the total length of time series data was still 26. As described above in Section 5.1, the length of time series input t = 20. Due to the many-to-one networks, the known labels before were not used actually. When the ratios of known labels were 100%, 70%, and 50%, the classified results (OA) were similar.

Cloud-Robust
More and more optical satellites monitor the dynamic spatiotemporal processe the earth's surface in a regular time with a few days intervals. However, satellite ima are inevitably lost, as the surface is usually completely or partially covered by cloud limits the extensive research and application of majority remote sensing approaches, brings great challenges for the methods that are designed with cloud-free imager mind.
Therefore, we simulated time series images with different degrees of cloud cover In short, reducing the model's dependence on training samples is beneficial to image classification tasks. It not only reduces the difficulty of collecting time series labeled data (especially in complex environments and terrain conditions), but also reduces the cost and workload of labeling samples.

Cloud-Robust
More and more optical satellites monitor the dynamic spatiotemporal processes of the earth's surface in a regular time with a few days intervals. However, satellite images are inevitably lost, as the surface is usually completely or partially covered by clouds. It limits the extensive research and application of majority remote sensing approaches, and brings great challenges for the methods that are designed with cloud-free imagery in mind.
Therefore, we simulated time series images with different degrees of cloud coverage. Since the overlap of some original satellite images in Jiamusi, we had supplemented more abundant J1, J2, and J3 data from 2015 to 2016, including images with full or thick cloud coverage. Then it was further filtered and divided into time series subdata sets with the same total length (the length is 24 here). The time distribution of each subdata set in the three regions is displayed in Figure 10, and various color annotations indicate varying degrees of cloud coverage images. It should be noted that the cloud coverage is calculated by ROI (region of interest) from the clipped area. Moreover, only the labels of cloud-free images were reserved for the training process (the green circles in Figure 10), whereas other images with clouds had no known labels to participate in training. Figure 9. The results of comparative experiments where the ratios of known labels are 100%, 70%, 50%, 20%, and 10% respectively. The dotted lines are a linear fit to OA values.

Cloud-Robust
More and more optical satellites monitor the dynamic spatiotemporal processes of the earth's surface in a regular time with a few days intervals. However, satellite images are inevitably lost, as the surface is usually completely or partially covered by clouds. It limits the extensive research and application of majority remote sensing approaches, and brings great challenges for the methods that are designed with cloud-free imagery in mind.
Therefore, we simulated time series images with different degrees of cloud coverage. Since the overlap of some original satellite images in Jiamusi, we had supplemented more abundant J1, J2, and J3 data from 2015 to 2016, including images with full or thick cloud coverage. Then it was further filtered and divided into time series subdata sets with the same total length (the length is 24 here). The time distribution of each subdata set in the three regions is displayed in Figure 10, and various color annotations indicate varying degrees of cloud coverage images. It should be noted that the cloud coverage is calculated by ROI (region of interest) from the clipped area. Moreover, only the labels of cloud-free images were reserved for the training process (the green circles in Figure 10), whereas other images with clouds had no known labels to participate in training. Finally, there are three types of subdata set with different degrees of cloud coverage in each experimental region: (A), The subdata set is basically full of cloud-free coverage images, and only a few are low-cloud coverage images. But such data set is difficult to guarantee the distribution of equal time intervals, and some data may be continuously missing for months (like the subdata set (A) in J3 which has no data for up to five months). We call this phenomenon "the lack of time series data".
The subdata set (B) is composed of images with cloud coverage less than 50%, making the time series images distributed as uniformly as possible. However, there are still cases where no data in individual months, and remote sensing data of adjacent time phases are used instead. Finally, there are three types of subdata set with different degrees of cloud coverage in each experimental region: (A), The subdata set is basically full of cloud-free coverage images, and only a few are low-cloud coverage images. But such data set is difficult to guarantee the distribution of equal time intervals, and some data may be continuously missing for months (like the subdata set (A) in J3 which has no data for up to five months). We call this phenomenon "the lack of time series data".
The subdata set (B) is composed of images with cloud coverage less than 50%, making the time series images distributed as uniformly as possible. However, there are still cases where no data in individual months, and remote sensing data of adjacent time phases are used instead.
In the subdata set (C), it contains images obscured by full or thick clouds (we called it "the lack of image data"), but its time series data is continuous and evenly distributed. In other words, there is an image every month.
On this basis, we trained different deep neural networks to conduct comparative experiments on the above-mentioned time series subdata sets with different cloud coverage. The results (OA) of their predicted classification are shown in Figure 11. Among the three sets of experimental results, our SemiLSTM network had a robust classification regardless of the lack of time series data or image data. It shows that the semi-supervised learning method can not only reduce the dependence on samples, but also reduce the interference in clouds and shadows (negative samples) on the network training. It makes SemiLSTM model have a certain ability to resist the interference in clouds and learn features from other areas with cloudless coverage.

Conclusions
In this study, we developed a novel deep neural network SemiLSTM to classify land cover by learning spectral-spatial-temporal features from time series remote sensing images. Compared with a variety of classification models (including non-deep learning and deep learning models), it has a more prominent classified performance. Among three LSTM-based deep learning networks, we verified that properly increasing the temporal length of time series input is conducive to land cover classification. Owing to the advantages of semi-supervised learning, our SemiLSTM model reduces the dependence on training samples, so that it still has a good classification performance in the case of a small number of known training labels. In addition, we supplemented and reorganized time series subdata sets with different cloud coverage to simulate the real situations of optical remote sensing images for comparison experiments. It shows that the lack of image data has a more serious impact on classification accuracy than the lack of time series data, but our SemiLSTM model is still robust for time series land cover classification with high cloud coverage. The study suggests that this classified method can reduce the requirements for data collection with abundant cloudless images or a lot of known labels, which is conductive to the application and promotion of the method. Moreover, it provides valuable help for optical remote sensing applications and research in areas that are seasonally or often obscured by clouds, such as subtropical areas.  Besides, in the comparative experiments of subdata sets (A), (B), and (C) with different degrees of data missing, classification results of (C) were relatively poor even using the semi-supervised method. It can be seen that the lack of image data has a greater impact on classification than the lack of time series data. Thus, the data base without cloud coverage should be satisfied as much as possible in the application of time series images. In fact, optical satellite images are always difficult to avoid the interference of clouds and shadows (thick or thin coverage), such as subtropical areas that are easily blocked by clouds. At this time, our method still has good classification robustness.

Conclusions
In this study, we developed a novel deep neural network SemiLSTM to classify land cover by learning spectral-spatial-temporal features from time series remote sensing images. Compared with a variety of classification models (including non-deep learning and deep learning models), it has a more prominent classified performance. Among three LSTMbased deep learning networks, we verified that properly increasing the temporal length of time series input is conducive to land cover classification. Owing to the advantages of semi-supervised learning, our SemiLSTM model reduces the dependence on training samples, so that it still has a good classification performance in the case of a small number of known training labels. In addition, we supplemented and reorganized time series subdata sets with different cloud coverage to simulate the real situations of optical remote sensing images for comparison experiments. It shows that the lack of image data has a more serious impact on classification accuracy than the lack of time series data, but our SemiLSTM model is still robust for time series land cover classification with high cloud coverage. The study suggests that this classified method can reduce the requirements for data collection with abundant cloudless images or a lot of known labels, which is conductive to the application and promotion of the method. Moreover, it provides valuable help for optical remote sensing applications and research in areas that are seasonally or often obscured by clouds, such as subtropical areas.