Spatial-Temporal Neural Network for Rice Field Classiﬁcation from SAR Images

: Agriculture is an important regional economic industry in Asian regions. Ensuring food security and stabilizing the food supply are a priority. In response to the frequent occurrence of natural disasters caused by global warming in recent years, the Agriculture and Food Agency (AFA) in Taiwan has conducted agricultural and food surveys to address those issues. To improve the accuracy of agricultural and food surveys, AFA uses remote sensing technology to conduct surveys on the planting area of agricultural crops. Unlike optical images that are easily disturbed by rainfall and cloud cover, synthetic aperture radar (SAR) images will not be affected by climatic factors, which makes them more suitable for the forecast of crops production. This research proposes a novel spatial-temporal neural network called a convolutional long short-term memory rice ﬁeld classiﬁer (ConvLSTM-RFC) for rice ﬁeld classiﬁcation from Sentinel-1A SAR images of Yunlin and Chiayi counties in Taiwan. The proposed model ConvLSTM-RFC is implemented with multiple convolutional long short-term memory attentions blocks (ConvLSTM Att Block) and a bi-tempered logistic loss function (BiTLL). Moreover, a convolutional block attention module (CBAM) was added to the residual structure of the ConvLSTM Att Block to focus on rice detection in different periods on SAR images. The experimental results of the proposed model ConvLSTM-RFC have achieved the highest accuracy of 98.08% and the rice false positive is as low as 15.08%. The results indicate that the proposed ConvLSTM-RFC produces the highest area under curve (AUC) value of 88% compared with other related models.


Introduction
In Asian regions, rice is a staple food for the general public [1][2][3]. It provides employment and also livelihoods for the people. Especially in Taiwan, rice agriculture is also considered an industry for a number of farmers. Most of the land area of Taiwan is covered by mountains, only one-third of the land area is used for agriculture, and producing 1.4 million tonnes of grains annually [4]. The number of natural disasters, flash floods, cyclones, changes in temperature, and rainfall has been reported to continue to increase due to the impacts of global warming. These impacts are the factors that have been led to the reduction of rice yield [5][6][7][8].
algorithms gained popularity by producing better rice field classification results. Several DL algorithms have been proposed to classify the rice fields from SAR images based on phenological and spatial-temporal profile analysis. Such algorithms are unidimensional convolutional neural networks (1D CNN) [40], gated recurrent unit (GRU) [41], 3D convolutional neural networks (3D CNN) [42], recurrent neural network (RNN) [43], long short-term memory (LSTM), and bi-directional LSTM (Bi-LSTM) [44]. Wu et al. [41] implemented a gated recurrent unit (GRU) to detect and classify rice fields in Taiwan from SAR images. The GRU is a modified version of LSTM in which forget gate and input gate were combined as a single update gate and has an additional reset gate. This network can extract temporal features from time-series SAR images to perform pixel-wise classification. This model produces satisfactory results in terms of overall accuracy and performance. Another recent work [42] proposed a 3D convolutional neural network (3D CNN) for rice crop yield estimation from Sentinel-2 images in Nepal. This model is constructed with a series of convolutional and pooling layers to classify each pixel of an image by extracting spatial features from SAR images. Additionally, the authors studied the impact of the multi-temporal, climate, and soil data on the rice crop classification accuracy. The model validated the effectiveness of the model with respect to other regression and deep learning crop yield prediction techniques. Wang et al. [45] proposed a combination of convolutional neural network and long short-term memory (ConvLSTM) to estimate winter wheat yield in the major producing regions of China. The LSTM is a main module of the ConvLSTM network, which can extract short-term or long-term dependencies from time-series SAR images. The ConvLSTM model first extracts spatial features and then temporal features afterward for crop classification.
Convolutional neural networks (CNN) are one of the most widely used models in deep learning, allowing different convolution kernels by sliding the input image and performing certain calculations to find out the features in the image. The dimension of convolutional kernels and the number of convolutional layers are the challenging issues to extract features using CNN. RNN is popularly used for sequential data modeling and feature extraction. However, RNNs are not suitable to map rice fields from SAR images due to the parameters being determined by the length of the time series.
This study combines the characteristics of the RNN and CNN models to perform rice field classification from SAR time-series images. The major contributions of this study include:

1.
This study proposes an original rice field classifier based on a spatial-temporal neural network called a convolutional long short-term memory rice field classifier (ConvLSTM-RFC) to classify rice fields in study areas from Sentinel-1A SAR images.

2.
The proposed model ConvLSTM-RFC is designed with multiple convolutional long short-term memory attentions blocks (ConvLSTM Att Block) to predict spatial-temporal features from the SAR images.

3.
The binary cross entropy loss function has been replaced by the bi-tempered logistic loss function (BiTLL) to make the proposed model more robust to noise in data during the training process [46].

4.
A convolutional block attention module (CBAM) was embedded in the residual structure of the ConvLSTM Att Block to extract refined features from the intermediate feature maps.
The rest of this paper is organized as follows: Section 2 describes the study area and ground truth, and the architectural details of the proposed method. Section 3 presents the experimental results. Section 4 discusses the merit of this study. Finally, the article is concluded with its primary findings in Section 5.

Study Area
Yunlin and Chiayi regions are used as a study area in this research, as shown in Figure 1. The study areas span about 1290.8 km 2 and 1903 km 2 respectively. The latitude and longitude information of Yunlin and Chiayi counties is 23 • 30.75 E respectively. These two counties are ranked as the top two rice-growing regions in Taiwan. These regions' climate belongs to the sub-tropical monsoon with an annual average temperature of 22.6°C and rainfall of 1028.9 mm. Although rice is a dominant crop, other crash crops such as maize, peanut, wheat, sweet potato, corn, and soybeans are cultivated in these regions. Most of the agricultural land is scattered with non-rice crops among rice fields. According to rice phenology, rice cultivation in a year can be divided into two seasons. The first season is from February to June and the second season is from July to December. Rice cultivation greatly depends on weather and water availability. Therefore the rice cultivation period takes about 130 days for the first season and is about 110 days for the second season. In the second season, farmers might not be cultivated in some areas due to the climate and irrigation factors.

Ground Truth Data
Agriculture and Food Agency (AFA) and Taiwan Agriculture Research Institute (TARI) carry out the duties of developing the food industry and addressing the challenges in the agricultural sector in the Taiwan region. The ground truth data provided by these organizations are acquired with multiple periods and different spatial resolutions through aerial photos, Landsat-8, and Rapideye satellites. These organizations have been collecting agricultural land maps, aerial photos, and satellite images every year on a regular basis. First, agricultural lands are identified by using field investigators and ground surveys. Then, aerial photos and satellite images are applied in distinguishing agricultural land and mapping the ground truth data of crops. In 2017, the study area's rice distribution areas were 31,054.26 ha and 18,380.44 ha, respectively. The experimental results of the proposed ConvLSTM-RFC were compared with the ground truth data provided by AFA and TARI to assess the accuracy of the rice field classification. The ground truth data of the rice field distribution of the two study areas in 2017 are shown in Figure 2. The ground truth data used in this experiment consists of rice fields and non-rice fields. In Figure 2a

Data Preprocessing and Smoothing Processes
The dataset used in this paper is the acquired imagery of the Sentinel-1A satellite. Sentinel-1A was launched in April 2014 to support operational applications in the areas of marine monitoring, land monitoring, and emergency management services. This satellite acquires spatial resolution images once every 12 days. Sentinel-1A operates at C-band, enabling them to acquire high-resolution images regardless of the light and weather. It comprises vertical-vertical (VV) and horizontal-vertical (VH) polarizations with a spatial resolution of 20 m × 22 m in the range and azimuth directions. The Sentinel-1A data used in this study was acquired with a swath width of 250 km in the interferometric wide swath (IW) acquisition mode. Level-1 ground range detected (GRD) SAR data with a pixel spacing of 10 m × 10 m format is utilized. The acquired SAR data are open access and were available free from the website. This research mainly focused on classifying rice fields in the first season; therefore, we downloaded the data from February to July in 2017. The complete details of the SAR data are listed in Table 1.
The SAR data preprocessing steps, including radiation correction, geometric correction, and speckle noise removal, are performed using sentinel application platform software (SNAP). However, ConvLSTM neural network layers were stacked to increase the computational complexity of the model to extract more features. The residual architecture is used in the neural network to avoid features loss by increasing the number of layers. The ConvLSTM attention block (ConvLSTM Att Block) is created to extract the features of SAR time-series images. The bi-tempered logic loss function is used in the proposed model to prevent the deep neural network of the model from false noise data. This loss function also controls the training of the model. The methodology of this study is shown in Figure 3.

Architecture and Strategy
In this study, 14 time-series images of the first season rice are combined as the time axis, extracted the VV and VH polarization images as the two characteristic channels of the model input, and the size of 7800 × 2800 SAR image is divided into a size of 78 × 28 small images. This research adopts the time series spatial-temporal neural network model ConvLSTM to perform rice field classification from SAR time-series images. This section will introduce the network architecture and optimization strategies used in this study. The overall network architecture of the proposed model convolutional LSTM network for rice field classification from SAR images (ConvLSTM-RFC) is shown in Figure 4.
The ConvLSTM-RFC is a combination of Conv2D, Conv3D, ConvLSTM, and ConvL-STM Att Block. A series of SAR images are firstly input to the model. Then, they are passed through the ConvLSTM-RFC model to generate rice field maps finally. Next, the ConvL-STM Att Block design, the strategy used to modify the optimizer, and the loss function will be introduced.

Convlstm Attention Block
In this study, referring to the architecture of ResNet [47], the network is deepened and the convLSTM attention block (ConvLSTM Att Block) is designed as shown in Figure 4. In 2018, Woo et al. [48] proposed a new attention-based CNN, named as convolutional block attention module (CBAM), to protect the attention mechanism and feature-map exploitation of the network. CBAM is simple in design and uses the spatial location of the object in object detection. As shown in Figure 4, CBAM first applies channel attention and then spatial attention sequentially to extract the refined feature maps. This serial learning process generates the 3D attention map and reduces the parameters as well as computational cost. The simple design of CBAM can be integrated easily with any CNN architecture.
ConvLSTM Att Block consists of two ConvLSTM neural network layers, two batch normalization, and one CBAM block, as shown in Figure 4. To improve the performance of the CNN model, the depth, width, and cardinality of the model occupy an important part. While deepening the network, the residual structure is used to avoid the divergence of the gradient in the forward pass. In Figure 4, x 1 is the input feature data, x (l+1) is the enhanced feature generated after this structure, which can be described by the following formula: where CBAM(F(x l , {W l })) is the convolutional attention module function and H(x l ) is a potential mapping function.
The function F(x l , {W l }) represents the mapping function corresponding to each Con-vLSTM Att Block. Where o l and C l are the output gates and cell units passing through the ConvLSTM network layer, and BN(o l • tanh(c l )) represents the optimized neural network method for batch normalization operations. In this study, to make the feature fusion, H(x l ) is designed as follows: where * represents convolution which is used for dimension matching operation. For any depth L and each ConvLSTM Att Block l, the combined ConvLSTM Att Block neural network can be taken as the following:

Incorrect Labeled Data
The ground truth data provided by Agriculture and Food Agency (AFA) and Taiwan Agriculture Research Institute (TARI) is shown in Figure 2. After comparing the real data of the rice fields of AFA and TARI, it was found that the real data presented by the two parties are mismatched. Hence, this study selected the ground truth data from AFA with more rice fields in the same area for training and testing labels.
In addition, in a binary classification problem, particularly the traditional logic loss function is sensitive to abnormal values. These incorrectly labeled data are often far away from the decision boundary, which will cause the model decision boundary to be pulled and may sacrifice other correct values. To avoid the adverse effects of noise data on model training, this research replaced the traditional logistic loss function with the bi-tempered logistic loss function. Bi-tempered logistic loss function uses its temperature and tail weight parameters to constrain the outliers.

Bi-Tempered Logistic Loss
Amid et al. [46] introduced the bi-tempered logistic loss function to address the issue of noise presented in the dataset. This noise can affect the quality of a segmentation output disproportionately. The authors propose two modifications to overcome this issue. First, the softmax output is replaced with a heavy tailed softmax function is given by the following equation:ŷ i = exp t 2 (â i − λ t 2 (â)), whereλ t 2 (â) ∈ R (5) such that ∑ C j exp t 2 (â i − λ t 2 (â)) = 1. Second, the entropy function is replaced with a tempered version, given by the following equation: The two parameters are temperature t 1 and tail-heaviness t 2 determine how heavytailed the functions become. When both t 1 and t 2 are 1, the bi-stable logic loss function is an ordinary logic loss function. The temperature parameter t 1 is a parameter between 0 and 1, and the smaller its value, the more restrictive it is to the bounds of the logistic loss function. The tail weight t 2 is defined as a parameter greater than or equal to 1. The larger the value, the thicker the tail will be, and the slower the decay will be compared to the exponential function.

Training and Testing Process
This experiment used Sentinel-1A SAR time-series images for the training and testing of all the models. Initially, the data was preprocessed and smoothened, then a total number of 10,000 images with a height × width of 78 × 58 pixels were generated. The total data is split into training and testing. Therefore, in the experiment, 8000 images are allocated for training (80%) and 2000 images for testing (20%). The data has been randomly scrambled to avoid the uneven distributions happening during the training and testing, and the random number seed is set to 42.
The training and testing process of the study is shown in Figure 5. The training and testing datasets were randomly divided. When the training process was completed, the models were tested using the testing data. The goal of the proposed model is to generate a rice distribution map in the selected area by classifying whether each pixel belongs to rice or non-rice.

Model Evaluation
The most common performance evaluation metrics in computer vision and image processing were used to evaluate the performance of the proposed model. The metrics are confusion matrix, precision, recall, F1-score, accuracy, and receiver operating characteristic curve (ROC). The values of precision, recall, F1-score, and accuracy are formally given by the following equations: where TP , TN, FP, and FN are the number of true positive, true negative, false positive, and false-negative observations, respectively, in a classification with a probability threshold of 0.5.

Execution Environment
All the experiments were performed using a PC with Intel(R) Xeon(R), CPU E5-2630 v4@ 2.20 GHz, and 64 GB of RAM. Two NVIDIA RTX2080Ti GPU with 11 GB of memory. Python 3.7 with CUDA 10.1 and cuDNN 7.6. The operating system is 64-bit Ubuntu 20.04.

Results
In this study, experiments were carried out to assess the rice field classification efficiency of the ConvLSTM-RFC model. The efficiency of the ConvLSTM-RFC model is compared with three different neural network models. These three models are GRU representing the temporal model, 3D CNN representing the spatial model, and ConvLSTM representing the spatial-temporal model, respectively. All the models used in this study were trained and tested using the time series data obtained from the Sentinel-1A satellite. After data preprocessing and smoothing processes, a total number of 10,000 images with a height × width of 78 × 58 pixels are generated. The train/test had 8000 images for training (80%), and 2000 images for testing (20%). Table 2 listed the respective training parameter settings of all the deep learning models. The experimental results are compared with the model evaluation indicators and the hyperparameters in the model. Yunlin and Chiayi regions are used as the study area. The rice field classification results of all models were compared with the ground truth data from Agriculture and Food Agency (AFA), as shown in Figure 6.

Influence of Spatial-Temporal Model
The identification results of all models are listed in Table 3. From Table 3, it is observed that the proportion of rice that is actually not rice but was incorrectly identified as rice (false positive) in GRU is 74.24%, 3D CNN is 51.80%, and ConvLSTM is 51.16%.
From Table 4, it can be seen that the overall model constructed by the ConvLSTM spatial-temporal neural network has the highest F1 score of 96.48% and an accuracy of 95.70%. From the current results, it can be seen that although the overall accuracy is satisfactory, the ConvLSTM is more effective in recognizing non-rice. In the next section, the model will be optimized and adjusted for this problem.

Influence of Different Optimized Strategy
In this paper, three methods were used to improve the ConvLSTM-RFC efficiency. The first method is modifying the loss function from binary cross-entropy to bi-tempered logistic loss, which is less sensitive to noisy labels. The second method is to deepen the network architecture and ConvLSTM is used as the residual architecture. The last method combines the above two methods and uses the attention mechanism for the features in the residual architecture to output more important features in space and timing. Table 2 lists the hyperparameter setting of this optimization method.
As shown in the Tables 3 and 4, after modifying the loss function to bi-tempered logistic loss, each evaluation index has risen substantially. In the case of non-rice but incorrectly identified as rice (false positive), among the proportions classified as rice, nearly half of the non-rice. In the original model, architecture without modification of the loss function was incorrectly marked as rice. In the second method, ConvLSTM was deepened as a residual error. The structure can still make the evaluation indicators of the model have a slight increase. Finally, the CBAM attention mechanism strengthens the features in each time sequence and space. The overall final optimization result has an accuracy of 98.08% and an F1 score of 94.77%. However, the proportion of rice that is actually non-rice but was incorrectly identified as rice (false positive) is as low as 15.08%.
Finally, the ROC curve is used to evaluate the performance of all the models used in the experiment and present the area under the curve (AUC) by applying threshold values across the interval [0, 1]. For each threshold, two values are calculated, the true positive rate ratio and the false positive rate ratio. Figure 7 shows the ROC curve, which plots the true positive rate ratio versus false-positive rate ratio with the threshold as a parameter for GRU, Conv3D, ConvLSTM, ConvLSTM-BiTLL, and ConvLSTM-RFC models.

Discussion
There are few DL models that have been developed and applied for classification over large-scale rice fields. The traditional ML methods such as DT, RF, and SVM extract features of rice from SAR images either manually or through data mining techniques before the rice classification is performed. Moreover, the RF algorithm with an oversampling technique has been used to classify rice phenology from Landsat-8 satellite images [49]. Furthermore, weighted nearest neighbors (WNN) and quadratic support vector machines (QSVM) were used to detect rice false smut in a complex planting environment [50]. The performance results of these classifiers are better than the actual investigation results.
In recent years, a series of state-of-the-art DL models have been developed and applied for crop mapping. These DL models have achieved higher rice field classification results than the traditional ML models. The CNNs have demonstrated better crop classification performance than the traditional classification methods by learning spatial features from time-series satellite images. In addition, RNNs have shown their potential to perform rice classification by learning temporal features automatically from time-series satellite images.
In this research, the main goal of the proposed ConvLSTM-RFC model is to achieve high classification efficiency. Hence, the characteristics of RNN and CNN models are combined to construct the proposed ConvLSTM-RFC model. The proposed model first extracts spatial features and then temporal features afterward for rice field classification. To achieve the goal, different optimization techniques have been implemented in the proposed model. These techniques include modifying the loss function from binary crossentropy to a bi-tempered logistic loss function (BiTLL), deepening the architecture with several convolutional long short-term memory attentions blocks (ConvLSTM Att Block), and integrating a convolutional block attention module (CBAM). Figure 6 illustrates the classification results of GRU, 3D CNN, ConvLSTM, and the proposed model ConvLSTM-RFC in the selected study areas. The classification results of the ConvLSTM-RFC model are closer to the ground truth data than those of the GRU, 3D CNN, and ConvLSTM models.
Three reasons led to the best performance of the proposed model over the other models. First, the ConvLSTM-RFC model contains ConvLSTM Att Block to obtain spatial information and temporal features of rice from SAR images. Second, the BiTLL loss function led to the ConvLSTM-RFC model being more robust to noise. Third, ConvLSTM Att Block employs a CBAM block that is capable of recognizing rice pixels using spatial information of rice, which could produce complete rice fields in the classification result. The most noticeable improvement is shown in Table 3, where the ConvLSTM-RFC model has reduced the false positive rate of rice to 15.08%. The ConvLSTM-RFC model has produced the highest accuracy of 98.08%, as shown in Table 4. Meanwhile, the ConvLSTM-BiTLL model has achieved the second-highest accuracy, slightly higher than the 3D CNN model and much higher than that of the GRU model. This indicates that the combined use of spatial and temporal features of rice can improve the accuracy of rice detection. Moreover, Figure 7 shows that the ConvLSTM-RFC model outperformed GRU, 3D CNN, ConvLSTM, and ConvLSTM-BiTLL models in terms of AUC value. The ConvLSTM-RFC model has produced the highest AUC value of 88%. It implies that there has been a significant increase in the AUC value of the ConvLSTM-RFC model after the optimization strategies have been applied. The results show that the proposed model ConvLSTM-RFC has the best performance in classification accuracy, which is more suitable for large-scale rice field mapping.

Conclusions
This research proposed a spatial-temporal neural network called a convolutional long short-term memory rice field classifier (ConvLSTM-RFC) for rice field classification from Sentinel-1A SAR images. Unlike the traditional deep learning methods that only use a temporal or spatial neural network for crops classification from SAR images, this research combines both spatial and temporal neural networks in one main network of the proposed model ConvLSTM-RFC. Additionally, ConvLSTM-RFC is constructed with several convolutional long short-term memory attentions blocks (ConvLSTM Att Block) and a bi-tempered logistic loss function (BiTLL). In the ConvLSTM Att Block design, a convolutional block attention module (CBAM) was integrated into the ConvLSTM Att Block to enhance the representation of rice fields in different periods on SAR images. The binary cross-entropy loss function has been replaced by the BiTLL function to make the proposed model more robust to the incorrectly labeled data. The experimental results demonstrated that the ConvLSTM-RFC model had reached the highest accuracy of 98.08% and the false-positive rate of rice is as low as 15.08%. On the other hand, the ConvLSTM-RFC produced the highest AUC value of 88%. Compared with temporal and spatial deep learning models, ConvLSTM-RFC greatly reduced the proportion of model false positives and achieved higher accuracy. While vegetation indices could have an impact on the classification result of rice fields. Future work will study the rice fields classification using the combination of vegetation indices and spatial-temporal features.