Multi-Temporal Unmanned Aerial Vehicle Remote Sensing for Vegetable Mapping Using an Attention-Based Recurrent Convolutional Neural Network

Vegetable mapping from remote sensing imagery is important for precision agricultural activities such as automated pesticide spraying. Multi-temporal unmanned aerial vehicle (UAV) data has the merits of both very high spatial resolution and useful phenological information, which shows great potential for accurate vegetable classification, especially under complex and fragmented agricultural landscapes. In this study, an attention-based recurrent convolutional neural network (ARCNN) has been proposed for accurate vegetable mapping from multi-temporal UAV red-green-blue (RGB) imagery. The proposed model firstly utilizes a multi-scale deformable CNN to learn and extract rich spatial features from UAV data. Afterwards, the extracted features are fed into an attention-based recurrent neural network (RNN), from which the sequential dependency between multi-temporal features could be established. Finally, the aggregated spatial-temporal features are used to predict the vegetable category. Experimental results show that the proposed ARCNN yields a high performance with an overall accuracy of 92.80%. When compared with mono-temporal classification, the incorporation of multi-temporal UAV imagery could significantly boost the accuracy by 24.49% on average, which justifies the hypothesis that the low spectral resolution of RGB imagery could be compensated by the inclusion of multi-temporal observations. In addition, the attention-based RNN in this study outperforms other feature fusion methods such as feature-stacking. The deformable convolution operation also yields higher classification accuracy than that of a standard convolution unit. Results demonstrate that the ARCNN could provide an effective way for extracting and aggregating discriminative spatial-temporal features for vegetable mapping from multi-temporal UAV RGB imagery.


Introduction
Accurate vegetable mapping is of great significance for modern precision agriculture. The spatial distribution map for different kinds of vegetables is the basis for automated agricultural activities such as unmanned aerial vehicle (UAV)-based fertilizer and pesticide spraying. Traditional vegetable Both the study area and multi-temporal UAV imagery used in this research are illustrated in Figure 1.
The study area includes a vegetable field which is located in Xijingmeng Village of Shenzhou City, Hebei province, China. There are various kinds of vegetables, such as Chinese cabbage, carrot, leaf mustard, etcetera. Meanwhile, the study area also locates in the North China Plain, which belongs to a continental monsoon climate, where summer is humid and hot, while winter is dry and cold. The annual temperature is about 13.4 • C and the annual precipitation is about 486 mm. Vegetables are usually planted in late August and harvested in early November. The study area includes a vegetable field which is located in Xijingmeng Village of Shenzhou City, Hebei province, China. There are various kinds of vegetables, such as Chinese cabbage, carrot, leaf mustard, etcetera. Meanwhile, the study area also locates in the North China Plain, which belongs to a continental monsoon climate, where summer is humid and hot, while winter is dry and cold. The annual temperature is about 13.4 °C and the annual precipitation is about 486 mm. Vegetables are usually planted in late August and harvested in early November.
A field survey was conducted along with the UAV flight. Vegetable and crop types, locations measured by global positioning system (GPS) and photographs were recorded for every land parcel. A field survey was conducted along with the UAV flight. Vegetable and crop types, locations measured by global positioning system (GPS) and photographs were recorded for every land parcel. According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil ( Table 1). According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 8 Millet

200/200
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 2 Chinese cabbage 400/400 Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 9 Weed 100/100 Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 3 Leaf mustard 200/200 Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 10 Bare soil 200/200 Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 11 Sweet potato 200/200 Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 12 Corn 50/50 Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 6 Kohlrabi 50/50 Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 13 Soybean

200/200
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 7 Potherb 100/100 Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of 14 Scallion 100/100 Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 22 According to the results of the field survey, there were a total of fourteen land cover categories, including eight vegetable types (i.e., carrot, Chinese cabbage, leaf mustard, turnip, spinach, kohlrabi, potherb and scallion), four crop types (i.e., millet, sweet potato, corn and soybean), weed and bare soil (Table 1). Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of Remote Sens. 2020, 12, 1668 6 of 22 Training and testing datasets were obtained from UAV imagery by visual inspection based on the sampling sites' GPS coordinates and the corresponding land cover categories. Numbers of both training and testing datasets are shown in Table 1. Besides, Table 1 also shows the ground image taken during the field work to depict the detailed appearance of various vegetables and crops.
Meanwhile, Figure 2 illustrates the spatial distribution of both training and testing samples. It indicates that all the samples are randomly distributed and no overlap exists between training and testing regions. Besides, because we adopted patch-based per-pixel classification, all the training and testing samples are pixels from the region of interest (ROI). In this study, the number of training and testing samples are both 2250, respectively, which accounts for a small area (0.03%) of the total study region (7,105,350 pixels).

Dataset Used
We utilized a small-sized UAV, DJI-Inspire 2 [51], for the image data acquisition. The camera onboard is an off-the-shelf, light-weight digital camera with only three RGB bands. Therefore, the low spectral resolution would make it difficult to separate various vegetable categories if only considering single-date UAV data. To tackle this issue, we introduce multi-temporal UAV observations, which could obtain the phenological information during the growing season to increase the inter-class separability.
We conducted three flights in the autumn of 2019 (Table 2). During each flight, the flying height was set to be 80 m, achieving a very high spatial resolution of 2.5 cm/pixel. Besides, the width and height of the study area is 3535 and 2010 pixels (88.4 m and 50.3 m), respectively. Actually, the extent

Dataset Used
We utilized a small-sized UAV, DJI-Inspire 2 [51], for the image data acquisition. The camera onboard is an off-the-shelf, light-weight digital camera with only three RGB bands. Therefore, the low spectral resolution would make it difficult to separate various vegetable categories if only considering Remote Sens. 2020, 12, 1668 7 of 22 single-date UAV data. To tackle this issue, we introduce multi-temporal UAV observations, which could obtain the phenological information during the growing season to increase the inter-class separability.
We conducted three flights in the autumn of 2019 (Table 2). During each flight, the flying height was set to be 80 m, achieving a very high spatial resolution of 2.5 cm/pixel. Besides, the width and height of the study area is 3535 and 2010 pixels (88.4 m and 50.3 m), respectively. Actually, the extent of the study area is at the limit of UAV data coverage. Although the study area may still seem small, it is limited by the operation range of the mini-UAV used. In future study, we would try high altitude long endurance (HALE) UAV to acquire images of a larger study region. The raw images acquired during each flight were orthorectified firstly and then mosaicked to an entire image by Pix4D [52]. Specifically, several key parameters in Pix4D are set as follows. "Aerial Grid or Corridor" is chosen for matching image pairs, "Automatic" is selected for targeted number of key points, matching window size is 7 × 7 and 1 GSD is used for resolution. The rest of the parameters are set to default values. Afterwards, image registration was performed among the multi-temporal UAV data by ENVI (the Environment for Visualizing Images) [53]. Figure 3 illustrates the architecture of the proposed attention-based recurrent convolutional neural network (ARCNN) for vegetable mapping from multi-temporal UAV data. It mainly contains two parts, (1) a spatial feature extraction module based on a multi-scale deformable convolutional network (MDCN), and (2) a spatial-temporal feature fusion module based on a bi-directional RNN and attention mechanism. The former is to learn representative spatial features while the latter is to aggregate spatial and temporal features for the final vegetable classification. The raw images acquired during each flight were orthorectified firstly and then mosaicked to an entire image by Pix4D [52]. Specifically, several key parameters in Pix4D are set as follows. "Aerial Grid or Corridor" is chosen for matching image pairs, "Automatic" is selected for targeted number of key points, matching window size is 7 × 7 and 1 GSD is used for resolution. The rest of the parameters are set to default values. Afterwards, image registration was performed among the multi-temporal UAV data by ENVI (the Environment for Visualizing Images) [53]. Figure 3 illustrates the architecture of the proposed attention-based recurrent convolutional neural network (ARCNN) for vegetable mapping from multi-temporal UAV data. It mainly contains two parts, (1) a spatial feature extraction module based on a multi-scale deformable convolutional network (MDCN), and (2) a spatial-temporal feature fusion module based on a bi-directional RNN and attention mechanism. The former is to learn representative spatial features while the latter is to aggregate spatial and temporal features for the final vegetable classification.

Spatial Feature Extraction Based on MDCN
Accurate vegetable classification requires discriminative features. In this section, a multi-scale deformable convolutional network (MDCN) is proposed to learn and extract rich spatial features from UAV imagery, which is to account for the scale and shape variations of land parcels.

Spatial Feature Extraction Based on MDCN
Accurate vegetable classification requires discriminative features. In this section, a multi-scale deformable convolutional network (MDCN) is proposed to learn and extract rich spatial features from UAV imagery, which is to account for the scale and shape variations of land parcels. Specifically, MDCN is an improved version of our previous study [44], and the network structure is depicted as Figure 4.  Same as our previous work, the input of MDCN is an image patch which is located at the center of the labeled pixel. The dimension of the patch is k × k × c [44], where k stands for the patch size while c refers to the channel number. Specifically, MDCN includes four regular convolutional layers and four deformable convolutional blocks. Table 3 shows the detailed configuration of the MDCN.  Same as our previous work, the input of MDCN is an image patch which is located at the center of the labeled pixel. The dimension of the patch is k × k × c [44], where k stands for the patch size while c refers to the channel number. Specifically, MDCN includes four regular convolutional layers and four deformable convolutional blocks. Table 3 shows the detailed configuration of the MDCN.
The deformable block contains multiple streams of deformable convolution [54], which could learn hierarchical and multi-scale features. The role of deformable convolution is to model the shape variations under complex agricultural landscapes. Considering that the standard convolution only samples the given feature map at fixed locations [54,55], it could not handle the geometric transformations. Compared with standard convolution, deformable convolution introduces additional offsets along with the standard sampling grid [54,55], which could account for various transformations for scale, aspect ratio and rotation, making it an ideal tool to extract robust features under complex landscapes. During the training process, both the kernel and offsets of a deformable convolution unit can be learned without additional supervision. In this situation, the output y at the location p 0 could be calculated according to Equation (1): where w stands for the learned weights, p i means the i th location, x represents the input feature map and ∆p i refers to the offset to be learned [54]. In addition, as for the determination of the patch size k, we referred to our previous research [44] and the highest classification performance was reached when k equaled 11.

Spatial-Temporal Feature Fusion
After the extraction of spatial features from every mono-temporal UAV image, it is essential to establish the relationship between these sequential features to yield a complete feature representation for boosting the vegetable classification performance. In this section, we exploit an attention based bi-directional LSTM (Bi-LSTM-Attention) for the fusion of spatial and temporal features ( Figure 5). The network structure of Bi-LSTM-Attention is illustrated as follows.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 22 variations under complex agricultural landscapes. Considering that the standard convolution only samples the given feature map at fixed locations [54,55], it could not handle the geometric transformations. Compared with standard convolution, deformable convolution introduces additional offsets along with the standard sampling grid [54,55], which could account for various transformations for scale, aspect ratio and rotation, making it an ideal tool to extract robust features under complex landscapes. During the training process, both the kernel and offsets of a deformable convolution unit can be learned without additional supervision. In this situation, the output y at the location p0 could be calculated according to Equation (1): where w stands for the learned weights, pi means the i th location, x represents the input feature map and i p Δ refers to the offset to be learned [54]. In addition, as for the determination of the patch size k, we referred to our previous research [44] and the highest classification performance was reached when k equaled 11.

Spatial-Temporal Feature Fusion
After the extraction of spatial features from every mono-temporal UAV image, it is essential to establish the relationship between these sequential features to yield a complete feature representation for boosting the vegetable classification performance. In this section, we exploit an attention based bi-directional LSTM (Bi-LSTM-Attention) for the fusion of spatial and temporal features ( Figure 5). The network structure of Bi-LSTM-Attention is illustrated as follows. Specifically, LSTM is a variant of RNN, which contains one input layer, one or several hidden layers and one output layer [45]. It should be noted that LSTM is more specialized in capturing long-range dependencies between sequential signals than other RNN models. LSTM utilizes a vector (i.e., memory cell) to store the long-term memory and adopts a series of gates to control the information flow [45] (Figure 6). The hidden layer is updated as follows: Specifically, LSTM is a variant of RNN, which contains one input layer, one or several hidden layers and one output layer [45]. It should be noted that LSTM is more specialized in capturing long-range dependencies between sequential signals than other RNN models. LSTM utilizes a vector (i.e., memory cell) to store the long-term memory and adopts a series of gates to control the information flow [45] (Figure 6). The hidden layer is updated as follows: where i refers to the input gate, f stands for the forget gate, o refers to the output gate, c is the memory cell and σ stands for the logistic sigmoid function [45].
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 22 where i refers to the input gate, f stands for the forget gate, o refers to the output gate, c is the memory cell and σ stands for the logistic sigmoid function [45]. LSTM has long been utilized in natural language processes (NLP) [56,57]. Recently, it has been introduced in the remote sensing field for change detection and land cover mapping. In this section, we exploit a bi-directional LSTM [57] to learn the relationship between multi-temporal spatial features extracted from the UAV image. As shown in Figure 5, two LSTMs are stacked together while the hidden state of first LSTM is fed into the second one, and the second LSTM follows a reverse order of the former to fully understand the dependencies of the sequential signals in a bi-directional way. In addition, to further improve the performance, we append an attention layer to the output of the second LSTM. Actually, attention mechanism is widely studied in the field of CV and NLP [58][59][60], which could automatically adjust the weight of input feature vectors according to their importance to the current task. Therefore, we also incorporate an attention layer to re-weight the sequential features to boost the classification performance.
Let H be a matrix containing a series of vectors [h1, h2, …, hT] that are produced by the bi-directional LSTM, where T denotes the length of the input features. The output of the attention layer is formed by a weighted sum of vectors described as follows: T att R Hα = (9) where α is the attention vector and while Ratt denotes the fused and attention-weighted spatial-temporal features. Additionally, the features outputted from the Bi-LSTM-Attention are re-weighted or re-calibrated adaptively, which could enhance the informative feature vectors and suppress the noisy and useless ones.
Finally, all the reweighted features were firstly sent to a fully-connected layer and then to a softmax classifier to predict the final vegetable category. LSTM has long been utilized in natural language processes (NLP) [56,57]. Recently, it has been introduced in the remote sensing field for change detection and land cover mapping. In this section, we exploit a bi-directional LSTM [57] to learn the relationship between multi-temporal spatial features extracted from the UAV image. As shown in Figure 5, two LSTMs are stacked together while the hidden state of first LSTM is fed into the second one, and the second LSTM follows a reverse order of the former to fully understand the dependencies of the sequential signals in a bi-directional way. In addition, to further improve the performance, we append an attention layer to the output of the second LSTM. Actually, attention mechanism is widely studied in the field of CV and NLP [58][59][60], which could automatically adjust the weight of input feature vectors according to their importance to the current task. Therefore, we also incorporate an attention layer to re-weight the sequential features to boost the classification performance.

Details of Network Training
Let H be a matrix containing a series of vectors [h 1 , h 2 , . . . , h T ] that are produced by the bi-directional LSTM, where T denotes the length of the input features. The output of the attention layer is formed by a weighted sum of vectors described as follows: where α is the attention vector and while R att denotes the fused and attention-weighted spatial-temporal features. Additionally, the features outputted from the Bi-LSTM-Attention are re-weighted or re-calibrated adaptively, which could enhance the informative feature vectors and suppress the noisy and useless ones. Finally, all the reweighted features were firstly sent to a fully-connected layer and then to a softmax classifier to predict the final vegetable category.

Details of Network Training
When training started, all the weights of the neural network were initialized through He normalization [61], and biases were all set to be zero. We adopt cross-entropy loss (Equation (10)) [62] as the loss function to train the proposed ARCNN: where CE is short for cross-entropy loss, y p is the predicted result and y is one-hot representation of the ground-truth label. Adam [63] was utilized as the optimization method with a learning rate of 1 × 10 −4 . In the training procedure, the model with the lowest validation loss was saved.
We have conducted data augmentation to reduce the impact of limited labeled data in this study. Specifically, all training image patches were flipped and rotated by a random angle from 90 • , 180 • and 270 • . Afterwards, we split 90% of the training datasets for the optimization of parameters. The remaining 10% of the training datasets were utilized as validation sets for performance evaluation during training. After the training process, a testing dataset was adopted to obtain the final classification accuracy.
Furthermore, we used TensorFlow [64] for the construction of our proposed model. The training process was performed on a computer running the Ubuntu 16.04 operation system. The central processing unit (CPU) involved as an Intel core i7-7800 @ 3.5 GHz while the graphics processing unit (GPU) was an NVIDIA GTX TitanX.

Accuracy Assessment
In this study, we utilized both qualitative and quantitative methods to verify the effectiveness of the proposed ARCNN for vegetable mapping. Specifically, as for the former, we used visual inspection to check for classification errors. While for the latter, a confusion matrix (CM) was obtained from the testing dataset. A series of metrics were calculated from the CM, including overall accuracy (OA), producer's accuracy (PA), user's accuracy (UA) and the Kappa coefficient.
As for the numbers of points chosen for each class, they were actually determined by the area ratio. For instance, the class of Chinese cabbage had the largest area ratio, therefore, the number of training/testing sample points was set to 400, which was the biggest among all the categories. Furthermore, other land cover types, such as spinach and kohlrabi, which only accounted for a small area on the entire study region, had a small number of sample points (only 50).
To further justify the effectiveness of the proposed method, we adopted both ablation analysis and comparison experiments with classic machine learning methods. Specifically, as for the ablation study, we justified the role of both attention-based RNN on vegetable mapping using the following setups.
(1) Feature-stacking: concatenating or stacking the spatial features derived from each single-date data for classification; (2) Bi-LSTM: using a bi-directional LSTM for classification; (3) Bi-LSTM-Attention: using the attention-based bi-directional LSTM for classification and (4) standard convolution: using common, non-deformable convolution operations for classification. Besides, ablation study has also been done to justify the impact of deformable convolution when compared with the standard convolution operations.
Meanwhile, classic machine learning methods such as MLC, RF and SVM were also included for comparison experiments. In specific, MLC has long been studied in remote sensing image classification, where the predicted labels are generated based on the maximum likelihood when compared with the training samples. The basic assumption of MLC is that the training samples should follow the normal distribution, which is hard to satisfy in reality, resulting in a limited classification performance. RF belongs to an ensemble of decision trees and the predicted results are determined by the average output of each decision tree [65]. RF has no restrictions on training data distribution and has outperformed MLC in many remote sensing studies. As for SVM, it is based on the Vapnik-Chervonenkis (VC) dimension theory which aims at the minimization of structure risk, resulting in good performance, especially under the situation of limited data [66]. Parameters involved in SVM usually contain kernel type, penalty coefficient, etcetera.  In order to further visually justify the classification results, Figure 8 shows the ground truth In order to further visually justify the classification results, Figure 8 shows the ground truth map which is manually vectorized from the UAV data. Actually, when compared with the GT map, the classification map of Figure 7 shows a salt and pepper effect. On one side, the classification model in this research belongs to a per-pixel method, which does not consider the boundary information of each land parcel, resulting in a more scattered classification result. On the other hand, the GT map is an ideal description of the spatial extent of every land cover category, neglecting the variations within each land parcel. For instance, several weed regions are missing in the GT map due to small areas. Additionally the bare soil regions in some land parcels have also been neglected. However, in practice, based on the classification map of Figure 7, we could easily generate a land cover map which is more accurate and concise just like Figure 8, which would justify the value of the proposed method in vegetable mapping. To make the boundaries of each land parcel more accurate, in future study we will research semantic segmentation models such as fully convolutional neural networks [19] to improve the visual effect.  In order to further visually justify the classification results, Figure 8 shows the ground truth map which is manually vectorized from the UAV data. Actually, when compared with the GT map, the classification map of Figure 7 shows a salt and pepper effect. On one side, the classification model in this research belongs to a per-pixel method, which does not consider the boundary information of each land parcel, resulting in a more scattered classification result. On the other hand, the GT map is an ideal description of the spatial extent of every land cover category, neglecting the variations within each land parcel. For instance, several weed regions are missing in the GT map due to small areas. Additionally the bare soil regions in some land parcels have also been neglected. However, in practice, based on the classification map of Figure 7, we could easily generate a land cover map which is more accurate and concise just like Figure 8, which would justify the value of the proposed method in vegetable mapping. To make the boundaries of each land parcel more accurate, in future study we will research semantic segmentation models such as fully convolutional neural networks [19] to improve the visual effect. Meanwhile, to quantitatively assess the classification performance, the confusion matrix, Kappa coefficient, OA, PA and UA were derived from the testing dataset. Table 4 indicates that the proposed classification model shows a high performance with both a high OA (92.80%) and a high Kappa coefficient (0.9206).   Table 4 indicates that the omissions and commissions mainly exist among leaf mustard and Chinese cabbage, potherb and turnip. For instance, several leaf mustard pixels were misclassified as Chinese cabbage and vice versa. This was understandable, since both color and shape of these leafy green vegetables (Chinese cabbage, leaf mustard, potherb, etc.) are very similar, especially at the early growth stage. Meanwhile, the RGB image has the drawback of a low spectral resolution, making it hard to differentiate these vegetable categories when using only color and shape information.

Results of Vegetable Mapping Based on Multi-Temporal Data
In addition, only a few mistakes occurred among the other categories, which verifies the effectiveness of the proposed vegetation mapping method. Figure 9 shows the vegetable map generated from both mono-and multi-temporal classification. As mentioned above, one hypothesis of this study is that the inclusion of multi-temporal UAV data could provide additional phenological information, which would enhance the inter-class separability to cover the shortage of low spectral resolution caused by off-the-shelf digital cameras. Therefore, in this section, a contrast experiment was conducted to compare the performance between multi-temporal and mono-temporal classification. It should be noted that when using single-date UAV data for classification, the spatial-temporal feature fusion module (i.e., Bi-LSTM-Attention) would be non-functional during the training and testing procedure. Figure 9 indicates that the incorporation of multi-temporal UAV images could significantly improve the classification performance when compared with mono-temporal data, which shows fewer obvious errors from visual inspection. This is in accordance with quantitative assessment (Table 5). It indicates that the overall classification accuracy improved by 19.76%-28.13%, with an average increase of 24.49%, after the inclusion of multi-temporal data.  As mentioned above, one hypothesis of this study is that the inclusion of multi-temporal UAV data could provide additional phenological information, which would enhance the inter-class separability to cover the shortage of low spectral resolution caused by off-the-shelf digital cameras. Therefore, in this section, a contrast experiment was conducted to compare the performance between multi-temporal and mono-temporal classification. It should be noted that when using single-date UAV data for classification, the spatial-temporal feature fusion module (i.e., Bi-LSTM-Attention) would be non-functional during the training and testing procedure. Figure 9 indicates that the incorporation of multi-temporal UAV images could significantly improve the classification performance when compared with mono-temporal data, which shows fewer obvious errors from visual inspection. This is in accordance with quantitative assessment (Table 5). It indicates that the overall classification accuracy improved by 19.76%-28.13%, with an average increase of 24.49%, after the inclusion of multi-temporal data. Meanwhile, Figure 9 also shows that it is difficult to obtain a high-precision vegetable map if only utilizing single-date UAV RGB images. There would be a large amount of classification errors among different vegetable categories, especially between Chinese cabbage, leaf mustard and turnip. Specifically, during the early growth stage (T1), large amounts of Chinese cabbage and leaf mustard pixels are misclassified as turnip (Figure 9a). This is mainly because these leafy green vegetables share very similar appearances (e.g., color, shape and texture patterns), which leads to a low inter-class separability hence a poor classification accuracy (64.67%). In the middle growth stage (T2), the classification accuracy of Chinese cabbage has been greatly improved due to its shape change due to the growth process. However, it still remains difficult to separate leaf mustard from Chinese cabbage (Figure 9b). When it comes to the ripe stage (T3), the leaf mustard could finally be differentiated from Chinese cabbage (Figure 9c). This is mainly because the Chinese cabbage shows a round head in the ripe stage (Table 1), which is greatly different to leaf mustard. Table 5 shows the class-level accuracy for each vegetable category and other land cover types. It indicates that there is a significant accuracy gap between mono-and multi-temporal classification when using UAV RGB imagery. This is understandable because if using single-date UAV data alone, the similarity of color and texture patterns between various vegetables would yield a low inter-class separability. This is even more so at the early growth stage (T1), when vegetable seedlings share very similar appearances, resulting in the lowest classification accuracy with an OA of 64.67%. However, with the inclusion of multi-temporal UAV images, the additional phenological information would increase the separability among various vegetables, which could boost the final classification performance.

Results of Ablation Analysis
To justify the effectiveness of the proposed ARCNN model, a series of ablation experiments are conducted and the results are shown as follows.

Results of Different Fusion Methods
In this section, we consider the following methods for the fusion of spatial-temporal features: (1) feature-stacking; (2) Bi-LSTM and (3) Bi-LSTM-Attention. The description of these methods is in Section 2.7. The experimental results are shown in Table 6.  Table 6 indicates that the Bi-LSTM-Attention module used in this study outperforms both feature-stacking and Bi-LSTM, which increases the OA by 3.24% and 1.87%, respectively. The role of Bi-LSTM-Attention will be discussed in Section 4.1.

Results of Standard Convolution
In this section, we replaced all the deformable convolution operations by standard convolution units in the proposed network to justify the role of deformable convolution in vegetable mapping. Table 7 shows the comparison results.  Table 7 implies that the inclusion of deformable convolution could improve the vegetable mapping accuracy. The detailed discussion will be presented in Section 4.2.

Results of Comparison with Other Methods
To further justify the effectiveness of the proposed classification model, we compared it with several machine learning classifiers and other deep learning models. As for the former, we conducted comparison experiments using MLC, RF and SVM based on the same training and testing datasets. We used grid search for the parameterization of both RF and SVM. It turns out that an RF with 300 decision trees and a max depth of 15, and a SVM with radial basis kernel, a gamma [66] of 0.001 and a penalty coefficient (C) [66] of 100 has the best performance, respectively. Table 8 shows the comparison results between the proposed model and other classical machine learning methods. It indicates that the deep learning based model has an advantage over the classical methods. A detailed discussion of this will follow in Section 4.4.  [43]. Because the dimension of input and output of these models are different from ours, we have made necessary changes accordingly when reproducing these DL models. The experimental results are shown as follows. Table 9 indicates that the proposed ARCNN in the research has a better performance when compared with several previous deep learning models. The OA is boosted by an increase of 2.53% to 8.36% while the Kappa has risen by 2.78% to 9.23%. The detailed discussion will be presented in Section 4.4.

Impact of Attention-Based RNN on Vegetable Mapping
In this section, we will discuss the impact of attention mechanisms in the RNN for vegetable mapping. Specifically, according to the ablation study results of Section 3.3.1, the comparisons were made between different methods for spatial-temporal feature fusion, including feature-stacking, Bi-LSTM and Bi-LSTM-Attention. Results show that the feature-stacking yields the lowest accuracy with an OA of 89.56% and a Kappa of 0.8849. The reason is that feature-stacking just concatenates all the multi-temporal features without considering the relationship and temporal dependencies across the sequential UAV data. Meanwhile, since Bi-LSTM could understand the dependencies of the sequential features in a bi-directional way, therefore, it shows a better performance than the simple feature-stacking method with an OA improvement of 1.37%. In this study, we added an attention layer on the top of Bi-LSTM to further improve its performance. The attention based Bi-LSTM could enhance the important features while suppressing the less informative ones, outperforming both feature-stacking and Bi-LSTM with an OA increase of 3.24% and 1.87%, respectively, which verifies its effectiveness in spatial-temporal feature fusion.

Impact of Deformable Convolution on Vegetable Mapping
Another hypothesis of this study is that the scale and shape variations could be accounted for using the deformable convolution. According to Table 7, it indicates that the inclusion of deformable convolution could boost the classification performance. The OA has been improved from 91.96% to 92.80% with a rise of about 1%, justifying the role of deformable convolution. The reason for the lower accuracy of standard convolution is that it has a fixed kernel shape, which lacks the capability to model the geometric transformations of complex landscapes. On the other hand, deformable convolution has a flexible receptive field, which could be adaptive to the variability of shape and scale of remotely sensed imagery [44]. Therefore, the deformable convolution shows a better performance, especially under the complex and fragmented agricultural landscape in this study.

Impact of Multi-Temporal UAV Data on Vegetable Mapping
In addition, we will further discuss the role of multi-temporal UAV data on vegetable mapping. In fact, one of the main objectives of this study is to explore whether the incorporation of multi-temporal UAV RGB images could improve the vegetable classification accuracy. The initial motivation lies in the fact that RGB images acquired by UAV have a low spectral resolution, which would make it hard for the fine-grained classification of various vegetable categories. Therefore, in this study we have selected images from three important periods, i.e., the sowing period, the growing period and the harvesting period, to capture the phenological characteristics of different vegetables. Although the number of three dates may seem limited, all of them fall into the distinct periods of the vegetable and crop growth stage, which could still provide additional and useful time-series features for classification.
Meanwhile, in previous studies, images from only three dates have been studied for remote sensing image classification and they outperform the single-date dataset. For instance, Palchowdhuri et al. used three images from both multi-temporal Sentinel-2 and WordView-3 imagery for crop classification in Coalville in the United Kingdom and achieved an accuracy of 91% [67]. Similar findings were also reported in Yang et al., where three images from summer, autumn and winter were integrated for coastal land cover mapping [68]. In our previous study [15], we also utilized only three images during the whole crop growing season for the cropland classification in the Yellow River Delta of China, which yielded an average accuracy of 89%, justifying the role of three dates for classification. In future research, images from a longer temporal range could be included to further improve the classification performance.

Comparison with Other Methods
In this section, we focused on the detailed discussion between the proposed ARCNN and other classical machine learning methods and several previous deep learning methods. Specifically, Table 8 indicates that our proposed method outperforms machine learning methods such as MLC, RF and SVM with an OA increase of 27.29%, 21.58% and 8.64%, respectively. The results are in accordance with [46] and our previous studies [39,44]. The reason could be that classical machine learning methods lack the ability to capture the high-level representative features when compared to deep learning models, leading to a performance gap in vegetable mapping.
In addition, there is a need to compare the proposed ARCNN with other methods for multi-temporal UAV image classification. Recent researches such as van Iersel et al. [14] and Michez et al. [12], they both utilized object-oriented image analysis (OBIA) and random forest for plant classification from multi-date UAV data. Manually designed features such as band ratio and vegetation indices were used for classification. Compared with their studies, we replace the manually designed features with high-level and discriminative features that are automatically learned from deep neural networks, (i.e., CNN and RNN), which could enhance the feature's representativeness. To the best of our knowledge, this study is the first case to introduce deep learning methods in multi-temporal UAV image classification. Therefore, the proposed method in this research might provide useful reference for future studies.
Meanwhile, it is also necessary to compare the proposed ARCNN with other deep learning models for remote sensing image classification. Early studies mainly utilized LSTM for multi-temporal classification. One representative research is Ndikumana et al., where five LSTMs were stacked for the classification using multi-temporal SAR Sentinel-1 data [46]. The input data in Ndikumana's study are a single pixel with a time curve, which neglects the rich, contextual relationship hidden in the spatial features, showing a relatively lower accuracy (84.44%). Different from Ndikumana et al. [46], we have added a CNN in front of LSTM to enrich the representative spatial feature extraction.
Mou et al. also cascaded a CNN and RNN for change detection from two optical remote sensing images [47]. Compared with Mou et al. [47], our model makes two significant improvements. Firstly, from the perspective of CNN, we incorporate the multi-scale deformable convolutions, which could aggregate multi-level contextual features. Secondly, we used the attention mechanism with a bi-directional LSTM to further enhance the modeling of sequential signals in multi-temporal remote sensing data. All the above modifications have improved the classification from 90.18% by Mou et al. to 92.80%.
Besides, Ji et al. adopted a 3D-CNN to extract spatial-temporal features for crop type mapping from multi-temporal satellite imagery [43]. Compared with Ji et al. [43], our method has also gained a more accurate result. The reason lies in that 3D CNN cannot explicitly establish the relationship between the sequential signals, which has flaws in the generation of spatial-temporal feature fusion and integration. Furthermore, our Bi-LSTM-Attention module is more straightforward in mining the relationship across multi-temporal data than a 3D-CNN.

Conclusions
This study proposed an attention-based recurrent convolutional neural network (ARCNN) for accurate vegetable mapping based on multi-temporal unmanned aerial vehicle (UAV) red-green-blue (RGB) data. The proposed ARCNN first leverages a multi-scale deformable CNN to learn and extract the rich spatial features from each mono-temporal UAV image, which aims to account for the shape and scale variations under complex and fragmented agricultural landscapes. Afterwards, an attention-based bi-directional long-short term memory (LSTM) is introduced to model the relationship between the sequential features, from which spatial and temporal features are fused and aggregated. Finally, the fused features are fed to a fully connected layer and a softmax classifier to determine the vegetable category.
Experimental results showed that the proposed ARCNN yields a high classification performance with an overall accuracy (OA) of 92.08% and a Kappa coefficient of 0.9206. When compared with mono-temporal classification, the incorporation of multi-temporal UAV data could boost the OA significantly by an average increase of 24.49%, which verifies the hypothesis that multi-temporal UAV observations could enhance the inter-class separability and thus reduce the drawback of low spectral resolution of off-the-shelf digital cameras. The Bi-LSTM-Attention module outperforms other fusion methods such as feature-staking and bi-directional LSTM with an OA increase of 3.24% and 1.87%, respectively, justifying its effectiveness in modeling the dependency across the sequential features. Meanwhile, the introduction of deformable convolution could also improve the OA by about 1% when compared with standard convolution. In addition, the proposed ARCNN also shows a higher performance than other classical machine learning classifiers such as maximum likelihood classifier, random forest and support vector machine, and several previous deep learning methods for remote sensing classification.
This study demonstrates that the proposed ARCNN could yield an accurate vegetable mapping result from multi-temporal UAV RGB data. The drawback of low spectral resolution of RGB images could be compensated by introducing additional phenological information and robust deep learning models. Although images from only three dates were included, a good classification result could still be achieved providing all three dates fall into the distinct growing periods of vegetables. Finally, the proposed model could be viewed as a general framework for multi-temporal remote sensing image classification. As for future work, more study cases should be considered to justify the effectiveness of the proposed method. Additionally, semantic segmentation models should be incorporated to get a more accurate vegetable map.