Method for Mapping Rice Fields in Complex Landscape Areas Based on Pre-Trained Convolutional Neural Network from HJ-1 A/B Data

: Accurate and timely information about rice planting areas is essential for crop yield estimation, global climate change and agricultural resource management. In this study, we present a novel pixel-level classiﬁcation approach that uses convolutional neural network (CNN) model to extract the features of enhanced vegetation index (EVI) time series curve for classiﬁcation. The goal is to explore the practicability of deep learning techniques for rice recognition in complex landscape regions, where rice is easily confused with the surroundings, by using mid-resolution remote sensing images. A transfer learning strategy is utilized to ﬁne tune a pre-trained CNN model and obtain the temporal features of the EVI curve. Support vector machine (SVM), a traditional machine learning approach, is also implemented in the experiment. Finally, we evaluate the accuracy of the two models. Results show that our model performs better than SVM, with the overall accuracies being 93.60% and 91.05%, respectively. Therefore, this technique is appropriate for estimating rice planting areas in southern China on the basis of a pre-trained CNN model by using time series data. And more opportunity and potential can be found for crop classiﬁcation by remote sensing and deep learning technique in the future study


Introduction
Rice spatial distribution and area information have great significance for yield prediction, greenhouse gas emission and human food security. Rice provides a stable source of energy for more than half of the world's population, most especially Asians [1,2]. However, surveys indicate that agricultural areas were on a decline in the period of 1994-2014 and that rice planting fields only occupy approximately 11% of global arable regions [3,4]. As global warming intensifies, rice plays an increasingly important role that cannot be neglected because it can generate considerable amounts of methane gas, which accounts for approximately 5-19% of the global emission to the atmosphere [5,6]. Food security, which is closely related to life security, has recently become the focus worldwide [7]. A comprehensive understanding of rice fields could aid the government's decision making for resource management [8]. Hence, precise information about rice planting areas is highly critical nowadays.
Unlike the field investigation in the past, remote sensing represents an important approach to Earth observation and is thus becoming widely popular in crop monitoring because of its large coverage, real-time capabilities and low costs [9]. The existing literature proposes several approaches for estimating rice planting areas based on remote sensing. These approaches are grouped into three types. First, as a response to survival circumstances, phenological features could well reflect regular HJ-1 A/B satellite data, which exhibit a 30 m spatial resolution and a four-day revisit cycle (unite for two days), bring a new direction for time series development. As we mentioned previously, CNN is significant in estimating rice fields and effective in extracting deep features. However, the number of available samples from remote sensing images is limited, making it difficult to support the training step. Transfer learning occurs with sharing knowledge from one or more source tasks and then applying such knowledge to a similar target domain [41]. Hence, we are interested in the use of the CNN model for extracting features of temporal VI through introducing the transfer learning method instead of training from scratch. For comparison, the SVM model is also used in experiment. SVM is a nonparametric classifier with a shallow structure that transforms a linear inseparable problem from a low dimensional space into a high dimensional feature space via a nonlinear mapping algorithm to make such problem linearly separable [42].
Considering the necessity and challenges of mapping paddy rice fields with complex landscapes in southern China, we aim to explore the feasibility of using the CNN deep model for accurate rice field mapping on the basis of VI time series images. This study ultimately provides a pixel-and deep temporal feature-based model for rice area extraction. Meanwhile, the SVM model is also used for comparison.

Study Area
Zhuzhou City is situated in the east of Hunan Province, China, expanding from 26 • 03 05" to 28 • 01 07" E and 112 • 57 30" to 114 • 07 15" N (World Geodetic System-1984 (WGS-84) Coordinate System) and is adjacent to Xiangjiang River ( Figure 1). The city is part of the subtropical monsoon zone with sufficient sunlight and warm and humid climate. Paddy rice cultivation is abundant in this city due to the annual average temperature of 16-18 • C and precipitation of 1500 mm. The city is also a famous commodity grain base in China. Additionally, Zhuzhou City belongs to typical hilly landforms, hence the patchiness and fragmentation of its cropland. Furthermore, the complex planting structure in this region heightens the difficulty in accurately estimating rice fields on remote sensing images. Thus, Zhuzhou is appropriate for our study. revisit cycle (unite for two days), bring a new direction for time series development. As we mentioned previously, CNN is significant in estimating rice fields and effective in extracting deep features. However, the number of available samples from remote sensing images is limited, making it difficult to support the training step. Transfer learning occurs with sharing knowledge from one or more source tasks and then applying such knowledge to a similar target domain [41]. Hence, we are interested in the use of the CNN model for extracting features of temporal VI through introducing the transfer learning method instead of training from scratch. For comparison, the SVM model is also used in experiment. SVM is a nonparametric classifier with a shallow structure that transforms a linear inseparable problem from a low dimensional space into a high dimensional feature space via a nonlinear mapping algorithm to make such problem linearly separable [42].
Considering the necessity and challenges of mapping paddy rice fields with complex landscapes in southern China, we aim to explore the feasibility of using the CNN deep model for accurate rice field mapping on the basis of VI time series images. This study ultimately provides a pixel-and deep temporal feature-based model for rice area extraction. Meanwhile, the SVM model is also used for comparison.

Study Area
Zhuzhou City is situated in the east of Hunan Province, China, expanding from 26°03′05″ to 28°01′07″ E and 112°57′30″ to 114°07′15″ N (World Geodetic System-1984 (WGS-84) Coordinate System) and is adjacent to Xiangjiang River ( Figure 1). The city is part of the subtropical monsoon zone with sufficient sunlight and warm and humid climate. Paddy rice cultivation is abundant in this city due to the annual average temperature of 16-18 °C and precipitation of 1500 mm. The city is also a famous commodity grain base in China. Additionally, Zhuzhou City belongs to typical hilly landforms, hence the patchiness and fragmentation of its cropland. Furthermore, the complex planting structure in this region heightens the difficulty in accurately estimating rice fields on remote sensing images. Thus, Zhuzhou is appropriate for our study.

Satellite Data
Eight HJ-1 A/B images from 13 May to 3 November 2017 were collected in our work to generate EVI datasets for machine learning and classification. Table 1 summarizes the concrete acquisition date of these images. Among them, there are two images in May (13 May and 29 May), two images in July (22 July and 26 July), two images in August (20 August and 25 August), one image in September (18 September) and one image in November (3 November). All of the images were downloaded freely from the China Center for Resources Satellite Data and Application (http://www.cresda.com/CN/). The major parameters of the two satellites are presented in Table 2. Two charge coupled device (CCD) sensors were loaded with 30 m spatial resolution, 360 km detection width and four-day temporal resolution. These sensors were also identical in the range of multispectral bands, which comprised four channels, that is, band 1 (blue): 0.43-0.52 µm, band 2 (green): 0.52-0.60 µm, band 3 (red): 0.63-0.69 µm and band 4 (near infrared): 0.76-0.90 µm. All acquired images needed to be processed prior to use in ENVI5.2 image processing software (Exelis VIS, White Plains, NY, USA). The processing mainly covered the following five aspects. First, absolute radiometric calibration was performed for each band of the image according to the coefficient and formula of radiometric calibration that was provided by the China Center for Resources Satellite Data and Application and then stacked together for later use. Second, we obtained the spectral response curve data from the China Center for Resources Satellite Data to produce four spectral response functions and converted the image format to prepare for atmospheric correction. Third, we conducted the atmosphere correction through the FLAASH module (Spectral Sciences Inc. & U.S. Air Force Research Laboratory). Fourth, automatic registration was completed among CCD images to eliminate the errors of different sensors, with the root mean square (RME) error was less than one pixel. Finally, the images were uniformly re-projected to the Universal Transverse Mercator (UTM)-WGS84 system.

Ancillary Datasets
The ground-truth data, together with observation points from Google Earth, were used in this study to promote classification and verification. There were 3577 points for result verification in total. In field surveys, 419 points were visited and every land use patch covered more than 900 square meters. Then position that was located by a Global Position System (GPS) device and the corresponding land type of each site were recorded. Other points were selected from Google Earth randomly. All these points were stored to a raster dataset with a TIFF format and 30 m spatial resolution. This dataset can be regarded as reference data for accuracy assessment.

Methodology
A novel idea based on deep learning was proposed in this study to accurately estimate paddy rice areas in southern China. The workflow of our approach is shown in Figure 2. On the basis of EVI time series data, we applied the SVM model and CNN model to proceed classification, respectively. The CNN model was built by pre-trained method and the features of EVI time series curves (shape, amplitude and abstract features) could be extracted to assist classification. Then a comparative analysis on accuracy assessment between two results were completed. The concrete procedures are follows.
(1) We calculated the VI of multitemporal satellite data and then reconstructed the time series curve as input. (2) A deep learning model for classification was developed on the basis of the framework of the Convolutional Architecture for Fast Feature Embedding [43], which introduced the strategy of transfer learning.  than one pixel. Finally, the images were uniformly re-projected to the Universal Transverse Mercator (UTM)-WGS84 system.

Ancillary Datasets
The ground-truth data, together with observation points from Google Earth, were used in this study to promote classification and verification. There were 3577 points for result verification in total. In field surveys, 419 points were visited and every land use patch covered more than 900 square meters. Then position that was located by a Global Position System (GPS) device and the corresponding land type of each site were recorded. Other points were selected from Google Earth randomly. All these points were stored to a raster dataset with a TIFF format and 30 m spatial resolution. This dataset can be regarded as reference data for accuracy assessment.

Methodology
A novel idea based on deep learning was proposed in this study to accurately estimate paddy rice areas in southern China. The workflow of our approach is shown in Figure 2. On the basis of EVI time series data, we applied the SVM model and CNN model to proceed classification, respectively. The CNN model was built by pre-trained method and the features of EVI time series curves (shape, amplitude and abstract features) could be extracted to assist classification. Then a comparative analysis on accuracy assessment between two results were completed. The concrete procedures are follows. (1) We calculated the VI of multitemporal satellite data and then reconstructed the time series curve as input. (2) A deep learning model for classification was developed on the basis of the framework of the Convolutional Architecture for Fast Feature Embedding [43], which introduced the strategy of transfer learning.

Construction of EVI Time Series
Extensive research has suggested that VIs can be used to objectively reflect plant growth states with evident seasonal characteristics, periodicity and difference [19,44]. Thus, selecting a proper index for our study was critical. NDVI, one of the most popular VIs, was successfully used to crop monitoring [45,46]. However, saturation tends to occur in some regions with dense vegetation coverage, thereby limiting its further application. As an improved VI, EVI has attracted increasing attention given the interference of environmental factors and soil background. This index compensates for the saturation problem of NDVI and shows a high sensitivity to vegetation changes; hence, we chose this index instead of NDVI to detect different land types [33]. The EVI computational formula is given as follows: where ρ N IR , ρ RED and ρ BLUE correspond to the reflectances of near-infrared, red and blue bands, respectively. Despite the rigorous pre-processing operations conducted initially, it is clear that clouds, aerosols and other external factors (like shadow and snows) are going to appear in VIs [47,48]. Thus, a filtering algorithm is necessary to minimize noises in the next step. At present, three primary filtering algorithms are commonly used: double logistic model functions, asymmetric Gaussian model functions and Savitzky-Golay filtering method. In the current study, we chose the last to smoothen the EVI time series. The Savitzky-Golay method implements local polynomial fitting on time series to generate the filter value of each point; the major feature of it is to keep invariance of shape and width while removing noises [49]. Furthermore, interpolation and polynomial fitting procedures were introduced to obtain the daily EVI value. We illustrate the EVI time series of rice in Figure 3.

Construction of EVI Time Series
Extensive research has suggested that VIs can be used to objectively reflect plant growth states with evident seasonal characteristics, periodicity and difference [19,44]. Thus, selecting a proper index for our study was critical. NDVI, one of the most popular VIs, was successfully used to crop monitoring [45,46]. However, saturation tends to occur in some regions with dense vegetation coverage, thereby limiting its further application. As an improved VI, EVI has attracted increasing attention given the interference of environmental factors and soil background. This index compensates for the saturation problem of NDVI and shows a high sensitivity to vegetation changes; hence, we chose this index instead of NDVI to detect different land types [33]. The EVI computational formula is given as follows: where , and correspond to the reflectances of near-infrared, red and blue bands, respectively.
Despite the rigorous pre-processing operations conducted initially, it is clear that clouds, aerosols and other external factors (like shadow and snows) are going to appear in VIs [47,48]. Thus, a filtering algorithm is necessary to minimize noises in the next step. At present, three primary filtering algorithms are commonly used: double logistic model functions, asymmetric Gaussian model functions and Savitzky-Golay filtering method. In the current study, we chose the last to smoothen the EVI time series. The Savitzky-Golay method implements local polynomial fitting on time series to generate the filter value of each point; the major feature of it is to keep invariance of shape and width while removing noises [49]. Furthermore, interpolation and polynomial fitting procedures were introduced to obtain the daily EVI value. We illustrate the EVI time series of rice in Figure 3.

Extraction of Different Phenological Patterns
On the basis of the dense stack of satellite data, every land type has its own phenological sequence pattern corresponding to specific VI curves; this pattern could be introduced into crop classification with certain characteristics [28,50]. Therefore, we divide the study area into four major land types; water, forest, rice and others (like bare land, building and abandoned cropland). The

Extraction of Different Phenological Patterns
On the basis of the dense stack of satellite data, every land type has its own phenological sequence pattern corresponding to specific VI curves; this pattern could be introduced into crop classification with certain characteristics [28,50]. Therefore, we divide the study area into four major land types; water, forest, rice and others (like bare land, building and abandoned cropland). The unique characteristics of land types are expected to assist image classification. Therefore, it is extremely essential to learn more about the characteristics of EVI time series curves, which accounts for the phenological sequence patterns of objects. The typical EVI time series curves of four land types are illustrated in Figure 4.
unique characteristics of land types are expected to assist image classification. Therefore, it is extremely essential to learn more about the characteristics of EVI time series curves, which accounts for the phenological sequence patterns of objects. The typical EVI time series curves of four land types are illustrated in Figure 4.

Architecture of CNN
With the advantages of local connections, shared weights, pooling and multiple layers, CNN, a famous feed-forward network, was chosen as the main framework of our proposed model, which consists of a number of neurons with learnable weights and biases [37,51]. We developed a model based on LeNet-5 (Yann LeCun), which has been perceived as the basic prototype of existing CNNs. We provided the model with a distinct architecture ( Figure 5), including input, convolutional, pooling, fully connected and output layers, to improve our understanding of CNN.

Architecture of CNN
With the advantages of local connections, shared weights, pooling and multiple layers, CNN, a famous feed-forward network, was chosen as the main framework of our proposed model, which consists of a number of neurons with learnable weights and biases [37,51]. We developed a model based on LeNet-5 (Yann LeCun), which has been perceived as the basic prototype of existing CNNs. We provided the model with a distinct architecture ( Figure 5), including input, convolutional, pooling, fully connected and output layers, to improve our understanding of CNN. unique characteristics of land types are expected to assist image classification. Therefore, it is extremely essential to learn more about the characteristics of EVI time series curves, which accounts for the phenological sequence patterns of objects. The typical EVI time series curves of four land types are illustrated in Figure 4.

Architecture of CNN
With the advantages of local connections, shared weights, pooling and multiple layers, CNN, a famous feed-forward network, was chosen as the main framework of our proposed model, which consists of a number of neurons with learnable weights and biases [37,51]. We developed a model based on LeNet-5 (Yann LeCun), which has been perceived as the basic prototype of existing CNNs. We provided the model with a distinct architecture ( Figure 5), including input, convolutional, pooling, fully connected and output layers, to improve our understanding of CNN.  Information is initially inputted into the network through the input layer for use in a later process. Next is the convolutional layer, which is the core module of the entire network and plays a significant role in feature extraction through convolution. There have multiple convolutional planes consist of numerous neurons. Each neuron is locally connected to the former layer by convolutional kernels, which simply dictate the feature extraction procedure. Then, a rectified linear unit (ReLU), a non-linear activation function, is introduced to add nonlinear ability [52]. We utilized various convolutional kernels convolved with the input to obtain different features. Although indefinite initially, these kernels were adjusted by frequent training.
Pooling, which is linked to the convolutional layer alternately, is calculated by down sampling to realize feature dimension reduction and that is why we also call it the down sampling layer. This layer mainly contributes to the reduction of the number of connections among neurons, which not only accelerates the computation but also helps to enhance robustness. The typical pooling algorithms are the max pooling and the average pooling [53]. Generally, the pooling layer is regarded as the secondary layer of feature extraction and is as important as the convolution layer.
The fully connected layer is aimed at implementing classification on the basis of locally separable features acquired from previous layers. Each neuron is fully linked to all neurons on the upper layer. The last layer is commonly considered as the output, where softmax logistic regression follows for classification task.

Strategy of Transfer Learning
In order to solve the problem about insufficient samples from remote sensing data, transfer learning, which can transfer previously knowledge to a new task, was introduced in our work. On the basis of the content that was transferred, the methods of transfer learning were divided into four categories; instance-based transfer learning, feature-based transfer learning, parameter-based transfer learning and relational-based transfer learning [41]. We used the parameter-based transfer learning that the model parameters were shared between target domain and source domain. In other words, the model, which was trained by using a large amount of data from source domain, was applied to target domain in which the model could be trained by less data. Therefore, the network could assimilate generic features to promote its application in small database [54].
In the experiment, the implementation of transfer learning was illustrated in Figure 6. First, we set up a pre-trained model (Model 1) by using the Modified National Institute of Standards and Technology (MNIST) database (http://yann.lecun.com/exdb/mnist/). The datasets consist of handwritten numerals from high school students and staff of the Census Bureau. Specifically, the datasets comprise 50,000 training samples and 10,000 testing samples, which are all in grayscale and normalized into 28 × 28 pixels. More significantly, different handwritten numerals were mainly distinguished by the shape of lines, which is similar to the EVI time series curve in our study. Subsequently, parameters of Model 1 were stored as an individual file which was transferred into Model 2. Ultimately, we conducted the fine tuning process by using EVI time series to determine the final CNN model. There were 1893 samples for deep learning, which were collected randomly by visual interpretation and were sorted into training samples and testing samples according to the proportion of 2:1. And these samples must be normalized uniformly before model training. As mentioned above, it is essential to help the pre-trained CNN model reach an optimal state as much as possible by adjusting several training parameters. Here, we mainly discuss the effects of two types of parameters, that is, batch size and learning rate, on model performance. Batch size is briefly defined as the numbers of samples for each training. Commonly, a large batch size equates to a precise descent direction within a reasonable scope, as well as a slight oscillation. By contrast, a small batch size may introduce randomness and poor convergence. Therefore, a proper batch size should be set in relation to the sample scale. We adopted 100, 64 and 32 as batch sizes in the current work. Learning rate, another important factor, is a parameter for adjusting gradient descent steps in a network. It basically depends on the speed of parameters in reaching their best values. In other words, an excessively high learning rate updates parameters rapidly, causing the network to easily converge into a local optimum. Conversely, an extremely low value reduces the efficiency and leads to a slow convergence. Thus, there is no doubt that setting up a proper learning rate is extremely necessary. We set the values of 0.1, 0.01 and 0.001 to explore the impacts of different learning rates on the results. optimum. Conversely, an extremely low value reduces the efficiency and leads to a slow convergence. Thus, there is no doubt that setting up a proper learning rate is extremely necessary. We set the values of 0.1, 0.01 and 0.001 to explore the impacts of different learning rates on the results.

Characteristics of EVI Time Series
From Figure 4, it can be seen that the shape, tendency and other aspects of the different temporal curves show obvious variances. First, rice fields are extremely unique because they are partially covered with a mixture of soil, water and rice seedlings during the transplanting and early growth periods [13]. Meanwhile, a low value is observed in the EVI curve. In the tilling and jointing stages, the roots and leaves develop rapidly and the EVI value increases sharply to a peak beyond 0.6. According to the phenological date of crops, rice begins to transform its growth pattern from vegetative growth (roots, stems and leaves) to reproductive stages (blossom and fruit) when the produced organics are gradually transported and stored in the grains. During this phenological stage, the ultimate grain number of rice is determined and the EVI curve tends to decline. The curve constantly decreases until harvest given that rice leaves undergo senescence and droop in the maturation stage. The forest has a long growth cycle and has a constantly high EVI value throughout the rice development stages due to the flourishing trees. The temporal curve is prone to being steady locally. The curves for water and others are all characterized by low EVI values and are relatively smoother than the temporal curve of rice. Water is constantly at an extremely stable state with a particularly low EVI value, which is close to the properties of strong absorption. As a result of the diverse land types in our study area, we obtained several types of temporal EVI curves, such as those for abandoned land and buildings, which were mapped with similar features.

Details of Fine Tuning Procedure
With setting different values (100, 64 and 32) of batch size, the changes in model performance are directly displayed in Figure 7. From an accuracy perspective (a,b), the curves under the three batch sizes rapidly increased and almost exceeding the value of 0.85 at the same time. Then, the curves remained at a stable state, fluctuating around 0.90. When the iterations were completed, the accuracy curves of the three batches were nearly equal. Moreover, three loss curves (c,d) showed an

Characteristics of EVI Time Series
From Figure 4, it can be seen that the shape, tendency and other aspects of the different temporal curves show obvious variances. First, rice fields are extremely unique because they are partially covered with a mixture of soil, water and rice seedlings during the transplanting and early growth periods [13]. Meanwhile, a low value is observed in the EVI curve. In the tilling and jointing stages, the roots and leaves develop rapidly and the EVI value increases sharply to a peak beyond 0.6. According to the phenological date of crops, rice begins to transform its growth pattern from vegetative growth (roots, stems and leaves) to reproductive stages (blossom and fruit) when the produced organics are gradually transported and stored in the grains. During this phenological stage, the ultimate grain number of rice is determined and the EVI curve tends to decline. The curve constantly decreases until harvest given that rice leaves undergo senescence and droop in the maturation stage. The forest has a long growth cycle and has a constantly high EVI value throughout the rice development stages due to the flourishing trees. The temporal curve is prone to being steady locally. The curves for water and others are all characterized by low EVI values and are relatively smoother than the temporal curve of rice. Water is constantly at an extremely stable state with a particularly low EVI value, which is close to the properties of strong absorption. As a result of the diverse land types in our study area, we obtained several types of temporal EVI curves, such as those for abandoned land and buildings, which were mapped with similar features.

Details of Fine Tuning Procedure
With setting different values (100, 64 and 32) of batch size, the changes in model performance are directly displayed in Figure 7. From an accuracy perspective (a,b), the curves under the three batch sizes rapidly increased and almost exceeding the value of 0.85 at the same time. Then, the curves remained at a stable state, fluctuating around 0.90. When the iterations were completed, the accuracy curves of the three batches were nearly equal. Moreover, three loss curves (c,d) showed an overall consistent convergence. With the increase of iterations, the curves' oscillation was low and tended toward stability. Among them, a slighter oscillation could be observed when the batch size was set to 100. Therefore, the batch size of 100 was applied in our experiment. overall consistent convergence. With the increase of iterations, the curves' oscillation was low and tended toward stability. Among them, a slighter oscillation could be observed when the batch size was set to 100. Therefore, the batch size of 100 was applied in our experiment. As for learning rate, the differences in model performance among three values of 0.1, 0.01 and 0.001 are shown in Figure 8 (a,b and c,d are corresponding different changes in accuracy curves and loss curves, separately). Briefly, the results reflected that learning rate was positively related to model performance. Firstly, it was clearly found that every accuracy curve suddenly reached a high value over 0.8. In particular, the 0.001 curve rose more slowly than the others did later on. After 1000 iterations, the 0.1 and 0.01 curves already entered a smooth period, whereas the 0.001 curve had yet to exceed the highest accuracy. When the iterated operation stopped, the accuracy of the 0.1 curve was the highest. Additionally, three loss curves dropped quickly to a low value. The other curves, relative to the 0.001 curve, followed a similar trend after the iterations were executed over 3000 times. Considering the fast convergence rate and higher accuracy, we ultimately set the learning rate to 0.1. As for learning rate, the differences in model performance among three values of 0.1, 0.01 and 0.001 are shown in Figure 8 (a,b and c,d are corresponding different changes in accuracy curves and loss curves, separately). Briefly, the results reflected that learning rate was positively related to model performance. Firstly, it was clearly found that every accuracy curve suddenly reached a high value over 0.8. In particular, the 0.001 curve rose more slowly than the others did later on. After 1000 iterations, the 0.1 and 0.01 curves already entered a smooth period, whereas the 0.001 curve had yet to exceed the highest accuracy. When the iterated operation stopped, the accuracy of the 0.1 curve was the highest. Additionally, three loss curves dropped quickly to a low value. The other curves, relative to the 0.001 curve, followed a similar trend after the iterations were executed over 3000 times. Considering the fast convergence rate and higher accuracy, we ultimately set the learning rate to 0.1.

Classification Results
We obtained the classification outcomes with SVM and the proposed method, which are displayed in Figure 9. As seen in the map, (a,c) are produced by SVM model, including the area of four land types and independent spatial distribution of rice, respectively; and (b,d) are the results by deep learning technique, representing distribution of each land type and rice area shown individually. On the whole, results obtained by SVM and CNN are roughly coincident. Our study area is mostly covered by forest on hills. The water is made up of the Xiangjiang and Lushui Rivers, with the latter being a major branch, combined with some small ponds scattered across the entire zone. But it must be noted that distinct differences of rice are observed with two classification models. Two maps on the bottom indicate that rice is mainly planted in the valleys because of the special terrain conditions of the hills. As for others, it is mainly made up of building, bare land and abandoned cropland. The building is concentrated in Lukou District and part of Zhuzhou City, with a particular distribution in the eastern and northern areas, separately. Affected by the human intervention and other factors, a certain portion of croplands on nearby rice fields appear to be abandoned.

Classification Results
We obtained the classification outcomes with SVM and the proposed method, which are displayed in Figure 9. As seen in the map, (a,c) are produced by SVM model, including the area of four land types and independent spatial distribution of rice, respectively; and (b,d) are the results by deep learning technique, representing distribution of each land type and rice area shown individually. On the whole, results obtained by SVM and CNN are roughly coincident. Our study area is mostly covered by forest on hills. The water is made up of the Xiangjiang and Lushui Rivers, with the latter being a major branch, combined with some small ponds scattered across the entire zone. But it must be noted that distinct differences of rice are observed with two classification models. Two maps on the bottom indicate that rice is mainly planted in the valleys because of the special terrain conditions of the hills. As for others, it is mainly made up of building, bare land and abandoned cropland. The building is concentrated in Lukou District and part of Zhuzhou City, with a particular distribution in the eastern and northern areas, separately. Affected by the human intervention and other factors, a certain portion of croplands on nearby rice fields appear to be abandoned.

Accuracy Assessment
To evaluate the classifier's ability to identify different objects, we make a comparative analysis from two perspectives: visual effect and confusion matrix. For the sake of differences between two classification results in detail, we select eight patches from Google Earth ( Figure 10) where rice fields are easily confused with the surroundings. We can see that (a,c) were covered with lush trees but SVM model failed to identify them correctly. We investigated that some croplands were usually abandoned and choked by large quantities of weeds, such as site (b). And the presented results demonstrated that SVM classified such abandoned lands into rice rather than others. Conversely, the proposed CNN model achieved the desired result. In terms of (d-h), we find that a plenty of rice

Accuracy Assessment
To evaluate the classifier's ability to identify different objects, we make a comparative analysis from two perspectives: visual effect and confusion matrix. For the sake of differences between two classification results in detail, we select eight patches from Google Earth ( Figure 10) where rice fields are easily confused with the surroundings. We can see that (a,c) were covered with lush trees but SVM model failed to identify them correctly. We investigated that some croplands were usually abandoned and choked by large quantities of weeds, such as site (b). And the presented results demonstrated that SVM classified such abandoned lands into rice rather than others. Conversely, the proposed CNN model achieved the desired result. In terms of (d-h), we find that a plenty of rice were grown here, which were consistent with the outcomes obtained from the CNN model. In sum up, the CNN model performs better than SVM in confusing regions.
were grown here, which were consistent with the outcomes obtained from the CNN model. In sum up, the CNN model performs better than SVM in confusing regions. Figure 10. Classification results of eight confusing patches (i.e., (a-h), which are observed on Google Earth) based on SVM and CNN; (1,3,5,7,9,11,13,15) are corresponding SVM results of eight patches, respectively; (2,4,6,8,10,12,14,16)  For intuitive and accurate assessment of results, we list more evaluation details of our model in Table 3. A total of 3577 reference points was applied to evaluate the classification accuracy. There were 886 points for forest, 860 points for others, 688 points for rice and 1143 points for others. In conclusion, a good classification result was produced with an overall accuracy of 93.60%. Each class achieved a high user's accuracy of over 90% and rice field assessment reached an accuracy of 91.36%.  (1,3,5,7,9,11,13,15) are corresponding SVM results of eight patches, respectively; (2,4,6,8,10,12,14,16)  For intuitive and accurate assessment of results, we list more evaluation details of our model in Table 3. A total of 3577 reference points was applied to evaluate the classification accuracy. There were 886 points for forest, 860 points for others, 688 points for rice and 1143 points for others. In conclusion, a good classification result was produced with an overall accuracy of 93.60%. Each class achieved a high user's accuracy of over 90% and rice field assessment reached an accuracy of 91.36%.
As for the producer's accuracy, it was calculated to be the lowest value 86.16% for water and 95.35% for rice. Later on, we compared the accuracies between the SVM model and our method (Table 4). Considering the rice field recognition, differences in the performance of the two classifiers are provided clearly. The developed model in our research exhibited great superiority over the SVM model in terms of its accuracies of 91.36%, 95.35% and 93.60%, which are higher than the values obtained for the SVM model (i.e., 90.55%, 80.81% and 91.05%, separately). Therefore, the developed model enabled us to map rice fields according to the temporal features of VIs obtained from the CNN in complex landscape areas.

Discussion
Our paper intends to explore the possibility of mapping rice planting areas via deep learning technology in southern China with complex landscapes. The classification results show that our model outperforms traditional machine learning methods and thus prove the importance of the phenological signatures of rice identification on remote sensing images. Specifically, rice shows its own states throughout the entire growth process as seasonal climate variations. In the early growth periods, paddy rice is planted on flooded fields, which are a mixture of water, rice seedlings and soil. Then, rice grows continuously until the period changes from the vegetative stage to the reproductive stage. Over time, the rice leaves begin to turn yellow and wither before harvest. All these results can be connected to a unique phenological curve, which can be deemed to a crucial point in this research. Hence, effective extraction and the manner of utilizing these rice identification features is urgent to be studied. We adopted the rising concept of deep learning. As a powerful tool for feature extraction, CNN is particularly popular in image classification as it can effectively capture some shallow and abstract features in former and deeper layers, respectively. Accordingly, the outcomes are more convincing. In the experiment, different phenological curves were entered into the CNN model to obtain comprehensive and representative features, which would help the classification process. Our method is good to deal with the temporal and spatial heterogeneities of rice due to the diverse cropping system, which makes results more accurate.
Nevertheless, there still have some inevitable errors and constrains. At first, the VI time series exerts great influence on the classification model. On the one hand, enough data sources should be gathered for constructing a fitting time series curve. But, uncontrollable clouds and rain coverage lead to lacking data collecting in southern China, particularly in growing seasons. We selected as much data as possible to reduce the fitting offset of curves. On the other hand, affected by atmospheric factors, some noises are produced in the VI time series although the preparation of filter operations. Therefore, a proper filter algorithm should be supposed to study further. Next, model training is also significant for error analysis. Various aspects are involved, such as quantity of samples, layers of network, iterations and so on. We used the simplest structure with seven layers in this study, hence the difficulty in extracting deeper features that affected model performance. So, a suitable CNN and its best parameters should be determined through constant attempts. Moreover, we assumed that only four types of EVI time series curves exist and they were entered into the CNN model for our study area. But in fact, the landscape is complex and diverse in hilly areas and some misclassification may occur. Consequently, we should consider other types with different features of VI time series as much as possible. Lastly, the current research merely concentrated on the signatures of VI time series curves. Other signatures, such as spatial and texture signatures, are also close to remote sensing classification in lots of literatures. In order to improve the CNN-based classification method, the establishment of a rice identification model based on multi-features learning should be the key focus of my further works.

Conclusions
This research put forward a novel VI-and CNN-based method for rice planting areas recognition on multispectral satellite images of southern China. We paid attention to capturing the unique signatures of rice phenological curves to distinguish them from other land covers. Thus, VIs were certainly used to reflect the growth changes of different vegetation. Due to make up the oversaturation, EVI was chosen as the optimal index to monitor the vegetation growth. We started by generating various EVI time series curves from multi-temporal CCD data gained from HJ-1 A/B satellites. Then, a pre-trained CNN model containing seven layers was developed to extract the deep temporal features of EVI curves. For comparative analysis, we also produced a rice area map with the SVM algorithm. Finally, we tested the classification results by using ground truth points and visual identification information from Google Earth. Considering the rice fields, we carried out a quantitative comparison of the two methods later. In conclusion, our proposed model performed better than SVM on account of the user's accuracy, the producer's accuracy (especially with a wide gap) and the overall accuracy.
Generally, we provided a flexible approach and tried our best to elaborate its theoretical basis which uses phenological features gained by deep learning technology for estimating rice planting areas. Furthermore, several concrete statistical indexes were applied to confirm the feasibility of the proposed approach in actual experiments. This innovative approach effectively facilitated rice identification on mid-high resolution remote sensing images of complex landscape areas, which was not only broad the ways of rice information extraction but also meaningful in predicting grain yield, mitigating climate change and managing resources. Overall, this idea shows great promise and may contribute to further research in the future.
Author Contributions: T.J. conceived the idea, carried out the experiment and wrote the original manuscript. X.L. offered valuable advice for the research and provided significant comments to the manuscript. L.W. supervised the process of field survey and offered several suggestions to the experiment.

Conflicts of Interest:
The authors declare no conflicts of interests.