1. Introduction
Tea is an evergreen woody plant whose leaves can produce a beverage that is known as one of the world’s three major drinks together with cocoa and coffee [
1]. As an economically important crop, tea plays a significant role in promoting the prosperity of many developing countries [
2]. Accounting for nearly 38% of the global total tea yield, China has the largest tea industry in the world, as it plants and produces tea in more than 20 provinces [
3]. According to statistics, as one of the major provinces producing tea in China, Zhejiang Province produced approximately 177,200 tons of tea in 2020 [
4], which accounted for a considerable proportion of the national tea yield. With the growing demand for tea, the area of tea plantations is also increasing. On one hand, tea can promote the development of the local economy, but on the other hand, it may cause some negative effects, such as the destruction of the ecological environment. At present, it is an urgent problem to implement correct supervision of tea plantations and effectively maintain a balance between the economic benefits brought by tea production and a series of adverse effects caused by the expansion of tea plantations. However, how to efficiently and accurately acquire the spatial distribution of tea plantations has always been a difficult problem in the process of fine and dynamic management of tea plantations. Therefore, the proposal of an effective method for extracting tea plantations is of great significance for monitoring the area of tea plantations and carrying out tea tree disaster alert work, thus improving tea production and quality. In addition, it can enhance the standardized management of tea plantations to boost garden greening configurations, conserve water, improve soil quality, raise biodiversity, and promote the sustainable development of the ecological environment.
Remote sensing is a technology that enables the detection of objects at a distance without contact. Given the advantages of prompt information acquisition and a wide observation range, remote sensing technology is widely used in resource census, land use planning, environmental monitoring, and so on [
5,
6,
7,
8]. With the rapid development of science and technology in recent years, the temporal, spatial, and spectral resolution of remote sensing satellite images has been continuously improved. A large number of studies have been conducted to use these images for the extraction of crop plantation areas [
9,
10,
11,
12], among which the most extensively used remote sensing images are derived from the Gaofen satellites of China, the Sentinel satellites within the European Copernicus program, and the Landsat satellites within the Landsat Project of the United States. With rich spectral, spatial, and temporal information, many crops can be classified, effectively avoiding interference by the phenomenon of “different body with the same spectrum” and “same body with different spectra” [
13]. Consequently, remote sensing technology is conducive to monitoring the planting area of many kinds of crops quickly, accurately, and efficiently to provide a reference for planting area statistics, spatial distribution mapping, etc.
The traditional methods for extracting tea plantations include field measurements and statistics, which are inefficient and not timely and do not meet the needs of modern agricultural development. To alleviate such problems, several scholars have conducted a series of studies on the extraction of tea plantations using multispectral remote sensing images, such as Sentinel [
14,
15,
16], Modis [
17], and Landsat [
17,
18,
19] images. In addition, the methods they used are mainly traditional machine learning algorithms, such as decision tree (DN) [
14,
17,
20], support vector machine (SVM) [
15,
16,
21,
22], maximum likelihood (ML) [
23,
24], and random forest (RF) [
16,
19,
20,
23]. These algorithms have a single structure, simple rules, and fine classification performance in specific cases but require the manual construction of corresponding features derived from specific prior knowledge to train the model. Among these studies, the commonly used features include three types: spectral, texture, and terrain features. Fine results can be achieved by using one or more of these types of features, but more types of features generally lead to better results [
15,
17,
23]. Nevertheless, this type of method requires much expertise and a large workload for feature engineering. In addition, the features that can be used are very limited if considering computational efficiency at the same time, and the selected features can hardly represent the characteristics of tea plantations completely or summarize the differences between tea plantations and other categories. Accordingly, the generalizability is relatively poor [
25], which means that the resulting model usually has difficulty achieving the expected classification outcomes in other study areas.
In recent years, deep learning has gradually emerged. Owing to the ability to automatically acquire deep features of the data effectively, it has been widely used in many fields. Compared with traditional machine learning algorithms, deep learning networks can achieve stronger robustness and higher extraction efficiency without manual features [
26]. CNNs are deep learning algorithms with local connection and weight-sharing characteristics. Due to their advantages in spatial feature processing, CNNs have played an important role in various remote sensing tasks, including scene recognition [
27], land use classification [
28], super resolution [
29], target detection [
30], and data reconstruction [
31]. Likewise, some scholars have carried out research on the extraction of tea plantations by CNN methods [
26,
32,
33,
34,
35]. Tea trees are usually cultivated by ridge planting and are arranged in strips, so tea plantations have specific spatial characteristics in remote sensing images. In addition, owing to the existence of the phenological period, the growth of tea trees shows periodic changes on an annual scale, which makes the characteristics of tea plantations in different periods of a year have specific change rules. However, most of the current studies only consider either the spatial or temporal characteristics of tea plantations in the images or only integrate the classification results of the two through image postprocessing technology [
26,
33,
34,
35]. There is still a lack of an end-to-end method to extract tea plantations that comprehensively considers the spatiotemporal features.
Recurrent neural networks (RNNs) are algorithms that can process data with time-series characteristics and prominently contribute to fields such as machine translation, speech recognition, and video processing. Moreover, some studies have demonstrated their greater potential in crop classification of time-series remote sensing data [
10,
11,
36,
37]. However, the input data of RNN-type models usually need to be processed into one-dimensional forms, resulting in ineffective use of spatial information. In view of the superiority of CNNs and RNNs in extracting spatial and temporal information, respectively, the purpose of our study is to construct an R-CNN method to extract the distribution of tea plantations in Xinchang County on multitemporal Sentinel-2 images and evaluate its performance. Thereafter, we attained a fine spatial distribution of tea plantations and then analyzed the distribution characteristics of tea plantations with changes in elevation and slope.
The remainder of the paper is structured as follows:
Section 2 introduces the study area and data,
Section 3 describes the methods proposed for classifying tea plantations,
Section 4 analyzes the experimental results,
Section 5 discusses the research, and
Section 6 provides the main conclusion of the article.
3. Materials and Methods
3.1. Data Preprocessing
Influenced by phenology and human management, the characteristics of tea plantations vary in different periods. In Zhejiang Province, from approximately the end of February to March, tea trees are in the budding stage; from approximately March to April, tea trees are artificially pruned after the harvest; from approximately June to September, the pruned tea trees grow again and gradually reach the peak growth stage; from approximately October to November, with the gradual drop in temperature, tea trees enter a period of slow growth; and from approximately December to the next year in early February, tea trees enter the dormancy period as the temperature plummets further [
14,
43]. Thus, we obtained low-cloud volume Sentinel-2 L1C images (atmospheric apparent reflectance products after ortho correction and geometric fine correction) from the Copernicus Data Center of ESA on five dates—23 February, 13 May, 22 July, 9 November, and 24 December 2020. The Sen2Cor plug-in in ESA Snap software was used for atmospheric correction of the images to generate the L2A product image. Then, the image data were processed through band resampling, format conversion, layer stacking, and image mosaic processing. The multitemporal Sentinel-2 remote sensing image of the study area was obtained by clipping the processed image according to the vector file of the study area.
Based on the preprocessed Sentinel-2 images, 3 atmospheric bands (B1, B9, and B10) were removed, and finally, the reflectivity of 10 spectral bands (B2, B3, B4, B5, B6, B7, B8, B8A, B11, and B12) were used as the initial input features. In addition, we calculated six frequently used vegetation indices for input feature combination experiments, including normalized difference vegetation index (NDVI) [
44], modified normalized difference vegetation index (MNDVI) [
21], enhanced vegetation index (EVI) [
45], normalized difference vegetation index red-edge 1 (NDVIre1), normalized difference vegetation index red-edge 2 (NDVIre2), and normalized difference vegetation index red-edge 3 (NDVIre3) [
46]. The calculation formulas for the vegetation indices are shown in
Table 2 below.
Meanwhile, we measured the effect of different input features on the performance of the R-CNN model by designing five combination schemes as the initial input features: (1) common bands (B2, B3, B4, and B8); (2) common bands and red edge vegetation bands (B2, B3, B4, B5, B6, B7, and B8); (3) common bands and SWIR bands (B2, B3, B4, B8, B11, and B12); (4) all spectral bands; and (5) all spectral bands and vegetation indices.
By referring to the VHR Google Earth image of the corresponding period, we selected 278,528 pixels from Sentinel 2 in the study area as sample sets, comprising 75,147 tea plantation pixels and 203,381 pixels of other ground objects. According to the principle that the data in different sets are independent of each other and the distribution of each class in all sets is similar [
36], they were randomly divided into three datasets: training dataset, validation dataset, and test dataset, with a division ratio of 3:1:1. The training dataset was used to train the classification model, the validation dataset was used to select the best parameters, and the test dataset was used to evaluate the accuracy of the classification model. Each dataset contained several groups of data that consisted of an image and a corresponding label.
3.2. R-CNN Method for Tea Plantation Extraction
CNN is a classical deep learning algorithm. The convolutional layers are the core part of the CNN, which uses convolutional kernels to operate the input features and calculates the output features
through the activation function [
47]. The calculation formula is as follows:
In Equation (1),
is the weight vector,
is the input features,
is the offset vector, and
is the activation function. Due to their powerful ability to extract spatial features, CNNs have made great achievements in image semantic segmentation. Presently, there are numerous widely used semantic segmentation models, including FCN [
48], SegNet [
49], PSPNet [
50], UNet [
51], and DeepLabv3 [
52]. Among them, the UNet model has a simple structure, multiscale feature extraction capability, and excellent extraction results with only a small number of samples. UNet is mainly composed of two parts: the contracting path is mainly used to obtain context information and extract features, and the expansion path is mainly used for precise positioning, which means mapping concentrated features to corresponding positions and then outputting predicted results. In addition, each level in the contracting path uses a skip connection to concatenate the features with those at the same level in the expansion path, which fuses both the shallow-level features and deep-level features and effectively combines local information and global information [
53]. However, the original UNet model is limited to the segmentation of a single time-stage image and cannot obtain the temporal information in multitemporal images.
Unlike CNNs, RNNs are structures that focus on processing time-series information, including long short-term memory (LSTM) [
54] network and gated recurrent unit (GRU) [
55] network, which are widely used. Compared with LSTM, GRU has the superiority of fewer parameters and being applicable to small datasets. The GRU network has inputs and outputs similar to those of an ordinary RNN network in general. The inputs comprise the input value
at time
and the state
at time
, and the outputs consist of the output value
and the state
at time
. Update and reset gates are two unique characteristics of GRU. The update gate decides whether to replace the hidden state at the previous time with a new hidden state. First, at time
, the update gate
is calculated:
The reset gate decides whether to forget the hidden state at the previous time. Then, the reset gate
at time
is calculated:
Additionally, based on the reset gate, the new hidden state
is calculated as a candidate, and the calculation formula is as follows:
Eventually, the update gate updates the hidden state and obtains the final output
. The calculation formula is as follows:
In Equations (2)–(5), σ is the logical sigmoid activation function, , , , , , and are the weight matrices, , , and are the offset vectors, is the hyperbolic tangent activation function, and is the Hadamard product.
We constructed an R-CNN model based on the UNet structure and GRU modules, as shown in
Figure 2. The model is mainly composed of three parts: CNN encoders, GRU modules, and a CNN decoder. First, each batch of data is input into the model through six 3 × 3 convolutional layers (with a stride of 1, the “same” padding, and a ReLU activation function) and two 2 × 2 maxpooling layers. To effectively enhance robustness and prevent the model from overfitting, a dropout layer is added after the first convolutional layer of each level. Second, the extracted feature vectors of spatial information are fed to the GRU modules in the time series, and the hidden state sequence is calculated by the GRU network. Note that we use a bidirectional GRU network here that can integrate information from both forward and backward states. In addition, four 3 × 3 convolutional layers and two 2 × 2 transposed convolutional layers are built to map the features with the same spatial resolution size as the input data. Meanwhile, the skip connection in the original UNet model is retained in our model. Finally, a 1 × 1 convolutional layer and softmax activation function are used to output the predicted results.
3.3. Other Methods for Comparison
We compared the tea plantation extraction results from our proposed method with those from the following methods:
- (1)
RF [
56] classification method: The RF classification algorithm is a traditional machine learning method developed from the decision tree (DT) algorithm, which has the benefits of high training speed and a small possibility of overfitting. It randomly extracts some data from the initial samples and reassembles them into sample subsets, then generates multiple decision trees to train the sample subsets, and finally integrates the voting results of each decision tree to determine the final predicted result of the classification model.
- (2)
SVM [
57] classification method: The SVM classification algorithm is a traditional machine learning method with the advantages of a simple structure and insensitivity to outliers. It maps the sample data into a high-dimensional space and solves for an optimal hyperplane in which to partition the data so that the data closest to the hyperplane on each side are as far away from the hyperplane as possible, based on which the sample data can be classified and predicted.
- (3)
CNN classification method: The CNN classification model in the comparison experiments is built based on the original UNet structure, i.e., removing the GRU module from the R-CNN model and using only the CNN encoder and CNN decoder.
- (4)
RNN classification method: The bidirectional GRU module is used to construct the RNN classification model in the comparison experiments.
3.4. Experimental Settings
In this experiment, we used the PyTorch framework to build the deep learning model on a Windows 10 system, an AMD Ryzen 7 4800H CPU, and an NVIDIA GeForce RTX 2060 GPU with 6 GB memory. In the training stage, the image data were first normalized by standard deviation and split into patches, and the image patch sets were fed into the model for feature extraction and feature mapping to obtain preliminary prediction results. Next, a loss function was used to calculate the error between the predicted result and the ground truth, and an adaptive moment estimation (Adam) optimizer was used to perform backpropagation iterations to dynamically adjust the parameters and learning rate of the model. Moreover, early stopping was set to avoid overfitting on the training dataset, which means that if the loss values on the validation dataset do not improve in 10 consecutive epochs of training, the model training will be terminated early. Ultimately, the trained model that had the smallest loss value on the validation dataset was selected as the optimal model. The specific hyperparameter settings before training are shown in
Table 3.
3.5. Evaluation Indicators
The tea plantation extraction task in this study is a semantic segmentation binary classification task, so we introduced the evaluation metrics F1 score and IoU, which are commonly used in semantic segmentation to evaluate the predicted results on the test dataset. The F1 score reflects the precision and recall of the classification results in a comprehensive manner, and the intersection over union (IoU) describes the overlap rate between the classification results and the ground truth. The final score of each evaluation indicator is derived from tenfold cross-validation. The formulas for the evaluation indicators are
where
is the precision,
is the recall,
is the intersection of the predicted and true areas for tea plantations, and
is the union of the predicted and true areas for tea plantations.
In addition, to evaluate the predicted spatial distribution of tea plantations in the study area from each extraction method, we combined the field survey results and visual interpretation of the randomly generated points on Google Earth images to select a total of 2166 verification points in the study area, including 920 points for tea plantations and 1246 points for other ground objects. Then, confusion matrices were created for the classification results of each method. Although the confusion matrices can visually represent the number of samples that are correctly or incorrectly predicted in each class, they cannot directly provide a detailed evaluation. Consequently, the following evaluation metrics were calculated based on the confusion matrix, including the overall accuracy (OA), commission error (CE), omission error (OE), and kappa coefficient. Overall accuracy refers to the proportion of the total number of verification points correctly classified; commission error represents the proportion of verification points predicted to be in a class that is actually not in that class; omission error refers to the proportion of verification points actually in a class that are predicted not to be in that class; and the kappa coefficient represents the proportion of improvement in the prediction of the classification method compared to completely random classification:
In Equations (8)–(12), is the number of correctly classified tea plantation points, is the number of incorrectly classified tea plantation points, is the number of correctly classified other ground object points, and is the number of incorrectly classified other ground object points.
Furthermore, The distribution index [
58] is applied to describe the relationship between the spatial distribution of tea plantations and topographical factors such as elevation and slope. Its calculation formula is as follows:
In Equation (13), is the distribution index, is the total area of the whole region, is the area of a specific grade of topographical factor in the whole region, is the area of class , which is at a specific grade of topographical factor in the whole region, is the area of class in the whole region.
5. Discussion
In this work, we used an end-to-end R-CNN method that combines CNN modules and RNN modules to extract tea plantations from multitemporal Sentinel-2 images. Recently, most of the relevant studies focus on using traditional machine learning methods or deep learning methods based on mono-temporal high-resolution images to extract tea plantations. The former relies heavily on the construction of manual features, which usually require considerable manpower, but it is sometimes still difficult to achieve the desired results. Although the latter achieves automation to some degree, it fails to effectively use the multispectral information and time-dimensional phenological information of the tea plantations in remote sensing images, resulting in numerous misclassified pixels between the tea plantations and other ground objects. In contrast, the method in our research has the following advantages: (1) It can automatically extract features from the original data without manually building additional features as the input data of the model, which helps reduce the amount of manpower. (2) In the feature extraction stage, it synthetically uses multispectral and spatiotemporal information to extract more comprehensive and robust features. (3) It is an end-to-end method for classification, and because the overall process complexity is low, it has high practicality. Our experimental results show that deep learning algorithms can markedly reduce misclassification [
63,
64], and CNNs that can effectively use spatial information and RNNs that can effectively use temporal information have complementary characteristics in tea plantation extraction. The R-CNN method obtains higher elevation scores in classifying tea plantations than the CNN method and RNN method, as well as traditional machine learning methods. Previous studies have successfully applied similar methods to land use classification [
65,
66], but few studies have applied them to tea plantation extraction, especially in an end-to-end way.
The extraction of tea plantations is still in the exploratory stage, and there are some limitations in this study. First, the experiments were conducted on a small dataset, and the model was constructed with a few layers and a simple structure to prevent overfitting. Although deep learning models commonly have better generalization capabilities than many traditional machine learning methods, the performance of the models is still influenced by the temporal and spatial coverage of the training datasets and the complexity of the models themselves. Therefore, when conducting province-wide or nationwide tea plantation extraction in the future, increasing the model complexity, as well as conducting data collection for multiple locations to increase the number and diversity of samples, can be considered to improve the generalization ability of models. In addition, the spatial resolution of the images used in this study is 10 m, which results in some tea plantations with small areas and the boundaries of the tea plantations forming mixed pixels with other ground objects in the image, leading to the misclassification of tea plantations in the predicted results. The use of multitemporal multispectral images with higher spatial resolution will be considered in the future to improve the accuracy of tea plantation extraction results and thereby provide technical support for the development of the tea industry.