1. Introduction
The dike-pond system (DPS) is the integration of agriculture and aquaculture. It is characterized by a natural or man-made pond and dikes on which crop, vegetables, or fruit trees are cultivated [
1]. The DPS is the traditional agriculture system in the low-lying and watery areas in South Asia [
2,
3]. In China, DPSs are concentrated in the Pearl River Delta, Yangtze River Delta, and bank regions of great lakes [
4]. The Huzhou Mulberry-dike and Fish-pond system in China was designated a globally important agriculture systems project (GIAHS) by the Food and Agriculture Organization of the United Nations (FAO) in 2017. The DPS plays a key role in preserving biodiversity [
5], enhancing the nutrient cycle [
6], and increasing crop production [
2]. Accurately identifying the DPS and mapping their spatial distributions are significant for understanding the environmental impacts of the integrated agricultural systems.
Remote sensing provides a unique alternative for mapping the spatial distribution of DPSs at large scales. However, DPS is usually classified as aquaculture ponds or wetland, derived from optical remote sensing images or radar images [
7,
8,
9,
10]. Very few studies have been devoted to identifying DPS from satellite images, because the small sizes and the complex compositions of crops and water bodies make them difficult to map with medium- or coarse-resolution remote sensing images and conventional classifiers. Li et al. [
11] analyzed the trends of DPS between 1978 and 2016 in Shunde district of South China, using the time series of Landsat images and declassified intelligence satellite photographs from before the 1980s. Liu and Li [
12] mapped DPS dynamics during 1949–2020 in the Guangdong-Hong Kong-Macao Greater Bay Area, using topographic maps from 1949, Landsat images, and high-resolution satellite images. Both studies focused on analyzing the spatial-temporal dynamics of DPS. The classification of DPS has mainly relied on medium-resolution satellite data and the object-oriented classification method, ignoring the relation between the dike and the pond.
Mapping DPS with remote sensing techniques requires identification of a pond with a regular or irregular shape and crops grown on the dikes as a whole target. This special landscape is difficult to automatically identify using conventional classifiers, such as random forest (RF) or support vector machine (SVM) algorithms. The conventional classifiers have limitations in detecting objects. Firstly, these classifiers need several moving windows with varying sizes to locate the target in the image, resulting in redundant windows and low efficiency. Secondly, these classification methods cannot effectively extract deep-level features and identify complex objects from remote sensing images. DPS is an integrated agricultural system, in which the pond and vegetation have contrast spectral characteristics but have a spatial connection. The conventional classification methods conducted on pixels or objects struggle to extract the features of the spatial relationship between water and vegetation and identify DPS as a target.
Recent advances in deep learning algorithms have provided great opportunities for automatically identifying targets on high-resolution remote sensing images [
13]. Deep learning is a hierarchical feature learning method that uses multi-layer neural networks. Convolutional neural networks (CNNs) are one of the most successful network architectures in deep learning methods through end-to-end learning. CNNs have demonstrated competitive abilities in classifying agricultural landscapes from remote sensing images at the pixel or object level [
14,
15,
16,
17]. However, few relevant studies have evaluated CNN-based object detection methods in agricultural applications, because of the complex properties of agricultural targets and the lack of annotated datasets, such as ImageNet, to meet the requirements of deep learning methods. Li et al. (2020) [
18] and Chen et al. (2021) [
19] detected agricultural greenhouses from high-resolution satellite data using the You Look Only Once-v3 (YOLO-v3) and CNN, respectively.
Despite the limited number of studies, deep learning methods have great potential for directly recognizing complex agricultural landscapes as targets. In addition to extracting complicated features in DPS, the algorithm used to identify DPS also needs to deal with irregular shapes and different orientations. The objective of this study was to develop a new architecture based on the state-of-art cascade region-based convolutional neural network (R-CNN) to detect the DPS from high-resolution satellite images. The novelty of the proposed method is that it is adaptive to the irregular shapes of DPSs and can provide a more accurate bounding box in the DPS detection. Based on the derived DPS map, we analyzed the spatial distribution of DPSs and quantified the area of oilseed rape growing on dikes, which is usually overlooked in remote mapping of cropland or the statistical data. This study was conducted in Qanjiang City, Hubei Province, China, where DPSs are widely distributed. The DPS in Qianjiang is characterized by a combination of winter oilseed rape growing on dikes and an aquaculture pond.
2. Materials and Methods
2.1. Study Area
The study area was Qianjiang City, a sub-prefecture-level city in South-Central Hubei province, China that covers an area of 200,400 ha. Qianjiang is located on the Jianghan Plain, and has abundant water resources, including rivers, lakes, and ponds (
Figure 1). In total, 6 lakes are scattered throughout the city, with a total area of 1800 ha. Qianjiang has a humid subtropical climate, with an annual (1988–2017) temperature of 16.6
and annual precipitation of 1162 mm [
20].
Aquaculture plays a very important role in the economy of Qianjiang. The total aquaculture area was 9195 ha in 2019. Oilseed rape (
Brassica napus L.) is the main winter crop in Qianjiang. It is widely grown on the dikes of aquaculture ponds, and in spring, when oilseed rape blossoms, DPSs are easier to identify (
Figure 1).
2.2. Data
In total, 5 high-resolution satellite images from Gaofen-1 (GF-1) and Gaofen-2 (GF-2) that covered the entire study area with cloud cover less than 10% were downloaded from China Centre For Resources Satellite Data and Application (
http://www.cresda.com/CN/, accessed on 12 August 2021). GF-1 and GF-2 were launched by the China National Space Administration on 26 April 2013 and 19 August 2014, respectively. GF-1 carries 2 panchromatic (PAN) and multispectral (MS) cameras, with a spatial resolution of 2 and 8 m for the PAN and MS bands, respectively. GF-2 also employs 2 PAN and MS cameras, capable of collecting images with a spatial resolution of 0.81 and 3.24 m at nadir in the PAN and MS bands, respectively. Approximately 80% of the study area was covered by 1 scene from GF-1 obtained on 8 March 2020, and the rest of the area was covered by 4 scenes from GF-2 obtained on 27 March 2018 due to the limited data availability.
The selected GF-1 and GF-2 images were orthorectified and projected onto the Albers equal-area conic projection. The MS images were registered to the PAN images using polynomial warping with automatically generated tie points. The red, green, and blue (RGB) bands of the MS data were used with the corresponding PAN images using the nearest-neighbor diffusion-based pan-sharpening algorithm [
21]. All the RGB composites were resampled to a spatial resolution of 2 m using the cubic convolution resampling method.
To train and validate the deep learning model, the regions of interest (ROIs) of 416 × 416 pixels containing the DPSs were cropped into tiles with 50% overlap. We labeled 1006 sample tiles, containing a total of 5903 targets. Eighty percent of the samples were used to train the deep learning models, and the remaining samples were used to validate the models.
2.3. Methods
Object detection methods can be categorized into two types: two-stage methods and one-stage methods. For two-stage methods, object detection is treated as a multi-task learning problem that combines classification and bounding box regression. On the one hand, two-stage methods typically require a heavy computational load. On the other hand, one-stage methods require only a single pass through the neural network and predict all the bounding boxes in one run. One-stage methods have recently become popular, mainly because of their computational efficiency. In this study, we improved the two-stage algorithm Cascade R-CNN to provide a more accurate detection of the DPS. The Cascade R-CNN based on feature pyramids network (FPN) and ResNet-101 backbone and popular one-stage algorithm YOLOv4 were applied for comparisons. After detecting DPSs in the study area, the bounding boxes were converted to vector data, and the number of bounding boxes represented the number of DPS.
2.3.1. Modified Cascade R-CNN
Cai and Vasconcelos proposed Cascade R-CNN, a multi-stage extension of the R-CNN [
22]. Cascade R-CNN incorporates high-quality object detectors to improve the detection accuracy by beating the overfitting problem at training and quality mismatch at inference. A study showed that Cascade R-CNN, based on the ResNet-101 and FPN backbone, was observed to have outperformed several two-stage (e.g., Faster R-CNN) and one-stage detectors (e.g., YOLOv2) on the MS-COCO2017 dataset [
22].
In this study, we used a Cascade R-CNN based on ResNeXt-101 and FPN backbone. In a remotely sensed image, DPSs are variable in shape and position. To improve the ability to learn deformable features, we modified ResNeXt-101 by replacing the regular convolutional layer with the deformable ConvNet v2 (DCNv2). DCNv2 is developed from DCNv1, which allows the grid sampling locations to swim with respect to the feature map when learning a spatial offset. However, DCNv1 suffers from the problem of irrelevant image content. DCNv2 is adaptive to an object’s structure and is more powerful in focusing on pertinent image regions than DCNv1 [
23]. ResNeXt-101+DCNv2 extracts the features of four different scales. The FPN recursively fuses features from higher levels to the current level.
The fused features are divided into four stages: one Region Proposal Network (RPN) and three detectors. The sampling of the first detection stage followed the procedures by Ren et al. [
24]. In the following stages, resampling was implemented by simply using the regressed bounding boxes from the previous stage [
22]. These 3 detectors were trained with an interaction over union (IoU) thresholds of 0.5, 0.6, and 0.7, respectively, to find a good set of close false positives for training the next stage. At each stage, the Cascade R-CNN included a classifier and a regressor optimized for the IoU threshold. The architecture of the modified Cascade R-CNN (mCascade R-CNN) is illustrated in
Figure 2.
2.3.2. YOLOv4
In addition, the one-stage algorithm YOLOv4 was applied to detect DPSs for comparison, because it is one of the most popular target detection methods with high speed and accuracy. YOLOv4, an evolution of the YOLOv3, is a real-time object detection algorithm that recognizes different objects in a single frame. YOLOv4 generally includes three parts, namely the backbone, neck, and head networks. The backbone network is mainly used to extract image features, and the neck network can enhance the image features. The head network conducts classifications and regressions based on the features derived from the backbone and neck networks.
The image features were extracted using the CSPDarknet53 module in YOLOv4. CSPDarknet53 uses DenseNet and Cross Stage Partial connection (CSP) to enhance the learning ability of CNN and reduce model calculations and memory costs while maintaining accuracy. The RGB sample tiles with a size of 416 × 416 × 3 were used as the input, and 3 outputs were generated after passing through CSPDarknet53. The sizes of the 3 feature outputs were 76 × 76 × 256, 38 × 38 × 512, and 19 × 19 × 1024. The neck network in YOLOv4 used Spatial Pyramid Pooling (SPP) and PANet to generate feature pyramids. SPP used 3 sliding kernels, namely 5 × 5, 9 × 9, and 13 × 13, to convolve the candidate images, and then applied multi-scale max pooling to obtain the same dimensions of the feature map. PANet extracted and integrated features at various scales. The feature maps of different scales output by PANet were spliced, and after the convolution operation, three heads of the different scales were obtained. Classifications and regressions were applied to the three heads to predict the bounding box and the confidence level. The architecture of YOLOv4 used in this study is presented in
Figure 3.
2.3.3. Evaluation of Model Performance
To evaluate the performance of mCascade R-CNN, Cascade R-CNN, and YOLOv4, we calculated the mean average precision based on the validation dataset. The mean average precision value is the area under the precision–recall curve of all classes. In this study, we only identified the DPS, and hence, the mean was not necessary. Average precision (AP) was calculated as follows:
where
P represents the precision rate and
R represents the recall rate.
The precision rate is the proportion of predicted positives that are actually positive, and the recall rate is the proportion of observed positive samples that are correctly predicted as positive. Precision and recall are expressed as follows:
where
TP is the number of real positive samples,
FP is the number of false positive samples, and
FN is the number of false negative samples.
2.4. Classification of Oilseed Rape
SVM was applied to classify pixels into oilseed rape and other land cover types. The SVM model was trained using a Gaussian radial basis function. Blue, green, red, and near-infrared bands were used as the inputs. Oilseed rape pixels were easily identified during the flowering stage. To train the model, 23,300 samples were used, and to validate the model, 9988 samples were used.
2.5. Kernel Density Estimation
Kernel density estimation (KDE) is a non-parametric estimation of probability density. It generates a smooth density probability surface, and provides a clear visualization of the spatial distribution of sample points (Brunsdon, 1995). The built-in kernel density tool in ArcGIS 10.5 was applied to calculate the density probability of the DPSs at a resolution and bandwidth of 100 m and 1 km, respectively.
4. Discussion
DPS is a typical eco-agricultural landscape that is distributed in plains or deltas covered by dense waterways. There is a lack of studies devoted to mapping DPS at a large scale using satellite data due to the complex combination of ponds and crops growing on dikes. The spatial distribution, ecological function, and environmental impact of DPSs cannot be quantitatively evaluated without an accurate map of DPSs. This study developed the mCascade R-CNN to identify the DPS as a target, achieving an AP value of 80.90%. In previous studies, the overall accuracy of the DPS classification reached 90% and even higher [
11,
12]. However, the overall accuracy cannot be compared with AP. The accuracy evaluates the performance of the classifier across all classes when all classes are of equal importance. However, this study aimed to identify the DPS, and thus the AP was used instead of the overall accuracy. The AP value not only assesses the accuracy of the target detections, but also takes into account the accuracy of the bounding box. Moreover, the accuracy assessment of the DPS classification studies was based on a small sample size, which is not comparable to the over 1000 samples used for the validation in this study.
The improvements of the mCascade R-CNN over the baseline were not only in the accuracy of the target detection, but also in the accuracy of the bounding box. A more accurate bounding box facilitates better estimation of the crop area within the DPS. However, we found that ponds with very narrow dikes are difficult to identify because the features may be weakened in such cases. The performance could be improved in several ways, such as by increasing the sample sizes, replacing the horizontal bounding box with an oriented bounding box, or testing other advanced deep learning methods. For example, non-maximum suppression is an integral part of the object detection algorithm, but it leads to a missed detection when the bounding boxes significantly overlap with each other. In the detection of DPS, we noticed that one or two were missed occasionally in a row of DPSs. The soft-non-maximum suppression decays the detection scores of all other objects as a continuous function of their overlap with the detection box [
25], and may improve the accuracy in rows of DPS.
Crops, vegetables, or fruit trees growing on dikes are usually overlooked because of their small areas and fragmented distribution. In the study area, oilseed rape growing on the dikes accounted for 3.42% of the total oilseed rape area. However, the actual growing area on the dikes is higher than 3.42% due to the uncertainty in the DPS identification and the accuracy of the bounding box. Until 2019, the average cultivated land area per farmer was approximately 0.35 ha [
26]. With the increasing demand for agricultural production in China, integrated agriculture systems, such as the crop-fishery-(livestock) system, provide an effective way to balance the limited cultivated land and higher profits from fisheries. As machine learning and computer vision techniques have developed rapidly in recent years, identifying and quantifying crops growing in integrated agriculture systems is more feasible and accurate, which compensates our knowledge of the agricultural and economic conditions of smallholders.
Integrated agriculture systems have developed rapidly in the recent decade with the advancements in new agricultural technology, loss of labors to the cities, and rural revitalization policies, which will drive spatiotemporal change in the DPSs. This study focused on developing a deep learning method to identify DPS, and thus the analyses were conducted on one-date satellite images. Future studies will analyze the spatiotemporal change of the DPSs at the larger scale based on multi-temporal satellite images. In the study area, DPS is mainly composed of a pond and oilseed rape plant, but it has diverse compositions in other regions, such as the Pearl River Delta. The proposed model could be applied to other regions, but it needs large quantities of training samples for the model to learn the different compositions of DPSs. Furthermore, farm ponds are more vulnerable to pollution than larger water bodies [
27]. The map of the DPS provides a basic dataset to evaluate the impact of the runoff from dikes on the ponds.