Extraction of Agricultural Fields via DASFNet with Dual Attention Mechanism and Multi-scale Feature Fusion in South Xinjiang, China

Lu, Rui; Wang, Nan; Zhang, Yanbin; Lin, Yeneng; Wu, Wenqiang; Shi, Zhou

doi:10.3390/rs14092253

Open AccessArticle

Extraction of Agricultural Fields via DASFNet with Dual Attention Mechanism and Multi-scale Feature Fusion in South Xinjiang, China

by

Rui Lu

¹

,

Nan Wang

¹,

Yanbin Zhang

²,

Yeneng Lin

³,

Wenqiang Wu

¹ and

Zhou Shi

^1,*

¹

Institute of Agricultural Remote Sensing and Information Technology Application, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou 310058, China

²

Consolidation and Rehabilitation Center, Hangzhou 310007, China

³

Institute of Cyber-Systems and Control, Yuquan Campus, Zhejiang University, Hangzhou 310027, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(9), 2253; https://doi.org/10.3390/rs14092253

Submission received: 21 February 2022 / Revised: 3 May 2022 / Accepted: 4 May 2022 / Published: 7 May 2022

Download

Browse Figures

Versions Notes

Abstract

:

Agricultural fields are essential in providing human beings with paramount food and other materials. Quick and accurate identification of agricultural fields from the remote sensing images is a crucial task in digital and precision agriculture. Deep learning methods have the advantages of fast and accurate image segmentation, especially for extracting the agricultural fields from remote sensing images. This paper proposed a deep neural network with a dual attention mechanism and a multi-scale feature fusion (Dual Attention and Scale Fusion Network, DASFNet) to extract the cropland from a GaoFen-2 (GF-2) image of 2017 in Alar, south Xinjiang, China. First, we constructed an agricultural field segmentation dataset from the GF-2 image. Next, seven evaluation indices were selected to assess the extraction accuracy, including the location shift, to reveal the spatial relationship and facilitate a better evaluation. Finally, we proposed DASFNet incorporating three ameliorated and novel deep learning modules with the dual attention mechanism and multi-scale feature fusion methods. The comparison of these modules indicated their effects and advantages. Compared with different segmentation convolutional neural networks, DASFNet achieved the best testing accuracy in extracting fields with an F1-score of 0.9017, an intersection over a union of 0.8932, a Kappa coefficient of 0.8869, and a location shift of 1.1752 pixels. Agricultural fields can be extracted automatedly and accurately using DASFNet, which reduces the manual record of the agricultural field information and is conducive to further farmland surveys, protection, and management.

Keywords:

agricultural field extraction; attention mechanism; deep learning; GaoFen-2 (GF-2); multi-scale feature fusion

Graphical Abstract

1. Introduction

Agriculture land is an important resource, providing humanity with indispensable supplies of food, fibers, fuels, and other raw materials [1]. Extraction of agricultural fields is key in progress toward sustainable agriculture management and sustainable development goals. Accurate spatial information of agricultural fields is instrumental in estimating the cropland areas, monitoring the food security, and forecasting the crop yields [2,3]. Due to the evolving climatic conditions and farming activities as per the global change, there is a strong requirement for extracting the field information to formulate effective agricultural policies [4]. With the development of remote sensing technology in agricultural applications, information regarding the spatial structure, texture characteristics, and internal composition of croplands can be obtained on a large scale through high-resolution remote sensing images [5]. Compared with other objects, agricultural fields show stronger self-similar fractal distribution characteristics. In addition, management measures, including the crop types, sowing, and irrigation, are similar, thus making the spectrum, texture, and morphological characteristics shown on remote sensing images highly consistent and providing a basis for the extraction and classification of agricultural fields based on remote sensing images [6].

Various studies have focused on image segmentation based on scenes. The current agricultural field segmentation and delineation approaches can be mainly divided into three categories: threshold-based methods [7,8,9], texture analysis technique methods [10,11,12,13], and trainable classification model methods [14,15,16,17]. The development of deep neural networks provided new opportunities for agricultural scene extraction by extracting the target features automatically to save the pre-processing costs, and their architectures are highly adaptive to complex problems [18]. Convolutional neural networks (CNN) are increasingly utilized in image analysis. These use convolution operations based on filters sliding along the input features and providing equivariant translation responses. CNN was originally applied to the image classification, and fully convolutional networks (FCN) [19] achieved image end-to-end semantic segmentation for the first time by replacing the fully-connected layer with a deconvolution layer to recover features to the original image size. Unlike the direct image size recovery, the encoder–decoder architecture networks UNet [20] used skip connections to combine the low-level details with the high-level semantic information. Then, SegNet [21] was developed to record the maximum pooling index in the encoder to perform sampling in the decoder. Multi-scale feature fusion parsing diverse scenes in deep learning has become important in segmentation. In PSPNet, the pyramid pooling module (PPM) was proposed to explore the context information [22], and the atrous convolution and atrous spatial pyramid pooling (ASPP) were adopted in DeeplabV3 [23]. In recent years, deep learning studies have focused on the attention mechanism to boost meaningful features while suppressing the weak ones. Hu et al. [24] proposed a squeeze and excitation block (SEblock) to recalibrate the channel features obtained by the interdependencies adaptively. SCAttNet [25] integrated the convolutional block attention module (CBAM) [26] with spatial and channel attention for remote sensing image segmentation. Nevertheless, the cascading mechanism in CBAM might cause aliasing of the spatial and channel information. Therefore, the concurrent spatial and channel “squeeze & excitation” attention module (scSE) [27] was proposed for image segmentation, which recalibrated the feature maps separately along the channel and space, fastening the transfer of information.

To obtain a higher accuracy, deep neural networks were applied to the image segmentation based on the remote sensing images, including road extraction [28,29,30], building extraction [31,32], cloud detection [33,34,35], and land cover classification [36,37,38]. Remote sensing images are beneficial for updating the cropland information due to a high time resolution, helping in the scientific decision making, and ensuring the quality and safety of the agricultural fields. Wang et al. [39] showed that U-Net could segment cropland in satellite imagery with weakly supervising labels and outperformed random forest and logistic regression. Taravat et al. [40] proposed ResU-Net to delineate agricultural fields and resulted in accurate field boundaries closed to human photo interpretation. Zhang et al. [41] developed the MPSPNet model to map cropland over a large area, and the model combined low-level and high-level features and captured long-range spatial dependencies. However, remote sensing images suffer from several problems in deep learning approaches, such as varying scales of objects, complex and diverse backgrounds, and an imbalance between positive and negative samples [42,43]. Moreover, there are some insufficiencies in the current multi-scale fusion and attention modules. PPM and ASPP might overlook the contextual spatial correlation and integrated information of the global feature map. Existing attention modules could lead to unevenness in the spatial and channel weights along with the background and textural information. Hence, novel deep learning modules must be constructed to replenish the insufficiencies of dual attention mechanisms and multi-scale feature fusion methods.

Xinjiang is a Chinese province with some of the most arable land suitable for agriculture. The cultivable area of southern Xinjiang is 7.33 million hectares, with abundant light and heat resources suitable for growing high-quality products, primarily cotton. However, previous studies have found that cotton yield varied among different environments and years [44,45,46], indicating an unstable yield in the cropping system due to its low resistance and resilience to various climatic conditions and management [47]. The agricultural fields have varied greatly across south Xinjiang over the years because of unstable precipitation and evaporation, as well as the soil salinity problems. Extraction technology for agricultural fields is fundamental for crop information acquisition and agriculture management. Therefore, a fast and accurate method is urgently needed for agricultural field identification.

However, due to the lack of a corresponding dataset for agricultural scenarios, there are few studies on deep learning techniques for agricultural field extraction from satellite images. Therefore, this study aimed to construct deep neural networks with improved novel modules to extract agricultural fields in south Xinjiang from the GF-2 image. The proposed dual attention mechanism and multi-scale feature fusion modules were aimed at the remediation of the shortcomings of the existing multi-scale fusion and attention modules, and improve the accuracy of the agricultural field extraction. Our main contributions are:

(1): The establishment of an agricultural field dataset from GF-2 images;
(2): The introduction of a novel deep learning model DASFNet with an improved dual attention mechanism and modified multi-scale feature fusion module;
(3): The high performance of DASFNet in agricultural field extraction compared with other contrastive modules and models in comparative experiments.

2. Materials and Methods

2.1. Study Area

The study area (40°24′–40°42′ N, 81°09′–81°36′ E) is a typical cropland region in Alar, south Xinjiang, China (Figure 1), located in the south of the Tianshan Mountains and the north of the Taklimakan Desert with a typical continental warm temperate arid climate. The area receives sparse mean annual precipitation of 60 mm, while the evapotranspiration is much higher, resulting in severe salinization on topsoil [48]. Cotton, the main crop, is annually cultivated in the study area. It is usually sowed at the end of March and harvested at the end of October [45]. Additionally, pears, red dates, peppers, and tomatoes are also important components of the regional agricultural planting. The dry climate in south Xinjiang results in less precipitation and cloud occurrence. This climate is conducive to observing the ground features based on the remote sensing images.

2.2. Dataset

A GF-2 image in south Xinjiang, China, was downloaded from the China Centre for Resources Satellite Data and Application (http://www.cresda.com/CN/, 8 January 2022) as the study area, obtained on 10 August 2017, covering 23.5 × 23.5 km². The GF-2 image contained a pan-chromatic (PAN) channel and four multispectral (MS) bands. Our study used four MS bands at the spatial resolution of 4 m, including red, green, blue, and near-infrared (NIR), to conduct further extraction.

Most public remote sensing segmentation datasets do not include an agricultural field category, such as the ISPRS Vaihingen and Potsdam dataset [25] and the WHU dataset [49]. Due to the lack of a related agriculture remote sensing dataset in south Xinjiang, we constructed the agricultural field dataset from the GF-2 image. In order to establish the dataset, we used a visual inspection method to annotate the image for ground reference information using ArcGIS. After the annotation, the polygon layer was exported and saved as a labeling image with two classes, the fields class and the non-fields class, the same size as the original satellite image.

As CNN uses contextual information for prediction, the accuracy of predicting classification depends on the various object locations in the input images, i.e., the objects near the edge of the input images can miss the entire context and possibly be misclassified [50]. To mitigate this effect, while cropping the original image, we used a moving window of size 256 pixels with stride 128 in each direction, varying the positions of agricultural fields in the input images. The deep learning method requires a large volume and a variety of training data to achieve better accuracy and optimize model performance [37,51]. Data augmentation approaches have been investigated to increase sample diversity. For data augmentation, we flipped the cropped images (horizontal and vertical flips) and randomly modified the brightness (Figure 2). Afterward, the images were divided into three parts to construct the dataset: 8000 training images (about 70%), 1600 validating images (about 15%), and 1620 testing images (about 15%).

2.3. Model Architecture

Our study proposed a deep learning model, named the Dual Attention and Scale Fusion Network (DASFNet), for segmenting the agricultural fields from the GF-2 satellite image. The structure of DASFNet was based on UNet, one of the most competitive classification networks [52]. The DASFNet consisted of encoder and decoder architectures, as well as a skip connection structure (Figure 3). While the encoder compressed the information content of a high-dimensional image into object features, the decoder gradually upscaled the encoded features and precisely defined the classes of interest. Vegetation showed obvious spectral reflectance characteristics in the NIR band; thus, it was chosen as an additional beneficial input for classification [53]. In the upscaling part, we chose transposed convolution rather than upsampling or up pooling for less information loss and better feature mapping ability [54]. Considering the number of parameters and CUDA memory, the depth of model was four. The DASFNet combined three improved modules to enhance the model performance, namely ResABlock, ameliorated scSE module, and APPM. The following subsections introduce the structure and function of each module proposed in our model.

2.3.1. Residual Atrous Block

The structure of the residual atrous block (ResABlock) is shown in Figure 4. The ResABlock consisted of a residual block [55] and a parallel atrous convolution, which was extremely helpful in alleviating the problem of the vanishing and exploding gradients with a shortcut connection. Atrous block with multiple dilation rates (also called atrous rates) expanded the receptive field without increasing the model parameters to capture more context and spatial information in the remote sensing images [56]. There were three parallel branches with increasingly larger dilation rates in ResABlock. Therefore, the input was simultaneously processed at multiple fields of view with multi-scale feature fusion, weakening the impact of different field sizes. Each small block in this module consisted of a 2D convolution layer with a batch normalization (BN) layer and a ReLU layer. The kernel size of the convolution layer was three in every branch and one in the shortcut connection. By normalizing the output of the previous layer through subtracting the batch mean and then dividing by the batch standard deviation, BN was an efficient technique to combat the internal covariate shift proble. It thus improved the speed, performance, and stability of neural networks [57]. ReLU was a nonlinear activation function promoting the sparsity and expression ability of the network.

2.3.2. Ameliorated Spatial and Channel Squeeze and Excitation Module

The ameliorated spatial and channel squeeze and excitation (ameliorated scSE) module is shown in Figure 5, performing a parallel computation of the global average pooling and the global maximum pooling compared to the initial module. The ameliorated scSE module was the product of the application and development of the dual attention mechanism in neural networks. It focused on the relationships among feature channels as well as feature maps. It also helped in automatically acquiring the importance of each feature, later strengthening the weight of important parts to adjust the network characteristics and improve the image segmentation results [58]. Average pooling can reduce the error caused by the limited neighborhood size, retaining more background information, while max-pooling can reduce the deviation of convolution estimation, retaining more textural information [59]. The improved scSE combined the two pooling methods to retain the balanced background and textural information. Squeeze and excitement were two key operations in this module. The attention mechanism utilized in the scSE module could be divided into channel attention and spatial attention along the two independent dimensions of channel and space:

(a) Channel attention focused on the meaningful feature maps and enlarged the significant weight of effective feature maps. Feature maps represented the characteristics of agricultural fields. Highlighting the effective features was helpful in accurately delineating the field boundaries. To calculate the channel attention, the spatial dimension of the input feature graph needed to be compressed with the pooling approach. The compress operation contained the global average pooling and maximum pooling, the former aggregating spatial information and the latter reflecting unique object characteristics. The excitation operation realized the function of the attention mechanism. The two obtained vectors passed through a multilayer perceptron (MLP) with one hidden layer and shared the same MLP weight. The output eigenvectors were merged using element-wise summation and then activated with a sigmoid function. At last, the channel attention feature was broadcasted and multiplied by the original input features. The process can be defined as:

M_{C} (F) = σ (w_{2} (w_{1} (F_{a v g}^{c})) + w_{2} (w_{1} (F_{m a x}^{c})))

(1)

where F represents the input features, avg and max represent the average pooling and maximum pooling, respectively, w₁ and w₂ are the parameters of the MLP layer, and 𝜎 is the sigmoid function.

(b) Spatial attention focused on the informative part, supplementary to channel attention. It contributed to determining the spatial relationship and location of agricultural fields and increasing the distinction between the fields and non-fields. For obtaining spatial attention, the average pooling and maximum pooling were applied along the channel axis via the squeeze operation. Then, through a convolutional layer, they were connected to generate a valid spatial feature map encoding the positions requiring attention or suppression. After making the feature map sigmoidal, it was multiplied by the original input features. The process can be defined as:

M_{S} (F) = σ (f (F_{a v g}^{s}; F_{m a x}^{s}))

(2)

where F represents the input features, avg and max represent the average pooling and maximum pooling, f is the convolution layer, and 𝜎 is the sigmoid function.

The output channel and spatial attention features characterized the correlation between each feature channel and feature map. The output of this module was the summation of two attention features. Unlike calculating the two attention features successively in CBAM, the scSE module simultaneously obtained the channel and spatial attentions, reducing the degree of disorder in the attention mechanism block.

2.3.3. Atrous Pyramid Pooling Module

A Pyramid Pooling Module (PPM) was inserted between the encoder and decoder in the neural networks to improve the accuracy of deep learning prediction through multi-scale feature fusion. Segmentation neural networks took advantage of the convolutional layers to extract the target features. The high-level networks in the model strongly represented the semantic information, but high-level feature maps were small in shape and lacked the spatial geometric feature details. Conversely, the low-level networks in the model had a relatively small mapping field and a strong representation of the geometric details despite the high resolution and weak semantic information identification [60]. In deep learning, integrating all these features together was highly effective for detection and segmentation.

The existing multi-scale feature fusion modules mainly possessed PPM [22] and ASPP [23], and we proposed a novel module, APPM combining the two modules (Figure 6). Single-scale analysis may decline the cropland extraction accuracy due to the large difference in the field size, and multi-scale feature fusion can solve the problems caused by different target sizes [34]. APPM contained two parts of pooling, including a convolutional layer with a 1 × 1 filter and three atrous convolutional layers in parallel with 3 × 3 filters and different dilation rates. Atrous convolution with different dilation rates could effectively capture multi-scale information, and a simple 1 × 1 filter could be the only working filter center when the dilation rate expanded to close to or even larger than the feature mapping size. However, the effective weight of the filters applied to the feature region rather than padding decreased gradually in atrous convolution, requiring the utilization of picture-level features. To solve this problem, the second part of APPM adopted parallel global adaptive average pooling in various sizes. Global adaptive pooling summed up spatial context information of the whole input maps and was more robust to spatial translations [61]. Different pooling sizes signified high-level and low-level features and were fused from different scales. Ultimately, the output of the two parts was concatenated using a residual structure, which preserved the shallow information in the original input features, then fused through a basic convolutional block including the 1 × 1 filter convolution, BN, and activation function, realizing multi-scale feature fusion and enhanced accuracy of segmentation

2.4. Training

Loss function and optimizer are two essential parameters for deep neural network model training, comparing the model’s output with the annotation results to calculate the loss and optimize accordingly. Most deep learning methods apply cross-entropy loss as the loss function for segmentation. Nevertheless, the cross-entropy loss may have a poor effect when the positive and negative samples are not balanced. This loss function averages the class prediction of each pixel and considers the positive and negative pixels trained equally. Evidently, the dice loss can outperform cross-entropy in semantic segmentation tasks [62,63]. This is because it focuses on the coincidence between predictions and labels and performs better for unbalanced samples, but some boundary information may be ignored. Our proposed model combines these two loss functions to eliminate the influence of unbalanced samples for an efficient boundary extraction. The loss function can be defined as the following equations

L_{C E} = - \frac{1}{N} [\sum_{i = 1}^{N} y_{i} \cdot \log (S ({\hat{y}}_{i}))] = - \frac{1}{N} [\sum_{i = 1}^{N} y_{i} \cdot \log (\frac{e^{{\hat{y}}_{j}}}{\sum_{j = 1}^{N} e^{{\hat{y}}_{j}}})]

(3)

L_{D i c e} = 1 - \frac{1}{N} \sum_{i = 1}^{N} \frac{2 y_{i} \cdot {\hat{y}}_{i}}{y_{i} + {\hat{y}}_{i}}

(4)

L o s s = L_{C E} + L_{D i c e}

(5)

where y_i is the real class of the relative pixel, ŷ_i is the predicted class of the pixel, N is the total number of pixels in the output, L_CE represents cross-entropy loss, and L_Dice represents the dice loss.

The optimizer employed in the proposed model is the Adam (Adaptive moment estimation) optimizer, which can adaptively adjust its parameters and make the loss converge fast, with the initial learning rate of 3 × 10⁻⁴. A scheduling method decaying the learning rate on the plateau of loss was adopted to avoid the generalization reduction by using the Adam optimizer. When training loss stops decreasing beyond 10 epochs, the learning rate will multiply by 0.1. The format of Adam optimizer is defined in brief by:

𝛳_{t} = 𝛳_{t - 1} - α \cdot m_{t} / (\sqrt{v_{t}} + ε)

(6)

where t is the timestep, m and v are the two moment vectors, adaptively updating in the process, 𝛳 is the resulting parameter, 𝛼 is the learning rate (or step-size), and 𝜀 is 1 × 10⁻⁸.

Deep learning warrants high computer performance; thus, our study is performed on the cloud server. Alibaba Cloud Elastic Compute Server provides elastic and scalable computing services, and users can select an instance according to their own computing needs, corresponding to a specific hardware and software configuration of the cloud virtual machine, such as CPU, operating system, memory size, disk type and size, and network configuration. We purchased a gn6v instance belonging to the GPU computing cloud servers, which provides Intel Xeon Platinum 8163 8vCPU with the main frequency of 2.5 GHz, 32 GB RAM, 16 GB NVIDIA V100 GPU, and 5 Mbps bandwidth. Moreover, the environment and framework used in our study are Python v3.7 and PyTorch v1.7.1 with CUDA v11.0.2.

2.5. Accuracy Assessment

Seven evaluation criteria assessed the performance of models. As the dataset was annotated as either field or non-field, the accuracy assessment was based on the binary confusion matrix in this binary classification task. Some mainstream evaluation indices of the machine learning methods are also used in our study, including overall accuracy (OA), precision (P), recall (R), F1-score (F), and intersection over union (IoU). F1-score represents the harmonic average of the precision and recall rates to consider both comprehensively. IoU represents the overlap rate of the candidate bound and ground truth bound first used in the target detection algorithm. These indices are defined as:

O A = \frac{T P + T N}{T P + F P + T N + F N}

(7)

P = \frac{T P}{T P + F P}

(8)

R = \frac{T P}{T P + F N}

(9)

F = \frac{2 P \cdot R}{P + R} = \frac{2 T P}{2 T P + F P + F N}

(10)

I o U = \frac{T P}{T P + F P + F N}

(11)

where true positive (TP) and true negative (TN) represent the number of pixels predicted correctly as positive class and negative class successfully, false positive and false negative represent the number of non-field pixels misclassified as positive class and positive pixels incorrectly classified as negative class, respectively. Additionally, the Kappa coefficient (Kappa) applied in the accuracy evaluation of remote sensing interpretation estimated the consistencies of prediction and ground reference. The calculation formula is as follows:

p_{e} = \frac{(T P + F P) \cdot (T P + F N) + (T N + F P) \cdot (T N + F N)}{{(T P + F P + T N + F N)}^{2}}

(12)

Kappa = \frac{O A - p_{e}}{1 - p_{e}}

(13)

where OA is the sum of classified samples divided by the total number of samples, and p_e is the product of the sum of predictions and ground reference relating to each class divided by the square of total number of pixels.

The evaluation methods above focus on the image accuracy, but spatial correlation is extremely important in remote sensing. The object-based metric location shift (L) is computed to indicate the discrepancy and spatial relationship between the centroid location of predictions and the reference in pixels:

L = \sqrt{{(x_{p} - x_{r})}^{2} + {(y_{p} - y_{r})}^{2}}

(14)

where (x_p, y_p) and (x_r, y_r) are the centroid coordinates of the predictions and the reference. The deep learning approach can generally obtain high accuracy in the image segmentation tasks. Consequently, the introduction of an evaluation index for spatial characteristics, such as location shift, can facilitate the assessment of deep neural networks.

3. Results

3.1. Model Parameters Selection

We trained four DASFNet model versions with different atrous rates of ResABlock and APPM, including ResABlock [1,2,4] + APPM [1,2,4], ResABlock [1,3,8] + APPM [1,2,4], ResABlock [1,2,4] + APPM [2,4,8], and ResABlock [1,2,4] + APPM [3,6,12], wherein the numbers in brackets represented the atrous rate combinations in the module, applicable to the atrous convolution layers. According to the similarity principle of the remote sensing spatial relationship, smaller atrous rate combinations were a better choice. The function of ResABlock was feature extraction, and we chose two different small atrous rates. For APPM, three incremental atrous rates were selected for the feature fusions of various scales. The variation in the training loss and accuracy on the validation dataset by epochs of four models is plotted in Figure 7. When the training epochs were approximately higher than 110, the model performance could hardly improve as the training curves changed slightly and became horizontally smooth; the maximum training epochs were 120 in our study. In various dilation rates of modules (Figure 7), the curve of ResABlock [1,2,4] + APPM [1,2,4] (the blue curve) converged the fastest, and the curve of ResABlock [1,2,4] + APPM [3,6,12] (the red curve) showed the best performance, while the curves of the other two selections changed variably and performed poorly. The quantitative comparison of the four models through accuracy assessment is shown in Table 1. In ResABlock, the smaller atrous rate combinations indicated better correctness but slightly higher location displacement. In APPM, the combinations [1,2,4] and [3,6,12] stood out in some evaluation indices, but the location shift diminished obviously by increasing the combinations of atrous rates. As a result, we chose ResABlock [1,2,4] + APPM [3,6,12] as the parameters of the DASFNet model to obtain higher accuracy and lesser location shift.

3.2. Comparison of Modules

This section performed an ablation experiment to compare our proposed model with those from different structures by replacing the pivotal module. As mentioned in Section 2.3, three critical modules in DASFNet improved the accuracy of the networks. In comparison, ResABlock was replaced by the basic Resblock without parallel branches and atrous convolution, the scSE module was removed and replaced by other modules using attention mechanisms, such as SEblock and CBAM, and APPM was displaced by PPM and ASPP.

Figure 8 compared the segmentation results of agricultural fields from neural networks with diverse model structures. The red pixels referred to the misclassification of the non-field category as fields, and the green ones represented the fields category classified as non-fields incorrectly. Due to the same network backbone, all models in different structures generally achieved considerable field extractions but the distinction in the ridges and edges. In most cases, ResABlock commonly showed better performance in field extraction, but sometimes the Resblock with simple structure showed errors. The results of the model without attention mechanism exhibited certain misclassification spots in the non-field region and even large displacement in the field ridges. The misclassification and location displacement were reduced using the SEblock, but there were still some mistakes. Furthermore, the segmentation errors significantly decreased by integrating the CBAM and scSE module. Regarding the multi-scale feature fusion modules, there were some missed and false points in ASPP. PPM performed better than ASPP but with greater missed areas on the edges, and thus, APPM performed the best. Notably, some missed pixels were detected in the agricultural field category forming an incomplete straight line similar to the ridge (the first row in Figure 8). However, during the visual inspection, some of these mistakes looked plausible and could be missed in the reference annotation.

Figure 9 shows the binary confusion matrix of module comparison in the ablation experiment. The accuracy assessment results of these models in various structures are shown in Table 2. In all comparisons of the three different modules, our model with the proposed modules achieved the best correctness in all evaluation indices. The prediction accuracy did not improve remarkably with ResABlock compared with Resblock, but the location shifts were distinctly reduced by 0.3 pixels. By introducing continuously improved attention mechanism modules into the deep neural networks, the model performance was clearly promoted. The location shift distinctly declined, indicating that the attention mechanism played a critical role in the agricultural field extraction based on deep neural networks. Although the accuracy of the SEblock model increased as compared to the model without the attention mechanism, its location displacement also increased, possibly because the single-channel attention ignored the importance of positional correlation. The ameliorated scSE module achieved the best performance among the three attention mechanism modules, with IoU and Kappa coefficient increased by 1% and 1.5%, respectively, and location shift reduced by 0.9 pixels. Among the multi-scale feature fusion modules, ASPP performed poorly in this task, consistent with the observation of the prediction images. APPM showed the highest accuracy with 0.6% higher F1-score, 1.1% higher IoU, 1.6% higher Kappa coefficient, and 1.3 pixels lower position offset in location shift than ASPP.

3.3. Comparison of Models

Figure 10 and Figure 11, and Table 3 depicted the prediction outcomes and performances of DASFNet and the other six deep learning models. The depth of all models was four, and ResNet [55] was used as the backbone network for consistency. SegNet [21] and ResUnet both had symmetric encoder–decoder architecture without additional modules. SegNet adopted index mapping between the encoder and decoder for concatenation, while ResUnet fused the features of the encoder and decoder by skip connection. SegNet demonstrated the worst performance among all models, indicating that the index mapping approach was not suitable for extracting the fields. The feature pixels without index mapping had no certain attribute in upsampling, leading to large location displacement. ResUnet outperformed the expectations in this task thanks to the effective skip connection structure with a 1.2% higher F1-score, 2.3% higher IoU, 3.5% higher Kappa coefficient, and 3.1 pixels lower location shift compared with SegNet. However, using the ResUnet to extract the fields in specific areas with high soil surface salinity was difficult because of its low robustness due to its simple structure.

DeeplabV3 [23] used atrous convolution and the ASPP module. However, many errors occurred on the edges in prediction since the single atrous convolution in feature extraction neglected the correlation of the adjacent pixels. Moreover, simple global pooling in the multi-scale feature fusion module ignored parts of spatial context information of the feature maps. PSPNet [22] used PPM to combine appropriate global features and context information and obtained good prediction results, yielding a 2.9% higher IoU and a 4.6% higher Kappa coefficient than DeeplabV3. SCAttNet [25], with spatial and channel attention mechanisms, could adaptively refine the features to effectively reduce the error rates of the segmentation fields, and the attention mechanism module adopted in the neural networks was similar to CBAM. It obtained the highest recall rate of 0.9026, indicating that fields were correctly extracted and successfully forecasted in the prediction among all models.

ResUnet-a was formulated to extract the field boundaries treated as multiple semantic segmentation and conditioned inference tasks [64]. In the comparison, ResUnet-a was built with the same modules but fewer model layers and a binary classification output compared with the other models under the same conditions. The assessment results demonstrated that ResUnet-a acquired better outcomes than the conventional deep learning models mentioned above. Comparing ResUnet-a and SCAttNet, the ResBlock-a with atrous convolution layers in parallel branches was conducive to minimizing location drifting, achieving a location shift 0.7 pixels lower.

In all the comparison results, DASFNet performed the best accuracy assessment on the testing dataset except for the recall rate. Compared with ResUnet-a, our proposed model obtained a 0.2% higher F1-score, a 0.6% higher IoU, a 1.0% higher Kappa coefficient, and a 0.3 lower location shift. In the evaluation indices based on the binary confusion matrix, our model only had a slight lead over other models attributed to the same training epochs, model depth, and backbone. In the assessment of the location shift, our model was especially superior and engendered 4.3 pixels less than SegNet, 1.0 pixels less than SCAttNet, and 0.3 pixels less than ResUnet-a. The deep learning methods showed an extraordinary effect on the image semantic segmentation, which required us to focus on the model accuracy, spatial correlation, and contextual information, significant in the remote sensing areas. Therefore, the inclusion of a location shift in the assessment system was necessary. In the images predicted through the models, SegNet and DeeplabV3 occurred with some false classifications at the roads and field edges. ResUnet resulted in missed classifications of a few agricultural fields. The conventional deep neural networks did not extract fine and dense field ridges with precision, but DASFNet reduced the error rate to a certain extent and distinguished well between the fields and ridges in these areas.

3.4. In-Situ Observation

The reference data in our CNN-based study was acquired by visual interpretation. In contrast to common in-situ observation, visual interpretation was spatially explicit with less position error [65]. However, in-situ data was an essential requirement in case the visual observation was confused between similar vegetation types. Thus, we obtained the in-situ ground reference data of the experimental fields in Alar, South Xinjiang, in November 2019 [66]. Figure 12 shows the in-situ observation of agricultural field boundaries and the model output result. The result indicated that our proposed model could delineate the extent of agricultural fields well while distinguishing the grass and trees near buildings.

3.5. Results of Agricultural Field Extraction

Figure 13 showed the extraction results of agricultural fields from the satellite image using our model, identifying more than 99% of the reference fields successfully. Our proposed model obtained high accuracy with all evaluation indices greater than 0.88 and reduced the position offset to 1.2 pixels. Figure 13a is the false-color composite image of the remote sensing image of the study area, Figure 13b is the grayscale image of extraction results, and Figure 13c represents the agricultural field boundaries converted from the field extraction result map.

4. Discussion

4.1. Deep Learning for Agricultural Field Extraction

This study proposed a deep learning model called DASFNet to extract agricultural fields from a GF-2 image in south Xinjiang Province, China. The model is structurally innovative, employing parallel branching atrous convolution, dual attention mechanism, and multi-scale feature fusion modules. The parallel branch convolution dilation rate combinations were first selected, followed by the identification of ResABlock [1,2,4] and APPM [3,6,12] as the parameters for subsequent model training based on the evaluation metrics. ResABlock used the combination of smaller dilation rates, while APPM used the combination of larger ones, explained by the different functions and roles of the two modules. ResABlock is used for feature extraction to uncover local information, such as the detailed texture of the image. The field area in the study area is small, and atrous convolution with small atrous rates can effectively extract the cropland features. According to the first law of geography, everything is related, but near things are more related to each other [67]. Thus, smaller parallel branching combinations of dilation rates can better imply spatial correlation and contain more information in the extracted features to improve the prediction accuracy. Comparatively, APPM is used to perform the multi-scale analysis and feature fusion and collate the information related to the overall profile. The larger perceptual fields are beneficial in integrating contextual information without losing the relevance of neighboring objects due to the residual connection.

We demonstrated the effectiveness of our improved modules by replacing the modules in DASFNet and comparing their performances. We also compared DASFNet with other models, and our proposed model obtained the best results in testing the dataset in the agricultural field extraction. DASFNet accomplished the field extraction with high accuracy, including some irregular edges and dense field ridges. Compared with other deep models with the same depth and ResNet as the network backbone, the evaluation index values obtained by DASFNet were relatively high. Training samples are often unbalanced, and it is not comprehensive to judge their modeling performances solely using the OA because too many samples may fall under one category. Therefore, it is extremely necessary to introduce other evaluation indices, such as spatial and location relationships, to evaluate the geographical feature. The evaluation metrics, such as IoU, Kappa coefficient, and location shift, can better demonstrate the model performance because these indices reduce the impact of unbalanced samples, showing a larger difference in discriminating the model performances in comparison.

4.2. Dual Attention Mechanism

A dual attention mechanism module combining channel attention and spatial attention is used in DASFNet. Convolution can only take advantage of the local feature information to calculate target pixels, bringing about bias by not significantly considering the global feature information. With the attention mechanism, the model can realize the integrated global feature reference in training with a good bias-variance tradeoff, which is more reasonable and can also enhance the interpretability of deep learning models.

In comparing the modules, the different attention mechanisms gradually improved the performance of the models in the agricultural field extraction, indicating that the attention mechanism plays a crucial part in the field extraction by deep learning methods. Without the attention mechanism, the agricultural fields were poorly extracted with cluttered misclassification and omitted spots on the predicted images, and obvious shifts in the position of field ridges. SEblock employed channel attention and reduced the classification mistakes, still displaying some errors and offsets in the results. Channel attention can strengthen the weights of the field features and decrease the feature weight of non-target objects, such as roads, buildings, and rivers. Nevertheless, channel attention solely focused on the weights of the extracted features, thus requiring the addition of spatial attention considering spatial correlation. After the spatial correlation information, a deep learning model can better understand the characteristics of field boundary, shape, and location to improve extraction accuracy. CBAM consists of two separate submodules, the channel attention module and the spatial attention module, which respectively perform channel and spatial attention calculations. However, the calculation of channel attention and, later, space attention could cause an alias of the spatial and channel information [68,69].

The scSE module assigns the feature map weights separately along the channel and space, and simultaneous calculation of the two attention weights achieved the best results in our task. Existing attention modules may cause disorders in the texture and background information. As a result, the ameliorated scSE in our model used parallel maximum and average pooling in both attention directions. Global maximum pooling mainly abstracted the texture information of input features but missed the global background characteristic, while global average pooling took the background information into account. Additionally, global average pooling used spatial information but ignored various parameters along the channels. As compared to the average pooling, global max pooling revealed the global maximum response and indicated the critical information in channels to a certain extent [70]. The improved scSE module enhanced the weight of effective information while suppressing the weak information and equipoised texture and background information, obtaining a 1% higher IoU, a 1.5% higher Kappa coefficient, and a 0.9 pixels lower location shift than the model without the attention mechanism. Figure 14 shows the feature maps obtained from the model with the non-attention mechanism and the scSE module. The feature maps obtained without the attention mechanism were messy, with more speckles, more missing features, and incomplete contours of fields. In contrast, with the attention mechanism, the field features were relatively complete, and the attention weights were uniform, interpreting a better location of fields and ridges.

4.3. Multi-scale Feature Fusion

The novel multi-scale feature fusion pyramid module APPM used in our proposed model improved the prediction accuracy and reduced the location offset compared to the model using ASPP and PPM. Proper fusion of multi-scale features is key to the semantic edge detection in CNN [71,72]. Using a multi-scale feature fusion pyramid module in the agricultural field extraction task helped in obtaining smooth field boundaries and accurate locations to enhance the model accuracy. ASPP expanded the field of perception, incorporating contextual information without losing information or increasing parameters. However, the atrous convolution in ASPP could only combine the local features and lacked a full utilization of global scene category clues, resulting in a large position bias in the assessment. PPM may be more suitable for the cropland segmentation according to the results of the module comparison. PPM adopted pooling layers of different scales, effectively extracting the features from agricultural fields at different scales. However, convolution with a small receptive field and reduction of feature channels may lead to the loss of contextual features at different scales.

APPM integrated the advantages of both pyramid modules, compensating for their weaknesses. It fused multi-scale shallow and deep information to improve the integrity and segmentation accuracy of field extraction. Additionally, it eliminated the problem of the ignored global information and feature loss with an F1-score 0.6% higher, an IoU 1.1% higher, a Kappa coefficient 1.6% higher, and a location shift 1.3 pixels lower than ASPP. APPM increased the receptive fields to capture the contextual information and united both local and global picture level features. The corresponding convolution and pooling layers balanced the local and global features at different scales. To avoid gradient disappearance and preserve shallow information in original feature maps, a skip connection residual structure was also added to APPM. Based on the results of the evaluation indices and prediction outputs, the proposed model with APPM significantly improved the prediction accuracy and integrity of the extracted fields, thereby exhibiting high efficacy of the multi-scale feature fusion method performed by this module and a significant improvement in the deep learning accuracy.

4.4. Perspective

In the result maps of the agricultural fields extracted by DASFNet, the majority of the fields are correctly segmented, with a few being incorrect or missed. A complete field may have dense ridges in the image at times. Nevertheless, some errors and omissions look reasonable under visual interpretation, such as an omitted ridge in the field area, led by the uncertainty in the label-making process. Admittedly, the reference data could also miss or contain incorrect or imprecise information. Thus, we referred to the labeled data as the ground reference rather than ground truth. Field survey measurements may bring accessibility problems due to vegetation and terrain. Image interpretation scaled better without any accessibility issues but possibly failed to get specific clues brought forward through in-situ visits [73]. Therefore, the visual interpretation and in-situ mapping must be combined to produce the reference data with efficient human image interpretation as the focus, supplemented by field surveys for error correction. Additionally, four bands (R, G, B, NIR) were used as input layers of DASFNet, which was common as the input in the study [64,74]. However, it was meaningful to figure out the importance of each band and understand the minimum input layers for agricultural crop field classification. In a further study, we will promote the research on the importance of input data.

The supervised deep learning model usually requires a significant amount of training data but labeling remote sensing images for training is extremely time-consuming and cost-intensive. There is a lack of high-quality benchmark datasets in agriculture scenes, which is limited in practical agricultural applications [75]. Moreover, due to the lack of benchmark agricultural fields dataset, it is hard to compare the related research results. We realized that it was important to establish a benchmark dataset of agriculture scenes, which was helpful in comparing state-of-art deep learning approaches. To applicate the agricultural fields extraction model, we construct the dataset in the study region. In light of the huge labeling cost, we chose one particular remote sensing image for our study, which may cause insufficiency. We focused on deep learning algorithms, and our proposed methods achieve reliable results in the comparisons. In further studies, we will take multi-source data and multi-temporal information into account, considering spatial and temporal scale effects. Apart from available dataset volumes, topography factors impact the accuracy of agricultural field extraction. It has been found that the terrain of remote sensing images will limit the accuracy balance of deep learning models, such as cropland mapping between plain and mountain areas [41]. Therefore, our future study will reduce the impact of terrain factors on cropland identification.

Our work is a binary classification task using deep neural networks for field extraction in a remote sensing satellite image. The binary classification problem for agriculture was sometimes obscure. The agricultural fields may be confused with grasslands or other vegetation, which place a higher demand on the diversity and precision of the training set. Subsequent research work will focus on improving the agricultural field segmentation. With the rapid development of remote sensing technology, increasing obtainable data from remote sensing big data can provide great analytical potential and value. In addition to the optical remote sensing features, meteorological and geological clues can be augmented in the cropland extraction to improve the accuracy and credibility of the remote sensing image analysis. Field extraction is the basis for farmland area calculation, crop yield prediction, and agricultural management policy formulation and implementation. In the follow-up study, we will focus on multi-crop extraction, classification mapping, yield prediction, or change detection. Furthermore, the fusion of mechanistic models, prior knowledge, and deep learning can enhance the interpretability of the deep learning model.

5. Conclusions

Our study proposed a deep learning model (DASFNet) combined with the dual attention mechanism and multi-scale feature fusion to extract agricultural fields from GF-2 images in South Xinjiang, China. After determining the model parameters and choosing atrous rates legitimately, a comparison was drawn among the selected modules. The results indicated that our proposed modules, the ameliorated scSE module, and novel APPM, worked better to enhance the prediction accuracy. Compared with different deep learning segmentation models, DASFNet performed the best in the agricultural field extraction task, with an F1-score of 0.9017, an IoU of 0.8932, a Kappa coefficient of 0.8869, and a location shift of 1.1752 pixels. The object-based metric location shift was helpful in better discriminating the performances of deep learning models with high accuracy. In summary, DASFNet can automatically extract agricultural fields from remote sensing images with high efficiency and accuracy. The dual attention mechanism module corrected the shape and boundary of the fields, and the multi-scale feature module was helpful in achieving accurate results among various field sizes. In south Xinjiang, the area of agricultural fields has changed over the years due to volatile climatic conditions, soil salinity, and agricultural management. Fast and accurate extraction via DASFNet shows great application potential in crop classification, yield prediction, and farmland resource protection.

Author Contributions

Conceptualization, Z.S. and R.L.; methodology, R.L.; software, R.L.; validation, R.L.; formal analysis, R.L. and N.W.; writing—original draft preparation, R.L.; writing—review and editing, N.W., Y.L., Z.S.; visualization, R.L. and W.W.; supervision, Z.S. and Y.Z.; project administration, Z.S., Y.Z. and N.W.; funding acquisition, N.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (2018YFE0107000) and the Ten-thousand Talents Plan of Zhejiang Province (2019R52004).

Conflicts of Interest

The authors declare no conflict of interest.

References

Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
Matton, N.; Canto, G.S.; Waldner, F.; Valero, S.; Morin, D.; Inglada, J.; Arias, M.; Bontemps, S.; Koetz, B.; Defourny, P. An automated method for annual cropland mapping along the season for various globally-distributed agrosystems using high spatial and temporal resolution time series. Remote Sens. 2015, 7, 13208–13232. [Google Scholar] [CrossRef] [Green Version]
Whitcraft, A.K.; Becker-Reshef, I.; Justice, C.O. A framework for defining spatially explicit earth observation requirements for a global agricultural monitoring initiative (GEOGLAM). Remote Sens. 2015, 7, 1461–1481. [Google Scholar] [CrossRef] [Green Version]
Tirado, M.C.; Clarke, R.; Jaykus, L.A.; McQuatters-Gollop, A.; Frank, J.M. Climate change and food safety: A review. Food Res. Int. 2010, 43, 1745–1765. [Google Scholar] [CrossRef]
Jung, J.; Maeda, M.; Chang, A.; Bhandari, M.; Ashapure, A.; Landivar-Bowles, J. The potential of remote sensing and artificial intelligence as tools to improve the resilience of agriculture production systems. Curr. Opin. Biotechnol. 2021, 70, 15–22. [Google Scholar] [CrossRef]
Lobell, D.B.; Thau, D.; Seifert, C.; Engle, E.; Little, B. A scalable satellite-based crop yield mapper. Remote Sens. Environ. 2015, 164, 324–333. [Google Scholar] [CrossRef]
Bai, X.D.; Cao, Z.G.; Wang, Y.; Yu, Z.H.; Zhang, X.F.; Li, C.N. Crop segmentation from images by morphology modeling in the CIE L* a* b* color space. Comput. Electron. Agric. 2013, 99, 21–34. [Google Scholar] [CrossRef]
Hassanein, M.; Lari, Z.; El-Sheimy, N. A new vegetation segmentation approach for cropped fields based on threshold detection from hue histograms. Sensors 2018, 18, 1253. [Google Scholar] [CrossRef] [Green Version]
Riehle, D.; Reiser, D.; Griepentrog, H.W. Robust index-based semantic plant/background segmentation for RGB-images. Comput. Electron. Agric. 2020, 169, 105201. [Google Scholar] [CrossRef]
Zheng, H.; Zhou, M.; Zhu, Y.; Cheng, T. Exploiting the textural information of UAV multispectral imagery to monitor nitrogen status in rice. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7251–7253. [Google Scholar]
Zhang, P.; Xu, L. Unsupervised segmentation of greenhouse plant images based on statistical method. Sci. Rep. 2018, 8, 4465. [Google Scholar] [CrossRef] [Green Version]
Crommelinck, S.; Bennett, R.; Gerke, M.; Yang, M.Y.; Vosselman, G. Contour detection for UAV-based cadastral mapping. Remote Sens. 2017, 9, 171. [Google Scholar] [CrossRef] [Green Version]
Cheng, Z.; Qi, L.; Cheng, Y. Cherry Tree Crown Extraction from Natural Orchard Images with Complex Backgrounds. Agriculture 2021, 11, 431. [Google Scholar] [CrossRef]
Khatami, R.; Mountrakis, G.; Stehman, S.V. A meta-analysis of remote sensing research on supervised pixel-based land-cover image classification processes: General guidelines for practitioners and future research. Remote Sens. Environ. 2016, 177, 89–100. [Google Scholar] [CrossRef] [Green Version]
Talukdar, S.; Singha, P.; Mahato, S.; Pal, S.; Liou, Y.A.; Rahman, A. Land-use land-cover classification by machine learning classifiers for satellite observations—A review. Remote Sens. 2020, 12, 1135. [Google Scholar] [CrossRef] [Green Version]
De Castro, A.I.; Torres-Sánchez, J.; Peña, J.M.; Jiménez-Brenes, F.M.; Csillik, O.; López-Granados, F. An automatic random forest-OBIA algorithm for early weed mapping between and within crop rows using UAV imagery. Remote Sens. 2018, 10, 285. [Google Scholar] [CrossRef] [Green Version]
Feng, S.; Zhao, J.; Liu, T.; Zhang, H.; Zhang, Z.; Guo, X. Crop type identification and mapping using machine learning algorithms and sentinel-2 time series data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3295–3306. [Google Scholar] [CrossRef]
Shrestha, A.; Mahmood, A. Review of deep learning algorithms and architectures. IEEE Access 2019, 7, 53040–53065. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 905–909. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Roy, A.G.; Navab, N.; Wachinger, C. Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Granada, Spain, 16–20 September 2018; Springer: Cham, Switzerland, 2018; pp. 421–429. [Google Scholar]
Chen, Z.; Wang, C.; Li, J.; Xie, N.; Han, Y.; Du, J. Reconstruction bias U-Net for road extraction from optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2284–2294. [Google Scholar] [CrossRef]
Li, X.; Wang, Y.; Zhang, L.; Liu, S.; Mei, J.; Li, Y. Topology-enhanced urban road extraction via a geographic feature-enhanced network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8819–8830. [Google Scholar] [CrossRef]
Lin, Y.; Xu, D.; Wang, N.; Shi, Z.; Chen, Q. Road extraction from very-high-resolution remote sensing images via a nested SE-Deeplab model. Remote Sens. 2020, 12, 2985. [Google Scholar] [CrossRef]
Tan, Y.; Xiong, S.; Yan, P. Multi-branch convolutional neural network for built-up area extraction from remote sensing image. Neurocomputing 2020, 396, 358–374. [Google Scholar] [CrossRef]
Guo, H.; Shi, Q.; Du, B.; Zhang, L.; Wang, D.; Ding, H. Scene-driven multitask parallel attention network for building extraction in high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4287–4306. [Google Scholar] [CrossRef]
Jeppesen, J.H.; Jacobsen, R.H.; Inceoglu, F.; Toftegaard, T.S. A cloud detection algorithm for satellite imagery based on deep learning. Remote Sens. Environ. 2019, 229, 247–259. [Google Scholar] [CrossRef]
Li, Z.; Shen, H.; Cheng, Q.; Liu, Y.; You, S.; He, Z. Deep learning based cloud detection for medium and high resolution remote sensing images of different sensors. ISPRS J. Photogramm. Remote Sens. 2019, 150, 197–212. [Google Scholar] [CrossRef] [Green Version]
Shao, Z.; Pan, Y.; Diao, C.; Cai, J. Cloud detection in remote sensing images based on multiscale features-convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4062–4076. [Google Scholar] [CrossRef]
Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep learning classification of land cover and crop types using remote sensing data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [Google Scholar] [CrossRef]
Scott, G.J.; England, M.R.; Starms, W.A.; Marcum, R.A.; Davis, C.H. Training deep convolutional neural networks for land–cover classification of high-resolution imagery. IEEE Geosci. Remote Sens. Lett. 2017, 14, 549–553. [Google Scholar] [CrossRef]
Mahdianpari, M.; Salehi, B.; Rezaee, M.; Mohammadimanesh, F.; Zhang, Y. Very deep convolutional neural networks for complex land cover mapping using multispectral remote sensing imagery. Remote Sens. 2018, 10, 1119. [Google Scholar] [CrossRef] [Green Version]
Wang, S.; Chen, W.; Xie, S.M.; Azzari, G.; Lobell, D.B. Weakly supervised deep learning for segmentation of remote sensing imagery. Remote Sens. 2020, 12, 207. [Google Scholar] [CrossRef] [Green Version]
Taravat, A.; Wagner, M.P.; Bonifacio, R.; Petit, D. Advanced fully convolutional networks for agricultural field boundary detection. Remote Sens. 2021, 13, 722. [Google Scholar] [CrossRef]
Zhang, D.; Pan, Y.; Zhang, J.; Hu, T.; Zhao, J.; Li, N.; Chen, Q. A generalized approach based on convolutional neural networks for large area cropland mapping at very high resolution. Remote Sens. Environ. 2020, 247, 111912. [Google Scholar] [CrossRef]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, W.; Zhang, T.; Li, J. HRCNet: High-resolution context extraction network for semantic segmentation of remote sensing images. Remote Sens. 2021, 13, 71. [Google Scholar] [CrossRef]
Adhikari, U.; Nejadhashemi, A.P.; Woznicki, S.A. Climate change and eastern Africa: A review of impact on major crops. Food Energy Secur. 2015, 4, 110–132. [Google Scholar] [CrossRef]
Li, N.; Lin, H.; Wang, T.; Li, Y.; Liu, Y.; Chen, X.; Hu, X. Impact of climate change on cotton growth and yields in Xinjiang, China. Field Crops Res. 2020, 247, 107590. [Google Scholar] [CrossRef]
Olesen, J.E.; Trnka, M.; Kersebaum, K.C.; Skjelvåg, A.O.; Seguin, B.; Peltonen-Sainio, P.; Rossi, F.; Kozyra, J.; Micale, F. Impacts and adaptation of European crop production systems to climate change. Eur. J. Agron. 2011, 34, 96–112. [Google Scholar] [CrossRef]
Li, X.; Lei, Y.; Han, Y.; Wang, Z.; Wang, G.; Feng, L.; Du, W.; Fan, Z.; Yang, B.; Xiong, S.; et al. The relative impacts of changes in plant density and weather on cotton yield variability. Field Crops Res. 2021, 270, 108202. [Google Scholar] [CrossRef]
Peng, J.; Biswas, A.; Jiang, Q.; Zhao, R.; Hu, J.; Hu, B.; Shi, Z. Estimating soil salinity from remote sensing and terrain data in southern Xinjiang Province, China. Geoderma 2019, 337, 1309–1319. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef] [Green Version]
Yu, X.; Wu, X.; Luo, C.; Ren, P. Deep learning in remote sensing scene classification: A data augmentation enhanced convolutional neural network framework. GISci. Remote Sens. 2017, 54, 741–758. [Google Scholar] [CrossRef] [Green Version]
Zhang, W.; Tang, P.; Zhao, L. Fast and accurate land-cover classification on medium-resolution remote-sensing images using segmentation models. Int. J. Remote Sens. 2021, 42, 3277–3301. [Google Scholar] [CrossRef]
Mutanga, O.; Adam, E.; Cho, M.A. High density biomass estimation for wetland vegetation using WorldView-2 imagery and random forest regression algorithm. Int. J. Appl. Earth Obs. Geoinf. 2012, 18, 399–406. [Google Scholar] [CrossRef]
Mohammadimanesh, F.; Salehi, B.; Mahdianpari, M.; Gill, E.; Molinier, M. A new fully convolutional neural network for semantic segmentation of polarimetric SAR imagery in complex land cover ecosystem. ISPRS J. Photogramm. Remote Sens. 2019, 151, 223–236. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Shao, Z.; Zhou, Z.; Huang, X.; Zhang, Y. MRENet: Simultaneous extraction of road surface and road centerline in complex urban scenes from very high-resolution images. Remote Sens. 2021, 13, 239. [Google Scholar] [CrossRef]
Mei, S.; Ji, J.; Hou, J.; Li, X.; Du, Q. Learning sensor-specific spatial-spectral features of hyperspectral images via convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4520–4533. [Google Scholar] [CrossRef]
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual attentive fully convolutional siamese networks for change detection in high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
Han, B.; Yin, J.; Luo, X.; Jia, X. Multibranch Spatial-Channel Attention for Semantic Labeling of Very High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 2167–2171. [Google Scholar] [CrossRef]
Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 426–435. [Google Scholar] [CrossRef]
Zhang, J.; Lin, S.; Ding, L.; Bruzzone, L. Multi-scale context aggregation for semantic segmentation of remote sensing images. Remote Sens. 2020, 12, 701. [Google Scholar] [CrossRef] [Green Version]
Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A fully convolutional neural network for automatic building extraction from high-resolution remote sensing images. Remote Sens. 2020, 12, 1050. [Google Scholar] [CrossRef] [Green Version]
Lan, M.; Zhang, Y.; Zhang, L.; Du, B. Global context based automatic road segmentation via dilated convolutional neural network. Inf. Sci. 2020, 535, 156–171. [Google Scholar] [CrossRef]
Waldner, F.; Diakogiannis, F.I. Deep learning on edge: Extracting field boundaries from satellite images with a convolutional neural network. Remote Sens. Environ. 2020, 245, 111741. [Google Scholar] [CrossRef]
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Wang, N.; Peng, J.; Xue, J.; Zhang, X.; Huang, J.; Biswas, A.; He, Y.; Shi, Z. A framework for determining the total salt content of soil profiles using time-series Sentinel-2 images and a random forest-temporal convolution network. Geoderma 2022, 409, 115656. [Google Scholar] [CrossRef]
Kang, J.; Fernandez-Beltran, R.; Duan, P.; Liu, S.; Plaza, A.J. Deep unsupervised embedding for remotely sensed images based on spatially augmented momentum contrast. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2598–2610. [Google Scholar] [CrossRef]
Hu, Z.; Yang, H.; Lou, T. Dual attention-guided feature pyramid network for instance segmentation of group pigs. Comput. Electron. Agric. 2021, 186, 106140. [Google Scholar] [CrossRef]
Ouyang, S.; Li, Y. Combining deep semantic segmentation network and graph convolutional neural network for semantic segmentation of remote sensing imagery. Remote Sens. 2021, 13, 119. [Google Scholar] [CrossRef]
Wang, W.; Zhang, J.; Wang, F. Attention bilinear pooling for fine-grained classification. Symmetry 2019, 11, 1033. [Google Scholar] [CrossRef] [Green Version]
Ma, W.; Gong, C.; Xu, S.; Zhang, X. Multi-scale spatial context-based semantic edge detection. Inf. Fusion 2020, 64, 238–251. [Google Scholar] [CrossRef]
Li, E.; Xia, J.; Du, P.; Lin, C.; Samat, A. Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5653–5665. [Google Scholar] [CrossRef]
Persello, C.; Tolpekin, V.A.; Bergado, J.R.; de By, R.A. Delineation of agricultural fields in smallholder farms from satellite images using fully convolutional networks and combinatorial grouping. Remote Sens. Environ. 2019, 231, 111253. [Google Scholar] [CrossRef]
Turkoglu, M.O.; D’Aronco, S.; Perich, G.; Liebisch, F.; Streit, C.; Schindler, K.; Wegner, J.D. Crop mapping from image time series: Deep learning with multi-scale label hierarchies. Remote Sens. Environ. 2021, 264, 112603. [Google Scholar] [CrossRef]
Ma, L.; Liu, Y.; Zhang, X.; Ye, Y.; Yin, G.; Johnson, B.A. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote Sens. 2019, 152, 166–177. [Google Scholar] [CrossRef]

Figure 1. False color composite of the GF-2 image (Blue, Green, Near Infrared) over the study area in Alar, south Xinjiang, China.

Figure 2. Examples of data augmentation for agricultural field dataset (256 × 256 pixels).

Figure 3. The structure of the DASFNet for agricultural field extraction. The blue part represents for residual atrous block; the green part represents for ameliorated spatial and channel squeeze and excitation module; the orange part represents for atrous pyramid pooling module.

Figure 4. The structure of residual atrous block.

Figure 5. The structure of ameliorated spatial and channel squeeze and excitation module.

Figure 6. The structure of atrous pyramid pooling module.

Figure 7. Progression of (a) loss value and (b) training accuracy for different model combinations of parameters during training.

Figure 8. Agricultural field extraction results of the comparison of different modules. Color map: white—agricultural fields; black—non-field category; red—non-field category wrongly classified as agricultural fields; green—field category wrongly classified as non-field. (The yellow box indicates the enlarged area to be compared.).

Figure 9. Binary confusion matrix of module comparison. (a) Resblock replaced structure; (b,c,d) Attention mechanism replaced structure; (e,f) Multi-scale feature fusion replaced structure; (g) Our model.

Figure 10. Agricultural field extraction results of the comparison of different models. Color map: white—agricultural fields; black—non-field category; red—non-field category wrongly classified as agricultural field; green—field category wrongly classified as non-field. (The yellow box indicates the enlarged area to be compared.).

Figure 11. Binary confusion matrix of model comparison. (a) SegNet; (b) ResUnet; (c) DeeplabV3; (d) PSPNet; (e) SCAttNet; (f) ResUnet-a; (g) Our model.

Figure 12. In-situ observation validation. (a) In-situ observation; (b) model output.

Figure 13. Agricultural field extraction results in the study area. (a) The false color composite of the GF-2 image; (b) The individual fields grayscale map extracted through DASFNet; (c) field boundaries of the extracted fields.

Figure 14. Examples of feature maps in models (a) without attention mechanism and (b) with the attention module scSE.

Table 1. Accuracy assessment of the four models with different atrous rates.

Block	OA	P	R	F	IoU	Kappa	L
ResABlock
[1,2,4]	0.9928	0.9013	0.9012	0.9010	0.8929	0.8879	1.9222
[1,3,8]	0.9894	0.8992	0.8980	0.8983	0.8874	0.8785	1.8985
APPM
[1,2,4]	0.9928	0.9013	0.9012	0.9010	0.8929	0.8879	1.9222
[2,4,8]	0.9916	0.9008	0.9011	0.9009	0.8921	0.8849	1.6639
[3,6,12]	0.9922	0.9023	0.9011	0.9017	0.8932	0.8869	1.1752

The numbers in brackets denote the atrous rate combination in the above module. The bold values represent the highest accuracy.

Table 2. Accuracy assessment on the comparison of different modules in ablation experiment.

Module	OA	P	R	F	IoU	Kappa	L
ResABlock
Resblock	0.9914	0.8995	0.9005	0.8997	0.8902	0.8833	1.4602
ResABlock	0.9922	0.9023	0.9011	0.9017	0.8932	0.8869	1.1752
scSE
None	0.9869	0.8974	0.8968	0.8969	0.8844	0.8712	2.0426
SEblock	0.9882	0.8974	0.8992	0.8982	0.8868	0.8757	2.1688
CBAM	0.9915	0.9007	0.9006	0.9005	0.8913	0.8835	1.5780
scSE	0.9922	0.9023	0.9011	0.9017	0.8932	0.8869	1.1752
APPM
PPM	0.9891	0.8985	0.8989	0.8986	0.8875	0.8784	1.8046
ASPP	0.9867	0.8975	0.8945	0.8958	0.8824	0.8707	2.4273
APPM	0.9922	0.9023	0.9011	0.9017	0.8932	0.8869	1.1752

The module name denotes the replacement module used in the above structure. The bold values represent the highest accuracy.

Table 3. Accuracy assessment on the comparison of different models.

Model	OA	P	R	F	IoU	Kappa	L
SegNet	0.9780	0.8899	0.8841	0.8860	0.8652	0.8445	5.4505
ResUnet	0.9903	0.9003	0.8976	0.8984	0.8881	0.8799	2.3418
DeeplabV3	0.9742	0.8834	0.8866	0.8846	0.8613	0.8344	2.2932
PSPNet	0.9899	0.8998	0.9002	0.8999	0.8900	0.8806	2.2462
SCAttNet	0.9907	0.8979	0.9026	0.9000	0.8905	0.8822	2.1575
ResUnet-a	0.9904	0.9014	0.8982	0.8996	0.8870	0.8769	1.4989
Our Model	0.9922	0.9023	0.9011	0.9017	0.8932	0.8869	1.1752

The bold values represent the highest accuracy.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, R.; Wang, N.; Zhang, Y.; Lin, Y.; Wu, W.; Shi, Z. Extraction of Agricultural Fields via DASFNet with Dual Attention Mechanism and Multi-scale Feature Fusion in South Xinjiang, China. Remote Sens. 2022, 14, 2253. https://doi.org/10.3390/rs14092253

AMA Style

Lu R, Wang N, Zhang Y, Lin Y, Wu W, Shi Z. Extraction of Agricultural Fields via DASFNet with Dual Attention Mechanism and Multi-scale Feature Fusion in South Xinjiang, China. Remote Sensing. 2022; 14(9):2253. https://doi.org/10.3390/rs14092253

Chicago/Turabian Style

Lu, Rui, Nan Wang, Yanbin Zhang, Yeneng Lin, Wenqiang Wu, and Zhou Shi. 2022. "Extraction of Agricultural Fields via DASFNet with Dual Attention Mechanism and Multi-scale Feature Fusion in South Xinjiang, China" Remote Sensing 14, no. 9: 2253. https://doi.org/10.3390/rs14092253

APA Style

Lu, R., Wang, N., Zhang, Y., Lin, Y., Wu, W., & Shi, Z. (2022). Extraction of Agricultural Fields via DASFNet with Dual Attention Mechanism and Multi-scale Feature Fusion in South Xinjiang, China. Remote Sensing, 14(9), 2253. https://doi.org/10.3390/rs14092253

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Extraction of Agricultural Fields via DASFNet with Dual Attention Mechanism and Multi-scale Feature Fusion in South Xinjiang, China

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Dataset

2.3. Model Architecture

2.3.1. Residual Atrous Block

2.3.2. Ameliorated Spatial and Channel Squeeze and Excitation Module

2.3.3. Atrous Pyramid Pooling Module

2.4. Training

2.5. Accuracy Assessment

3. Results

3.1. Model Parameters Selection

3.2. Comparison of Modules

3.3. Comparison of Models

3.4. In-Situ Observation

3.5. Results of Agricultural Field Extraction

4. Discussion

4.1. Deep Learning for Agricultural Field Extraction

4.2. Dual Attention Mechanism

4.3. Multi-scale Feature Fusion

4.4. Perspective

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI