High ‐ Resolution Boundary Refined Convolutional Neural Network for Automatic Agricultural Greenhouses Extraction from GaoFen ‐ 2 Satellite Imageries

.


Introduction
Agricultural greenhouses (AGs) are a common cultivation facility of modern installation agriculture and are also an important managing object for facility agriculture. With the advantage of reducing the impacts of the external environment, such as adverse climate conditions, AGs could extend the growing season and increase crop yield significantly [1][2][3][4], and are considered an evolutionary transition from extensive to intensive farming [5].
Since the first generation of AGs was used in 1950s [6], the total coverage of AGs has reached 3.019 million hectares globally [7], mostly distributed in North Africa, the Middle East, Europe (Mediterranean areas) and China [1,8,9]. According to statistical reports released by officials in 2017, the cover area of AGs in China is up to 0.981 million hectares and the proportion has been ranked first worldwide for a long time [10]. However, since most materials used are difficult to be completely degraded by the natural environment [11,12], AGs raise wide concerns and criticism from many for posing environmental issues caused by rapid expansion, such as soil pollution, plastic waste, biological degradation and farmland destruction [13,14]. As a result, it is necessary to map and estimate AG coverage accurately and effectively, as it is significant to the sustainable development of regional agriculture and environmental conservation [15].
Remote sensing images (RSIs) from satellites and aerial platforms are widely used in various fields, such as land-use mapping [16], disaster monitoring [17] or urban planning [18,19]. AG mapping using high resolution imagery has attracted increased attention. However, features of RSIs obtained by sensors vary greatly under different atmospheric conditions or solar illuminance. Moreover, construction materials vary in transmittance and reflectivity, and together with pixel confusion caused by semitransparent film and plant crops present a highly dispersed and heterogeneous appearance and bring challenges to the identification of AGs [20].
Traditional manual acquisition methods, such as visual interpretation implemented according to image characteristics by interpretation experts, are time-consuming and laborious [21]. The accuracy is decided by expertise and experience, which cannot meet the demand of automated large-scale extraction. Recently, many approaches for AG extraction by using RSIs have been proposed, which can be divided into pixel-based and objectoriented classification methods in terms of the types of basic units. The pixel-based (PB) method [22][23][24] is performed on a single pixel and considers only spectral characteristics of different wavebands, which can cause "salt and pepper noise" [25] and deterioration in accuracy. Unlike the PB approach, the object-oriented method [26,27] focuses on units composed of several homogeneous and adjacent pixels [28]. This method can not only make use of the spectral information but also takes into account the texture, shape and structure of greenhouse patches, which effectively avoids noise and minimizes confusion of similar types [29]. Nevertheless, because of the requirement to manually design and filter features, that is, to analyze their relationship and select an optimal combination of features, there are still shortcomings of insufficient intelligence and unreasonable segmentation, such as under-segmentation and over-segmentation.
In recent years, deep learning technology [30,31], particularly deep convolutional neural networks (DCNN), has been applied to various fields such as semantic segmentation and object detection. DCNN can automatically and effectively extract hierarchical features from initial input, which surpasses traditional methods in many visual tasks [32]. Since the fully convolutional network (FCN) [33] was first proposed, semantic segmentation [34][35][36][37][38], a pixel-level classification task which aims to assign each pixel to a componential category [39], has been developed rapidly. On the basis of FCN, some state-of-theart (SOTA) semantic segmentation frameworks have been proposed, such as UNet [40], SegNet [41], PSPNet [42] and DeepLab series [43][44][45][46], remarkably improving the performance of segmentation. In the latest research, HRNetV1 [47] and HRNetV2 [48] were proposed successively to address multiscale feature extraction through high-resolution DCNN and reached a new achievement in semantic segmentation. Drawing inspiration from human vision, the attention mechanism [49] was gradually applied to DCNN, especially in natural language processing and computer vision [50], which has become a popular strategy for increasing the accuracy of semantic segmentation networks. Some particular modules have been designed, such as the squeeze and excitation network [51], which focus on the independence between channels. A nonlocal neural network [52] was put forward to capture the long-distance spatial dependence between pixels, thus enhancing the expression by gathering specific global information for each pixel. Some of the latest research includes GCNet [53], CBAM [54] and DANet [55], which combine attention blocks of both spatial and channel dimensions. Enlightened by the great success of DCNN in natural images [56][57][58], researchers have extended it to remote sensing image processing in diverse applications, such as land use land cover (LULC) [59], which shows that AGs extraction from RSIs can also highly benefit from semantic segmentation. For example, Yang et al. [60] compared the performance of SegNet and FCN in identifying plastic agricultural mulches on UAV aerial images. Baghirli et al. [61] employed two modified variations based on UNet architecture incorporating dilated convolutions and skip connections to identify AGs on images obtained by SPOT-7. Sun et al. [62] utilized two-temporal Sentinel-2 imageries acquired in different seasons and a one-dimensional DCNN for mapping.
In previous research, encoder-decoder frameworks [40,41] have mainly been adopted in DCNN-based AG identification algorithms. Although the resolution of an image is gradually rebuilt by fusing shallow features at the decoder stage, these methods still lose details of the spatial information through down-sampling operations, such as precise localized edges and geometric shapes. Specifically, since low-level coarse features could introduce noise, incorrect identification often occurs; for example, some buildings and factories with similar characteristics are misclassified as AGs. In addition, although some attempts have been made to solve multiscale problems [63][64][65], the local receptive fields (RFs) still limit the feature expression to a certain extent, leading to insufficient context aggregation information, thereby small AGs may be ignored, and some large-scale measures are discontinuous and holey. Moreover, multispan greenhouses (the overall AGs with large area integrated by original independent single-room greenhouses through reasonable design and excellent materials) are distributed in dense blocks, while most independent greenhouses are scattered and fragmented. How to maintain original morphological features, such as right angles and straight lines, is still a challenge to face. Some studies [66,67] have made efforts to obtain boundary information or utilize extra processing to improve the results of segmentation, yet the tedious operations, such as separate two-stream architecture or two subnets, cost more time.
To address the problems mentioned above, in this study a novel framework called high-resolution boundary refined network (HBRNet) is proposed on the basis of HRNetV2 aiming to extract AGs from RSIs more accurately, especially for boundary regions. First, we proposed the Pyramid Cross Channel Attention (PCCA) module and combined it with the basic residual module of ResNet [68] to generate two new residual units, which serve as basic elements of new backbone characterized by more powerful feature extraction capability. Then, the Spatial Enhancement (SE) module is introduced to aggregate multiscale information, eliminating problems arising from AGs of various sizes, namely scale inconsistency. Finally, the Boundary Refined (BR) module is designed to pay close attention to the edge pixels, correspondingly, the joint loss function derived from segmentation loss (SegLoss) and boundary constraint loss (BCLoss) is designed to supervise the learning procedure.
The main three contributions in this study are summarized as follows:  We develop a parallel resolution-preserved convolutional network named HBRNet to accurately and effectively extract multiscale AGs. Unlike HRNetV2 and other approaches, two new residual blocks embedded with the PCCA module are employed to form a new backbone, realizing the interaction of cross-channel information.  Based on the parallel multibranch architecture, we develop the SE module to capture more dense multiscale contextual information. This module is a combination of the Dense Cascaded Atrous Spatial Pyramid Pooling (DCASPP), a global encoding module and a skip connection, improving the segmentation accuracy of categories at different scales.  We propose the BR module and the joint loss composed of SegLoss and BCLoss. Under the restriction, the edge morphology maintenance is significantly improved. Specifically, the Spatial Gradient Variation (SGV) unit and the joint loss both combine the segmentation task and the boundary task so that they can interact with each other, facilitating the convergence of the framework.

Study Area and Dataset
The study area is briefly introduced in Sections 2.1 and 2.2 describes the data sources and the process of data preprocessing in detail.

Study Area
The study area is Shouguang City (36°41′~37°19′N, 118°32′~119°10′E), which is located in the northwest of Weifang, Shandong Province, China ( Figure 1). It is an agricultural region adjoining cities of Hanting, Changle, Qingzhou and GuangRao, covering 2072 square kilometers in total. This area is located in a temperate zone with a continental monsoon climate characterized by four distinct seasons, where the hot rainy summer is particularly favorable for crop growth. Therefore, AGs have expanded rapidly in the past few decades and gradually have become the largest vegetable production base in China, so that the city is known as "China's Vegetable Township" [69]. According to our field survey, the most widely used materials for AGs are transparent and translucent plastics and March to May is the high production period of crops by AGs [15].

Dataset and Preprocessing
In this study, GaoFen-2 (GF-2) images were collected to create the experimental dataset. Since the platform is equipped with two high-resolution scanners, the GF-2 images contain four multispectral bands with a spatial resolution of 3.2 m and a panchromatic band with 0.8 m. Table 1 introduces the payload parameters of the GF-2 satellite [70]. To take full advantage of all spectral information, geometric orthorectification was conducted using Rational Polynomial Coefficient polynomial [71], then the pansharpening method [72] was adopted to fuse multispectral and panchromatic bands, obtaining a 0.8 m resolution image with four wavebands including blue (B), green (G), red (R) and near infrared (NIR); the bit depth is unified to 8 bits. All the above preprocessing processes were implemented based on PCI Geo Imaging Accelerator software. The ground-truth binary maps, denoting AGs and background, were manually labeled via visual interpretation with Photoshop CC 2017. In our experiment, 20 images for regions of interest were selected for the dataset, which were acquired at different times and evenly distributed over the study region to guarantee the diversity and adequacy of samples. The size of each image ranged from 1604 × 1465 pixels to 6477 × 7557 pixels. Due to their large size, we randomly clipped them to patches of 512 × 512 pixels. To fit in with the GPU's capacity, all patches were randomly divided into training and test sets at a ratio of 8:2 and all four wavelengths were employed to evaluate the performance of our method. Note that a small part of the training samples was used for validation. Finally, 6700 tiles of images were generated totally, with 4020, 670 and 2010 tiles in the training, validation and test sets, respectively.

Methodology
In this section, we introduce the proposed method in detail. An overview of the network structure is first briefly presented in Section 3.1. Then Section 3.2 exhaustively describes the basic HRNetV2 and the improvements on it including the PCCA module, two new residual blocks and the new backbone. The SE module is illustrated in Section 3.3; and, finally, the BR module and the loss functions used in this article are elaborated on in Sections 3.4 and 3.5, respectively.

Overview of Network Architecture
As shown in Figure 2, the overall pipeline of HBRNet is mainly composed of three components including backbone, segmentation head and boundary refined module.
First, the new backbone, a parallel multipath network combining the idea of multiscale grouped convolution and attention mechanism together, is designed and served as a feature extractor, which inherits the advantages of HRNetV2, acquiring multiscale highlayer semantic information while retaining spatial detail information. Second, the segmentation head employs up-sampling and summation operations along the structure from the bottom up to fuse branches after SE module, aiming to rebuild high-resolution feature maps and output segmentation probability maps. Then, the primary boundary mask obtained after SGV operation is fused with the features extracted from the aggregated semantics to generate the finer boundary. Finally, the joint loss integrated by SegLoss and BCLoss is used to supervise the training phase. The resolution and dimension of the feature maps channels are marked as H, W and C in Figure 2. H and W refer to the height and width of input.

Backbone
In this section, the basic HRNetV2 is firstly introduced and then the PCCA module is described in detail. Finally, we elaborate on two new residual blocks and the novel backbone.

The Basic HRNetV2
As pictured in Figure 3, HRNetV2 consists of several parallel branches (here four branches) in the architecture corresponding to the gradually descending spatial resolutions. The upper branch can retain a high resolution and preserve the details of spatial information, while the lower branches, gradually added one by one through stride convolution, can obtain richer semantic information benefiting from larger receptive fields. The single resolution feature is acquired by a set of residual blocks in each branch at every stage. Subsequently, the fusion of different branches is realized through a fully connected way so that each resolution can repeatedly get information from others. It is worth nothing that, unlike the basic residual block as the basic element of stage one, all other stages employ the bottleneck block as unit. Finally, the feature maps of all parallel low-resolution subnets are up-sampled to the spatial size of the top branch, then concatenation is employed along the channel dimension to generate the segmentation result. After the input, two 3 × 3 layers with a stride of 2 (see Stem in Figure 3) down-sample the feature map to 1/4 of the original resolution and increase the width (channel number) to C to reduce the memory cost. Therefore, the resolution in different branches decreases to 1/4, 1/8, 1/16 and 1/32 of the original image, and the corresponding width increases to C, 2C, 4C and 8C, respectively.

Pyramid Cross Channel Attention Module
The latest research has suggested that the performance for segmentation can be appropriately improved by the attention mechanism [73,74]. Therefore, we drew inspiration from [51,75,76] and developed the PCCA module so as to understand the relevance between multiscale features and to measure the importance of each channel.
As displayed in Figure 4, there are four steps for the execution of the PCCA. First, we can gain multiscale feature maps on all channels through the proposed Pyramid Squeeze Mechanism module (PSM). Secondly, the Channel Attention Mechanism module (CAM) is adopted to extract channel-wise attention vectors of feature maps with different scales. Then, the softmax function is operated on channel-wise attention vectors to generate the whole recalibrated weight, which can be interpreted as a cross channel attention descriptor for all multiscale channels. In the fourth step, the recalibrated weight and corresponding feature map are subjected to an element-wise product and the optimized feature maps with richer scale information will be output.

 Pyramid Squeeze Mechanism Module
The Pyramid Squeeze Mechanism module (PSM), a fundamental operator in the proposed PCCA, is implemented for multiscale feature extraction. As presented in Figure 5, the spatial information of the input feature map is obtained from multiple branches and there are the same number of input channels in each branch. The input features are first squeezed through multiscale convolution kernels and multisize group convolutions in a pyramid structure, and then branches with different scales are obtained. For each branch , the output channel dimension is . Note that must be divisible by . Specifically, the multiscale feature maps are generated by: where and indicate the -th kernel size and group size, respectively, and ∈ ℝ denotes the feature maps with different scales. Furthermore, considering the computational cost, we follow the default settings from the author and employ a novel criterion for selecting the group size and the kernel size without increasing the number of parameters [76]. The relationship between the two can be formulated as: In our study, is set to 4, so is calculated as 3, 5, 7 and 9 by the formula above, the corresponding are 2, 4, 8 and 16. By employing different kernels to group convolution, the input can be squeezed and multiscale spatial information is learned independently.  Cross Channel Attention The channel attention mechanism allows the network to increasingly stimulate related channels and suppress fewer effective channels. Therefore, we adopted the Channel Attention Mechanism module (CAM) of SENet [51], which is comprised of squeeze and excitation operations. The diagram is illustrated in Figure 6. Firstly, based on the features after PSM module, a channel descriptor ∈ ℝ is calculated by the global average pooling operation to integrate channel-wise global information. The -th element of can be calculated by the equation: Then, the excitation operation is implemented for adaptively recalibrating the channel-wise relationship. Formally, the vector of attention weight is defined as: , where is the ReLU [77] function, ∈ ℝ and ∈ ℝ refer to two fully-connected layers and represents a sigmoid function. By learning a nonlinear interaction between channels, the attention weight vector ∈ ℝ with different scales is obtained. Subsequently, assuming not destroying the original channel attention vector, in order to realize the integration of the cross-dimensions vector and the interaction of attention information, a soft attention mechanism is employed to adaptively select information at different spatial scales, which is under the guidance of the descriptor , as shown in Figure 4. It is given by: where the re-calibrated weight of all multiscale channels is produced by the softmax function, thus the interaction between local and global channels attention is enabled. Then, the re-calibrated weight of multiscale channel attention is multiplied by the corresponding feature map as: in which ⊙ denotes channel-wise multiplication and ∈ refers to the re-calibrated feature map. Finally, we can generate the refined feature maps by: where represents the concatenation operation along the channel dimension and ∈ ℝ represents the whole final output. Generally, it can be seen from the above reasoning process that the proposed PCCA module can not only integrate spatial information in different scales, but also focuses much on cross channel attention, which can achieve better information interaction between local and global channels.

New Residual Blocks and New Backbone
As shown in Figure 7, two brand new blocks named PCCA Basic block and PCCA Bottleneck are proposed and the former adds a PCCA module to the basic block while the latter replaces the 3 × 3 convolution with the PCCA module in the bottleneck of ResNet. Owing to the ability of the PCCA module to integrate multiscale spatial information and cross-channel attention, new residual blocks can extract multiscale spatial information at a finer-grained level and form a long-range channel dependency. Correspondingly, a novel backbone is formed by applying the new proposed residual blocks above to the original HRNetV2. Specifically, different from HRNetV2, the minimum number of residual units (four) are kept in each branch of same stage and the last is replaced with one of the two new units, aiming to make the network lightweight and efficient. Then, we modified the width of the top branch (branch 1) to 64 for high-resolution feature enhancement so as to better restore spatial detail information later. It is worth noting that, referring to the location of the basic unit in HRNetV2, PCCA Basic block is only embedded at stage 1 while PCCA Bottleneck is adopted at the other three stages. Similar to HRNeTv2, there are four parallel branches in the new backbone, and each subbranch is generated through stride convolution from the upper-branch. The resolution of the feature maps is down-sampled to 1/4 of the original input after Stem. As can be followed from Figure 8

Spatial Enchancement Module
As we all know, in complicated scenes, frequent scale changes of AGs result in limited extraction accuracy and problems such as omission of small greenhouses and incomplete large-scale greenhouses often occur. To alleviate this phenomenon, a theoretically ideal context module needed to be designed. Therefore, we developed a unit called Spatial Enhancement (SE) module combining the ideas of dense connection [78] and ASPP [46], which aims to gain denser multiscale features of RSIs. As shown in Figure 9, the SE module is composed of a Dense Cascaded Atrous Spatial Pyramid Pooling (DCASPP), a global encoding module and a skip connection. The global encoding module is implemented through global average pooling operation to generate a representation of the whole feature map, which can effectively prevent overfitting [79]. The skip connection, an element-wise summation followed by a simple 1 × 1 convolution operation, is adopted to take full advantage of high-level semantic features and promote model convergence. The DCASPP uses hybrid multiple cascade 3 × 3 atrous depth-wise separable convolutions [80] with different dilation rates and densely connects them, which can be expressed by the formula: in which stands for the dilation rate of -th layer, . represents the concatenation operation in channel dimension and . is the depth-wise separable convolution operation. Note that refers to the input. It has been proved that depth-wise separable convolution can effectively reduce the number of trainable parameters without deteriorating performance [80]. In this paper, the dilation rates are set 1, 2, 3, 5. The RFs calculated by the common ASPP are 3, 5, 7, 11, respectively, while the maximum RF of each layer in the DCASPP can reach 3, 7, 13, 23. DCASPP has the ability to acquire larger RFs and capture denser multiscale contextual information compared to the common ASPP, implying it is more robust to scale changes of AGs.

Boundary Refined Module
Enlightened by the block designed in [81], we propose the Spatial Gradient Variation (SGV) unit by modifying it, where semantic boundaries of AGs can be easily generated from a segmentation mask by spatial gradient deriving. Herein, adaptive pooling is used to derive spatial gradient : where , represents the position of pixel in the segmentation mask probability map; | | denotes the absolute value function and is the max-pooling operation with kernel size and stride . The derived boundary width is decided by the parameter . In our study, is set to 3 and to 1. The reason why we use the max-pooling operation instead of the Global Average Pooling (GAP) is that it can produce more precise semantic boundary results. Some examples are shown in Figure 10. As illustrated in Figure 2, in the segmentation head, the feature maps at various resolutions are used to generate the final segmentation probability map combining up-sampling and summation operations. Then, the boundary mask derived from segmentation probability map through SGV is fused in a concatenation way with those single-channel features generated by 1 × 1 convolution, followed by which, the concatenated feature is fed into a convolution layer to finally obtain the boundary probability map. We only choose high-resolution single-channel features from branch 1 and branch 2 (1/4, 1/8 of the input) to better preserve details, which are conducive to boundary extraction.
The SGV module establishes a prototype for boundary extraction, while the derived boundary serves as the constraint for segmentation learning. Compared with independent segmentation and boundary detection, the two tasks can interact with and benefit from each other, so the training becomes easier.

Loss Functions
As is shown in Figure 2, both segmentation and boundary probability maps are generated from our proposed network. Accordingly, we put forward two loss functions for different tasks.
SegLoss: The standard cross-entropy loss, referred to as , is most commonly adopted in semantic segmentation, which treats each pixel equally. We apply it to segmentation task and compute as follows: where and , respectively, indicate the ground truth value and predicted probability of pixel and the total number of pixels in a mini-batch is represented by . Only when a pixel belongs to AGs, 1, otherwise, is 0.
BCLoss: Considering the ambiguity and sparsity of pixels around the boundary, boundary extraction suffers from a higher missing rate. Obviously, it is unreasonable to treat each pixel equally. In order to mitigate this effect, the following class-balanced crossentropy loss function named is employed, drawing inspiration from [82][83][84]: where denotes the proportion of non-edge pixels in the total number of pixels in boundary label and 1 means the pixel is assigned to edge of AGs. Joint loss: Therefore, we couple SegLoss with BCLoss and finally propose the joint loss function as the following formula: , (12) in which stands for a weight parameter for measuring the contribution of BCLoss. In this paper, we tested four different values including 0.5, 1, 2, 5 and the results prove setting it to 2 is the best choice. In this way, the boundary mask can perform auxiliary supervision for the segmentation task of AGs.

Experiments
In this section, we present the experiments of our study, i.e., experimental details in Section 4.1, evaluation metric in Section 4.2, comparison to SOTA studies in Section 4.3 and mapping of study area in Section 4.4.

Experimental Details
Configurations: The proposed network was implemented using Pytorch1.1 framework in the Ubuntu 18.04.5LTS environment. All work was performed on a single NVIDIA GTX TITAN X GPU with 12G memory.
Training Settings: For optimization, the Adam optimizer [85] with an initial learning rate 0.001 ( ) was chosen for training, where weight decay was set to default as recommended. We adopted an exponential weight-decay strategy with a decay rate of 0.9 to adjust the learning rate for each epoch formulated as ( 0.9 ). In our study, all models for comparison were trained from scratch for about 50 epochs up to convergence. In view of the limitation of GPU memory, the batch size was set to 4 and same parameter settings were guaranteed to equally evaluate the performance of different approaches.
Dataset settings: Given the risk of overfitting, we apply some common data augmentation methods on each image of 512 × 512 pixels, such as vertical-horizontal flipping and random rotation within a range of [90,180,270]. In addition, the pixel values were rescaled between 0 and 1 before being fed into networks.
Inference settings: In the inference phase, the flipping inference strategy was used for evaluation, so we can get final results by averaging all predictions.

Evaluation Metric
To objectively evaluate the performance of HBRNet and other methods, four common pixel-level metrics consisting of precision, recall, score and Intersection over Union (IoU), were used in our experiments. The precision and recall are formulated as follows: , , where , and represent the number of correctly detected AGs pixels (true positives), missed AGs pixels (false positives) and falsely classified background pixels (false negatives), respectively. Precision means the proportion of in all positive predictions, while recall represents the proportion of on all positive samples. Based on them, the score and IoU are expressed as below: where the former is the weighted average of the precision and recall and the latter calculates the ratio that AGs pixels are correctly recognized.

Comparison to SOTA Studies
In this section, we illustrate the quantitative comparisons in Section 4.3.1 and then Section 4.3.2 provides the visualization results.

Quantitative Comparisons
To demonstrate the effectiveness of the proposed network, we conducted contrastive experiments on an AG dataset to compare HBRNet with several recent SOTA methods for semantic segmentation in the computer vision community, including FCN8s, UNet, Se-gNet, PSPNet, DeepLabV3+ and HRNetV2. Table 2 exhibits the quantitative comparisons of various algorithms with the best results highlighted in bold. According to the results, our proposed HBRNet shows a great improvement and outperforms other SOTA methods in all metrics on the AG dataset. We can clearly observe that the IoU score has been improved by 1.93% over the latest model HRNetV2, which showed the second-best performance. Especially, the proposed approach obtains approximately 4.41%, 10.77%, 7.7% and 13.6% improvement in precision, re-call, score and IoU, compared with FCN8s, and achieves 97.22%, 97.53%, 97.38% and 94.89%, respectively. The near-perfect performance demonstrates that our HBRNet has enough robustness to be applied for AG extraction of RSIs with complex scenarios. Furthermore, we compared our method with some recent research including ResUNet [86], LANet [87], MAP-Net [88] and GSCA-UNet [89]. These algorithms all focus on the recognition of various kinds of ground objects in RSIs, which has a certain similarity with AGs extraction and, for fairness, they were reproduced in the same framework.
ResUNet combined advantages of residual learning and UNet. MAP-Net generated multiscale features by a modified multipath neural network and obtained global dependency through a pyramid spatial pooling module to exactly extract building footprints, achieving an IoU accuracy of 90.86% on the WHU datasets. LaNet introduced two light modules including Patch Attention Module (PAM) and Attention Embedding Module (AEM) on the basis of FCN, which aims at tackling the gap between low and high features and improves the precision of segmentation. GASC-UNet developed a global spatial context attention (GSCA) module and embedded it into the U-shaped structure of the encoder and decoder network to adaptively aggregate the global context information of each pixel in the spatial dimension. A better balance was achieved between automation and accuracy.
The segmentation results are reported in Table 3 and the best entries are marked in bold. As we have seen, compared with these approaches, our HBRNet achieves the best performance. Especially, Score metric and IoU metric are boosted by 1.43% and 2.67%, respectively, slightly over the second-best MAP-Net.

Visualization Results
To visually compare different methods, we display some sample results in Figure 11. There are seven examples including AGs with a variety of scales and appearances listed in seven rows. Obviously, compared with other SOTA approaches, HBRNet obtained the most consistent results with ground truth and has strong robustness for greenhouse identification in different complex scenarios. On the contrary, the generalization capacity of other methods is insufficient.
On the whole, FCN shows the poorest performance. As presented in Figure 11c, there exist numerous errors in AG extraction results, which is mainly due to the loss of spatial information through down-sampling operation. Owing to noise introduced from coarse low-level feature maps, SegNet produces blurred and jagged edges, implying the boundary localization is not sufficiently accurate. Meanwhile, "salt and pepper noise" always appears in the results. A striking illustration is listed in the last row of Figure 11d, where parts of farmland and roads are confused as AGs. Compared with the two models above, PSPNet, DeepLabV3+ and UNet gained better segmentation performance and all obtained a reasonable result in a simple scenario. However, with more diverse AGs and more complex scenes, parts were misclassified or missed. For example, it can be clearly observed that small greenhouses suffer a high miss rate according to the third and fourth row in Figure 11 and large greenhouses are always incompletely extracted and holey, as displayed in the fifth and sixth row, suggesting that these methods are sensitive to the variation of scales and difficult to accurately recognize as AGs. Benefiting from multibranch parallel architecture, HRNetV2 can easily reserve spatial details, which means the integrity of AGs of various scales can be maintained to some extent. Nevertheless, the effect of edge extraction is not ideal. We realize that AGs with irregular structure may contain multiple corners, and HRNetV2 cannot correctly distinguish other adjacent pixels with similar features, possibly contributing to over-extraction (see the first and second row of Figure 11h).
Despite the gradual progress, these classic semantic segmentation algorithms have their own limitations and shortcomings. On the contrary, HBRNet combines three improved modules including the PCCA, SE and BR module, which respectively strengthen the interpretation capability of the network, overcome multiscale identification problems, reduce the probability of omission and misrecognition and provide simple but effective solutions for edge problems. As shown in Figure11i, our method not only obtains the best extraction results but restores the geometric shape and spatial distribution of AGs to the greatest extent, which verifies its superiority and accuracy. To demonstrate the details clearly, Figure 12 presents the results of typical areas on the images in a closer perspective, circled by the yellow rectangle. We chose representative models including FCN, SegNet, PSPNet, HRNetV2 and our HBRNet for comparison.
From close observation, we can find: 1. Our network can detect all AGs more comprehensively than other approaches. As presented in the first row of Figure 12, AGs with inconspicuous characteristics were successfully identified and plastic film mulch with similar spectral response could be separated, implying that the proposed model has been equipped with strong feature extraction capability, which is superior to other methods. 2. We can see from the segmentation results that it is difficult for other SOTA methods to distinguish the easily confused adjacent non-greenhouse objects. Especially, as displayed in the middle columns in Figure 12, "adhesion" phenomena usually appear, that is, the background between AGs distributed more densely can be easily misidentified as greenhouse because of a short distance, thus AGs existing as independent individuals cannot be recognized correctly. However, boundary supervision implemented through the joint of the BR module and BCloss in our network can alleviate the indistinguishability and present the truth, as can be seen from the last column in Figure 12 .

Mapping of Study Area
Compared with natural images, RSIs have huge data magnitudes. The size of an orthorectified GF-2 image is more than 30,000 × 30,000 pixels, which greatly exceeds the 512*512 size of patches used in the training process. As mentioned in the previous section, the sufficiency of contextual information affects segmentation performance, which suggests that pixels close to the edge cannot be classified as confidently as ones close to the center, because the lack of information outside the patch limits the contextual information of the edge region. There may appear stitching traces if a regular grid is adopted for cropping, predicting and mosaicking.
To further improve the segmentation performance and obtain smooth masks, we adopt the strategy of edge-ignoring prediction, that is, employing the sliding window to crop the image with overlapping and ignoring part of the edge area during the mosaicking. As shown in Figure 13, the actual prediction result of cropped patch has a size of A × A, yet the final size used for stitching is a × a. In our experiment, the size of the sliding window is 512 × 512 and the number of edge pixels ignored in the four directions of the patch is 56. In other words, the 400*400 square part centrally is used for stitching. Finally, we mosaic multiple segmentation masks to obtain the distribution mapping of the entire study area. As presented in Figure 14, we can observe that the south is a concentrated district of AGs in Shouguang while it is relatively sparse in the north, implying that our method can identify AGs in different scenarios, which is applicable for large-scale automated mapping.

Discussion
In this section, the comparative experiment is firstly used to verify the generalization of the BR module and then we analyze the roles of each proposed module. Finally, the complexity analysis of related methods and the limitations of our method are discussed.

Effectiveness of the Boundary Refined Module
To verify the generalization of the BR module, we applied it to UNet and the corresponding BCLoss was also introduced for training. According to records in Table 4, whether embedded in UNet or in our framework, this module can improve the performance with little increase in computational cost. The comparison results strongly demonstrate its effectiveness and certain applicability. Note that the flipping inference strategy is not adopted in this contrast experiment. As mentioned in Section 3.5, the proposed joint loss function may affect the segmentation performance; to further investigate the contribution of the boundary auxiliary task, we used different joint loss functions with varying weight in Equation (12) to train HBRNet on the AG dataset. The network was trained eventually to get five sets of results in total, four for AGs extraction and boundary detection simultaneously and the remaining one for only AGs extraction task. All training processes took 50 epochs to converge and the performances with different parameter configurations are displayed in Table 5. The weight decides how much the network focuses on the edge and the degree of emphasis increases as grows. Table 5 clearly shows that no matter what the value is set to, our approach can obtain better results on all metrics with the auxiliary constraint of boundary learning. Figure 15 also proves that the network with boundary learning is beneficial to maintain the morphological characteristics of AGs, with more complete edge, clearer straight and right angles and fewer misclassified pixels. However, we find that a too big or too small value can affect the stability of performance, so for a tradeoff, it is recommended to set this parameter to 2 to maximize the overall performance.

Ablation Experiments
For the purpose of analyzing the roles of different modules of HBRNet, ablation studies were conducted on the AG dataset and we adopted precision, recall, Score and IoU for evaluation. The ablation experimental results are shown in Table 6 with the best entries highlighted in bold.
First, based on the baseline HRNetV2, the new backbone described in Section 3.2 was applied to optimize its structure. The modified framework, with a fusion operation at the end that was the same as in HRNetV2, outperforms HRNeV2 by 0.67%., which is contributed to by the powerful feature extraction capability of the PCCA module. Then, after embedding the SE module in the segmentation head to capture multiscale information, the IoU is improved by 0.18%, indicating that SE is robust in addressing crucial scale issues for AG objects. Finally, as expected, compared only with the segmentation task, the combination of boundary extraction and the BCLoss function boosted the IoU performance by 0.83%; and, furthermore, with the assistance of flipping inference strategy, our proposed method achieves a 94.89% IoU, which is a 1.93% arise compared to the baseline.

Complexity Analysis
Our proposed approach adopts a parallel multipath structure; some branches are needed to process high-resolution feature maps to maintain accurate localization information, which may result in a mass of parameters. To verify the tradeoff between the complexity and performance of HBRNet, three indexes including parameter, FLOPs and IoU of related algorithms are compared on the AG dataset.
As presented in Table 7, HRNetV2 contains the fewest parameters but relatively lower extraction accuracy. Due to a large number of convolution operations, PSPNet and DeeplabV3+ with ResNet50 as encoder have higher complexity, whereas their performance is poor. The reasonable explanation is that its decoder structure does not rebuild high resolution feature maps step by step. It is also clearly observed that ResUNet has a small number of parameters but still numerous FLOPs, which poses more requirements for hardware. After the redesign of the backbone and introduction of SE module and BR module, the complexity of HBRNet slightly increased based on HRNetV2; however, compared with other methods, better performance was achieved with far fewer parameters.
To visually compare the complexity and performance, we display the experimental results on a scatter plot as shown in Figure 16, in which diameter of blue circle denotes the size of the model file and the number of parameters and IoU metric stand for the complexity and accuracy of each other method. It can be concluded that compared with other related methods, our proposed HBRNet maintains the highest accuracy and relatively lower complexity.

Conclusions
Effective and accurate AG extraction is crucial to monitor the distribution and estimate coverage of AGs. However, due to diverse geometric appearances and different spectral features, traditional semantic segmentation has limited accuracy and inaccurate boundary localization. Therefore, we set up an AG dataset and a novel framework named HBRNet was proposed to conduct experiments on this dataset. In this method, the PCCA module was introduced into a parallel multibranch architecture to form a new backbone. While maintaining high-resolution detailed information, it also recalibrated the contribution of each channel through cross channel attention, enhancing its capability of feature extraction. Furthermore, the SE module further rebuilds and optimizes features in the space domain and thus the network can attach more importance to small-size greenhouses and restrain the holes that appear on the larger scale. In particular, the BR module was combined with the BCLoss to prompt the model to pay more attention to the pixel classification around the boundary. The comprehensive experiments demonstrated that our method outperforms other typical semantic segmentation approaches in performance with relatively less complexity. In addition, we conducted ablation experiments to evaluate the consequences of each proposed module. Although a small number of parameters cost less memory, our HBRNet is relatively inefficient because of the existence of group convolution. In future research, we will further improve our algorithm in computational efficiency towards a lighter convolution structure. Our current experiments mainly focus on AG extraction and we hope it will be employed to various target recognition tasks such as building and aquaculture areas to verify its applicability.
Author Contributions: X.Z. designed the experiments and wrote the manuscript. B.C. provided the original data, supervised the study and reviewed the draft paper. C.L. and J.C. revised the manuscript and gave some appropriate suggestions. All authors have read and agreed to the published version of the manuscript. Acknowledgments: This work was partially funded by the National Natural Science Foundation of China (61731022, 61860206004) and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant Numbers XDA19010401). The authors also thank the anonymous reviewers and the editors for their insightful comments and helpful suggestions to improve our manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations:
The following abbreviations are used in this manuscript: