Improved Winter Wheat Spatial Distribution Extraction Using A Convolutional Neural Network and Partly Connected Conditional Random Field

: Improving the accuracy of edge pixel classiﬁcation is crucial for extracting the winter wheat spatial distribution from remote sensing imagery using convolutional neural networks (CNNs). In this study, we proposed an approach using a partly connected conditional random ﬁeld model (PCCRF) to reﬁne the classiﬁcation results of ReﬁneNet, named ReﬁneNet-PCCRF. First, we used an improved ReﬁneNet model to initially segment remote sensing images, followed by obtaining the category probability vectors for each pixel and initial pixel-by-pixel classiﬁcation result. Second, using manual labels as references, we performed a statistical analysis on the results to select pixels that required optimization. Third, based on prior knowledge, we redeﬁned the pairwise potential energy, used a linear model to connect di ﬀ erent levels of potential energies, and used only pixel pairs associated with the selected pixels to build the PCCRF. The trained PCCRF was then used to reﬁne the initial pixel-by-pixel classiﬁcation result. We used 37 Gaofen-2 images obtained from 2018 to 2019 of a representative Chinese winter wheat region (Tai’an City, China) to create the dataset, employed SegNet and ReﬁneNet as the standard CNNs, and a fully connected conditional random ﬁeld as the reﬁnement methods to conduct comparison experiments. The ReﬁneNet-PCCRF’s accuracy (94.51%), precision (92.39%), recall (90.98%), and F1-Score (91.68%) were clearly superior than the methods used for comparison. The results also show that the ReﬁneNet-PCCRF improved the accuracy of large-scale winter wheat extraction results using remote sensing imagery. and Z.X.; software: S.W., H.Y., and Z.Z.; validation: Z.M., T.Z., and Y.W.; formal analysis: C.Z., Z.X., and S.W.; investigation: J.Z., Z.M., and T.Z.; resources, S.G.; data curation, S.G. and Y.W.; writing—original draft preparation, C.Z., Z.X., and S.W.; writing—review and editing: C.Z.; visualization: and supervision:


Introduction
The crop spatial distribution includes the shape, location, and area of each piece of crop planting area. The accurate measurement of crop spatial distributions is of great significance for scientific research, food security, estimates of grain production, and agricultural management and policy [1][2][3]. Whether the edges are fine is a key indicator of the crop spatial distribution data quality; to achieve this, research related to obtaining large-scale and high-quality crop spatial distribution has attracted widespread attention [4,5].
Ground surveys can be used to obtain accurate crop spatial distributions. However, this method is highly labor-intensive and time-consuming, thereby making it difficult to obtain large-scale data [6]. The data obtained via ground surveys are mainly used to verify the data obtained using other technologies [7].
As remote sensing technologies can rapidly obtain up-to-date, large-scale, finely detailed ground images, remote sensing imagery has become the main source of data used to generate accurate crop spatial distributions [8][9][10]. Image segmentation technology can produce pixel-by-pixel classification results; thus, it is widely used in extracting crop spatial distributions [11,12]. Furthermore, both the specific pixel feature extraction method and classifier have a decisive impact on the accuracy of the classification results [13,14].
As pixel features form the basis for high-quality image segmentation, previous studies have developed various feature extraction methods to obtain effective pixel features [15,16]. Previously, spectral features were used in remote sensing image segmentation, of which the normalized difference vegetation index (NDVI) was the frequently used feature when extracting vegetation [17]. The spectral feature extraction method is based on statistical and analytical technologies. By performing a series of mathematical operations on the channel value of each pixel, the result obtained is used as the value of the pixel feature [18].
In low-spatial-resolution images, such as from Moderate Resolution Imaging Spectroradiometer (MODIS) and Enhanced Thematic Mapper/Thematic Mapper (ETM/TM), pixels inside winter wheat and other crop fields have good consistency and low change rates, which can better distinguish crop fields from other land-use types [19,20]. However, at the edge of the crop planting area, the feature value extracted from the mixed pixels has a weak discrimination ability, resulting in more pixels being misclassified [21,22]. In addition, differences in crop growth within the planting area adversely affect the spectral feature extraction, thereby resulting in mis-segmented pixels that form the so-called "salt and pepper" phenomenon [23,24].
As the spectral features only express the characteristic information of the pixels themselves, the effect is usually not ideal when applied to higher-spatial-resolution images [23]. There is more detailed information in higher-spatial-resolution remote sensing images, and the spatial correlation between pixels is significantly enhanced, but the spectral characteristics cannot express this correlation information, and therefore, in such cases, spectral features are ineffective [25,26]. To better express the spatial correlation information between pixels, previous studies have proposed a series of texture feature extraction methods, such as the wavelet transform [27,28], Gabor filter [29,30], and gray level co-occurrence matrix (GLCM) [31]. Combining spectral and textural features enables the extraction of higher-quality crop spatial distributions from low-and medium-resolution imagery [32].
In addition to the spectral and texture features, previous studies have developed a series of methods, including neural networks [33,34], support vector machines [35,36], random forests [37][38][39], and decision trees [40,41], to obtain features with improved distinguishing abilities for high-spatial-resolution remote sensing images. These methods generally use the channel values of pixels as the input, as well as complex mathematical operations to obtain improved distinguishing features. As these methods do not consider or barely consider the spatial correlation between pixels, the distinguishing ability of the extracted features is not ideal for several types of new higher-spatial-resolution remote sensing images.
With the success of convolutional neural networks (CNNs) in camera image processing, researchers began to successfully use these networks for feature extraction from remote sensing images and Remote Sens. 2020, 12, 821 3 of 25 have achieved good results [42][43][44][45]. The convolution operation can accurately express the spatial relationship between pixels and extract deep information from the pixels (when the convolution kernel is set appropriately), combining the advantages of previous feature extraction methods [14,[46][47][48]. Classic CNNs, such as the Fully Convolutional Network (FCN) [49], SegNet [50], DeepLab [51], and RefineNet [52], form the basis for the rapidly developing field of remote sensing image segmentation. Although the use of CNNs can significantly improve the accuracy of remote sensing image segmentation, errors remain common near object edges owing to the inherent characteristics of the convolution operation [49][50][51]53]. Thus, convolution must be combined with other post-processing techniques to improve the accuracy of the results [51,54,55].
RefineNet and most other classic CNNs typically use two-dimensional (2-D) convolution methods to extract feature values. Two-dimensional convolution methods are unsuitable for processing images with small channels, such as optical remote sensing images or camera images [56]. To preserve the spectral and spatial features when processing hyperspectral remote sensing images, previous studies have used three-dimensional (3-D) convolution methods to extract spectral-spatial features [56,57]. As the 3-D convolution method can fully use the abundant spectral and spatial information of hyperspectral imagery, this convolution method has achieved remarkable success in the classification of hyperspectral images.
Conditional random field (CRF) is a commonly used post-processing technique for camera image segmentation [55,58]. As CRFs have the ability to capture both local and long-range dependencies within an image, they significantly improve CNN segmentation results [59]. The existing CRFs, such as the fully connected CRF modeling processes, are complicated and require a large number of calculations [60]. To complete the calculations, previous studies have used approximate calculations [60,61], reduction of the number of samples involved in modeling [62,63], and introduced conditional independence [64][65][66]. However, in doing so, the performance of the CRFs gets reduced [67]. To combine a CNN and CRFs, and achieve end-to-end training, several studies [67][68][69] have converted the CRF into an iterative calculation, while others [64] have converted the CRF into a convolution operation.
The existing CRF mode uses only the channel value and position of the pixel, which emphasizes the smoothness of the image data [70]. As the spatial resolution of a remote sensing image is significantly lower than that of a camera image, the color change at the boundary of the object is not as apparent as in the camera image. When CRF is applied to remote sensing image segmentation, new features should be used in the modeling process. In the existing CRF modeling, the CNN is used only as a unary potential function, and any other information provided by the CNN is not used. In addition, it is unreasonable to use the equal weight method to connect the unary potential function and the pairwise potential function, which needs to be improved.
As winter wheat is an important food crop, previous studies have proposed numerous methods to extract the spatial distribution information of winter wheat from remote sensing images. When using low-and medium-resolution images as data sources, NDVI and other vegetation indices are typically used as the main features [71]. When higher-resolution remote sensing images are used as data sources, regression methods [72], support vector machines [73,74], random forests [75], linear discriminant analysis [76], and CNNs [77,78] are the more commonly used methods. There is a significant number of mis-segmented pixels at the edges of winter wheat planting areas, which are common problems that these methods must overcome. Although the edge accuracy of the winter wheat planting area can be improved with the use of a CRF [78], improving the computational efficiency of CRFs is still an important issue that requires an urgent solution.
In this study, we proposed a partly connected conditional random field (PCCRF) model to post-process the RefineNet extraction results, referred to as RefineNet-PCCRF, to eventually achieve the goal of obtaining the high-quality winter wheat spatial distribution. The main contributions of this paper are as follows: • The statistical analysis technology is used to analyze the segmentation results of RefineNet, and prior knowledge is applied to PCCRF modeling. • Based on prior knowledge, we modified the fully connected conditional random field (FCCRF) to build the PCCRF. We refined the definition of pairwise potential energy, employing a linear model to connect the unary potential energy and pairwise potential energy. Compared to the equal weight connection model used in the FCCRF, the new fusion model used in the PCCRF can better reflect the different roles of information generated from a larger receptive field and information generated from a smaller receptive field.

•
We only used pixel pairs associated with the selected pixels in the PCCRF, which can effectively reduce the amount of data required for computing models and improve the computational efficiency of the PCCRF.

•
Benefiting from the ability to describe the spatial correlation between pixel categories of a CRF, RefineNet-PCCRF can not only improve the classification accuracy of edge pixels in the winter wheat planting area, but it also has high computing efficiency.

Study Area
Tai'an City covers an area of 7761 km 2 within the Shandong Province of China (116 • 20 to 117 • 59 E, 35 • 38 to 36 • 28 N), including 3665 km 2 of farmland. This region is an important crop production area ( Figure 1). The area is a temperate, continental, semi-humid, monsoon climate zone with four distinct seasons and sufficient light and heat to allow for crop growth. The average annual temperature is 12.9 • C, the average annual sunshine is 2627.1 h, and the average annual rainfall is 697 mm. The main crops include winter wheat (grown from October through June of the following year) and corn (grown from April to November).
Remote Sens. 2020, 12, 821 4 of 26 • Based on prior knowledge, we modified the fully connected conditional random field (FCCRF) to build the PCCRF. We refined the definition of pairwise potential energy, employing a linear model to connect the unary potential energy and pairwise potential energy. Compared to the equal weight connection model used in the FCCRF, the new fusion model used in the PCCRF can better reflect the different roles of information generated from a larger receptive field and information generated from a smaller receptive field. • We only used pixel pairs associated with the selected pixels in the PCCRF, which can effectively reduce the amount of data required for computing models and improve the computational efficiency of the PCCRF. • Benefiting from the ability to describe the spatial correlation between pixel categories of a CRF, RefineNet-PCCRF can not only improve the classification accuracy of edge pixels in the winter wheat planting area, but it also has high computing efficiency.

Study Area
Tai'an City covers an area of 7761 km 2 within the Shandong Province of China (116°20′ to 117°59′ E, 35°38′ to 36°28′ N), including 3665 km 2 of farmland. This region is an important crop production area ( Figure 1). The area is a temperate, continental, semi-humid, monsoon climate zone with four distinct seasons and sufficient light and heat to allow for crop growth. The average annual temperature is 12.9 °C, the average annual sunshine is 2627.1 h, and the average annual rainfall is 697 mm. The main crops include winter wheat (grown from October through June of the following year) and corn (grown from April to November).

Remote Sensing and Pre-Processing
We collected 37 Gaofen-2 (GF-2) remote sensing images from November 2018 to April 2019 covering the entire study area. Each GF-2 image consisted of a multispectral and panchromatic image. The former was composed of four spectral bands (blue, green, red, and near-infrared), where the spatial resolution of each multispectral image was 4 m, whereas that of the panchromatic image was 1 m.
Environment for Visualizing Images (ENVI) software Version 5.5 (developed by Harris Geospatial Solutions, Broomfield, Colorado, United States of America) is a remote sensing image processing software that integrates numerous mainstream image processing tools and therefore improves the efficiency of image processing and utilization. ENVI can especially use an interactive data language to develop image processing programs according to our requirements, which can

Remote Sensing and Pre-Processing
We collected 37 Gaofen-2 (GF-2) remote sensing images from November 2018 to April 2019 covering the entire study area. Each GF-2 image consisted of a multispectral and panchromatic image. The former was composed of four spectral bands (blue, green, red, and near-infrared), where the spatial resolution of each multispectral image was 4 m, whereas that of the panchromatic image was 1 m.
Environment for Visualizing Images (ENVI) software Version 5.5 (developed by Harris Geospatial Solutions, Broomfield, Colorado, United States of America) is a remote sensing image processing software that integrates numerous mainstream image processing tools and therefore improves the efficiency of image processing and utilization. ENVI can especially use an interactive data language to develop image processing programs according to our requirements, which can further improve our work efficiency. We used ENVI which copyright purchased by Shandong Provincal Climate Center to preprocess the imagery through three steps: atmospheric correction used the Fast line-of-sight atmospheric analysis of spectral hypercubes (FLAASH) module, orthorectification used the Rational Polynomial Coefficient (RPC) module, and data fusion used the Nearest Neighbor Diffusion (NNDiffuse) Pan Sharpening module. We developed a batch program using an interactive data language (IDL) to improve the degree of automation during pre-processing.
After pre-processing, each image contained four channels (red, blue, green, and near-infrared) with a spatial resolution of 1 m.
The main land-use types used for image capture were winter wheat, mountain land, water, urban residential area, agricultural building, woodland, farm land, roads, and rural residential area, among others. As winter wheat was the main crop in the pre-processed images, we used it as the extraction target in this study to test the effectiveness of the proposed method.

Create Image-Label Pair Dataset
Larger image blocks are advantageous for model training. Considering the hardware used in our research, we cut each pre-processed image into equal-sized image blocks (1000 × 1000 pixels). A total of 920 cloudless image blocks were selected for manual labeling with numbers assigned to the following categories: (1) winter wheat, (2) mountain land, (3) water, (4) urban residential area, (5) agricultural building, (6) woodland, (7) farm land, (8) roads, (9) rural residential area, and (10) others. While selecting the pixel blocks, we used the following principle: each pixel block should contain at least three land-use types, where the area proportion of each land-use type in the selected images was similar to that in the pre-processed images.
We created a label file for each image block, comprising a single-channel image file in which the number of rows and columns was identical to the corresponding image. We used visual interpretation to assign a category number to each pixel and saved it in the corresponding location in the label file. After labeling, the image block and its corresponding label file formed an image-label pair ( Figure 2). Remote Sens. 2020, 12, 821 5 of 26 further improve our work efficiency. We used ENVI which copyright purchased by Shandong Provincal Climate Center to preprocess the imagery through three steps: atmospheric correction used the Fast line-of-sight atmospheric analysis of spectral hypercubes (FLAASH) module, orthorectification used the Rational Polynomial Coefficient (RPC) module, and data fusion used the Nearest Neighbor Diffusion (NNDiffuse) Pan Sharpening module. We developed a batch program using an interactive data language (IDL) to improve the degree of automation during pre-processing. After pre-processing, each image contained four channels (red, blue, green, and near-infrared) with a spatial resolution of 1 m.
The main land-use types used for image capture were winter wheat, mountain land, water, urban residential area, agricultural building, woodland, farm land, roads, and rural residential area, among others. As winter wheat was the main crop in the pre-processed images, we used it as the extraction target in this study to test the effectiveness of the proposed method.

Create Image-Label Pair Dataset
Larger image blocks are advantageous for model training. Considering the hardware used in our research, we cut each pre-processed image into equal-sized image blocks (1000 × 1000 pixels). A total of 920 cloudless image blocks were selected for manual labeling with numbers assigned to the following categories: (1) winter wheat, (2) mountain land, (3) water, (4) urban residential area, (5) agricultural building, (6) woodland, (7) farm land, (8) roads, (9) rural residential area, and (10) others. While selecting the pixel blocks, we used the following principle: each pixel block should contain at least three land-use types, where the area proportion of each land-use type in the selected images was similar to that in the pre-processed images.
We created a label file for each image block, comprising a single-channel image file in which the number of rows and columns was identical to the corresponding image. We used visual interpretation to assign a category number to each pixel and saved it in the corresponding location in the label file. After labeling, the image block and its corresponding label file formed an imagelabel pair ( Figure 2).

Methodology
We first modified the original RefineNet model as an initial segmentation model (Section 3.1), and then performed statistical analysis on the initial segmentation results to obtain the prior knowledge (Section 3.2). Based on the obtained knowledge, we constructed the PCCRF model (Section 3.3) and trained the model (Section 3.4). The trained model was then used to refine the initial segmentation results of the CNNs to generate the final results. We designed a set of comparative experiments to evaluate the performance of the proposed method (Section 3.5). Figure 3 summarizes the entire flowchart of the proposed approach.

Methodology
We first modified the original RefineNet model as an initial segmentation model (Section 3.1), and then performed statistical analysis on the initial segmentation results to obtain the prior knowledge (Section 3.2). Based on the obtained knowledge, we constructed the PCCRF model (Section 3.3) and trained the model (Section 3.4). The trained model was then used to refine the initial segmentation results of the CNNs to generate the final results. We designed a set of comparative experiments to evaluate the performance of the proposed method (Section 3.5). Figure 3 summarizes the entire flowchart of the proposed approach.

Improved RefineNet Model
We selected RefineNet as our initial segmentation model. Unlike the FCN, SegNet, DeepLab, and other models, this model uses a multi-path structure that fuses low-level detailed semantic features with high-level rough semantic features, thereby effectively improving the distinguishability of the pixel features. We modified the classic RefineNet model to initially segment remote sensing images; Figure 4 shows the structure of the improved RefineNet model.

Improved RefineNet Model
We selected RefineNet as our initial segmentation model. Unlike the FCN, SegNet, DeepLab, and other models, this model uses a multi-path structure that fuses low-level detailed semantic features with high-level rough semantic features, thereby effectively improving the distinguishability of the pixel features. We modified the classic RefineNet model to initially segment remote sensing images; Figure 4 shows the structure of the improved RefineNet model.
Improvements to the RefineNet model were as follows. First, we replaced the equal weight fusion model used in the classic model with a linear fusion model to fuse detailed low-level semantic features and high-level rough semantic features. The fusion method is as follows: where s denotes the fused features, f represents the detailed low-level semantic feature values generated by the convolution block, g denotes the up-sampling feature of the high-level rough semantic features, and a and b are the coefficients of the fusion model. The specific values of a and b must be determined via model training. Second, we modified the classifier of RefineNet, i.e., Softmax, to simultaneously output the prediction category label and category probability vector, P, for each pixel.
The probability value of a pixel was assigned as the ith category label p i , which was calculated as follows: where m is the number of categories, and r i and r j represent the output of the RefineNet encoder, i.e., the product of the pixel's feature vector and ith feature function, respectively. Based on the definition of p i , P can be defined as follows:

of 25
We used the stochastic gradient descent algorithm [79] to train the improved RefineNet model, and used the trained model to segment image blocks to obtain initial segmentation results, including the prediction label image and category probability vectors for each pixel. Improvements to the RefineNet model were as follows. First, we replaced the equal weight fusion model used in the classic model with a linear fusion model to fuse detailed low-level semantic features and high-level rough semantic features. The fusion method is as follows: where s denotes the fused features, f represents the detailed low-level semantic feature values generated by the convolution block, g denotes the up-sampling feature of the high-level rough semantic features, and a and b are the coefficients of the fusion model. The specific values of a and b must be determined via model training. Second, we modified the classifier of RefineNet, i.e., Softmax, to simultaneously output the prediction category label and category probability vector, P, for each pixel.
The probability value of a pixel was assigned as the ith category label pi, which was calculated as follows: where m is the number of categories, and ri and rj represent the output of the RefineNet encoder, i.e., the product of the pixel's feature vector and ith feature function, respectively. Based on the definition of pi, P can be defined as follows: We used the stochastic gradient descent algorithm [79] to train the improved RefineNet model, and used the trained model to segment image blocks to obtain initial segmentation results, including the prediction label image and category probability vectors for each pixel.

Statistical Analysis of the Initial Segmentation Results
In a previous study [21], we proposed the confidence level, CL, as an indicator to evaluate the credibility degree of the predicted category label of the pixel using the CNN: where p max represents the maximum value of P and p max ' represents the maximum value of P with p max excluded. We used Cgate to represent the confidence level threshold. The predicted category label of the pixel was considered credible if CL > Cgate, and not if otherwise. After Cgate was determined, the pixel set I = {1, 2, . . . , m} was divided into two subsets, as follows: As the classification results of the pixels in the PC were credible, we only needed to post-process the classification results of the pixels in the PIC.
The value of Cgate had a significant impact on the overall accuracy. When Cgate was high, the number of pixels that required post-processing was large, such that there was a significant improvement in the overall classification accuracy. When the value of Cgate was low, the number of pixels that required post-processing was small, but improvements to the overall classification accuracy were not always apparent. The following steps were used in our study to determine the value of Cgate. First, we used a TIFF file to store the CL while we predicted the category label, category probability vector, and manual-labeled category.
Second, the pixels were divided into two sets based on the artificially-labeled category and predicted category using the following rules: PW = {i}, mannul category label of pixel i Predicted catagory label of i.
Third, a histogram was produced for PR and PW using the CL as the x-axis and the number of pixels corresponding to a certain CL value as the y-axis. Figure 5 provides an example of a histogram, which was used to determine the value of Cgate. In general, the principle is that when CL is greater than Cgate, the number of misclassified pixels should be as small as possible.
As the classification results of the pixels in the PC were credible, we only needed to post-process the classification results of the pixels in the PIC.
The value of Cgate had a significant impact on the overall accuracy. When Cgate was high, the number of pixels that required post-processing was large, such that there was a significant improvement in the overall classification accuracy. When the value of Cgate was low, the number of pixels that required post-processing was small, but improvements to the overall classification accuracy were not always apparent.
The following steps were used in our study to determine the value of Cgate. First, we used a TIFF file to store the CL while we predicted the category label, category probability vector, and manuallabeled category.
Second, the pixels were divided into two sets based on the artificially-labeled category and predicted category using the following rules: mannul category label of pixel = Predicted category label of , = { }, mannul category label of pixel Predicted catagory label of .
Third, a histogram was produced for PR and PW using the CL as the x-axis and the number of pixels corresponding to a certain CL value as the y-axis. Figure 5 provides an example of a histogram, which was used to determine the value of Cgate. In general, the principle is that when CL is greater than Cgate, the number of misclassified pixels should be as small as possible.

Description of the Modeling Scheme
According to the obtained prior knowledge, in the classification results generated by the CNN, the results for the pixels located inside the object are credible, but the credibility of the pixels located at the edge of the object is low. Furthermore, only low-credibility classification results require post-processing.
Based on previous studies [51][52][53]58,59], approximately 80% of the pixel-by-pixel classification results generated by CNN models are credible. Therefore, only approximately 20% of the pixel classification results require post-processing. This strategy can significantly reduce the number of calculations, thereby improving the efficiency and performance of the model. This is in reference to our use of term "partly connected." Based on the abovementioned analysis, we consider the following case: on a given image, when the category labels of certain pixels have been determined by the CNN, how the category labels of the remaining pixels are to be determined needs to be clarified ( Figure 6). classification results require post-processing. This strategy can significantly reduce the number of calculations, thereby improving the efficiency and performance of the model. This is in reference to our use of term "partly connected." Based on the abovementioned analysis, we consider the following case: on a given image, when the category labels of certain pixels have been determined by the CNN, how the category labels of the remaining pixels are to be determined needs to be clarified ( Figure 6). We can observe that the main difference between the PCCRF and FCCRF is that the former can take full advantage of the fact that certain pixels have already been assigned certain category labels.
In the PCCRF, we used the category probability vectors generated by the CNN to build a unary potential energy similar to the FCCRF by using the relationship between pixel pairs to build a pairwise potential energy. Considering that there are numerous mixed pixels on the remote sensing image, we must select appropriate features to form a feature vector for the pixels (Section 3.3.2), and then use these vectors to define the pairwise potential energy (Section 3.3.3). Based on this, we can provide the definition of a PCCRF (Section 3.3.4).

Features Selection
Based on prior knowledge, the inner and edge pixels of the winter wheat planting areas are extremely similar in terms of color and texture. Considering that the near-infrared band (NIR) can better distinguish between crops and non-crops, we selected the red, blue, green, and NIR bands, along with the NDVI, contrast (CON), uniformity (UNI), inverse difference (INV), and entropy (ENT), to construct the feature vectors for the pixels. The NDVI was calculated following the methods reported in Ma et al. [17]: (9) Figure 6. A description of the modeling progress for a partly connected conditional random field.
We can observe that the main difference between the PCCRF and FCCRF is that the former can take full advantage of the fact that certain pixels have already been assigned certain category labels.
In the PCCRF, we used the category probability vectors generated by the CNN to build a unary potential energy similar to the FCCRF by using the relationship between pixel pairs to build a pairwise potential energy. Considering that there are numerous mixed pixels on the remote sensing image, we must select appropriate features to form a feature vector for the pixels (Section 3.3.2), and then use these vectors to define the pairwise potential energy (Section 3.3.3). Based on this, we can provide the definition of a PCCRF (Section 3.3.4).

Features Selection
Based on prior knowledge, the inner and edge pixels of the winter wheat planting areas are extremely similar in terms of color and texture. Considering that the near-infrared band (NIR) can better distinguish between crops and non-crops, we selected the red, blue, green, and NIR bands, along with the NDVI, contrast (CON), uniformity (UNI), inverse difference (INV), and entropy (ENT), to construct the feature vectors for the pixels. The NDVI was calculated following the methods reported in Ma et al. [17]: Here, CON, UNI, INV, and ENT were extracted using the methods proposed by Yang and Yang [27], based on the GLCM: where q is the gray level and g(i,j) is an element of the GLCM.
The feature vector f of each pixel comprises nine elements, structured as follows: f = (red, green, blue, NIR, NDVI, UNI, CON, ENT, INV).

Definition of the Pairwise Potential Energy
Based on the Gaussian kernel function, we define the potential energy of a pixel pair, τ x i , x j , as: where i and j each represent a single pixel of image I, x i is the predicted category label of pixel i by the CNN, x j represents the predicted category label of pixel j by the CNN, x i and x j are elements of category label set L = {l 1 , l 2 , . . . , l n }, f i , f j represent the feature vector of the pixel, as discussed in Section 3.3.2, I i − I j is the Manhattan distance between i and j, f i − f j is the Euclidean distance between i and j, and µ(x i , x j ) is the label comparison function. When x i and x j are identical, the value was set to 0; otherwise, it is set to 1. Here, ω (1) , ω (2) , θ α , θ β , and θ γ are determined through training the PCCRF.
Based on the definition of τ ij , we can define the sum of the pairwise potential energy of x i , τ(x i ), as: The total pairwise potential energy associated with i is defined as follows: Considering that the unary potential energy is an element of the category probability vector, the value range is [0, 1], and therefore, we used τ(i) to normalize τ(x i ): We used nτ(x i ) to build the PCCRF.

Definition of PCCRF
As discussed in Section 3.2, I is the set of pixels and PC and PIC are the subsets of I. As the classification results of the pixels in PC were credible, we only needed to optimize the classification results of the pixels in PIC. Based on the above-mentioned analysis, we only used such pixel pairs to build the PCCRF, where at least one pixel in the pixel-pair belonged to PIC.
Let i be a pixel in PIC and j be a pixel in I. Therefore, x = {x 1 , x 2 , . . . , x m } represents a label set assignment of PIC. Then, θ represents the model parameter set of ω (1) , ω (2) , θ α , θ β , and θ γ . We define the Gibbs energy of x as follows: where ϕ(x i ) represents the unary potential energy of x i , ϕ(x i ) is an element of the category probability vector of pixel i generated by the CNN, ∂ is the weight value for the unary potential energy, and (1 − ∂) is the weight value for the pairwise potential energy. Here, ∂ is determined while training the PCCRF. Based on the above analysis, we define the PCCRF as follows: where X represents the set of all possible label set assignments of the PIC and y represents a label set assignment of the PIC. By minimizing the above CRF energy, E(x), we can assign an optimal set of labels to the PIC.
In the PCCRF, ϕ(x i ) provides the information from a large receptive field to predict the category label for a pixel, while nτ(x i ) provides additional information from a small receptive field to optimize the category label.
The PCCRF takes full advantage of prior information. When the predicted category of the pixel using the CNN is credible, the category label can be determined using only the information from a large receptive field. Otherwise, it uses additional information to optimize the category label.

PCCRF Training
We defined the objective function of the PCCRF based on the cross-entropy of the samples as follows: where p is the predicted category probability distribution (CPD) output by the PCCRF, q is the actual CPD, t is the number of category labels, and i is the index of an element in the CPD. Based on this, the loss function of the PCCRF model was defined as follows: where Total is the number of samples used in the training stage. We then used the stochastic gradient descent to train the model via the following steps: 1. Pretrained the RefineNet; 2.
Constructed the PCCRF training dataset using the training prediction results generated by the trained RefineNet; 3.
Performed statistical analysis on the training dataset and determined the value of Cgate; 4.
Initialized the parameters of the PCCRF model; and 5.
Calculated the parameters of the PCCRF using the method proposed in Zheng et al. [55].

Experimental Setup
We conducted comparison experiments based on the RefineNet (which combines low-level and high-level features) and SegNet (which only uses high-level semantic features) using three levels of configuration for each experiment: the original model, classic CRF post-processing, and PCCRF post-processing (Table 1). We applied data augmentation techniques on the training dataset, such as horizontal flip, color adjustment, and vertical flip steps. The color adjustment factors included brightness, hue, saturation, and contrast. Each image in the training dataset was processed 10 times. All images created using the data augmentation techniques were only used for training the CNNs.
We used cross-validation techniques in the comparative experiments. Each CNN model was trained over five rounds. In each round, 200 images were selected as test images and the other images were used as training images to guarantee that each image was used at least once as a test image. Table 2 lists the hyper-parameter setup used to train the proposed RefineNet-PPCRF. In the comparison experiments, the hyper-parameters were also applied to the comparison model.  Although there were certain misclassified pixels in the inner regions of the winter wheat planting area in the SegNet results, the overall classification accuracy of each comparison method in the inner regions of the winter wheat planting area was satisfactory. The difference between the result of the six comparison modes at the edge was observable. In the SegNet results, the edges of the winter wheat fields were rough, and therefore, the RefineNet results were superior to those of the SegNet, Although there were certain misclassified pixels in the inner regions of the winter wheat planting area in the SegNet results, the overall classification accuracy of each comparison method in the inner regions of the winter wheat planting area was satisfactory. The difference between the result of the six comparison modes at the edge was observable. In the SegNet results, the edges of the winter wheat fields were rough, and therefore, the RefineNet results were superior to those of the SegNet, thereby demonstrating the importance of using fused features over high-level features. Both the CRF and PCCRF post-processing methods produced superior results, thus demonstrating the importance of post-processing procedures. The SegNet-PCCRF was superior to SegNet-CRF, while the RefineNet-PCCRF was superior to the RefineNet-CRF; this demonstrated that the PCCRF was more suitable as a post-processing method. Comparing the SegNet-PCCRF and RefineNet-CRF, the performance of the RefineNet-CRF was superior, thereby confirming that the initial segmentation method was also a an extremely significant factor in determining the final result.

Results and Evaluation
We used four popular criteria, named accuracy, precision, recall, and F1-score [80] to evaluate the performance of the proposed model. They were calculated using the confusion matrix.
Accuracy is the ratio of the number of correctly classified samples to the total number of samples, calculated as: where c ii denotes the number of correctly classified samples, and c ij is the number of samples of class i misidentified as class j. Precision denotes the average proportion of pixels correctly classified into one class from the total retrieved pixels, calculated as: Recall represents the average proportion of pixels that are correctly classified in relation to the actual total pixels of a given class, calculated as: F1-score represents the harmonic mean of precision and recall, calculated as: We evaluated the results using the accuracy, precision, recall, and F1-score. The RefineNet-PCCRF scored highest among all models using all metrics (Table 3). The confusion matrices for all categories ( Figure 8) and The confusion matrices for winter wheat and others (Figure 9) for each models demonstrating that the RefineNet-PCCRF achieved the best segmentation results.   In the confusion matrices of the six models, there was nearly no confusion between the winter wheat and urban areas. This could be attributed to the difference in the characteristics of the two land-use types. However, the confusion between winter wheat and farmland was serious. This was because most winter wheat regions that were misclassified as farmlands had poor growing conditions. In these areas, their characteristics were similar to those of farmlands in winter, which led to a greater probability of misclassification. There was also a certain degree of confusion in the winter wheat and woodland areas. This was because certain trees were still green in winter, similar to the In the confusion matrices of the six models, there was nearly no confusion between the winter wheat and urban areas. This could be attributed to the difference in the characteristics of the two land-use types. However, the confusion between winter wheat and farmland was serious. This was because most winter wheat regions that were misclassified as farmlands had poor growing conditions. In these areas, their characteristics were similar to those of farmlands in winter, which led to a greater probability of misclassification. There was also a certain degree of confusion in the winter wheat and woodland areas. This was because certain trees were still green in winter, similar to the characteristics in the regions of winter wheat. However, in this case, due to the use of both texture and high-level semantic information, the degree of confusion was significantly lower than that of farmland. This also explained the advantage of post-processing from another aspect, as it led to the introduction of new information, which could effectively improve the accuracy of the classification results. Table 4 lists the average time required for each method to complete the testing of a single image. The proposed RefineNET-PPCRF method required approximately 3% more time but improved the accuracy by 5%-8%. The time consumed by the CRF was higher than that using the proposed PCCRF method because the CRF had to calculate the distances between all pixel-pixel pairs for a single image, while the proposed PCCRF method calculated the distances for only a small number of pixel-pixel pairs. The number of pixel-pixel pairs calculated in the SegNet-PCCRF was only approximately 30% of that of the SegNet-CRF. The number of pixel-pixel pairs calculated in the RefineNet-PCCRF is only approximately 20% of that in the RefineNet-CRF.

PCCRF Necessity
The CNN models typically use multiple convolutional layers to obtain high-level semantic features, which then assign the features to each pixel in the receptive field through a deconvolution operation. When the operation is performed at the edges of the object, since there may be two or more types of pixels in the sensory field, this can cause differences in the feature values of edges and inner pixels, resulting in a higher classification error at object edges ( Figure 7).
The structural characteristics of the convolutional neural network indicate that there will be inevitable misclassification of pixels at the edges. This problem can only be improved using post-processing methods or improving the structure of the convolutional neural network.
At present, numerous post-processing methods have been proposed, but most of these methods fail to make full use of the results provided by convolutional neural networks. The PCCRF proposed in this study comprehensively uses the advantages of the CRF and prior knowledge provided by the CNN, which is a more effective post-processing method.

Comparison between PCCRF and FCCRF
PCCRF has three clear advantages over FCCRF. First, it has a clearer model structure. In PCCRF, a category probability vector is used to express the calculation result, and each component represents the probability that the pixel to be processed is classified into a certain category. The class probability vector of a pixel is divided into two levels for calculation: (1) a pixel-level class probability vector that represents the class probability distribution calculated on the basis of the characteristics of the pixel itself and (2) a class-level class probability vector that represents a class probability distribution calculated on the basis of the class of pixels around the pixel to be classified. The scale factor expresses the fusion of two types of information in which the two messages involved in the fusion have the same meaning. In contrast, in FCCRF, each component of the first level vector is a class feature value calculated on the basis of the characteristics of the pixel itself, whereas each component of the second-level vector is a category feature value calculated on the basis of the category information of the pixel to be processed and the surrounding pixels. The two feature values with different properties are added together to produce the class feature value of the pixel. The meaning of the eigenvalues obtained using this processing method is not clear enough.
Second, FCCRF does not introduce any prior knowledge, and all pixel-pairs need to be calculated, which leads to overcalculation. Hence, there is a need to solve model parameters through finding approximate values. In contrast, PCCRF introduces prior knowledge and only processes pixels with low classification reliability, effectively reducing the number of calculations and directly solving the model through methods such as the stochastic gradient descent algorithm.
Third, PCCRF uses color, texture, and low-level semantics to form feature vectors, which is more in line with the characteristics of remote sensing data. FCCRF obtains good results using only color features because the camera image resolution is usually very high and the detailed information is very rich. The color of the pixels often differs greatly where two objects are adjacent. However, in remote sensing imagery, a large number of mixed pixels means that the differences in the pixel color of two objects are often much smaller, and hence, the additional information used by PCCRF improves its classification performance.

Cgate Effect
Given the overall importance of the Cgate parameter in the RefineNet-PCCRF, we held other parameters steady and calculated the relationships among the Cgate, accuracy (Figure 10), and consumed time ( Figure 11). Second, FCCRF does not introduce any prior knowledge, and all pixel-pairs need to be calculated, which leads to overcalculation. Hence, there is a need to solve model parameters through finding approximate values. In contrast, PCCRF introduces prior knowledge and only processes pixels with low classification reliability, effectively reducing the number of calculations and directly solving the model through methods such as the stochastic gradient descent algorithm.
Third, PCCRF uses color, texture, and low-level semantics to form feature vectors, which is more in line with the characteristics of remote sensing data. FCCRF obtains good results using only color features because the camera image resolution is usually very high and the detailed information is very rich. The color of the pixels often differs greatly where two objects are adjacent. However, in remote sensing imagery, a large number of mixed pixels means that the differences in the pixel color of two objects are often much smaller, and hence, the additional information used by PCCRF improves its classification performance.

Cgate Effect
Given the overall importance of the Cgate parameter in the RefineNet-PCCRF, we held other parameters steady and calculated the relationships among the Cgate, accuracy (Figure 10), and consumed time ( Figure 11).  Higher Cgate values improved the accuracy because pixels were filtered with a higher level of confidence. Post-processing resulted in the reclassification of the initially misclassified pixels, thus improving the accuracy of the overall result. Therefore, when selecting the Cgate value, we must consider the classification ability of the initial segmentation model. In addition, selecting a model with a stronger classification ability for preliminary segmentation can significantly improve the performance of the results obtained from the PCCRF model. Higher Cgate values also increased the consumed time; this indicated that a further reduction in the number of pixels involved in modeling, i.e., using more prior knowledge, is the key to further improving the calculation efficiency of both the PCCRF and classic CRF models.

Comparison between PP-CNN and RefineNet-PPCRF
To obtain high-quality spatial distribution information of winter wheat, we used an improved Euclidean distance to establish PP-CNN as a post-processing method [81]. According to the improved Euclidean distance of the feature vector between a pixel being classified and the determined winter wheat pixel, it can be determined whether the pixel being classified is displaying winter wheat. Unlike the PP-CNN, the proposed PCCRF was established on the basis of the CRF. Due to the advantage of the CRF using global distribution characteristics, the PP-CRF can more accurately determine the category label of the edge of the winter wheat planting area.
In general, the PP-CNN can be used in cases where the feature differences are stable between the mixed pixels on the edge of the winter wheat planting area and the inner pixels of the same area. When the difference is unbalanced, the distance threshold bias is large, which increases the probability of pixel classification errors during post-processing. The PCCRF fully considers the spatial correlation between pixel categories, hence yielding a strong global balance ability. Therefore, this method can better handle situations where the edge pixels are significantly different from the inner pixels, thereby effectively reducing the impact of large differences in crop growth.

Conclusions
CNNs can significantly improve the overall accuracy of remote sensing image segmentation results. However, in the segmentation results, there are certain misclassified pixels in the adjacent land-use types. This study used the advantages of the CRF model that can describe the spatial correlation between pixel categories, introduced a variety of prior knowledge, and proposed a Higher Cgate values improved the accuracy because pixels were filtered with a higher level of confidence. Post-processing resulted in the reclassification of the initially misclassified pixels, thus improving the accuracy of the overall result. Therefore, when selecting the Cgate value, we must consider the classification ability of the initial segmentation model. In addition, selecting a model with a stronger classification ability for preliminary segmentation can significantly improve the performance of the results obtained from the PCCRF model. Higher Cgate values also increased the consumed time; this indicated that a further reduction in the number of pixels involved in modeling, i.e., using more prior knowledge, is the key to further improving the calculation efficiency of both the PCCRF and classic CRF models.

Comparison between PP-CNN and RefineNet-PPCRF
To obtain high-quality spatial distribution information of winter wheat, we used an improved Euclidean distance to establish PP-CNN as a post-processing method [81]. According to the improved Euclidean distance of the feature vector between a pixel being classified and the determined winter wheat pixel, it can be determined whether the pixel being classified is displaying winter wheat. Unlike the PP-CNN, the proposed PCCRF was established on the basis of the CRF. Due to the advantage of the CRF using global distribution characteristics, the PP-CRF can more accurately determine the category label of the edge of the winter wheat planting area.
In general, the PP-CNN can be used in cases where the feature differences are stable between the mixed pixels on the edge of the winter wheat planting area and the inner pixels of the same area. When the difference is unbalanced, the distance threshold bias is large, which increases the probability of pixel classification errors during post-processing. The PCCRF fully considers the spatial correlation between pixel categories, hence yielding a strong global balance ability. Therefore, this method can better handle situations where the edge pixels are significantly different from the inner pixels, thereby effectively reducing the impact of large differences in crop growth.

Conclusions
CNNs can significantly improve the overall accuracy of remote sensing image segmentation results. However, in the segmentation results, there are certain misclassified pixels in the adjacent land-use types. This study used the advantages of the CRF model that can describe the spatial correlation between pixel categories, introduced a variety of prior knowledge, and proposed a PCCRF model. The proposed PCCRF model can be used to post-process the results of the CNN to better solve the problem of rough edges in the results extracted using only the CNN.
The main contributions of this study are as follows: (1) Pre-processing (such as statistical analysis of the CNN segmentation results) allows for the use of post-processing and modeling of prior knowledge, such that only those pixels with a lower confidence are processed, thus significantly reducing calculation time. As the RefineNet has high segmentation accuracy, this post-processing only requires the use of 20% of all the pixels. (2) According to the characteristics of the winter wheat planting area on the remote sensing image, the PCCRF uses original channel values, texture features, and low-level semantic features to compose the feature vector and construct the pairwise potential energy. This feature vector better matches the characteristics of the remote sensing imagery. At the same time, after normalizing the pairwise potential energy, the data range is identical to that of the unary potential energy. This aspect is more reasonable than that of the FCCRF. (3) The PCCRF uses a linear model to fuse the unary energy and pairwise energy such that the parameters of the linear mode are determined while training the PCCRF. This strategy is more reasonable than the fixed weight value strategy adopted by the FCCRF. Due to the ability to describe the globe spatial correlation between pixel categories of the CRF, the RefineNet-PCCRF can efficiently improve the classification accuracy of edge pixels in a winter wheat planting area.
As the prior knowledge required by the PCCRF can only be obtained via statistical analysis of the CNN segmentation results, the PCCRF and CNN must be used separately to generate improved extraction results, which is the major limitation of our method. In future studies, we intend to use hyperparameters and other means to express prior knowledge, convert the PCCRF into convolution operations, and construct a complete end-to-end training model.