A New CNN-Bayesian Model for Extracting Improved Winter Wheat Spatial Distribution from GF-2 imagery

When the spatial distribution of winter wheat is extracted from high-resolution remote sensing imagery using convolutional neural networks (CNN), field edge results are usually rough, resulting in lowered overall accuracy. This study proposed a new per-pixel classification model using CNN and Bayesian models (CNN-Bayesian model) for improved extraction accuracy. In this model, a feature extractor generates a feature vector for each pixel, an encoder transforms the feature vector of each pixel into a category-code vector, and a two-level classifier uses the difference between elements of category-probability vectors as the confidence value to perform per-pixel classifications. The first level is used to determine the category of a pixel with high confidence, and the second level is an improved Bayesian model used to determine the category of low-confidence pixels. The CNN-Bayesian model was trained and tested on Gaofen 2 satellite images. Compared to existing models, our approach produced an improvement in overall accuracy, the overall accuracy of SegNet, DeepLab, VGG-Ex, and CNN-Bayesian was 0.791, 0.852, 0.892, and 0.946, respectively. Thus, this approach can produce superior results when winter wheat spatial distribution is extracted from satellite imagery.


Introduction
Wheat is the most important food crop in the world, comprising 38.76% of the total area cultivated for food crops and 29.38% of total food crop production in 2016 [1].In China, these numbers are 21.38% and 21.00%, respectively [2].Accurate estimations of crop spatial distribution and total cultivated area are of great significance for agricultural disciplines such as yield estimation, food policy development, and planting management, which are of great importance for ensuring national food security [3,4].
Traditionally, obtaining crop area required large-scale field surveys.Although this approach produces high-accuracy results, it is time-consuming, labor-intensive, and often lacking in spatial information [5].The use of remotely sensed data is an effective alternative that has been widely used over the past few decades at regional or global scales [6][7][8].As extraction of crop spatial distribution mainly relies on pixel-based image classification, correctly determining pixel features for accurate classification is the basis for this approach [9][10][11][12].
The spatial resolution and precision of crop extraction can be significantly improved by using high-resolution imagery [8,24,25].However, as the spectral characteristics of such imagery are not as stable as those of low-and middle-resolution imagery, traditional feature extraction methods struggle to extract effective pixel features [26,27].Neural networks [28,29] and support vector machines [30,31] have been applied to this problem, but both are shallow-learning algorithms [32] that have difficulty effectively expressing complex features, producing unsatisfactory results.
Convolutional neural networks (CNN) were developed from neural networks.The standard CNN follows an "image-label" approach, and its output is a probability distribution over different class.Typical examples include AlexNet [33], GoogLeNet [34], Visual Geometry Group Network (VGG) [35], and Resnet [36].Due to their strong feature extraction ability, these networks have achieved remarkable results in camera image classification [37,38].The fully convolutional network, a "per-pixel-label" model based on standard CNNs, was proposed in 2015 [39].This network uses a multi-layer convolutional structure to extract pixel features, applies appropriate deconvolutional layers to up-sample the feature map of the last convolution layer to restore it to the same size of the input image, and classified the up-sampled feature map pixel by pixel.Accordingly, a series of convolution-based per-pixel-label models have been developed including SegNet [40], UNet [41], DeepLab [42], and ReSeg [43].Of these, SegNet and UNet have the clearest and easiest-to-understand convolution structures.DeepLab uses a method called "Atrous Convolution", which has a strong advantage in processing detailed images.ReSeg exploits local generic features extracted by CNNs and the capacity of recurrent neural networks to retrieve distant dependencies.Each model has its own strengths and is adept at dealing with certain image types.As conditional random field (CRF) have the ability to learn the dependencies between categories of pixels, CRF can be used to further refine segmentation results [44].
These convolution-based per-pixel-label models have been applied in remote sensing image segmentation with remarkable results.For example, researchers have used CNN to carry out remote sensing image segmentation and used conditional random fields to further refine the output class map [45][46][47][48].To suit the characteristics of specific remote sensing imagery, other researchers have established new convolution-based per-pixel-label models, such as multi-scale fully convolutional networks [49], patch-based CNNs [50], and two-branch CNNs [51].Effective work has also been carried out in extracting information from remote sensing imagery using convolution-based per-pixel-label models, e.g., extracting crop information for rice [52,53], wheat [54], leaf [55], and rape [56], as well as target detection for weeds [57][58][59], diseases [60][61][62], and extracting road information using improved FCN [63].Some new feature extraction techniques are being applied to crop information extraction, including 3D-CNN [64], deep recurrent neural networks [65], and CNN-LSTM [66], and Recurrent Neural Networks (RNN) was also used to correct satellite image classification maps [67].Some new techniques are proposed to improve the segmentation accuracy, including structured autoencoders [68] and locality adaptive discriminant analysis [69].Moreover, the research on how to automatically determine the feature dimension that could be adaptive to different data distributions will help to obtain a good performance in machine learning and computer vision [70].
How to determine the optimal value of the parameters is an important problem in the use of convolutional neural networks.Stochastic gradient descent with momentum [45] is a common and effective training method.Data augmentation technology [33,35,41] and dropout technology [33] used to prevent overfitting, so as to ensure that the model can obtain the optimal parameters.Practice has proved that reasonable use of a BN (Batch Normalization) layer is also helpful for model training to obtain the optimal parameters [42,43].
At present, the CNN structure used in the pre-pixel classification of remote sensing imagery generally includes two parts: feature extractor and classifier.The former has been the focus of many researchers with good results.The convolution value acquired by the convolution kernel and pixel block operations is regarded as a feature of central pixels in the pixel blocks and is the common technique for existing feature extractors.However, with regard to classifying pixels with acquired features, most studies have only used classifiers with relatively ordinary functions.These classifiers use a set of linear regression functions to encode the features of pixels and obtained category-code vectors.The SoftMax function is then used to convert the category-code vector into a category probability vector, and the category corresponding to the maximum probability value is taken as the pixel category.
Previous experimental results [44][45][46][47][48][49][50][51][52][53][54][55][56] have shown that misclassified pixels are primarily located at the intersections of two land use types, such as field edges or corners.This is because when the features of pixels in these areas are acquired, the used pixel blocks usually contain more pixels of other categories, resulting in the features often being different from the feature of inner pixels of the planting area, which frequently cause classification errors.By analyzing the probability vector of these misclassified pixels, it can be found that the difference between the maximum probability value and the second-maximum probability value is generally small.These errors are due to the inherent structure of the convolutional layer, which needs to be combined with the classifier to be improved.
The Bayesian model can synthesize information from different sources and improve the reliability of inferred conclusions [71,72].Therefore, when judging the category of a pixel whose difference between the maximum probability value and the second-maximum probability value is small, the spatial structure information of the pixels can be further introduced to improve the reliability of the judgment by using the Bayesian model.In this study, we developed a new CNN consisting of a feature extractor, encoder, and a Bayesian classifier, which we refer to as a Bayesian Convolutional Neural Network (CNN-Bayesian model).We then used this model to extract winter wheat spatial distribution information from Gaofen 2 (GF-2) remote sensing imagery and compared the results with those achieved by other methods.

Study Area
Shandong Province is a major wheat-producing area in China.The total planted area was 38,303 km 2 in 2016 and 38,429 km 2 in 2017 (Figure 1) [73].Zhangqiu County is located in North-central Shandong Province (36 • 25 -37 • 09 N, 117 • 10 -117 • 35 E).From south to north, the county's terrain progresses through mountainous, hilly, plain, and lowland regions, accounting for 30.8%,25.9%, 30.7%, and 12.6% of the total area, respectively.The main food crops are wheat and corn [74].Moreover, the county's geographical and agricultural conditions are representative of broader regions within China, making it an appropriate study area.

Remote Sensing Imagery
We used 32 GF-2 images to cover the entire Zhangqiu County; 17 were captured on 14 February, 2017, and 15 on 21 January and 1 March, 2018 (Figure 2a).Each GF-2 image is divided into a multispectral and panchromatic image.The former is composed of four spectral bands (blue, green, red, and near-infrared), and the spatial resolution of each multispectral image is 4 m, whereas that of the panchromatic image is 1 m.

Remote Sensing Imagery
We used 32 GF-2 images to cover the entire Zhangqiu County; 17 were captured on 14 February, 2017, and 15 on 21 January and 1 March, 2018 (Figure 2a).Each GF-2 image is divided into a multispectral and panchromatic image.The former is composed of four spectral bands (blue, green, red, and near-infrared), and the spatial resolution of each multispectral image is 4 m, whereas that of the panchromatic image is 1 m.
The preprocessing of the GF-2 images involved four stages: geometric correction, radiometric calibration, atmospheric correction, and image fusion.Using Python and the Geospatial Data Abstraction Library, we designed a geometric correction program and completed this by combining the control points obtained from the ground investigation.Radiometric calibration converted the images' digital values to absolute at-sensor radiance values using Environment for Visualizing Images (ENVI) software (developed by Harris Geospatial Solutions, Broomfield, Colorado, United States of America).The calibration parameters were obtained by calibration experiments in Chinese fields as published in CRESDA [9].Atmospheric correction converted the radiance to reflectance using the Fast Line-of-Sight Atmospheric Analysis of Spectral Hypercubes (FLAASH) model in ENVI with the Interactive Data Language.The related FLAASH parameters were obtained according to the acquisition time and imaging conditions.Subsequently, the ENVI pan-sharpening method was used to fuse the multispectral and panchromatic images.After preprocessing, each fusion image had four bands (blue, green, red, and near-infrared) with a spatial resolution of 1 m and a size of 7300 × 6900 pixels.
We used 32 GF-2 images to cover the entire Zhangqiu County; 17 were captured on 14 February, 2017, and 15 on 21 January and 1 March, 2018 (Figure 2a).Each GF-2 image is divided into a multispectral and panchromatic image.The former is composed of four spectral bands (blue, green, red, and near-infrared), and the spatial resolution of each multispectral image is 4 m, whereas that of the panchromatic image is 1 m.

Ground Investigation Data
The main land cover in Zhangqiu County during winter includes winter wheat, agricultural buildings, woodland, developed land, roads, water bodies, farmland, and bare fields.In fused GF-2 images, bare fields, agricultural buildings, developed land, water bodies, farmland, and roads are all visually distinct from each other and vegetated areas during winter.In order to accurately distinguish whether a vegetation area is winter wheat or woodland in visual interpretation, the sample information of winter wheat areas and woodland areas should be obtained, so we conducted ground investigations in 2017 and 2018, obtaining 367 sample points (251 winter wheat, 116 woodland); time, location, and land use were recorded for all points (Figure 2b).

Image-Label Datasets
We selected 305 non-overlapping region images from the GF-2 images described in Section 2.2 to establish the image-label dataset for training and test, and each image contained 1024 × 1024 pixels.The dataset covered all land use types of the study area, including winter wheat, agricultural buildings, woodland, developed land, roads, water bodies, farmland, and bare fields.We manufactured a label file for each image, which was used to record the category number of each pixel on the image.In combination with the ground investigation data described in Section 2.2.2, we used visual interpretation and ENVI software to establish the label file.Figure 3 illustrates a training image and corresponding label file.
In the label files, winter wheat, agricultural buildings, woodland, developed land, roads, water bodies, bare fields, and others were marked 1-8, respectively.In the test stage, 2-8 will be replaced by 9, indicating that the corresponding pixel is a non-winter wheat pixel.In the label files, winter wheat, agricultural buildings, woodland, developed land, roads, water bodies, bare fields, and others were marked 1-8, respectively.In the test stage, 2-8 will be replaced by 9, indicating that the corresponding pixel is a non-winter wheat pixel.

Model Architecture
The proposed CNN-Bayesian model consists of a feature extractor used to generate feature vectors for each pixel, an encoder used to transform the feature vector of each pixel into a categorycode vector, and a classifier used to determine the category of a pixel (Figure 4).

Feature Extractor
The feature extractor's network structure is based on a VGG16 network [30] in that it consists of 13 layers (corresponding to the first 13 layers of a VGG16); each layer includes a convolution, batch

Model Architecture
The proposed CNN-Bayesian model consists of a feature extractor used to generate feature vectors for each pixel, an encoder used to transform the feature vector of each pixel into a category-code vector, and a classifier used to determine the category of a pixel (Figure 4).In the label files, winter wheat, agricultural buildings, woodland, developed land, roads, water bodies, bare fields, and others were marked 1-8, respectively.In the test stage, 2-8 will be replaced by 9, indicating that the corresponding pixel is a non-winter wheat pixel.

Model Architecture
The proposed CNN-Bayesian model consists of a feature extractor used to generate feature vectors for each pixel, an encoder used to transform the feature vector of each pixel into a categorycode vector, and a classifier used to determine the category of a pixel (Figure 4).

Feature Extractor
The feature extractor's network structure is based on a VGG16 network [30] in that it consists of 13 layers (corresponding to the first 13 layers of a VGG16); each layer includes a convolution, batch

Feature Extractor
The feature extractor's network structure is based on a VGG16 network [30] in that it consists of 13 layers (corresponding to the first 13 layers of a VGG16); each layer includes a convolution, batch normalization, activation, and pooling layer.Like a VGG16, the CNN-Bayesian model uses a rectified linear unit as an activation function.We added 10 convolution kernels (sized 1 × 1 × 3) in the first layer to extract the color features of pixels.
The input of the feature extractor is the fused GF-2 remote sensing images, each with four bands.The output is a 3D matrix with a size of m × n × l, where m and n are the number of rows and columns respectively, and l is the length of the feature vector of each pixel.Each feature vector corresponding to one pixel consists of three parts.The first is derived from the result of the convolution kernel, which represents the color feature.The second is derived from the result of the first layer, which represents the low-level texture features.The third is derived from the output of the last layer, which represents the semantic feature.
Compared with the camera image, the pixels of the remote sensing image are continuous.Therefore, we used the extension method to cut out the training and test images and then extend some pixels on the four edges of each image, to ensure that the size of the last layer's feature image was the same as the original image.
We improved the original pooling method of VGG16 using the following equation: where s,t denotes the position of the pixel being calculated, a denotes the pooled result, and b denotes the feature map used in the pooled operation.We used a step size of 1 in the pooling operation.After a feature map whose size is m × n has been pooled, the size of the resulting matrix is (m − 2) × (n − 2).Therefore, after each layer of feature extraction, the image size is reduced by four rows and four columns compared with the original image.Therefore, when we cut the training and test images, we extended 24 pixels outward on the four edges of each image.

Encoder
The encoder is used to transform the feature vector of a pixel from the feature extractor into a category-code vector, as shown below: where each row of the matrix w indicates a fitting function for a specified class, m denotes the number of classes, n denotes the length of the feature vector of one pixel, vector x denotes the feature vector, vector b denotes the respective biases, and vector r denotes the encoded value.The matrix w and vector b are trained in the training stage.

Classifier
The classifier is divided into two levels, A and B. The A-level classifier transforms category-code vector r (corresponding to one pixel) into category-probability vector p as follows: where m denotes the number of classes.Next, the confidence level (CL) of each p is calculated as follows: where p i denotes the max value in p, and p j denotes the max value in p except p i .The category of a given pixel is determined by: i; in train stage and p i is themaxvalue in P 1; in classi f ication stage and CL ≥ δ 11; in classi f ication stage and CL < δ where Aout denotes the category number of one pixel, code 1 indicates winter wheat, code 11 indicates uncertainty, and δ indicates the low threshold value of CL.The δ value is selected and determined manually after the training has been completed, and the training results of all samples are statistically analyzed.The B-level classifier is used to determine the category of a pixel whose CL < δ, denoted by vPixel, by acquiring the maximum posterior probability of vPixel classified as winter wheat: where v ww denotes the maximum posterior probability that the category of vPixel is winter wheat when the CL value is ∅, c is the category, ww denotes winter wheat, ∅ denotes the CL value corresponding to the P of vPixel, p(∅|c=ww) represents the probability that the CL value is equal to ∅ in winter wheat pixels, p(c=ww) indicates the probability of winter wheat, and p(∅) indicates the probability that the CL value is equal to ∅ in all pixels.Next, the maximum posterior probability of vPixel classified as non-winter wheat is acquired by: where v nw denotes the maximum posterior probability that the category vPixel is not winter wheat when the CL value is ∅, c, ∅, and p(∅) have the same meaning as in the Equation ( 6), nw denotes non-winter wheat, p(∅|c=nw) indicates the probability that the CL value is equal to ∅ in non-winter wheat pixels, and p(c=nw) indicates the probability of non-winter wheat.
In Equations ( 6) and (7), p(∅|c=ww), p(c=ww), p(∅|c=nw), p(c=nw), and p(∅) are acquired by statistical methods.When obtaining p(∅|c=ww) and p(∅|c=nw), all samples are statistics, reflecting the global characteristics of the confidence of certain classes.When obtaining p(c=ww), p(c=nw), and p(∅), only samples in the maximum pixel block are used to extract the features of vPixel, reflecting the local characteristics of pixel spatial associations.
The classifier determines the final pixel category as follows: (p 1 is themaxvalue in P and CL ≥ δ) or (CL < δ and v ww > v nw ) or (CL < δ , v ww = v nw and p 1 is themaxvalue in P) 9; (p 1 is not themaxvalue in P and CL ≥ δ)) or (CL < δ and v ww < v nw ) or (CL < δ , v ww = v nw and p 1 is not themaxvalue in P) where out represents the final category number of one pixel, code 1 indicates winter wheat, and code 9 indicates non-winter wheat.

Training Model
The basic loss function calculation unit is the definition of cross entropy, expressed for one sample as: where p is the predicted category probability distribution, q is the actual category probability distribution, and i is the index of an element in the category probability distribution.On this basis, the loss function of the CNN-Bayesian model is defined as: where ts denotes the pixel amount used in the training stage.
We trained the CNN-Bayesian model in an end-to-end manner, B-level classifier does not participate in the training stage.The parameters required for B-level classifier to perform calculations are obtained by statistics after training completed.The training stage consists of the following steps: 1.
Image-label pairs are input into the CNN-Bayesian model as a training sample dataset, and parameters are initialized.

2.
Forward propagation is performed on the sample images.

3.
The loss is calculated and back-propagated to the CNN-Bayesian model.

4.
The network parameters are updated using the stochastic gradient descent [45] with momentum.
Steps 2-4 are iterated until the loss is less than the predetermined threshold values.
Table 1 shows the hyperparameters setup we used to train our model.In the comparison experiments, the hyperparameters also applied to the comparison model.

Work Flow
First, a set of fixed-size pixel blocks are cut from the pre-processed remote sensing image set to form the image set for training and testing.The training images are labeled pixel by pixel using visual interpretation.These data are then used to train the CNN-Bayesian model (loss value of 10 -9 in this study).The predicted category, actual category, and CL of each sample are output after each round of training.Subsequently, the training information of the last round is used to acquire the confidence threshold δ (0.23 in this study) and the probability distributions p(∅|c=ww) and p(∅|c=nw).Finally, the trained model is used to exact winter wheat spatial distribution information form remote sensing images.

Experimental Setups
The proposed CNN-Bayesian model was implemented using Python 3.6 on a Linux Ubuntu 16.04 operating system and TensorFlow framework.The comparison experiments were performed on a graphics workstation with an NVIDIA GeForce Titan X Graphics device with 12 GB graphic memory.
The network architecture parameters of the feature extractor of CNN-Bayesian model and the data dimensions of each layer are given in SegNet [35] and DeepLab [37] are classic semantic segmentation models for images that have achieved good results in the processing of camera images.Moreover, the working principles of these two models are similar to that of our study, and we therefore chose these as comparison models to better reflect the advantages of our model in feature extraction and classification.We also removed the second-level classifier of the CNN-Bayesian model as another comparison model, named VGG-Ex, to better compare the role of the Bayesian classifier.
We used data augmentation techniques on the dataset to prevent overfitting, and each image was randomly processed in brightness, saturation, hue, and contrast.After the processing is completed, each image is rotated and transformed, and each image is rotated three times (90 • , 180 • , 270 • ).There are 6100 images in our final data set.We also employed random split technique for training and testing model to prevent overfitting.During each training and test round, 4880 images randomly selected from the image-label datasets were used as training data, and the remaining 1220 images were used as test data.The SegNet, DeepLab, VGG-Ex, and CNN-Bayesian model were trained with the same image dataset.This was done five times.Table 3 shows the total number of samples of each category used in each training and test round.

Results and Evaluation
Table 4 shows the confusion matrices for the segmentation results of the four models.Each row of the confusion matrix represents the proportion taken by the actual category, and each column represents the proportion taken by the predicted category.Our approach achieved better classification results.The proportion of "winter wheat" wrongly categorized as "non-winter wheat" was, on average, 0.033, and the proportion of "non-winter wheat" wrongly classified as "winter wheat" was, on average, 0.021.In this paper, we used four popular criteria, named accuracy, precision, recall and Kappa coefficient to evaluate the performance of the proposed model [45].Table 5 shows the values of evaluation criteria of the four models.To further compare the classification accuracy of planting area edges, we further subdivided the categories into "inner" and "edge" labels.If only winter wheat category pixels are used in the convolution process to extract the pixel features, it is classified as inner; otherwise it is classified as edge.Table 6 show the confusion matrices for the segmentation results of the four models.As can be seen from Table 4, the accuracy of inner category of four models' results were similar, but the CNN-Bayesian model was more accurate with regard to the edge category.The accuracy of CNN-Bayesian model in edge recognition is three times higher than that of SegNet, two times higher than that of DeepLab.By comparing the accuracy of winter inner edge of CNN-Bayesian and that of VGG-Ex, it can be found that the ability of CNN-Bayesian to recognize winter wheat edge is improved by nearly 30% due to the use of Bayesian classifier.
Figure 5 shows ten images and corresponding results randomly selected from the tested images, each containing 1204 × 1024 pixels.The CNN-Bayesian model misclassified only a small number of pixels at the corner of the winter wheat planting area.In the DeepLab results and VGG-Ex results, the misclassified pixels were mainly distributed at the junction of winter wheat and non-winter wheat areas, including edge and corner locations, but the number of misclassified pixels in the VGG-Ex model results is less than that of the DeepLab.The SegNet results had the most errors, which were scattered throughout the image; most misclassified pixels were located on the edges and corners, with some also occurring in the planting area.

Discussions
This paper proposed a novel per-pixel classification approach to extract winter wheat spatial distribution from GF-2 imagery.This approach can extract winter wheat with fine field edge by using two strategies, including a CNN structure to extract features and a two-level classifier to determine the pixel's category accurately.The contributions of these two strategies are discussed as follows.

The Effectiveness of Feature Extractor
To distinguish winter wheat from other categories, a popular deep learning algorithm CNN was applied to explore the features.The trained feature extractor of proposed CNN-Bayesian model has strong feature extraction ability, which can make the distances between feature vectors extracted from pixels of the same category, but with different spectral information, close, and make the distances between feature vectors extracted from pixels from different category, but with close spectral information, far away.
Since the CNN-Bayesian model and the VGG-Ex model use the same feature extractor, we selected the most different set of semantic features from the last layer of the CNN-Bayesian, SegNet, and DeepLab models for comparative analysis, Figure 6 show the statistical results, respectively.The degree of confusion in the CNN-Bayesian model results is smaller than that in the other two models because its network structure and data organization mode are better, and the improved pooling algorithm used in feature extractor has a larger receptive field, and has a greater advantage in feature aggregation than the classical pooling algorithm.The CNN-Bayesian model feature extractor can keep the size of the feature image of the last layer unchanged without using deconvolution.Furthermore, it can eliminate location errors of the feature value that may be caused by the deconvolution operation and ensure one-to-one correspondence between the feature value and the pixel, thus reducing the degree of confusion between the features of winter wheat edge and non-winter wheat areas.Compared with the comparison model, the CNN-Bayesian model better suits the data features of high-resolution remote sensing images.
As can been seen for the statistical result of SegNet, although the feature values of winter wheat inner pixels and winter wheat edge pixels are scattered, the feature values of winter wheat inner pixels are basically not overlapped with the feature values of other categories.However, the overlap between the feature values of winter wheat edge pixels and other categories is large, which is the reason that the accuracy of winter wheat inner higher than that of winter wheat edge.
The feature values of some winter wheat edge pixels were confused with those of non-winter wheat pixels in all three cases, but those of winter wheat inner pixels were never confused with those of non-winter wheat pixels.This shows that pixel position has a great impact on the feature extraction results, mainly for two main reasons: First, field edge pixel information is different from inner pixel information, because edge areas often contain both winter wheat and bare fields or other land use types, and the proportion of winter wheat varies greatly, whereas inner areas contain only winter wheat.Second, pixel blocks centered on pixels at the edge of winter wheat fields, usually contain more non-winter wheat than wheat pixels (Figure 7).Thus, when extracting the feature values of these edge pixels, approximately 50% of the pixels involved in the convolution operation are pixels of other categories, whereas the ratio for corners is 75% or higher.information, because edge areas often contain both winter wheat and bare fields or other land use types, and the proportion of winter wheat varies greatly, whereas inner areas contain only winter wheat.Second, pixel blocks centered on pixels at the edge of winter wheat fields, usually contain more non-winter wheat than wheat pixels (Figure 7).Thus, when extracting the feature values of these edge pixels, approximately 50% of the pixels involved in the convolution operation are pixels of other categories, whereas the ratio for corners is 75% or higher.

The Effectiveness of Classifier
Both the CNN-Bayesian model and the comparison model use the category-probability vector as the basis for determining the category of the pixels.The main advantage of the CNN-Bayesian model is that it takes into account the deep meaning of the difference between elements of categoryprobability vector, and use hierarchical strategy to determine the category of the pixels.The category of pixels with high confidence were directly determined, and the category of pixels with low confidence were determined combining prior knowledge.VGG-Ex, SegNet and DeepLab only use the maximum probability value as the basis to determine the category of the pixels.Therefore, the strategy adopted by the CNN-Bayesian model helps to improve the accuracy of the results, and the results are shown and compared in Figure 5, Table 3 and Table 4.

The Effectiveness of Classifier
Both the CNN-Bayesian model and the comparison model use the category-probability vector as the basis for determining the category of the pixels.The main advantage of the CNN-Bayesian model is that it takes into account the deep meaning of the difference between elements of category-probability vector, and use hierarchical strategy to determine the category of the pixels.The category of pixels with high confidence were directly determined, and the category of pixels with low confidence were determined combining prior knowledge.VGG-Ex, SegNet and DeepLab only use the maximum probability value as the basis to determine the category of the pixels.Therefore, the strategy adopted by the CNN-Bayesian model helps to improve the accuracy of the results, and the results are shown and compared in Figure 5, Tables 3 and 4.
We select the number of pixels in each confidence level of the CNN-Bayesian, VGG-Ex, SegNet, and DeepLab models for comparative analysis (Figure 8).The pixel ratio of the SegNet and DeepLab models is higher than that of the CNN-Bayesian model and VGG-Ex at a lower confidence level.This shows that the feature composition of the CNN-Bayesian model is more reasonable, because it uses color and texture features in addition to the high-level semantic features used by all three models.
As the confidence increases, the classification errors of the four models decrease and the degree of reduction increases (Figure 9).This is because the confidence value directly reflects the degree to which the pixel characteristics match the overall category characteristics and, thus, the likelihood that the classification result is correct.Therefore, it is reasonable to choose the confidence value as the index of the confidence that a given pixel will be classified into a certain category.
Overall, these results show that the CNN-Bayesian model is more capable than the comparison models, reflecting its advantageous use of a two-level classifier structure.Since the second-level classifier makes full use of the confidence and planting structure information, the number of misclassified pixels is effectively reduced.
As can be seen from Figures 8 and 9, for the CNN-Bayesian model, the number of pixels with confidence lower than 0.23 is small, but the proportion of misclassification is very large.This is the reason we choose 0.23 as confidence threshold described in Section 3. We select the number of pixels in each confidence level of the CNN-Bayesian, VGG-Ex, SegNet, and DeepLab models for comparative analysis (Figure 8).The pixel ratio of the SegNet and DeepLab models is higher than that of the CNN-Bayesian model and VGG-Ex at a lower confidence level.This shows that the feature composition of the CNN-Bayesian model is more reasonable, because it uses color and texture features in addition to the high-level semantic features used by all three models.As the confidence increases, the classification errors of the four models decrease and the degree of reduction increases (Figure 9).This is because the confidence value directly reflects the degree to which the pixel characteristics match the overall category characteristics and, thus, the likelihood that the classification result is correct.Therefore, it is reasonable to choose the confidence value as the index of the confidence that a given pixel will be classified into a certain category.Overall, these results show that the CNN-Bayesian model is more capable than the comparison models, reflecting its advantageous use of a two-level classifier structure.Since the second-level classifier makes full use of the confidence and planting structure information, the number of misclassified pixels is effectively reduced.
As can be seen from Figures 8 and 9, for the CNN-Bayesian model, the number of pixels with confidence lower than 0.23 is small, but the proportion of misclassification is very large.This is the reason we choose 0.23 as confidence threshold described in Section 3.3.and DeepLab models for comparative analysis (Figure 8).The pixel ratio of the SegNet and DeepLab models is higher than that of the CNN-Bayesian model and VGG-Ex at a lower confidence level.This shows that the feature composition of the CNN-Bayesian model is more reasonable, because it uses color and texture features in addition to the high-level semantic features used by all three models.As the confidence increases, the classification errors of the four models decrease and the degree of reduction increases (Figure 9).This is because the confidence value directly reflects the degree to which the pixel characteristics match the overall category characteristics and, thus, the likelihood that the classification result is correct.Therefore, it is reasonable to choose the confidence value as the index of the confidence that a given pixel will be classified into a certain category.Overall, these results show that the CNN-Bayesian model is more capable than the comparison models, reflecting its advantageous use of a two-level classifier structure.Since the second-level classifier makes full use of the confidence and planting structure information, the number of misclassified pixels is effectively reduced.

Comparison to Other Similar Works
As can be seen from Figures 8 and 9, for the CNN-Bayesian model, the number of pixels with confidence lower than 0.23 is small, but the proportion of misclassification is very large.This is the reason we choose 0.23 as confidence threshold described in Section 3.3.

Comparison to Other Similar Works
At present, there are some methods focus on improving the classification accuracy of edge regions [43][44][45]67].These methods describe the association between inputs from the semantic level, so that the relationship between prediction labels of adjacent pixels can be described, and the prediction results are not only related to the features of the predicted pixels.Also relevant, and affected by the results of previous predictions, our method can describe the statistical of inputs.The prediction result is determined by the features of the pixel itself and the regional statistical features, which is more in line with the characteristics of remote sensing data.

Conclusions
Using satellite remote sensing has become a mainstream approach for extracting winter wheat spatial distribution, but field edge results are usually rough, resulting in lowered overall accuracy.In this paper, we proposed a new approach for extracting spatial distribution information for winter wheat, which significantly improves the accuracy of edge extraction results.The main contributions of this paper are as follows: (1) Our feature extractor is designed to meet the characteristics of remote sensing image data, avoiding extra calculations and errors caused by using deconvolution in the feature extraction process.The feature extractor can fully explore the deep and spatial semantic features of the remote sensing image.(2) Our classifier effectively uses the confidence value of the category probability vector and combines the planting structure characteristics of winter wheat to reclassify pixels with a low confidence value, thus effectively reducing classification errors for edge pixels.As we optimized the method of extracting and using remote sensing image features and rationally used color, texture, semantic, and statistical features to obtain high-precision spatial distribution data of winter wheat.The spatial distribution data of winter wheat in Shandong Province in 2017 and 2018 obtained by the proposed approach has been used by the Meteorological Bureau of Shandong Province.
The number of categories that can be extracted by the proposed CNN-Bayesian model is determined by the number of categories of samples in the training dataset.When the model is used to extract other land use types or applied to another area, only a new training dataset is needed to retrain the model.The successfully trained model can then be used to extract high-precision spatial distribution data of land use from high-resolution remote sensing images.
The main disadvantage of our approach is that it requires more pre-pixel label files.Future research should test the use of semi-supervised classification to reduce the dependence on pre-pixel label files.

Figure 1 .
Figure 1.Regional distribution of wheat planting in Shandong Province, China, and the location of Zhangqiu County (red outline).

Figure 1 .
Figure 1.Regional distribution of wheat planting in Shandong Province, China, and the location of Zhangqiu County (red outline).

Figure 2 .
Figure 2. Data sources: (a) Gaofen 2 remote sensing imagery of Zhangqiu County; and (b) sample point locations within the county.

Figure 3 .
Figure 3. Example of image classification: (a) original Gaofen 2 image and (b) classified by land use type.

Figure 3 .
Figure 3. Example of image classification: (a) original Gaofen 2 image and (b) classified by land use type.

Figure 3 .
Figure 3. Example of image classification: (a) original Gaofen 2 image and (b) classified by land use type.

Figure 7 .
Figure 7. Examples of the effect of pixel position on the extracted features; pixel boxes (red) centered on corner or edge areas contain 50% or more non-winter wheat pixels.

Figure 7 .
Figure 7. Examples of the effect of pixel position on the extracted features; pixel boxes (red) centered on corner or edge areas contain 50% or more non-winter wheat pixels.

Figure 8 .
Figure 8. Distribution of confidence values for the four models.

Figure 9 .
Figure 9. Distribution of misclassified pixels for all four models.

Figure 8 .
Figure 8. Distribution of confidence values for the four models.

Figure 8 .
Figure 8. Distribution of confidence values for the four models.

Figure 9 .
Figure 9. Distribution of misclassified pixels for all four models.

Figure 9 .
Figure 9. Distribution of misclassified pixels for all four models.

Table 2
f denotes the size of the convolution/pooling kernel, s represents the step length, and d represents the number of convolution cores in this layer.Because the batch normalization and rectified linear unit layers do not change the size of the data dimensions, they are not listed in the table.

Table 3 .
Total number of samples of each category used in each training and test round.

Table 4 .
Confusion matrix of the winter wheat classification.

Table 5 .
Comparison of the four models' performance.