Convolutional Neural Network-Based Remote Sensing Images Segmentation Method for Extracting Winter Wheat Spatial Distribution

When extracting winter wheat spatial distribution by using convolutional neural network (CNN) from Gaofen-2 (GF-2) remote sensing images, accurate identification of edge pixel is the key to improving the result accuracy. In this paper, an approach for extracting accurate winter wheat spatial distribution based on CNN is proposed. A hybrid structure convolutional neural network (HSCNN) was first constructed, which consists of two independent sub-networks of different depths. The deeper sub-network was used to extract the pixels present in the interior of the winter wheat field, whereas the shallower sub-network extracts the pixels at the edge of the field. The model was trained by classification-based learning and used in image segmentation for obtaining the distribution of winter wheat. Experiments were performed on 39 GF-2 images of Shandong province captured during 2017–2018, with SegNet and DeepLab as comparison models. As shown by the results, the average accuracy of SegNet, DeepLab, and HSCNN was 0.765, 0.853, and 0.912, respectively. HSCNN was equally as accurate as DeepLab and superior to SegNet for identifying interior pixels, and its identification of the edge pixels was significantly better than the two comparison models, which showed the superiority of HSCNN in the identification of winter wheat spatial distribution.


Introduction
Winter wheat is the most important food crop in China, comprising 21.38% of the gross cropped area of the domestic food crops in 2017 according to the data released by the National Bureau of Statistics, with its output accounting for 21.00% of the total food crop production [1].For national food security, the Chinese government has assigned a minimum area of arable land in each region that needs to be safeguarded (the "red line") [2].Timely and accurate acquisition of the size and spatial distribution of winter wheat fields assists the relevant government departments in guiding the farming activities, estimating the yield, and adjusting the agricultural structure for ensuring food security [3].
Remote sensing is capable of imaging and large-area monitoring, making it a good data source for rapid and accurate extraction of winter wheat planting information.Researchers have successfully extracted winter wheat spatial distribution information from MODIS (moderate-resolution imaging spectroradiometer) and ETM/TM (enhanced thematic mapper plus/thematic mapper), achieving accuracies of 85.5% and 89.1%, respectively [4,5].This exhibits the advantage of remote sensing in this application.However, owing to limitations in the spatial resolution of the data source, the spatial resolution of the extraction results is also rather coarse and unable to satisfy the requirement of the application [6][7][8][9][10].With the development of high-resolution remote sensing satellites, a crop planting area can be monitored more accurately using the corresponding images as the data source [11,12].The winter wheat cultivation information is extracted from the remote-sensing images captured by Gaofen-1 of the Gaofen series of Chinese satellites, yielding satisfactory results, with maximum accuracy reaching about 89% [13][14][15][16][17][18].Most researchers still use traditional methods, such as decision trees and textures features.These methods can only take advantage of low-level features, which make it easy to make mistakes in identifying pixels at the edge of winter wheat planting area.
Image segmentation has been successfully used in the processing of camera images and applied by researchers to high-resolution remote sensing images, achieving significantly more accurate classification by a pixel-by-pixel segmentation [19][20][21].Feature extraction is the key step in remote sensing image segmentation.In high-resolution remote sensing images, as the spectral difference between the same type of objects is increased, and between different types of objects is diminished, the former has more probability of exhibiting different spectral properties, whereas the latter tends to be spectrally similar, which makes feature extraction increasingly difficult [22,23].Traditional methods including k-nearest neighbors and maximum entropy can only identify low-level image features such as color, shape, and texture.They are not capable of visually providing a semantic description.This hinders the extraction of higher-level features and limits the use of these methods in the segmentation of high-resolution remote sensing images [24,25].
With the development of machine learning, algorithms such as neural networks (NNs) [26] and support vector machine (SVM) [27,28] are being used in the segmentation of high-resolution images [29][30][31].In some studies, when compared with traditional statistical methods and object-oriented methods, machine learning algorithms yielded better image segmentation results [32,33].Both SVM and NNs are shallow-learning algorithms [34][35][36], which do not express complex functions well owing to the limitations in their network structure.Therefore, these models cannot adapt to the continuously increasing complexity caused by the increasing sample size and diversity [37,38].
Progress in deep learning has facilitated solving these problems by using deep neural networks (DNNs) [39][40][41][42].As an important branch of deep learning, a convolutional NN (CNN) is widely used with visual data because of its excellent feature learning ability [43][44][45].A CNN is a deep learning network, composed of several layers, capable of nonlinear mapping.Its strength in learning is exemplified by the good image segmentation results achieved [46][47][48][49][50][51][52].Further, the capacity of many large CNNs can be scaled according to the size of the training data and complexity and processing ability of the model, and their performance in image segmentation has improved significantly [53][54][55][56][57][58][59][60].
A fully convolutional network (FCN) is a deep learning network for image segmentation, which was proposed in 2015.Taking advantage of convolution computation in its feature organization and extraction abilities, an FCN realizes pixel-by-pixel segmentation of camera images by constructing a multi-layer convolutional structure and setting appropriate deconvolutional layers [61][62][63].Accordingly, a series of convolution-based segmentation models has been developed including SegNet [64], UNet [65], DeepLab [66], multi-scale FCN [67], and ReSeg [68].Of these models, SegNet and UNet are clearly structured, and it is easy to understand the convolution structure of the model.The processing speed is fast.DeepLab uses a method called "Atrous Convolution", which has a strong advantage in processing detailed images.multi-scale FCN is designed to address the huge scale gap between different classes of targets, i.e., sea/land and ships.ReSeg exploits the local generic features extracted by Convolutional Neural Networks and the capacity of Recurrent Neural Networks (RNN) to retrieve distant dependencies.Each model has its own strengths and is adept at dealing with certain image types.
In the work of extracting the spatial distribution of crops with high GF-1 as the data source, in addition to methods such as decision trees, textures features, and maximum entropy, research has also been carried out using deep learning.However, most of these studies directly use the existing deep learning model as a tool, and seldom consider the influence of characteristics difference of edge pixels and inner pixels in the crop planting area are large.
On board the Gaofen-2 satellite is a panchromatic camera with a spatial resolution of 1 m, and a multi-spectral camera with a spatial resolution of 4 m, which provides ideal data for extracting winter wheat plantation information.Before the application of a CNN to GF-2 remote-sensing images for this purpose, trial extraction is performed with classical network architectures (such as SegNet) where misidentified pixels are categorized, of which approximately 90% are found at the edge of the crop field.Further analysis indicates the structure of the convolutional layer as the source of this problem.The outcome produced by operating the convolution kernel in the pixel block is treated as the eigenvalue of the central pixel of the pixel block.As such, for the pixels at the edge, 50% of the pixels involved in each convolution are from negative samples, whereas, for the pixels at the corner, this number is 75% or higher.This results in a significant difference between the eigenvalues of the pixels at these locations and those at the center of the image, and an increase in the probability of the recognition results being placed in a wrong category.To avoid these problems, a new method is herein proposed for the extraction of the winter wheat field information from the GF-2 remote sensing images.The main procedures are as follows.

1.
First, a CNN consisting of two independent sub-networks of different depths is established.The deep and shallow sub-networks are trained to be sensitive only to the pixels at the interior and edge of a winter wheat planting field, respectively, and only these pixels are extracted.This model is named as a Hybrid Structure Convolutional Neural Network (HSCNN).

2.
A classification algorithm is adopted in the model training.For initial training of the sub-network used for the edge pixel extraction, edge pixels are considered as positive samples, with the pixels at other locations being treated as negative samples.The inner pixels are then designated as positive samples, with the pixels at other locations as negative samples, for training the sub-network used for the inner pixel extraction.After the successful completion of the training, the neural network is able to extract the winter wheat field from the GF-2 images accurately.

3.
Finally, a GF-2 image is segmented by the trained model.Because SegNet and DeepLab are classic semantic segmentation models of images, and, the working principles of these two models are very similar to our work, we choose these two models as the comparison model, and a comparison is performed with them to evaluate the accuracy of the segmentation results.

Study Region
The whole study region is Shandong province, China.Shandong is located along the eastern coast of China (in the lower stream of the Yellow river), within 34 • 22 N-38 • 24 N and 114 • 47.5 E-122 • 42 E.It measures 721.03 km from east to west, and 437.28 km from north to south.The land area of the province is 155,800 km 2 , of which 14.59% is mountainous, 5.56% is water (such as lakes), 15.98% is forest, and 53.82% is cultivated land.The annual total planting area of crops in the province is approximately 162 million mu.The main food crops of this region are wheat and maize.In 2016, the wheat planting area was 57.45405 million mu, and in 2017 it was 57.6435 million mu [69].
In this paper, we used the ground data and remote sensing data of Feicheng county, Ningyang county and Zhangqiu county, Shandong province.The three counties are similar in topography, all relatively flat, which can eliminate the influence of topographic fluctuations on the experimental results.

Ground-Based Data
For manufacturing sample to train our model, we conducted a field survey in Feicheng county, Ningyang county and Zhangqiu county in 2017 and 2018, and obtained the land use data of 369 sample points, among which 257 were winter wheat sample points and 112 were bare land.The survey results include the time, location and type of land use.

Remote Sensing Data
We selected 39 GF-2 remote sensing image, size of each image is 7300 × 6900.Of these images, 15 were captured on 17 February 2017, 11 were captured on 21 March 2018 and 13 were captured on 12 April 2018.We select images from different periods to increase the anti-interference abilities of the HSCNN.These remote sensing data cover Feicheng county, Ningyang county and Zhangqiu county, and are matched with ground investigation time.At the same time, the selected remote sensing data have fewer clouds and better clarity.
The Environment for Visualizing Images (ENVI) software was used for preprocessing the tasks, including fusion of panchromatic spectrum and multispectral band to obtain 1-m spatial resolution multispectral data, and the contrast stretch to generate a color-enhanced color composite image.

Network Architecture of Our Method
The HSCNN model is divided into five functional groups of components, input (a), inner-CNN (b), edge-CNN (f), vote function (j), and output (k), as shown in Figure 1.Both the edge-CNN and inner-CNN have convolution layers, an encoder layer, and a classifier layer.In the training stage, the inputs are original images and artificial classification labels.In the classification stage, the inputs are original GF-2 images, output is a single-band file, and content of each pixel in the output is the category number of the corresponding original image pixel.The HSCNN indicates the winter wheat area using category number 100, and category number 200 distinguishes other land use.The reason for adopting the two numbers is to fit with the coding value table we are working on to obtain detailed land use information.

Inner-Layers and Edge-Layers
The operational characteristics of the pixel block-based convolution for image segmentation are described in Section 1, in addition to the effect of the pixel block location on the convolution results.Based on this analysis, two convolution sub-structures of different depths are setup for the feature extraction of the winter wheat field.The deep convolution sub-network is used to extract the features of the pixels in the interior of the winter wheat plantation, shown as inner-layers (c) in Figure 1.
The shallower sub-network is used to extract the features of the pixels at the edge of the winter wheat plantation, shown as edge-layers (g) in Figure 1.The benefits of this design are discussed in Section 4 based on the experimental results.
In our approach, an inner pixel refers to the pixel that only contains winter wheat pixels in the pixel block when convolution operation is carried out with the pixel as the center pixel.An edge pixel refers to the pixel that contains winter wheat pixels and other pixels when computing the feature of the pixel.All kernels of the HSCNN take the form, w × h × c, where w is the width, h is the height, and c is the number of channels of a kernel.Two types of kernels are used in the first convolutional layers of inner-layers (c) and edge-layers (g).For one type w and h are set to 1, and for the other type the values are set to 3. In both cases, c is set to 4 because the data in the four multi-spectral bands of GF-2 are used.Kernels of the form 1 × 1 × 4 are used to extract the features of the pixels.The generated feature map is used instantaneously as the input of the encoder, and does not participate in the subsequent convolution.Convolution kernels of the form 3 × 3 × 4 are used to extract the spatial relation between the pixels and generate the spatial semantics by multi-level convolution.
After the operation of first convolution layer on the original image, we obtain a feature map which has only one channel.Because the input of convolution layer is the feature map calculated by the previous convolution layer, so the w and h values of the kernels used in all other convolutional layers are set to 3, and c is 1 from the second layer.To extract more features from the edge pixels of the crop field, the number of kernels used in each convolutional layer of edge-layers (g) is twice that used in the corresponding layer of inner-layers (c).
In the HSCNN, each convolution layer has only one activation layer attached, and there is no pool layer.Accordingly, the convolution result of each pixel block can be used directly as the feature of its central pixel, without the need to determine the position of the pixel that the feature corresponds to through deconvolution.As such, the HSCNN does not utilize a deconvolutional layer.This reduces the extent of computation and positioning error of the deconvolution, thereby improving the accuracy of the segmentation.

Inner-Encoder and Edge-Encoder
The inner-encoder and edge-encoder are used to encode the eigenvector extracted by the convolution layers on the pixel, ensuring that the classifier can establish the relationship between the eigenvector and pixel type.In the HSCNN model, the inner-and edge-encoders are both 2 × n matrices, where n is the length of the eigenvector.
Let X denote the eigenvector of the pixel, W denote the encoder matrix, and R the encoded vector result.The encoding calculation is displayed in Equation (1).All kernels of the HSCNN take the form, w × h × c, where w is the width, h is the height, and c is the number of channels of a kernel.Two types of kernels are used in the first convolutional layers of inner-layers (c) and edge-layers (g).For one type w and h are set to 1, and for the other type the values are set to 3. In both cases, c is set to 4 because the data in the four multi-spectral bands of GF-2 are used.Kernels of the form 1 × 1 × 4 are used to extract the features of the pixels.The generated feature map is used instantaneously as the input of the encoder, and does not participate in the subsequent convolution.Convolution kernels of the form 3 × 3 × 4 are used to extract the spatial relation between the pixels and generate the spatial semantics by multi-level convolution.
After the operation of first convolution layer on the original image, we obtain a feature map which has only one channel.Because the input of convolution layer is the feature map calculated by the previous convolution layer, so the w and h values of the kernels used in all other convolutional layers are set to 3, and c is 1 from the second layer.To extract more features from the edge pixels of the crop field, the number of kernels used in each convolutional layer of edge-layers (g) is twice that used in the corresponding layer of inner-layers (c).
In the HSCNN, each convolution layer has only one activation layer attached, and there is no pool layer.Accordingly, the convolution result of each pixel block can be used directly as the feature of its central pixel, without the need to determine the position of the pixel that the feature corresponds to through deconvolution.As such, the HSCNN does not utilize a deconvolutional layer.This reduces the extent of computation and positioning error of the deconvolution, thereby improving the accuracy of the segmentation.

Inner-Encoder and Edge-Encoder
The inner-encoder and edge-encoder are used to encode the eigenvector extracted by the convolution layers on the pixel, ensuring that the classifier can establish the relationship between the eigenvector and pixel type.In the HSCNN model, the inner-and edge-encoders are both 2 × n matrices, where n is the length of the eigenvector.
Let X denote the eigenvector of the pixel, W denote the encoder matrix, and R the encoded vector result.The encoding calculation is displayed in Equation (1).
where each row of matrix w represents a fitting function for a particular type of pixel, b 1 and b 2 are the respective biases, and the corresponding component of r is the encoded value of eigenvector x on that pixel type.The inner-and edge-encoders are trained separately.

Inner-Classifier and Edge-Classifier
For each pixel, the inner-classifier converts its vector of the encoded values given by the inner-encoder into a probability distribution over a set of classes, and classifies the pixel as an inner pixel or a non-inner pixel of the winter wheat plantation based on the location of the component with the highest probability.Similarly, the edge-classifier distinguishes between the edge and non-edge pixels of the winter wheat field using the vectors of the encoded values generated on the pixels by the edge-encoder.
In reference to the classic softmax classifier [60][61][62][63][64][65], Equation ( 2) is used here to convert vector r of the encoded values to vector p of the class probabilities for each pixel.
After the transformation, the index of max(p 1 and p 2 ) is taken as the predicted category of the pixel.For the inner-classifier, index numbers of 1 and 0 are assigned to the inner pixels of the winter wheat field and other pixels, respectively.Accordingly, index numbers of 1 and 0 are assigned to the edge pixels of the winter wheat field and other pixels, respectively.

Vote-Function
The vote-function determines the category number of a pixel given by the inner-classifier and edge-classifier and writes it to the output file.As described in the beginning of Section 2.2, the HSCNN indicates the winter wheat area using category number 100 and other land uses using category number 200.The category number of a pixel is calculated in Equation (3).where o represents the final category number of a pixel and p inner and p edge are the outputs of the inner-classifier and edge-classifier, respectively.

HSCNN Training
We manually labeled all images at the pixel level as ground truth (GT) label data.In other words, for each image, there exists a 7300 × 6900 label map, having a pixel-class (row-col indexed) correspondence with it.We used 36 images for training, and the remaining 3 images for testing.The GF-2 images and their corresponding artificial classification labels will be input to the HSCNN as training samples.
The training process includes error calculation, error back propagation and weight update.This process is iterated until the difference becomes smaller than the predetermined threshold.
We calculated the errors between the predicted classification label and manual classification label by the chain rule.The chain rule, the derivative rule in calculus, is used to find the derivative of a complex function, which is a common way to do the derivative calculation of calculus.The derivative of a composite function is the product of the derivatives of this finite number of functions at the corresponding point, as a chain.Then, the errors are back-propagated through the network.The backward propagation algorithm is a kind of training and learning method in deep learning, which can spread the error of the output layer backward to realize weight adjustment, adjust the weight between each node in the deep network, and achieve the goal that the sample tag output from the network is consistent with the actual tag.We use gradient descent method to update HSCNN parameters.Gradient descent method is the most commonly used optimization method.The idea is to use the negative gradient direction of the current position as the search direction, because that direction is the fastest descending direction of the current position.

Sample Labeling
We use the ENVI software for labeling and designing a preprocessor to build the labels.The process of artificial labeling is as follows: 1.
The region-of-interest (RoI) tool in the ENVI software is used to select the winter wheat regions and other regions in the image.Then, the map locations of the pixels in each region are output to different files based on the category.

2.
A band is added to the image file by the preprocessor as a mask band.The spatial resolution, size, and other parameters of the mask band are the same as the original image.Then, the category number of each pixel is written to the mask band according to the map location of the pixel previously output.We manually label all the images at the pixel level.Thus, for each image, there exists a 7300 × 6900 label map, with a row-column-indexed pixel-class correspondence.

3.
The pixels marked as winter wheat are further categorized as edge pixels and inner pixels.Based on the parameters given above, the inner-layers comprise of eight convolutional layers, each with a 3 × 3 (length × width) convolution kernel.Therefore, the feature extraction from pixel s involves a 9 × 9 pixel block centered at s in the calculation.As defined in Section 2.2.1, the winter wheat pixels are divided sequentially into edge pixels and inner pixels.For training class by class, we use temporary code 160 to denote edge pixels and 170 to denote inner pixels in the mask band.weight between each node in the deep network, and achieve the goal that the sample tag output from the network is consistent with the actual tag.We use gradient descent method to update HSCNN parameters.Gradient descent method is the most commonly used optimization method.The idea is to use the negative gradient direction of the current position as the search direction, because that direction is the fastest descending direction of the current position.

Sample Labeling
We use the ENVI software for labeling and designing a preprocessor to build the labels.The process of artificial labeling is as follows: 1.The region-of-interest (RoI) tool in the ENVI software is used to select the winter wheat regions and other regions in the image.Then, the map locations of the pixels in each region are output to different files based on the category.2. A band is added to the image file by the preprocessor as a mask band.The spatial resolution, size, and other parameters of the mask band are the same as the original image.Then, the category number of each pixel is written to the mask band according to the map location of the pixel previously output.We manually label all the images at the pixel level.Thus, for each image, there exists a 7300 × 6900 label map, with a row-column-indexed pixel-class correspondence.3. The pixels marked as winter wheat are further categorized as edge pixels and inner pixels.Based on the parameters given above, the inner-layers comprise of eight convolutional layers, each with a 3 × 3 (length × width) convolution kernel.Therefore, the feature extraction from pixel s involves a 9 × 9 pixel block centered at s in the calculation.As defined in Section 2.2.1, the winter wheat pixels are divided sequentially into edge pixels and inner pixels.For training class by class, we use temporary code 160 to denote edge pixels and 170 to denote inner pixels in the mask band.
Figure 2 shows an example of an image-label pair.

Loss Function
In our method, new loss functions are defined for the inner-CNN and edge-CNN, which still use the cross entropy as the basic element for the calculation, as expressed in Equation ( 4). (4)

Loss Function
In our method, new loss functions are defined for the inner-CNN and edge-CNN, which still use the cross entropy as the basic element for the calculation, as expressed in Equation ( 4).
where p and q are, respectively the predicted and actual probability distribution, and i is index of a component in the probability distribution.On this basis, the loss function of the inner-CNN is defined as When computing loss of inner-CNN, m is obtained by subtracting the number of edge pixels of the winter wheat field from the total number of samples, and when computing loss of edge-CNN, m is obtained by subtracting the number of inner pixels of the winter wheat field from the total number of samples.

Model Training
Images from two different periods were selected as the training data.We selected images from different periods for increasing the anti-interference abilities of the HSCNN and mitigating the complications such as the change in the seasons, and thus enhancing applicability.The training stage proceeded through the following steps: 1.
Image-label pairs are input into the HSCNN as training samples.Network parameters are initialized.

2.
Forward propagation is performed on the sample images.

3.
The [loss]_inner is calculated and back propagated to the inner-CNN, whereas the [loss]_edge is calculated and back propagated to the edge-CNN.
The training yields two sub-networks, an inner-CNN and edge-CNN.The former can accurately extract the inner pixels of the winter wheat plantation from the sweet GF-2 remote sensing images, whereas the latter allows the best possible distinction between the edge pixels of the winter wheat planting region and other pixels.
In our training, the SGD method with momentum was used for parameter updates, which is illustrated in the following expression: where W (n) denote the old parameters, W (n+1) denote the new parameters, and ∆W (n+1) is the increment in the current iteration, which is a combination of the old parameters, gradient, and historical increment, i.e., where J(W) is the loss function, ϑ is the learning rate for step length control, d w denotes the weight decay, and m denotes the momentum.

Segmentation Using the Trained Network
After successful training, the HSCNN can be used to segment an input imagery pixel-by-pixel.According to our design, the output is written in a new band.The benefit of this design avoids damaging the original file.

Experiments and Results
The data used in the experiment are presented in Section 2.1.In this section, the models used for comparison are described in Section 3.1, and the experimental results and assessment of accuracy are given in Section 3.2.

Comparison Model
Feature selection is the basis of remote sensing image segmentation.At present, there are mainly two methods based on artificial feature selection and machine learning.Haralick et al. (1973) put forward the method of gray level co-occurrence matrix, which is a classical artificial selection feature method.This method is mainly used to select image texture features.Since the texture is formed by repeated alternating changes of gray distribution in image space, so there is a certain gray-scale relationship between two separate pixels away certain distance, Haralick et al. described this correlation through a matrix [70].Based on the artificial selection feature, only limited, shallow features can generally be selected.The feature selection based on machine learning can fully explore the deep feature and spatial semantic feature of the image.SegNet and DeepLab are classic semantic segmentation models of images, which have achieved very good results in the processing of images.Moreover, the working principles of these two models are very similar to our work, so we choose these two models as the comparison model, which can better reflect the advantages of our model in feature extraction.A comparative experiment was conducted using the methods established in the published literature.

SegNet
For the SegNet model, we directly employed the structure proposed by Badrinarayanan et al. [64], which consists of an encoder, a decoder, and a classifier.The encoder uses the first 13 convolutional layers of the VGG16 network, each having its corresponding decoder layer, totaling 13 decoder layers.The last decoder generates a multi-channel feature map as the input to the classifier, which outputs a probability vector of length K, where K is the number of classes.The final predicted category corresponds to the class having maximum probability at each pixel.In terms of training, SegNet can be trained end-to-end using SGD.

DeepLab
For DeepLab, we directly employed the DeepLab v3 model proposed by Liang-Chieh Chen et al. [66].DeepLab was also developed based on the VGG network.To ensure that the output size would not be not too small without excessive padding, DeepLab changes the stride of the pool4 and pool5 layers of the VGG network from the original 2 to 1, plus 1 padding.To compensate for the effect of the stride change on the receptive field, DeepLab uses a convolution method called "atrous convolution" to ensure that the receptive field after pooling remains unchanged and the output is more refined.Finally, DeepLab incorporates a fully connected conditional random field (CRF) model to refine the segmentation boundary.

Results and Result Comparison
In the comparative experiment, we applied our trained model to three GF-2 images for segmentation.These images were only used for testing and not involved in training.Figure 3 illustrates the results obtained from the comparison methods and proposed method.In Figure 3, the first column illustrates the results of Experiment 1, the second column illustrates the results of Experiment 2 and the third column illustrates the results of Experiment 3.
Tables 1-3 are confusion matrices c for the segmentation results of SegNet model, DeepLab model, and HSCNN model, respectively.Each row of the confusion matrix represents the proportion taken by the actual category, and each column represents the proportion taken by the predicted category.As can be seen from the tables, our method achieves better classification results.In the example above, the proportion of "winter wheat" wrongly categorized as "background" is on average 0.069, and the proportion of "background" wrongly classified as "winter wheat" is on average 0.019, resulting in an overall accuracy of 91.2%.Accuracy, precision, recall, and the Kappa coefficient were used to evaluate the models.These indices are calculated via mixing matrix c.
Accuracy is the ratio of the number of correctly classified samples to the total number of samples, and is given in this case by the following equation: Here, C ii denotes the number of correctly classified samples, and C ij denotes the number of samples of class i misidentified as class j.
Precision denotes the average proportion of pixels correctly classified to one class from the total retrieved pixels.Precision is calculated as: Recall represents the average proportion of pixels that are correctly classified in relation to the actual total pixels of a given class.Recall is computed as: The Kappa coefficient measures the consistency of the predicted classes with artificial labels.The Kappa coefficient is computed as: Equations ( 8)-( 11) use the definitions given in Reference [18] and are modified according to our actual situation.The minimum accepted precision is 89% according to practical application.
The indicator values are listed in Table 4.

Analysis and Discussion
From the experimental results in Section 3, it is clear that our method significantly improves the accuracy of winter wheat extraction.In this section, the advantages of our model are discussed in terms of the differences between the remote sensing images and camera images.This is followed by more specific comparisons with SegNet and DeepLab.Finally, the role of our model in the classification of land uses by remote sensing is discussed briefly.

Advantages of the HSCNN Model
CNNs have achieved significant success in camera image segmentation, which has motivated researchers to apply them to remote sensing images.The HSCNN model proposed here is developed based on a previous work followed by a further in-depth analysis of the fundamental difference between camera images and remote sensing images.Thus, it possesses clear advantages compared with the traditional practice of the straightforward application of camera image segmentation model to remote sensing images.
Camera images and remote sensing images essentially differ in information representation.Owing to their advantages in shooting distance and the pixel quality of the camera, camera images are superior in terms of the rich details they contain, such that one object is formed by multiple pixels.Thus, the color of a pixel reflects the information at a certain point on an object but not the spatial relation between the pixels, which is found and expressed only by convolution.The nature of convolution is to represent the spatial correlation between the pixels by constructing a complex fitting function by operating on the pixel value of a pixel block.Particularly because it makes good use of the essential characteristics of camera images, deep convolution is extremely successful in camera image processing.
In remote sensing images, particularly for crop fields, a pixel generally contains multiple objects.For example, generally in GF-2 images 1 m 2 of ground is covered by a pixel, which contains 600-700 winter wheat plants.A pixel embodies the color information of the plants and the spatial information between them.However, at the edge of a winter wheat field, the region covered by a pixel is often a mixture of the winter wheat and bare land or winter wheat and other geographical objects, with varying percentages of winter wheat in the space.In this view, the information contained in a pixel at the edge region is significantly different from that at the interior.These two types of pixels can even be regarded as two different types of objects.
Based on the above analyses, the HSCNN network architecture is designed with a complete consideration of the properties of the remote sensing image and extraction target, and it makes good use of the characteristics of the winter wheat field captured in the GF-2 remote sensing images.The strengths of this model are exhibited in the following three aspects: 1.
Considering the significant difference between the pixels of the interior and edge region of the winter wheat plantation (during extraction), these two regions are treated as two subclasses.
Accordingly, the features of the inner pixels are more focused, which facilitates the model training.
Two sub-networks with different depths are then designed with respect to the characteristics of the two subclasses.The deep sub-network extracts the pixels at the interior, whereas the shallower sub-network extracts those at the edge.This scheme reduces the effect of the non-winter wheat pixels on the features and improves the stability of the model for edge pixel extraction.

2.
Two types of kernels are used in the first convolutional layer.The 1 × 1 × 4 kernels are used to extract the feature of the pixels, and the 3 × 3 × 4 kernels are used to extract the spatial relation between pixels.This design takes advantage of the ability of convolution for extracting higher-level spatial semantics and for obtaining the rich pixel information contained in the remote sensing images.3.
Our model does not utilize pooling, instead the convolution result is taken as the eigenvalue of the central pixel of the pixel block.In the application of the convolutional network to image classification, the basic target (sample) for the classification is the entire image.Thus, pooling can produce the main features of the feature map and reduce the amount of subsequent computation.Although the information on the accurate position of the features is lost during this process, their relative positions are nevertheless retained, which ensures the normal operation of the subsequent computational steps.However, in image segmentation, the basic target (sample) for the classification is an individual pixel, whose exact location must be mapped by the eigenvalue.Therefore, the major advantage of our model is its ability to preserve the spatial location of the eigenvalue, which makes it possible to remove the deconvolution adopted by the traditional FCN.Accordingly, the amount of computation is reduced.Further, the loss in precision due to positioning error is reduced, as the accurate position of the eigenvalue is kept.

Comparison with SegNet and Analysis
SegNet is founded on the FCN model.Its main strength lies in the search and extraction of the rich details of an image by deep convolution, and it is very distinct when extracting target objects with relatively few pixels.If the target objects contain only a few pixels or even one pixel, the deep convolution does not generate more details and may introduce more noise owing to the expanded field of view, affecting the determination of the pixel type.
In the remote sensing images of GF-2, the edge and interior of the winter wheat plantation are very different in composition, which makes it more difficult for SegNet to locate the common features, because of its structure containing a single convolutional network.In comparison, the HSCNN is equipped with two sub-networks of different depths and is adaptable to the characteristics of the edge and interior.It also uses two different sizes of kernel, which are capable of uncovering the spatial relation between the pixels, and the information embedded in the pixels.
As shown in Figure 3, the segmentation results of HSCNN and SegNet are nearly identical for the interior of the winter wheat field.SegNet, however, produces prominent errors at the edge of the field, while HSCNN does not.
Both HSCNN and SegNet use classifiers to generate the probability distribution of the classes, and consider the class with the maximum probability (max) in the distribution as the type to which the pixel belongs.Clearly, a larger difference between the max and background implies a higher separability of the pixels and more reliable results.The probability differences given by the HSCNN and SegNet model for the inner wheat and edge wheat classes are presented in Figures 4 and 5, respectively.It is clear in Figure 4 that HSCNN and SegNet lead to significant probability differences for many pixels in the interior, which demonstrates the high separability of this region and the strength of CNN.In the probability distribution in Figure 5, pixels are noted as having large probability differences in both the HSCNN and SegNet; nevertheless, the number is maintained at a quite high level for the HSCNN, whereas SegNet exhibits a reduced performance.
Appl.Sci.2018, 8, x FOR PEER REVIEW 15 of 20 separability of the pixels and more reliable results.The probability differences given by the HSCNN and SegNet model for the inner wheat and edge wheat classes are presented in Figures 4 and 5, respectively.It is clear in Figure 4 that HSCNN and SegNet lead to significant probability differences for many pixels in the interior, which demonstrates the high separability of this region and the strength of CNN.In the probability distribution in Figure 5, fewer pixels are noted as having large probability differences in both the HSCNN and SegNet; nevertheless, the number is maintained at a quite high level for the HSCNN, whereas SegNet exhibits a reduced performance.

Comparison with DeepLab and Analysis
Compared with the FCN and SegNet, DeepLab has significant improvements in two aspects: (1) the deconvolution; and (2) the refinement of the boundary of the segmentation result by fully connected CRFs.These two improvements are beneficial for the segmentation of individual objects covering numerous pixels.Based on the published literature, DeepLab displays a higher segmentation accuracy at the boundary than the FCN and SegNet, because it better utilizes the detailed information contained in the image and the large-scale spatial correlation between the pixels.However, in its application to winter wheat identification, the strength of DeepLab is not fully realized, because the details within a pixel block of the winter wheat plantation do not change significantly.Therefore, less separability of the pixels and more reliable results.The probability differences given by the HSCNN and SegNet model for the inner wheat and edge wheat classes are presented in Figures 4 and 5, respectively.It is clear in Figure 4 that HSCNN and SegNet lead to significant probability differences for many pixels in the interior, which demonstrates the high separability of this region and the strength of CNN.In the probability distribution in Figure 5, fewer pixels are noted as having large probability differences in both the HSCNN and SegNet; nevertheless, the number is maintained at a quite high level for the HSCNN, whereas SegNet exhibits a reduced performance.

Comparison with DeepLab and Analysis
Compared with the FCN and SegNet, DeepLab has significant improvements in two aspects: (1) the deconvolution; and (2) the refinement of the boundary of the segmentation result by fully connected CRFs.These two improvements are beneficial for the segmentation of individual objects covering numerous pixels.Based on the published literature, DeepLab displays a higher segmentation accuracy at the boundary than the FCN and SegNet, because it better utilizes the detailed information contained in the image and the large-scale spatial correlation between the pixels.However, in its application to winter wheat identification, the strength of DeepLab is not fully realized, because the details within a pixel block of the winter wheat plantation do not change significantly.Therefore, less information is available to the model, and the spatial correlation within the farmlands and woods is

Comparison with DeepLab and Analysis
Compared with the FCN and SegNet, DeepLab has significant improvements in two aspects: (1) the deconvolution; and (2) the refinement of the boundary of the segmentation result by fully connected CRFs.These two improvements are beneficial for the segmentation of individual objects covering numerous pixels.Based on the published literature, DeepLab displays a higher segmentation accuracy at the boundary than the FCN and SegNet, because it better utilizes the detailed information contained in the image and the large-scale spatial correlation between the pixels.However, in its application to winter wheat identification, the strength of DeepLab is not fully realized, because the details within a pixel block of the winter wheat plantation do not change significantly.Therefore, less information is available to the model, and the spatial correlation within the farmlands and woods is not strong over large regions.
As mentioned in Section 4.2, the HSCNN completely utilizes the characteristics of the pixels and the spatial relation between them.Therefore, it is well adapted to the data characteristics of the winter wheat plantation.Further, it effectively avoids the deficiencies of DeepLab and ensures the accuracy of segmentation.
As in Section 4.2, the probability differences between the HSCNN and DeepLab models in the inner wheat and edge wheat class are displayed in Figures 6 and 7, respectively.It is clear in Figure 6 that both the models produce large probability differences for many pixels in the interior.In the probability distribution of Figure 7, a considerable number of pixels still display large probability differences after the HSCNN processing, whereas DeepLab shows a much poorer performance (even lower than SegNet), proving again the notion that the atrous convolution is not suitable for farmlands.As mentioned in Section 4.2, the HSCNN completely utilizes the characteristics of the pixels and the spatial relation between them.Therefore, it is well adapted to the data characteristics of the winter wheat plantation.Further, it effectively avoids the deficiencies of DeepLab and ensures the accuracy of segmentation.
As in Section 4.2, the probability differences between the HSCNN and DeepLab models in the inner wheat and edge wheat class are displayed in Figures 6 and 7, respectively.It is clear in Figure 6 that both the models produce large probability differences for many pixels in the interior.In the probability distribution of Figure 7, a considerable number of pixels still display large probability differences after the HSCNN processing, whereas DeepLab shows a much poorer performance (even lower than SegNet), proving again the notion that the atrous convolution is not suitable for farmlands.As mentioned in Section 4.2, the HSCNN completely utilizes the characteristics of the pixels and the spatial relation between them.Therefore, it is well adapted to the data characteristics of the winter wheat plantation.Further, it effectively avoids the deficiencies of DeepLab and ensures the accuracy of segmentation.
As in Section 4.2, the probability differences between the HSCNN and DeepLab models in the inner wheat and edge wheat class are displayed in Figures 6 and 7, respectively.It is clear in Figure 6 that both the models produce large probability differences for many pixels in the interior.In the probability distribution of Figure 7, a considerable number of pixels still display large probability differences after the HSCNN processing, whereas DeepLab shows a much poorer performance (even lower than SegNet), proving again the notion that the atrous convolution is not suitable for farmlands.

Benefits of Using the Proposed Approach to Classify Land Use
Accurate land use classification is of tremendous importance in scientific research and agriculture with the use of remote sensing data as an increasingly common practice for this purpose.Based on a CNN and taking complete advantage of the convolution in feature extraction, the design of the CNN architecture adapting to the features of the remote-sensing images is the key in land use classification by this method.
We have taken the feature difference between the edge pixel and the inner pixel in the white wheat planting area into full consideration, this significantly improve the extraction accuracy of the edge pixel.Compared with earlier research, the model presented in this paper has the following advantages.
Firstly, two types of kernels were used in the convolution of the model, which allowed the full utilization of the strength of the convolution in the extraction of spatial semantics and made appropriate use of the rich information contained in the pixels of the remote sensing images, thus achieving a more accurate segmentation.
Secondly, pooling layers were not used in the model.Although the speed of the feature aggregation was consequently reduced, the information of the exact location to which an eigenvalue corresponds was retained, thereby effectively mitigating the loss in the accuracy due to the positioning error of the deconvolution and improving the overall effect of the segmentation.
The model presented in this paper provides a solution for the edge extraction problem or the segmentation of the winter wheat plantation using GF-2 images.It has an important role to play and enhances the efficiency of the agricultural survey.This model has been utilized by the Department of Agriculture and the Meteorological Bureau of Shandong Province, China.

Conclusions
This paper presents a novel approach for the extraction of the winter wheat distribution from GF-2 remote-sensing images.Compared with the two typical deep learning-based approaches, the extraction accuracy is obviously improved.Our approach combines the segmentation and classification stages, taking the accuracy as the only constraint, and achieves high quality classification in an end-to-end way.The GT classes of ground objects are taken as the supervised information that guides both the feature extraction and the region generation.Taking into account the significant differences between pixel and edge pixel in the planting area, different convolution structures were used to extract the feature of edge and interior pixels, focusing on the common features in the two subclasses for more effective model training, and obtained a high resolution class prediction.
Our model is still limited in many aspects, and further improvements could be made in the following two areas: (1) The current encoder uses a relatively simple regression algorithm for encoding; thus, a regression that can express the complex relationship between the eigenvalues needs to be explored.(2) A new pooling method, which allows for expedited feature aggregation without the loss of the spatial information of the eigenvalues, should be established.We will continue our work in the future to improve the current model and obtain better classification performance.

Figure 2
Figure 2 shows an example of an image-label pair.

Figure 4 .
Figure 4. Distribution of the probability differences for the inner wheat pixels.

Figure 5 .
Figure 5. Distribution of the probability differences for the edge wheat pixels.

Figure 4 .
Figure 4. Distribution of the probability differences for the inner wheat pixels.

Figure 4 .
Figure 4. Distribution of the probability differences for the inner wheat pixels.

Figure 5 .
Figure 5. Distribution of the probability differences for the edge wheat pixels.

Figure 5 .
Figure 5. Distribution of the probability differences for the edge wheat pixels.

Figure 6 .
Figure 6.Distribution of the probability differences for the inner wheat pixels.

Figure 7 .
Figure 7. Distribution of the probability differences for the edge wheat pixels.

Figure 6 .
Figure 6.Distribution of the probability differences for the inner wheat pixels.

Figure 6 .
Figure 6.Distribution of the probability differences for the inner wheat pixels.

Figure 7 .
Figure 7. Distribution of the probability differences for the edge wheat pixels.

Figure 7 .
Figure 7. Distribution of the probability differences for the edge wheat pixels.

Table 1 .
Confusion matrix of the SegNet approach for Figure4.

Table 2 .
Confusion matrix of the DeepLab approach for Figure4.

Table 3 .
Confusion matrix of our HSCNN approach for Figure4.