An End-to-End and Localized Post-Processing Method for Correcting High-Resolution Remote Sensing Classiﬁcation Result Images

: Since the result images obtained by deep semantic segmentation neural networks are usually not perfect, especially at object borders, the conditional random ﬁeld (CRF) method is frequently utilized in the result post-processing stage to obtain the corrected classiﬁcation result image. The CRF method has achieved many successes in the ﬁeld of computer vision, but when it is applied to remote sensing images, overcorrection phenomena may occur. This paper proposes an end-to-end and localized post-processing method (ELP) to correct the result images of high-resolution remote sensing image classiﬁcation methods. ELP has two advantages. (1) End-to-end evaluation: ELP can identify which locations of the result image are highly suspected of having errors without requiring samples. This characteristic allows ELP to be adapted to an end-to-end classiﬁcation process. (2) Localization: Based on the suspect areas, ELP limits the CRF analysis and update area to a small range and controls the iteration termination condition. This characteristic avoids the overcorrections caused by the global processing of the CRF. In the experiments, ELP is used to correct the classiﬁcation results obtained by various deep semantic segmentation neural networks. Compared with traditional methods, the proposed method more e ﬀ ectively corrects the classiﬁcation result and improves classiﬁcation accuracy. result post-processing. images and their corresponding ground truth from the Potsdam dataset.


Introduction
With the advent of high-resolution satellites and drone technologies, an increasing number of high-resolution remote sensing images have become available, making automated processing technology increasingly important for utilizing these images effectively [1]. High-resolution remote sensing image classification methods, which automatically provide category labels for objects in these images, are playing an increasingly important role in land resource management, urban planning, precision agriculture, and environmental protection [2]. However, high-resolution remote sensing images usually contain detailed information and exhibit high intra-class heterogeneity and inter-class homogeneity characteristics, which are challenging for traditional shallow-model classification algorithms [3]. To improve classification ability, deep learning technology, which can extract higher-level features in complex data, has been widely studied in the high-resolution remote sensing classification field in recent years [4].
Deep semantic segmentation neural networks (DSSNNs) are constructed based on convolutional neural networks (CNNs); the input of these models is an image patch, and the output are category As shown in Figure 1, the end-to-end classification strategy is usually adopted for DSSNNs' training and classification. During the training stage, a set of remote sensing images ImageSet = {I1, I2, …, In} is adopted and manually interpreted into a ground truth set GroundTruthSet = {Igt1, Igt2,…, Igtn}; then, the GroundTruthSet is separated into patches to construct the training dataset. The classification model Mendtoend is obtained based on this training dataset. During the classification stage, the classification model is utilized to classify a completely new remote sensing image Inew (not an image from ImageSet). This strategy achieves a higher degree of automation; the classification process has no relationship with the training data or the training algorithm, and newly obtained or other images in the same area can be classified automatically with Mendtoend, forming an input-to-output/end-to-end structure. Thus, this strategy is more valuable in practical applications when massive amounts of remote sensing data need to be processed quickly.
However, the classification results of the end-to-end classification strategy are usually not ʺperfectʺ, and they are affected by two factors. On the one hand, because the training data are constructed by manual interpretation, it is difficult to provide training ground truth images that are precise at the pixel level (especially at the boundaries of ground objects). Moreover, the incorrectly interpreted areas of these images may even be amplified through the repetitive training process [16]. On the other hand, during data transfer among the neural network layers, along with obtaining highlevel spatial features, some spatial context information may be lost [35]. Therefore, the classification results obtained by the ʺend-to-end classification strategyʺ may result in many flaws, especially at ground object boundaries. To correct these flaws, in the computer vision research field, the conditional random field (CRF) method is usually adopted in the post-processing stage to correct the result image. The conditional random field can be defined as follows: where F is a set of random variables {F1, F2,…, FN}; Fi is a pixel vector; X is a set of random variables {x1, x2,…, xN}, where xi is the category label of pixel i; Z(F) is a normalizing factor; and c is a clique in a set of cliques Cg, where g induces a potential φc [23,24]. By calculating Equation (1), the CRF adjusts the category label of each pixel and achieves the goal of correcting the result image. The CRF is highly effective at processing images that contain only a small number of objects. However, the numbers, sizes, and locations of objects in remote sensing images vary widely, and the traditional CRF tends to perform a global optimization of the entire image. This process leads to some ground objects being excessively enlarged or reduced. Furthermore, if the different parts of ground objects that are shadowed or not shadowed are processed in the same manner, the CRF result will contain more As shown in Figure 1, the end-to-end classification strategy is usually adopted for DSSNNs' training and classification. During the training stage, a set of remote sensing images ImageSet = {I 1 , I 2 , . . . , I n } is adopted and manually interpreted into a ground truth set GroundTruthSet = {I gt1 , I gt2 , . . . , I gtn }; then, the GroundTruthSet is separated into patches to construct the training dataset. The classification model M endtoend is obtained based on this training dataset. During the classification stage, the classification model is utilized to classify a completely new remote sensing image I new (not an image from ImageSet). This strategy achieves a higher degree of automation; the classification process has no relationship with the training data or the training algorithm, and newly obtained or other images in the same area can be classified automatically with M endtoend , forming an input-to-output/end-to-end structure. Thus, this strategy is more valuable in practical applications when massive amounts of remote sensing data need to be processed quickly.
However, the classification results of the end-to-end classification strategy are usually not "perfect", and they are affected by two factors. On the one hand, because the training data are constructed by manual interpretation, it is difficult to provide training ground truth images that are precise at the pixel level (especially at the boundaries of ground objects). Moreover, the incorrectly interpreted areas of these images may even be amplified through the repetitive training process [16]. On the other hand, during data transfer among the neural network layers, along with obtaining high-level spatial features, some spatial context information may be lost [35]. Therefore, the classification results obtained by the "end-to-end classification strategy" may result in many flaws, especially at ground object boundaries. To correct these flaws, in the computer vision research field, the conditional random field (CRF) method is usually adopted in the post-processing stage to correct the result image. The conditional random field can be defined as follows: where F is a set of random variables {F 1 , F 2 , . . . , F N }; F i is a pixel vector; X is a set of random variables {x 1 , x 2 , . . . , x N }, where x i is the category label of pixel i; Z(F) is a normalizing factor; and c is a clique in a set of cliques C g , where g induces a potential ϕ c [23,24]. By calculating Equation (1), the CRF adjusts the category label of each pixel and achieves the goal of correcting the result image. The CRF is highly effective at processing images that contain only a small number of objects. However, the numbers, sizes, and locations of objects in remote sensing images vary widely, and the traditional CRF tends to perform a global optimization of the entire image. This process leads to some ground objects being excessively enlarged or reduced. Furthermore, if the different parts of ground objects that are shadowed or not shadowed are processed in the same manner, the CRF result will contain more errors [31]. In our previous work, we proposed a method called the restricted conditional random field (RCRF) that can handle the above situation [31]. Unfortunately, the RCRF requires the introduction of samples to control its iteration termination and produce an output integrated image. When integrated into the classification process, the need for samples will cause the whole classification process to lose its end-to-end characteristic; thus, the RCRF cannot be integrated into an end-to-end process. In summary, to address the above problems, the traditional CRF method needs to be further improved by adding the following characteristics: (1) End-to end result image evaluation: Without requiring samples, the method should be able to automatically identify which areas of a classification result image may contain errors. By identifying areas that are strongly suspected of being misclassified, we can limit the CRF process and analysis scope.
(2) Localized post-processing: The method should be able to transform the entire image post-processing operation into local corrections and separate the various objects or different parts of objects (such as roads in shadow or not in shadow) into sub-images to alleviate the negative impacts of differences in the number, size, location, and brightness of objects.
To achieve this goal, a new mechanism must be introduced to improve the traditional CRF algorithm for remote sensing classification results post-processing.

End-to-End Result Image Evaluation and Localized Post-Processing
The majority of evaluation methods for classification results require samples with category labels that allow the algorithm to determine whether the classification result is good; however, to achieve an end-to-end classification results evaluation, samples cannot be required during the evaluation process. In the absence of testing samples, although it is impossible to accurately indicate which pixels are incorrectly classified, we can still find some areas that are highly suspected of having classification errors by applying some conditions. Therefore, we need to establish a relation between the remote sensing image and the classification result image and find the areas where the colors (bands) of the remote sensing image are consistent, but the classification results are inconsistent; these are the areas that may belong to the same object but are incorrectly classified into different categories. Such areas are strong candidates for containing incorrectly classified pixels. Furthermore, we try to correct these errors within a relatively small area.
To achieve the above goals, for a remote sensing image I image and its corresponding classification image I cls , the methods proposed in this paper are illustrated in Figure 2: same object but are incorrectly classified into different categories. Such areas are strong candidates for containing incorrectly classified pixels. Furthermore, we try to correct these errors within a relatively small area.
To achieve the above goals, for a remote sensing image Iimage and its corresponding classification image Icls, the methods proposed in this paper are illustrated in Figure 2: As shown in Figure 2, we use four steps to perform localized correction: As shown in Figure 2, we use four steps to perform localized correction: (1) Remote sensing image segmentation We need to segment the remote image based on color (band value) consistency. In this paper, we adopt the simple linear iterative clustering (SLIC) algorithm as the segmentation method. The algorithm initially contains k clusters. Each cluster is denoted by C i = {l i , a i , b i , x i , y i }, where l i , a i , and b i are the color values of C i in CIELAB color space, and x i , y i are the center coordinates of C i in the image. For two clusters, C i and C j , the SLIC algorithm is introduced to compare color and space distances simultaneously, as follows: where distance color is the color distance and distance space is the spatial distance. Based on these two distances, the distance between the two clusters is: where N color is the maximum color distance and N space is the maximum position distance. The SLIC algorithm uses the iterative mechanism of the k-means algorithm to gradually adjust the cluster position and the cluster to which each pixel belongs, eventually obtaining N segment segments [36]. The advantage of the SLIC algorithm is that it can quickly and easily cluster adjacent similar regions into a segment; this characteristic is particularly useful for finding adjacent areas that have a consistent color (band value). For I image , the SLIC algorithm is used to obtain the segmentation result I SLIC . In each segment in I SLIC , the pixels are assigned the same segment label.
(2) Create a list of segments with suspicious degree evaluations For all the segments in I SLIC , a suspicion evaluation list for the segments HList = {h 1 , h 2 , . . . , h n } is constructed, where h i is a set h i = {hid i , hpixels i , hrec i , hspc i }, hid i is a segment label, hpixels i holds the Remote Sens. 2020, 12, 852 6 of 21 locations of all the pixels in the segment; hrec i is the location and size of the enclosing frame rectangle of hpixels i ; and hspc i is a suspicious evaluation value, which is either 0 or 1-a "1" means that the pixels in the segment are suspected of being misclassified, and a "0" means that the pixels in the segment are likely correctly classified. The algorithm to construct the suspicious degree evaluation list is as follows (SuspiciousConstruction Algorithm):

Algorithm SuspiciousConstruction
Input: I SLIC Output: HList Begin HList = an empty list; foreach (segment label i in I SLIC ){ hid i = i; hpixels i = Locations of all the pixels in corresponding segment i; hrec i = the location and size of hpixels i 's enclosing frame rectangle; In the SuspiciousConstruction algorithm, by default, each segment's spc i = 0 (the hypothesis is that no misclassified pixels exist in the segment).
(3) Analyze the suspicious degree As shown in Figure 2, for a segment, the spc i value can be calculated based on the pixels in I SLIC ; h i 's corresponding pixels can be grouped as SP = {sp 1 , sp 2 , . . . , sp m }, where sp i is the pixel number belonging to category i, and the inconsistency degree of SP can be described using the following formula: Based on this formula, the value of hyp i can be expressed as follows: where α is a threshold value (the default is 0.05). When a segment's hspc i = 0, the segment's corresponding pixels in I SLIC all belong to the same category, which indicates that pixels' features are consistent in both I image (band value) and I SLIC (segment label); in this case, the segment does not need correction by CRF. In contrast, when a segment's hspc i = 1, the pixels of the segment in I SLIC belong to different categories, but the pixel's color (band value) is comparatively consistent in I image ; this case may be further subdivided into two situations: A. Some type of classification error exists in the segment (such as the classification result deformation problem appearing on an object boundary).
B. The classification result is correct, but the segment crosses a boundary between objects in I SLIC (for example, the number of segments assigned to the SLIC algorithm is too small, and some areas are undersegmented).
In either case, we need to be suspicious of the corresponding segment and attempt to correct mistakes using CRF. Based on Formulas 5 and 6, the algorithm for analyzing I cls using HList is as follows (SuspiciousEvaluation Algorithm): By applying the SuspiciousEvaluation algorithm, we can identify which segments are suspicious and require further post-processing.
(4) Localized correction As shown in Figure 2, for a segment h i that is suspected of containing error classified pixels, the post-processing strategy can be described as follows: First, based on h i .hrec i, create a cut rectangle cf, (cf = rectangle h i .hrec i enlarged by β pixels), where β is the number of pixels to enlarge and the default value is 10. Second, use the cf cut sub-images from I image and I cls to obtain a Sub image and a Sub cls . Third, input Sub image and Sub cls to the CRF algorithm, and obtain a corrected classification result Corrected-Sub cls . Finally, based on the pixel locations in h i .hpixels i, obtain the pixels from Corrected-Sub cls and write them to I cls , which constitutes localized area correction on I cls . For the entire I cls , the localized correction algorithm on I cls is as follows (LocalizedCorrection Algorithm): By applying the LocalizedCorrection algorithm, the I CLS will be corrected segment by segment through the CRF algorithm.

Overall Process of the End-to-End and Localized Post-Processing Method
Based on the four steps and algorithms described in the preceding subsection, we can evaluate the classification result image without requiring testing samples and correct the classification result image within local areas. By integrating these algorithms, we propose an end-to-end and localized post-processing method (ELP) whose input is a remote sensing image I image and a classification result image I cls , and whose output is the corrected classification result image. Through the iterative and Remote Sens. 2020, 12, 852 8 of 21 progressive correction process, the goal of improving the quality of the I CLS can be achieved. The process of ELP is shown in Figure 3. As Figure 3 shows, the ELP method is a step-by-step iterative correction process that requires a total of γ iterations to correct the Icls content. Before beginning the iteration, the ELP method obtains the segmentation result image ISLIC before iteration: Then, it evaluates the segments and constructs the suspicion evaluation list for the segments HList: In each iteration, ELP updates HList to obtain HList η , and it outputs a new classification result image CLS I  , where η is the iteration value (in the range [1,γ]). The i-th iterationʹs output is: When η = 1, HList 0 = HList, and 0 cls cls I I  ; when η  2, the current iteration result depends on the result of the previous iteration. Based on the above two formulas, the ELP algorithm will update HList and Icls in each iteration; HList indicates suspicious areas, and these areas are corrected and stored in Icls. As the iteration progresses, the final result is obtained: As Figure 3 shows, the ELP method is a step-by-step iterative correction process that requires a total of γ iterations to correct the I cls content. Before beginning the iteration, the ELP method obtains the segmentation result image I SLIC before iteration: Then, it evaluates the segments and constructs the suspicion evaluation list for the segments HList: In each iteration, ELP updates HList to obtain HList η , and it outputs a new classification result image I η CLS , where η is the iteration value (in the range [1,γ]). The i-th iteration's output is: When η = 1, HList 0 = HList, and I 0 cls = I cls ; when η ≥ 2, the current iteration result depends on the result of the previous iteration. Based on the above two formulas, the ELP algorithm will update HList Remote Sens. 2020, 12, 852 9 of 21 and I cls in each iteration; HList indicates suspicious areas, and these areas are corrected and stored in I cls . As the iteration progresses, the final result is obtained: Through the above process, ELP achieves both the desired goals: end-to-end result image evaluation and localized post-processing.

Experiments
We implemented all the codes in Python 3.6; the CRF algorithm was implemented based on the PyDenseCRF package. To analyze the correction effect for the deep semantic segmentation model's classification result images, this study introduces images from Vaihingen and Potsdam in the "semantic labeling contest of the ISPRS WG III/4 dataset" as the two test datasets.

Method Implementation and Study Images
We introduced five commonly used DSSNN models as testing targets: FCN8s, FCN16s, FCN32s, SegNET, and U-Net. All these deep models are implemented using the Keras package, and all the models take a 224 × 224 image patch as input and output a corresponding semantic segmentation result. The five image files from Vaihingen were selected as the study images and are listed in Table 1. All five images contain five categories: impervious surfaces (I), buildings (B), low vegetation (LV), trees (T), and cars (C). These images have three spatial bands (near-infrared (NIR), red (R), and green (G)). The study images and their corresponding ground truth images are shown in Figure 4.
We selected study images 1 and 3 as training data and used all the images as test data. Study images 1 and 3 and their ground truth images were cut into 224 × 224 image patches with 10-pixel intervals; all the patches were stacked into a training set, and all the deep semantic segmentation models were trained on this training set.
This study used two methods to compare correction ability: (1) CRF: We compared our method with the traditional CRF method. For each classification result image, the CRF was executed 10 times, and each time, the corresponding correct-distance parameters were set to 10, 20, . . . , 100. Since all the study images have corresponding ground truth images, the CRF algorithm selects the result with the highest accuracy among the 10 executions.
(2) ELP: Using the proposed method, the threshold value parameter α was set to 0.05, and the CRF's correct-distance parameters in the LocalizedCorrection algorithm were set to 10. The number of ELP iterations was set to 10. Since ELP emphasizes "end-to-end" ability, no ground truth is needed to analyze the true classification accuracy during the iteration process; therefore, ELP directly outputs the result of the last iteration as the corrected image. All five images contain five categories: impervious surfaces (I), buildings (B), low vegetation (LV), trees (T), and cars (C). These images have three spatial bands (near-infrared (NIR), red (R), and green (G)). The study images and their corresponding ground truth images are shown in Figure 4.

Classification Results of Semantic Segmentation Models
We used all five deep semantic segmentation models as end-to-end classifiers to process five study images. The classification results are illustrated in Figure 5.
As shown in Figure 5, because study images 1 and 3 are used as training data (the deep neural network is sufficiently large to "remember" these images), the classification results by the five models for study images 1 and 3 are close to "perfect": almost all the ground objects and boundaries are correctly identified. However, because just two training images cannot exhaustively represent all the boundaries and object characteristics, these models cannot perfectly process study images 2, 4, and 5. As shown in Figure 5, there are obvious defects and boundary deformations, and many objects are misclassified in large areas. Based on the ground truth images, the classification accuracies of these result images are as follows: As shown in Figure 5, because study images 1 and 3 are used as training data (the deep neural network is sufficiently large to "remember" these images), the classification results by the five models for study images 1 and 3 are close to "perfect": almost all the ground objects and boundaries are correctly identified. However, because just two training images cannot exhaustively represent all the boundaries and object characteristics, these models cannot perfectly process study images 2, 4, and 5. As shown in Figure 5, there are obvious defects and boundary deformations, and many objects are misclassified in large areas. Based on the ground truth images, the classification accuracies of these result images are as follows: Table 2 shows that because study images 1 and 3 are training images, all five models' classification accuracies of these two images are above 95%, which is a satisfactory result. However, on study images 2, 4, and 5, due to the many errors on ground objects and boundaries, all five models' classification accuracies degrade to approximately 80%. Therefore, it is necessary to introduce a correction mechanism to correct the boundary errors in these images.
to analyze the true classification accuracy during the iteration process; therefore, ELP directly outputs the result of the last iteration as the corrected image.

Classification Results of Semantic Segmentation Models
We used all five deep semantic segmentation models as end-to-end classifiers to process five study images. The classification results are illustrated in Figure 5.

Comparison of the Correction Characteristics of ELP and CRF
To compare the correction characteristics of the ELP and CRF, this section uses U-Net's classification result for test image 5 and applies ELP and CRF to correct a subarea of the result image. The detailed results of the two algorithms with regard to iterations (ELP) and execution (CRF) are as shown in Figure 6.

Comparison of the Correction Characteristics of ELP and CRF
To compare the correction characteristics of the ELP and CRF, this section uses U-Netʹs classification result for test image 5 and applies ELP and CRF to correct a subarea of the result image. The detailed results of the two algorithms with regard to iterations (ELP) and execution (CRF) are as shown in Figure 6.  As shown in Figure 6a, for the sub-image of test image 5, the classification results obtained by U-Net are far from perfect, and the boundaries of objects are blurred or chaotic. Especially at locations A, B, and C (marked by the red circles), the buildings are confused with impervious surfaces, and the buildings contain large holes or misclassified parts. On this sub-image, the classification accuracy of U-Net is only 79.02%. Figure 6b shows the results of the 10 ELP iterations. As the method iterates, the object boundaries are gradually refined, and the errors at locations A and B are gradually corrected. By the 5th iteration, the hole at position A is completely filled, and by the 7th iteration, the errors at location B are also corrected. For location C, because our algorithm follows an end-to-end process, no samples exist in this process to determine which part of the corresponding area is incorrect; therefore, location C is not significantly modified during the iterations. Nevertheless, the initial classification error is not enlarged. As the iteration continues, the resulting images change little from the 7th to 10th iterations, and the algorithm's result becomes stable. Figure 6c shows the results of the CRF. From executions 1 to 3, it can be seen that the CRF can also perform boundary correction. After the 4th iteration, the errors in locations A and B are corrected. It can also be seen that at position C, part of the correctly classified building roof was modified into impervious surfaces, further exaggerating the errors. The reason for this outcome is that at location C, for the corresponding roof color, the correctly classified part is smaller than the misclassified part. In the global correction context, the CRF algorithm more easily replaces relatively small parts. At the same time, as the iteration progresses, errors gradually appear due to the CRF's correction process (as marked in orange); some categories that were originally not dominant in an area (such as trees and cars) experience large decreases, and the classification accuracy continues to decrease with further iterations.
Based on the ground truth image, we evaluate the classification accuracy of the two methods after each iteration/execution as shown in Table 3: Table 3. Comparison of two algorithms by iteration/execution.

Methods
Classification Accuracy at each Iteration/Execution (%) As seen from Table 3, compared to the original classification result, whose accuracy is 79.02%, the ELP's classification accuracy increases to 80.93% after the first iteration-the lowest among its ten iterations, and it reaches 85.81% by the 8th iteration for a classification accuracy improvement of 6.79%. The CRF's classification accuracy is 79.41% after the first executions, and it reaches its highest accuracy of 83.76% in the 4th execution; subsequently, the classification accuracy gradually declines during the remaining iterations, falling to 73.86% by the 10th execution. Overall, the CRF reduced the accuracy by 5.16% compared with the original classification image. A graphical comparison of the classification accuracy of the two methods is shown in Figure 7: (as marked in orange); some categories that were originally not dominant in an area (such as trees and cars) experience large decreases, and the classification accuracy continues to decrease with further iterations.
Based on the ground truth image, we evaluate the classification accuracy of the two methods after each iteration/execution as shown in Table 3 : Table 3. Comparison of two algorithms by iteration/execution. As seen from Table 3, compared to the original classification result, whose accuracy is 79.02%, the ELP's classification accuracy increases to 80.93% after the first iteration-the lowest among its ten iterations, and it reaches 85.81% by the 8th iteration for a classification accuracy improvement of 6.79%. The CRF's classification accuracy is 79.41% after the first executions, and it reaches its highest accuracy of 83.76% in the 4th execution; subsequently, the classification accuracy gradually declines during the remaining iterations, falling to 73.86% by the 10th execution. Overall, the CRF reduced the accuracy by 5.16% compared with the original classification image. A graphical comparison of the classification accuracy of the two methods is shown in Figure 7:  In Figure 7, the black dashed line indicates the original classification accuracy of 79.02%. The CRF's accuracy improvement was slightly better than that of the ELP algorithm in the second, third, and fourth iterations; however, its classification accuracy decreases rapidly in the later iterations, and by the ninth iteration, the classification accuracy is lower than that of the original classification result image. In contrast, the classification accuracy of the ELP increases steadily, and after approaching its highest accuracy, in subsequent iterations the accuracy remains relatively stable. From the above results, the ELP not only achieves a better correction result but also avoids causing obvious classification accuracy reductions from performing too many iterations. In end-to-end application scenarios where no samples participate in the result evaluation, we cannot know when the highest correction result has been reached; thus, the ideal method of termination conditions are also unknown. Specifying a too-small distance parameter will cause under-correction, while a too-large parameter will cause over-correction. The relatively stable characteristics and greater accuracy of ELP clearly allow it to achieve better processing results than those of CRF.

Correction Results Comparison
The correction results of the CRF method are shown in Figure 8. As shown in Figure 8, for images 1 and 3, because the classification accuracy is relatively high, less room exists for correction, and the resulting images change only slightly. For images 2, 4, and 5, although the large errors and holes are corrected, numerous incorrect borders are present, pixels appear at shadowed parts of the ground objects, and many small objects (e.g., cars or trees) are erased by larger objects (e.g., impervious surfaces or low vegetation). For ELP, the correction results are shown in Figure 9.
parameter will cause over-correction. The relatively stable characteristics and greater accuracy of ELP clearly allow it to achieve better processing results than those of CRF.

Correction Results Comparison
The correction results of the CRF method are shown in Figure 8. As shown in Figure 8, for images 1 and 3, because the classification accuracy is relatively high, less room exists for correction, and the resulting images change only slightly. For images 2, 4, and 5, although the large errors and holes are corrected, numerous incorrect borders are present, pixels As Figure 9 shows, ELP also corrects the large errors and holes but does not produce overcorrection errors in the shadowed parts of the ground objects, and small objects are not erased. Therefore, in general, the correction results of ELP are better than those of CRF. Correction accuracy comparisons of the two algorithms are shown in Table 4. appear at shadowed parts of the ground objects, and many small objects (e.g., cars or trees) are erased by larger objects (e.g., impervious surfaces or low vegetation). For ELP, the correction results are shown in Figure 9. As Figure 9 shows, ELP also corrects the large errors and holes but does not produce overcorrection errors in the shadowed parts of the ground objects, and small objects are not erased. Therefore, in general, the correction results of ELP are better than those of CRF. Correction accuracy comparisons of the two algorithms are shown in Table 4.  As shown in Table 3, for study images 1 and 3, because the original classification accuracy is high, the correction results of CRF and ELP are similar to the original classification result, and the improvements are limited. On study images 2, 4, and 5, the ELP's average improvements are 6.78%, 7.09%, and 5.83%, respectively, while the corresponding CRF improvements are only 2.84%, 2.88%, and 2.74%. Thus, ELP's correction ability is significantly better than that of CRF.

Test Images and Methods
This study introduces four images from the Potsdam dataset, which are listed in Table 5. We selected two images as training data and the other two images as test data. Three bands (red (R), green (G), blue (B)) of images were selected. These images contain six categories: impervious surfaces (I), buildings (B), low vegetation (LV), trees (T), cars (C), and clutter/ background (C/B). The study images and their corresponding ground truth images are shown in Figure 10.

Image Name
Filename Size Training image 1 top_potsdam_2_10_RGBIR 6000 × 6000 Training image 2 top_potsdam_3_10_RGBIR 6000 × 6000 Testing image 1 top_potsdam_2_12_RGBIR 6000 × 6000 Testing image 2 top_potsdam_3_12_RGBIR 6000 × 6000 We selected two images as training data and the other two images as test data. Three bands (red (R), green (G), blue (B)) of images were selected. These images contain six categories: impervious surfaces (I), buildings (B), low vegetation (LV), trees (T), cars (C), and clutter/ background (C/B). The study images and their corresponding ground truth images are shown in Figure 10. To further evaluate ELP's ability, this paper compares four methods: (1) U-Net + CRF: Use U-Net to classify an image and use CRF to perform post-processing.
(2) U-Net + MRF: Use U-Net to classify an image and use the Markov random field (MRF) to perform post-processing.
(3) DeepLab: Adopt DeepLab v1 model; in the DeepLab v1, the model has a built-in CRF as the last processing component, and this model can obtain a more accurate boundary than a model without the CRF component.
(4) U-Net + ELP: Use U-Net to classify an image and use ELP to perform post-processing. To further evaluate ELP's ability, this paper compares four methods: (1) U-Net + CRF: Use U-Net to classify an image and use CRF to perform post-processing.
(2) U-Net + MRF: Use U-Net to classify an image and use the Markov random field (MRF) to perform post-processing.
(3) DeepLab: Adopt DeepLab v1 model; in the DeepLab v1, the model has a built-in CRF as the last processing component, and this model can obtain a more accurate boundary than a model without the CRF component.
(4) U-Net + ELP: Use U-Net to classify an image and use ELP to perform post-processing.

Process Results of Four Methods
For the two testing images, the final process results of the four methods are illustrated in Figure 11. As can be seen in Figure 11, because U-Net + CRF uses a global CRF processing strategy, there are many overcorrection areas, and some objects in the result image contain chaotic wrongly classified pixels. For the U-Net + MRF, the majority of noise pixels are removed, but the correction effects are not obvious. DeepLab's CRF is performed on image patch, not the whole image, so the overcorrection phenomenon is less than that of U-Net + CRF to some extent. U-Net + ELP obtained the best classification among all of the methods. The classification accuracies of the four methods are presented in Table 6. For the two testing images, the final process results of the four methods are illustrated in Figure  11. Figure 11. Results of the four methods.
As can be seen in Figure 11, because U-Net + CRF uses a global CRF processing strategy, there are many overcorrection areas, and some objects in the result image contain chaotic wrongly classified pixels. For the U-Net + MRF, the majority of noise pixels are removed, but the correction effects are not obvious. DeepLab's CRF is performed on image patch, not the whole image, so the overcorrection phenomenon is less than that of U-Net + CRF to some extent. U-Net + ELP obtained the best classification among all of the methods. The classification accuracies of the four methods are presented in Table 6. Table 6. Classification accuracies of the four methods. As shown in Table 6, U-Net + MRF achieves the lowest classification accuracy, U-Net + CRF and DeepLab are higher than U-Net + MRF, and U-Net + ELP achieves the best classification accuracy.

Analysis of computational complexity
To analyze the computational complexity of the methods, we use four methods to process the testing image 1 and run the process five times. The experiments are performed on a computer (i9 9900k/64 GB/RTX 2080ti 11G), and the average process times are listed in Table 7. As shown in Table 7, because the U-Net model can make full use of the graphics processing unit (GPU), and the processing speed of CRF and MRF on the whole image is also fast, U-Net + CRF and Figure 11. Results of the four methods.
As shown in Table 6, U-Net + MRF achieves the lowest classification accuracy, U-Net + CRF and DeepLab are higher than U-Net + MRF, and U-Net + ELP achieves the best classification accuracy.

Analysis of computational complexity
To analyze the computational complexity of the methods, we use four methods to process the testing image 1 and run the process five times. The experiments are performed on a computer (i9 9900k/64 GB/RTX 2080ti 11G), and the average process times are listed in Table 7. As shown in Table 7, because the U-Net model can make full use of the graphics processing unit (GPU), and the processing speed of CRF and MRF on the whole image is also fast, U-Net + CRF and U-Net + MRF obtain results in a short time. DeepLab performs CRF after each image patch classification, so it does not need the post-processing stage, but the patch-based CRF process needs additional data access time and duplicate pixels at the patches' border, so its process time is similar to that of U-Net + CRF and U-Net + MRF. Since U-Net + ELP adopts the same deep model, its process time of the first three steps is similar to that of U-Net + CRF and U-Net + MRF, but at the post-processing stage, it needs a much longer time than the other methods.
For the ELP algorithm, the HList is updated at each iteration, and the suspicious areas marked by HList will change constantly. Each suspicious area needs to be processed by the CRF method, so the processing complexity of the ELP will vary along with the complexity of the image content. Although each suspicious area is small, ELP's greater amount of iterations, localization process, and result image update mechanism will introduce an additional computational burden, so ELP needs more process time than traditional methods.

Analysis of Different Threshold Parameter Values of ELP
The threshold value α of ELP will determine the choice of suspicious areas. To test the influence of this parameter, we chose UNet + ELP to process testing image 1, and set α in the range 0, 0.01, 0.02, . . . , 0.09, which can vary from 0 to 0.09 with an interval of 0.01. The classification accuracy is shown in Table 8. As can be seen from Table 8, when α is less than 0.6, the classification accuracy does not change significantly; when α is larger than 0.7, the classification accuracy is decreased. The main reason for this phenomenon is that when α is small, ELP will be more sensitive to the discovery of suspicious areas; however, too many suspicious areas merely increase the computational burden without contributing to obvious changes in accuracy. In contrast, when α is larger, ELP will have a diminished capability to discover suspicious areas, and many suspicious areas that need correction will be omitted, which will cause a decrease in accuracy. At the same time, we can see from Table 8 that in a large range (0.0 to 0.6), the accuracy of ELP does not change greatly, which reveals that ELP has good stability with the threshold value α.

Analysis of Different Segmentation Number of ELP
The ELP method adopts the SLIC algorithm as the segmentation method, and an important parameter of the SLIC algorithm is N segment which decides the number of segments after the algorithm performed. When the N segment is assigned an overly small value, under-segmentation may appear; conversely, when N segment is assigned an overly large value, over-segmentation may appear. To test the influence of this parameter on the ELP, we set N segment = 1000, 2000, . . . , 10,000 and allowed N segment to vary from 1000 to 10,000 with an interval of 1000. The classification accuracy of testing image 1 by U-Net + ELP is shown in Table 9. It can be seen from Table 9 that when N segment = 1000 to 3000, because the image is large (6000 × 6000) and the segment number is relatively small, the image is under-segmented, and each segmentation may contain pixels with a different color or brightness. This situation makes it difficult for ELP to focus on suspicious areas, and the classification accuracy is low. When N segment = 9000 to 10,000, the image is obviously over-segmented, and the segmentations are too small. This situation leads ELP to have a small update size in its LocalizedCorrection algorithm and leads to ELP's poor performance. For N segment = 4000 to 8000, the classification accuracy of ELP does not vary greatly, which indicates that ELP does not have restrictive requirements for the segment parameter; as long as the segment method can correctly separate the regions with similar colors/brightness and the segment size is not too small, ELP can achieve satisfactory results.

Conclusions
Deep semantic segmentation neural networks are powerful end-to-end remote sensing image classification tools that have achieved successes in many applications. However, it is difficult to construct a training set that thoroughly exhausts all the pixel segmentation possibilities for a specific ground area; in addition, spatial information loss occurs during the network training and inference stages. Consequently, the classification results of DSSNNs are usually not perfect, which introduces a need to correct the results.
Our experiments demonstrate that when faced with complicated remote sensing images, the CRF algorithm often has difficulty achieving a substantially improved correction effect; without restricting the mechanism by using additional samples, the CRF may overcorrect, leading to a decrease in the classification accuracy. Our approach improves on the traditional CRF global processing effects by offering two advantages: (1) End-to-end: ELP identifies which locations of the result image are highly suspected of containing errors without requiring samples; this characteristic allows ELP to be used in an end-to-end classification process.
(2) Localization: Based on the suspect areas, ELP limits the CRF analysis and update area within a small range and controls the iteration termination condition; these characteristics avoid the overcorrections caused by the global processing of the CRF.
The experimental results also show that the ELP achieves a better correction result, is more stable, and does not require training samples to restrict the iterations. The above advantages ensure that ELP is better able to adapt to correct the classification results of remote sensing images and provides it with a higher degree of automation.
The typical limitation of ELP is that, in comparison with the traditional CRF, the additional iterations, the localization process, and the update mechanism of the result image will introduce an additional computational burden. Consequently, ELP is much slower than the traditional CRF method. Fortunately, the localization process also ensures that different parts of areas do not affect each other, which makes ELP easier to parallelize. In further research, we will adjust the processing structure of ELP, facilitating a GPU implementation that enables ELP to execute faster. For semantic segmentation neural networks, differences in the training set size can cause various degrees of errors in the resulting image, which has an apparent influence on the post-processing task. In future research, to construct a more adaptive post-processing method, we will study the relationship between training/testing dataset size and the post-processing methods used and consider the problems faced by post-processing methods in more complex application scenarios.