A Stacked Fully Convolutional Networks with Feature Alignment Framework for Multi-Label Landcover Segmentation

Applying deep-learning methods, especially fully convolutional networks (FCNs), has become a popular option for land-cover classification or segmentation in remote sensing. Compared with traditional solutions, these approaches have shown promising generalization capabilities and precision levels in various datasets of different scales, resolutions, and imaging conditions. To achieve superior performance, a lot of research has focused on constructing more complex or deeper networks. However, using an ensemble of different fully convolutional models to achieve better generalization and to prevent overfitting has long been ignored. In this research, we design four stacked fully convolutional networks (SFCNs), and a feature alignment framework for multi-label land-cover segmentation. The proposed feature alignment framework introduces an alignment loss of features extracted from basic models to balance their similarity and variety. Experiments on a very high resolution(VHR) image dataset with six categories of land-covers indicates that the proposed SFCNs can gain better performance when compared to existing deep learning methods. In the 2nd variant of SFCN, the optimal feature alignment gains increments of 4.2% (0.772 vs. 0.741), 6.8% (0.629 vs. 0.589), and 5.5% (0.727 vs. 0.689) for its f1-score, jaccard index, and kappa coefficient, respectively.


Introduction
The distributions and changes of natural and artificial surfaces, such as grasslands, forests, buildings and roads, is fundamental information that is referenced for many applications such as urban planning [1], navigation [2], land-used management [3], and forest monitoring [4].Traditionally, this information was obtained by labor-intensive and time-consuming field surveys [5].The ability to achieve precise and cost-efficient updating of land cover is a long-existing demand for remote sensing.Over the last few years, with the emerging of innovative technologies, the cost as well as difficulty of capturing very high resolution(VHR) aerial imagery has significantly declined [6,7].Thus, robust and precise methods for the automatic classification and segmentation of land cover become the core of the whole solution.
According to the conditions of image datasets such as scale, color space, and resolution, various automatic segmentation methods have been proposed.These methods can be divided into two categories depending on whether it is necessary to have ground truth: I. unsupervised methods and II.supervised methods.Unsupervised methods can be further categorized as three groups according to their operational mechanisms: (1) threshold-based, (2) edge-based, and (3) region-based methods.Threshold-based methods separate different parts using thresholds determined by the value or histogram of the pixels [8].Edge-based methods detect the abrupt changes using mathematical designed filters, such as Sobel [9] and Canny [10], to generate boundaries between different parts.In region-based methods, image segmentations are done by clustering or region-growing [11][12][13].Because of manually adjustable parameters and the lack of need for ground truth, unsupervised methods are easier to implement, and are widely adopted for small scale datasets.However, for larger datasets, as the variety and complexity increase, the performance of unsupervised segmentation methods usually lacks generalization capability [14].In direct contrast, supervised methods utilize the ground truth to learn segmentation patterns and then apply it to new data.For supervised methods, the segmentation problem is converted into a pixel-to-pixel image classification where pixels of different parts are classified into their corresponding categories [15].Because the segmentation is made by classifying each pixel, these methods generally produce segmentations that are more precise.
For supervised segmentation methods, there are two fundamental procedures: feature extraction, and classification.At the early stages, these steps are done separately.The spatial or textual features are firstly extracted from the image through hand-crafted descriptors, such as haar-like, local binary pattern, and histogram of oriented gradient [16][17][18][19].Later, various classifiers, such as support vector machines, decision trees, and neural networks are utilized for further classification using the extracted features [20][21][22][23].Because of the separateness of the two procedures, optimizing the performance of methods requires many cycles of trial and error.Instead, convolutional neural network (CNN) methods incorporate automatic feature extraction and classification through an unified framework [24].As these steps can directly learn from the ground truth, CNNs show superior generalization capabilities and precision in many classification and segmentation tasks [25].
Before fully convolutional networks (FCN) [26], CNN-based methods adopted patch-based approaches which classified the center pixel by using a small patch of the whole image [27].Because of highly overlapped patches, these methods required massive memory space as well as high computational capability.To solve this problem, the FCN method utilizes fully convolutional network architectures that can directly perform pixel-to-pixel translation of the input images to ground truth.In this manner, the FCN method significantly improves training efficiency and model performance [28].In classic FCNs (FCN32s, FCN16s and FCN-8s), the methods adopt multiple scale bilinear upsampling operations to generate segmentation output with the same height and width of input.These operations lead to information loss that affects the precision of prediction.Recently, more advanced and accurate FCN-based methods have been developed [29].These methods improve model performance through different strategies.The U-Net and FPN methods adopt multiple skip-connections between corresponding lower and upper layers to share information between layers [30,31].The DeconvNet replaces bilinear upsampling with deconvolution (convolution transpose) operation [32].The MC-FCN method applies multi-constraints for various scale outputs [33].Finally, the BR-Net method uses additional boundary information to regulate the model [34].These methods further develop the potential of fully convolutional networks.However, with more complex architectures and stronger representation capabilities, overfitting becomes inevitable [35].
Overfitting is a long-existing problem in deep learning.This problem is more critical for smaller datasets.To compensate for the problem, several approaches are proposed.These approaches include early stopping, data augmentation, regularization, and ensemble learning.The early stopping approach stops the training model before convergence to prevent overfitting [36,37].For the data augmentation approach, the original images are rotated, resized, random cropped, or re-colorized to generate more training samples and increase the variety of data [38].As for regularization, extra penalty (e.g., L1/L2) [39,40] or dropout [41] is implemented to reduce and regulate the representation capability of the model.By contrast, ensemble learning combines several models to generate a final prediction [42].Owing to its capability of utilizing a variety of different models, biased predictions from one model can be compensated for by other models, and better results can be produced.Currently, ensemble learning is mainly applied to patch-based CNN for pixel-level classification [43].Ensemble learning has not received any attention in FCN-based architectures.In addition, research on ensemble learning is mainly focused on various numbers or combinations of basic models.The studies on the combination approaches of different basic models are not sufficient.
To explore the capability of ensemble learning using fully convolutional networks, we design four stacked fully convolutional networks(SFCNs) using FCN-8s, U-Net, and FPN.Furthermore, we propose a feature alignment framework for efficient ensemble learning, which enhances the relations between basic models.Compared with traditional ensemble learning approaches, the proposed method implements basic segmentation loss between prediction and corresponding ground truth as well as extra alignment loss between features that are extracted separately from different basic models.The value of the alignment loss is determined by the consistency of features extracted by different models.If these features are similar, the alignment loss is zero.During iterations, the optimizer is required to update parameters to reduce the value of the weighted sum of segmentation loss and alignment loss.Thus, the optimized network is capable of generating predictions using features extracted from basic models that contain a balance of similarity and variety.
The effectiveness of the proposed feature alignment framework is demonstrated by a VHR image dataset with 2D multi-label segmantic information(refer to Section 2.1).In comparative experiments, the performances of achieved by the proposed method (SFCNv3, +FL) are 0.785(±0.004) of F1-score, 0.646(±0.005) of jaccard index [44], and 0.742(±0.005) of kappa coefficient [45], respectively.Furthermore, sensitivity analysis indicates that the proposed feature alignment can control the balance between similarity and variety of features extracted from different basic models.By optimizing the feature alignment level, ensemble fully convolutional networks gain better model performance.
The main contributions of this study can be summarized as follows:(1) We design a stacked fully convolutinal networks architecture using multiple FCNs for efficient multi-label land-cover segmentation and (2) we further proposed a feature alignment framework to balance the similarity and variety of features extracted from basic models to gain extra performance.
The rest of the paper is organized as follows: First, the materials and methods used for this research are described in the Section 2.Then, the quantitative and qualitative comparison results of different methods are presented in the Section 3. Finally, the discussions and conclusions from this study are presented in the Sections 4 and 5, respectively.

Dataset
For estimating the effectiveness of the designed SFCNs and proposed feature alignment framework, we conduct our experiments on ISPRS Vaihingen (Germany) 2D semantic labeling dataset.The dataset is an open benchmark, which is available online (http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html).Within the dataset, there are 33 tiles including 16 tiles for training and 17 tiles for testing.Only the tiles used for training are provided with images of annotated ground truth.The size of each tile ranges from 1388 × 2555 to 2006 × 3007 pixels.The ground sampling distance (GSD) of orthophoto is about 9 cm.
As shown in Figure 1, each tile of the dataset contains a orthophoto and its corresponding annotated ground truth.The orthophoto is an 8-bit image with three bands, which correspond to the near-infrared, red, and green bands delivered by the camera.The image of annotated ground truth utilizes six different colors to represent land-covers of impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background (see color map in Table 1).

Method
Figure 2 presents the workflow of this research.All 16 tiles of orthophoto, as well as their corresponding ground truth, are divided into two sets for training and testing.These sets contain 12 and 4 tiles of images, respectively.A sliding window with stride of 224 pixel is applied to each tile of the training set to generate image pathes with size of 224 × 224 pixels.After data preprocessing, the image patches are shuffled and split into two groups that include training (70%), and cross-validating (30%).The number of samples in training and cross-and validation are 744 and 312, respectively.Through several cycles of training and cross validation, the hyper-parameters are determined and optimized.Then, the predictions generated by the optimized model are further evaluated by the tiles in the test set.For performance evaluations, we choose three commonly used evaluation metrics, namely, jaccard index, f1-score, and kappa coefficient.These metrics are computed without post-processing operations [46,47] for better estimation of experimental methods.

Stacked Fully Convolutional Networks
After the invention of FCN in 2015, FCN and FCN-based methods have become a gold standard for many image segmentation tasks [48,49].Compared to conventional patch-based CNN methods, FCN-based models significantly improve computational efficiency and performance.Advanced FCN-based models further enhance feature representation capabilities and improve model performance through various approaches.These approaches include various combinations of skip-connections (U-Net & FPN), replacing bilinear upsample with unpooling (SegNet) or convolution transpose (DenconvNet), multi-constraints (MC-FCN), and additional boundary information (BR-Net).However, the increased representation capability and the complexity of the models usually lead to the overfitting of training data, especially for small or biased datasets.
To avoid overfitting, approaches including early stopping, data augmentation, regularization, and ensemble learning, are widely adopted.Of them, owing to its ability to utilize the representation capability of different models, the ensemble learning approach shows better performance and generalization capability.However, ensemble learning is currently used for patch-based CNN architectures, but not for FCN-based architectures.Additionally, research on ensemble learning mainly focuses on adding numbers or trying different combinations of basic models.To our best knowledge, research on methods to discern better combinations of various FCNs in ensemble learning does not exist.
Thus, we design stacked fully convolutional networks (SFCNs), and propose a feature alignment method, which enhances the relations between basic models.For ensemble learning, if the predictions from two models are completely different (in extreme cases, one of them is all zeros and the other is all ones), the ensemble result is just an average of both biased predictions that cannot yield better performances.Therefore, to have better results, the predictions of different models should contain a certain level of variety as well as similarity.Compared with traditional ensemble learning approaches, the proposed method introduces an extra alignment loss to control similarity as well as consistency between features that are extracted separately from different basic models.In contrast to common segmentation loss, which is computed as the difference between a ground truth and its corresponding prediction, proposed alignment loss is computed among extracted features from stacked basic models.To make sure the alignment loss can be applied to a various number of basic models (e.g., 2 models of FCN and U-Net, 3 models of FCN, U-Net, and FPN), the alignment loss is computed as the mean square error(MSE) between the maximum and minimum values of the extracted features (see details in Equation ( 1)).The value of alignment loss becomes zero when all the extracted features are similar.The value of the alignment loss reflects the consistency of extracted features.During iterations, the optimizer is required to update parameters to reduce the value of the weighted sum of segmentation loss and alignment loss.Thus, the optimized network is capable of generating a normalized prediction from variant basic models.Through feature alignment framework, the SFCNs can achieve a balance of similarity and variety using different basic models, and improve performance.
Figure 3 presents the design of the proposed stacked fully convolutinal networks(SFCN).The SFCN consists of two parts: (1) a framework for feature extraction using various fully convolutional networks and (2) a framework for feature alignment and output generation.In the feature extraction framework, different numbers or combinations of FCN-based models are implemented to separately extract features from the same input image.For each FCN-based model, there are several universal operations and model specific layers.For universal operations, there are convolution, nonlinear activation, and subsampling operations.For backend models, various model specific layers, such as skip-connection (U-Net & FPN) and unpooling (SegNet), are included.
For universal operations, element-wise multiplication within kernel is computed through the convolutional operation.The size of the kernel determines the receptive field and the computational efficiency of the convolution operation.Later, the output of convolution is handled by the rectified linear unit (ReLU) [50], which returns the original value if the value is larger than zero and sets values less than zero to zero.To accelerate network training, most models adopt batch normalization (BN) [51] layers before (e.g., SegNet) or after non-linear activations (e.g., FPN).To reduce the width and height of features, max-pooling [52] is chosen for subsampling in this study.
As for model specific layers, sequential bilinear upsampling [53] is commonly used to upsample the width and height of the features.By contrast, SegNet backend uses unpooling which applies corresponding pooling indices of max-pooling to achieve upsampling.In FPN and U-Net backends, skip-connection, which concatenates two layers with consistent height and width across channel axis, is applied between downward and upward layers.
In the framework for feature alignment and output generation, alignment loss that restricts the consistency of extracted features from various models and multi-class segmentation loss are computed sequentially.

•
Alignment loss (Loss align ) Through the n th FCN-based model, extracted features (denoted as Xn) with size of W × H × D are generated.W and H is consistent with the height and width of the input.The value of D is the same as the number of classes of land covers.The maximum and minimum value for each position from the 1 st to the n th feature are computed.The final alignment loss (Loss align ) is calculated by the mean square error between corresponding maximum and minimum values of all positions.

•
Segmentation loss (Loss seg ) From all extracted features (X1, X2, ..., Xn), the final output/prediction(Y) of the network is computed by taking the average value of all features.Then, the binary cross entropy [54], which calculates the difference between ground truth(G) and its corresponding prediction, is used as segmentation loss(Loss seg ).The calculation can be formulated as where y i,j,k and g i,j,k represent the (i,j,k) element of model output(Y) and ground truth (G).The value of y i,j,k is the predicted probability of the pixel category.
Therefore, the total loss of the network can be formulated as where λ is the weight of the alignment loss (Loss align ).By controlling the value of λ, we are able to adjust the balance between Loss align and Loss seg .
During iterations, Adam optimizer [55] will minimize Loss f inal to driven proposed network to generate pixel-to-pixel predictions for multi-label land-cover segmentation.

Network Specification
• FCN-8s.The classic FCN-8s architecture was proposed by Long et al. in 2015 [26].This method innovatively adopts fully convolutional architecture to perform pixel-to-pixel image classification or segmentation.The FCN architecture is the first fully convolutional network used for image segmentation.• U-Net.The U-Net architecture was proposed by Ronneberger et al. [30] for medical image segmentation.This method introduces multiple skip connections between upper and downer layers.Owing to its robustness and elegant structure, U-Net and its variants are widely adopted for many semantic segmentation tasks.

•
FPN.The FPN architecture was published by Lin et al., 2017 [31].Like U-Net, this method adopts multiple skip connections.In addition, the FPN model generates multi-scale predictions for final output.By utilizing abundant information from the feature pyramid, the FPN method achieves state-of-the-art performance.
The number and the size of convolutional kernels have significant impact on model performance.To minimize their effect, basic models used in this research are implemented with consistent number of kernel size at corresponding layers(see details in Figure 4).

Model Setup
To analyze the importance or significance of the proposed alignment loss, four versions of stacked fully conventional networks (SFCNs) are setup.There are three variants utilizing different combinations of two basic models, and a variant utilizing all three basic models.The variants using two basic models, SFCN f &p , SFCN f &u and SFCN u&p consist of FCN-8s&FPN, FCN-8s&U-Net, and U-Net&FPN, respectively (as shown in Table 2).All combinations are separately trained with different values of λ (λ ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]).In all experiments, the models are trained, cross-validated, and tested though exactly the same dataset.To prevent random bias, each set of experiments is repeated five times.After removing the best and worst performances of each method, their average performance with the testing dataset is carefully evaluated.
Table 2. Network setting of stacked fully convolutional networks from FCN-8s, U-Net, and FPN.

Version
No. of basic models FCN-8s U-Net FPN

Results
Three well-known FCN-based methods, namely, FCN-8s, U-Net, and FPN, are chosen for basic models in this study.Four SFCN models composed from three basic models (refer to Table 2) are trained separately with various weight (λ) of alignment loss (Loss align ).All experiments are performed on the same dataset and processing platform.
Three commonly used balanced metrics, including f1-score, jaccard index, and kappa coefficient, are selected for quantitative evaluation.Figures 5 and 6 show the comparison results of experimental methods.

Sensitivity Analysis of Feature Alignment
To investigate the significance of feature alignment, four stacked fully convolutional networks (i.e., SFCNs) using sequential values of lambda (λ ∈ [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]) are implemented and validated on the testing dataset.To prevent random bias, each set of experiments is repeated five times.After removing the best and worst performances of each method, their mean value and standard deviation (SD) of the evaluation metrics are calculated.Figure 5 and Table 3 present the trends and values of f1-score, jaccard index, and kappa coefficient over various λ of Loss align .
Figure 5a shows the trend of performances over λ values of Loss align on SFCN f &p .As the value of λ increases, the values of three metrics improve.The best performance is achieved with maximum value of λ = 1.0.This result indicates that the introduction of feature alignment leads to better performance of the ensemble model.Figure 5b,c show the trend of performances on SFCN f &u and SFCN u&p , respectively.When λ ≤ 0.8, higher λ generally has higher value metrics.By contrast, while λ ≥ 0.8, higher λ leads to weaker performances.In contract to Figure 5a-c, there is no significant change in the values of the metrics under various λ in Figure 5d, which implies that feature alignment has no significant effect on SFCN f &u&p .

Performances Comparison
Three basic models (FCN-8s, U-Net, and FPN) and four combinations of ensemble models (i.e., SFCNs) with/without optimal feature alignment (FA) are implemented and validated by the testing dataset.To prevent random bias, each set of experiments is repeated five times.After removing the best and worst performances of each method, their mean value and standard deviation (SD) of evaluation metrics are calculated.
Figure 6a shows the relative performances of these models.Among three basic methods, the FPN shows the highest values for all evaluation metrics.For each combination of ensemble learning, methods with optimal feature alignment(+FA) are generally better than the corresponding methods without optimal feature alignment(−FA).
Figure 6b displays the corresponding mean and standard deviation(SD) values of evaluation metrics from different methods.Among four ensemble models without optimal feature alignment (SFCNs, −FA), SFCN f &u&p (−FA) shows the higher mean values than SFCN u&p (−FA), SFCN f &u (−FA), and SFCN f &p (−FA) for all metrics.This observation indicates that an ensemble with more models can lead to better performance.For ensemble models using the same number of basic models (SFCN f &p , SFCN f &u , and SFCN u&p ), a combination of U-Net and FPN (SFCN u&p ) is better than a combination of FCN-8s and U-Net(SFCN f &u ) or FCN-8s and FPN (SFCN f &p ).Surprisingly, the best basic model (FPN) is better than the best ensemble model without feature alignment (SFCN f &u&p , −FA).This result suggests that a simple ensemble of different basic models does not assure higher performance.As for the four ensemble models with optimal feature alignment (SFCNs, +FA), SFCN u&p (+FA) shows the highest mean values for f1-score (0.785), jaccard index(0.646),and kappa coefficient(0.742).Ensemble methods with feature alignment showed higher values for all three evaluation metrics Compared than their counterparts without feature alignment.Among all methods, the SFCN u&p (+FA) methods achieved the highest performance.
The values for the standard deviation (SD) of three metrics from different models range from 0.001 to 0.008.When compared to the mean values, even the maximum value of SD (0.008) is not significant.Through independent t-test, except for SFCN f &u&p , methods with optimal feature alignment showed significantly different values for all three evaluation metrics Compared than their counterparts without feature alignment(see details in Table 4).

Qualitative Comparison
Figure 7 shows the prediction results on testing areas Tile-1, Tile-2, Tile-3, and Tile-4 of three basic models(FCN-8s, U-Net, and FPN) and optimized SFCNs.Generally, these models could correctly segment the major parts of different land-covers from the original aerial images.The FCN-8s model tends to misclassify low vegetation as trees (e.g., red rectangle in column 2, Tile 1), and the border area of buildings is usually broken (e.g., the red rectangle in column 2, Tile 3).The result generated by U-Net is unable to discriminate between roads and buildings (e.g., red rectangle in column 3, Tile 1 or row 3, Tile 2).The FPN model is generally better than FCN-8S and U-Net.However, trees and roads are misclassified as buildings (e.g., the red rectangle in column 4, Tile 3).Among SFCN models, results generated from SFCN f &p and SFCN f &u tend to miss the buildings in the corner area (e.g., the red rectangles in column 5, Tile 4 and column 5, Tile 4).The SFCN f &u&p model outperforms SFCN f &p and SFCN f &u .However, there are misclassified holes within large buildings (e.g., the red rectangle in column 8, Tile 3).When compared to other methods, even with some misclassification (e.g., the red rectangle in column 7, Tile 3), SFCN u&p shows better performance in major areas.

Computational Efficiency
All experiments are implemented and tested on a Sakura Internet Server(https://www.sakura.ad.jp/) equipped with one NVIDIA Tesla V100 GPU (https://www.nvidia.com/en-us/data-center/tesla-v100/)and installed with 64-bit Ubuntu 16.04 LTS.To eliminate the effect of some hyperparameters, for all models, the size of batch and number of iteration are fixed to 24 and 1000, respectively.The Adam stochastic optimizer, which is running at default setting (lr = 2 −4 , betas = [0.9,0.999]), is used for training different models.
Table 5 shows the computating speeds in frames per second (FPS) of these methods.In training period, three basic model are processed at 41.4 FPS(FCN-8s), 59.4 FPS(U-Net), and 54.6 FPS(FPN), respectively.When compared to basic models, the ensemble methods are much slower.As the number of basic models increases (e.g., 3 in SFCN f &u&p vs. 2 in SFCN f &p ,SFCN f &u , and SFCN u&p ), the training speed decreases.Even with the same number of basic models, because of the difference in model combination, the training speeds are different.Generally, a combination of fast basic models can form a faster ensemble model (e.g., 41.8 FPS of SFCN u&p vs. 31.2FPS of SFCN f &p ).In testing period, these methods achieved 1.3-2.1 times the processing speed.Interestingly, SFCN f &u&p has the most significant performance difference (57.6 vs. 27.2,2.1x) between the training and testing stages.Deep-learning methods, especially FCN-based models, are widely adopted for automatic building extraction from large-scale aerial images [57,58].Compared to conventional methods, the FCN-based models significantly improve segmentation performance when tested on various benchmark datasets [59,60].Recently, more advanced FCN-based models have enhanced feature representation capabilities to achieve better model performance (e.g., FPN, MC-FCN, and BR-Net).However, the increased representation capability and as complexity of the models usually lead to overfitting.Ensemble learning, which utilizes several different networks to generate a weighted prediction, is a promising option to avoid overfitting.
In this paper, we designed four SFCNs and proposed a novel feature alignment framework to enhance the performance of the ensemble framework.In contrast to existing ensemble approaches which mainly focus on adding numbers or trying different combinations of basic models, the proposed framework introduces alignment loss to control the similarity and consistency of features extracted from different basic models.Through feature alignment, the proposed ensemble method can achieve a balance between variety and similarity so better predictions can be achieved from weaker basic models.Qualitative and quantitative results on the testing tiles demonstrated the effectiveness of our proposed stacked fully convolution networks as well as feature alignment framework.Additionally, because of its flexibility, this framework can easily extend to ensemble learning architectures using varied numbers of basic models.

Accuracies, Uncertainties, and Limitations
From the sensitivity analysis, different ensemble models show a similar trend that as the weight of alignment loss increases, the performance of the model will increase first and, after a certain level, decline (see details in Figure 5).The interpolation of this trend is: (1) When there is no feature alignment(λ=0), features extracted from different basic models are so diverse that they might have different predictions for certain locations.An ensemble of these features doesn't bring better results.
(2) When feature alignment is added, at early stages, a higher value of λ forces features extracted from different basic models to be closer to each other so that they can compromise on specific locations and generate better overall predictions.However, if λ rises beyond the optimal value, the extracted features might be too similar to each other, and there will not be enough variety.Thus, the performance of the ensemble method will regress to that of a single basic model.This observation indicates that the feature alignment framework can help achieve a balance in similarity and variety of features in ensemble learning.
Among the methods, the proposed SFCNs with feature alignment (SFCN u&p , +FA) shows the highest values for all evaluation metrics.The values of f1-score, jaccard index, and kappa coefficient are 0.785, 0.646, and 0.742, respectively.SFCN models using two basic models (SFCN f &p , SFCN f &u , and SFCN u&p ), with or without feature alignment (i.e., +/− FA), show significantly different performances.Ensemble models with proper weights for alignment loss are generally better than their counterparts without alignment loss.Especially for SFCN f &u , optimal feature alignment gains increments of 4.2% (0.772 vs. 0.741) for f1-score, 6.8% (0.629 vs. 0.589) for jaccard index, and 5.5% (0.727 vs. 0.689) for kappa coefficient.These results indicate that introducing feature alignment leads to better performance of the ensemble model.However, for ensemble models using three basic models (SFCN f &u&p ), the values of jaccard index and kappa coefficient only increase about 1.6% (0.640 vs. 0.630) and 1.2%(0.736vs. 0.727), respectively.The improvement caused by feature alignment for SFCN f &u&p is not significant.Additionally, when compared to the best basic model (FPN), the optimized ensemble model doesn't show big improvements (see details in Figure 6 b).
Through analysis of computating speed, we observed a significant decrease in computational efficiency at the training stage when applying ensemble learning.Of four SFCN models, the model with three basic models (SFCN f &u&p ) is much slower than the models with two basic models (SFCN f &p , SFCN f &u , and SFCN u&p ).Because of the decrease in computational efficiency, even though feature alignment can be easily extended to the ensemble model with all basic models, the proposed ensemble model might not be suitable for the analysis of very large areas (e.g., automatic mapping of entire country).

Conclusions
In this paper, we propose a novel feature alignment framework for efficient ensemble learning of fully convolutional networks.The proposed framework can be seamlessly integrated with ensemble learning models with variant number of basic models to regulate a balance in similarity and variety of the features extracted from different branches.Their performances are verified by VHR image dataset with multi-label segmentic information.The ensemble models with proposed feature alignment show significantly better performance than existing methods.In SFCN f &u , optimal feature alignment gains increments of 4.2% (0.772 vs. 0.741), 6.8% (0.629 vs. 0.589), and 5.5% (0.727 vs. 0.689) for f1-score, jaccard index, and kappa coefficient, respectively.Sensitivity analysis demonstrated that feature alignment plays an important role in controlling the balance between similarity and variety of the ensemble model.In future studies, we will further optimize our feature alignment framework to achieve better performance in more complex ensemble learning architectures.

Figure 1 .
Figure 1.Example of Vaihingen 2D semantic labeling dataset.(a) true orthophoto, (b) annotated ground truth, and (c) legend.The ground truth contains six types of land-covers.

Figure 2 .
Figure 2. Experimental workflow of this research.The existing methods, as well as the proposed model, are trained and evaluated by 224 × 224 image patches extracted from original dataset.

Figure 3 .
Figure 3. Proposed stacked fully convolutional networks(SFCN).The SFCN contains a framework for feature extraction using N number of fully convolutional networks(A, B, ..., N model), and a framework for feature alignment and final output generation.

Table 1 .
Reference of color map of Vaihingen dataset.

Table 3 .
Table of model performances of four SFCNs under lambda values in [0.0, 0.2, 0.4, 0.6, 0.8, 1.0].The x and y axes represent the models and their corresponding values, respectively.(b) Table of mean value and standard deviation (SD) of performance comparisons among these methods.For each evaluation metric, the highest mean values and lowest SD are highlighted in bold.

Table 4 .
Comparison of performances of the basic models of FCN-8s, U-Net, and FPN as well as four SFCNs with/without feature alignment.(a)Bar chart for comparison of relative performances.(b)Table of mean value and standard deviation (SD) of the performance comparison of these methods.For each evaluation metric, the highest mean values and lowest SD are highlighted in bold.Result of independent t-test of four SFCNs under with/without feature alignment.The p-value is the probability that SFCN has the same performances at both with and without feature alignment.

Table 5 .
Comparison of the computational efficiencies of FCN-8s, U-Net, FPN and four ensemble fully conventional networks.For each column, the highest mean values and lowest SD are highlighted in bold.