In this section, we conducted extensive experiments using the SARMV3D-BIS dataset to evaluate the performance of our proposed method. To achieve this goal, we compared our method with some SOTA methods and analyzed the results. The details of the experimental setup are in
Section 4.1. The comparison experiments and analysis of our method and SOTA methods on the SARMV3D-BIS dataset are provided in
Section 4.2. The ablation experiments and analysis of our model are presented in
Section 4.3. The analysis of our method is provided in
Section 4.4.
4.1. Experimental Settings
4.1.1. Dataset Description
To prove the effectiveness of our proposed method on SAR image semantic segmentation in building areas, the benchmark dataset we use is the SARMV3D-BIS dataset [
65], produced by the Chinese Academy of Sciences team.
The original SAR images in the dataset come from the Omaha city area of the United States in the GF-3 beam stacking mode. According to the systematic process and method of labeling based on the 3D model simulation back projection proposed by the data production team, the SAR images are refined semantically. It contains the facade, roof, and shadow of each building.
Figure 8 shows a partial dataset, and each column is a data pair. The first row is the ground truth, where red represents the facade of the building, white represents the roof of the building, blue represents the shadow of the building, and black represents the background. The second row corresponds to the SAR image. The size of each image is
pixels. The dataset can be divided into training set, validation set, and test set. The training set contains 1280 image pairs, the validation set contains 369 image pairs, and the test set contains 369 image pairs.
4.1.2. Comparison Methods and Evaluation Metrics
To demonstrate the superiority of our method for segmentation tasks on SAR images in building areas, we compared the proposed LRFFNet with SOTA semantic segmentation methods, which are UNet [
35], EncNet [
36], ApcNet [
37], EmaNet [
38], DeepLabV3 [
41], PspNet [
43], DaNet [
44], FPN [
63], MP-ResNet [
50], HR-SARNet [
49], and MS-FCN [
51].
To fairly compare with the SOTA methods on the SARMV3D-BIS dataset, we use the widely used evaluation metrics, including intersection over union , mean intersection over union , Accuracy , mean Accuracy , and all Accuracy .
The
is calculated as follows:
The
is calculated as follows:
where
i denotes the semantic categories and
n is the number of classes. In particular,
is calculated by dividing the number of all correctly classified pixels in the prediction map by the total number of pixels.
4.1.3. Implementation Details
Our network was tested on the following platforms, including AMD Ryzen 5 5600× CPU @3.7 GHz, NVIDIA RTX 3090 Ti GPU with CUDA version 11.6. In the experimental setting, we use random flip and random rotation data augmentation methods for the input image, and the probability is set to 0.5. The optimizer we use is the adamw optimizer, where is set to 0.0008, coefficients used for computing running averages of the gradient and its square betas are set to (0.9, 0.999), the weight decay coefficient is 0.05, the warmup and learning rate decay strategies are used at the same time, the warm-up interval is set to 300, the warmup ratio is set to 0.001, and the decay strategy uses the poly strategy. The batch size was set to 8, and the network was trained in 20,000 steps.
4.2. Comparative Experiments and Analysis
We followed the experimental setup in
Section 4.1, and several groups of comparative experiments were carried out.
The SARMV3D-BIS dataset is a challenging task. SAR images already contain much useful information, such as the structure, texture, and occlusion relationship between the environment and the target. However, SAR images contain information about the microwave band and have a lower signal-to-noise ratio compared with optical images. Optical images can be distinguished well by the human eye, but interpreting SAR images requires professional knowledge. Therefore, it is more challenging to interpret SAR images. The segmentation of SAR images is more complicated. We made a statistic on the area of each category in the dataset. The proportions of the four categories (background, facade, roof, and shadow) in the dataset are 76.14%, 5.27%, 14.01%, and 4.57%. It can be seen that the background accounts for a large part of the dataset, and the building area accounts for less than 50%. In addition, the roof category in the building area accounts for a considerable part, but the facade and shadow only account for less than 6%, which are small objects. The unbalanced distribution of data categories is an important reason for why this dataset is challenging for performing semantic segmentation tasks.
We conduct experiments on the SARMV3D-BIS dataset using our method and some other SOTA methods, and most comparison methods are improved based on the encoder-decoder structure. For a more rigorous comparison, our experiments can be divided into three groups. In the first set of experiments, the comparison method entirely refers to the original paper, and most of them use the ResNet encoder. The experimental results are shown in
Table 1. In the second set of experiments, the comparison method uses ConvNeXt as the encoder, which uses the same encoder as our proposed method. The purpose of setting up this set of experiments is to eliminate the effects of using the different encoders. The experimental results are shown in
Table 2. In the third set of experiments, we compared our method with the advanced network specially designed for SAR image segmentation. In the three tables, the best experiment result is marked in red, and the following best experiment result is marked in blue.
As shown in
Table 1, our proposed method outperforms other methods in various evaluation metrics. Specifically, the
mIoU score of our LRFFNet is
higher than the second-best method, DaNet. The
mAcc score of our LRFFNet is
higher than the second-best method, DeepLabV3. The
aAcc score of our LRFFNet is
higher than the second-best method, DaNet.
As shown in
Table 2, compared with the SOTA method, our proposed method improves on most evaluation metrics. Specifically, the
mIoU score of our LRFFNet is
higher than the second-best method, DeepLabV3. In particular, the
IoU scores of LRFFNet in background and roof increased by 0.38% and 2.82% compared to second place. Our method also achieves SOTA performance on the hard-to-segment semantic classes of “facade” and “shadow”. On the segmentation of “facade” objects, the
IoU score of our LRFFNet is 4.94% higher than the second-best method, DeepLabV3. On the segmentation of “shadow” objects, the
IoU score of our LRFFNet is 2.68% higher than the second-best method, DeepLabV3. The
mAcc score of our LRFFNet is
higher than the second-best method, DeeplabV3. In particular, on the segmentation of “facade” objects, the
Acc score of our LRFFNet is
higher than the second-best method, DeeplabV3. On the segmentation of “shadow” objects, the
Acc score of our LRFFNet is
higher than the second-best method, PspNet.
In addition, another series of comparative experiments was carried out between our method and methods designed for SAR image segmentation. The compared methods include MP-ResNet, HR-SARNet, and MS-FCN. The experimental results are shown in
Table 3. As can be seen, our proposed method achieves improvements in all evaluation metrics. To intuitively demonstrate the superiority of our proposed LRFFNet in the semantic segmentation of SAR images in building areas, we performed visual comparison experiments and produced a comparison result graph. The visualization results are shown in
Figure 9 and
Figure 10. Every two rows in the figure are a set of comparative experiments, including the original image to be classified, the ground truth, and the classification result predicted by each method. Among them, the last block is the prediction result of our LRFFNet. Our proposed method has good segmentation accuracy whether in the larger category, such as background and roof, or in the smaller category, such as facade and shadow. Compared with other methods, the semantic segmentation results obtained by our method are closer to the ground-truth images in visual effect.
4.3. Ablation Experiments
In this subsection, we evaluated the effectiveness of two key modules of our proposed method, the cascade feature pyramid module (CFP) and the large receptive field channel attention module (LFCA). At the same time, we evaluated the effectiveness of the auxiliary branch.
4.3.1. Effect of Cascade Feature Pyramid Module
We first set up a basic experiment in which we use ConvNeXt as the feature extractor and FPN as the decoder. After that, we trained the network and the network. These three sets of comparative experiments prove the effectiveness of our proposed feature fusion module. At the same time, it is proven that CFP can be used as a basic unit, and the network structure obtained by concatenating CFP has better segmentation ability.
The comparison results of the three experiments can be seen in
Table 4. Comparing the results of the experiment
and the experiment
, after using our proposed
CFP, the evaluation index
increased by
,
increased by
, and
increased by
. Comparing the results of experiment
and experiment
, the evaluation index
increased by
,
increased by
, and
increased by
. This performance benefits from our redesigned feature fusion path and redesigned feature fusion method. It proved that the
CFP layer could be regarded as a network unit, which can be flexibly combined and used in series to expand the network capacity and improve the effectiveness of the network.
4.3.2. Effect of the Large Receptive Field Channel Attention Module
We first set up a basic experiment
in which we use ConvNeXt as the feature extractor and FPN as the decoder; then we trained the
model to evaluate the effectiveness of our proposed
LFCA.
Table 5 shows the result of the two experiments. It shows that after adding the
LFCA model, the evaluation index
increased by 1.36%, the
of the facade increased by 2.19%, and the
of the shadow increased by 2.19%. It can be seen that due to the addition of the attention mechanism we proposed, the classification effect on small objects has been improved. At the same time, the segmentation effect in other categories has also been improved. The
of the background has increased by 0.36%, and the
of the roof has increased by 1.85%. In the
model, the
evaluation index increased by 1.33%, the
of facade increased by 2.54%, the
of shadow increased by 0.7%, and the
of roof increased by 2.07%. The
LFCA model can distinguish between the importance of channels, assign greater weight to channels that contain helpful information, and assign small weights to channels with less content.
LFCA helps the network pay more attention to the information-rich channels, highlight the areas of significant interest, and improve the network training effect.
To more intuitively understand the role of the
LFCA layer, we performed visual processing of the features before and after the
LFCA layer. The features before going through the
LFCA layer are expressed as
, and the features after going through the
LFCA layer are expressed as
. The visualization result shown in
Figure 11 contains two examples. In order to make the SAR image and the label have a more intuitive correspondence, we fused the original SAR image and the GT image with
each to generate the “image”. “Before” stands for the visualization result before the
LFCA structure processes the feature, and “After” stands for the visualization result after the
LFCA structure processes the feature. The warmer color temperature in the picture represents the more significant value in the feature, representing the area the network pays more attention to. As can be seen in the figure, the feature map after
LFCA processing highlights the area where the target is located. The target and the background areas are more clearly distinguished.
4.3.3. Effect of the Auxiliary Branch
The dataset we use has the following characteristics: the three categories of roofs, facades, and shadows appear simultaneously and are closely connected. Every building contains all three parts simultaneously. Therefore, we can divide the images into two categories: building and non-building areas. In some segmentation results, the ground objects in the non-building areas are identified as buildings. To reduce the occurrence of such phenomena, we design an auxiliary branch.
The auxiliary branch is to add a segmentation branch to the original network. In our setting, we put this branch after the features extracted by stage 2, and the network structure is a classic FCN network. We want to strengthen the supervision of features and optimize the quality of prediction results by adding an auxiliary branch. The following experiments confirm that the auxiliary branch can achieve such an effect.
We first processed the original GT image and unified the three categories of roof, shadow, and facade as one category. The generated mask image only contains two categories: building and background. The dataset after processing is shown in
Figure 7. Each row is a set of data, including the original image to be segmented, the ground truth, and the mask image. The image predicted by the auxiliary branch is compared with the mask image, then the loss is calculated, and then back-propagation is performed to update the network parameters. The images predicted in the auxiliary branch and the mask images are used for loss calculation. As mentioned in
Section 3.5, our network contains two loss functions: AUX loss and SEG loss. In order to select the appropriate loss ratio, we carried out several experiments and selected three representative groups of results for display. The experimental results are shown in
Table 6, where
comes from Equation (
10) and represents the ratio between the two losses. It can be seen from the data in the table that when we set the ratio of aux loss and seg loss to 2:1, each index of the network has decreased compared with the evaluation metrics in
Table 5. Specifically, the evaluation index
decreased by
,
increased by
, and
increased by
. When we set this ratio to 1:1 and 0.2:1, the network has a small improvement in each index, and when the ratio is 0.2:1, the score on each index is even higher. Specifically, the evaluation index
increased by
, and
increased by
. Therefore, we set the ratio of the loss function as 0.2:1 in LRFFNet.
The experimental results are shown in
Figure 12. Each column is a set of data. “Image” stands for the original input image, “GT” stands for ground truth, “Initia” stands for the prediction result graph before adding auxiliary branches, and “Impro” stands for the prediction result graph after adding an auxiliary branch. We marked the parts of the image that needed special attention with a yellow dotted box. Comparing “Initial” and “Impro” images, it can be seen from the experimental results that the phenomenon of separate color blocks outside the mask area is reduced. Due to the use of the mask images, the network can focus more on segmentation within the building areas and reduce the misclassification of pixels outside the building areas as buildings. The segmentation results are closer to the ground truth.