1. Introduction
Synthetic aperture radar (SAR) is a crucial remote sensing technique for realizing high-resolution earth observation under all weather conditions and all time conditions. As an active imaging radar, SAR demonstrates significant capabilities in both civil and military fields, playing a vital role in remote sensing applications [
1,
2,
3]. Through SAR image interpretation, a large scene investigation can be achieved [
4,
5]. Unlike optical images, SAR images are hard to use effectively by only visual interpretation; the interpretation of polarimetric SAR images especially is more difficult [
6,
7]. With the increasing availability of high-resolution SAR, including PolSAR, image products from airborne and spaceborne systems, the segmentation of land cover is still a hot topic in SAR applications. However, the massive amount of SAR data brings broad prospects and formidable challenges.
Still considered the foundation for wide terrain investigation, SAR image classification is frequently used in urban segmentation, road extraction, crop classification, change detection, and other applications [
8,
9,
10,
11]. The traditional classification methods are mainly based on statistical distribution and a physical scattering mechanism, such as the wavelet feature and the neighborhood feature [
12,
13,
14]. Semantic image segmentation is an end-to-end classification method that can classify the target region pixel-by-pixel [
15]. To meet the requirement of highly efficient means, SAR image segmentation based on deep learning has gained popularity in recent years [
16,
17,
18,
19,
20,
21,
22]. The convolutional neural network (CNN) is one of the main representatives; it uses the sliding window method to train the network and obtain the segmentation results of the regional center pixel [
16,
17]. Many researchers have made improvements in order to achieve outstanding results based on this sliding window method, such as in SLS-CNN [
18], CV-CNN [
19], Multi-Scale-CNN [
20], and modified AlexNet [
21]. Additionally, a new hybrid CNN–MLP classifier is proposed for ship classification [
22]. However, the sliding window method has the problem of low efficiency. The pixels are repeatedly sampled many times over, resulting in a huge computational and storage burden [
23,
24]. This region-to-pixel method can only predict the center pixel, even if the sampling window normally contains different pixel categories, especially at the boundary. This inconsistency leads to the excessive smoothing of the boundary and uncertainty in the segmentation results [
25,
26].
To overcome the disadvantages of the region-to-pixel method, the pixel-to-pixel segmentation method has been studied to effectively retain the whole region of the target [
27,
28,
29]. The fully convolutional neural network proposed by Long [
27] achieves segmentation of images of any size by replacing the full connection layer of CNN with the convolution layer, which has already been developed and applied in SAR image segmentation [
28,
29]. Semantic segmentation attracts a significant amount of attention in the development of deep learning for achieving pixel-wise segmentation [
30,
31,
32,
33,
34,
35,
36,
37,
38,
39]. ENet [
30], proposed by Paszke et al., has been improved and applied in remote sensing image semantic segmentation [
31]. ENet reduces the number of parameters by using an asymmetric encoder–decoder structure and implements dilated convolution within the convolution layer to improve the receptive field of the network. ERFNet [
32], proposed by Romera et al., adopts None-bottleneck-1D to enhance the learning ability of the network and speed up the segmentation process. EFNet is proposed by Yin et al. [
33] based on ERFNet, and the two networks are compared for winter-wheat spatial distribution extraction from Gaofen-2 images. DeepLabV3 [
34], proposed by Chen et al., applies modules employing atrous convolution in cascade or parallel to capture multi-scale context by adopting multiple atrous rates. Additionally, an atrous–spatial–pyramid-pooling module is applied to probe convolutional features at multiple scales, which further boosts performance. DeepLabV3 is applied in in situ sea-ice detection [
35]. OSDES_Net, using group convolutions, is proposed for oil spill detection in SAR images [
36]. A bilateral segmentation network (BiSeNet) [
37], proposed by Yu, has been developed for sea–land segmentation [
38], decoupling spatial information and receptive field into SP and CP. The feature fusion module (FFM) and attention refinement module (ARM) are also employed to further improve accuracy at an acceptable cost. The bilateral structure strikes a balance between efficiency and accuracy, thus representing a significant advantage in semantic segmentation networks.
With the progressive development and application of semantic segmentation networks in SAR image segmentation, the limitations of the commonly employed cross-entropy loss have become apparent. The cross-entropy loss primarily focuses on pixel-level prediction accuracy while overlooking the spatial relationships and semantic continuity between pixels [
39]. Additionally, the cross-entropy loss is also susceptible to class imbalance issues frequently encountered in SAR image land-cover segmentation tasks. The aforementioned networks all employ the cross-entropy loss function [
40], which may not be fully suitable for segmentation tasks. Meanwhile, the most commonly used metric for segmentation tasks is the intersection over union (IOU) score, also known as the Jaccard index [
41,
42]. Incorporating IOU into the optimization target effectively enhances the segmentation performance of the network. To achieve this, an improved loss function that combines the Lovász-softmax loss and cross-entropy loss is proposed in this paper, along with a corresponding training method. Furthermore, a semantic segmentation network, named LoSARNet—Lovász-softmax loss optimization SAR net, is proposed for PolSAR image segmentation. LoSARNet adopts a dual-path structure to proficiently extract features, ensuring both high-speed performance and precise semantic segmentation. The improved loss function leverages the advantages of both the Lovász-softmax loss and the CE loss, resulting in higher scores for IOU and greater pixel accuracy. A two-stage training approach is designed to achieve better results. The experiments are conducted on multiple datasets, including the AIR-PolSAR-Seg [
43], Flevoland, and Oberpfaffenhofen datasets, covering segmentation cases of complex large scenes, multiple categories, and high-resolution images.
The remainder of the paper is organized as follows. In
Section 2, the proposed LoSARNet is introduced, including the improved loss function and the network structure.
Section 3 provides an overview of the experimental datasets and details the data preparation process for the experiments. The experimental results and analysis are presented in
Section 4. The discussion and conclusion are provided in
Section 5 and
Section 6, respectively.
3. Datasets and Pre-Processing
To verify the effectiveness of the proposed LoSARNet, the experimental datasets are introduced in this section, including the AIR-PolSAR-Seg data, the classical Flevoland data and the Oberpfaffenhofen data. Then, the pre-processing of the experimental data is given. Finally, the evaluation metrics for segmentation performance analysis are listed.
3.1. Datasets
- (1)
AIR-PolSAR-Seg data [
43]: The AIR-PolSAR-Seg data supplies the PolSAR amplitude images captured by the Gaofen-3 satellite in quad-polarized strip I (QPSI) mode on 29 April 2019. The PolSAR amplitude images include four polarization modes, viz., vertical–vertical (VV), horizontal–vertical (HV), horizontal–horizontal (HH), and vertical–horizontal (VH). The spatial resolution of the images is 8 m, and it is annotated with respect to six typical terrain categories at the pixel level, including housing areas, industrial areas, natural areas, land-use areas, water areas, and other areas.
Figure 4 exhibits the PolSAR amplitude image under an HH polarization state and the corresponding ground-truth map with color codes.
The PolSAR image annotation is well performed by researchers in AIRCAS. A radiation calibration operation is conducted to suppress speckle noise, and the ground truth is labeled by Wang et al. manually [
43]. And the PolSAR image of AIR-PolSAR-Seg is cropped into 500 patches with a size of 512 × 512. Since each PolSAR image patch contains four polarization modes, the total number of PolSAR image patches is 2000 (500 × 4). And the PolSAR amplitude images of HH, HV, VH, and VV are all real numbers.
- (2)
Flevoland data: The Flevoland data here was acquired by NASA/JPL AIRSAR in 1991, and is widely used in many research efforts regarding PolSAR image classification [
15,
19,
23,
24,
42]. The size of the PolSAR image is 1020 × 1024, with a spatial resolution of 10 m.
Figure 5 displays a pseudo-RGB image of the Flevoland area data, which are obtained by Pauli decomposition [
3,
5,
6,
7]. As the coherence matrix is obtained, the spatial average operation is employed to suppress the speckle noise. The ground-truth class labels and color codes are also given in
Figure 5, where the black regions are regarded as the background. The Flevoland area dataset includes 14 categories, namely, Potato, Fruit, Oats, Beet, Barley, Onions, Wheat, Beans, Peas, Maize, Flax, Rapeseed, Grass, and Lucerne, respectively.
- (3)
Oberpfaffenhofen data: The Oberpfaffenhofen data set was acquired by E-SAR. It is widely used in much research relevant to PolSAR image classification [
19,
43]. The data is acquired in L-band and has been multi-looked. The size of the PolSAR image is 1300 × 1200 pixels, and the spatial resolution is 3 m. The Pauli image and the ground truth are shown in
Figure 6. In
Figure 6, the ground-truth class labels cover four typical terrain categories, including Built-up Area, Wood Land, Open Area, and Other.
3.2. Pre-Processing and Training
The PolSAR image data consists of multiple polarization channels. Thus, it is necessary to perform a normalization operation to avoid an imbalanced initialization of weights. The input of LoSARNet is first normalized with the batch normalization layer, and the data augmentation is applied by random cropping, horizontal flipping, and vertical flipping.
For optimization, the AdamW optimizer [
48] is used for 200 epochs. The learning rate starts from 0.0001, and the l2-norm regularization is employed, with a weight decay of 0.01 to avoid overfitting. To achieve a better training result, the dynamic learning rate adjustment strategy is adopted to optimize the loss of validation set. The specific strategy is that the learning rate will be decreased to 0.7 times the current learning rate after the 11th epoch, if no validation loss decrease is detected within 10 epochs and no improvement occurs in the 11th epoch. Additionally, the minimum learning rate is set as 10
−7 to prevent the low learning rate from causing a slow optimization speed.
After training for 200 epochs, the weight with best performance on the validation set is chosen to be the network weight to continue the next stage. During the fine-tuning stage, the pre-trained network is optimized with the Lovász-softmax loss function and cross-entropy loss function as given in Equation (6). Notably, during training, the two outputs of the context path behind the ARMs are used as the auxiliary loss functions [
37].
Table 1 summarizes the key training parameters and algorithms in different training stages.
During the training process, the early-stopping policy is utilized. The patience is set as 20, which means a network with optimal generalization performance is obtained if the validation loss no longer drops within 20 epochs. During the testing process, the test image is normalized at first and input into the trained network. After ‘softmax’ operation on the output, the category with the highest scoring class value is assigned to the pixel.
3.3. Evaluation Metrics
The experiments are evaluated using the accuracy for each scene in the dataset. Four metrics are used to evaluate the segmentation performance of the proposed network, including the mean pixel accuracy (
MPA), overall accuracy (
OA), mean IOU (
MIOU), and
Kappa coefficient.
where
indicates pixels which belong to class
but are predicted to be class
, and
indicates the categories in total.
MPA and
OA measure the pixel-accuracy of classification.
MIOU and
Kappa coefficient are used to measure the overall segmentation effect.
4. Experiments
In this section, experiments are performed on the aforementioned datasets. In order to verify the segmentation performance of the proposed LoSARNet, some typical semantic segmentation networks, including ENetV2, ERFNet, DeepLabV3 and BiSeNet, are mentioned for comparisons. The experiments are all performed on a PC with Intel Xeon Platinum 8269CY CPU (Intel, Santa Clara, CA, USA), 128GB RAM, and a NVIDA GeForce RTX3080 GPU (NVIDA, Santa Clara, CA, USA).
4.1. AIR-PolSAR-Seg
As introduced in
Section 3.1, the AIR-PolSAR-Seg dataset contains 2000 PolSAR amplitude image patches in total. In the experiment, the training samples are 1400 (350 × 4), and the validation samples are 200 (50 × 4); the remaining 400 (100 × 4) PolSAR image patches are set as test samples.
Since the AIR-PolSAR-Seg dataset supplies the amplitude patch data under four polarizations, HH, HV, VH, and VV, the four-channel PolSAR images are accordingly used as the input of LoSARNet. As shown in
Figure 4, six effective categories are considered during the input processing. As only the cropped 512×512 patch amplitude data is obtained from AIR-PolSAR-Seg, the patch PolSAR amplitude images are taken to obtain the segmentation results and the subsequent semantic segmentation indices. Then, the training data is converted into a 4-D tensor with the dimensions of 350 × 512 × 512 × 4 as input to train the architecture of LoSARNet, which represents 350 × 4 PolSAR images extracted from the original dataset as training data. Each image has 512×512 pixels of a spatial dimension and four channels corresponding to the amplitude values achieved under the four polarization modes.
To validate the performance of the loss function used and the improvement of LoSARNet compared to BiSeNet, the results of the two networks with the same training algorithm for the AIR-PolSAR-Seg dataset are shown in
Figure 7. A 512 × 512 PolSAR patch image is used for the vision-based illustration.
Figure 7a is the pseudo-RGB image of the example patch A. As the AIR-PolSAR-Seg dataset supplies amplitude data under different polarizations, the pseudo-RGB image here is composed by R = |HH|, G = |HV|, and B = |VV|.
Figure 7b shows the ground truth of patch A, and
Figure 7c displays the segmentation results by using BiSeNet, which adopts the cross-entropy loss function. For regions with small distribution, such as the red and cyan areas, the feature map obtained from the space path is only 1/8 of the size of the original map, and the network trained in cross-entropy can only with difficulty correctly identify these objects, resulting in poor classification results.
Figure 7d represents the segmentation results achieved using the proposed LoSARNet. By using the improved loss function as expressed in Equation (6) to jointly optimize the training results, more precise segmentation results are obtained. This validates the proposition that the gaps in the partition can be filled by the loss function to have a better performance in the unevenly distributed dataset. It has a particular advantage in recovering small objects, allowing for more comprehensive segmentation results.
To verify the performance of LoSARNet further, it is compared to BiSeNet networks trained with different loss functions, and multiple other networks. All the experiments are conducted using the same training data, data augmentation approach, and dynamic learning rate strategy, in which Xception is used in DeepLabV3 as the backbone network. The dynamic learning rate strategy mentioned in
Section 3.2 is applied and has obtained a network with optimal generalization performance in rounds of 200. For proper accuracy assessment, the training and test sets used in this paper do not overlap.
Quantitative analysis of the metrics for the AIR-PolSAR-Seg by BiSeNet with different loss functions are shown in
Table 2. As mentioned in
Section 3.2, the two outputs of the context path behind the ARMs are used as the auxiliary loss functions. In BiSeNet, all three outputs use cross-entropy to optimize three loss functions. 1-Lovász denotes that the Lovász-softmax loss function is only used in the main output of the network, while 3-Lovász takes all three outputs for Lovász-softmax to optimize three loss functions. The proposed loss function introduced in Equation (6) is employed for LoSARNet. In
Table 2, it is obvious that the MIOU of LoSARNet (51.50%) is higher than that of BiSeNet (42.66%). And the other two results also have higher MIOU than BiSeNet, which means the Lovász-softmax loss function can improve MIOU. The OA and Kappa of LoSARNet provide the highest results. This reveals the advantage of LoSARNet. A noteworthy issue is that the result of BiSeNet has higher OA and lower MPA compared to other results. This is because the network focuses more on categories with a larger proportion and is less sensitive to categories with a smaller proportion. And the Lovász-softmax loss function helps the network pay attention to the regional features of the image, which can also have good effects on small classes. However, using only the Lovász-softmax loss function will focus too much on small categories, resulting in suboptimal overall network results. LoSARnet clearly balances pixel accuracy and overall segmentation performance, and the improvements show the effectiveness of the proposed loss function strategy.
Segmentation results of AIR-PolSAR-Seg obtained by different networks are shown in
Figure 8 by taking one patch as an illustration example. From a visual perspective, EnetV2, ERFNet, and DeepLabV3 have misclassified pixels and are unable to correctly identify bare ground and river areas. In
Figure 8c, the ENetV2 network fails to learn image features correctly, resulting in blurry areas in the segmentation result. The best segmentation results are achieved by the proposed LoSARNet. The BiSeNet network incorrectly classifies the bare ground area in the upper left corner of the image, which incorrectly predicts most of the naked pixels in red color as rivers in cyan color. In contrast, LoSARNet correctly classifies the bare ground area and obtains a more ideal segmentation result.
Quantitative analysis of the metrics for the AIR-PolSAR-Seg by different networks is shown in
Table 3. It can be seen from
Table 3 that the MPA, OA, MIOU, and Kappa evaluations of LoSARNet are far superior to the other methods, especially the MIOU. LoSARNet obtains an MIOU of 51.50%, which is higher than that of ERFNet, with 48.15%; ENetV2, with 43.36%; and DeepLabV3, with 40.42%. ENetV2, ERFNet, and LoSARNet all perform strongly on the Kappa coefficient, which means better overall segmentation performance. For pixel accuracy, LoSARNet achieved the best results on MPA and OA, and had a significant improvement in effectiveness. At the same time, LoSARNet possesses the highest recognition speed, surpassing the two real-time semantic segmentation networks ENetV2 and ERFNet.
4.2. Flevoland
As introduced in
Section 3.1, the complete PolSAR image is sized 1020×1024. Different from Pol-AIRSAR-Seg data, the Flevoland data provides a complete polarimetric coherence matrix. A window of size 256 × 256 slides over the PolSAR image with steps of 32 to finally obtain the 625 PolSAR patch images in total, each with a size of 256×256 in six channels. And the six channels correspond to the polarimetric coherence matrix
[
3] elements, which are
T11,
T12,
T13,
T22,
T23, and
T33. A total of 10% of the data is used as the training set, 20% as the verification set, and the rest is used for the test set; each patch dataset has a size of 256 × 256 × 6. According to the research in [
9], the 6-D complex-valued matrix is converted into a 6D real-valued vector based on the consideration of the elements of the coherence matrix.
where
SPAN is the total intensity and equals
; A represents the total power of all channels (dB); B and C are normalized channel values of
T22 and
T33. D, E, F are relative correlation coefficients.
Through the comparison and analysis of the experiment in
Section 4.1, the LoSARNet shows great superiority to other networks on complex large scenes. The Flevoland data includes 14 categories, posing new challenges for the segmentation. The experiment’s results are shown in
Figure 9. In marked area 1, the results of ERFNet and LoSARNet demonstrate consistency with the ground-truth map. DeepLabV3 misclassifies the class Oats, and ENetV2 clearly cannot distinguish this class from the background, although the weight of the background is set very low in the experiment. In marked area 2, the correct category is Fruit, and only BiSeNet and LoSARNet classify most pixels correctly. The quantitative comparison is given in
Table 4. The semantic segmentation metrics obtained by LoSARNet are better than those of other networks. For MIOU, the performance of LoSARNet is 86.06%, which is higher than that of BiSeNet, 79.01%. The improvement is very huge. ENetV2 and ERFNet also perform well in IOU with 81.34% and 84.78%, respectively. For MPA, LoSARNet also has an improvement of about 7 percentage points compared with BiSeNet. And as for OA and Kappa, all networks achieve good results. The above results show that the proposed LoSARNet can effectively extract the features of PolSAR image features with better segmentation accuracy.
The application of LoSARNet in the segmentation of Flevoland data has been proved to extract features and classify the PolSAR images effectively. LoSARNet achieves better segmentation results than do other networks, which is supported by the IOU score for Flevoland, as shown in
Table 5. It is obvious that LoSARNet is completely superior to BiSeNet. Additionally, it can be noticed that the results of ENetV2 and ERFNet are also very good. However, the recognition of certain categories is almost completely incorrect, which is usually difficult to accept in most segmentation tasks. Although LoSARNet achieved nearly 100% accuracy in most categories, slightly lower than EnetV2 and ERFNet, it still ensured accuracy in most categories.
4.3. Oberpfaffenhofen
As introduced in
Section 3.1, the complete PolSAR image has a size of 1300 × 1200, and hence, a window of size 256 × 256 slides over the PolSAR image with steps of 32 to finally obtain the 1054 PolSAR patch images in total with the size of 256 × 256, in six channels. In this experiment, 10% of the samples are set as training set, 10% of samples are set as a validation set, and the rest are used as the test set. The channels of the PolSAR image used in the experiment are the same as those of the Flevoland data, which makes the data real-valued as to vectors.
The results for the whole dataset are shown in
Figure 10. In marked area 1 and area 2, LoSARNet better preserves the edge information in the ground-truth map compared to other results. BiSeNet has also achieved good results, which may be an advantage brought by bilateral networks. In marked area 3, the misclassification area of LoSARNet is obviously smaller. In marked area 4, the results of LoSARNet are closest to those of the ground-truth map. Considering that all the results are visually good, further attention is focused on the specific quantitative results. All the results are shown in
Table 6, and the IOU scores of different categories are given in
Table 7. In
Table 6, LoSARNet performs best in all evaluation metrics. Specifically, the MIOU of LoSARNet is 94.57%, higher than 88.46%, 86.21%, 71.29%, and 91.19%. Comparing these results with BiSeNet, which already produces good results of 91.19%, a 3.38% increase is remarkable. For MPA, OA, and Kappa, the results of LoSARNet are all above 97%, which means that the segmentation task is almost perfectly completed. The results of other methods are also very good. But there is still a significant gap compared to LoSARNet. In further analyses of IOU scores, LoSARNet achieves the best results for all categories, especially for the category Other; other methods have significant errors, but LoSARNet can distinguish them more accurately. This corresponds to the white area in
Figure 10, and we also achieve the same results visually. LoSARNet achieves a huge improvement in IOU. This improvement comes from a more accurate classification of other categories.
5. Discussion
In this paper, LoSARNet is proposed for PolSAR data segmentation. By leveraging a bilateral structure, LoSARNet excels at extracting features efficiently, thereby ensuring both high-speed performance and precise semantic segmentation. To further enhance its segmentation capability, we employ a new loss function that combines the Lovász-softmax loss and CE (cross-entropy) loss within LoSARNet. This joint loss function enables the network to optimize both pixel accuracy and IOU scores simultaneously. As a result, LoSARNet outperforms other existing networks in terms of performance.
To evaluate the segmentation performance of LoSARNet, we conduct experiments on three widely used sets of PolSAR data. These datasets cover various segmentation cases, including complex large scenes, multiple categories, and high-resolution images, each corresponding to different segmentation requirements. Meanwhile, several typical semantic segmentation networks, including ENetV2, ERFNet, DeepLabV3 and BiSeNet, are employed for comparisons. In the case of the AIR-PolSAR-Seg dataset, LoSARNet achieves significantly higher MIOU scores compared to the aforementioned networks, with improvements of 8.14%, 3.35%, 11.08%, and 8.84%. Similarly, the MPA of LoSARNet is higher by 8.16%, 3.50%, 9.89%, and 9.33%, respectively. These experimental results provide strong evidence for the superior performance of LoSARNet. For the Flevoland dataset, the phase information of the network is maximized by converting the six-channel complex-valued data to the six-channel real-valued data. Both LoSARNet and ERFNet achieve good results on this dataset. LoSARNet achieves the best results on MPA, MIOU and Kapp. This confirms the effectiveness of LoSARNet in multi-class segmentation tasks. In the case of the Oberpfaffenhofen dataset, LoSARNet demonstrates its capability in detailed classification of high-resolution PolSAR images. LoSARNet achieves over 90% accuracy for all classification indicators, highlighting its effectiveness. When compared to BiSeNet, LoSARNet achieves improvements of 3.01% and 3.38% in terms of MPA and MIOU, respectively, which are substantial enhancements over the baseline results.
Overall, LoSARNet exhibits superior performance across three PolSAR datasets. These improvements are attributed to the efficient feature extraction enabled by the bilateral structure and the simultaneous optimization of pixel accuracy and IOU through the new loss function.
6. Conclusions
Motivated by the successful application of deep learning methods in PolSAR segmentation, this paper presents a novel SAR network called LoSARNet, which utilizes the Lovász-softmax loss optimization. The network demonstrates exceptional performance compared to other networks when evaluated on various PolSAR datasets, including AIR-PolSAR-Seg, Flevoland, and Oberpfaffenhofen. Considering the disparity between the cross-entropy loss and the segmentation task, this paper proposes an improved loss function and a corresponding two-stage training method. This improved loss function combines Lovász-softmax loss and CE loss to achieve both segmentation performance and pixel accuracy. These improvements help the network to achieve a better result. During the training process, the dynamic learning rate strategy and the early-stopping strategy are employed to accelerate the loss-convergence speed while keeping the loss-function curve from entering the local optimal point. In the comparison to BiSeNet, which also incorporates a dual-path structure, the proposed LoSARNet achieves higher evaluation metrics for all datasets. Meanwhile, LoSARNet has also demonstrated comprehensive, leading performance compared to several other segmentation networks, including ENetV2, ERFNet, and DeepLabV3. The improved loss function in this paper can effectively fill the gaps in the split and obtain more comprehensive segmentation results. Especially when dealing with images containing unevenly distributed data types, the potential of LoSARNet is demonstrable. Overall, this work validates a semantic segmentation network with an improved loss function on three classical PolSAR datasets. To consider the phase information contained in PolSAR data, the complex-valued LoSARNet will be further studied in future research.