1. Introduction
Soil salinization, which significantly accelerates land degradation and declining soil productivity, has become one of the most critical threats to agriculture and food security [
1,
2]. Owing to factors including dry climate, low precipitation, and high evaporation, excessive salt accumulation degrades soil structure, fertility, plant growth, crop yields, and microbial activity [
3]. It has affected over 8.31 × 10
8 ha of soil worldwide, and the salt-affected area continues to increase [
4,
5]. China is recognized as a country significantly affected by salinization, with a total area of saline soil of 3.6 × 10
7 ha [
6]. In China, Xinjiang represents a major region impacted by saline soils, comprising nearly 20% of the nation’s total saline land area, hindering the sustainable development of local agriculture [
7,
8,
9]. Therefore, timely and effective monitoring of the spatiotemporal distribution of soil salinization is vital for ensuring sustainable agricultural development and safeguarding food security [
2].
Traditional studies have primarily relied on in-situ sampling and laboratory measurement, which are both labor- and time-intensive [
10]. These methods are often limited by sparse sampling points, small spatial coverage, and high uncertainty, resulting in limited representativeness of the results [
11,
12]. With its wide coverage, short acquisition cycle, and fast processing speed, RS has emerged as a key approach for monitoring soil salinization [
13]. Optical remote sensing has been extensively utilized for regional and global-scale monitoring, mapping, and control of soil salinization, and remains the most widely adopted multi-temporal dynamic soil monitoring technology [
14]. However, optical remote sensing is limited by its sensitivity to atmospheric conditions and lack of surface penetration capability, which constrains its effectiveness in monitoring soil salinity under unfavorable weather or surface cover conditions [
15,
16].
Synthetic Aperture Radar (SAR) provides continuous earth observation capabilities, unaffected by weather conditions or cloud cover, enabling day-and-night monitoring [
17]. Moreover, SAR backscatter is highly sensitive to soil electrical conductivity (EC), which is strongly influenced by salinity levels [
18,
19,
20]. Consequently, SAR has the potential for effective application in soil salinity monitoring [
18]. Polarimetric Synthetic Aperture Radar (PolSAR) is an advanced imaging radar system characterized by its multi-channel and multi-parameter capabilities [
21]. It acquires information from electromagnetic waves that are transmitted toward target objects and subsequently reflected to the sensor. The received signals provide rich target-specific scattering characteristics [
22]. By using PolSAR, we can obtain more complete polarimetric scattering characteristics of targets than using single-polarization radar, making PolSAR a significant part in obtaining a variety of details such as physical dielectric characteristics, geometric shapes, and target directions of ground objects [
23]. It enhances the radar’s capability to obtain target information [
23,
24,
25]. At the same time, polarization decomposition is utilized to extract target features from the polarimetric SAR data that meet the classification criteria, thereby enabling more accurate target classification and recognition [
26]. It has found extensive application in land cover monitoring [
27,
28,
29], target detection [
30,
31,
32], and terrain classification [
33,
34,
35,
36].
In the past decade, there has been an increasing application of deep learning-based classifiers in PolSAR image classification [
33,
37,
38,
39,
40]. As a branch of machine learning, deep learning efficiently handles complex data and is highly capable of feature extraction. Owing to these strengths, it has demonstrated superior performance in PolSAR data interpretation, often outperforming traditional classification methods [
39]. However, recent studies have highlighted several persistent challenges, including the limited availability of labeled samples, the complicated scattering mechanisms inherent in PolSAR data, and the difficulty of effectively integrating global and local features across multiple data sources [
41,
42].
With the continuous development of computer vision, the attention mechanism has been improved and applied to image semantic segmentation [
43,
44]. The attention mechanism dynamically highlights key features while filtering out less relevant information, thereby improving model performance [
45]. It is primarily categorized into spatial attention [
46], channel attention [
47], self-attention [
48], and cross-attention [
49], among others. Spatial and channel attention are often combined into the CBAM module for multi-source feature fusion [
47]. These attention mechanisms comprehensively scan the input image, allocating attention resources to different regions, selecting key areas to gather more detailed information, and suppressing attention to non-essential regions [
41]. In the field of RS, attention mechanisms have been widely applied to semantic segmentation, particularly for the multi-scale, multi-source feature fusion of optical and radar images, addressing challenges related to scale variation and feature integration [
50]. Gao et al. proposed a dual-encoder network with a detail attention module (DAM) and composite loss to enhance SAR-optical image fusion and classification [
51]. Yu et al. proposed a dual-attention fusion network (DAOSNet), which enhances SAR-optical image classification by balancing cross-modal semantics and spatial detail through attention-based fusion and a semantic balance loss [
52]. Both self-attention and cross-attention mechanisms have been employed in RS, where self-attention enhances feature interactions within the image and strengthens global dependencies, while cross-attention improves feature fusion by facilitating information exchange between different features [
53]. The coupling of them has been proposed to deal with difficulties in scale variation and feature fusion. Li et al. introduced the multimodal-cross attention network (MCANet), which fully leveraged both self-attention and cross-attention to independently extract and fuse features, demonstrating exceptional performance in fusing these two modalities [
54]. However, current research is more focused on feature fusion between SAR and optical images within the domain of computer vision and pattern recognition [
55,
56]. Due to limitations in the dataset and factors such as spatial heterogeneity in geographical environments, studies embedding attention mechanisms into deep learning algorithms for RS applications in environmental monitoring remain relatively few. Furthermore, the existing studies mainly focus on the feature fusion of raw optical and radar images, with limited attention given to the deep fusion of optical spectral information and the backscattering physical information from radar images.
To address the aforementioned challenges, this study focused on the Keriya Oasis in Xinjiang, China, utilizing Gaofen-3 full-polarization radar images and Sentinel-2 multi-spectral images with a 10 m resolution as input data. Spectral indices and radar full-polarization parameters were extracted and subjected to feature selection to identify key features. Additionally, a Dual-Modal Deep Learning Network for Soil Salinization (DMSSNet) mapping was proposed in this study. The encoder of DMSSNet employed a Dual-Branch structure to refine features separately for optical and radar data. The fusion module integrated self-attention and CBAM mechanisms, while the decoder, inspired by the UNet architecture, used an upsampling process with skip connections. This framework was designed to enhance the accuracy of soil salinization mapping through effective multi-modal data fusion.
4. Results
4.1. Polarimetric Features Extracted from the GF-3 Data
Firstly, the coherency matrix
and backscatter coefficients were extracted. Subsequently, a series of polarimetric decomposition methods were applied to the GF-3 imagery. In total, 38 types of polarimetric features were obtained. The characteristics are summarized in
Table 5, and the RGB composite image derived from polarimetric decomposition is illustrated in
Figure 11.
4.2. Optimal Feature Selection
Before feature selection, a normalization procedure was applied to ensure consistency across different feature scales. Specifically, the Sentinel-2 and Gaofen-3 were pre-processed using a linear min-max scaling approach, which linearly transforms the raw feature values into the range of [0, 1] [
65].
To enhance the robustness and generalizability of feature importance evaluation, we conducted 300 randomized experiments, each with a different random seed for data partitioning and model initialization. In each iteration, feature importance scores were calculated using the LightGBM framework. The average importance was computed to reduce the influence of stochastic variability and to ensure the stability of the selected features. As shown in
Figure 12A,B, 38 SAR polarimetric features and 17 optical spectral indices were evaluated independently. Based on mean importance scores, features were ranked, and the two resulting sets were trained and evaluated separately.
The bar chart illustrates the mean importance across 300 runs, while the black lines indicate the corresponding macro F1-score achieved during the processing of SFS. An inflection point is observed after the top four or five features, beyond which performance gains become negligible. Consequently, we selected the top four features from each dataset, with the model trained on these features achieving a macro F1-score exceeding 0.80, thus providing a suitable balance between predictive precision and model complexity. Finally, these variables, including SI1, CRSI, NDSI, SI2, Touzi_alpha, VanZyl_odd, Cloude_T11, and Yamaguchi_vol, were subsequently selected for the classification model.
4.3. Comparisons of DMSSNet Among Different Methods
This section presents a comparison between DMSSNet and three conventional semantic segmentation models: SegNet, ResUNet, and DeepLabv3+. From
Table 6, DMSSNet outperformed all models regarding OA, Kappa, and class-wise metrics. Specifically, DMSSNet achieved an OA of 92.94%, surpassing ResUNet, SegNet, and DeepLabv3+ by 5.26%, 2.90%, and 11.31%, respectively. Similarly, the Kappa of DMSSNet reached 0.9077, which was 0.07, 0.0384, and 0.1507 higher than those of ResUNet, SegNet, and DeepLabV3+, respectively. DMSSNet also demonstrated superior precision and recall in distinguishing difficult categories like MS and HS soils, with precision values of 0.9375 and 0.9333, and recall values of 0.9233 and 0.8940. The enhanced performance might be attributed to its dual-branch architecture and hierarchical attention fusion strategy, which effectively integrates complementary information from optical and PolSAR modalities. These results highlight DMSSNet’s effectiveness in capturing subtle salinity differences.
4.4. Comparisons of Classification Results Under Different Input Data
To comprehensively evaluate the performance of all models across various data sources (Sentinel-2, Gaofen-3, and Gaofen-3 + Sentinel-2), we conducted classification experiments using SegNet, ResUNet, DeepLabv3+, and the proposed DMSSNet. For each model, we computed the Kappa, macro F1-score, mIOU, and OA. The results are summarized in
Table 7.
When combining the optical indices with the radar polarimetric parameters into a fused dataset, the classification performance of all models improved remarkably. As shown in
Table 7, models that utilized both Sentinel-2 and Gaofen-3 data outperformed those using only single-source inputs. Compared with using only Sentinel-2 or Gaofen-3 data, the OA of ResUNet, SegNet, DeepLabv3+, and DMSSNet increased to 90.04%, 87.68%, 81.63%, and 92.94%, respectively. This demonstrates that multi-source fusion can effectively leverage the complementary advantages.
To visually compare the classification performance of each model for salinization, a comparison was conducted. The classification results and the confusion matrices are shown in
Figure 13 and
Figure 14.
As illustrated in
Figure 13, the classification results varied noticeably depending on the input data. When only Sentinel-2 data was used (
Figure 13A), SegNet and DeepLabv3+ generated relatively coarse segmentation, particularly at the boundaries between vegetated and salinized areas. SegNet yielded fragmented classification outputs, while DeepLabv3+ exhibited discontinuous mapping in regions with moderate and heavy salinization. ResUNet demonstrated improved boundary delineation and inter-class separation, yet was still prone to noise in smaller patches. In contrast, DMSSNet delivered more refined and continuous segmentation results with clearer class boundaries and reduced noise, particularly in heterogeneous landscapes.
Figure 14(A-1–A-4) further shows that both ResUNet and DMSSNet performed well in overall salinization classification, with DeepLabv3+ excelling in the detection of heavily salinized (HS) soils. Notably, DMSSNet outperformed the other models in accurately identifying vegetation (VG), bare land (BL), and water bodies (WB). When only Gaofen-3 data were used (
Figure 13B), classification accuracy declined across all models. SegNet’s performance in detecting BL and WB decreased substantially; ResUNet was most effective in identifying HS soils, while DMSSNet achieved superior results for moderately and slightly salinized (MS and SS) soils. SegNet and DeepLabv3+ struggled in distinguishing different salinity levels, whereas ResUNet improved the discrimination of BL and WB categories. The highest performance was observed with the multi-modal input (
Figure 13C), where the fusion of spectral and radar data improved the performance of all models. DMSSNet demonstrated the most noticeable improvement, achieving nearly 90% accuracy across all salinization classes, reducing salt-and-pepper noise, and yielding more precise delineation of small and irregular regions. These results confirm the effectiveness of DMSSNet in integrating multi-source information and enhancing soil salinization classification.
4.5. Model Complexity and Inference Time Comparison
To assess the practical deployment potential of the studied models, we conducted a representative comparison of model complexity, computational efficiency, and segmentation performance.
Table 8 reports the number of trainable parameters, FLOPs, model size in megabytes (MB), average inference time per image patch, and the corresponding mIoU. These metrics collectively evaluate model efficiency and deployment cost from multiple perspectives. All models were evaluated using the same input image size of 128 × 128 pixels.
Among the tested models, SegNet has the largest memory usage of 117.75 MB and 29.40 M parameters, achieving a mIoU of 72.10% but with a relatively high computational cost and an inference time of 31.36 ms. ResUNet offers the fastest inference at 19.90 ms and a compact size of 57.33 MB, suitable for real-time and low-power applications. Although DeepLabv3+ has the fewest parameters of 3.91 M, its complex ASPP module results in the highest FLOPs and an inference time of 37.33 ms. Our proposed DMSSNet combines a dual-branch encoder with attention-based fusion and achieves superior accuracy with a mean IoU of 79.12%. It has moderate complexity with 42.25 GFLOPs, a lightweight model size of 46.39 MB, and an inference time of 31.16 ms, which is comparable to SegNet.
4.6. Ablation Experiment
To assess the contribution of DMSSNet’s dual-branch encoder and the fusion strategy integrating self-attention and CBAM, two ablation experiments were performed. First, the dual-branch architecture was compared to a single-branch variant that processes all input modalities jointly, thereby evaluating the effect of separating modality-specific features on representation quality and segmentation performance. Second, three fusion strategies—self-attention only, CBAM only, and the combination of both—were systematically tested on both architectural variants under the same training conditions. This approach enables a detailed analysis of the individual and synergistic contributions of the attention mechanisms.
All ablation models were trained under identical hyperparameter settings, loss functions, and datasets to ensure fair comparison. Segmentation performance was evaluated using quantitative metrics, including overall accuracy (OA) and class-specific accuracies for slightly salinized (SS), moderately salinized (MS), and heavily salinized (HS) soils. The results, presented in
Table 9, highlight the relative influence of architectural design and attention modules on DMSSNet’s performance.
The results clearly demonstrate that the dual-branch encoder consistently achieves higher accuracies for SS, MS, and HS as well as OA across all fusion strategies. Specifically, the dual-branch encoder structure improves SS accuracy by approximately 2.30–3.57%, MS accuracy by 0.62–4.51%, HS accuracy by 3.36–5.92%, and OA by 1.46–2.25% when compared to the single-branch design. These findings highlight the advantage of separately extracting modality-specific features prior to feature fusion. The dual-branch structure more effectively preserves distinct spectral and backscattering information, thereby enhancing segmentation performance across all salinization categories.
Additionally, an analysis of
Table 9 indicates that self-attention and the convolutional block attention model (CBAM) serve complementary roles. The combination of self-attention’s global context modeling with CBAM’s spatial-channel recalibration enables the network to effectively emphasize subtle indicators of salinity within both optical and PolSAR feature spaces. In the one-branch architecture, the joint use of both attention mechanisms attains the highest segmentation accuracy (SA) of 90.69%, representing improvements of 0.69% and 0.66% over the use of self-attention and CBAM alone, respectively. While CBAM alone increases MS soil accuracy by 1.79% compared to self-attention, its performance on HS soils is comparatively lower. In the dual-branch configuration, integrating both attention mechanisms yields a maximum mean segmentation accuracy (MSA) of 93.49%, representing gains of 1.99% and 2.41% over using self-attention and CBAM individually. The accuracies for all salinized soil types consistently remain around 93%, exceeding the OA, which further highlights the effectiveness of dual-branch encoding with hybrid attention fusion in recognizing and distinguishing varying degrees of soil salinization.
6. Conclusions
This study proposes a dual-branch, multi-modal semantic segmentation network, Dual-Modal Deep Learning Network for Soil Salinization (DMSSNet), tailored for Soil Salinization Mapping in the Keriya Oasis. DMSSNet integrates Sentinel-2 multispectral and GF-3 full-polarimetric SAR data via parallel encoder branches, each dedicated to extracting complementary features from optical and radar modalities. The incorporation of both self-attention and CBAM within a hierarchical fusion framework enables the model to effectively capture intra- and cross-modal dependencies, thereby enhancing spatial feature representation.
The polarimetric decomposition features and spectral indices are jointly exploited to characterize diverse land surface conditions. Experimental results in the Keriya Oasis indicate that DMSSNet achieves superior performance, attaining the highest overall accuracy (OA), mean Intersection over Union (mIoU), and macro F1-score compared to conventional models such as ResUNet, SegNet, and DeepLabv3+. The integration of attention-guided fusion with multi-source remote sensing data, encompassing both optical and radar modalities, improves classification performance, particularly in complex and heterogeneous landscapes exhibiting varying degrees of soil salinization. This research demonstrates the effectiveness of multi-modal deep learning frameworks for land degradation monitoring and provides a promising technical reference for future applications in ecological management, environmental monitoring, sustainable agriculture, and agricultural production.