Land Use and Land Cover Classification with Deep Learning-Based Fusion of SAR and Optical Data

Irfan, Ayesha; Li, Yu; E, Xinhua; Sun, Guangmin

doi:10.3390/rs17071298

Open AccessTechnical Note

Land Use and Land Cover Classification with Deep Learning-Based Fusion of SAR and Optical Data

School of Information Science and Technology, Beijing University of Technology, Beijing 100124, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(7), 1298; https://doi.org/10.3390/rs17071298

Submission received: 15 January 2025 / Revised: 16 March 2025 / Accepted: 1 April 2025 / Published: 5 April 2025

(This article belongs to the Special Issue Multi-platform and Multi-modal Remote Sensing Data Fusion with Advanced Deep Learning Techniques (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

Land use and land cover (LULC) classification through remote sensing imagery serves as a cornerstone for environmental monitoring, resource management, and evidence-based urban planning. While Synthetic Aperture Radar (SAR) and optical sensors individually capture distinct aspects of Earth’s surface, their complementary nature SAR excelling in structural and all-weather observation and optical sensors providing rich spectral information—offers untapped potential for improving classification robustness. However, the intrinsic differences in their imaging mechanisms (e.g., SAR’s coherent scattering versus optical’s reflectance properties) pose significant challenges in achieving effective multimodal fusion for LULC analysis. To address this gap, we propose a multimodal deep-learning framework that systematically integrates SAR and optical imagery. Our approach employs a dual-branch neural network, with two fusion paradigms being rigorously compared: the Early Fusion strategy and the Late Fusion strategy. Experiments on the SEN12MS dataset—a benchmark containing globally diverse land cover categories—demonstrate the framework’s efficacy. Our Early Fusion strategy achieved 88% accuracy (F1 score: 87%), outperforming the Late Fusion approach (84% accuracy, F1 score: 82%). The results indicate that optical data provide detailed spectral signatures useful for identifying vegetation, water bodies, and urban areas, whereas SAR data contribute valuable texture and structural details. Early Fusion’s superiority stems from synergistic low-level feature extraction, capturing cross-modal correlations lost in late-stage fusion. Compared to state-of-the-art baselines, our proposed methods show a significant improvement in classification accuracy, demonstrating that multimodal fusion mitigates single-sensor limitations (e.g., optical cloud obstruction and SAR speckle noise). This study advances remote sensing technology by providing a precise and effective method for LULC classification.

Keywords:

land use/land cover (LULC); remote sensing; multisource data fusion; optical imagery; Synthetic Aperture Radar (SAR); deep learning

1. Introduction

The proliferation of advanced Earth observation (EO) constellations has revolutionized terrestrial monitoring paradigms through the continuous acquisition of unprecedented volumes of geospatial data. Central to this transformation, SAR and optical sensor fusions provide a large amount of complementary data. Optical imaging provides detailed spectral information essential for recognizing different materials and types of vegetation [1]. Conversely, SAR systems employ active microwave illumination to derive all-weather, diurnally invariant observations, capturing surface dielectric properties and geometric structures via polarization-dependent backscattering coefficients—parameters intrinsically linked to soil moisture variability and urban morphology [2]. This sensor fusion paradigm has catalyzed innovations in remote sensing scene classification, a critical research frontier in geospatial artificial intelligence. The task involves hierarchical semantic labeling of image scenes into distinct LULC taxonomies through features. Such classification frameworks play a vital role in environmental surveillance, urban development, and resource administration.

The past three decades have witnessed transformative advancements in creating various methods for this task due to its critical importance in numerous applications, including natural hazard assessment through landslide pattern analysis [3,4], geospatial object detection [5,6,7,8], LULC determination [9,10,11], geographic image retrieval [12,13], environmental monitoring, vegetation mapping [14], and urban planning. Integrating radar and multi-spectral data has emerged as a significant trend in remote sensing. Numerous studies have demonstrated that combining these data sources can lead to higher mapping accuracies [15,16,17]. Regarding the techniques for fusing multisource remote sensing data, the most commonly employed methods are pixel-, feature-, and decision-level fusion [18]. Pixel-level fusion combines data from two sources to produce an enhanced image, a prime example being the creation of pan-sharpened images through the integration of low-spatial-resolution multi-spectral images and high-spatial-resolution panchromatic images. However, this method may not fully leverage the individual data sources as they are not separately analyzed in the context of image classification. Feature-level fusion combines multiple feature sets before classification, while decision-level fusion fuses outcomes from various classifiers, which may utilize different features or classification techniques. Notably, feature-level fusion dominates operational implementations due to its advantages of straightforwardness and proven effectiveness [19,20]. In [21], SAR and panchromatic (PAN) features are combined into the multi-spectral (MS) images through the “à trous” wavelet decomposition and modified Brovey transform (MBT). The MBT operates by locally adjusting each MS image according to the ratio of the new intensity to the original intensity of components. This novel intensity component is generated by merging SAR and PAN features through a feature selection process. In [22], a fusion technique has been developed that integrates the wavelet transform with the IHS transform to combine high-resolution SAR images with moderately spatially resolved MS images. In this process, a new approximation image based on wavelet is created using the intensity component of MS images and a weighted combination of SAR features, preventing the over-injection of SAR intensity information.

Recently, the synergistic deployment of Sentinel-1 and Sentinel-2 satellite image time series (SITS) has proven effective for LULC mapping, showing the advantages of using such optical and radar SITS in this area. Significant instances include employing optical S2 SITS to generate land cover maps on a national scale [23]. Recent advancements in remote sensing imaging technology provide abundant data for both research and practical applications. The Sentinel-1 and Sentinel-2 satellites offer Synthetic Aperture Radar (SAR) and optical imagery at a spatial resolution of 10 m, aiding numerous Earth observation projects. Nevertheless, effectively leveraging the complementary data from these sensors remains a critical challenge in remote sensing [24]. Combining the polarimetric features of SAR data with optical data enhances the ability to differentiate complex urban structures, emphasizing the importance of choosing suitable SAR properties to improve fusion results [2]. Divergent physical measurement principles between SAR polarimetry (sensitive to dielectric/geometric properties) and optical reflectance (material-specific spectral signatures) raises the challenge of cross-modal feature incongruence. After acquiring a set of aligned SAR-optical images, various traditional pansharpening techniques have been adapted for SAR–optical pixel-level integration. Generative adversarial network (GAN)-based architecture is introduced, featuring a U-shaped generator and a convolutional discriminator. This network employs multiple loss functions to effectively remove speckle noise while retaining considerable structural details [25]. Moreover, atmospheric conditions, such as cloud cover, often impact optical imagery, which can significantly degrade spectral and spatial quality. Fortunately, SAR data are largely unaffected by weather, making it an ideal tool for enhancing optical images. Several methods have been developed to create cloud-free optical images using auxiliary SAR data from the same location [26,27]. Specifically, a straightforward residual model is used to learn directly from the data pairs to produce cloud-free images, proving effective even under heavy cloud coverage [28].

In computer vision, the remarkable achievements of deep learning have been primarily motivated by addressing the problem related to image classification, which involves assigning one or more labels to a given picture. To achieve this, numerous researchers have utilized the ImageNet database, which comprises millions of annotated images [29]. In remote sensing, this task is commonly known as scene classification, where one or more labels are assigned to a remote sensing image or scene. Significant progress has also been made in this domain in recent years, accompanied by an increasing number of specialized datasets [30]. Recent advancements have shifted towards automating fusion through deep learning methodologies like convolutional neural networks (CNNs). These models automatically extract and combine features from SAR and optical data, resulting in substantially better accuracy for LULC classification compared to conventional approaches [31]. Image fusion can integrate corresponding information from multiple image sources, leverage the benefits of multi-sensor data, and broaden the range of image applications [32]. Optical images provide clear visuals and abundant spectral details. Synthetic Aperture Radar (SAR) captures the scattering characteristics of objects in almost any weather conditions, both day and night, and SAR images contain detailed texture features and roughness information [33]. Therefore, the fusion of SAR and optical images can take advantage of images rich in spatial, spectral, and scattering information, facilitating precise identification of targets and mapping the distribution of objects on the ground [34,35]. Recently, the application of deep learning in the fusion of multisource remote sensing data has gained attention. A common approach in this domain involves designing a dual-branch network structure. Initially, each branch independently extracts features from various data sources. These features are then combined using techniques such as feature stacking or concatenation. Subsequently, the integrated features are processed by a classifier layer to produce the ultimate classification outcomes [36,37,38]. The robustness of data fusion in scene classification is enhanced by addressing distribution shifts and detecting out-of-distribution data. Synthetic Aperture Radar (SAR) and optical imagery are combined to improve performance in land cover classification tasks, even when distribution shifts occur due to factors such as cloud cover or sensor malfunctions. Various data fusion techniques, including deep learning-based methods, investigate how to manage scenarios where only certain data sources are impacted by distribution shifts [39]. A dual-input model performs image-level fusion using SAR and optical data through principal component analysis (PCA) and incorporates feature-level fusion techniques for effective multimodal data integration. Deep learning approaches successfully merge the complementary features of SAR and optical images [40]. The GAN method can convert optical data to SAR data and vice versa for detecting temporal changes in LULC. This demonstrates the potential of combining optical and SAR data for enhanced scene classification by integrating Sentinel-1 and Sentinel-2 datasets [41].

Many current models utilize shallower classifiers that are probably insufficient for learning complex hierarchical features. This limitation hinders their capacity to identify the intricate patterns and details essential for precise classification. Basic feature extraction techniques employed by simpler architectures may overlook vital information in high-dimensional data. Certain models may find it challenging to manage the high dimensionality of multi-spectral optical and SAR data, resulting in inefficiencies and potential information loss during dimensionality reduction. Lacking effective mechanisms to handle high-dimensional inputs, these models might discard crucial features or become overwhelmed by irrelevant ones. To overcome these issues, we proposed a model that addresses the limitations of existing models by combining SAR and optical data with appropriate structure and fusion strategies. It utilizes a novel deep CNN architecture for advanced feature extraction, incorporating a shared feature extraction network that efficiently captures and leverages spatial and spectral features. The features can be fused at either the Early or Late stage. The model used Sentinel-1 and Sentinel-2 imagery from the SEN12MS [42] dataset to evaluate the effectiveness of the models. These innovations enhance accuracy, robustness, and generalization in land type classification using SAR and optical imagery. The rest of the paper is organized as follows: Section 2 describes the dataset used in this study, Section 3 outlines the classification methods employed, Section 4 presents the results, Section 5 discusses the experimental findings, and Section 6 concludes the study.

2. Dataset

Sentinel-1 and Sentinel-2 are integral components of the Copernicus Programme, spearheaded by the European Space Agency (ESA). This initiative is aimed to ensure a steady, detailed, and readily accessible flow of data for observing the Earth. Sentinel-1 consists of satellites that utilize C-band Synthetic Aperture Radar (SAR) to capture images of the Earth’s surface under all weather conditions, day and night. In contrast, Sentinel-2 operates with a two-satellite configuration that uses multi-spectral imaging technology across 13 spectral bands covering the visible, short-wave, and near-infrared light spectrum. Collectively, Sentinel-1 and Sentinel-2 provide a robust toolset for tracking environmental changes, supporting agricultural operations, and advancing climate change studies, with their high-frequency and high-resolution data capabilities. This technology is particularly effective for monitoring changes in soil, water bodies, inland waterways, vegetation, and coastal zones.

This paper uses the SEN12MS dataset [42] to fuse the SAR and optical imagery for land use/land cover classification. The SEN12MS dataset is a distinctive compilation of georeferenced multi-spectral imagery that merges data from the Sentinel-1 and Sentinel-2 satellite missions. It is tailored to facilitate the evaluation and development of deep learning models and data fusion methods in the remote sensing field. Featuring aligned Synthetic Aperture Radar (SAR) and optical images, the SEN12MS dataset provides extensive coverage across different seasons and diverse global landscapes. It was collected from various locations worldwide, encompassing a range of land types and geographic regions. The dataset includes remote sensing imagery from multiple continents. The dataset comprises 180,662 patches spread globally and covers all seasons. Each patch in the dataset is provided at a pixel resolution of 10 m and measures 256 × 256 pixels. The dataset includes SAR images from Sentinel-1 featuring two polarimetric channels (VV and VH) and optical images from Sentinel-2 composed of 13 multi-spectral channels (B1, B2, B3, B4, B5, B6, B7, B8, B8a, B9, B10, B11, and B12) with ten bands focused on land surface observations (bands 2–4 and 8, each with a 10 m resolution; and bands 5–7, 8A, 11, and 12, each with a resolution of 20 m) and three bands aimed at atmospheric observations (bands 1, 9, and 10, each with a 60 m resolution). Sentinel-1 images are unaffected by clouds, while the Sentinel-2 images are made cloud-free by mosaicking based on GEE-based procedures. We randomly selected 12,000 patches from the SEN12MS dataset. The number of single-label images is 3140 and multi-label images are 8860. Single-label classification images are assigned with only one label. In multi-label classification, labels are assigned to each scene for all classes that cover more than 10% of the image. There are six classes with land/land cover labels: Forest, Cropland, Water, Barren, Urban/Built-up, and Savanna. Selected images are split randomly into one of two groups for training (80%) and validation (20%). Figure 1 shows some of the samples of Sentinel-1 and Sentinel-2 from the SEN12MS dataset.

3. Methods

Convolutional neural networks (CNNs) have recently been utilized in numerous tasks in computer vision, significantly enhancing performance in areas such as image scene classification and object detection. Despite these advancements, the classification accuracy largely depends on features that can precisely represent the image scenes. Therefore, effectively leveraging the feature learning capabilities of CNNs is vital for scene classification. The proposed methods are designed to optimize the combined advantages of two satellite data types. The core of these methods involves merging the Sentinel-1 and Sentinel-2 datasets through a fusion strategy that preserves the spatial and spectral information, using a technique that identifies nonlinear associations between the datasets. Subsequently, feature extraction is performed by CNN to derive the most significant features from the optical and radar data, enhancing land type classification.

3.1. Early Fusion

The proposed model employs Early Fusion techniques to integrate Synthetic Aperture Radar (SAR) and optical imagery to classify images into distinct LULC categories. It can process SAR images with 2 channels and optical images with 13 multi-spectral channels, all with an input size of 256 × 256 pixels. The preprocessing stage normalizes the values of the SAR and optical channels to a range between 0 and 1 to avoid biased feature extraction. In the Early Fusion block, these normalized channels are concatenated into a single input tensor of shape (256, 256, 15). This tensor is then input into a convolutional network for shared feature extraction. The convolution comprises four blocks with filter sizes increasing from 32 to 256, enabling the network to capture hierarchical feature representations. Each block uses 3 × 3 kernels for capturing local spatial features while maintaining computational efficiency and ReLU activation to prevent vanishing gradient issues and accelerate convergence, along with batch normalization and 2 × 2 max-pooling to reduce spatial dimensions while retaining key features and preventing overfitting. Extracted features are then processed by the classification head, which includes dense layers with 128 and 64 units, respectively, employing ReLU activation and dropout rates of 0.5 for regularization. The final dense layer comprises 6 activation units corresponding to the six target classes.

The model is flexible for both single-label and multi-label classification. Single-label classification uses a binary categorical cross-entropy loss function and a softmax activation function in the output layer. For the multi-label classification, a sigmoid activation function is used for each output neuron with a binary cross-entropy loss function. The flexible classification output layer can switch between softmax and sigmoid activations based on the specific task. This architecture effectively combines SAR and optical imagery data to create a robust and accurate classifier, ensuring high accuracy and generalization through efficient feature extraction and regularization methods. An overview of the proposed Early Fusion model is depicted in Figure 2. The details of layers and filters used in the Early Fusion model are mentioned in Table 1.

3.2. Late Fusion

The Late Fusion model is designed with separate feature extraction pathways for optical and Synthetic Aperture Radar (SAR) imagery which are later fused to optimize the classification of images into six distinct classes. The SAR pathway processes 2 channels, while the Optical pathway handles 13 multi-spectral channels. Both paths start with a Conv2D layer comprising 32 filters, ReLU activation, a 3 × 3 kernel, and the same padding, followed by a MaxPooling2D layer with a 2 × 2 pool size. During the Late Fusion stage, outputs from both pathways are concatenated into a combined feature map. The combined map is subsequently processed by another Conv2D layer featuring 64 filters, a 3 × 3 kernel, ReLU activation, and the same padding. This is followed by batch normalization to improve training stability and performance.

A Global Average Pooling2D layer is then applied to reduce the spatial dimensions and effectively summarize the feature maps. The classification head processes these fused and pooled features through dense layers. The first dense layer consists of 128 units with ReLU activation, followed by a dropout layer with a 0.5 rate to mitigate overfitting. Next, a second dense layer with 64 units and ReLU activation is followed by a dropout layer with a rate of 0.5. The output is then passed through a dense layer with 6 units, using either softmax activation for single-label classification with binary categorical cross-entropy loss or sigmoid activation for multi-label classification with binary cross-entropy loss. This setup provides the probability distribution across the six target classes. This architecture ensures that features from both SAR and optical imagery are effectively captured and integrated, resulting in a robust and accurate classifier capable of handling complex multi-channel input data. An overview of the proposed Late Fusion model is depicted in Figure 3. The details of layers and filters used in the Late Fusion model are in Table 2.

3.3. Training Settings

In training our deep learning model, we utilize the Adam optimizer [43] due to its efficiency and adaptive learning rate features, setting the learning rate

{1 \times 10}^{- 4}

to ensure stable convergence. To achieve a balance between model performance and computational efficiency, we use 32 batches. The training duration range is from 50 to 100 epochs, depending on when the model converges allowing for sufficient training time while preventing unnecessary computations. The final number of epochs is set to 100. To ensure proper generalization, 20% of the training data are reserved for validation. We employ early stopping to prevent overfitting, which monitors the validation loss and stops training if there is no improvement within 10 epochs. Additionally, if the validation loss increases a learning rate scheduler decreases the learning rate by a factor of 0.1, with a patience period of 5 epochs. This adaptive approach helps refine the learning process, potentially enhancing performance and expediting convergence.

3.4. Evaluation Metrics

The classification models are evaluated using

A c c u r a c y

for single-label classification and the

F 1 s c o r e

for multi-label classification. We report class-wise metrics for six classes, the average of class-wise values, and the overall average across all samples. This approach ensures a fair evaluation, considering the imbalance in class distribution. Equation (1) refers to accuracy and Equation (4) refers to the

F 1 s c o r e

. Here,

T P

refers to true positives (correctly classified into the class),

T N

to true negatives (correctly classified into other classes),

F P

to false positives (incorrectly classified into the class), and

F N

to false negatives (incorrectly classified into other classes).

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{T P}{T P + F N}

(3)

F 1 s c o r e = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

4. Results

The fusion model, incorporating Sentinel-1 and Sentinel-2 data, was evaluated using Early and Late Fusion strategies. Overall, the Early Fusion strategy outperforms the Late Fusion strategy. Early Fusion achieves an overall accuracy of 88.12%, compared to 83.96% for Late Fusion. Similarly, the overall F1 score for Early Fusion is 86.53%, while Late Fusion records 81.87%. These findings indicate that Early Fusion provides more precise and dependable scene-based classifications across various land types. The F1 score and accuracy achieved for each class are depicted in Figure 4 and Figure 5. Examples of predicted images are shown in Table 3.

According to each class’s accuracy and F1 score, Water emerges as the top-performing class in both the Early Fusion and Late Fusion strategies, demonstrating exceptional accuracy and F1 scores of approximately 94% or higher. Early Fusion maintains a particularly robust F1 score of 97.10%, highlighting its effectiveness in accurately identifying water bodies within multi-class scene classification tasks. In contrast, barren areas show subpar performance across both fusion strategies, achieving accuracies ranging from 66.79% to 70.18% and F1 scores below 70%. Late Fusion notably struggles more with barren land classification, reflected in its lower F1 score of 59.51%. The Forest class exhibits varied performance between the two strategies. Early Fusion achieves a higher accuracy at 78.36% but a lower F1 score of 69.83%, suggesting difficulties in precise forested area classification.

In contrast, Late Fusion achieves a higher F1 score of 73.59% but a lower accuracy of 71.15%, indicating a different balance between precision and recall compared with Early Fusion. Built-up areas perform very well with both fusion strategies, achieving high accuracy levels over 89% and F1 scores above 87%. Early Fusion shows slightly better performance in both metrics for this class. Cropland and Savanna achieve average results: Cropland achieves an F1 score of 78.98% with Early Fusion while Late Fusion have a slightly lower F1 score of 70.22%, and both strategies maintain an accuracy above 73%. The Savanna class performs reasonably well, achieving over 75% accuracy in both strategies, but Late Fusion faces more challenges, attaining an F1 score of 69.99%, which indicates difficulty in balancing precision and recall for this class. The analysis shows that Early Fusion generally outperforms Late Fusion in both accuracy and F1 score across most land cover classes, as well as in overall performance. However, specific classes like Forest and Water exhibit higher F1 scores with the Late Fusion strategy, suggesting that the optimal fusion strategy may depend on the specific application or class. Nonetheless, Early Fusion remains the more effective method for integrating Sentinel-1 and Sentinel-2 data in this context. Figure 4 presents typical samples of land use and land type classification results. For single-label classification, all land use types are correctly classified by both the Early and Late Fusion strategies. For multi-label classification, Cropland and Savanna are often misclassified due to their similar features, while the Late Fusion strategy misclassifies Forest as Water because of its dark appearance.

Next, the classification performance of the proposed method is compared with existing deep learning models including ResNet [44], VGGNet [45], and GoogleNet [46]. We used the Sentinel-1 (Sen1) and Sentinel-2 (Sen2) datasets individually as the input of these single-source methods. We also used multimodal Sen1+Sen2 imagery on these three models by modifying the input layer to accept multi-channel inputs. For ResNet, we utilized pre-trained weights from ImageNet and fine-tuned the model on the SAR and optical dataset. For GoogleNet, we adapted the input layer to handle multi-channel data and modified specific inception modules to emphasize feature extraction of SAR and optical data.

Similarly, for VGGNet, we modified the input layer for the SAR and optical channels and fine-tuned the model with a lower learning rate to enhance its performance on the SAR and optical dataset. We compared their performance to the multisource fusion method proposed in this study for land type classification. We utilized pre-trained models, adjusted them based on the dataset, and fine-tuned the hyperparameters to suit our needs. The single- and multi-label imagery ResNet demonstrated higher accuracy than GoogleNet and VGGNet. However, the existing models did not perform as well overall compared to the proposed fusion method. ResNet achieved approximately 64% accuracy, whereas GoogleNet and VGGNet scored 58% and 53% on the Sen1 dataset, respectively. On the Sen2 dataset, ResNet attained around 66% accuracy, with GoogleNet and VGGNet achieving 62% and 59%, respectively. With the multisource input, the ResNet model recorded about 72% accuracy, while GoogleNet and VGGNet reached around 63% and 60%, respectively. The details of the results obtained by existing models are displayed in Figure 6 and the overall results of the existing and proposed models are included in Table 4.

5. Discussion

The classification results demonstrate that the proposed multisource fusion method significantly surpasses existing deep learning models, including ResNet, GoogleNet, and VGGNet, in land-type classification tasks involving single-label and multi-label imagery. Models trained and fine-tuned on single-source datasets (Sen1 and Sen2) delivered moderate performance, with ResNet consistently outperforming GoogleNet and VGGNet, achieving higher accuracy across most land cover classes. However, the single-source approaches struggled to fully exploit the complementary features available in multi-sensor data, resulting in suboptimal outcomes. ResNet achieved accuracy levels of around 64% on Sen1 and 66% on Sen2, with corresponding F1 score improvements. In comparison, GoogleNet and VGGNet yielded lower accuracies, ranging from 53% to 62% across the datasets. Despite ResNet’s relatively better performance, the limitations of single-source methods became evident, as relying solely on individual sensor data restricted the ability to generalize across diverse land cover types.

Conversely, the proposed Early Fusion and Late Fusion strategies produced notable improvements by integrating multi-sensor data at different stages of the learning process. Early Fusion, in particular, achieved an overall accuracy of 88.12% and an F1 score of 86.53%, significantly surpassing ResNet’s best performance on combined datasets (72.15% accuracy and 68.54% F1 score). This highlights the critical importance of feature-level integration for effectively leveraging complementary spatial and spectral information. The class-level analysis further emphasizes the superiority of the fusion approach. The fusion strategies dramatically enhanced classification accuracy for challenging categories such as Barren and Forest, where single-source models performed poorly. Early Fusion, for instance, improved Barren classification accuracy from 38.67% (ResNet) to 70.18%, showcasing its effectiveness in resolving ambiguities in less distinct classes. Additionally, for high-performing classes such as Water and Urban/Built-up, the fusion method produced even more refined results, achieving near-perfect accuracy and F1 scores. These outcomes highlight the transformative potential of multisource fusion techniques. Current deep learning models also face challenges in multisource data fusion. VGGNet has a limited capacity to capture multiscale features and is computationally intensive. The inception modules of GoogleNet struggle to effectively capture SAR-specific features, while ResNet is a powerful model that tends to overfit during certain iterations when using both SAR and Optical data together. By capturing complementary features from multi-sensor data, the fusion strategies not only enhance classification accuracy but also improve robustness across diverse and challenging land cover scenarios.

Overall, the proposed fusion approaches perform well on single-label and multi-label data. The Water, Built-up, and Cropland categories consistently achieve high accuracy and F1 scores across both fusion strategies, demonstrating strong classification performance. In contrast, the Barren, Savanna, and Forest categories exhibit more variability, with some metrics indicating fluctuation in accurate classification and balanced precision and recall. These findings emphasize the strengths of the Early Fusion strategy as compared to the Late Fusion strategy in classifying various land use/ land cover types using Sentinel-1 and Sentinel-2 data.

Early Fusion outperforms Late Fusion because SAR and optical images contain complementary information. The proposed Early Fusion model effectively utilizes data from both sensor types to enhance performance. In Early Fusion, SAR and optical images are combined at the input level, allowing the network to extract joint features from both modalities from the beginning. A deep CNN learns to suppress SAR noise by leveraging optical data as contextual information. Since the fusion happens early, the network can identify correlations between optical textures and SAR patterns, which helps reduce noise interference. Neural networks excel at distinguishing features when trained on both noisy (SAR) and clean (optical) inputs simultaneously. This allows the model to prioritize relevant optical features while naturally mitigating SAR noise. In contrast, Late Fusion first extracts features from SAR and optical data separately before combining them. This approach presents challenges. Since SAR noise remains present during feature extraction, the SAR branch learns less discriminative and noisier representations. Meanwhile, the optical branch extracts clean features, but early interaction does not occur between the two modalities. As a result, when fusion happens later, the model integrates high-quality optical features with noisy SAR features, ultimately leading to reduced classification accuracy. Early Fusion enables the model to capture cross-modality relationships from the beginning. By integrating SAR data at an early stage, specific features in the SAR data better complement spectral features in the optical data, allowing the model to learn not only modality-specific features but also those that arise from their interaction, resulting in more robust feature representations that improve classification accuracy. In Late Fusion, SAR and optical data are processed independently before being combined at a later stage. This method maintains the unique properties of each modality while restricting the network’s capacity to effectively utilize cross-modal interactions. Since fusion occurs only after feature extraction, the model may miss important shared information, leading to less efficient multimodal learning.

6. Conclusions

The study investigated the synergistic potential of SAR–optical data fusion for accurate land type classification by leveraging the SEN12MS benchmark dataset. Two innovative deep learning architectures are proposed. Early Fusion provides immediate access to complementary information, giving the model a more holistic understanding of the scene. This approach strengthens the model by mitigating the weaknesses of each modality individually. The relationships learned between modalities in Early Fusion reduce dependence on a single data type, enhancing classification accuracy, particularly in challenging environments. By building joint feature representations from the outset, the model learns how these features interact and complement one another, resulting in more distinctive and discriminative features, which improves the model’s ability to differentiate between various land type classes. This architecture establishes a new state-of-the-art solution for complex land cover classification tasks, providing an extensible template for multimodal Earth observation tasks. The framework’s scalability is going to be further validated by its compatibility with other sensor pairs (e.g., ALOS-2 PALSAR and Landsat-8).

Author Contributions

Conceptualization, A.I., Y.L., and X.E.; methodology, A.I.; software, A.I.; validation, A.I., Y.L., and X.E.; formal analysis, A.I.; investigation, A.I.; resources, X.E.; data curation, A.I.; writing—original draft preparation, A.I.; writing—review and editing, A.I., Y.L., and X.E.; visualization, A.I.; supervision, G.S.; project administration, Y.L. and G.S.; funding acquisition, Y.L. and X.E. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China (2021YFA0715101), the Strategic Priority Program of the Chinese Academy of Sciences (XDB41020104) and the Natural Science Foundation of China (42376178).

Data Availability Statement

The SEN12MS dataset is available at: https://mediatum.ub.tum.de/1474000.

Acknowledgments

We are grateful to the anonymous reviewers for their valuable comments and suggestions, and Technical University of Munich for making the SEN12MS data publicly accessible, which enabled us to conduct experiments for this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jensen, J.R. Remote Sensing of the Environment: An Earth Resource Perspective 2/e; Pearson Education India: Delhi, India, 2009. [Google Scholar]
Lee, J.S.; Pottier, E. Polarimetric Radar Imaging: From Basics to Applications; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
Stumpf, A.; Kerle, N. Object-oriented mapping of landslides using Random Forests. Remote Sens. Environ. 2011, 115, 2564–2577. [Google Scholar]
Cheng, G.; Guo, L.; Zhao, T.; Han, J.; Li, H.; Fang, J. Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA. Int. J. Remote Sens. 2013, 34, 45–59. [Google Scholar]
Zhao, B.; Zhong, Y.; Xia, G.S.; Zhang, L. Dirichlet-derived multiple topic scene classification model for high spatial resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2015, 54, 2108–2123. [Google Scholar]
Qi, K.; Wu, H.; Shen, C.; Gong, J. Land-use scene classification in high-resolution remote sensing images using improved correlatons. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2403–2407. [Google Scholar]
Zhao, L.; Tang, P.; Huo, L. Feature significance-based multibag-of-visual-words model for remote sensing image scene classification. J. Appl. Remote Sens. 2016, 10, 035004. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar]
Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar]
Zhang, D.; Han, J.; Cheng, G.; Liu, Z.; Bu, S.; Guo, L. Weakly supervised learning for target detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2014, 12, 701–705. [Google Scholar]
Bhagavathy, S.; Manjunath, B.S. Modeling and detection of geospatial objects using texture motifs. IEEE Trans. Geosci. Remote Sens. 2006, 44, 3706–3715. [Google Scholar]
Yang, Y.; Newsam, S. Geographic image retrieval using local invariant features. IEEE Trans. Geosci. Remote Sens. 2012, 51, 818–832. [Google Scholar]
Zhou, W.; Shao, Z.; Diao, C.; Cheng, Q. High-resolution remote-sensing imagery retrieval using sparse features by auto-encoder. Remote Sens. Lett. 2015, 6, 775–783. [Google Scholar]
Li, X.; Shao, G. Object-based urban vegetation mapping with high-resolution aerial photography as a single data source. Int. J. Remote Sens. 2013, 34, 771–789. [Google Scholar]
Forkuor, G.; Conrad, C.; Thiel, M.; Ullmann, T.; Zoungrana, E. Integration of optical and Synthetic Aperture Radar imagery for improving crop mapping in Northwestern Benin, West Africa. Remote Sens. 2014, 6, 6472–6499. [Google Scholar] [CrossRef]
Ameline, M.; Fieuzal, R.; Betbeder, J.; Berthoumieu, J.F.; Baup, F. Estimation of corn yield by assimilating SAR and optical time series into a simplified agro-meteorological model: From diagnostic to forecast. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4747–4760. [Google Scholar]
Lu, L.; Tao, Y.; Di, L. Object-based plastic-mulched landcover extraction using integrated Sentinel-1 and Sentinel-2 data. Remote Sens. 2018, 10, 1820. [Google Scholar] [CrossRef]
Li, S.; Kang, X.; Fang, L.; Hu, J.; Yin, H. Pixel-level image fusion: A survey of the state of the art. Inf. Fusion 2017, 33, 100–112. [Google Scholar]
Chen, W.S.; Dai, X.; Pan, B.; Huang, T. A novel discriminant criterion based on feature fusion strategy for face recognition. Neurocomputing 2015, 159, 67–77. [Google Scholar]
Liao, W.; Pižurica, A.; Bellens, R.; Gautama, S.; Philips, W. Generalized graph-based fusion of hyperspectral and LiDAR data using morphological features. IEEE Geosci. Remote Sens. Lett. 2014, 12, 552–556. [Google Scholar]
Chibani, Y. Integration of panchromatic and SAR features into multispectral SPOT images using the ‘a tròus’ wavelet decomposition. Int. J. Remote Sens. 2007, 28, 2295–2307. [Google Scholar]
Hong, G.; Zhang, Y.; Mercer, B. A wavelet and IHS integration method to fuse high resolution SAR with moderate resolution multispectral images. Photogramm. Eng. Remote Sens. 2009, 75, 1213–1223. [Google Scholar]
Inglada, J.; Vincent, A.; Arias, M.; Tardy, B.; Morin, D.; Rodes, I. Operational high resolution land cover map production at the country scale using satellite image time series. Remote Sens. 2017, 9, 95. [Google Scholar] [CrossRef]
Ienco, D.; Interdonato, R.; Gaetano, R.; Minh, D.H.T. Combining Sentinel-1 and Sentinel-2 Satellite Image Time Series for land cover mapping via a multi-source deep learning architecture. ISPRS J. Photogramm. Remote Sens. 2019, 158, 11–22. [Google Scholar]
Kong, Y.; Hong, F.; Leung, H.; Peng, X. A fusion method of optical image and SAR image based on dense-UGAN and Gram–Schmidt transformation. Remote Sens. 2021, 13, 4274. [Google Scholar] [CrossRef]
Gao, J.; Yuan, Q.; Li, J.; Zhang, H.; Su, X. Cloud removal with fusion of high resolution optical and SAR images using generative adversarial networks. Remote Sens. 2020, 12, 191. [Google Scholar] [CrossRef]
Grohnfeldt, C.; Schmitt, M.; Zhu, X. A conditional generative adversarial network to fuse SAR and multispectral optical data for cloud removal from Sentinel-2 images. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; IEEE: New York, NY, USA, 2018; pp. 1726–1729. [Google Scholar]
Meraner, A.; Ebel, P.; Zhu, X.X.; Schmitt, M. Cloud removal in sentinel-2 imagery using a deep residual neural network and SAR-optical data fusion. ISPRS J. Photogramm. Remote Sens. 2020, 166, 333–346. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar]
Ghamisi, P.; Rasti, B.; Yokoya, N.; Wang, Q.; Hofle, B.; Bruzzone, L.; Benediktsson, J.A. Multisource and multitemporal data fusion in remote sensing: A comprehensive review of the state of the art. IEEE Geosci. Remote Sens. Mag. 2019, 7, 6–39. [Google Scholar]
Kulkarni, S.C.; Rege, P.P. Pixel level fusion techniques for SAR and optical images: A review. Inf. Fusion 2020, 59, 13–29. [Google Scholar]
Li, J.; Li, C.; Xu, W.; Feng, H.; Zhao, F.; Long, H.; Yang, G. Fusion of optical and SAR images based on deep learning to reconstruct vegetation NDVI time series in cloud-prone regions. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102818. [Google Scholar]
Ghassemian, H. A review of remote sensing image fusion methods. Inf. Fusion 2016, 32, 75–89. [Google Scholar]
Ren, B.; Ma, S.; Hou, B.; Hong, D.; Chanussot, J.; Wang, J.; Jiao, L. A dual-stream high-resolution network: Deep fusion of GF-2 and GF-3 data for land cover classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102896. [Google Scholar] [CrossRef]
Xu, X.; Li, W.; Ran, Q.; Du, Q.; Gao, L.; Zhang, B. Multisource remote sensing data classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2017, 56, 937–949. [Google Scholar]
Hughes, L.H.; Schmitt, M.; Mou, L.; Wang, Y.; Zhu, X.X. Identifying corresponding patches in SAR and optical images with a pseudo-siamese CNN. IEEE Geosci. Remote Sens. Lett. 2018, 15, 784–788. [Google Scholar]
Huang, B.; Zhao, B.; Song, Y. Urban land-use mapping using a deep convolutional neural network with high spatial resolution multispectral remote sensing imagery. Remote Sens. Environ. 2018, 214, 73–86. [Google Scholar]
Gawlikowski, J.; Saha, S.; Niebling, J.; Zhu, X.X. Handling unexpected inputs: Incorporating source-wise out-of-distribution detection into SAR-optical data fusion for scene classification. EURASIP J. Adv. Signal Process. 2023, 2023, 47. [Google Scholar] [CrossRef]
Quan, Y.; Zhang, R.; Li, J.; Ji, S.; Guo, H.; Yu, A. Learning SAR-Optical Cross-Modal Features for Land Cover Classification. Remote Sens. 2024, 16, 431. [Google Scholar] [CrossRef]
Rangzan, M.; Attarchi, S.; Gloaguen, R.; Alavipanah, S.K. TSGAN: An Optical-to-SAR Dual Conditional GAN for Optical-based SAR Temporal Shifting. arXiv 2023, arXiv:2401.00440. [Google Scholar]
Schmitt, M.; Hughes, L.H.; Qiu, C.; Zhu, X.X. SEN12MS—A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, 153–160. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]

Figure 1. Some samples of Sen1 and Sen2 from the SEN12MS dataset with the class of scene.

Figure 2. Overview of the proposed model with the Early Fusion approach.

Figure 3. Overview of the proposed model with the Late Fusion approach.

Figure 4. Per-class F1 score achieved by our proposed Early Fusion and Late Fusion methods.

Figure 5. Per-class accuracy achieved by our proposed Early Fusion and Late Fusion methods.

Figure 6. Per-class accuracy achieved by (a) ResNet, (b) VGGNet, and (c) GoogleNet and F1 score achieved by (d) ResNet, (e) VGGNet, and (f) GoogleNet on Sen1, Sen2, and Sen1+Sen2 imagery.

Table 1. Summary of proposed Early Fusion model.

Layers	Filter Size	No. of Filters	Output Size
Input Layer	-	15	256 × 256 × 15
Conv2D	3 × 3	32	256 × 256 × 32
Batch Normalization	-	-	256 × 256 × 32
MaxPooling2D	2 × 2	-	128 × 128 × 32
Conv2D	3 × 3	64	128 × 128 × 64
Batch Normalization	-	-	128 × 128 × 64
MaxPooling2D	2 × 2	-	64 × 64 × 64
Conv2D	3 × 3	128	64 × 64 × 128
Batch Normalization	-	-	64 × 64 × 128
MaxPooling2D	2 × 2	-	32 × 32 × 128
Conv2D	3 × 3	256	32 × 32 × 256
Batch Normalization	-	-	32 × 32 × 256
MaxPooling2D	2 × 2	-	16 × 16 × 256
Global Average Pooling2D	-	-	256
Dense Classification layer	1 × 1	6	256 × 256 × 6

Table 2. Summary of the proposed model with Late Fusion.

Layers	Filter Size	No. of Filters	Output Size
SAR Input Layer	-	-	256 × 256 × 2
Conv2D	3 × 3	32	256 × 256 × 32
MaxPooling2D	2 × 2	-	128 × 128 × 32
Optical Input	-	-	256 × 256 × 32
Conv2D	3 × 3	32	256 × 256 × 32
MaxPooling2D	2 × 2	-	128 × 128 × 32
Concatenate	-	-	128 × 128 × 64
Conv2D	3 × 3	64	128 × 128 × 64
Batch Normalization	-	-	128 × 128 × 64
Global Average Pooling	-	-	1 × 1 × 64
Dense Classification layer	1 × 1	6	256 × 256 × 6

Table 3. Example of predicted scene-based single-label and multi-label images from the SEN12MS dataset using a novel optical and SAR fusion model.

	Label	Early Fusion	Late Fusion
Single label	Urban/Built-up	Urban/Built-up	Urban/Built-up
	Cropland	Cropland	Cropland
	Savanna	Savanna	Savanna
Multi-label	Savanna Barren	Savanna Barren Cropland	Savanna Barren
	Urban/Built-up Forest Cropland	Urban/Built-up Forest Cropland Savanna	Urban/Built-up Forest
	Urban/Built-up Forest	Urban/Built-up Forest	Urban/Built-up Forest Water

Table 4. Overall accuracy and F1 score achieved existing and proposed models.

Methods	Accuracy	F1 Score
ResNet (Sen1)	63.91	60.12
GoogleNet (Sen1)	58.00	56.75
VGGNet (Sen1)	53.17	49.38
ResNet (Sen2)	66.54	62.98
GoogleNet (Sen2)	62.69	60.45
VGGNet (Sen2)	59.96	57.27
ResNet (Sen1+Sen2)	72.15	68.54
GoogleNet (Sen1+Sen2)	66.95	64.37
VGGNet (Sen1+Sen2)	63.82	58.61
Proposed Early Fusion	88.12	86.53
Proposed Late Fusion	83.96	81.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Irfan, A.; Li, Y.; E, X.; Sun, G. Land Use and Land Cover Classification with Deep Learning-Based Fusion of SAR and Optical Data. Remote Sens. 2025, 17, 1298. https://doi.org/10.3390/rs17071298

AMA Style

Irfan A, Li Y, E X, Sun G. Land Use and Land Cover Classification with Deep Learning-Based Fusion of SAR and Optical Data. Remote Sensing. 2025; 17(7):1298. https://doi.org/10.3390/rs17071298

Chicago/Turabian Style

Irfan, Ayesha, Yu Li, Xinhua E, and Guangmin Sun. 2025. "Land Use and Land Cover Classification with Deep Learning-Based Fusion of SAR and Optical Data" Remote Sensing 17, no. 7: 1298. https://doi.org/10.3390/rs17071298

APA Style

Irfan, A., Li, Y., E, X., & Sun, G. (2025). Land Use and Land Cover Classification with Deep Learning-Based Fusion of SAR and Optical Data. Remote Sensing, 17(7), 1298. https://doi.org/10.3390/rs17071298

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Land Use and Land Cover Classification with Deep Learning-Based Fusion of SAR and Optical Data

Abstract

1. Introduction

2. Dataset

3. Methods

3.1. Early Fusion

3.2. Late Fusion

3.3. Training Settings

3.4. Evaluation Metrics

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI