Multi-Stage Feature Fusion of Multispectral and SAR Satellite Images for Seasonal Crop-Type Mapping at Regional Scale Using an Adapted 3D U-Net Model

Wittstruck, Lucas; Jarmer, Thomas; Waske, Björn

doi:10.3390/rs16173115

Open AccessEditor’s ChoiceArticle

Multi-Stage Feature Fusion of Multispectral and SAR Satellite Images for Seasonal Crop-Type Mapping at Regional Scale Using an Adapted 3D U-Net Model

by

Lucas Wittstruck

,

Thomas Jarmer

^*

and

Björn Waske

Institute of Computer Science, University of Osnabrueck, Wachsbleiche 27, D-49090 Osnabrueck, Germany

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3115; https://doi.org/10.3390/rs16173115

Submission received: 11 July 2024 / Revised: 19 August 2024 / Accepted: 21 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Remote Sensing: 15th Anniversary)

Download

Browse Figures

Versions Notes

Abstract

Earth observation missions such as Sentinel and Landsat support the large-scale identification of agricultural crops by providing free radar and multispectral satellite images. The extraction of representative image information as well as the combination of different image sources for improved feature selection still represent a major challenge in the field of remote sensing. In this paper, we propose a novel three-dimensional (3D) deep learning U-Net model to fuse multi-level image features from multispectral and synthetic aperture radar (SAR) time series data for seasonal crop-type mapping at a regional scale. For this purpose, we used a dual-stream U-Net with a 3D squeeze-and-excitation fusion module applied at multiple stages in the network to progressively extract and combine multispectral and SAR image features. Additionally, we introduced a distinctive method for generating patch-based multitemporal multispectral composites by selective image sampling within a 14-day window, prioritizing those with minimal cloud cover. The classification results showed that the proposed network provided the best overall accuracy (94.5%) compared to conventional two-dimensional (2D) and three-dimensional U-Net models (2D: 92.6% and 3D: 94.2%). Our network successfully learned multi-modal dependencies between the multispectral and SAR satellite images, leading to improved field mapping of spectrally similar and heterogeneous classes while mitigating the limitations imposed by persistent cloud coverage. Additionally, the feature representations extracted by the proposed network demonstrated their transferability to a new cropping season, providing a reliable mapping of spatio-temporal crop type patterns.

Keywords:

agriculture; classification; CNN; deep learning; Sentinel-1; Sentinel-2

1. Introduction

Crop-type mapping is highly relevant for numerous large-scale agricultural applications, such as sustainable crop management, food security, or agricultural policy [1,2]. In this context, satellite images have become increasingly important as they frequently provide useful information about cropland at a large scale. Earth observation (EO) missions such as the Sentinel (ESA) and the Landsat (NASA/USGS) programs freely provide petabytes of imagery, cover large areas with high spatial and temporal resolution, and thus offer new opportunities.

In recent years, the combination of multitemporal multispectral and radar satellite data has been increasingly used for crop-type mapping, as it has led to improved classification accuracies compared to mono-temporal or single-sensor data applications [3,4,5,6,7]. These studies have shown that synthetic aperture radar (SAR) data provides different but complementary information on land use and land cover (LULC) when compared to multispectral data. While multispectral data from optical sensors offers information about the spectral reflectance of crops, SAR data complements this by revealing the structural and physical attributes of plants, including details about surface roughness, moisture content, and overall geometry. The combination of multispectral and SAR data can therefore lead to increased classification accuracy [8,9], especially when the discrimination of crop types and LULC classes is difficult due to spectral ambiguities. Moreover, the use of multitemporal SAR data seems particularly interesting when the availability of multispectral data is limited due to cloud cover. Therefore, already a few additional Sentinel-1 images increase the LULC map in terms of accuracy [10]. A detailed review on applications of multispectral and SAR data for LULC monitoring is given by Joshi et al. [11].

Besides the increased availability of data from different complementary sensors, such as multispectral and SAR, research in LULC mapping was driven, among others, by the selection of representative spatial and temporal image information to increase classification accuracy. In the context of crop-type mapping, spatial image information has often been derived by applying texture filters [12,13,14] or by using object-oriented segmentation methods [15,16,17]. Although image segmentation can significantly increase the accuracy when compared to the results achieved by a pixel-based classification, the definition of adequate segmentation parameters can be challenging [16,18]. In order to represent important temporal patterns of crop dynamics, (spectral-)temporal metrics or phenological information were often derived from satellite imagery [3,5,19]. However, traditional approaches have limitations in automation and rely on human experience and domain knowledge when extracting relevant image features from such complex input data [20]. Currently, deep learning methods are gaining increasing attention due to their automatic and efficient context feature learning from multi-dimensional input data and their ability to process huge datasets, which demand adequate algorithms for data analysis. In particular, convolutional neural networks (CNNs) have recently been applied for crop-type mapping using the combination of multispectral and SAR imagery because they have proved to provide excellent classification accuracy compared to classical approaches [21,22,23,24,25].

Nevertheless, the combination of multisensor data remains challenging in the field of remote sensing [23]. It is particularly important to fuse multisensor data adequately so that relations between data modalities can be learned efficiently [26]. The main fusion strategies can be divided into three categories [24]: (1) input-level fusion: combination of information from multiple sources or modalities by simple data stacking into a single data layer; (2) feature-level fusion: features are extracted from each modality separately and then combined; (3) decision-level fusion: In decision-level fusion, the decisions or outputs of multiple classifiers are combined to make a final decision. Initial studies have successfully applied deep learning methods as appropriate models for fusing multispectral and SAR data for crop-type mapping [7,21,23,24,25,27]. However, so far, input-level fusion methods, which are mostly the initial stacking of the individual image sources, have been used as the preferred fusion technique [21,25,27]. The drawback of this approach is the high dimensionality of the input data after layer stacking [28] as well as the fact that the interactions between data modalities are not fully considered. Therefore, dependencies that may be present within the different datasets are ignored [23]. The decision fusion technique can improve the classification results due to the combination of multiple classifiers [4]. The integration of multiple classification outcomes, however, requires specialized knowledge.

When it comes to combining remote sensing data from multiple sources, feature-level fusion methods are considered to be a preferable and more suitable option in comparison to input-level and decision-level fusion methods due to the effective reduction of data redundancy and the ability of multi-level feature extraction. Currently, only a few studies have proposed the advantage of CNN’s as a reliable and automated feature fusion technique for crop type classification. In [23], a LULC classification was performed using a deep learning architecture based on recurrent and convolutional operations to extract multi-dimensional features from dense Sentinel-1 and Sentinel-2/Landsat time series data. For the classification, these features were combined using fully connected layers to learn the dependencies between the multi-modal features. Ref. [24] analyzed and compared three fusion strategies (input-, layer-, and decision-level) to identify the best strategy that optimizes optical-radar classification performance using a temporal attention encoder. The results showed that the input- and feature-level fusion achieved the best overall F-score, surpassing the accuracy achieved by the decision-level fusion by 2%. Ref. [29] proposed a novel method to learn multispectral and SAR features for crop classification through cross-modal contrastive learning based on an updated dual-branch network with partial weight-sharing. The experimental results showed that the model consistently outperforms traditional supervised learning approaches.

In the broader context of LULC classification, first studies have integrated CNNs as effective fusion techniques. For instance, ref. [30] introduced a two-branch supervised semantic segmentation framework enhanced by a novel symmetric attention module to fuse SAR and optical images. Similarly, ref. [31] proposed a comparable methodology, employing a spatial-aware circular module featuring a cross-modality receptive field. This mechanism was designed to prioritize critical spatial features present in both SAR and optical imagery. Both approaches notably leveraged attention mechanisms using CNN layers to optimize the integration of both modalities, resulting in significant enhancements in LULC classification. However, these approaches are constrained by their reliance on monotemporal input data and do not address spatio-temporal feature extraction across different modalities. Hence, it is evident that the utilization of CNN techniques for classifying multispectral and SAR data in crop-type mapping remains relatively unexplored. Particularly, most deep learning studies have primarily relied on simplistic methods such as image stacking or single-stage feature fusion. Alternatively, other approaches tend to be excessively complex, involving multiple intricate processing steps, which can result in increased computational costs and complexity. These methods may include advanced feature fusion techniques, multi-stage processing pipelines, or heavy reliance on additional data sources and annotations. Moreover, many early studies in the context of multispectral and SAR fusion were restricted to study sites less than 3000 km² or even on plot/farm level or considered only one cropping season [11]. However, the availability of Sentinel-1 and Sentinel-2 fosters the combination of SAR and multispectral data for large-scale applications and multiple cropping season analysis [4,5,6].

The presented study was intended to make an initial contribution for large-scale and multi-annual crop type classification using multitemporal multispectral satellite data from Sentinel-2/Landsat-8, as well as SAR images from Sentinel-1. Our research focused on developing a spatial and temporal multi-stage image feature fusion technique for multispectral and SAR images to extract relevant temporal vegetation patterns between the different image sources using state-of-the-art CNN methods. For this, we created an adapted dual-stream 3D U-Net architecture for multispectral and SAR images, respectively, to extract relevant spatio-temporal vegetation features based on vegetation indices and SAR backscatter coefficients. To improve channel interdependencies among the data sources, we extended the U-Net network with a feature fusion block, which was applied at multiple levels in the network based on a 3D squeeze-and-excitation (SE) technique by [32]. This technique allowed features from different input sources to be combined by emphasizing important features while suppressing less important ones. Additionally, the multi-level fusion method reduces the problem of extracting noisy image information due to clouds or speckle noise. From now on, the proposed model is referred to as 3D U-Net FF (3D U-Net Feature Fusion).

As the creation of cloud-free image composites from optical remote sensing data is still a challenging task, researchers frequently rely on spectral-temporal metrics over a certain time period or perform time series interpolation [6,24]. Nevertheless, the reduction of pixel-wise spectral reflectance into broad statistical metrics as well as the calculation of synthetic reflectance values may result in the loss of relevant spatial, spectral, and temporal information for crop-type mapping. Moreover, as with other feature extraction methods (e.g., image segmentation, texture), the definition of statistical metrics is user-dependent. For this reason, we used the “original” multitemporal data and implemented an adapted temporal sampling strategy to reduce cloud coverage in the data.

The overarching goal of our study was to develop a robust approach for seasonal crop-type mapping using CNNs and multisensor data. We expected the results of the proposed approach to outperform other classifiers in terms of classification accuracy. Moreover, we assumed that the model could learn meaningful spatial-temporal features and thus could be transferred on a large scale in time. In order to evaluate the potential of our concept and crop maps, we (i) evaluated whether the proposed approach increases the mapping accuracy when compared to commonly regular 2D and 3D U-Net architectures, and (ii) analyzed the transferability of the proposed model to another cropping season without using additional training data.

The specific objective of our study was the generation of large-scale crop type maps of Lower Saxony, Germany. We generated four seasonal crop type maps, comprising the cropping seasons 2017/18, 2018/19, 2019/20, and 2020/21. Each crop year began in October and ended in September of the following year.

To assess the effectiveness of our approach, we compared our proposed network with classical 2D and 3D U-Net models. Afterwards, the trained model was applied to generate a crop type map for the cropping season 2021/22 in Lower Saxony. Overall, the findings of this study were expected to contribute to an improved and (semi-)automated framework for large-scale crop-type mapping.

2. Materials and Data

2.1. Study Area

Lower Saxony, situated in northwest Germany, is the second-largest state in the country, with a total area of approximately 47,600 km² (see Figure 1). Lower Saxony has a significant agricultural industry, including a variety of crop types that are grown across its different soil types, which vary from sandy to clayey soils [33]. The environmental conditions are characterized by a mild maritime and warm humid climate, with average temperatures of 9 °C and annual precipitation levels ranging between 800 and 850 mm [34]. Major crops in Lower Saxony are maize (24.9 million tons), sugar beet (7.8 million tons), and winter cereals like wheat (2.7 million tons), barley (1.1 million tons), and rye (847.000 tons). The region is also known for its production of high-quality potatoes (5.5 million tons), making it the highest producer in Germany (yield data from 2020, [35]). The agricultural landscape is predominantly characterized by small, parceled areas with frequently changing crop types.

2.2. Remote Sensing Data

In this study, multitemporal multispectral (Sentinel-2 & Landsat-8) and SAR (Sentinel-1) satellite image data were collected to perform crop-type mapping on a seasonal basis. All accessible remote sensing data within the region of Lower Saxony were obtained for four cropping seasons, ranging from 2017/18 to 2021/22, from the Google Earth Engine (GEE) platform [36].

For the multispectral data (Level-2), only comparable spectral bands (blue, green, red, near infrared (NIR), shortwave infrared (SWIR1 and SWIR2)) were chosen from Landsat-8 and Sentinel-2 and normalized between 0 and 1.

SAR data from the Sentinel-1 sensor were collected as ground range detected (GRD) products with dual-polarization (VV + VH). The data were processed to VH and VV backscatter values, after which the inherent signal noise was filtered using the gamma map algorithm with a window size of 3 × 3 pixels. Finally, all acquired images were resampled to a spatial resolution of 10 m and reprojected to WGS84.

2.3. Reference Data

The ground truth data utilized in this study originated from the Land Parcel Identification System (LPIS) for the federal state of Lower Saxony [37]. LPIS is an integral component of the agricultural subsidy support system within the European Union (EU), capturing comprehensive information on crop types cultivated during each cropping season. This dataset includes precise delineations of field boundaries, along with corresponding crop type annotations provided by farmers and landowners for individual fields. In our study, 16 classes were considered for the classification approaches, including all major and a few minor agricultural crops in the area of Lower Saxony (see Table 1). Crops that were cultivated only to a very small extent have been aggregated into the classes of other winter cereals, other spring cereals, and other crops. Fields with no agricultural production were assigned to a background class.

3. Methods

3.1. Patch-Based Multitemporal Image Composites

In contrast to traditional pixel-based classification algorithms, 2D and 3D CNN methods require patch-based input data, which enable the extraction of spatial information. Therefore, we spatially sliced all multispectral and SAR satellite images into non-overlapping patches of equal size using a sliding window method. The patch size specification is critical for the classification performance since it determines the extent to which spatial features within a patch can be extracted by the neural network. We used a window size of 128 × 128 pixels (approximately 163.84 hectares), which preserved crop-specific structures without becoming too computationally intensive for model training.

To derive valuable phenological crop features from our satellite data, we generated multitemporal image composites for all four considered seasons. However, rather than generating region-wide seasonal image composites, we produced multispectral and SAR composites for each considered patch individually.

When considering multispectral satellite data for image composites, clouds are commonly masked to minimize data noise [6,7,27]. However, because of the hightemporal and spatial variability of clouds, it remains a major challenge to derive high-temporal resolution optical image composites. Due to this fact, we decided not to mask any clouds within our multitemporal patches; instead, we utilized a 14-day window across each growing season, aiming to capture relevant phenological changes in crops while managing the challenges posed by cloud cover as well as computational demands. Within each 14-day window, for each patch, we selected the multispectral image with the least cloud cover based on the Sentinel-2 s2cloudless and Landsat-8 fmask cloud detection algorithms. If the Sentinel-2 and Landsat-8 image patches had the same minimum cloud coverage, we preferred the Sentinel-2 data due to their higher spatial resolution. The aim of this method was to reduce the impact of clouds within a multitemporal patch composite while maintaining spatial interrelations between the pixels and keeping temporal information on vegetation phenological dynamics throughout the cropping year. Since remaining clouds were not masked and acquisition dates could differentiate between patch composites, we tried to force the applied classifiers to extract more general crop features to overcome the constraints of temporal cloud cover and the variability in temporal acquisition patterns.

To create seasonal patch-based image composites from the Sentinel-1 data, we also applied a 14-day window for each patch, thus achieving the same temporal resolution for the composites as the multispectral data. As the SAR data is not much affected by clouds, we calculated the mean VH and VV backscatter values of all available images within a 14-day window. Finally, each patch-based composite contained 24 multispectral and SAR images.

To balance the number of bands in the multispectral (6 bands) and SAR (2 bands) imagery data and to reduce spectral noise in the multispectral imagery [38], we applied two vegetation indices to the multispectral patch composites. Firstly, we computed the normalized difference vegetation index (NDVI), a widely used remote sensing index that can derive unique crop characteristics based on vegetation status and canopy structure [39]. Secondly, we derived the tasseled cap Greenness, which provides relevant information on vegetation phenological stages, using the tasseled cap coefficients ascertained by [40]. Both indices have been successfully used in the context of LULC classification [3,6,41,42].

3.2. 2D and 3D U-Net Model

Our classification approach was based on the commonly used U-Net architecture by [43]. The U-Net model is a special type of deep learning architecture from the domain of semantic image segmentation that is particularly suitable for classification using remote sensing data, as it performs better with a small sample size and unbalanced classes than other CNN models. Due to its symmetric encoder-decoder structure coupled with skip connections, contextual information at multiple scales can be preserved, allowing the model to capture both local and global spatial dependencies within the imagery. This is crucial for accurately distinguishing between different crop types, as the spatial arrangement and contextual cues play a significant role in classification [44].

The typical 2D U-Net architecture is symmetric and consists of two connected parts, including a contracting path and an expansive path. The contracting path is a typical convolutional network, which uses a sequence of 3 × 3 convolutional and 2 × 2 max-pooling layers to extract and aggregate local spatial image features. After each step, the number of filters is doubled. The expansive path uses a series of 2 × 2 upsampling and convolutional layers to associate the derived image features with their corresponding location in the segmentation map. A distinctive characteristic of the U-Net models is skip connections. These are used to transfer feature maps, which are created in the downsampling process, to their corresponding output of the upsampling operation to preserve important spatial information that has been lost by pooling operations [45].

The 3D U-Net is composed of 3D kernels to extract features from volumetric data (voxels). 3D convolutions can therefore be applied in three dimensions to extract volumetric information from the input data. In our case, we applied the 3D U-Net model to capture spatio-temporal features from the generated multitemporal image patches.

3.3. Proposed Multi-Stage 3D Feature Fusion U-Net Model (3D U-Net FF)

The proposed network model was based on a common 3D convolutional U-Net architecture with modifications to fuse spatio-temporal image features from multispectral and SAR data at different aggregation levels (see Figure 2). In distinction to the original U-Net architecture, we used a dual-stream 3D U-Net to split multispectral and SAR image data to learn the spatio-temporal features of both data sources separately. With this, we targeted enhanced feature representation and spatial context preservation, as both data streams can focus on sensor-specific image features, capturing both fine-grained details and high-level spatio-temporal semantic information.

The multispectral branch contained a stack of multitemporal patches of NDVI and Greenness values of Sentinel-2 and Landsat imagery, whereas the SAR stream included a stack of temporal VV and VH backscatter patches, derived from Sentinel-1. Therefore, both streams had an input size of 128 × 128 × 24 × 2 and were passed through the contracting and expansive path of the 3D network in parallel to learn features separately.

In detail, the contracting path of both streams involved three downsampling steps (S1, S2, and S3). Each step consisted of a convolutional block with two consecutive 3D convolutions (3 × 3 × 3) with batch normalization and a rectified linear unit (ReLU). Each convolutional block is followed by a 3D max-pooling operation (2 × 2 × 2) to gradually downsample the input size of the patches in the spatio-temporal domain. In the downsampling process, the number of feature maps was doubled with each subsequent step. To combine the contracting and the expansive paths, another convolutional block (S4) with doubled filter numbers was used. According to the contracting path, the expansive path was built up on three steps (S5, S6, and S7). Each step included a 3D transpose convolution with a stride of 2 × 2 × 2 to increase the sizes of the feature maps, followed by the concatenation of the corresponding skip connection. The combined features were then processed by a convolutional block with two consecutive 3D convolutions. The feature maps of both streams were finally added and passed to the output layer. As the output of the network needed to be a 2D classification image, the resulting 3D feature maps were reshaped into 2D arrays and converted to the number of classes using a 2D convolution. To obtain the final class probability maps, we used the softmax function.

To address the task of feature fusion, we modified the multi-level skip connections within the U-Net model. Each corresponding skip connection from both branches was passed to a novel fusion module to combine the spatio-temporal features from both data streams. With this, patterns and dependencies of the different sources should be learned at different aggregation levels, resulting in an efficient and flexible learning procedure.

Our fusion module (see Figure 3) was inspired by a squeeze-and-excitation (SE) network introduced by [32], which improves channel interdependences of a single or multiple input sources at almost no computational cost. In our study, we first applied a 3D convolution with a 1 × 1 × 1 kernel to each input branch and concatenated the outputs along the channel dimension. Then we passed the results into a 3D squeeze-and-excitation (SE) network. This network selectively focused on significant feature maps using fully connected layers. Although this model structure has been successful in tasks like image segmentation or object detection, its application to data fusion has been limited.

The 3D SE network performs two main operations: squeezing and exciting. Squeezing reduces the 3D filter maps to a single dimension (1 × 1 × 1 × C) via a 3D global average pooling layer, enabling channel-wise statistics computation without altering channel count. The exciting operation involves a fully connected layer followed by a ReLU function to introduce nonlinearity. Output channel complexity is reduced by a ratio, typically 4 in our case. A second fully connected layer, followed by a sigmoid activation, provides each channel with a smooth gating function. This enables the network to adaptively enhance the channel interdependencies of both input sources at minimal computational cost. At last, we weighted each feature map of the merged feature block based on the result of the squeezing and exciting network using simple matrix multiplication. The feature maps of the individual fusion blocks were passed to each corresponding level in the contracting path for both network streams. Therefore, each stream could benefit from the derived information for optimized feature learning.

As part of the study, we compared our proposed multi-stage fusion model with classical 2D and 3D U-Net architectures for the cropping season 2017/18 to 2020/21. To achieve comparability, we implemented the structure of the two conventional U-Net models according to our presented architecture. In comparison to the proposed network, the input data (NDVI, Greenness, VV, and VH) for the traditional models were stacked (Input Fusion) and fed into only a single U-Net network. Also, traditional skip connections were used by simply concatenating the feature maps from the contracting path to the expansion path. As the input size and number of filters do not change in our model, a valid comparison of all models could be ensured. For the 2D U-Net model, the 3D convolutions were replaced with 2D convolutions.

3.4. Experimental Setup

The proposed 3D U-Net FF model as well as a conventional 2D and 3D U-Net architecture were trained and validated on reference data from Lower Saxony, including four cropping seasons (2017/18 to 2020/21). We considered multiple cropping seasons to address differences in data availability and plant growth due to seasonal climate variations.

For this reason, the created image patches and the corresponding reference patches of Lower Saxony spanning from 2017/18–2020/21 were randomly divided into 66% training data and 33% test data. A total of 10% of the training data has been kept for internal validation during the training process. In total, 296,000 composite patches (with 128 × 128 × 24 pixel) including all four cropping seasons were created and used for model training. It is evident that the training dataset exhibited a significant class imbalance, with some crop types having much larger areas represented compared to others. Dominant classes in the data were grassland (2.1 million hectares) and maize (1.6 million hectares), whereas legumes (36,750 hectares and aggregated classes like other winter cereals (15,660 hectares) and other spring cereals (36,300 hectares) had a relatively lower occurrence.

To facilitate a fair comparison of all classification networks, we employed the same parameters for all classification networks. Specifically, we utilized 64 filters to instantiate the networks, with the number of filters gradually doubled in the contraction path and then halved in the expansive process. The models were trained for 25 epochs, following which there was no discernible improvement in validation error across all models. As a model optimizer, the stochastic gradient descent method Adam was used. The training process was performed with a batch size of 16 and an initial learning rate of 0.0001.

3.5. Accuracy Assessment

The performance of the models was evaluated based on the independent test dataset (97,700 image patches), considering all four cropping seasons. Moreover, the classification accuracy as well as the mapped areas of each crop class in the individual years were analyzed separately from the perspective of model stability over time and the representative crop type areas. Finally, we investigated the temporal transferability of our model to verify its generalizability under different seasonal weather conditions and management practices. Therefore, we applied our proposed 3D U-Net FF to the full dataset for the federal state of Lower Saxony for the cropping season 2021/22.

The accuracy assessment was based on calculated area-weighted accuracies (at a 95% confidence interval), including overall accuracy (OA), user’s accuracy (UA), and producer’s accuracy (PA), as presented in [46]. As the dataset showed strong class imbalances, we also report the (class-wise) F1 metrics, which are the harmonic mean of precision and recall.

4. Results

4.1. Overall Accuracy Assessment

In order to evaluate the quality and reliability of the generated crop type maps of the individual models, it is essential to conduct an accuracy assessment. This assessment serves as a crucial step in the validation and verification of the classification results, allowing for an objective measurement of the agreement between the predicted crop types and the ground truth data of the models tested.

The classification accuracies of all cropping seasons are summarized in Figure 4. For all models, high overall accuracies of more than 92% could be achieved. With 94.5%, the proposed 3D U-Net FF model performed the best in terms of OA, while the conventional architectures achieved lower classification results (92.6% (2D U-Net) and 94.2% (3D U-Net)). Differences in accuracy were especially apparent between the 2D U-Net and the two 3D U-Net models, as for all classes, the 2D model provided significant lower class-wise accuracies. The proposed 3D U-Net FF revealed the highest UA and PA values for almost all classes. This was especially significant for classes with lower classification accuracy, such as other winter cereals, other spring cereals, or legume. Finally, the proposed model revealed the highest F1-scores for all classes, with an improved accuracy balance between the crop type classes.

All models exhibited a satisfactory ability to distinguish between agricultural and non-agricultural (background class) pixels, as evidenced by F1-scores above 94%. Very high accuracies of more than 90% could also be achieved for the winter cereal classes winter wheat, winter rye, and winter barley. The results of the 3D U-Net models were higher compared to the 2D U-Net classifier, whereas the highest F1-scores for all classes were given by the proposed 3D U-Net FF. More challenging was the correct classification of winter triticale and other winter cereals, due to misclassification between all winter cereal classes. The 2D U-Net showed the lowest F1-scores of 82.2% and 63.1%, respectively. Both 3D models showed significant improvements for these two classes, whereas the 3D U-Net FF outperformed the common 3D U-Net by about 1% and 4%, respectively.

The classification trends of spring cereal were comparable among the tested models, with varying F1-scores. The highest F1-scores were achieved for spring barley (2D U-Net: F1-score = 87.2%, 3D U-Net: F1-score = 90.8%, and 3D U-Net FF: F1-score = 91.0%). Although lower accuracies were reached for spring oat (2D U-Net: F1-score = 69.9%, 3D U-Net: F1-score = 80.5%, and 3D U-Net FF: F1-score = 81.7%), both 3D models performed adequate in terms of accuracy. The largest variations as well as low accuracies were given for other spring cereals (2D U-Net: F1-score = 48.7%, 3D U-Net: F1-score = 63.9%, and 3D U-Net FF: F1-score = 65.7%). These last two classes were mainly misclassified as spring barley by all classifiers. The proposed 3D U-Net FF also provided the highest accuracy for spring cereals, with major improvements for spring oat and other spring cereals.

The classification of winter rapeseed, maize, potatoes, beet, and grassland demonstrated an excellent accuracy level. These classes consistently achieved high F1-scores near or above 90% across all models. The legume class exhibited larger estimation errors, primarily due to confusion with maize and other crops. The performance varied significantly among the models, with the 2D U-Net achieving the lowest F1-score of 68.2%, while the 3D U-Net and 3D U-Net FF reached 77.8% and 79.4%, respectively.

Overall, the proposed network showed the most reliable and consistent classification results, outperforming the standard models in F1-score across all classes. When comparing the results to the results achieved by the traditional networks, differences in accuracy were particularly large when the classification seemed challenging. This was especially evident for aggregated (mixed) classes such as other winter cereals, other spring cereals, or legume, where very likely (spectral) ambiguities may occur.

4.2. Accuracy Assessment for the Individual Cropping Seasons

The classification accuracies across the four cropping seasons exhibited distinct similarities for all classifiers, as indicated in Table 2. Results for the cropping season 2018/19 demonstrated the highest classification accuracy, while the classification in 2017/18 exhibited the lowest OA. The 2D U-Net model showed lower classification accuracy for each individual season. Specifically, the results for season 2017/18 showed a higher misclassification rate, resulting in an overall accuracy of 91.5%. On the other hand, the 3D U-Net models demonstrated improved performances across all years, with OA values ranging from 93.2% to 94.6% for the common 3D model and 93.6% to 95.0% for the proposed 3D U-Net FF architecture. In terms of overall accuracy, the proposed 3D U-Net FF outperformed the common architectures in every season.

Figure 5 shows the correlation between the reported and estimated areas of the individual classes based on the independent test dataset. Despite the differences in class-wise accuracy as well as between the cropping seasons, for all models, excellent and consistent area accuracies were observed with only minor variability. It was found that the mapping accuracy of the models improved with increasing crop area while showing less variability between the cropping years. Classes with smaller areas were slightly underestimated by all classifiers, whereas the common 2D U-Net model showed the highest error rates for the most classes. Especially the smallest class other winter cereals showed larger variations. When comparing the models, the 3D U-Net models showed an improvement in accuracy for almost all classes. Particularly for the smallest class, an increased area accuracy could be shown, which was consistent with the higher PA value in Figure 4. The proposed network showcased the most reliable correlations between the referenced and estimated areas, with improvements for smaller classes such as other winter cereals, other spring cereals, or legume across the seasons when compared to the traditional models.

4.3. Seasonal Crop-Type Mapping

The crop-type mapping for Lower Saxony across all models showed that the majority of the fields had been identified correctly. Most fields were classified as a single crop type, with almost no isolated pixels or salt and pepper effects. Field boundaries were clearly delineated, and spatial structures were often correctly mapped. Class confusions exhibited consistency and were comparable across all cropping seasons, despite the yearly variations in climate conditions. Grassland areas dominated the classification maps of each classifier throughout the seasons, which corresponds well to the reported crop areas. In some cases, adjacent fields have been merged by the classifiers, and therefore field boundaries between parcels were not present. Misclassifications appeared systematically at the field edges, as the correct distinction between field boundary and background was challenging.

Figure 6 shows the mapping results of the subset A (see Figure 1) in the district of Emsland in the western part of Lower Saxony across the four cropping seasons. The region was characterized by a high diversity of crop types and field sizes. A high level of agreement with the reference could be shown for all three models across all cropping seasons. Despite the variations in local and temporal environmental conditions, the results showed no significant spatial classification errors. Thus, each classifier used was able to effectively represent the crop type distribution across the considered seasons. Notably, confusion between crop types often occurred within the same broader crop group, such as winter or spring cereals. In addition, the quality of the results was found to be unrelated to the field size.

When comparing the maps of the models tested, the common 2D and 3D networks exhibited more misclassifications throughout the cropping seasons, impacting both the whole cropping field as well as specific local field areas. This was almost restricted to classes with lower classification accuracy, such as winter triticale, spring oat, or legume, as well as aggregated classes. As already shown in the overall accuracies of the individual cropping years, particularly dominant mapping errors were obvious in the map for the 2017/18 cropping year. The proposed 3D U-Net model demonstrated a substantial improvement in crop-type mapping, reducing major misclassifications within fields confirming the accuracy assessment of a more precise classification result across all cropping seasons.

As clouds were not masked in our multispectral image composites, the patches exhibited varying degrees of cloud cover. In Figure 7, the mean cloud cover of the generated patch-based composites for each season is displayed, showing a significant spatial and temporal variation throughout the study period. For all seasons, two linear patterns of dense cloud cover were noticeable, indicating regions with limited availability of optical image data in these regions. The cropping season of 2020/21 was particularly affected by dense mean cloud cover, which extended over most parts of Lower Saxony. In comparison, the image data for the 2018/19 season were characterized by significantly less cloud cover, particularly for the central area of the federal state.

With respect to the visualized cloud cover, the mapping results showed only a minor correlation between cloud cover and classification accuracy. For example, all three models achieved better classification results for the season 2020/21 than 2017/18, although the mean cloud cover was significantly higher during this season. This suggests that other factors had a stronger impact on crop classification than cloud cover alone. Nonetheless, classification restrictions due to cloud cover could be recognized when clouds appeared persistent for longer periods of the season or during the main phases of crop growth.

This is underlined by the results of a detailed, cloud-cover-specific accuracy assessment (Figure 8) for the 2017/18 season. In this assessment, we stepwise excluded pixels from the validation set based on cloud cover thresholds; for example, the ‘<=30%’ category includes only pixels with up to 30% cloud cover. The results clearly show a decrease in overall performance as cloud cover increases. In this context, differences in classification accuracy were evident between the models, with the proposed network demonstrating more accurate and stable results as cloud cover increased compared to the common 2D and 3D models. When considering the classification maps, the traditional models often failed to fully map parcels if cloud cover was too dominant in the multitemporal patches. This was also evident for classes that could otherwise be correctly classified with high accuracy. The results of the proposed 3D U-Net FF, however, were much more consistent compared to the traditional models. Field shapes were more often identified correctly, and crop types were mostly successfully assigned.

4.4. Accuracy Assessment for Temporal Transfer of the 3D U-Net FF

The proposed 3D U-Net FF (trained with data from four cropping seasons 2017/18 to 2020/21) was used to classify data from the 2021/22 season for Lower Saxony. The results displayed in Figure 9 indicated a good performance in terms of OA (91.8%) as well as a successful discrimination between agricultural and non-agricultural pixels, as evidenced by F1-scores surpassing 94%.

The winter cereals wheat, barley, and rye were successfully classified with only minimal deviations. The detection of winter triticale was more limited compared to the training seasons, resulting in losses in F1-scores of nearly 18%. This relatively low accuracy observed for this class could be attributed to its limited differentiation from other winter cereal classes, which was already observed for the previous seasons. A notable agreement in Lower Saxony was found with the reference data for the spring classes barley (F1-score = 83%) and oats (F1-score = 74.2%). Very satisfactory results were achieved for winter rapeseed, maize, potato, beet and grassland. With F1-scores above 93% for the first four classes and 88.4% for grassland, the identification of these classes was especially given and comparable to the previous seasons. In comparison to prior seasons, the evaluation of legume exhibited still acceptable outcomes (F1-score = 70.1%). The lowest F1-scores were given for other winter cereals and other spring cereals. With 48.8% and 36.5%, respectively, the class estimation was clearly limited and significantly lower than the previous cropping seasons. These findings suggest a notable decrease in accuracy, primarily influenced by exceedingly low precision PA values.

Figure 10 presents the estimated class areas for the season 2021/22, showing still sufficient mapping accuracy for almost all considered classes. A very strong correlation was observed for classes with larger areas, such as grassland, maize, winter rapeseed, and winter cereals (wheat, barley, and rye). Slight disparities were observed for classes like legumes and winter triticale. These discrepancies remained, however, within a range of high agreement. Higher mapping errors were only observed for other winter cereals and other spring cereals, resulting in underestimations of approximately 10% and 20%, respectively. These underestimations aligned with the low values of PA’s depicted in Figure 9.

4.5. Mapping of Crop Types for the 2021/22 Season in Lower Saxony

The transfer of the proposed model to the image data of the unknown season 2021/22 showed persistent map accuracy. The small subsets A and B (see Figure 1) of the classification map of Lower Saxony are illustrated in Figure 11. The upper figure represents subet A (a region in the district of Emsland), which has already been considered for model comparison in Figure 6. The lower figure shows subset B, a marsh region in the northern part of Lower Saxony near the estuary of the Elbe River into the North Sea. The considered regions exhibited contrasting predominant crop profiles. The Emsland area is notably characterized by the predominant cultivation of maize and potato crops, whereas the marsh area displayed a notable prevalence of cereal and grassland cultivation.

For both regions, the majority of parcels were accurately predicted, and field structures were well represented compared to the previous seasons. Most pixels within a parcel were correctly assigned, allowing for reliable assessment of individual field crops. No significant spatial mapping errors were observed, which might have been caused by cloud cover or different local environmental conditions. Rather, regional crop type patterns were consistently and accurately mapped. However, in comparison to the training seasons, a slight increase in mapping errors was noticeable. Especially for the classes winter triticale, other winter cereals, and other spring cereals, a higher misclassification rate could be shown, as indicated by the lower accuracy values shown. As limitations in crop type classification were predominantly restricted to specific local field areas, a correct crop type assignment of almost all parcels was possible.

5. Discussion

5.1. Accuracy Assessment across All Cropping Seasons

The presented results demonstrated satisfactory classification accuracy over all seasons, including major crop types like maize, winter rapeseed, and winter cereals like wheat, barley, or rye. Hence, it can be concluded that our proposed approach for generating multitemporal image composites still enables CNNs to effectively manage complex and extensive datasets of multisensor Earth observation (EO) data, even in the presence of temporal cloud cover and varying climate conditions. Further, the detailed accuracy assessment showed that the 3D models outperformed the 2D U-Net in terms of accuracy, resulting in an OA of over 94%.

When considering the class-wise accuracies, the successful classification of winter cereals by all classifiers should be highlighted. Due to similar spectral ambiguities and similar plant structures, these crops are usually very difficult to differentiate in both optical and SAR data [47]. An exceptional case was winter triticale. The separation of this crop type was more difficult due to its very similar appearance and phenological development to other winter cereals, as it is a hybrid form of wheat and rye [19]. While this was particularly evident for the standard 2D U-Net (F1-score = 82.2%), the proposed 3D U-Net FF performed very well (F1-score = 91.0%).

The 2D U-Net model exhibited significantly lower classification results for the classes spring oat and legume. Both classes are difficult to separate due to spectral similarities to other crop classes, which was also observed in previous studies [5]. The integration of additional temporal features by the 3D models had a substantial positive impact on the classification accuracy of these classes, resulting in very high accuracies, which also exceeded results from other comparable studies [4,5,6]. This demonstrated that the application of 3D convolutions allowed the networks to effectively learn and capture representative crop patterns related to vegetation dynamics in the temporal domain. Therefore, 3D convolutions proved to be advantageous in the context of multitemporal crop-type mapping.

A detailed accuracy assessment underlined the potential of the proposed 3D N-Net FF. Among all classification algorithms tested, the proposed 3D U-Net FF model performed the best in terms of OA as well as F1-scores, leading to an increase in accuracy for all classes. Moreover, the results of our model were more balanced between the classes, which is crucial for achieving reliable and accurate classification results. For this reason, it can be concluded that the proposed squeeze-and-excitation fusion module helped to focus on important features to increase and balance the classification performance.

A particular strength of the 3D U-Net FF was the ability to accurately classify pixels belonging to aggregated (mixed) classes such as other winter cereals or other spring cereals. These classes encompassed a high level of variance as they included multiple crop types, each with distinct spectral signals and growing patterns. The reason for this excellent performance could stem from learned multi-stage patterns between SAR and multispectral data contributing to the identification of these classes.

5.2. Comparison of the Model Accuracies for the Individual Cropping Seasons

The individual cropping seasons are characterized by altered meteorological conditions and agro-ecological management practices, which may lead to differences in plant growth and corresponding changes in spectral signature patterns. This temporal variability adds complexity to the classification process, making it more difficult to differentiate between them. This can be challenging, especially for crops with similar spectral characteristics and when the availability and quality of remote sensing data are limited.

The results demonstrated robust and consistent classification results over multiple cropping seasons for all U-Net models, with differences in OA between the seasons of around 1.5%. However, our 3D model outperformed the common 2D and 3D classifiers for each season significantly, underlining the extraction of representative spatio-temporal plant features that are robust to inner-seasonal anomalies.

Our results showed that the class-related accuracies of the measured and estimated areas were consistent. However, the accuracies of classes with smaller acreages exhibited greater variability from year to year. This is likely due to the limited sample size and spatial extent of these smaller classes, making them more susceptible to fluctuations in environmental and weather conditions, which in turn leads to increased variability in their spectral behavior. This trend has been observed in other studies (e.g., [6]).

Within this context, our proposed feature fusion model showcased excellent prediction accuracy for smaller classes such as other winter cereals, other spring cereals, or legume. The first two classes also encompass multiple crop types, resulting in complex spectral signature patterns. By integrating complementary image features of the multispectral and SAR data using the proposed fusion module, our model effectively captured the subtle spectral differences and intricate patterns within these classes, leading to highly accurate classifications. This highlights the strength of our approach in handling the complexities and achieving precise results for classes with diverse crop compositions or limited sampling size.

For all models, the largest decrease in OA was given for the season 2017/18, which was related to extreme weather conditions in the spring and summer months. High temperatures and precipitation deficits with low cloud cover led to permanent high temperature anomalies [48]. Especially for the spring crops, the very dry conditions caused drastic drought damages [49] and therefore led to a temporal change in the spectral signal.

5.3. Comparison of the Mapping Accuracies of the Models

The findings of the accuracy assessment were underlined by the visual interpretation of the classification maps, which revealed promising results for all three classifiers. Moreover, each model was able to successfully map the regional crop rotation across the seasons, which provides valuable insights into the spatial distribution of crop types as well as a better understanding of temporal dynamics in agriculture. This information is particularly relevant in the context of climate change and increasing weather extremes to understand the spatial and temporal distribution of crop types for effective land management and planning. The detailed analysis of the classification maps revealed the advantage of the 3D networks over the 2D model in accurately representing individual parcels, enabling a more detailed and comprehensive view of the dynamics and patterns within agricultural fields.

The excellent classification maps across the growing years are especially noticeable in the context of changing cloud conditions, which differed significantly across the cropping seasons. It was shown that the overall influence of clouds on the classification quality was very limited for all models, supporting our approach of using cloud-effected image composites. Instead, the influence of weather extremes, as shown for the season 2017/18, was more relevant for reduced classification accuracy.

Differences between the models could be identified in the case of persistent cloud cover (see Figure 8). Among all models, the proposed 3D U-Net FF model showed the most stable performance in terms of classification accuracy, especially when classifying pixels that are characterized by strong cloud cover. At this point, it can be assumed that our proposed 3D squeeze-and-excitation fusion module succeeded in suppressing cloud-related feature maps to improve interdependencies between SAR and optical images. For this reason, it can be concluded that complementary multi-stage features are useful to compensate for cloud cover in optical data and therefore overcome temporal restrictions in satellite imagery. However, this hypothesis should be verified in further work.

In the classification maps, it was observed that the precise identification of the parcel boundaries was challenging, which can be difficult due to various factors such as mixed pixels [27] or changes over time. However, when assigning the crop type to a specific field, the focus is primarily on classifying the spectral properties within the field itself. Therefore, this issue has a relatively little impact on the correct assignment of the crop type to a specific field. Of greater importance for effective agricultural management and decision-making is the ability to distinguish between neighboring fields and individual parcels. As adjacent fields are often merged by the classifiers, the correct differentiation of two fields could be limited when neighboring parcels contain the same crop type.

5.4. Evaluation of Temporal Model Transfer

In the context of operational crop mapping strategies, the temporal transfer of a classifier is particularly interesting. While for “transfer learning” specific and rather complex workflows can be used, we assume that the 3D U-Net was able to learn adequate discriminative features and could be transferred to a certain degree to another cropping season. Therefore, the 3D U-Net FF model, which was trained with data from Lower Saxony from four cropping seasons, was used to classify data from another cropping season.

The transfer of our proposed model to Lower Saxony for the growing seasons 2021/22 was related to only minor losses in the classification accuracy. The ability of our model to adapt well to unknown growing seasons indicates that the network has captured important and generalizable patterns of vegetation dynamics across different temporal compositions, allowing it to perform effectively even in seasons with different environmental conditions and crop variations. Classification losses were given for mixed classes (other spring cereals and other winter cereals). A certain classification loss for these classes was expected due to variations in growth patterns and changes in spectral characteristics across a different season compared to the training data. The limited availability of characteristic patterns due to the relatively small number of training samples can further constrain the accurate assignment of these classes.

However, as the results were still very promising, our approach enabled the identification and assessment of spatial and temporal crop type patterns. An illustrative example (subset region B) with a significant crop rotation change is depicted in Figure 12. This example highlights the transition from predominantly spring cereals in the season 2017/18 to a dominant presence of winter cereals in the following seasons. This circumstance was attributed to the difficult sowing conditions in the marshland regions, primarily resulting from heavy rainfall events throughout the autumn of 2017. This was partly compensated in the spring period by the cultivation of spring cereals like barley or oats [50].

6. Conclusions

In this paper, we proposed a multi-stage feature fusion method based on a 3D U-Net architecture for annual crop-type mapping using multitemporal Sentinel-1 and Sentinel-2/Landsat-8 imagery. With this approach, operational monitoring of detailed crop type information was shown, enabling comprehensive assessments of agricultural landscapes.

In detail, our study highlighted the following findings: (1) We introduced a novel approach for creating multitemporal image composites by leveraging patch-based image combinations within a 14-day window based on minimum cloud coverage. In this way, we developed a flexible image composition method that can be applied across different cropping seasons that does not require additional user-dependent metric calculations. (2) The proposed feature fusion model resulted in the best overall accuracy, with improvements for all classes. This was evident based on all considered cropping seasons as well as within the individual years, where our model consistently exhibited more stable and balanced classification results. This indicated its ability to mitigate the challenges posed by differences in climate conditions and farming patterns. In addition, our model proved to be more robust to prolonged cloud cover. Based on these results, we concluded that the integration of our 3D fusion model, incorporating a 3D squeeze-and-excitation network, improved the extraction of spatio-temporal information and dependencies between SAR and multispectral images, leading to increased classification performances with almost no additional computational cost. (3) The results obtained showed that the proposed network was highly effective in ensuring stability and reliability when transferring it to a new cropping season, which is particularly significant in light of the rapidly changing environmental conditions and evolving agricultural practices that are becoming increasingly prevalent. Limitations were primarily given for spectral similar classes like spring crops or aggregated classes, whose spectral signals can alter due to differences in climate and soil conditions between larger regions.

In the future, it will be necessary to further optimize the presented model since the distinction between spectrally similar classes and classes with limited training samples remains challenging. It may be possible to improve the classification by increasing the temporal resolution of the patch-based composites. In particular, this would be useful for the spring and summer months, as these contain relevant phenological crop information. In this context, our approach, which uses an adapted temporal sampling strategy for cloud reduction, should be compared to classical methods for creating image composites (which compute predominantly spectral-temporal metrics over time based on cloud-free images). To address class imbalances, various techniques can be employed, such as oversampling or undersampling of the minority classes or applying class weighting during the model training process.

Since the classification method did not provide spatial model transfer, the application should be tested in other regions with different climate sand farming patterns to assess the model’s stability in the spatial and temporal domains. In this context, the integration of further crop species and dealing with the spectral properties of non-crop-related classes (background) seem relevant for a flexible model application.

Altogether, this study showed that state-of-the-art deep learning methods offer great potential to learn representative multi-modal image features from multispectral and SAR satellite imagery for reliable crop-type mapping.

Author Contributions

Conceptualization, L.W., T.J. and B.W.; methodology, L.W.; software, L.W.; validation, L.W.; formal analysis, L.W.; investigation, L.W.; resources, L.W.; data curation, L.W.; writing—original draft preparation, L.W.; writing—review and editing, T.J. and B.W.; visualization, L.W.; supervision, T.J.; project administration, B.W.; funding acquisition, T.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was done within the framework of the project “Agri-Gaia” and partly funded by the Federal Ministry for Economic Affairs and Climate Action [01MK21004K].

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author/s.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, C.; Li, H.; Li, J.; Lei, Y.; Li, C.; Manevski, K.; Shen, Y. Using NDVI percentiles to monitor real-time crop growth. Comput. Electron. Agric. 2019, 162, 357–363. [Google Scholar] [CrossRef]
Damian, J.M.; Pias, O.H.d.C.; Cherubin, M.R.; da Fonseca, A.Z.; Fornari, E.Z.; Santi, A.L. Applying the NDVI from satellite images in delimiting management zones for annual crops. Sci. Agric. 2020, 77, e20180055. [Google Scholar] [CrossRef]
Ghazaryan, G.; Dubovyk, O.; Löw, F.; Lavreniuk, M.; Kolotii, A.; Schellberg, J.; Kussul, N. A rule-based approach for crop identification using multi-temporal and multi-sensor phenological metrics. Eur. J. Remote Sens. 2018, 51, 511–524. [Google Scholar] [CrossRef]
Orynbaikyzy, A.; Gessner, U.; Mack, B.; Conrad, C. Crop Type Classification Using Fusion of Sentinel-1 and Sentinel-2 Data: Assessing the Impact of Feature Selection, Optical Data Availability, and Parcel Sizes on the Accuracies. Remote Sens. 2020, 12, 2779. [Google Scholar] [CrossRef]
Asam, S.; Gessner, U.; Almengor González, R.; Wenzl, M.; Kriese, J.; Kuenzer, C. Mapping Crop Types of Germany by Combining Temporal Statistical Metrics of Sentinel-1 and Sentinel-2 Time Series with LPIS Data. Remote Sens. 2022, 14, 2981. [Google Scholar] [CrossRef]
Blickensdörfer, L.; Schwieder, M.; Pflugmacher, D.; Nendel, C.; Erasmi, S.; Hostert, P. Mapping of crop types and crop sequences with combined time series of Sentinel-1, Sentinel-2 and Landsat 8 data for Germany. Remote Sens. Environ. 2022, 269, 112831. [Google Scholar] [CrossRef]
Wang, X.; Fang, S.; Yang, Y.; Du, J.; Wu, H. A New Method for Crop Type Mapping at the Regional Scale Using Multi-Source and Multi-Temporal Sentinel Imagery. Remote Sens. 2023, 15, 2466. [Google Scholar] [CrossRef]
Waske, B.; Benediktsson, J. Fusion of Support Vector Machines for Classification of Multisensor Data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3858–3866. [Google Scholar] [CrossRef]
Stefanski, J.; Kuemmerle, T.; Chaskovskyy, O.; Griffiths, P.; Havryluk, V.; Knorn, J.; Korol, N.; Sieber, A.; Waske, B. Mapping Land Management Regimes in Western Ukraine Using Optical and SAR Data. Remote Sens. 2014, 6, 5279–5305. [Google Scholar] [CrossRef]
Steinhausen, M.J.; Wagner, P.D.; Narasimhan, B.; Waske, B. Combining Sentinel-1 and Sentinel-2 data for improved land use and land cover mapping of monsoon regions. Int. J. Appl. Earth Obs. Geoinf. 2018, 73, 595–604. [Google Scholar] [CrossRef]
Joshi, N.; Baumann, M.; Ehammer, A.; Fensholt, R.; Grogan, K.; Hostert, P.; Jepsen, M.; Kuemmerle, T.; Meyfroidt, P.; Mitchard, E.; et al. A Review of the Application of Optical and Radar Remote Sensing Data Fusion to Land Use Mapping and Monitoring. Remote Sens. 2016, 8, 70. [Google Scholar] [CrossRef]
Jia, K.; Li, Q.; Tian, Y.; Wu, B.; Zhang, F.; Meng, J. Crop classification using multi-configuration SAR data in the North China Plain. Int. J. Remote Sens. 2012, 33, 170–183. [Google Scholar] [CrossRef]
Wan, S.; Chang, S.H. Crop classification with WorldView-2 imagery using Support Vector Machine comparing texture analysis approaches and grey relational analysis in Jianan Plain, Taiwan. Int. J. Remote Sens. 2019, 40, 8076–8092. [Google Scholar] [CrossRef]
Xuan, F.; Dong, Y.; Li, J.; Li, X.; Su, W.; Huang, X.; Huang, J.; Xie, Z.; Li, Z.; Liu, H.; et al. Mapping crop type in Northeast China during 2013–2021 using automatic sampling and tile-based image classification. Int. J. Appl. Earth Obs. Geoinf. 2023, 117, 103178. [Google Scholar] [CrossRef]
Castillejo-González, I.L.; López-Granados, F.; García-Ferrer, A.; Peña-Barragán, J.M.; Jurado-Expósito, M.; de la Orden, M.S.; González-Audicana, M. Object- and pixel-based analysis for mapping crops and their agro-environmental associated measures using QuickBird imagery. Comput. Electron. Agric. 2009, 68, 207–215. [Google Scholar] [CrossRef]
Stefanski, J.; Mack, B.; Waske, B. Optimization of Object-Based Image Analysis With Random Forests for Land Cover Mapping. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 2492–2504. [Google Scholar] [CrossRef]
Belgiu, M.; Csillik, O. Sentinel-2 cropland mapping using pixel-based and object-based time-weighted dynamic time warping analysis. Remote Sens. Environ. 2018, 204, 509–523. [Google Scholar] [CrossRef]
Waske, B.; van der Linden, S. Classifying Multilevel Imagery From SAR and Optical Sensors by Decision Fusion. IEEE Trans. Geosci. Remote Sens. 2008, 46, 1457–1466. [Google Scholar] [CrossRef]
Heupel, K.; Spengler, D.; Itzerott, S. A Progressive Crop Type Classification Using Multitemporal Remote Sensing Data and Phenological Information. PFG—J. Photogramm. Remote Sens. Geoinf. Sci. 2018, 86, 53–69. [Google Scholar] [CrossRef]
Zhong, L.; Hu, L.; Zhou, H. Deep learning based multi-temporal crop classification. Remote Sens. Environ. 2019, 221, 430–443. [Google Scholar] [CrossRef]
Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep Learning Classification of Land Cover and Crop Types Using Remote Sensing Data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [Google Scholar] [CrossRef]
Gu, L.; He, F.; Yang, S. Crop Classification based on Deep Learning in Northeast China using SAR and Optical Imagery. In Proceedings of the 2019 SAR in Big Data Era (BIGSARDATA), Beijing, China, 5–6 August 2019; IEEE: New York, NY, USA, 2019; pp. 1–4. [Google Scholar] [CrossRef]
Ienco, D.; Interdonato, R.; Gaetano, R.; Ho Tong Minh, D. Combining Sentinel-1 and Sentinel-2 Satellite Image Time Series for land cover mapping via a multi-source deep learning architecture. ISPRS J. Photogramm. Remote Sens. 2019, 158, 11–22. [Google Scholar] [CrossRef]
Ofori-Ampofo, S.; Pelletier, C.; Lang, S. Crop Type Mapping from Optical and Radar Time Series Using Attention-Based Deep Learning. Remote Sens. 2021, 13, 4668. [Google Scholar] [CrossRef]
Weilandt, F.; Behling, R.; Goncalves, R.; Madadi, A.; Richter, L.; Sanona, T.; Spengler, D.; Welsch, J. Early Crop Classification via Multi-Modal Satellite Data Fusion and Temporal Attention. Remote Sens. 2023, 15, 799. [Google Scholar] [CrossRef]
Sohn, K.; Shang, W.; Lee, H. Improved multimodal deep learning with variation of information. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 2, pp. 2141–2149. [Google Scholar]
Adrian, J.; Sagan, V.; Maimaitijiang, M. Sentinel SAR-optical fusion for crop type mapping using deep learning and Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 2021, 175, 215–235. [Google Scholar] [CrossRef]
Valero, S.; Arnaud, L.; Planells, M.; Ceschia, E. Synergy of Sentinel-1 and Sentinel-2 Imagery for Early Seasonal Agricultural Crop Mapping. Remote Sens. 2021, 13, 4891. [Google Scholar] [CrossRef]
Yuan, Y.; Lin, L.; Zhou, Z.G.; Jiang, H.; Liu, Q. Bridging optical and SAR satellite image time series via contrastive feature extraction for crop classification. ISPRS J. Photogramm. Remote Sens. 2023, 195, 222–232. [Google Scholar] [CrossRef]
Xu, D.; Li, Z.; Feng, H.; Wu, F.; Wang, Y. Multi-Scale Feature Fusion Network with Symmetric Attention for Land Cover Classification Using SAR and Optical Images. Remote Sens. 2024, 16, 957. [Google Scholar] [CrossRef]
Li, W.; Sun, K.; Li, W.; Wei, J.; Miao, S.; Gao, S.; Zhou, Q. Aligning semantic distribution in fusing optical and SAR images for land use classification. ISPRS J. Photogramm. Remote Sens. 2023, 199, 272–288. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
NIBIS Kartenserver. Bodenkarte von Niedersachsen 1:50000 (BK50); Landesamt für Bergbau, Energie und Geologie (LBEG): Hannover, Germany, 2023. [Google Scholar]
Deutscher Wetterdienst. Lufttemperatur & Niederschlag: Normalwerte (Zeitraum 1971–2000); Deutscher Wetterdienst: Offenbach, Germany, 2023. [Google Scholar]
Lower Saxony Ministry for Food, Agriculture, and Consumer Protection. Die Niedersächsische Landwirtschaft in Zahlen 2021: Einschließlich Ergänzungen und Aktualisierungen Stand Juli 2023; Lower Saxony Ministry for Food, Agriculture, and Consumer Protection: Hannover, Germany, 2021. [Google Scholar]
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
European Court of Auditors. The Land Parcel Identification System. In Special Report No 25, 2016; Publications Office of the EU: Maastricht, The Netherlands, 2016. [Google Scholar] [CrossRef]
Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.; Gao, X.; Ferreira, L. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
Hu, Q.; Wu, W.-b.; Song, Q.; Lu, M.; Chen, D.; Yu, Q.-y.; Tang, H.-j. How do temporal and spectral features matter in crop classification in Heilongjiang Province, China? J. Integr. Agric. 2017, 16, 324–336. [Google Scholar] [CrossRef]
Shi, T.; Xu, H. Derivation of Tasseled Cap Transformation Coefficients for Sentinel-2 MSI At-Sensor Reflectance Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 4038–4048. [Google Scholar] [CrossRef]
Raab, C.; Tonn, B.; Meißner, M.; Balkenhol, N.; Isselstein, J. Multi-temporal RapidEye Tasselled Cap data for land cover classification. Eur. J. Remote Sens. 2019, 52, 653–666. [Google Scholar] [CrossRef]
Shimizu, K.; Murakami, W.; Furuichi, T.; Estoque, R.C. Mapping Land Use/Land Cover Changes and Forest Disturbances in Vietnam Using a Landsat Temporal Segmentation Algorithm. Remote Sens. 2023, 15, 851. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Lecture Notes in Computer Science; Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Wei, S.; Zhang, H.; Wang, C.; Wang, Y.; Xu, L. Multi-Temporal SAR Data Large-Scale Crop Mapping Based on U-Net Model. Remote Sens. 2019, 11, 68. [Google Scholar] [CrossRef]
Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef]
Olofsson, P.; Foody, G.M.; Herold, M.; Stehman, S.V.; Woodcock, C.E.; Wulder, M.A. Good practices for estimating area and assessing accuracy of land change. Remote Sens. Environ. 2014, 148, 42–57. [Google Scholar] [CrossRef]
Larrañaga, A.; Álvarez-Mozos, J.; Albizua, L. Crop classification in rain-fed and irrigated agricultural areas using Landsat TM and ALOS/PALSAR data. Canadian J. Remote Sens. 2011, 37, 157–170. [Google Scholar] [CrossRef]
Imbery, F.; Friedrich, K.; Haeseler, S.; Koppe, C.; Janssen, W.; Bissolli, P. Vorläufiger Rückblick auf den Sommer 2018–Eine Bilanz Extremer Wetterereignisse; Deutscher Wetterdienst: Offenbach, Germany, 2018. [Google Scholar]
Krampf, K. Die Ernte Nach Hitze und Trockenheit—Eine Regionalanalyse des Landwirtschaftlichen Anbaus in Niedersachsen im Sommer 2018; Statistische Monatshefte Niedersachsen: Wiesbaden, Germany, 2019. [Google Scholar]
Landwirtschaftskammer Niedersachen. Landessortenversuche 2018: Sommerfuttergerste; Landwirtschaftskammer Niedersachen: Oldenburg, Germany, 2018. [Google Scholar]

Figure 1. Location of the research area with location of two subsets for mapping assessment.

Figure 2. Architecture of the proposed multi-source and multi-stage 3D fusion U-Net model.

Figure 3. Proposed feature fusion block.

Figure 4. Accuracies for the considered U-Net models based on all four cropping seasons from 2017/18 to 2020/21. Highest accuracies of PA; UA and F1-score are presented in bold.

Figure 5. Predicted and reported annual agricultural areas for the crop type classes based on the independent test datasets. For a better representation of the small classes, the axes were logarithmized.

Figure 6. Comparison of multi-annual crop-type mapping of the subset A in the Emsland of Lower Saxony based on the tested U-Net models.

Figure 7. Mean cloud cover of the multitemporal image patch composites for each trained cropping season.

Figure 8. Overall classification accuracy of the season 2017/18 with different cloud cover ranges. The background class was excluded.

Figure 9. Accuracies for the temporal transfer of the proposed 3D U-Net model FF based on the cropping season 2021/22.

Figure 10. Reported and predicted annual agricultural areas for the crop type classes based on the independent season 2021/22. For a better representation of the small classes, the axes were logarithmized.

Figure 11. Mapping accuracy for subsets A and B of the 3D U-Net FF for the independent cropping season 2021/22.

Figure 12. Significant change in crop type rotation in the marshland region.

Table 1. Classes used for crop-type mapping from LPIS classes.

Crop Type Classes
Winter wheat	Winter rapeseed
(winter wheat, durum wheat)
Winter rye	Legume
	(a.o. peas, beans, soy, lupins)
Winter barley	Other crops
	(o.a. quinoa, spring rapeseed)
Winter triticale	Maize
	(maize (biogas), maize (silage), grain maize)
Other winter cereals	Potato
(a.o. winter oat, winter spelt)	(potatoes, starch potato, seed potato)
Spring barley	Beet
	(sugar beet, fodder beet, cabbage beet)
Spring oat	Grassland
	(a.o. meadows, pastures, mowing pastures)
Other spring cereals	Background
(a.o. spring triticale, spring rye)	(no agriculture)

Table 2. Overall accuracy for the different seasons of the U-Net models tested. The best results are presented in bold.

	OA in% ± Standard Error
	2017/18	2018/19	2019/20	2020/21
2D U-Net	91.5 ± 0.01	93.1 ± 0.01	92.9 ± 0.01	92.9 ± 0.01
3D U-Net	93.2 ± 0.01	94.6 ± 0.01	94.3 ± 0.01	94.4 ± 0.01
3D U-Net FF	93.6 ± 0.01	95.0 ± 0.01	94.8 ± 0.01	94.8 ± 0.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wittstruck, L.; Jarmer, T.; Waske, B. Multi-Stage Feature Fusion of Multispectral and SAR Satellite Images for Seasonal Crop-Type Mapping at Regional Scale Using an Adapted 3D U-Net Model. Remote Sens. 2024, 16, 3115. https://doi.org/10.3390/rs16173115

AMA Style

Wittstruck L, Jarmer T, Waske B. Multi-Stage Feature Fusion of Multispectral and SAR Satellite Images for Seasonal Crop-Type Mapping at Regional Scale Using an Adapted 3D U-Net Model. Remote Sensing. 2024; 16(17):3115. https://doi.org/10.3390/rs16173115

Chicago/Turabian Style

Wittstruck, Lucas, Thomas Jarmer, and Björn Waske. 2024. "Multi-Stage Feature Fusion of Multispectral and SAR Satellite Images for Seasonal Crop-Type Mapping at Regional Scale Using an Adapted 3D U-Net Model" Remote Sensing 16, no. 17: 3115. https://doi.org/10.3390/rs16173115

APA Style

Wittstruck, L., Jarmer, T., & Waske, B. (2024). Multi-Stage Feature Fusion of Multispectral and SAR Satellite Images for Seasonal Crop-Type Mapping at Regional Scale Using an Adapted 3D U-Net Model. Remote Sensing, 16(17), 3115. https://doi.org/10.3390/rs16173115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Stage Feature Fusion of Multispectral and SAR Satellite Images for Seasonal Crop-Type Mapping at Regional Scale Using an Adapted 3D U-Net Model

Abstract

1. Introduction

2. Materials and Data

2.1. Study Area

2.2. Remote Sensing Data

2.3. Reference Data

3. Methods

3.1. Patch-Based Multitemporal Image Composites

3.2. 2D and 3D U-Net Model

3.3. Proposed Multi-Stage 3D Feature Fusion U-Net Model (3D U-Net FF)

3.4. Experimental Setup

3.5. Accuracy Assessment

4. Results

4.1. Overall Accuracy Assessment

4.2. Accuracy Assessment for the Individual Cropping Seasons

4.3. Seasonal Crop-Type Mapping

4.4. Accuracy Assessment for Temporal Transfer of the 3D U-Net FF

4.5. Mapping of Crop Types for the 2021/22 Season in Lower Saxony

5. Discussion

5.1. Accuracy Assessment across All Cropping Seasons

5.2. Comparison of the Model Accuracies for the Individual Cropping Seasons

5.3. Comparison of the Mapping Accuracies of the Models

5.4. Evaluation of Temporal Model Transfer

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI