Next Article in Journal
Satellite Views of Long-Term Variations in pCO2 on the Changjiang River Estuary and the Adjacent East China Sea (1998–2024)
Previous Article in Journal
Remote Sensing Extraction and Spatiotemporal Change Analysis of Time-Series Terraces in Complex Terrain on the Loess Plateau Based on a New Swin Transformer Dual-Branch Deformable Boundary Network (STDBNet)
Previous Article in Special Issue
Improving the Accuracy of Seasonal Crop Coefficients in Grapevine from Sentinel-2 Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Estimating Soil Moisture Using Multimodal Remote Sensing and Transfer Optimization Techniques

1
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
2
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
3
School of Electronics and Information, Aerospace Information Technology University, Jinan 250200, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(1), 84; https://doi.org/10.3390/rs18010084
Submission received: 24 October 2025 / Revised: 5 December 2025 / Accepted: 23 December 2025 / Published: 26 December 2025

Highlights

What are the main findings?
  • A multimodal fusion framework integrating SAR, optical, topographic, and meteorological data achieved high-precision soil moisture estimation ( R 2 = 0.8956), significantly outperforming single-modality methods.
  • An intermediate fine-tuning strategy applied to a large dataset (10,571 images and 3772 samples) substantially enhanced the model’s generalization and transferability across diverse agro-ecological zones.
What are the implications of the main findings?
  • The framework enables the generation of reliable, field-scale soil moisture maps, providing a practical tool for precision irrigation and site-specific water management to enhance drought resilience.
  • This approach offers a scalable and robust solution for operational soil moisture monitoring in heterogeneous landscapes, supporting sustainable water resource management from farm to regional scales.

Abstract

Surface soil moisture (SSM) is essential for crop growth, irrigation management, and drought monitoring. However, conventional field-based measurements offer limited spatial and temporal coverage, making it difficult to capture environmental variability at scale. This study introduces a multimodal soil moisture estimation framework that combines synthetic aperture radar (SAR), optical imagery, vegetation indices, digital elevation models (DEM), meteorological data, and spatio-temporal metadata. To strengthen model performance and adaptability, an intermediate fine-tuning strategy is applied to two datasets comprising 10,571 images and 3772 samples. This approach improves generalization and transferability across regions. The framework is evaluated across diverse agro-ecological zones, including farmlands, alpine grasslands, and environmentally fragile areas, and benchmarked against single-modality methods. Results with RMSE 4.5834% and R 2 0.8956 show consistently high accuracy and stability, enabling the production of reliable field-scale soil moisture maps. By addressing the spatial and temporal challenges of soil monitoring, this framework provides essential information for precision irrigation. It supports site-specific water management, promotes efficient water use, and enhances drought resilience at both farm and regional scales.

1. Introduction

Global water resources are under increasing pressure, driving the need for more precise and efficient agricultural water management practices [1,2]. In arid and semi-arid regions, accurate estimation of SSM is especially important for irrigation planning, drought monitoring, and crop yield forecasting [3,4,5].
For soil moisture measurement methods, they are mainly divided into traditional methods and methods based on remote sensing image inversion. Traditional in-situ methods, such as manual sampling and ground-based sensor networks, provide reliable measurements at specific sites but are constrained by high costs and limited spatial coverage [6,7]. Global monitoring efforts, including the International Soil Moisture Network (ISMN), reveal substantial observational gaps, particularly in regions like Africa and parts of South America [8]. These limitations highlight the need for scalable, cost-effective solutions for monitoring soil moisture over large areas. Remote sensing has emerged as a vital tool for filling these data gaps, offering synoptic and timely information at regional to global scales [9,10]. Optical sensors in the visible and near-infrared (VIS/NIR) range capture surface reflectance and perform well in bare or sparsely vegetated regions [11,12]. Thermal infrared (TIR) sensors infer SSM indirectly through thermal inertia, making them suitable for areas with strong diurnal temperature variations [13]. Microwave sensors are particularly effective due to their sensitivity to soil dielectric properties and their ability to penetrate vegetation and clouds, offering reliable observations in various weather conditions [14,15,16].
However, a single sensor is susceptible to environmental interference, often resulting in spatio-temporally inconsistent inversion outcomes, which makes it difficult to directly deploy in operational applications [17]. To overcome the limitations of single-sensor systems, recent advances have focused on integrating data from multiple sources [18,19,20]. Combining optical, radar, and UAV-based imagery allows models to leverage complementary information, improving accuracy and generalizability [21,22,23,24]. This multimodal fusion approach enhances the model’s ability to represent complex environmental conditions and leads to more precise soil moisture estimates [25,26,27,28], and delivers more reliable data products for agriculture [29,30]. However, despite these advances, current multimodal methods still face notable challenges. Multimodal methods often struggle with temporal consistency and regional transferability, limiting their use in long-term water planning. Integrating data from heterogeneous sources is also complicated by differences in spatial resolution and revisit frequency, making harmonization essential for accurate modeling.
In addition, soil moisture dynamics are further complicated by nonlinear interactions among soil properties, vegetation, and meteorological conditions. This complexity makes it difficult to design models that perform well across diverse environments. Machine learning and deep learning approaches have shown promise in addressing these challenges. They can model nonlinear relationships, process large datasets, and generate high-resolution soil moisture estimates [31,32,33]. However, their effectiveness is constrained by the scarcity of high-quality, spatially dense training data. To mitigate this limitation, intermediate task fine-tuning has emerged as a promising strategy. This involves pretraining model on large, related datasets to learn transferable features, followed by fine-tuning on smaller, task-specific datasets. This two-step approach improves model generalization and performance in data-scarce environments [34,35,36], and has proven effective in cross-modal applications where labeled data are limited [37,38].
Nevertheless, the transferability of soil moisture models across agro-ecological zones remains uncertain. Most fusion approaches have been developed and validated within single, relatively uniform regions, with limited testing in ecologically fragile or agriculturally important areas such as alpine ecosystems and intensively farmed zones [31,39]. Variability in climate, vegetation, soil texture, and hydrology introduces additional uncertainty, underscoring the need to evaluate how these factors affect model performance.
To bridge the gap between advanced remote sensing and actionable water management decisions, this study proposes a multimodal framework. The primary goal is to develop a robust tool that can directly support agricultural water management activities. The specific objectives are to: (1) Enhance soil moisture estimation accuracy in regions with limited ground-based observations, aiding real-time drought monitoring and irrigation decision-making. (2) Quantify the added value of integrating multimodal data in generating actionable insights beyond those offered by single-source models. (3) Test the framework across diverse agro-ecological zones to assess its robustness and potential for broad application, from intensively cultivated farmland to environmentally sensitive regions. (4) Achieve high-resolution soil moisture retrieval by leveraging the synergistic integration of multi-source satellite data, thereby overcoming the spatial resolution limitations of existing coarse-resolution products and enabling field-scale monitoring capabilities essential for precision agriculture and localized water resource management. The results of this study will provide valuable scientific insights and technical support for region-specific agricultural water management, drought monitoring, and ecological conservation, particularly enhancing soil moisture monitoring in complex environments.

2. Materials and Methods

2.1. Datasets

This study employs a multimodal dataset (see Table 1) that integrates diverse sources of environmental information to capture the complex dynamics of soil moisture. To effectively manage the structural and functional differences among data types, inputs are organized into two complementary categories based on their spatial characteristics and their role in soil moisture estimation: pixel-level modalities and image-level modalities. This organizational framework not only facilitates systematic data preprocessing and alignment but also guides the design of the fusion architecture, ensuring that each data source contributes optimally to model performance.
Pixel-level modalities provide spatially explicit raster data where each pixel represents distinct surface properties that vary across the landscape. These data require precise geometric co-registration and are spatially aligned within a 1.28 km × 1.28 km window (128 × 128 pixels at 10 m resolution), with each pixel corresponding across sources within the fixed geographic extent. This window size was selected based on compatibility with Sentinel-1’s native 10 m resolution and computational efficiency for processing large datasets. This category includes synthetic aperture radar (SAR) imagery, multispectral optical data, and digital elevation models (DEM). SAR data, obtained from Sentinel-1 and Gaofen satellites, play a central role in directly sensing soil moisture through their sensitivity to soil dielectric properties and their ability to penetrate vegetation and clouds under various weather conditions. Multispectral imagery from Landsat-8 [40] and Sentinel-2 [41] complements SAR by capturing surface reflectance across visible, near-infrared, and shortwave infrared bands, enabling detailed characterization of vegetation cover, land use, and surface conditions that indirectly influence soil moisture distribution. Topographic features derived from DEM, including elevation and slope, are essential for modeling the terrain-driven redistribution of water through surface runoff and subsurface flow pathways.
Image-level modalities, in contrast, represent contextual environmental conditions or metadata that characterize the entire scene rather than individual pixels. These include spatially aggregated climate statistics, vegetation indicators, and scalar metadata such as geolocation and observation timing. Since these features are either spatially homogeneous across the study window or serve as scene-level attributes, they do not require pixel-level alignment and are incorporated through global pooling operations in the model architecture. Vegetation indices, specifically Leaf Area Index (LAI) and Fraction of Photosynthetically Active Radiation (FPAR) derived from MODIS products [42], provide indirect indicators of soil moisture status through their linkage to vegetation health and canopy structure. Meteorological data [43,44] from the ERA5 reanalysis product capture atmospheric forcing on soil moisture dynamics, with daily and monthly records of temperature and precipitation enabling the model to account for both immediate and lagged meteorological influences. Additionally, geospatial coordinates (latitude and longitude) and temporal information (month of observation) are encoded in a cyclic manner to help the model recognize seasonal trends and geographic variability in soil moisture patterns.
The distinction between pixel-level and image-level modalities reflects their different contributions to soil moisture estimation. Pixel-level data capture fine-scale spatial heterogeneity in surface and subsurface properties, which is critical for generating high-resolution soil moisture maps suitable for field-scale applications. Image-level data, meanwhile, provide the broader environmental context necessary for interpreting local observations within regional climatic and phenological frameworks. By integrating these complementary information sources through an attention-based fusion mechanism, the framework leverages the strengths of each modality while mitigating their individual limitations. A comprehensive summary of each data type, including spatial resolution, temporal frequency, and band information, is provided in Table 1. The location of the aforementioned open-source data has been indicated in Appendix A.

2.1.1. SSM from a Large-Scale Dataset

The first soil moisture labels used in this study are derived from two key sources: WoSIS [45] and SMAP [46]. These datasets have fundamentally different characteristics: WoSIS provides static, high-resolution (250 m) soil property measurements derived from global field observations at six depth intervals, while SMAP delivers dynamic, daily global soil moisture estimates at coarse resolution (9 km) from the SPL3SMP-E Level 3 product. WoSIS captures fine-scale spatial heterogeneity in soil physical properties but lacks temporal dynamics, whereas SMAP captures temporal variability in surface soil moisture but at insufficient spatial resolution for field-scale applications.
To combine the strengths of both datasets, we apply the NNDiffuse image fusion algorithm [47], treating WoSIS as a high-resolution (panchromatic) input and SMAP as a low-resolution (multispectral) input. This output is then upsampled to align with the spatial resolution of Sentinel-1 imagery, producing label maps with consistent resolution across modalities. The fusion process is carried out using ENVI (v5.6) software. For convenience, the fused dataset is hereafter referred to as the SMAP-WoSIS Fusion (SWF).
The fusion process proceeds as follows: For each location and date in our dataset, we extract the corresponding SMAP observation temporally aligned with the Sentinel-1 acquisition date; The static WoSIS data provides the spatial structure that guides the downscaling of SMAP’s temporal signal; The NNDiffuse algorithm is implemented in ENVI software to generate the fused product; The output is upsampled to 10 m resolution to align with Sentinel-1’s spatial grid. This fusion successfully combines SMAP’s temporal fidelity with WoSIS’s spatial precision, while eliminating physically unrealistic moisture assignments over water bodies present in the original coarse SMAP data.
We acknowledge that this fusion introduces label uncertainty, as the resulting SWF estimates reflect regional-scale patterns rather than point-level ground truth. To address this limitation, we implement a two-stage training strategy: intermediate pretraining on the large-scale SWF dataset to learn general soil moisture patterns, followed by fine-tuning on high-quality in-situ measurements to achieve site-specific accuracy.
Input features include all modalities listed in Table 1, except Gaofen data, which is excluded due to limited global availability. The final dataset consists of 10,571 co-registered samples. The large-scale dataset spans four years from 2017 to 2020, with samples distributed across all 12 months to capture seasonal variability in soil moisture dynamics. The temporal distribution shows relatively balanced monthly coverage, with slightly higher representation during the growing season (July–August). This multi-year, multi-season sampling strategy ensures that the model learns soil moisture patterns across diverse climatic conditions, phenological stages, and meteorological forcing scenarios. The temporal alignment uses Sentinel-1 SAR acquisition dates as the reference timestamp. For all other modalities, we search for the nearest available observation within a temporal window before and after the SAR acquisition date. If no valid observation is found within the search window (due to cloud cover for optical data or data gaps), the corresponding modality is marked as missing and handled by the model’s missing data mechanism. The geographic distribution of these samples is illustrated in Figure 1.

2.1.2. SSM from a Small-Scale Dataset

The second category of soil moisture labels used in this study is derived from in-situ measurements collected across several representative regions in China. These include Wangkui County, the Tibetan Plateau, and the Heihe River Basin. The geographic distribution of these sampling sites is shown in Figure 2. This dataset spans a variety of climate zones and land surface types, which offers ground-truth observations.
In the Wangkui region, data were collected in July 2021 using manual sampling techniques. Soil samples were weighed before and after oven-drying to calculate the GWC. For the Tibetan Plateau, soil moisture data were obtained from several long-term observation networks [48], including: Naqu network, Maqu network, and additional sites such as Ali and Pali. These networks cover a wide range of meteorological and hydrological environments across the plateau, with observation periods extending from July 2010 to August 2016. The Heihe Basin dataset was compiled from two separate field projects [49]: The first includes gravimetric measurements from the upper reaches of the basin, collected between July 2013 and July 2014. The second contains time-series records from the Babao River sub-basin, covering time from July 2013 to December 2017.
All in-situ soil moisture measurements were collected at depths ranging from 0–5 cm below the surface. For the Tibetan Plateau and Heihe River Basin datasets, the long-term observation networks provide soil moisture measurements at multiple depths (5 cm, 10 cm, 20 cm, 40 cm, and 50 cm layers), and we extracted the 5 cm depth measurements as ground truth labels, as this depth is commonly used in agricultural and hydrological applications and represents the near-surface soil moisture layer relevant for vegetation water uptake and evapotranspiration processes. For the Wangkui dataset, soil samples were collected at four depths (2 cm, 5 cm, 10 cm, and 20 cm layers), and we similarly selected the 5 cm depth measurements for consistency.
Soil moisture values used in this study are calculated based on standard definitions, GWC (gravimetric water content) and VWC (volumetric water content). By introducing the soil bulk density parameter ρ soil , GWC can be converted to VWC as:
V W C = G W C × ρ s o i l × 100 %
For consistency and comparability across sites and measurement techniques, all soil moisture observations were converted to VWC. After strict quality control and data cleaning, the final in-situ dataset contains 3772 valid records. As with the large-scale dataset, these VWC values are used as supervised labels, and the input features include the multimodal remote sensing and meteorological variables listed in Table 1. One notable distinction in the small-scale dataset is that SAR data for the Wangkui region were acquired from the Gaofen satellite.

2.2. Data Processing

To ensure consistency in spatial resolution, temporal alignment and data format across all sources, a standardized preprocessing workflow was implemented before model development. Each data type was handled according to its specific characteristics, as outlined below:
  • Sentinel-1 served as the reference dataset, setting the standard for observation dates, spatial resolution and coordinate reference system. All other data sources were aligned to match this baseline. Sentinel-1 imagery is acquired in ascending and descending orbits, with each viewing direction capturing two polarization channels, resulting in four bands per observation.
  • Gaofen imagery, which lacks built-in preprocessing, was radiometrically calibrated using the method proposed by [50]. This step was essential for extracting reliable radar backscatter intensities.
  • Sentinel-2 offers two product levels: Level-1C and Level-2A. Although Level-2A data provide improved atmospheric correction, they were not always available for the entire study period. Consequently, these two sources are used in a complementary fashion to enhance spectral and spatial representation in remote sensing analyses.
  • Landsat-8 provides both Level-1 and Level-2 data, too. To ensure adequate coverage while maintaining data quality, we also prioritized Level-2 data and used Level-1 data where necessary.
  • DEM data were obtained from the ASTER GDEM dataset. In addition to elevation, terrain slope was derived to incorporate topographic features relevant to soil moisture distribution.
  • ERA5 climate reanalysis data were aligned with Sentinel-1 acquisition dates. Daily and monthly averages of temperature and precipitation were extracted to capture short- and mid-term atmospheric influences on soil moisture.
  • Time: seasonality was represented by encoding the acquisition month of each Sentinel-1 image using a cyclic transformation. This allowed the model to account for seasonal patterns affecting SSM.
  • Geolocation information, specifically the latitude and longitude of the image center, was also encoded cyclically. This enabled the model to learn spatial patterns and regional dependencies.
All pixel-level modalities are resampled to a target resolution of 10 m using bilinear interpolation [51], and reprojected to match the spatial reference system of Sentinel-1. This ensured consistent spatial alignment across all data sources. Temporal alignment was also prioritized: for each observation, data from all modalities were synchronized as closely as possible to Sentinel-1 acquisition dates. In particular, for ERA5 and optical datasets, observations closest in time to Sentinel-1 captures were selected to enhance inter-modal consistency.
During preprocessing, invalid or missing values—caused by factors such as cloud cover, sensor noise, or acquisition failure—were systematically identified. These error indicators vary by modality (e.g., 0 in Sentinel-2, –inf in Sentinel-1). A binary validity mask was generated for each image, where a value of 1 denotes valid data and 0 represents missing or corrupted pixels. These masks were incorporated as additional model inputs during training, ensuring that only valid pixels contributed to the loss calculation and gradient updates.
To manage large-scale data acquisition and preprocessing efficiently, this study developed an automated system combining Python (v3.8) and the GEE API. For the large-scale dataset, the workflow begins with Sentinel-1 metadata, which serves as the reference for retrieving corresponding remote sensing observations and lower-resolution soil moisture labels. Anchoring the pipeline to Sentinel-1 ensures spatial consistency between inputs and labels. For the smaller in-situ dataset, the workflow starts from ground measurement metadata, including timestamp and coordinates. These are indexed in JSON files to guide the retrieval of multimodal satellite inputs that are spatially and temporally aligned with each soil sampling event.

2.3. Methods

We adopt a two-stage intermediate task fine-tuning strategy to train the soil moisture estimation model. Additionally, a multimodal fusion mechanism is integrated to effectively combine complementary information from diverse input sources, improving the model’s ability to capture complex relationships across modalities.
This research adopts ConvNeXt v2 as the backbone architecture for soil moisture estimation [52]. Given the sparse and noisy nature of remote sensing data, such as invalid regions due to cloud cover or sensor gaps, the model uses submanifold sparse convolution, a core component of ConvNeXt v2. Unlike standard dense convolutions, sparse convolutions operate only on valid pixels (i.e., where the validity mask equals 1), eliminating unnecessary computation in empty regions.
To adapt ConvNeXt V2 for regression rather than classification, architectural modifications are applied as follows (illustrated in Figure 3):
  • The original classification head is replaced with a single-channel linear output layer to predict soil moisture for each pixel.
  • The loss function is changed from cross-entropy to Mean Absolute Error, which better suits the continuous nature of soil moisture values.
  • U-Net-style skip connections [53] are added between encoder and decoder stages to preserve spatial details. Feature maps are concatenated channel-wise to retain fine-grained information during upsampling.
Before entering the backbone network, input data from multiple modalities are fused to capture complementary information. For image-level modalities, features are concatenated into a unified 1D vector. For pixel-level modalities, we apply an attention-based fusion strategy using the Squeeze-and-Excitation (SE) module [54] to adaptively weight channels by importance. The fusion process proceeds as follows:
  • Global average pooling is applied to each feature map X R B × C × H × W to obtain a channel-wise descriptor:
    z c = 1 H × W i = 1 H j = 1 W X c ( i , j )
  • The descriptor z is passed through a two-layer MLP ( F C 1 and F C 2 ) with ReLU and sigmoid activations to produce the attention weights ω :
    ω = σ W 2 · ReLU W 1 · z
  • Each channel is reweighted by its corresponding attention score:
    X c = X c · ω
  • A 1 × 1 convolution is applied to reduce the reweighted feature map to a single-channel representation per modality, allowing unified downstream processing. The complete multimodal fusion workflow is illustrated in Figure 4.
Once single-channel representations are extracted for all input modalities, they are jointly processed within the ConvNeXt V2 backbone to enable integrated feature learning across modalities. This study employs a progressive transfer learning paradigm:
  • Pretraining stage: The model is initialized with ConvNeXt V2 weights pretrained on the MMEarth dataset [55], which is specifically designed for multimodal Earth observation tasks and has demonstrated strong performance in various settings.
  • Intermediate training stage: The model is further trained on a large-scale and automatically curated soil moisture dataset using SWF labels. This global dataset enables the model to learn generalizable relationships between multimodal remote sensing inputs and soil moisture dynamics.
  • Fine-tuning stage: Finally, the model is fine-tuned on a high-quality, small-sample in-situ dataset based on field measurements. The higher label fidelity in this stage enhances the model’s accuracy in specific geographic regions and supports better adaptation to real-world conditions.
While the SWF and in-situ datasets differ in label origin, they are consistent in terms of input modalities, spatial resolution, and preprocessing procedures.

3. Results

This study investigates the effectiveness of multimodal input for soil moisture estimation through a series of controlled experiments. A total of eight model configurations were designed to evaluate the influence of data fusion strategies and architectural choices on performance. The configurations include unimodal SAR input (a), pixel-level multimodal fusion (b), and a combined pixel- and image-level fusion strategy (c). Additionally, a group (d) was included in the experiments, which did not use the intermediate fine-tuning strategy for training. Due to convergence issues with the dataset, this group only includes results from training with SAR data.
Each group has two sub-variants: one with skip connections for better spatial detail recovery (’1’), and one without to assess their impact (’2’). This design allows for evaluating not only the contribution of modalities but also the architectural effect of skip connections. During preliminary trials, we observed that under small-sample training conditions, only SAR-based unimodal models (Group a) demonstrated consistent convergence. In contrast, models with multiple input modalities (Groups b and c) often encountered feature conflicts, which impeded effective training and model stability. These conflicts likely stem from inconsistencies across modalities and limited training data, which constrain the model’s ability to learn coherent joint representations. As a result, only SAR-based architectures were employed during the intermediate fine-tuning stage (Group d), where training stability and convergence were prioritized. This decision ensures a more reliable learning process while leveraging the strong generalization capabilities of SAR data.
To comprehensively evaluate model performance in predicting soil moisture, we employ four widely used statistical metrics: root mean square error (RMSE), mean absolute error (MAE), coefficient of determination ( R 2 ), and unbiased root mean square error (ubRMSE).

3.1. Intermediate Task Fine-Tuning

This stage incorporates channel-wise fusion to integrate complementary features from multiple remote sensing modalities. As illustrated in Figure 5, the fusion process enhances spatial consistency across inputs while preserving distinctive information from each source.
As part of the intermediate task fine-tuning strategy, a fusion process was first applied to soil moisture images. The results, shown in Figure 6, demonstrate that the fused image retains the spectral characteristics of the original SMAP data while significantly improving spatial resolution and visual clarity. Prior to fusion, the low-resolution SMAP images cover broad areas, including non-soil surfaces such as lakes and rivers. This leads to the assignment of soil moisture values over water bodies, which is physically inaccurate. In contrast, the WoSIS dataset, with its finer spatial granularity, better delineates soil-covered regions and excludes water surfaces, offering a more realistic depiction of soil distribution. The NNDiffuse algorithm facilitates this enhancement by applying a nonlinear diffusion mechanism. This process improves the fused image’s spatial precision without compromising spectral integrity. Furthermore, a known limitation of the algorithm is that it does not impute values in areas with missing data. As a result, any blank regions present in the WoSIS image remain unprocessed in the final fused output, and are also reflected in the downsampled SMAP image.
During the initial training phase on the large-scale dataset, the model’s performance across different configurations is summarized in Table 2. The best-performing model achieved an R 2 of 0.3072 and an RMSE of 11.749%, representing more than an 150% improvement over the SAR-only baseline (Group a). The addition of skip connections consistently improved predictive performance across all architectures. The effect was most pronounced in Group b (pixel-level multimodal fusion), where R 2 increased by 16.9% with skip connections enabled. However, performance gains began to plateau as more modalities were added. From Group a to Group b, R 2 improved by 0.1256, while the transition from Group b to Group c yielded a smaller increase of just 0.0610. This trend highlights the need to balance computational overhead with diminishing returns when incorporating additional data sources in practical applications. Despite the improvements, the optimal configuration still exhibited an ubRMSE of 11. 295%, indicating the presence of residual error.

3.2. Soil Moisture Estimation Results

To further improve model accuracy and generalizability, this stage incorporates in-situ soil moisture measurements. Figure 7 displays the prediction results, while the corresponding quantitative metrics are summarized in Table 3. The experimental findings confirm that the inclusion of skip connections consistently enhances model performance, particularly in terms of RMSE and R 2 , by preserving spatial features and improving gradient flow. As more data modalities are integrated, the model continues to improve across all evaluation metrics. Comparing Group a1 (SAR-only input) to Group b1 (SAR combined with other pixel-level modalities), results show significant gains: R 2 increases by approximately 0.20 and RMSE decreases by 2.8%; Comparing Group b1 and Group c1, we find: a further 0.05 increase in R 2 and RMSE reduced by 1.0%.
A comparison between the intermediate fine-tuning stage (see Table 2) and the final fine-tuning stage (see Table 3) reveals substantial performance improvements. For the same model architecture, the R 2 value in Group a2 increased by up to 456.7%, highlighting the critical importance of high-quality, accurate labels in soil moisture retrieval tasks. The benefits of multimodal remain consistent across training stages. Specifically, the improvement from Group a1 to Group c1 during final fine-tuning yields a Δ R 2 = 0.2400 , demonstrating the effectiveness of integrating diverse data sources in enhancing model generalization and predictive accuracy. In addition, detailed error analysis confirms that fine-tuning with in-situ data significantly reduces ubRMSE by 36.4% to 60.8%, effectively mitigating the systematic bias introduced during large-scale pretraining. Both RMSE and MAE also show consistent reductions across configurations, further affirming the utility of in-situ supervision in refining model outputs. While the Group d experiments, conducted under data-limited conditions, exhibited an overall drop in performance, the presence of skip connections continued to provide noticeable benefits. This reinforces the robustness of the intermediate task fine-tuning strategy, which helps preserve generalization capability even when training data is scarce.

3.3. Accuracy of Soil Moisture Estimates over Different Soil Types

This research categorizes in-situ soils into three types based on water sources and spatial distribution patterns: Wangkui, Heihe, and Tibet. First, when examining the performance of individual regions, the error distribution in Wangkui is notably concentrated, with the boxplot closely aligned with the zero-error line. In contrast, the error distribution in Tibet falls between the Wangkui and Heihe regions, with a slightly larger box range. On the other hand, the Heihe region exhibits much larger errors, with a wider box and extended upper whisker range. This indicates a dispersed error distribution with numerous high-error outliers.
Analysis of the three figures reveals a clear trend: as additional modalities are incorporated, errors gradually decrease, particularly in Heihe, where error values decline significantly. In Tibet, the inclusion of modalities narrows the error range. Numerically, in the Heihe region, the error (represented by MAE) decreases from 0.1004 to 0.0592 and further to 0.0509 as modalities are added. Similarly, in Tibet and Wangkui, the MAE values are 0.0701 → 0.0515 → 0.0424 and 0.0416 → 0.0403 → 0.0351, respectively. Moreover, the addition of modalities not only reduces errors but also helps identify and correct outliers. In Heihe, although errors remain substantial, the number of outliers decreases significantly, and the error distribution becomes more concentrated. This suggests that modality integration improves error estimation in regions with complex water sources, helping to reduce retrieval errors despite inherent spatial variability.

4. Discussion

4.1. Contribution of Intermediate Task Fine-Tuning Strategy

Before fine-tuning, the retrieval performance of SWF-derived soil moisture was evaluated, and the results indicated relatively low accuracy. After incorporating the in-situ dataset for fine-tuning, the multimodal model achieved a substantial improvement in accuracy (Figure 7). Compared with the findings of [56], the performance of our method was lower before fine-tuning but surpassed theirs after fine-tuning with in-situ measurements. In their study, the reported R 2 values ranged from 0.49 to 0.7056, with RMSE values between 3.1% and 5.0%. In contrast, our multimodal approach achieved an R 2 of 0.8956 and an RMSE 4.58%. These results highlight the critical role of ground-based measurements in model training, while also underscoring the limitations of relying solely on downscaled remote sensing data for soil moisture retrieval.
Prior to incorporating in-situ data, the network exhibited suboptimal training performance. One contributing factor is the quality of the training labels [57]. SWF-derived soil moisture estimates reflect regional-scale averages rather than localized point-level ground truth. Their relatively coarse spatial resolution likely introduced uncertainty during model training, reducing the network’s ability to capture fine-grained, site-specific soil moisture variability. This limitation constrains the overall predictive accuracy and demonstrates why in situ calibration is indispensable for improving retrieval performance.

4.2. Role of Multimodal Approach

The Group a experiment, which relies solely on SAR data, highlights the unique ability of SAR to penetrate through vegetation and soil layers, providing valuable insights into subsurface moisture. However, despite this penetrating capability, SAR has limited effectiveness in capturing detailed surface information, particularly in regions with diverse landforms or land use types, where surface characteristics significantly influence soil moisture dynamics [58]. Moreover, SAR backscatter is sensitive to multiple confounding factors, such as soil type and topography, which further constrains its reliability when used in isolation. The Group b experiment demonstrates that integrating optical and topographic data with SAR substantially enhances model performance. Optical imagery contributes rich surface-level spectral information, capturing vegetation cover, water bodies, and land use, while topographic data provides geometric attributes such as elevation, slope, and aspect, which are critical for understanding the movement and retention of water across the landscape [59]. The inclusion of these modalities leads to a marked improvement in estimation accuracy. The Group c experiment further advances performance by introducing pixel-level modalities in addition to image-level features.These contextual inputs enrich the model with environmental background information that aids the interpretation of soil moisture. Meteorological variables allow the model to adjust predictions based on weather dynamics [60], while temporal indices help capture seasonal variability in soil moisture patterns [61]. Vegetation attributes also play a vital role in describing how soil retains or releases water [62], offering a more complete understanding of moisture dynamics.
Overall, these findings underscore the importance of environmental context in modeling the spatio-temporal variability of soil moisture, particularly in heterogeneous or seasonally dynamic regions. The complementary nature of multimodal inputs strengthens model robustness and adaptability across diverse landscapes and climatic conditions. Nevertheless, incorporating an excessive number of modalities can increase computational costs and introduce potential interference among modalities, sometimes leading to fluctuations in training loss.

4.3. Underlying Causes of Soil Type Differences

Different soil types exert a significant influence on the accuracy of soil moisture inversion due to their distinct formation mechanisms and hydrological characteristics [63]. However, relatively few studies have systematically investigated the role of soil type in soil moisture estimation [56]. From the overall results (Figure 8), the three soil types all achieved satisfactory retrieval accuracy, but with notable differences. The Wangkui soils, primarily dependent on natural precipitation and supplemental irrigation, exhibit relatively uniform spatial distribution. This stability in soil moisture dynamics allows the model to capture regional patterns effectively. In contrast, the Heihe soils are jointly influenced by river irrigation and groundwater [64], leading to pronounced spatial heterogeneity and high sensitivity to both natural and anthropogenic disturbances. This highlights the limitations of existing models in capturing complex hydrological processes and spatial variability. The Tibet soils, replenished by precipitation, snowmelt, and thawing permafrost, represent an environment with strong heterogeneity [65]. Notably, when pixel-level and image-level modalities were jointly introduced, retrieval accuracy in Tibet improved significantly, suggesting that multimodal feature integration can mitigate uncertainties caused by complex hydrological conditions.
A closer comparison of different input strategies (Figure 8a–c) further illustrates these distinctions. In Wangkui, where hydrological processes are simple and spatial distributions uniform, the MAE remained consistently low across all three input settings. The box plots displayed few outliers and narrow distributions, indicating that SAR-only input was sufficient to achieve reliable accuracy. Additional pixel- or image-level inputs yielded gains but is limited. In contrast, in Heihe and Tibet regions with complex hydrological processes and strong heterogeneity, the marginal benefits of multimodal inputs were far greater. For example, in Heihe, SAR-only input yielded an MAE of 0.1004, whereas combined inputs reduced the error to approximately 0.0509. The upper bounds and number of outliers in the error distribution also decreased substantially, indicating that multimodal features provided effective constraints for capturing local heterogeneity caused by river irrigation and groundwater recharge. A similar trend was observed in Tibet: multimodal integration nearly halved the error, with both the overall box height and number of outliers significantly reduced, demonstrating improved model stability and robustness under conditions of high heterogeneity and multiple water sources.
From an application perspective, these different outcomes have important implications. In hydrologically stable and spatially homogeneous regions (e.g., Wangkui), SAR-only input can deliver satisfactory inversion accuracy with lower computational and data costs. However, in highly heterogeneous regions (e.g., Tibet and Heihe), single-source input is insufficient to capture complex moisture dynamics, and multimodal, multi-scale data fusion becomes essential for ensuring reliability. These findings suggest that large-scale applications should adopt a “region-specific” strategy: lightweight SAR-only models in stable areas to maximize efficiency, and fusion-based models in complex regions to preserve accuracy. Such an adaptive approach offers a balanced solution between cost and performance.

4.4. Main Perspectives of the Multimodal Approach

Nevertheless, several sources of error remain. Some errors arise from the inherent uncertainties of the data. For instance, SWF-derived soil moisture estimates have relatively coarse spatial resolution, reflecting regional averages rather than point-level ground truth, which introduces label noise [66]. In situ datasets also vary in acquisition methods and conditions, leading to inconsistencies in depth measurements [67]. Moreover, multimodal data are not always complete, and missing features were sometimes replaced with masks during training. The strong performance across all three regions ( R 2 = 0.8956, RMSE = 4.58%) suggests that the model successfully captures soil moisture dynamics despite depth variations. The consistency of results across sites with different target depths (2 cm vs. 5 cm) indicates that the multimodal approach provides robust estimates of soil moisture. However, in regions with highly complex hydrological processes and multiple recharge sources (e.g., Tibet and Heihe), soil moisture dynamics are subject to large temporal variability and strong external influences such as irrigation and groundwater replenishment. These factors make it difficult for the model to achieve complete convergence, thereby contributing to higher residual errors.
While ideally all measurements would be at a uniform depth, the framework does not rely solely on SAR backscatter. The integration of vegetation indices, meteorological variables, and topographic features provides information about soil water dynamics that extends beyond the immediate SAR sensing depth. These auxiliary variables help bridge the gap between surface backscatter and deeper soil moisture.
Future research can be pursued in several directions. First, incorporating more advanced deep learning architectures and adaptive modality selection mechanisms may help reduce redundancy and enhance model robustness [31]. Second, coupling multi-temporal and multi-scale datasets could provide a more comprehensive representation of soil moisture dynamics [68]. Third, integrating physical constraints into the modeling process would ensure that predictions remain consistent with the underlying principles of hydrological processes [69].

5. Conclusions

This study proposed an integrated framework for soil moisture estimation that fuses SAR, optical, topographic, and meteorological data with an intermediate fine-tuning strategy. The results demonstrated that data fusion can substantially improve estimation accuracy, thereby enabling more reliable field-scale soil moisture maps that support informed irrigation planning and water allocation. The fine-tuning strategy proved effective in translating coarse satellite information into high-resolution, management-ready outputs, showing strong scalability for operational applications. Moreover, the framework exhibited robustness across heterogeneous landscapes, from stable agricultural regions to hydrologically complex basins, thus supporting adaptive management under diverse conditions. The observed sensitivity to soil type further highlights the need for region-specific strategies, where SAR-only models may suffice in stable areas while multi-source integration is essential in complex terrains. Overall, this research provides both a methodological advance and a practical pathway toward data-driven, resilient, and efficient agricultural water management, contributing to sustainable use of water resources.

Author Contributions

Conceptualization, J.L.; Methodology, J.L.; Software, J.L.; Validation, J.L.; Formal analysis, J.L.; Investigation, J.L.; Resources, L.L.; Data curation, J.L.; Writing—original draft, J.L.; Writing—review & editing, L.L. and X.W.; Supervision, L.L. and W.Y.; Project administration, L.L.; Funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Aerospace Information Research Institute, Chinese Academy of Sciences, grant number Y9G0100BF0. And the APC was funded by Aerospace Information Research Institute, Chinese Academy of Sciences.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We would like to express our sincere thanks to the Chinese Academy of Agricultural Sciences for kindly providing the soil survey data that was essential for this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Data sources and open access information for datasets used in this study as follows:
All pixel-level modalities underwent spatial coverage validation requiring images to fully contain the target polygon with a 200-m buffer zone, with systematic exclusion of images containing no-data values at orbit boundaries. Cloud filtering was applied to optical imagery, excluding Sentinel-2 scenes with cloud coverage exceeding 10% at the tile level, while preserving quality assessment bands (SCL, QA, QC) for downstream pixel-level filtering. A dual-track resampling strategy was employed: continuous variables (spectral bands, temperature, vegetation indices) were resampled to 10-m using bilinear interpolation, while categorical data (quality flags, land cover) used nearest neighbor to preserve discrete values. For time-series products, images were selected based on temporal proximity to the soil moisture reference date. Downloaded tiles were center-cropped to 128 × 128 pixels to ensure consistent spatial dimensions, with a retry mechanism implemented to handle network instability during download.

References

  1. Shiklomanov, I.A. World Water Resources: A New Appraisal and Assessment for the 21st Century; United Nations Educational, Scientific and Cultural Organization (UNESCO): Paris, France, 1998. [Google Scholar]
  2. Fu, Z.; Ciais, P.; Feldman, A.F.; Gentine, P.; Makowski, D.; Prentice, I.C.; Stoy, P.C.; Bastos, A.; Wigneron, J.P. Critical soil moisture thresholds of plant water stress in terrestrial ecosystems. Sci. Adv. 2022, 8, eabq7827. [Google Scholar] [CrossRef]
  3. Martínez-Fernández, J.; González-Zamora, A.; Sánchez, N.; Gumuzzio, A.; Herrero-Jiménez, C. Satellite soil moisture for agricultural drought monitoring: Assessment of the SMOS derived Soil Water Deficit Index. Remote Sens. Environ. 2016, 177, 277–286. [Google Scholar] [CrossRef]
  4. Yao, Y.; Liu, Y.; Zhou, S.; Song, J.; Fu, B. Soil moisture determines the recovery time of ecosystems from drought. Glob. Change Biol. 2023, 29, 3562–3574. [Google Scholar] [CrossRef] [PubMed]
  5. Kandhol, N.; Pandey, S.; Singh, V.P.; Herrera-Estrella, L.; Tran, L.S.P.; Tripathi, D.K. Link between plant phosphate and drought stress responses. Research 2024, 7, 0405. [Google Scholar] [CrossRef] [PubMed]
  6. Robinson, D.A.; Campbell, C.S.; Hopmans, J.W.; Hornbuckle, B.K.; Jones, S.B.; Knight, R.; Ogden, F.; Selker, J.; Wendroth, O. Soil moisture measurement for ecological and hydrological watershed-scale observatories: A review. Vadose Zone J. 2008, 7, 358–389. [Google Scholar] [CrossRef]
  7. Robinson, D.A.; Jones, S.B.; Wraith, J.M.; Or, D.; Friedman, S.P. A review of advances in dielectric and electrical conductivity measurement in soils using time domain reflectometry. Vadose Zone J. 2003, 2, 444–475. [Google Scholar] [CrossRef]
  8. Dorigo, W.; Himmelbauer, I.; Aberer, D.; Schremmer, L.; Petrakovic, I.; Zappa, L.; Preimesberger, W.; Xaver, A.; Annor, F.; Ardö, J.; et al. The International Soil Moisture Network: Serving Earth system science for over a decade. Hydrol. Earth Syst. Sci. Discuss. 2021, 2021, 1–83. [Google Scholar] [CrossRef]
  9. Blatchford, M.L.; Mannaerts, C.M.; Zeng, Y.; Nouri, H.; Karimi, P. Status of accuracy in remotely sensed and in-situ agricultural water productivity estimates: A review. Remote Sens. Environ. 2019, 234, 111413. [Google Scholar] [CrossRef]
  10. Li, Z.L.; Leng, P.; Zhou, C.; Chen, K.S.; Zhou, F.C.; Shang, G.F. Soil moisture retrieval from remote sensing measurements: Current knowledge and directions for the future. Earth-Sci. Rev. 2021, 218, 103673. [Google Scholar] [CrossRef]
  11. Amani, M.; Parsian, S.; MirMazloumi, S.M.; Aieneh, O. Two new soil moisture indices based on the NIR-red triangle space of Landsat-8 data. Int. J. Appl. Earth Obs. Geoinf. 2016, 50, 176–186. [Google Scholar] [CrossRef]
  12. Babaeian, E.; Sadeghi, M.; Franz, T.E.; Jones, S.; Tuller, M. Mapping soil moisture with the OPtical TRApezoid Model (OPTRAM) based on long-term MODIS observations. Remote Sens. Environ. 2018, 211, 425–440. [Google Scholar] [CrossRef]
  13. Kang, J.; Jin, R.; Li, X.; Ma, C.; Qin, J.; Zhang, Y. High spatio-temporal resolution mapping of soil moisture by integrating wireless sensor network observations and MODIS apparent thermal inertia in the Babao River Basin, China. Remote Sens. Environ. 2017, 191, 232–245. [Google Scholar] [CrossRef]
  14. Karthikeyan, L.; Pan, M.; Wanders, N.; Kumar, D.N.; Wood, E.F. Four decades of microwave satellite soil moisture observations: Part 1. A review of retrieval algorithms. Adv. Water Resour. 2017, 109, 106–120. [Google Scholar] [CrossRef]
  15. Al-Yaari, A.; Wigneron, J.P.; Kerr, Y.; Rodriguez-Fernandez, N.; O’Neill, P.; Jackson, T.; De Lannoy, G.J.M.; Al Bitar, A.; Mialon, A.; Richaume, P.; et al. Evaluating soil moisture retrievals from ESA’s SMOS and NASA’s SMAP brightness temperature datasets. Remote Sens. Environ. 2017, 193, 257–273. [Google Scholar] [CrossRef]
  16. Lal, P.; Singh, G.; Das, N.N.; Lohman, R.B. Validation of the NISAR Multi-Scale Soil Moisture Retrieval Algorithm across Various Spatial Resolutions and Landcovers Using the ALOS-2 SAR Data. J. Remote Sens. 2025, 5, 0729. [Google Scholar] [CrossRef]
  17. Singh, A.; Gaurav, K.; Sonkar, G.K.; Lee, C.C. Strategies to measure soil moisture using traditional methods, automated sensors, remote sensing, and machine learning techniques: Review, bibliometric analysis, applications, research findings, and future directions. IEEE Access 2023, 11, 13605–13635. [Google Scholar] [CrossRef]
  18. Cheng, M.; Jiao, X.; Liu, Y.; Shao, M.; Yu, X.; Bai, Y.; Wang, Z.; Wang, S.; Tuohuti, N.; Liu, S.; et al. Estimation of soil moisture content under high maize canopy coverage from UAV multimodal data and machine learning. Agric. Water Manag. 2022, 264, 107530. [Google Scholar] [CrossRef]
  19. Li, H.; Song, Y.; Wang, Z.; Li, M.; Yang, W. Development of an online prediction system for soil organic matter and soil moisture content based on multi-modal fusion. Comput. Electron. Agric. 2024, 227, 109514. [Google Scholar] [CrossRef]
  20. Cai, J.; Zhao, W.; Ding, T.; Yin, G. Generation of High-Resolution Surface Soil Moisture over Mountain Areas by Spatially Downscaling Remote Sensing Products Based on Land Surface Temperature–Vegetation Index Feature Space. J. Remote Sens. 2025, 5, 0437. [Google Scholar] [CrossRef]
  21. Zhang, Y.; Yang, X.; Tian, F. Study on Soil Moisture Status of Soybean and Corn across the Whole Growth Period Based on UAV Multimodal Remote Sensing. Remote Sens. 2024, 16, 3166. [Google Scholar] [CrossRef]
  22. Karmakar, P.; Teng, S.W.; Murshed, M.; Pang, S.; Li, Y.; Lin, H. Crop monitoring by multimodal remote sensing: A review. Remote Sens. Appl. Soc. Environ. 2024, 33, 101093. [Google Scholar] [CrossRef]
  23. Nan, G.; Zhao, Y.; Fu, L.; Ye, Q. Object detection by channel and spatial exchange for multimodal remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8581–8593. [Google Scholar] [CrossRef]
  24. Fan, X.; Zhou, W.; Qian, X.; Yan, W. Progressive adjacent-layer coordination symmetric cascade network for semantic segmentation of multimodal remote sensing images. Expert Syst. Appl. 2024, 238, 121999. [Google Scholar] [CrossRef]
  25. De Sa, V.R.; Ballard, D.H. Category learning through multimodality sensing. Neural Comput. 1998, 10, 1097–1117. [Google Scholar] [CrossRef]
  26. Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
  27. Deng, C.; Zhu, D.; Li, K.; Gou, C.; Li, F.; Wang, Z.; Zhong, S.; Yu, W.; Nie, X.; Song, Z.; et al. Emerging properties in unified multimodal pretraining. arXiv 2025, arXiv:2505.14683. [Google Scholar] [CrossRef]
  28. Zhao, T.; Shi, J.; Entekhabi, D.; Jackson, T.J.; Hu, L.; Peng, Z.; Yao, P.; Li, S.; Kang, C.S. Retrievals of soil moisture and vegetation optical depth using a multi-channel collaborative algorithm. Remote Sens. Environ. 2021, 257, 112321. [Google Scholar] [CrossRef]
  29. Orth, R. Global soil moisture data derived through machine learning trained with in-situ measurements. Sci. Data 2021, 8, 1–14. [Google Scholar] [CrossRef] [PubMed]
  30. Cheng, M.; Li, B.; Jiao, X.; Huang, X.; Fan, H.; Lin, R.; Liu, K. Using multimodal remote sensing data to estimate regional-scale soil moisture content: A case study of Beijing, China. Agric. Water Manag. 2022, 260, 107298. [Google Scholar] [CrossRef]
  31. Teshome, F.T.; Bayabil, H.K.; Schaffer, B.; Ampatzidis, Y.; Hoogenboom, G. Improving soil moisture prediction with deep learning and machine learning models. Comput. Electron. Agric. 2024, 226, 109414. [Google Scholar] [CrossRef]
  32. Singh, A.; Gaurav, K. Deep learning and data fusion to estimate surface soil moisture from multi-sensor satellite images. Sci. Rep. 2023, 13, 2251. [Google Scholar] [CrossRef]
  33. Nijaguna, G.; Manjunath, D.; Abouhawwash, M.; Askar, S.S.; Basha, D.K.; Sengupta, J. Deep learning-based improved WCM technique for soil moisture retrieval with satellite images. Remote Sens. 2023, 15, 2005. [Google Scholar] [CrossRef]
  34. Vu, T.; Wang, T.; Munkhdalai, T.; Sordoni, A.; Trischler, A.; Mattarella-Micke, A.; Maji, S.; Iyyer, M. Exploring and predicting transferability across NLP tasks. arXiv 2020, arXiv:2005.00770. [Google Scholar] [CrossRef]
  35. Szép, M.; Rueckert, D.; von Eisenhart-Rothe, R.; Hinterwimmer, F. A Practical Guide to Fine-tuning Language Models with Limited Data. arXiv 2024, arXiv:2411.09539. [Google Scholar] [CrossRef]
  36. Weller, O.; Seppi, K.; Gardner, M. When to use multi-task learning vs intermediate fine-tuning for pre-trained encoder transfer learning. arXiv 2022, arXiv:2205.08124. [Google Scholar]
  37. Nayak, S.; Ranathunga, S.; Thillainathan, S.; Hung, R.; Rinaldi, A.; Wang, Y.; Mackey, J.; Ho, A.; Lee, E.S.A. Leveraging auxiliary domain parallel data in intermediate task fine-tuning for low-resource translation. arXiv 2023, arXiv:2306.01382. [Google Scholar]
  38. Cai, L.; Li, S.; Ma, W.; Kang, J.; Xie, B.; Sun, Z.; Zhu, C. Enhancing cross-modal fine-tuning with gradually intermediate modality generation. arXiv 2024, arXiv:2406.09003. [Google Scholar]
  39. Xing, S.; Zhang, G.; Zhang, N.; Zhang, Y.; Zhang, Y. Effects of straw returning methods on seasonal variation in soil moisture and water storage in Mollisols with different degradation degrees. Agric. Water Manag. 2025, 319, 109796. [Google Scholar] [CrossRef]
  40. EROS. USGS EROS Archive-Landsat Archives-Landsat 8-9 OLI/TIRS Collection 2 Level-2 Science Products; US Geological Survey: Reston, VA, USA, 2020.
  41. Pahlevan, N.; Chittimalli, S.K.; Balasubramanian, S.V.; Vellucci, V. Sentinel-2/Landsat-8 product consistency and implications for monitoring aquatic systems. Remote Sens. Environ. 2019, 220, 19–29. [Google Scholar] [CrossRef]
  42. Myneni, R.; Knyazikhin, Y.; Park, T. MODIS/Terra+Aqua Leaf Area Index/FPAR 4-Day L4 Global 500 m SIN Grid V061; NASA EOSDIS Land Processes Distributed Active Archive Center: Sioux Falls, SD, USA, 2021. [CrossRef]
  43. Furtak, K.; Wolińska, A. The impact of extreme weather events as a consequence of climate change on the soil moisture and on the quality of the soil environment and agriculture—A review. Catena 2023, 231, 107378. [Google Scholar] [CrossRef]
  44. Muñoz Sabater, J. ERA5-Land monthly averaged data from 1981 to present. Copernic. Clim. Chang. Serv. (C3S) Clim. Data Store (CDS) 2019, 10. [Google Scholar] [CrossRef]
  45. Turek, M.E.; Poggio, L.; Batjes, N.H.; Armindo, R.A.; van Lier, Q.d.J.; de Sousa, L.; Heuvelink, G.B. Global mapping of volumetric water retention at 100, 330 and 15,000 cm suction using the WoSIS database. Int. Soil Water Conserv. Res. 2023, 11, 225–239. [Google Scholar] [CrossRef]
  46. Chan, S.K.; Bindlish, R.; O’Neill, P.; Jackson, T.; Njoku, E.; Dunbar, S.; Chaubell, J.; Piepmeier, J.; Yueh, S.; Entekhabi, D.; et al. Development and assessment of the SMAP enhanced passive soil moisture product. Remote Sens. Environ. 2018, 204, 931–941. [Google Scholar] [CrossRef]
  47. Zhao, J.; Huang, L.; Yang, H.; Zhang, D.; Wu, Z.; Guo, J. Fusion and assessment of high-resolution WorldView-3 satellite imagery using NNDiffuse and Brovey algotirhms. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2606–2609. [Google Scholar]
  48. Bob, S.; Yang, K. Time-Lapse Observation Dataset of Soil Temperature and Humidity on the Tibetan Plateau (2008–2016); National Tibetan Plateau Data Center: Lanzhou, China, 2019. [Google Scholar]
  49. Rui, J.; Jian, K. Hourly Soil Moisture Dataset Observed by Eco-Hydrological Sensor Network in the Upper Reaches of Heihe River (2013–2017); National Tibetan Plateau Data Center: Lanzhou, China, 2021. [Google Scholar]
  50. Liu, J.; Liu, L.; Zhou, X. Calibration of SAR Polarimetric Images by Covariance Matching Estimation Technique with Initial Search. Remote Sens. 2024, 16, 2400. [Google Scholar] [CrossRef]
  51. Kirkland, E.J.; Kirkland, E.J. Bilinear interpolation. In Advanced Computing in Electron Microscopy; Springer: Boston, MA, USA, 2010; pp. 261–263. [Google Scholar]
  52. Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 16133–16142. [Google Scholar]
  53. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  54. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  55. Nedungadi, V.; Kariryaa, A.; Oehmcke, S.; Belongie, S.; Igel, C.; Lang, N. MMEarth: Exploring multi-modal pretext tasks for geospatial representation learning. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 164–182. [Google Scholar]
  56. Abowarda, A.S.; Bai, L.; Zhang, C.; Long, D.; Li, X.; Huang, Q.; Sun, Z. Generating surface soil moisture at 30 m spatial resolution using both data fusion and machine learning toward better water resources management at the field scale. Remote Sens. Environ. 2021, 255, 112301. [Google Scholar] [CrossRef]
  57. Parada, L.M.; Liang, X. Impacts of spatial resolutions and data quality on soil moisture data assimilation. J. Geophys. Res. Atmos. 2008, 113. [Google Scholar] [CrossRef]
  58. Zhu, L.; Dai, J.; Jin, J.; Yuan, S.; Xiong, Z.; Walker, J.P. Are the current expectations for SAR remote sensing of soil moisture using machine learning over-optimistic? IEEE Trans. Geosci. Remote Sens. 2025, 63, 4501815. [Google Scholar]
  59. Hashemi, M.G.; Alemohammad, H.; Jalilvand, E.; Tan, P.N.; Judge, J.; Cosh, M.; Das, N.N. Estimating crop biophysical parameters from satellite-based SAR and optical observations using self-supervised learning with geospatial foundation models. Remote Sens. Environ. 2025, 327, 114825. [Google Scholar] [CrossRef]
  60. Tian, Q.; Lu, J.; Chen, X. An innovative method for measuring the hysteresis effects of soil moisture on meteorological variables at various time scales and climate conditions. Geo-Spat. Inf. Sci. 2025, 28, 671–684. [Google Scholar] [CrossRef]
  61. Wei, Z.; Miao, L.; Peng, J.; Zhao, T.; Meng, L.; Lu, H.; Peng, Z.; Cosh, M.H.; Fang, B.; Lakshmi, V.; et al. Bridging spatio-temporal discontinuities in global soil moisture mapping by coupling physics in deep learning. Remote Sens. Environ. 2024, 313, 114371. [Google Scholar] [CrossRef]
  62. Yue, J.; Li, T.; Liu, Y.; Tian, J.; Tian, Q.; Li, S.; Feng, H.; Guo, W.; Yang, H.; Yang, G.; et al. A novel vegetation-water resistant soil moisture index for remotely assessing soil surface moisture content under the low-moderate wheat cover. Comput. Electron. Agric. 2024, 224, 109223. [Google Scholar] [CrossRef]
  63. Fragkos, A.; Loukatos, D.; Kargas, G.; Arvanitis, K.G. Response of the TEROS 12 soil moisture sensor under different soils and variable electrical conductivity. Sensors 2024, 24, 2206. [Google Scholar] [CrossRef]
  64. Tian, J.; Zhang, B.; He, C.; Han, Z.; Bogena, H.R.; Huisman, J.A. Dynamic response patterns of profile soil moisture wetting events under different land covers in the Mountainous area of the Heihe River Watershed, Northwest China. Agric. For. Meteorol. 2019, 271, 225–239. [Google Scholar] [CrossRef]
  65. Xing, Z.; Fan, L.; Zhao, L.; De Lannoy, G.; Frappart, F.; Peng, J.; Li, X.; Zeng, J.; Al-Yaari, A.; Yang, K.; et al. A first assessment of satellite and reanalysis estimates of surface and root-zone soil moisture over the permafrost region of Qinghai-Tibet Plateau. Remote Sens. Environ. 2021, 265, 112666. [Google Scholar] [CrossRef]
  66. Dong, R.; Fang, W.; Fu, H.; Gan, L.; Wang, J.; Gong, P. High-resolution land cover mapping through learning with noise correction. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
  67. Kennedy, J.J. A review of uncertainty in in situ measurements and data sets of sea surface temperature. Rev. Geophys. 2014, 52, 1–32. [Google Scholar] [CrossRef]
  68. Wang, S.; Li, R.; Wu, Y.; Zhao, S. Effects of multi-temporal scale drought on vegetation dynamics in Inner Mongolia from 1982 to 2015, China. Ecol. Indic. 2022, 136, 108666. [Google Scholar] [CrossRef]
  69. Bhasme, P.; Vagadiya, J.; Bhatia, U. Enhancing predictive skills in physically-consistent way: Physics informed machine learning for hydrological processes. J. Hydrol. 2022, 615, 128618. [Google Scholar] [CrossRef]
Figure 1. Spatial Distribution of the Large Dataset.
Figure 1. Spatial Distribution of the Large Dataset.
Remotesensing 18 00084 g001
Figure 2. Spatial Distribution of in-situ SSM data.
Figure 2. Spatial Distribution of in-situ SSM data.
Remotesensing 18 00084 g002
Figure 3. Workflow of multimodal soil moisture inversion model with intermediate fine-tuning.
Figure 3. Workflow of multimodal soil moisture inversion model with intermediate fine-tuning.
Remotesensing 18 00084 g003
Figure 4. Workflow of multi-channel image fusion into a single channel.
Figure 4. Workflow of multi-channel image fusion into a single channel.
Remotesensing 18 00084 g004
Figure 5. Fused Images of Pixel-Level Modalities: (a) Sentinel-1, (b) Sentinel-2, (c) Aster, (d) Landsat 8.
Figure 5. Fused Images of Pixel-Level Modalities: (a) Sentinel-1, (b) Sentinel-2, (c) Aster, (d) Landsat 8.
Remotesensing 18 00084 g005
Figure 6. Image fusion results: (a) SMAP, (b) WoSIS, and (c) SMAP–WoSIS Fusion.
Figure 6. Image fusion results: (a) SMAP, (b) WoSIS, and (c) SMAP–WoSIS Fusion.
Remotesensing 18 00084 g006
Figure 7. SSM estimated by different input variable groups: (a) Group a1. (b) Group b1. (c) Group c1. (d) Group a2. (e) Group b2. (f) Group c2. (g) Group d1. (h) Group d2.
Figure 7. SSM estimated by different input variable groups: (a) Group a1. (b) Group b1. (c) Group c1. (d) Group a2. (e) Group b2. (f) Group c2. (g) Group d1. (h) Group d2.
Remotesensing 18 00084 g007
Figure 8. The boxplot for estimation performance comparison of different soil types: (a) Input with SAR only. (b) Pixel-level input. (c) Combined pixel- and image-level input.
Figure 8. The boxplot for estimation performance comparison of different soil types: (a) Input with SAR only. (b) Pixel-level input. (c) Combined pixel- and image-level input.
Remotesensing 18 00084 g008
Table 1. Multimodal Input Data.
Table 1. Multimodal Input Data.
LevelTypeProductResolutionFrequencyBand Information
PIXELMultispectralLandsat-830 m16 DaysB1-B11
MultispectralSentinel-210–60 m5 DaysB1-B12
SARSentinel-110 m6 DaysVV, VH for ascending/descending orbit (C-band)
SARGaofen1 mN/AVV, VH, HV, HH (L-band)
DEMAster30 mN/Aelevation, slope
IMAGEVegetationMODIS500 m4 DaysFPAR, LAI
Meteorological Reanalysisera511,132 m1 Month/Daytemperature, precipitation
GeopositionN/AN/AN/Acyclic encoding of latitude and longitude
TimeN/AN/AN/Acyclic encoding of month
Table 2. Test Results Using SWF Label.
Table 2. Test Results Using SWF Label.
Group R 2 RMSE (%)MAE (%)ubRMSE (%)
a10.120613.33310.97511.153
a20.108213.42710.86212.328
b10.246212.34410.07811.145
b20.210612.63310.28511.462
c10.307211.7499.44211.295
c20.294111.9469.615911.327
Table 3. Test Results Using In-situ Data.
Table 3. Test Results Using In-situ Data.
Group R 2 RMSE (%)MAE (%)ubRMSE (%)
a10.65568.32486.67627.0931
a20.60248.94417.15506.8219
b10.84545.57814.56225.5780
b20.83295.79944.78285.7616
c10.89564.58343.84124.4282
c20.87015.11334.34755.0135
d10.452910.49208.43489.9923
d20.398910.99718.865510.7433
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Liu, L.; Yu, W.; Wang, X. Estimating Soil Moisture Using Multimodal Remote Sensing and Transfer Optimization Techniques. Remote Sens. 2026, 18, 84. https://doi.org/10.3390/rs18010084

AMA Style

Liu J, Liu L, Yu W, Wang X. Estimating Soil Moisture Using Multimodal Remote Sensing and Transfer Optimization Techniques. Remote Sensing. 2026; 18(1):84. https://doi.org/10.3390/rs18010084

Chicago/Turabian Style

Liu, Jingke, Lin Liu, Weidong Yu, and Xingbin Wang. 2026. "Estimating Soil Moisture Using Multimodal Remote Sensing and Transfer Optimization Techniques" Remote Sensing 18, no. 1: 84. https://doi.org/10.3390/rs18010084

APA Style

Liu, J., Liu, L., Yu, W., & Wang, X. (2026). Estimating Soil Moisture Using Multimodal Remote Sensing and Transfer Optimization Techniques. Remote Sensing, 18(1), 84. https://doi.org/10.3390/rs18010084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop