1. Introduction
Soil plays a vital role in supporting numerous essential ecological functions, encompassing sustaining crop production, nutrient cycling, carbon (C) storage, climate regulation, water purification, transfer of mass and energy between spheres, and acting as a habitat for diverse flora and fauna [
1,
2]. Healthy soils are the foundation of over 90% of the food we eat [
3] and host 25% of all biodiversity on the planet [
4]. Given the various physical (e.g., sealing, compaction, and erosion) and chemical (e.g., contamination, salinization, loss of organic matter) pressures exerted on soils, it is of paramount importance to preserve the soil ecosystem [
5]. In the European Union (EU) alone, 70% of the soils are estimated to be in an unhealthy condition, with carbon loss, erosion, land take, and contamination being the major threats [
6]. In July 2023, the EU proposed a new soil monitoring law [
7]) to protect and restore soils and ensure that they are used sustainably; there is also a focus placed on defining what soil health is and the methodologies used to monitor it [
8,
9,
10,
11,
12]. Additionally, soil’s capacity to sequester large amounts of C is taken into account, and techniques for monitoring soil carbon change are also widely examined [
13].
Scientists, policymakers, and stakeholders alike acknowledge their critical responsibility in ensuring that soils maintain this capacity to sustain human life on the planet, and are developing tools to monitor the soil ecosystem [
14]. These tools rely on large-scale Earth observation (EO) monitoring that leverages both the big geospatial data generated by space-borne missions and in situ monitoring [
15,
16,
17], to provide accurate, timely, and objective information to support evidence-based conservation recommendations, which is crucial for creating a more sustainable future [
18,
19]. To process these large datasets various technological solutions have been proposed [
20], with the datacube approach providing access to large spatio-temporal data in an analysis-ready form and offering the advantages of on-premises hosting and processing [
21,
22]. It has been successfully employed in areas ranging from helping farmers to manage their fields, over financial markets to forecasting food prices to the EU common agriculture policy [
23], and in monitoring agricultural productivity [
24].
When croplands are considered, bare-soil surface reflectance composites are usually constructed using a combination of thresholding and other heuristic methodologies from multispectral imagery [
25,
26,
27]. The majority of studies usually take advantage of the spectral features of exposed soils over cropland areas, then apply various artificial intelligence (AI) algorithms to estimate the soil properties as derived from a compilation of soil samples. It is now even possible to provide high-spatial-resolution mapping using the European Copernicus Sentinels that provide images every 5 days at a 10 m resolution. To our knowledge, only a few studies have been conducted that have generated soil-related indicators at both a large scale and high spatial resolution [
28,
29]. However, these studies correspond to a coarse representation of the various soil properties, making their use impossible for good agricultural and environmental conditions (GAECs) [
30] monitoring in the framework of cross-compliance, considering the small parcels in Europe. On the other hand, for permanently vegetated areas, the approaches followed consider mostly the usage of multiple environmental covariates (e.g., vegetation indices, climate data, geology, etc.) from which the soil properties are estimated [
31,
32,
33], sometimes also incorporating radar data [
34].
The recognized importance of sustainable soil carbon sequestration practices to contribute to climate change mitigation [
35] has led to various applications of satellite EO data, demonstrating the benefits of EO-driven soil spatial products (reference). Two of the most widely monitored soil properties via EO are the soil organic carbon (SOC) and the clay content [
36,
37]. SOC is essential for a variety of soil processes and ecosystem services, plays a crucial part in the global carbon cycle, and is a topic of scientific interest in the context of climate change [
38,
39]. Estimating the geographical distribution of SOC content is essential for directing land management to improve soil health by reducing carbon emissions and sequestering C [
40]. It may be significantly altered over the years due to crop practices. Soil clay content refers to the proportion of fine particles in the soil, and is one of the three soil particles that contribute to the soil’s texture (the other two being sand and silt particles). It is of crucial importance for soil health, influencing a soil’s fertility, water-holding capacity, nutrient availability, and overall soil structure, ultimately affecting plant growth and ecosystem functioning [
41]. The clay content of soil typically does not change rapidly or frequently in natural conditions, being primarily determined by factors such as the parent material, weathering processes, and the geological history of the area. However, human activities, such as land management practices, can alter the clay content over longer periods of time; for example, excessive tilling or erosion can lead to a loss of topsoil, which may reduce the clay content.
This paper aims to provide a novel pipeline for topsoil monitoring based on open-access EO data, artificial intelligence (AI), and advanced digital infrastructures, inspired by the demand for soil monitoring improvements with regards to its reliability (uncertainties) and spatial and temporal resolution. Differently from previous approaches that mainly focus on small- and medium-scale modeling, generating coarse-resolution products (>30 m), the main novelty of this work is its focus being mainly on the generation of national-scale soil thematic maps with high spatial resolution (10 m) based on an AI modeling approach. Specifically, this work will strive to:
Use multi-temporal Sentinel-2 imagery data for the identification of exposed soils over cropland areas, covering all of Lithuania at a 10 m spatial resolution.
Implement and evaluate AI algorithms for pixel-level predictions of the topsoil SOC and clay content in croplands at the national scale by using the Land Use and Coverage Area Frame Survey (LUCAS) dataset as ground truth soil data. A novel 3 × 3 padding scheme is used to augment the training dataset and provide more training data to the AI models. Moreover, two new CNN-based models predicting the two topsoil properties are introduced, whose network architecture is automatically optimized and that manage to outperform contemporary ML approaches.
Leverage the European Integrated Administration and Control System (IACS) database to transform the pixel-level predictions to more robust parcel-level estimations by proposing a novel thresholding methodology.
The proposed pipeline utilizes the Soil Data Cube (SDC), which is a self-hosted custom tool for the handling and processing of the large volume of satellite imagery data. At the same time, this work presents the main results and discusses the benefits, limitations, and perspectives of the proposed approach, while the overall methodology can be transferred and implemented in different bioclimatic zones and at various scales. Moreover, the parcel-level information could strongly support soil monitoring by national authorities and as part of different components of European policies, such as eco-schemes, in a more cost-efficient way.
4. Discussion and Future Perspectives
4.1. Soil Data Cube
Platforms such as Google Earth Engine (GEE) [
68], ODC [
46], Copernicus Data and Information Access Services (DIAS) (
https://www.copernicus.eu/en/access-data/dias) (accessed on 12 August 2023), and the new Copernicus Data Space Ecosystem Services (
https://dataspace.copernicus.eu/) (accessed on 20 August 2023) have greatly boosted the use of big data enabling its large-scale applications. However, closed-source cloud-based solutions are expensive to operate and do not provide a guarantee for their continuity, hence are an unstable solution causing uncertainty for users with long-term requirements. On the other hand, the ODC framework enables the scientists to have direct access to data and to infrastructure processing capabilities, providing a strong motivation for their use [
20].
Considering the above, the SDC custom tool is self-hosted and does not depend on cloud computing, making it extendable and customizable. An additional advantage is that the SDC is an end-to-end framework that can provide the long-term and baseline data required to determine trends, quantify past and present changes, and inform future decisions. Admittedly, for self-hosting there is a large cost associated with the procurement of processing servers and storage space, but for medium-sized and large organizations the necessary infrastructure is usually already in place. For the above reasons, we selected to develop the national-scale products using the SDC. This spatial–temporal information can be used as an evidence base for the design, implementation, and evaluation of soil-related EU policies, programs, and regulations, as well as for developing market consulting services.
4.2. Ground Truth Data
Monitoring soil properties at large scales is challenging mainly due to the field sampling and the laboratory analysis, which is both timely and costly [
69]. However, remote sensing spectroscopy may efficiently map soil properties at large scales but to fully harness its potential, it is crucial to have accurate and reliable ground truth data, mainly for calibration and validation needs. In this work, the absence of ground truth data at the national scale, highlights the significance of the LUCAS dataset as an exemplary resource of in situ data supporting the generation of large-scale soil thematic maps through AI techniques. It should be also highlighted that the available points of the LUCAS training dataset were measured under laboratory conditions (dried and sieved), thus their spectral information is not influenced by moisture or roughness.
Moreover, and bearing in mind that soil is a three-dimensional body, it should be mentioned that, in general, remote sensing techniques including spectroscopy can map only topsoil properties and cannot replace soil mapping using soil survey techniques. In addition, in situ spectroscopy methods using portable and low-cost sensors could further support the remote sensing with ground truth data [
70,
71].
4.3. Exposed Soils
The quality of the bare-soil reflectance composite map is affected mainly by the number of images with low cloud coverage, and on the correct definition of the threshold values for the vegetation indices that are used to mask the vegetated areas [
72].
Considering our case study area, located in northern Europe, there are large periods with high cloud coverage or snow cover. This, combined with the cloud cover filter, reduced the number of available satellite images, thereby reducing the detection of bare-soil pixels. Also, it is worth emphasizing that the satellite’s revisit times play a crucial role in the output and, in general, shorter revisit times undoubtedly could provide more images and information [
54,
73]. The Sentinel-2 revisit time appeared to be sufficient for Lithuania for the period from March to October. However, it was not possible to detect all of the exposed soil areas during the very cloudy and snowy season (November–February) across the whole country, although this period is considered particularly critical and should be monitored, especially for GAEC’s and eco-schemes’ needs.
Additionally, the use of the zonal statistic filters to transform the information from pixel- to parcel-level causes a significant reduction in the number of final parcels for prediction. A future adjustment in the thresholds (see
Table 6), whether stricter or softer, will certainly cause a corresponding change in the total number of parcels to predict.
Moreover, the definition of thresholds for the used spectral vegetation indices (NDVI, NBR2) requires both an expert interpretation of the image and field observations [
74]. Heiden et al. (2022) [
26] proposed a new approach for bare-soil composite creation based on a new thresholding for the two most common vegetation indices (NDVI, NBR2), while also developing a new index called PV+IR2 that combines information from the visible to near-infrared (VNIR) and short-wave infrared (SWIR) wavelength regions.
It should also be mentioned that the effects of soil conditions, such as soil moisture or roughness, could affect the generation of the bare-soil composite maps and, therefore, the accuracy of the prediction models [
75,
76], making it more difficult to map soil indicators (e.g., SOC) at large scales [
77]. By using the synergy of radar and optical data, e.g., Sentinel-1 data and Sentinel-2 data [
78], or by using hyperspectral data [
79,
80], this influence could be eliminated [
81]. The incorporation of climatic data may also assist in the exclusion of time periods that correspond to precipitation events. As this work refers to a large-scale application, the aforementioned effects were not taken into consideration as this would increase the computational processes and storage requirements and due to limited availability of large-scale hyperspectral data; hence, this could not be supported at this stage with the available capacity.
4.4. Results and Comparison of the Artificial Intelligence Models
The accuracy results of
Table 4 indicate that the CNN model outperforms the other learning algorithms (i.e., RF, SVR, PLS, and Cubist) in the context of predicting SOC and clay content in croplands. The results clearly demonstrate the superior performance of the CNN model across both prediction tasks, with an 8% relative decrease in the RMSE compared to the second-best model as far as SOC is concerned, and a 24% decrease for clay. This may be attributed to several factors. One key advantage of CNN models is their ability to perform more effective feature extraction from the input data [
82]. This capability may have allowed the models to capture intricate patterns and relationships within the Sentinel-2 multispectral data that were critical for accurate predictions. In contrast, traditional models often rely on manual feature engineering or predefined shallow feature spaces, which can limit their capacity to represent the complexity of the data adequately. Moreover, SOC and clay content are influenced by a multitude of factors, many of which interact in a non-linear manner. CNNs are inherently suited to capture these intricate and non-linear dependencies [
83], whereas models like RF, SVR, Cubist, and PLSR often struggle to represent and exploit such non-linearity effectively. The CNN models may have also benefited from advanced regularization and optimization techniques, which contributed to their robustness and ability to generalize well on unseen data [
84]. These techniques helped to mitigate the risk of overfitting, which can be a challenge in complex prediction tasks. On the other hand, traditional models often require careful and meticulous tuning of hyperparameters to achieve a comparable level of robustness. Some final notes regarding the suitability of the CNN approach, although not clearly demonstrated herein, pertain to their scalability and their capacity to use transfer learning to be applied to new regions. The scalability is important when generating national-scale soil thematic maps, as the amount of data involved may be substantial. For example, SVR has an algorithmic complexity of
when using the RBF kernel, where
N is the number of training samples and
D the number of features, which means it does not scale well as the number of patterns increases. In addition, it is relatively easy to re-purpose the CNN model; a pre-trained deep learning model can be fine-tuned on specific tasks, which is particularly useful when starting with a limited amount of labeled data and using, e.g., a previously continental model.
Additionally, we note that the provided CNN network architecture was optimized using the HyperBand algorithm which identified, among other things, the optimal number of convolutional layers and parameters, the number of dense layers and their neurons, and the activation functions used. These led to relatively complex models with about 57k parameters for the SOC model and 84 k for the clay model. In general, the relationship between model complexity and predictive accuracy is not linear, and a higher number of parameters does not inherently guarantee better predictions. The search space used and the methodology employed to perform hyperparameter tuning (and thus to identify the optimal architecture of the CNN models) was selected to balance the number of parameters with the complexity of the problem at hand.
4.5. General Overview and Comparison with Other Works and Products
A thorough study from Safanelli et al. (2020) [
85] covered the whole European territory with a 30 m spatial resolution using the GEE platform and image data from Geodynamics Experimental Ocean Satellite 3 (GEOS-3). They achieved
values of 0.44 for clay, but only 0.06 for SOC, by using more than 7142 data points. Sorenson et al. (2021) [
86] trained an RF model with legacy data (454 points for SOC and 435 points for clay) and by using the GEE platform and Landsat-5 imagery, with a 30 m spatial resolution, they obtained an
value of 0.55 and 0.44 for the SOC and clay content predictions, respectively, for Saskatchewan soils. More recently, Wang et al. (2023) [
53] used Sentinel-2 data downloaded from GEE, and 160 available soil samples for SOC predictions with a 10 m spatial resolution. Their results showed that the XGBoost algorithm achieved the best results (
= 0.77) in farmlands in a karst trough valley area of southeast China. They also compared their results with three other open SOC products (SoilGrids with 1 km and 250 m resolutions, and the Harmonized World Soil Database) and found substantial differences. In addition, they proposed that global models such as SoilGrids are likely to be unsuitable for areas with a high spatial heterogeneity, and instead, a local model would be more appropriate. Also, Salazar et al. (2023) [
78] used Sentinel-1/2 data to create temporal bare-soil mosaics over a 6-year period for an agricultural region in central France (4838 km
) with a 25 m spatial resolution. They estimated SOC concentrations based on a quantile regression forest (QRF) model and tested different environmental covariate datasets, with the best prediction accuracy being
= 0.33.
Considering the AI models’ performance in this work, in both properties, the CNN outperforms the other learning algorithms by a wide margin, i.e., increases the
on average by 23% compared to the second-best model (SVR for SOC and PLS for clay). In general, we can state that the aforementioned prediction accuracy for both indicators is very close to that reported in the recent literature. However, it should be emphasized that the innovation and findings of this work focus mainly on the large-scale application of the methodology, covering 65,300 km
, and the generation of high-spatial-resolution maps (10 m), as well as the utilization of open databases only (e.g., Sentinel, IACS, LUCAS). As mentioned above, there was a lack of in situ training data and, in contrast with other studies, which mostly used in situ data to train the models, in this work only LUCAS was used as a training dataset, but the prediction results are very close to those previously reported in the literature. Additionally, there is a general lack of use of the IACS dataset in the literature in contrast to this study, which not only uses this dataset at a national level but also presents an analytical methodology for calculating the final value per parcel, strongly supporting the EU’s policy requirements. In general, from our point of view the low SOC and clay accuracies in the literature, and in this study as well, are expected for large-scale predictions using multispectral optical data, because the soil pressures from various farm management activities have a strong effect on SOC while erosion could affect the clay’s spatial variability. At this point, uncertainty maps can have a great contribution and can provide extra information for the end user concerning the confidence levels of the final predictions. In this context, a useful finding was provided by Dvorakova et al. (2023) [
87], who found a decrease in the SOC uncertainty in predictions when the number of scenes per pixel increased by keeping a minimum of more than six observations per pixel.
With respect to the uncertainty results, it is noted that the average PIR was about 0.48 for SOC and 0.61 for clay, with a tendency to be higher for the 2022 products (corresponding to the annual bare-soil products of 2020–2022) than for the 2020 and 2021 products (corresponding to 2018–2020 and 2019–2021, respectively). These values suggest a relatively robust model, particularly for SOC, where the PIR is lower than 0.50. Dvorakova et al. (2023) [
87] report a higher PIR of 0.88 for SOC prediction using Sentinel-2 data in Belgium and the Netherlands, while Qu et al. (2024) [
88] found a PIR of about 0.50 for predicting sand content in eastern China using digital soil mapping techniques. It should be noted, however, that the scientific community is still not decided on which is the best uncertainty measure in digital soil mapping. In ML predictions, oftentimes either bootstrapping or empirical uncertainty quantification through data partitioning and/or cross-validation is used [
89]. Even in studies that use bootstrapping, the 90% interval is not used throughout, with some suggesting that the actual width of these intervals should be lower. Moreover, the formulae used are not the same. For example, Zhou et al. (2022) [
90] used a different definition for PIR, using the mean of the bootstraps and the standard error corresponding to the 90% confidence interval. Other studies employ the prediction interval coverage probability (PICP), which assesses whether the probability assigned to the prediction intervals matches the actual frequency of observed test data falling within those intervals [
91]; however, a caveat is that it does not possess the ability to identify one-sided bias in quantile predictions [
92].
Moreover, it is well known that soil varies in space and has a strong relationship with variations in the landscape, thus topographic data (e.g., DEM) could improve the model prediction accuracy. However, our study area has very low altitudes, characterized as almost flat (especially in the agricultural area). Therefore, the influence of the topography on the final model predictions is considered to be very small or even negligible. Nevertheless, if the proposed methodology is applied to other regions with stronger topographic features, then obviously the inclusion of topographic data is necessary to improve the accuracy of the models. Even more so if the modeling is to be implemented in permanently vegetated areas where environmental covariates have a strong influence on the results. In general, we focused on a modeling approach using pure spectral information to provide a more general model and did not use digital soil mapping techniques including several environmental covariates as model inputs.
4.6. Future Perspectives
Regarding the SDC, its capability to add more satellite sensors in the pipeline (e.g., radar sensors and hyperspectral) as well as to predict more soil quality indicators will be the main focus in the future. Another important step is to calculate a series of soil erosion, soil texture, soil organic carbon, pH, and nutrient indicators for 9.1 billion parcels (
https://ec.europa.eu/eurostat/statistics-explained/SEPDF/cache/73319.pdf) (accessed on 28 July 2023), corresponding to the entire (157 million ha) agricultural area of the EU countries, with the possible involvement and use of the Joint-Research-Centre-hosted (JRC) big data analytics platform JEODPP [
93]. Considering also that about two-thirds (63.8%) of the EU’s agricultural area consists of parcels less than 5 ha in size, it is a necessity to improve the spatial resolution of the maps by using either multispectral or hyperspectral data.
In addition, other methodologies for bare-soil reflectance composite generation [
26] may be examined, in addition to testing different thresholds for the vegetation spectral indices (e.g., NDVI, NBR2). Related to this is also the improvement in the general cloudless/shadow problem, exploring also different approaches, such as combining the Sentinel-2 cloudless algorithm with the Sen2Cor cloud mask (SCL 4/5/6). Already-developed algorithms can be used but the results are still not faultless [
77]. Related to this is the fact that a benchmarking classification dataset that can be used with reference (ground truth) data of the occurrences of bare soil in space and time currently does not exist; the research community should focus on the generation of such an open dataset to provide quantifiable evaluations of the different bare-soil detection methodologies.
Another critical concern, recognized worldwide, is the improvement of AI model performance. Generally, and especially for SOC and soil texture estimations, hyperspectral data showed better prediction accuracy than multispectral data [
94,
95]. Thus, data from the upcoming hyperspectral missions such as ESA’s CHIME or Tanager by Planet will be in consideration.
Finally, the implementation of the proposed overall methodology in different bioclimatic zones and their correlation with farm management practices will be at the forefront of future research.
5. Conclusions
This work aimed to provide a comprehensive approach for the production of high-resolution annual soil thematic maps of croplands at the national scale, presenting also a representative pathway to go from pixel- to parcel-level decisions based on a thresholding methodology. The results of this study showed that the integration of open-access data (Sentinel-2, LUCAS, IACS) and AI algorithms into self-hosted tools (SDC) could generate annual soil spatial explicit indicators (SOC, clay content) with satisfactory accuracy, and implemented a bare-soil modeling approach with a 10 m spatial resolution.
Specifically, the detection of bare-soil areas throughout the country was achieved by analyzing a three year time series of Sentinel-2 data from March to October by combining three different filters (
Section 2.3.1) to exclude non-bare-soil pixels. This approach generated a more complete bare-soil map compared to the annual products, detecting over 85% of the total number of parcels as bare.
Furthermore, considering the lack of sufficient in situ training data, the 3 × 3 padding around the LUCAS central point played a crucial role in the models’ performances. In this regard, this work indicated that deep learning models and, specifically, the herein-proposed CNN models that were automatically optimized using the HyperBand algorithm, outperformed the machine learning, achieving fair results ( of 0.51 for SOC and 0.57 for clay) using the Sentinel-2 multispectral information in order to generate large-scale soil thematic maps supported also by their prediction uncertainties.
The newly proposed method to go from pixel- to parcel-level values is both representative and strict considering the thresholds applied for the examined soil indicators (see
Figure 3). Providing predictions at both pixel- and parcel-levels has two key advantages: pixel-level predictions help in recognizing variations within parcels and, in cases where the uncertainty is high or a portion of the pixels contained within a parcel are identified as bare, parcel-level estimations provide more robust estimates.
Nevertheless, the pixel-level products from multispectral data do not currently provide the necessary accuracy to reliably lead to farm management practices within a parcel. Future products developed from hyperspectral data and/or using other covariates may be able to attain the necessary accuracy levels.
Finally, and building upon the pan-European environmental awareness, the findings of this work could strongly support the actual actors involved in soil monitoring and protection (i.e., national paying agencies, agricultural policymakers, farmers, etc.). The overall approach could provide effective soil monitoring which can lead to climate change mitigation by reducing CO emissions, support the new CAP strategic plans such as the new green deal and eco-schemes, as well as provide innovative and cost-effective soil monitoring methodologies for developing monitoring, reporting, and verification (MRV) protocols for carbon estimation on mitigation measures.