Using Sentinel-2 for Simplifying Soil Sampling and Mapping: Two Case Studies in Umbria, Italy

: Soil-sample collection and strategy are costly and time-consuming endeavors, mainly when the goal is in-ﬁeld variation mapping that usually requires dense sampling. This study developed and tested a streamlined soil mapping methodology, applicable at the ﬁeld scale, based on an unsupervised classiﬁcation of Sentinel-2 (S2) data supporting the deﬁnition of reduced soil-sampling schemes. The study occurred in two agricultural ﬁelds of 20 hectares each near Deruta, Umbria, Italy. S2 images were acquired for the two bare ﬁelds. After a band selection based on bibliography, PCA (Principal Component Analysis) and cluster analysis were used to identify points of two reduced-sample schemes. The data obtained by these samplings were used in linear regressions with principal components of the selected S2 bands to produce maps for clay and organic matter (OM). Resultant maps were assessed by analyzing residuals with a conventional soil sampling of 30 soil samples for each ﬁeld to quantify their accuracy level. Although of limited extent and with a speciﬁc focus, the low average errors (Clay ± 2.71%, OM ± 0.16%) we obtained using only three soil samples suggest a wider potential for this methodology. The proposed approach, integrating S2 data and traditional soil-sampling methods could considerably reduce soil-sampling time and costs in ordinary and precision agriculture applications. A.L.; data curation, F.S.S. and A.L.; writing—original preparation, and M.V.; writing—review and editing, M.V., A.A. and A.L.; visualization, supervision, M.V. A.A.; administration,


Introduction
Soil has a crucial influence on agricultural crop growth. Understanding the spatial variation of soil attributes can guarantee significant benefits in all crop management strategies and support the definition of variable fertilizer application for most nutrients within the modern Precision Agriculture (PA) processes [1][2][3][4]. In agricultural fields, soil properties often show relevant spatial variations [5] due to several factors, e.g., parent material, geomorphology, previous use, and management. This fact should imply the need to evaluate the soil variability before sampling through a careful survey of the surface morphology and by digging mini-pits and auger holes to assess the type and the thickness of the soil horizons. Following this approach, the in-field variation mapping is typically expensive and time consuming, requiring a very dense sampling (generally about one sample per hectare) [2,6,7].
Recently, the application of modern technologies, such as sensors for detecting electrical conductivity [8,9], determined an essential reduction in sampling time but only a partial decrease in execution costs. Thus, soil-mapping techniques are still tricky to access for most farms, often characterized by narrow profit margins per hectare.
Remote sensing (RS) has been widely used, mainly at a regional scale, to support soil mapping due to the low costs, rapid data acquisition, and high spatial and temporal resolution data availability. Thanks to the recent launch of many new earth observation satellites, a wide range of RS data with variable temporal, spatial, spectral, and radiometric resolution is available for free or at low costs. In particular, the launch of Sentinel-2 by the European Space Agency determined a significant step forward in earth observation thanks to the distribution of free multispectral, high spatial and temporal resolution data very useful for multiple applications [10][11][12][13].
RS and soil data analysis techniques are strongly dependent on the dimensionality (i.e., number of spectral variables) of the soil spectral data. Bare-soil images captured by satellites and aerial platforms can be analyzed by traditional approaches, including band ratios, discriminant analysis, and classification. Hyperspectral data were used to derive very accurate soil spectra [14], potentially useful to perform a supervised classification of satellite or UAV-derived hyperspectral data. However, the high complexity of the soil reflectance phenomenon prevents an adequate prediction of soil properties using these methods. A step forward has been the application of regression analysis, including multiple regression analysis (MLR), principal component regression (PCR), and partial least-squares (PLS) regression, to relate soil properties to reflectance data or spectral indices. Although it is based on an empirical approach and has shown many limitations, this remains the most applied method even for hyperspectral soil reflectance data. With this approach, the complex dynamics of light incidence on soil properties can be ignored [15]. The visible and near-infrared (NIR) regions of the electromagnetic spectrum are the most used in such analyses due to their close relationships with some soil properties. However, the mid-infrared and the thermal-infrared regions have been used as well. Various research studies have focused on studying such correlations and developing linear regression models comparing satellite bands with soil textural properties and OM content (Table 1). Such research has demonstrated RS's great potential in supporting soil mapping using medium and medium-high resolution images combined with regression analysis.  Some studies explored the possibility of using spatial predictors, including RS data, for soil-sampling optimization. In this regard, the geostatistical variogram technique is a common technique when spatial-dense-information ancillary data (e.g., aerial photographs, elevation, RS spectral information) are available and related to the variable of interest. The variograms of these data can indicate the likely scale of variation in the soil of the variable of interest; this information can be used to determine how many cores of soil should be taken and where to densify (in high variability areas) or reduce (in low variability areas) sampling [2]. Even though this technique is beneficial for optimizing the sampling scheme, in any case, it requires a high number of soil samples. Other studies used RS information to guide soil sampling. Yuzugullu et al. [5] obtained accuracies greater than 90% in the spatial prediction of soil zones and topsoil properties, such as pH, soil organic matter (SOM), and clay content, in agricultural fields using random forest algorithms. Wang et al. [28] showed how RS spectral information, previous crop yields, and digital elevation models could be successfully combined with a super-grid method for soil-sampling reduction in OM mapping. Castaldi et al. [29] applied multivariate statistical modeling on multispectral (S2) and hyperspectral (EnMAP) data for identifying sampling locations for Soil Organic Carbon (SOC) mapping. In this study, EnMAP data provided lower SOC variability and lower prediction accuracy compared to S2.
However, RS applications to optimize and reduce sample location at the field scale and produce soil maps are still not common, particularly using S2 data. Moreover, the available methodologies are generally based on the collection of many samples for soil characterization. In this context, our research aimed at defining and testing a low-cost and easy-to-implement soil mapping methodology, applicable at field scale, to support the definition of reduced soil-sampling schemes. For this purpose, we used S2 data and considered two essential soil attributes that strongly affect plant nutrient availability (clay and OM content). The idea behind this study is that S2 data combined with a strongly reduced soil-sampling scheme and linear regressions could produce reliable soil maps useful for ordinary or PA practices.

Materials and Methods
The general methodological workflow of the research is shown in Figure 1. After a preliminary S2 band selection based on the available bibliography, a Principal Components Analysis (PCA) was performed to reduce multicollinearity between all selected S2 bands. A subsequent cluster analysis was aimed to identify the sampling points for two reduced sampling schemes. The data from samples collected according to the reduced scheme were used in linear regressions with PCs of selected clay and OM bands to produce maps for the soil parameters under investigation. The reliability of these maps was assessed by comparing them with the analytical data acquired with a conventional soil sampling (CSS). An additional linear regression between the first Principal Components (PCs) and the clay and OM data obtained from the CSS was developed to understand and assess the relationships between the selected S2 bands and the soil properties of the fields under investigation.

Study Area
The experiments were carried out in two different plain fields, each about 20 ha wide, owned by the FIA, "Fondazione per l'Istruzione Agraria," located in Tiber valley, near Deruta, Umbria, Italy ( Figure 2). The study area is characterized by a Mediterranean climate, with a dry season from May to September and a cold and rainy season from October-November to March-April. According to the USDA soil taxonomy [30], the soils of the area, derived from fluvial and lacustrine sediments, are classified as Fluventic Haplustept. The study areas have no topographic gradient and are characterized by a soil textural gradient mainly related to their proximity to the Tiber stream at the east of both fields. Field 1 shows a more regular gradient than Field 2, characterized by a more evident local variability.

Identification of the Sampling Locations
All level 2A Sentinel-2 images with no cloud cover, timed close to the soil-sampling dates, were collected for the two study areas. The images, georeferenced according to WGS84-UTM32, were downloaded from The Copernicus Open-Access Hub. They were visually analyzed in false colors and compared, calculating the average NDVI to identify the most representative dates on which the two fields had the lowest vegetation percentage. Finally, the images of 28 September 2018 (Field 1) and 20 November 2019 (Field 2) were selected for the subsequent analysis.
Considering the scientific evidence reported in Table 1, we selected the S2 bands according to the evidence pointed out by Nanni and Demattê [17] for Landsat TM since they appeared the most adaptable to our case studies. In this vein, S2 bands included in the analysis were B02, B04, B08, B11, and B12 for clay and B03 and B12 for OM. A bilinear resampling to 10 meters was performed in SAGA GIS (version 7.5.0) [31] on the two S2 20-m resolution bands (B11, B12) to uniform the bands' spatial resolution.
A K-means cluster analysis was carried out in SAGA GIS to identify natural groupings in the data and spatially identify the reduced soil-sample schemes' optimal configuration for each field. To this aim, in the same software, a Principal Component Analysis (PCA) based on the variance-covariance matrix method was performed for each field to reduce the multicollinearity between input bands. In this step, all S2 selected bands (B02, B03, B04, B08, B11, and B12) were included for each field. The optimal number of clusters was found following the elbow method by analyzing the average cluster variance obtained by increasing their number [32][33][34]. The cluster's center value +/-2% was used to identify the most representative area of the cluster spatially. The final sampling location was defined manually, trying to locate two samples per cluster spaced apart within the said area or only one sample in the central part of the same area. Thus, two reduced soil-sampling schemes were defined for each field using two or one samples per cluster.

Soil Sampling
For comparison purposes, besides the said reduced sampling schemes, a conventional soil sampling (CSS) composed of 30 samples for each field was performed according to a regular georeferenced grid ( Figure 2). Soil sampling was performed on 14 November 2018 for Field 1 and 17 December 2019 for Field 2. Each sample was composed of 3 bulks collected by an auger (0-40 cm) at about 5 meters from the reference sampling location. The samples were dried for three days in the oven at 30 • C and used for the physicalchemical analyses. After that, the samples were ground and passed through a 2-mm sieve after drying. Specifically, the soil samples were analyzed for their texture by the pipette method [35] and organic matter (OM) content by Walkley-Black method [36]. Using the USDA methodology, a textural characterization was performed on all soil samples from both fields [37]. Boxplots for clay and OM were plotted to compare data distributions of the three sampling schemes.

Linear Regression Analysis
Before proceeding to linear regression analysis, a PCA for each field and soil parameter was carried out in SAGA GIS, including the selected S2 bands. Only the first PC for all the four combinations was included in the subsequent regression steps, considering the very high explained cumulative variance of such components.
The output of the two reduced soil-sampling schemes was modeled through a Linear Regression Analysis (LRA), including the first PC for OM and clay contents, to assess their relationships and calculate linear equations applicable for generating maps of these two parameters. An additional LRA was performed between the first PCs and the CSS outputs to compare and assess the relationships between the S2 bands and clay and OM and possibly confirm the evidence reported in the bibliography. In this case, PCs were sampled by calculating the average value within a 20-m radius from the soil-sampling points using the "raster statistics for polygons" tool of SAGA GIS. A supplementary analysis was developed in R software to verify the base assumptions of LRA (heteroscedasticity, normal distribution of residuals, and Moran's spatial autocorrelation of residuals).
All inferential statistical analysis, LRA, R 2 , and residual analysis were performed using the statistics software R [38] (Version 4.0.3) with the "agricolae" [39] (Version 1.3.3) and "DHARMa" (Version 0.3.3.0) [40] packages. The inferential statistical analysis in this study was conducted separately for the two fields, considering them as independent experiments.
To assess the reliability of clay and OM maps derived from the LRA of reduced sampling schemes, they were compared with the CSS visually and quantitatively through a residual analysis. In the former, maps produced using only a few samples and predictors from S2 and a map generated with conventional sampling and mapping methods (splines) are visually compared. For the latter, clay and OM maps derived from the reduced sampling schemes were sampled by calculating the average value within a 20-m radius from the soilsample points of the conventional scheme. In this step, the "raster statistics for polygons" tool of SAGA GIS was used. The residuals calculated for clay and OM in this step were also spatialized in SAGA GIS using the spline method to understand their variability across the fields. In this step, the "Multilevel b-spline interpolation" tool was used.

Results
According to the USDA soil textural classification [37], the samples collected in Field 1 fall mainly in the silty-clay-loam texture class, whereas those collected in Field 2 are distributed in three different textural classes: loam, clay-loam, and silty-clay-loam (Figure 3). Regarding the clay content, for Field 1, we found a minimum value of 24.6% and a maximum of 46%, with a range of 21.4%. For Field 2, we found a minimum of 20.3% and a maximum of 35%, with a range of 14.7%. The OM content showed a minimum value of 1.41% and a maximum of 2.32% with a range of 0.91% for Field 1 and a minimum of 1.10% and a maximum of 2.19% with a range of 1.09% for Field 2. The boxplots of clay and Organic matter (OM) data from the three sampling methods show how the reduced sampling schemes capture the actual soil variability in the two fields ( Figure 4).  Three clusters were identified using the elbow method for both fields as the optimal number of clusters. The central areas of these clusters were used to locate the soil samples of the reduced sampling schemes, which, considering the number of clusters, was composed of six (RS6) and three samples (RS3) for each field ( Figure 5). PCA results of the selected S2 bands (B02, B03, B04, B08, B11, and B12) for both fields and soil parameters are reported in Table 2 (clay content) and Table 3 (OM content). The first PCs, in all cases, account for the vast majority of the information contained in the S2 bands (some representative S2 bands are shown in Figure S1).  Outputs of LRAs, including the first PCs and reduced schemes samples (RS6 and RS3), are shown in Table 4. R-squared values for clay indicate for Field 1, and both RS6 and RS3 indicate a good fitting of the linear models (R 2 for RS6: 0.88**, R 2 for RS3: 0.99**). In the same field, the models of OM show a medium-high correlation of the predictor using RS6 (0.74**), whereas for RS3, even though it shows a higher correlation (0.96), it results are less significant. For Field 2, the linear models were less effective for clay and OM both for RS6 and RS3. Only for clay and RS6 does the model shows a medium correlation (R 2 = 0.67*). LRA outputs for the CSS are shown in Table 5. R-squared values indicate for Field 1 a high correlation of the predictors with clay content (0.85) and a medium-high correlation with OM content (0.65). Medium-low correlations were found in Field 2 both for clay (0.31) and OM (0.32) contents. The tests on the residuals highlighted only a partial skewness for OM in Field 2 and a slight spatial autocorrelation for clay in Field 1. Table 5. Output parameters from linear regressions between selected PCs and soil clay and OM of CSS (Conventional Sampling Scheme). * Significance at the p < 0.01 level of probability. ** p < 0.05 means the existence of spatial autocorrelation between the residuals. Clay and OM maps from LRA on RS6 and RS3 were visually compared with the map obtained from CSS using spline interpolation ( Figure 6). Correlation and statistics of residuals between RS6 and RS3 maps and the correspondent, measured CSS values support a quantitative assessment of results ( Table 6). The distribution of residuals is shown in the Supplementary Material ( Figure S2 and Figure S3). Maps of these residuals obtained using spline interpolation show the spatial variability of the residuals (Figure 7).

Discussion
Our study investigated a methodology for quantitative characterization of clay and OM at the field scale based on remotely sensed data by Sentinel 2 and reduced soil sampling. Our results confirmed the evidence widely reported in the literature on a regional scale about the ability of optical remote sensing to support the soil texture and OM gradients estimation. We obtained comparable results of previous studies, highlighting a correlation between the visible and infrared bands with the soil OM content [16,17,19,[23][24][25]. At the same time, results obtained for clay confirmed the evidence reported in previous relevant studies [16,17,27].
The proposed methodology is low cost and mainly based on unsupervised classification of RS data. Compared to other cited approaches to optimize soil sampling [2,5,28,29], it is more straightforward and easy to implement even within modern, web-based decisionsupport systems supporting precision agriculture [13]. The unsupervised classification of S2 data, based on K-means cluster analysis, represents a key methodological step. In this regard, we applied the widely used Elbow method to identify the optimal cluster number. Other approaches, possibly more effective in detecting representative spatial clusters, could be applied and tested in this step (e.g., average silhouette [41], gap statistic [42]). Moreover, other more advanced clustering methods, able to consider spatial relationships among pixels, even object based and including textural information with higher resolution images [43,44], could be tested for this purpose.
The correlations of linear models were considerably more significant in Field 1 than in Field 2 ( Table 4). The high R 2 values highlighted an encouraging performance of RS techniques in identifying the general gradient of the field. Indeed, R 2 values were higher in the regression with the conventional sampling scheme since there was no detection of the local variability of the gradient in the reduced sampling. On the other hand, the small number of samples decreased the significance level (p-value). The method's validity lay in the basic assumption that the predictors used were significantly related to the respective soil parameters, as demonstrated in the conventional sampling-scheme regression (Table 5) and reported in the literature.
The soil sampling highlighted a regular increasing clay and OM content trend in the former, moving from west to east. Differently, for the latter, the sampling showed a more irregular gradient both for the OM and for clay contents, with a spatial trend characterized by local variations of these parameters but with a visible base gradient given by increasing quantities of clay moving from east to west and increasing quantities of OM from southeast to northwest. The more significant local variability occurring for Field 2 than for Field 1 was attributed to its topographic position. Indeed, Field 2 is located very close to the Tiber river, and it is conceivable that the fluvial dynamics over time have influenced the characteristics of the soil, particularly its texture. This fact may explain the low correlation with remote sensing data, which were able to identify only the primary spatial trend (Figure 6) of the soil gradient, losing the information linked to local variations. However, concerning Field 1, despite the high correlation between clay content and the relative PC, the latter did not explain the whole variability across the field, as highlighted by the slight spatial autocorrelation between the residuals (Moran's I Test < 0.05) ( Table 5). In the case of Field 2, on the other hand, although the absence of spatial autocorrelation between the residuals (Moran's I Test > 0.05) (Table 5), the low correlation between the soil parameters and the respective PCs suggested the presence of a good portion of variability not explained by the predictors.
As reported by the textural analysis (Figure 3), the samples are placed in a narrow range of the USDA textural triangle for both fields. Most of the samples from Field 1 are placed within the silty-clay-loam textural class, whereas the samples from Field 2 are mainly distributed in three classes (loam, clay loam, and silty-clay-loam). The analysis of textural classes and the soil parameters' variation range showed that the gradient, although existing, was not so marked in Field 2, partly explaining the difficulty of remote sensing to identify it. This evidence is confirmed by boxplots of data from the different sampling methods (Figure 4). The reduced sampling schemes captured most of the variability of clay in both fields and of OM in Field 1 (RS3 performed better than RS6 in this case); differently, in Field 2, the reduced samplings did not express the (lower) variability of OM correctly.
For Field 1, the clay and OM maps, deriving from the reduced sampling, provided visually similar results compared with those obtained with the spline interpolation of the conventional sampling ( Figure 6). The textural and OM gradients, increasing from east to west, are well represented in these maps. Figure 7 shows spatially the differences between the maps derived from the reduced samplings and remote sensing information and the maps derived from conventional sampling. In this regard, maps of Field 1 showed only slight differences, with most of the area characterized by differences of +/−2 % for clay (+/−5% maximum) and +/−0.2 % for OM (+/− 0.5% maximum). For Field 2, the differences were more marked, especially for the clay content, with peaks of more than −5% difference in the northern part of the field. The maps also showed the difficulty of remote sensing identifying the sharp, local variations characterizing the second field, with the maps taking on a "patchy" appearance.
As shown in Table 6, even using only three samples provided helpful information with results comparable to the six-point sampling. Compared to the conventional sampling, the normalized average error obtained from the RS3 maps is between 1.70% and 2.72% for clay and between 0.16% and 0.17% of OM. Moreover, for Field 1, almost 90% of the area of the RS3 maps show an estimation error between -3.50% and +3.36% for clay and between −0.32% and +0.08% for OM. For Field 2, the RS3 maps show a higher error, but they appear still usable for agronomic purposes, with 50% of the area characterized by an error between −2.09% and 2.86% for clay −0.22% and −0.11% for the OM.

Conclusions
Soil is one of the most significant factors for crop growth and yield, and its knowledge provides benefits for farm management. Soil-map making generally requires a very dense field sampling or more advanced and relatively expensive techniques. This research generally confirmed the state-of-the-art evidence of soil parameters detection by optical remote sensing. Moreover, the results demonstrated how remote sensing, even if approximately, can identify the general gradient trend of soil properties, such as clay and OM contents, in agricultural fields. Even though other additional experiments are necessary, results suggest that, using the proposed procedure based on Sentinel 2 data, it is possible to define a reduced sampling scheme useful for speeding-up on-field soil sampling operations. S2 data may help characterize a given soil gradient even using a few soil samples and differentiate the crops' operations within a precision agriculture process.