Digital Mapping of Soil Organic Matter in Northern Iraq: Machine Learning Approach

Khalaf, Halmat S.; Mustafa, Yaseen T.; Fayyadh, Mohammed A.

doi:10.3390/app131910666

Open AccessArticle

Digital Mapping of Soil Organic Matter in Northern Iraq: Machine Learning Approach

by

Halmat S. Khalaf

¹,

Yaseen T. Mustafa

^1,*

and

Mohammed A. Fayyadh

²

¹

Department of Environmental Science, College of Science, University of Zakho, Zakho 42002, Kurdistan Region, Iraq

²

Department of Soil and Water Science, College of Agricultural Engineering Sciences, University of Duhok, Duhok 42001, Kurdistan Region, Iraq

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10666; https://doi.org/10.3390/app131910666

Submission received: 25 August 2023 / Revised: 21 September 2023 / Accepted: 22 September 2023 / Published: 25 September 2023

(This article belongs to the Special Issue Remote and Proximal Sensing Applied to Agriculture and Forest Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

Soil organic matter (SOM) is an essential component of soil fertility that plays a vital role in the preservation of healthy ecosystems. This study aimed to produce an SOM-level map of the Batifa region in northern Iraq. Random forest (RF) and extreme gradient boosting (XGBoost) models were used to predict the SOM spatial distribution. A total of 96 soil samples were collected from the surface layer (0–30 cm) of both cropland and soil areas in Batifa. In addition, remote sensing data were obtained from Landsat 8, including bands 1–7, 10, and 11. Supplementary variables such as the normalized difference vegetation index (NDVI), soil-adjusted vegetation index (SAVI), brightness index (BI), and digital elevation model (DEM) were employed as tools to predict SOM levels across the region. To evaluate the accuracy of the RF and XGBoost models in predicting SOM levels, statistical metrics, including mean absolute error (MAE), root mean square error (RMSE), and determination coefficient (R²), were used, with 80% of the data used for prediction and 20% for validation. The findings of this study revealed that the XGBoost model exhibited higher accuracy (MAE = 0.41, RMSE = 0.62, and R² = 0.92) in predicting SOM than the RF model (MAE = 0.65, RMSE = 0.96, R² = 0.79). Band 10, DEM, SAVI, and NDVI were identified as the most important predictors for both the models. The methodology employed in this study, which utilizes machine learning models, has the potential to map SOM in similar settings. Furthermore, the results offer significant insights for the stakeholders involved in soil management, thereby facilitating the enhancement of agricultural techniques.

Keywords:

Landsat 8; machine learning models; random forest; soil organic matter; XGBoost

1. Introduction

The presence of soil organic matter (SOM) has a considerable impact on various aspects of soil management, including soil mapping, interpretation of soil features, and the application of fertilizers and agricultural chemicals. While soil carbon evaluation, simulations, and modelling of the landscape [1,2] directly measure or estimate SOM content, other techniques like precision farming leverage these estimates for optimized soil management. Nevertheless, many of these approaches may not be feasible for large-scale applications due to factors like cost. As a solution, remote sensing data emerge as a promising alternative for assessing SOM levels.

Remote sensing has been an essential data source for mapping soil and research for several decades [3]. Wilcox et al. [4] demonstrated that a linear association can be created between data obtained from remote sensing and soil organic carbon content. Chen et al. [5] also applied remote sensing to evaluate the relationship between organic matter and reflection spectra. According to their results, the ratio of organic matter content to image sensitivity showed a high level of correlation (R² = 0.98) with the assessed data. Numerous studies have explored techniques to determine SOM. For instance, Fox and Sabbagh [6] analyzed the red and near-infrared (NIR) bands’ reflectance and intensity to gauge soil depth, formulating an equation to link pixel distances on soil surfaces with SOM. Tziachris et al. [7] predicted topsoil organic matter content using boosted regression tree models on Landsat Thematic Mapper images. Elevation, influencing vegetation types, indirectly affects the soil’s organic matter absorption [8]. To refine SOM estimates, research has integrated multispectral remote sensing data with other datasets, including topography and climate [9,10].

Traditional methods, like laboratory examination, aid digital soil mapping (DSM), but integration with spectral data can enhance soil mapping precision [11]. However, characterizing vast areas with spectral data remains a challenge due to the complexities of natural soil surfaces [8].

There is a growing emphasis on machine learning (ML) algorithms, like support vector machines and random forest (RF), to analyze spectral data for SOM prediction [12,13,14]. Song et al. [13] highlighted ML’s capability to handle intricate data relationships, delivering enhanced accuracy over linear models. These algorithms, combined with other factors such as terrain, offer improved SOM predictions. Recent advancements in ML present opportunities to establish non-linear relationships between soil parameters and remote sensing data [15]. Tree-based models, especially RF and extreme gradient-boosted decision trees (XGBoost), stand out for their predictive accuracy [15]. While RF reduces overfitting risks, XGBoost offers efficient solutions to regression challenges [16,17].

In the Iraqi Kurdistan region, despite extensive SOM research, there is a noticeable absence of detailed digital SOM maps. While Gravi and Ibrahim [18] employed the ASD Field Spec 3 spectroradiometer for SOM levels in Duhok province, others still rely on time-tested laboratory methods [19,20].

The purpose of this study was to use advanced ML algorithms to predict SOM content in the Batifa region. To accomplish this, the study uses innovative techniques, such as RF and XGBoost models, in combination with remotely sensed data, to create a SOM model. Furthermore, this study aimed to produce a DSM of SOM for the Batifa region in the Iraqi Kurdistan region.

2. Study Area and Datasets

2.1. Study Area

The Batifa area is located in the northern Duhok Governorate in the Iraqi Kurdistan region, between latitudes 37°4′58.5″ and 37°22′8.460″ N and longitudes 42°48′45.212″ to 43°11′51.440″ E. The region covers an area of approximately 528 km², with altitudes ranging from 512 m to 2301 m above sea level (Figure 1). The average temperature in Batifa is approximately 27 °C, whereas the annual average rainfall is approximately 65.7 mm [21]. The landscape of the region includes hilly and mountainous terrain, grasslands, and valleys. The area experiences wet, chilly winters and dry summers. The region is characterized by a wide variety of land uses owing to both environmental and human interference [22]. The soil in the study area is primarily gravel and shallow. However, lithosols cover the study area, particularly along the ridges and summits. The soils were formed from the parent materials obtained from limestone, marl, mudstone, and sandstone. However, limestone is the main bedrock from which the parent material is drawn in this area [23]. In general, these soils are calcareous because they contain high amounts of carbonate [24]. Rendzina soils are typically found on ridge tops, but the soil is not disturbed due to favorable topography [25].

2.2. Dataset and Pre-Processing

This study utilized two Landsat 8 images from the USGS Earth Explorer website, captured on 31 August and 2 October 2021. The former helped in creating a land use and land cover map (LULC), and the latter was used for SOM modelling aligned with soil sample collection timing. For precision, images underwent radiance conversion and reflectance transformation via the FLAASH module in ENVI 5, then were georeferenced using WGS 84 UTM Zone 38 North. To generate the SOM model, 13 variables were used (Table 1). They influenced the SOM model, confirming their significance in previous studies [15,26]. A 30 m resolution DEM from the SRTM satellite, corrected to UTM-WGS84, was incorporated. The study also utilized multiple Landsat-8 bands and indices like the normalized difference vegetation index (NDVI) [16], soil-adjusted vegetation index (SAVI) [27], and brightness index (BI) [15] known to correlate with SOM [15]. The spatial resolution of those bands along with the indices is 30 m.

From 27 to 29 September 2021, 96 soil samples were gathered in Batifa, distributed over 24 plots. Pre-fieldwork, ArcGIS facilitated stratified random sampling, and post-collection, the samples were subjected to a 24-h period of air drying, after which they underwent a meticulous cleaning procedure to remove any extraneous materials, including stones, weeds, and roots. A grinding and sieving was performed using a 2-mm sieve that was used ordinarily for the determination of soil properties. Then, in the stage of organic matter estimation, the soil was sieved through a 0.5 mm sieve according to the procedure of Walkley and Black [28,29]. In many countries, the Tyurin method is used instead of the Walkley–Black method [30]. However, in this study, we employed the Walkley–Black method, which is more commonly utilized in professional settings. The northern region was excluded due to the current ongoing military conflicts between the Kurdistan Workers’ Party and Turkey.

Table 1. The 13 variables selected for SOM mapping.

Explanatory Variable	Formula	Reference
Digital Elevation Models (DEM)		Shuttle Radar Topography Mission (SRTM) from (https://earthexplorer.usgs.gov/, accessed on 20 August 2021)
Landsat 8 (OLI)	b 1: Coastal aerosol, b 2: Blue, b 3: Green, b 4: Red, b 5: NIR, b 6: SWIR 1, b 7: SWIR 2, b 10: TIRS 1, b 11: TIRS 2	(https://earthexplorer.usgs.gov/, accessed on 22 October 2021)
Normalized Difference Vegetation Index	$N D V I = \frac{N I R - R e d}{N I R + R e d}$	[31]
Soil-Adjusted Vegetation Index	$\begin{matrix} S A V I = \frac{(N I R - R e d)}{(N I R + R e d + L)} (1 + L), \\ L = 0.5 \end{matrix}$	[27]
Brightness Index	$B I = \frac{1}{2} \times \sqrt{R e d^{2} + G r e e n^{2}}$	[15]

3. Methodology

The proposed method comprises three main stages, as illustrated in Figure 2. These stages include data collection, feature extraction, and model training.

3.1. LULC and Accuracy Assessment

The main goal of LULC mapping is to delineate and demarcate appropriate regions for conducting soil sampling, with a specific emphasis on bare soil and cropland. To do so, a series of land surveys was conducted during a fortnight-long period spanning from 28 August to 10 September of the year 2021. A handheld GPS device was employed utilizing the same projection system as the study region to precisely delineate LULC features. The field survey data comprised seven different types of LULC: dense forest, sparse forest, cropland, built-up, soil, water bodies, and rock. The Landsat image acquired on 31 August 2021, was classified using an SVM classifier. This approach employs a statistical threshold for decision-making purposes to discriminate between visually similar signatures and identify a hyperplane that can effectively differentiate between two distinct categories. The classification process was executed using training data, which constituted 70% of the collected field data, resulting in the creation of a classified map.

The precision of the LULC classification was evaluated by generating a confusion error matrix using validation data [32]. The validation dataset comprised 30% of the collected field data. The matrix facilitated a comparison between the predicted classification outcomes and actual outcomes derived from the validation dataset. The accuracy of the classified image was determined using various metrics, such as the overall accuracy, producer accuracy, user accuracy, and Kappa coefficient. The overall accuracy of the LULC classification is often expressed as a percentage, which is determined by dividing the total number of correctly categorized samples by the total number of examined samples. However, the use of the Kappa coefficient is favored over overall accuracy because of its ability to offer improved discrimination between different classes [33].

3.2. RF

We used the RF ML algorithm, introduced by Breiman [17], to estimate SOM. It has been widely applied in precision farming, soil parameter measurement, and organic matter mapping [34,35]. The RF algorithm can predict continuous response variables using classification and regression tree analysis [36]. Decision trees were used to model the data; each tree acted as a separate regression function and was trained using a distinct bootstrap sample of the training data. The final result was obtained by averaging the outputs of multiple trees [34]. The reliability of the decision trees was assessed using out-of-bag (OOB) samples, which were not included in the bootstrap sample, along with other model metrics, such as misclassification error and variable relevance [35].

We optimized the RF parameters for two response variables, SOM and response variables (Landsat bands [1,2,3,4,5,6,7], bands 10 and 11, DEM, NDVI, SAVI, and BI), using the OOB error estimate. The optimization process focuses on two key parameters: ntree, which refers to the total number of trees generated by the algorithm, and mtry, which pertains to the number of random input parameters used for constructing each tree. We used the percentage increase in the MSE metric (%IncMSE) to determine the importance of each explanatory variable in model predictions. This metric measures the decline in model accuracy when a variable is removed [36]. By examining whether replacing a variable significantly affects %IncMSE, we can infer the significance of that variable [37].

3.3. XGBoost

Chen [38] developed the XGBoost algorithm to enhance the performance of gradient boosting machines, particularly regression trees [39]. This approach involves boosting, which creates a “strong” learner from a group of “weak” learners through additional training. The XGBoost method enhances the computation speed and reduces the frequency of overestimated events. In addition, it simplifies the objective functions and accelerates the computation by combining the estimation and adjustment terms. Furthermore, XGBoost automatically performs parallel computations of functions during the training phase [39]. In this study, we employed the R Project for statistical computing (Version 4.2) to develop an XGBoost model. The optimal model parameters were ascertained through meticulous tuning of pivotal hyperparameters, including the number of decision trees, the maximum count of nodes or leaves in a single decision tree, the maximum depth of the tree, and the learning rate, as outlined in [15]. Pertinent R packages utilized for this research encompassed xgboost [38], randomForest [40], caret [41], and raster [42].

3.4. Validation of ML Models

The soil dataset comprised 96 sites, which were divided into two groups: a calibration dataset containing 77 sites (80%), and a validation dataset containing 19 sites (20%). For the purpose of cross-validation and evaluating the efficacy of each model, we used three evaluation metrics: mean absolute error (MAE) (Equation (1)); root-mean-squared-error (RMSE) (Equation (2)); and determination coefficient (R²) (Equation (3)). MAE measures absolute model accuracy, whereas RMSE measures the accuracy within the model. The MAE was less affected by the outliers than the RMSE. The R² value, which falls within the range 0–1, signifies the extent of agreement between the observed and predicted values along the fitted regression line.

M A E = \frac{1}{n} \sum_{i = 1}^{n} |o_{i} - p_{i}|

(1)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {(o_{i} - p_{i})}^{2}}{n}}

(2)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(p_{i} - o_{i})}^{2}}{\sum_{i = 1}^{n} {(o_{i} - \bar{o})}^{2}}

(3)

where

p_{i}

and

o_{i}

are the predicted and true values of the validation phase of the sample points, respectively; and

\bar{o}

is the true mean value.

4. Results and Discussions

4.1. Descriptive Statistics

Descriptive statistics for both SOM and remote sensing data are presented in Table 2. The SOM content ranged from 0.1% to 6.18%, with a mean value of 1.81%. The mean value was slightly higher than the standard deviation (1.22%), suggesting that more data points were clustered around the mean, resulting in a higher overall average. The variability of SOM was classified using the Wilding [43] approach and was found to be high, with a coefficient of variation (CV) of 67.45%. This high variability may be attributed to the varying topographies and climatic conditions of the area. NDVI had the next highest CV (31.57%), likely due to the increase in vegetative cover and plant residue that remained after the harvesting process in August and September. BI values ranged from 0.06 to 0.19, with a CV of 25.11%. Wang et al. [44] confirmed that BI was strongly correlated with farmland SOM in autumn. The DEM had the lowest CV (11.35%), mainly because of the significant amount of human interference in the topography of Batifa, particularly agricultural engineering activities.

4.2. LULC and Assembled Bare Soil and Cropland Maps

Figure 3a shows the LULC classification map for the Batifa region. Figure 3b displays a map of the bare soil and cropland types derived from the LULC map. The availability of such a detailed map is essential for efficient land management in the region. Table 3 provides the image classification accuracy metrics and the confusion matrix. The rows show the real LULC patterns, whereas the columns represent the LULC patterns used to classify the images. The resulting user accuracy was significant for all classes, indicating a high likelihood that a pixel identified as a specific class on the map belongs to that class. Similarly, the resultant producer accuracy was impressive, indicating the likelihood of a particular feature on the ground being correctly classified. The overall accuracy of the classification was 93%, and the Kappa coefficient was 0.91, which was considered successful when compared to previous research. According to previous studies [32,45], for LULC classification, it is typically deemed acceptable to have an overall accuracy rate of more than 85 percent and a Kappa coefficient greater than 0.80.

Sparse forests cover the largest area (approximately 239.42 km²), followed by bare soil (202.4 km²). The south has more soil, whereas the north has sparser forest. Dense forest covers 37.64 km², mainly in the north, and rock covers 17.21 km², mostly in the central and southern regions. Cropland covers 14.29 km² and is dominant in the south. The water and built-up areas had the smallest areas of 6.91 km² and 10.08 km², respectively (Figure 4). These results can be deemed satisfactory and align with the findings of Mustafa and Ismail [22].

4.3. Model Performance

The evaluation results, as summarized in Table 4, emphasize the statistical measures (MAE, RMSE, and R²) employed for the prediction model’s precision. Meanwhile, Figure 5 illustrates the spatial pattern of the digital SOM generated by the XGBoost and RF models. This demonstrates noteworthy spatial heterogeneity in the SOM, with elevated predicted values predominantly concentrated in the central and southern regions.

According to Table 4, the results of the prediction model show that the XGBoost model performed better than the RF model. The XGBoost model had a higher accuracy with an R² value of 0.92, RMSE of 0.62, and MAE of 0.41, compared to R² values of 0.77, 0.96, and 0.65, respectively. The enhanced precision of the XGBoost model can be attributed to its implementation of stochastic gradient boosting, thereby improving the methodology, mitigating overfitting concerns, and potentially enhancing predictive accuracy [46]. In summary, the XGBoost algorithm exhibited better performance than the RF algorithm, as evidenced by the higher R² values and lower RMSE values.

Furthermore, as shown in Figure 6, XGBoost outperformed RF in its ability to predict SOM content. The precision of the outcomes produced by XGBoost surpasses that of RF as they exhibit a higher degree of alignment with the black 1:1 line. Upon closer examination of Figure 5a,b, for XGBoost and RF, respectively, it can be observed that there is a significant disparity between the predicted and test values. This finding provides additional support for the belief that XGBoost outperforms RF in terms of the accuracy in a given scenario.

The XGBoost model exhibited notable efficacy in predicting SOM, corroborating the findings of Xie et al. [15], who similarly observed superior performance of the XGBoost model compared to other machine learning approaches for SOM prediction. Moreover, several studies within the domain of DSM have employed the XGBoost model to accurately predict soil properties and nutrient levels across diverse regions. For instance, Hengl et al. [47] successfully employed this model in sub-Saharan Africa, Ramcharan et al. [48] on soil properties in the United States, Chen et al. [49] on soil pH in China, Hengl et al. [50] at a global scale for soil properties, and Shangguan et al. [51] estimated the depth to bedrock worldwide. However, Tziachris et al. [7] reported a lower level of precision in SOM prediction using the XGBoost model compared with the random forest (RF) model. It is important to note that the accuracy of the XGBoost model may vary significantly across studies, as evidenced by the aforementioned studies. These discrepancies may be due to differences in research regions, geography, sample density, and the number and quality of environmental variables employed, among other factors.

In the expansive Batifa region of northern Iraq, this research sought to map SOM variability, emphasizing the advantages of a machine-learning model coupled with satellite remote sensing. Our model provides a unique landscape perspective, offering insights into regional SOM distribution that may elude individual soil analyses. This broad view can guide policymakers, regional planners, and agricultural associations in informed decision-making. By including the northern zones, even those without direct soil samples, our research presents a comprehensive SOM overview crucial for future land-use strategies. Crucially, the integration of remote sensing allows for ongoing, extensive monitoring, capturing both temporal and spatial SOM changes—an advantage unattainable with traditional soil analyses alone.

Within the discussion on model confidence relative to spatial resolution, it is crucial to highlight the inherent advantages of employing satellite remote sensing. The capability of satellite imagery to correlate spatial attributes of image pixels with the terrestrial granularity of the ground cells offers a substantial edge in large-scale SOM mapping, such as in our study of the Batifa region. Our results with the XGBoost model yielded a high accuracy, suggesting that the chosen resolution of satellite imagery is apt for generating robust estimates in the region. Nevertheless, elucidating the relationship between the confidence level, particularly the 95% confidence interval, and varying spatial resolutions warrants a more exhaustive study. Such an analysis would necessitate the resampling of datasets at diverse resolutions and subsequently evaluating the performance of machine learning models. While this aspect was outside the purview of our current research, it poses a significant avenue for future investigations. For upcoming studies, understanding the intricate interplay between spatial resolution and prediction confidence can illuminate the selection of optimal satellite datasets, especially pertinent in regions characterized by diverse land cover types and soil heterogeneity.

4.4. Variable Importance Analysis

The relative importance of auxiliary data variables in predicting SOM with the XGBoost and RF models is presented in Figure 7. Notably, the order of significance differed between XGBoost and RF, depending on their predicted SOM values. In our study, the prominence of the B10 band from Landsat 8, which represents the thermal infrared-1 spectrum, was particularly evident when using the XGBoost algorithm. Pahlavan-Rad et al. [52] underscored this phenomenon by identifying the thermal band’s substantial correlation with SOM, making it a pivotal variable for accurate predictions. The thermal infrared band can capture variations in surface temperature, which in turn can be influenced by soil moisture and organic content, thus providing valuable information about SOM levels. Additionally, Mirzaee et al. [53] noted that Landsat bands can capture spatial variations in vegetation. Based on the characteristics of each variable and their significance in the mode, the order of relative relevance of the variables for XGBoost was B10 > SAVI > B1 > DEM > NDVI > B5.

Our analysis determined that the most important factor for the RF model was DEM, followed by SAVI, B11I, NDVI, B10, and B6. According to previous research [54,55], the topographical features represented by the DEM play a significant role in influencing the spatial distribution of SOM. The variations in elevation and slope can impact factors such as soil moisture, temperature, and erosion rates, all of which can in turn influence SOM distribution. For instance, lower elevations or depressions might accumulate more organic material due to water runoff and reduced erosion. Conversely, higher elevations or slopes might have reduced SOM due to increased erosion and lesser water retention. Such topographical influences have been well-documented in the soil science literature. Our findings that emphasize DEM’s role align with previous research conducted by Mallic et al. [8], who also identified DEM as a pivotal predictor for SOM modelling.

This study determined that SAVI, DEM, B10, and NDVI were the most prominent predictors of SOM content in both the RF and XGBoost models. Among those four important variables, DEM may not capture soil-specific properties resulting from soil management practices. However, DEM included in this study as one of the predictors was based on previous studies [7,56] that have found elevation data to sometimes be indirectly associated with SOM due to drainage, moisture retention, and other terrain-driven factors. These variables exhibit heightened susceptibility to diverse soil characteristics, rendering them pivotal predictors for accurate SOM estimation.

The efficacy of both SAVI and NDVI as tools for the accurate prediction of SOM levels within various contexts is well-established. Several studies, including that conducted by Nabiollahi et al. [57], have demonstrated the utility of SAVI in this regard. Concurrently, the importance of NDVI, a measure of vegetation vigor, is emphasized by findings such as those presented by Taghizadeh-Mehrjardi et al. [58]. NDVI often correlates positively with SOM, as healthier vegetation generally grows in soils with a higher organic content. In contrast, SAVI, which accounts for the soil’s red reflectance particularly in low vegetation areas, might have an inverse relationship with SOM. Xie et al. [15] further illustrated that SAVI can refine predictions of agricultural soil organic carbon content. In the context of our study area, it is essential to understand the relationship between these indices. The interconnectedness of these indices with biotic factors, such as vegetation type and agroecosystems, and abiotic factors, such as climate and relief, makes them significant in SOM prediction. However, it is noteworthy that the variables B2, B3, B4, and BI, derived from satellite images captured during specific periods in the Batifa region, displayed a minimal impact on SOM levels in both models. Specifically, these variables were sourced from images taken in August, a period following summer pruning and weeding activities. This timeframe saw an increase in bare soil areas, which led to heightened ground scattering and reduced canopy scattering [59].

Overall, these findings demonstrated the effectiveness of vegetation indicators in accurately predicting SOM levels in various environments and contexts, including agriculture and forestry.

5. Conclusions

A study was conducted in the Batifa region to spatially map soil organic matter using ML models based on remotely sensed data. These findings indicate that this approach yields significant insights into the soil management and agricultural planning within the region. Two ML models, XGBoost and RF, were evaluated, and it was found that XGBoost performed better at predicting changes in soil organic matter. This study produced the following key findings:

XGBoost exhibited the best performance (R² = 0.92) compared to RF (R² = 0.77).
The RMSE value of the XGBoost model for SOM (RMSE = 0.62) was lower than that of the RF model (RMSE = 0.96), indicating that XGBoost could estimate SOM levels more accurately.
In XGBoost, Band 10 was found to be the most important variable for predicting SOM, whereas in RF, DEM was the most significant variable.
The application of our ML model to the entire study area revealed that the SOM’s spatial distribution indicated higher predicted values predominantly in the central and southern regions, despite the field survey not encompassing the northern region.

The XGBoost method is advantageous because it can be easily implemented and modelled using the R software (v. 4.2). This makes it highly useful for handling large amounts of data, as it can produce more accurate estimates while reducing errors and preventing misleading results.

Author Contributions

Conceptualization, H.S.K., Y.T.M., and M.A.F.; methodology, H.S.K. and Y.T.M.; software, Y.T.M.; validation, H.S.K., Y.T.M., and M.A.F.; formal analysis, H.S.K.; investigation, H.S.K. and Y.T.M.; resources, H.S.K.; data curation, H.S.K.; writing—original draft preparation, H.S.K. and Y.T.M.; writing—reviewing, and editing, Y.T.M. and M.A.F.; visualization, H.S.K., Y.T.M., and M.A.F.; supervision, Y.T.M. and M.A.F.; project administration, Y.T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data were not publicly available owing to privacy restrictions.

Conflicts of Interest

The authors declare no conflict of interest.

References

McKenzie, N.; Cresswell, H.; Ryan, P.; Grundy, M. Contemporary land resource survey requires improvements in direct soil measurement. Commun. Soil Sci. Plant Anal. 2000, 31, 1553–1569. [Google Scholar] [CrossRef]
Thomasson, J.; Sui, R.; Cox, M.; Al–Rajehy, A. Soil reflectance sensing for determining soil properties in precision agriculture. Trans. ASAE 2001, 44, 1445. [Google Scholar]
Ivanov, M.; Abdullin, H.; Gainullin, I.; Gafurov, A.; Usmanov, B.; Williamson, J. Using XVIII–XIX Cent. Maps and Modern Remote Sensing Data for Detecting the Changes in the Land Use at Bulgarian Fortified Settlements in the Volga Region. Earth 2021, 2, 51–65. [Google Scholar] [CrossRef]
Wilcox, C.H.; Frazier, B.E.; Ball, S.T. Relationship between soil organic carbon and Landsat TM data in Eastern Washington. Photogramm. Eng. Remote Sens. 1994, 60, 777–781. [Google Scholar]
Chen, F.; Kissel, D.E.; West, L.T.; Adkins, W. Field—Scale mapping of surface soil organic carbon using remotely sensed imagery. Soil Sci. Soc. Am. J. 2000, 64, 746–753. [Google Scholar] [CrossRef]
Fox, G.A.; Sabbagh, G.J. Estimation of soil organic matter from red and near-infrared remotely sensed data using a soil line Euclidean distance technique. Soil Sci. Soc. Am. J. 2002, 66, 1922–1929. [Google Scholar] [CrossRef]
Tziachris, P.; Aschonitis, V.; Chatzistathis, T.; Papadopoulou, M. Assessment of spatial hybrid methods for predicting soil organic matter using DEM derivatives and soil parameters. Catena 2019, 174, 206–216. [Google Scholar] [CrossRef]
Mallick, J.; Ahmed, M.; Alqadhi, S.D.; Falqi, I.I.; Parayangat, M.; Singh, C.K.; Rahman, A.; Ijyas, T. Spatial stochastic model for predicting soil organic matter using remote sensing data. Geocarto Int. 2022, 37, 413–444. [Google Scholar]
Guo, L.; Sun, X.; Fu, P.; Shi, T.; Dang, L.; Chen, Y.; Linderman, M.; Zhang, G.; Zhang, Y.; Jiang, Q. Mapping soil organic carbon stock by hyperspectral and time-series multispectral remote sensing images in low-relief agricultural areas. Geoderma 2021, 398, 115118. [Google Scholar]
Wang, S.; Gao, J.; Zhuang, Q.; Lu, Y.; Gu, H.; Jin, X. Multispectral remote sensing data are effective and robust in mapping regional forest soil organic carbon stocks in a northeast forest region in China. Remote Sens. 2020, 12, 393. [Google Scholar] [CrossRef]
Nawar, S.; Mouazen, A. On-line vis-NIR spectroscopy prediction of soil organic carbon using machine learning. Soil Tillage Res. 2019, 190, 120–127. [Google Scholar] [CrossRef]
Bian, Z.; Guo, X.; Wang, S.; Zhuang, Q.; Jin, X.; Wang, Q.; Jia, S. Applying statistical methods to map soil organic carbon of agricultural lands in northeastern coastal areas of China. Arch. Agron. Soil Sci. 2019, 66, 532–544. [Google Scholar]
Song, J.; Gao, J.; Zhang, Y.; Li, F.; Man, W.; Liu, M.; Wang, J.; Li, M.; Zheng, H.; Yang, X.; et al. Estimation of Soil Organic Carbon Content in Coastal Wetlands with Measured VIS-NIR Spectroscopy Using Optimized Support Vector Machines and Random Forests. Remote Sens. 2022, 14, 4372. [Google Scholar] [CrossRef]
Pal, S.; Sharma, P. A Review of Machine Learning Applications in Land Surface Modeling. Earth 2021, 2, 174–190. [Google Scholar] [CrossRef]
Xie, B.; Ding, J.; Ge, X.; Li, X.; Han, L.; Wang, Z. Estimation of soil organic carbon content in the Ebinur Lake wetland, Xinjiang, China, based on multisource remote sensing data and ensemble learning algorithms. Sensors 2022, 22, 2685. [Google Scholar] [CrossRef]
Liang, Z.; Chen, S.; Yang, Y.; Zhou, Y.; Shi, Z. High-resolution three-dimensional mapping of soil organic carbon in China: Effects of SoilGrids products on national modeling. Sci. Total Environ. 2019, 685, 480–489. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Gravi, K.A.; Ibrahim, S.M. A Studying the Possibility of Estimating Soil Organic Carbon of Soils under Pinus brutia and Quercus aegilops L. Trees in Sarke-Duhok By Using ASD FieldSpec 3 Spectroradiometer. Sci. J. Univ. Zakho 2020, 8, 34–41. [Google Scholar] [CrossRef]
Maulood, P.M. Determination of Organic Matter by Using Titrimetric and Loss on Ignition Methods for Northern Iraqi Governorates Soils. Al-Nahrain J. Sci. 2022, 25, 1–7. [Google Scholar] [CrossRef]
Meshabaz, R.A.; Umer, M.I. Assessment of industrial effluent impacts on soil physiochemical properties in Kwashe Industrial Area, Iraq Kurdistan Region. In Proceedings of the IOP Conference Series: Earth and Environmental Science, Sulaimani, Iraq, 1–4 October 2022; p. 012037. [Google Scholar]
Yousif, B.S.; Mustafa, Y.T.; Fayyadh, M.A. Digital mapping of soil-texture classes in Batifa, Kurdistan Region of Iraq, using machine-learning models. Earth Sci. Inform. 2023, 16, 1687–1700. [Google Scholar] [CrossRef]
Mustafa, Y.T.; Ismail, D.R. Land use land cover change in Zakho District, Kurdistan Region, Iraq: Past, current and future. In Proceedings of the 2019 International Conference on Advanced Science and Engineering (ICOASE), Duhok, Iraq, 2–4 April 2019; pp. 141–146. [Google Scholar]
Buday, T.; Jasim, S.Z. The Regional Geology of Iraq, Tectonism, Magmatism and Metamorphism; Directorate General for Geological Survey: Baghadad, Iraq, 1980; p. 352. [Google Scholar]
Fayyadh, M.A.; Sindi, A.A.M. Distribution of total carbonate and iron oxides on catena at Duhok Governorate, Kurdistan Region, Iraq. Mater. Today Proc. 2021, 42, 2064–2070. [Google Scholar] [CrossRef]
Buringh, P. Soils and Soil Conditions in Iraq; Ministry of Agriculture Baghdad: Baghdad, Iraq, 1960. [Google Scholar]
Wang, S.; Zhuang, Q.; Wang, Q.; Jin, X.; Han, C. Mapping stocks of soil organic carbon and soil total nitrogen in Liaoning Province of China. Geoderma 2017, 305, 250–263. [Google Scholar]
Sripada, R.P.; Heiniger, R.W.; White, J.G.; Meijer, A.D. Aerial color infrared photography for determining early in—Season nitrogen requirements in corn. Agron. J. 2006, 98, 968–977. [Google Scholar] [CrossRef]
Walkley, A.; Black, I.A. An examination of the Degtjareff method for determining soil organic matter, and a proposed modification of the chromic acid titration method. Soil Sci. 1934, 37, 29–38. [Google Scholar] [CrossRef]
Barwari, V.; Hashim, F.A.; Mohammed, B.H. Comparison between Walkley-Black and Loss-on-Ignition methods for organic carbon estimation in soil from different locations. Kufa J. Agric. Sci. 2017, 9, 292–306. [Google Scholar]
Shamrikova, E.; Kondratenok, B.; Tumanova, E.; Vanchikova, E.; Lapteva, E.; Zonova, T.; Lu-Lyan-Min, E.; Davydova, A.; Libohova, Z.; Suvannang, N. Transferability between soil organic matter measurement methods for database harmonization. Geoderma 2022, 412, 115547. [Google Scholar]
Rouse, J.W., Jr.; Haas, R.H.; Deering, D.; Schell, J.; Harlan, J.C. Monitoring the Vernal Advancement and Retrogradation (Green Wave Effect) of Natural Vegetation; Texas A&M University Remote Sensing Center: College Station, TX, USA, 1974. [Google Scholar]
Mather, P.; Tso, B. Classification Methods for Remotely Sensed Data; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
Thakur, R.; Panse, P. Classification Performance of Land Use from Multispectral Remote Sensing Images using Decision Tree, K-Nearest Neighbor, Random Forest and Support Vector Machine Using EuroSAT Data. Int. J. Intell. Syst. Appl. Eng. 2022, 10, 67–77. [Google Scholar]
de Santana, F.B.; de Souza, A.M.; Poppi, R.J. Visible and near infrared spectroscopy coupled to random forest to quantify some soil quality parameters. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2018, 191, 454–462. [Google Scholar] [CrossRef]
Wiesmeier, M.; Barthold, F.; Blank, B.; Kögel-Knabner, I. Digital mapping of soil organic matter stocks using Random Forest modeling in a semi-arid steppe ecosystem. Plant Soil 2011, 340, 7–24. [Google Scholar]
Iqbal, F.; Lucieer, A.; Barry, K. Poppy crop capsule volume estimation using UAS remote sensing and random forest regression. Int. J. Appl. Earth Obs. Geoinf. 2018, 73, 362–373. [Google Scholar] [CrossRef]
Siroky, D.S. Navigating random forests and related advances in algorithmic modeling. Stat. Surv. 2009, 3, 147–163. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13 August 2016; pp. 785–794. [Google Scholar]
Fan, J.; Wang, X.; Wu, L.; Zhou, H.; Zhang, F.; Yu, X.; Lu, X.; Xiang, Y. Comparison of Support Vector Machine and Extreme Gradient Boosting for predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: A case study in China. Energy Convers. Manag. 2018, 164, 102–111. [Google Scholar] [CrossRef]
Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef]
Hijmans, R.J.; Van Etten, J.; Cheng, J.; Mattiuzzi, M.; Sumner, M.; Greenberg, J.A.; Lamigueiro, O.P.; Bevan, A.; Racine, E.B.; Shortridge, A. Package ‘raster’. R Package 2015, 734, 473. [Google Scholar]
Wilding, L. Spatial variability: Its documentation, accomodation and implication to soil surveys. In Proceedings of the Soil Spatial Variability, Las Vegas, NV, USA, 30 November–1 December 1985; pp. 166–194. [Google Scholar]
Wang, K.; Qi, Y.; Guo, W.; Zhang, J.; Chang, Q. Retrieval and mapping of soil organic carbon using Sentinel-2A spectral images from bare cropland in autumn. Remote Sens. 2021, 13, 1072. [Google Scholar] [CrossRef]
Foody, G.M. Thematic map comparison. Photogramm. Eng. Remote Sens. 2004, 70, 627–633. [Google Scholar] [CrossRef]
Keskin, H.; Grunwald, S.; Harris, W.G. Digital mapping of soil carbon fractions with machine learning. Geoderma 2019, 339, 40–58. [Google Scholar] [CrossRef]
Hengl, T.; Leenaars, J.G.; Shepherd, K.D.; Walsh, M.G.; Heuvelink, G.B.; Mamo, T.; Tilahun, H.; Berkhout, E.; Cooper, M.; Fegraus, E. Soil nutrient maps of Sub-Saharan Africa: Assessment of soil nutrient content at 250 m spatial resolution using machine learning. Nutr. Cycl. Agroecosyst. 2017, 109, 77–102. [Google Scholar] [CrossRef]
Ramcharan, A.; Hengl, T.; Nauman, T.; Brungard, C.; Waltman, S.; Wills, S.; Thompson, J. Soil property and class maps of the conterminous United States at 100-meter spatial resolution. Soil Sci. Soc. Am. J. 2018, 82, 186–201. [Google Scholar] [CrossRef]
Chen, S.; Liang, Z.; Webster, R.; Zhang, G.; Zhou, Y.; Teng, H.; Hu, B.; Arrouays, D.; Shi, Z. A high-resolution map of soil pH in China made by hybrid modelling of sparse soil data and environmental covariates and its implications for pollution. Sci. Total Environ. 2019, 655, 273–283. [Google Scholar] [CrossRef]
Hengl, T.; Mendes de Jesus, J.; Heuvelink, G.B.; Ruiperez Gonzalez, M.; Kilibarda, M.; Blagotić, A.; Shangguan, W.; Wright, M.N.; Geng, X.; Bauer-Marschallinger, B. SoilGrids250m: Global gridded soil information based on machine learning. PLoS ONE 2017, 12, e0169748. [Google Scholar] [CrossRef] [PubMed]
Shangguan, W.; Hengl, T.; Mendes de Jesus, J.; Yuan, H.; Dai, Y. Mapping the global depth to bedrock for land surface modeling. J. Adv. Model. Earth Syst. 2017, 9, 65–88. [Google Scholar] [CrossRef]
Pahlavan-Rad, M.R.; Dahmardeh, K.; Brungard, C. Predicting soil organic carbon concentrations in a low relief landscape, eastern Iran. Geoderma Reg. 2018, 15, e00195. [Google Scholar] [CrossRef]
Mirzaee, S.; Ghorbani-Dashtaki, S.; Mohammadi, J.; Asadi, H.; Asadzadeh, F. Spatial variability of soil organic matter using remote sensing data. Catena 2016, 145, 118–127. [Google Scholar] [CrossRef]
Ning, L.; Cheng, C.; Lu, X.; Shen, S.; Zhang, L.; Mu, S.; Song, Y. Improving the Prediction of Soil Organic Matter in Arable Land Using Human Activity Factors. Water 2022, 14, 1668. [Google Scholar] [CrossRef]
Zhou, Y.; Hartemink, A.E.; Shi, Z.; Liang, Z.; Lu, Y. Land use and climate change effects on soil organic carbon in North and Northeast China. Sci. Total Environ. 2019, 647, 1230–1238. [Google Scholar] [CrossRef]
Zhang, Y.; Kou, C.; Liu, M.; Man, W.; Li, F.; Lu, C.; Song, J.; Song, T.; Zhang, Q.; Li, X.; et al. Estimation of Coastal Wetland Soil Organic Carbon Content in Western Bohai Bay Using Remote Sensing, Climate, and Topographic Data. Remote Sens. 2023, 15, 4241. [Google Scholar] [CrossRef]
Nabiollahi, K.; Eskandari, S.; Taghizadeh-Mehrjardi, R.; Kerry, R.; Triantafilis, J. Assessing soil organic carbon stocks under land-use change scenarios using random forest models. Carbon Manag. 2019, 10, 63–77. [Google Scholar] [CrossRef]
Taghizadeh-Mehrjardi, R.; Nabiollahi, K.; Kerry, R. Digital mapping of soil organic carbon at multiple depths using different data mining techniques in Baneh region, Iran. Geoderma 2016, 266, 98–110. [Google Scholar] [CrossRef]
Wang, H.; Zhang, X.; Wu, W.; Liu, H. Prediction of Soil Organic Carbon under Different Land Use Types Using Sentinel-1/-2 Data in a Small Watershed. Remote Sens. 2021, 13, 1229. [Google Scholar] [CrossRef]

Figure 1. A map displaying the study area: (a) Iraq; (b) Duhok Governorate; (c) Batifa region with location of sample plots; and (d) example of a plot with soil samples locations as it is a zoom-in part through arrows for one sample plot.

Figure 2. Flowchart showing study methodology.

Figure 3. (a) Map of LULC and (b) map of bare soil and cropland.

Figure 4. Area of Batifa land types.

Figure 5. SOM map created by (a) XGBoost model and (b) RF model.

Figure 6. Model performance in the prediction of SOM based on the testing dataset: (a,c) scatterplots depicting the comparison between measured and predicted SOM for both the XGBoost and RF models are presented, respectively; (b,d) measured and predicted SOM curves from the XGBoost model and RF model, respectively.

Figure 7. Variable importance of two models: (a) XGBoost model and (b) RF model.

Table 2. Descriptive statistics of SOM and auxiliary data.

Variables	Min	Max	Median	Mean	Standard Deviation	CV%
SOM%	0.10	6.18	1.69	1.81	1.22	67.45
DEM	633.00	1028.00	772.00	752.91	85.46	11.35
NDVI	0.12	0.46	0.17	0.19	0.06	31.58
SAVI	0.09	0.30	0.12	0.13	0.03	27.34
BI	0.06	0.19	0.12	0.12	0.031	24.41
B1	0.05	0.12	0.09	0.08	0.02	19.55
B2	0.05	0.14	0.09	0.09	0.02	23.71
B3	0.08	0.22	0.14	0.14	0.03	25.17
B4	0.10	0.31	0.20	0.20	0.05	25.12
B5	0.18	0.41	0.29	0.30	0.05	19.60
B6	0.19	0.41	0.34	0.31	0.05	17.87
B7	0.13	0.33	0.24	0.24	0.04	18.44
B10	34.21	43.01	39.64	39.52	1.99	5.04
B11	33.19	41.84	38.71	38.63	1.91	4.94

Table 3. Confusion matrix (accuracies) of the classified LULC of the image.

LULC	Classified
LULC	Dense Forest	Sparse Forest	Rock	Soil	Water	Cropland	Built-Up	User’s Accuracy
Dense Forest	65	1	0	0	0	0	0	98.48
Sparse Forest	1	31	0	4	0	0	0	86.11
Rock	0	0	32	1	0	0	0	96.96
Soil	0	1	4	60	0	2	0	89.55
Water	1	0	0	0	14	0	0	93.33
Cropland	0	0	0	1	0	19	0	95
Built-up	0	0	1	0	1	0	18	90
Producer’s Accuracy	97.01	93.93	86.48	90.90	93.33	95	100
Overall Accuracy								93%
Kappa Coefficient								0.91

Table 4. Comparison of the accuracy of two ML models.

ML Algorithms	MAE	RMSE	R²
XGBoost	0.41	0.62	0.92
RF	0.65	0.96	0.77

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khalaf, H.S.; Mustafa, Y.T.; Fayyadh, M.A. Digital Mapping of Soil Organic Matter in Northern Iraq: Machine Learning Approach. Appl. Sci. 2023, 13, 10666. https://doi.org/10.3390/app131910666

AMA Style

Khalaf HS, Mustafa YT, Fayyadh MA. Digital Mapping of Soil Organic Matter in Northern Iraq: Machine Learning Approach. Applied Sciences. 2023; 13(19):10666. https://doi.org/10.3390/app131910666

Chicago/Turabian Style

Khalaf, Halmat S., Yaseen T. Mustafa, and Mohammed A. Fayyadh. 2023. "Digital Mapping of Soil Organic Matter in Northern Iraq: Machine Learning Approach" Applied Sciences 13, no. 19: 10666. https://doi.org/10.3390/app131910666

APA Style

Khalaf, H. S., Mustafa, Y. T., & Fayyadh, M. A. (2023). Digital Mapping of Soil Organic Matter in Northern Iraq: Machine Learning Approach. Applied Sciences, 13(19), 10666. https://doi.org/10.3390/app131910666

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Digital Mapping of Soil Organic Matter in Northern Iraq: Machine Learning Approach

Abstract

1. Introduction

2. Study Area and Datasets

2.1. Study Area

2.2. Dataset and Pre-Processing

3. Methodology

3.1. LULC and Accuracy Assessment

3.2. RF

3.3. XGBoost

3.4. Validation of ML Models

4. Results and Discussions

4.1. Descriptive Statistics

4.2. LULC and Assembled Bare Soil and Cropland Maps

4.3. Model Performance

4.4. Variable Importance Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI