1. Introduction
Forests are vital ecosystems, covering approximately 31% of the Earth’s surface, and they play a crucial role in maintaining ecological balance [
1]. However, they are becoming increasingly threatened by deforestation and degradation, leading to a loss of biodiversity, climate change, and disrupted livelihoods. Deforestation involves the complete removal of trees for non-forest use, such as agriculture or urban development, whereas forest degradation refers to the decline in forest quality due to factors such as logging, pollution, or climate change, even though the forest remains standing [
1].
The drivers of deforestation and forest degradation are multifaceted and large-scale. These generally involve two broad categories: human activity and natural environmental factors. Human activities include agriculture, logging, urban expansion, road construction, and the use of firewood for livelihoods, which are occasionally worsened by conflict and poor management. Natural environmental factors encompass climate change, particularly changes in temperature and rainfall, earth topography, diseases, and their impact on domestic animals.
Understanding the causes and patterns of forest degradation requires both environmental and socio-political perspectives. Traditional remote sensing techniques have been widely applied to detect land cover change and forest dynamics across different regions [
2,
3]. However, these approaches often lack the predictive capacity and interpretability needed for complex, data-scarce areas. The advancement of machine learning (ML) has enhanced our ability to process and analyze large-scale spatial data with improved accuracy [
4,
5,
6]. ML methods also offer flexibility in handling nonlinear relationships and mixed variable types.
Recently, studies have shown that integrating ML with explainable models such as SHAP and GAM improves insights into the spatial and temporal behavior of forest change [
7,
8]. These models help to quantify the relative influence of ecological and anthropogenic factors, such as rainfall, temperature, fire, and human expansion. In regions like northern Iraq, where conflict and terrain complexity intersect, such approaches are particularly valuable [
9,
10]. Therefore, expanding the theoretical framework of forest degradation using data-driven, interpretable techniques is essential for informed policy and conservation planning.
ML techniques are increasingly being applied in forestry and related fields. For instance, researchers have used Support Vector Machines (SVMs), Classification Regression Trees (CRTs), and Random Forest (RF) to predict land degradation in Iran and found RF to be the most effective predictive model [
4]. Researchers have employed ML to identify the drivers of forest degradation and deforestation in the Congo Basin of Cameroon, categorizing the data into broad, fine-scale, and disaggregated time series [
5]. The authors classified satellite images of Abbottabad, Pakistan, using an Artificial Neural Network (ANN) and RF to analyze forest cover and support conservation efforts [
6]. ML algorithms are integrated with the SHapley Additive exPlanations (SHAP) method to assess urban flood vulnerability, emphasizing the importance of identifying optimal feature sets [
7]. Furthermore, they focused on climate factors, vegetation indices, and Landsat imagery to monitor forest degradation in India through machine learning and utilized the Google Earth Engine (GEE) platform to capture and organize data [
8].
Forest management information systems are essential tools for planning, implementing, and monitoring activities. Regions such as the Duhok Governorate have experienced a decline in overall vegetation cover, especially over the past ten years [
9]. The economic limitations and conflicts in Duhok have increased the complexity of forest conservation and management. Increasing pressure on the environment highlights the need for greater forest maintenance and sustainable management, particularly in the Iraqi Kurdistan Region (IKR), where they are superimposed on domestic challenges. The border between Iraq and Turkey, the study area for this research, faces high levels of forest degradation and deforestation due to political, economic, and environmental issues, including unregulated logging, land use changes, and conflicts.
The Iraq–Turkey border area is characterized by factors leading to forest degradation and deforestation, including increased conflicts and wars resulting in burned areas, tree cutting, and road construction, as well as climatic and topographic influences. The study area is exposed to climatic challenges and human conflicts, making field data col-lection risky. Traditional forest management methods are inadequate, and data scarcity hinders effective monitoring. Although helpful, RS techniques have not been fully integrated with modern ML techniques for real-time data assessments. The combination of ML and remote RS is still in its early stages, especially in areas such as the Duhok Governorate. In light of these and other diverse challenges and drivers, it is important to focus on a specific region that reflects these complexities. Accordingly, this study explores the sensitive geopolitical area of the Iraq–Turkey border based on the Duhok Governorate. Although socio-economic variables such as population density or economic activities could strongly influence deforestation, such data were excluded due to limited availability and reliability within this geopolitically sensitive region.
This study aims to monitor and analyze forest degradation and deforestation along the Iraq–Turkey border between 2015 and 2024, a region characterized by ecological sensitivity and political instability. The research integrates explainable ML techniques with high-resolution Sentinel-2 RS data to model and interpret the drivers of forest change. Specifically, this study seeks to quantify the extent of forest loss, compare the predictive performance of different ML models, and utilize SHAP and GAM to improve interpretability. It also aims to evaluate the relative influence of climatic, topographic, and anthropogenic factors over time. The research is guided by two key questions: (1) What are the prevailing drivers of deforestation and forest degradation in the study area? (2) How have these drivers evolved temporally in relation to ecological and human-induced pressures? By addressing these objectives, this study provides a data-driven, reproducible framework for understanding landscape dynamics in geopolitically complex and data-limited regions.
2. Study Area
The study area is located north of the Duhok Governorate, IKR, on its border with Turkey. Geographically, it is located at latitudes 37°0′0′′ N to 37°20′0′′ N and longitudes 43°0′0′′ E to 44°0′0′′ E with an area coverage of approximately 1645 km
2. The study area features varying elevations, ranging from 583 m to 2597 m above sea level, with the steepest slope reaching 85% (
Figure 1). It is worth noting that the study area covers the northern parts of the two districts of the Duhok Governorate. These are Zakho and Amadi, which are border regions with Turkey. The study area has the highest vegetation cover in the Duhok Governorate [
11]. Additionally, the Amedi district has the highest greenspace coverage in the IKR, accounting for 59.8% [
12]. The study area is cold and semi-arid, with a Mediterranean rainfall pattern. The region experiences a long winter, an extended spring, a semi-moderate summer, and a short autumn. It has a complex topography, and diverse ecosystems create unique terrains and rich vegetation, including mountain forests, riverine forests, and steppe pastures [
13]. These natural features make this area of great value for agriculture and tourism, making the conservation of the area’s forests of utmost importance, potentially facilitating the work of other researchers in the region.
However, in recent decades, deforestation and land degradation have threatened dense forests. Human activities, including logging for fuel, infrastructure development, and political conflicts, have led to severe forest loss. This region is particularly affected by forest fires and illegal logging. Additionally, ongoing conflicts between Turkey and the Kurdistan Workers’ Party (PKK) have turned parts of the area into restricted zones, further accelerating deforestation and environmental damage [
10]. To effectively quantify and analyze the described environmental pressures and spatiotemporal dynamics, a methodological framework was adopted. This methodology, which combines RS data, climatic and topographic information, and advanced ML approaches, is detailed in the following section.
3. Materials and Methods
This study used an ML framework to investigate spatiotemporal forest conditions by combining data-driven methods with computational analysis. The entire process, as shown in
Figure 2, was organized into three major parts.
The initial phase was data acquisition, whereby relevant features were obtained from various sources and converted into thematic maps with a 10 m spatial resolution. In the second part, a strict process of feature selection with multicollinearity and correlation analysis was used to ensure that only the most relevant variables were utilized in the ML models. The final part involved model development, testing, spatial mapping, and the interpretation of results.
A set of proprietary software packages were used throughout the study. ArcGIS (v. 10.8) was used for GIS mapping and ENVI (v. 5.6) was used to enable image processing and analysis. The R Project for Statistical Computing (v. 4.3) was used for statistical modeling and model building, and the GEE was used to enable satellite data processing and feature extraction.
3.1. Data Sources and Preprocessing
Given the significant role of climatic and topographic variables identified in prior research [
14,
15], their incorporation into our analysis is pivotal for a comprehensive understanding of forest dynamics and deforestation patterns.
The datasets used in this study included the following primary types: satellite images, topographic data, and climate data. These datasets were systematically processed to guarantee spatial resolution uniformity and facilitate accurate environmental assessments. Sentinel-2 images were obtained as the main dataset for land cover change studies, whereas Digital Elevation Model (DEM) data and climate conditions (temperature and precipitation) were useful environmental data. A summary of the datasets, descriptions, derived factors, and sources is presented in
Table 1.
Sentinel-2 data, acquired by the Google Earth Engine (GEE) platform, were chosen for three cloud-free dates in August 2015, 2019, and 2024. The images were pre-processed with regular radiometric, atmospheric, and geometric corrections. Additional preprocessing procedures, such as layer stacking and masking of the study area, were performed to improve analytical precision. Due to the multi-year intervals of available Sentinel-2 imagery, shorter-term deforestation events might have been missed. However, the intervals chosen were representative of the most significant changes documented by secondary reports.
The climate data included the annual average temperature (T) from ERA5 and the yearly precipitation (R) from the CHIRPS dataset. These data were originally available at coarse spatial resolutions of 9600 m and 4800 m, respectively (
Figure 3m–r). The Climate Engine platform from which these data were obtained has been thoroughly tested and used in other research investigations, such as urban heat island research [
16] and environmental monitoring [
17].
A digital elevation model (DEM) of 12.5 m resolution with radiometrically corrected data was downloaded from the Earth Data website of the National Aeronautics and Space Administration (NASA) (
Figure 3s). The data were used to perform a spatial analysis of the study area and to measure its effect on forest dynamics. The DEM scenes were initially mosaicked, followed by extraction of the study area through a masking process. To ensure compatibility with Sentinel-2 images, all the datasets were resampled to a 10 m resolution using the nearest neighbor resampling method in ArcGIS 10.8. This step was crucial for maintaining consistency in the spatial analysis.
3.1.1. Land Cover Indices
Numerous studies have employed various indices extracted from satellite images to examine land use and land cover in specific areas. In this study, four indices were used to identify the factors affecting the forest vegetation cover, namely the fractional vegetation cover (FVC), normalized burn ratio (NBR), Bare Soil Index (BSI), and Road Index (RI). All indices were derived from Sentinel-2 images and were calculated using the GEE platform. Each of these indices is calculated using a unique formula and is specified in terms of certain value ranges based on its recommended use (
Table 2,
Figure 3a–l). The values of all indices fall within the range of −1 to +1.
FVC is a valuable indicator for tracking environmental changes, assessing vegetation conditions, and estimating deforestation and land degradation [
18]. It has been identified as an elementary phenotypic trait in agriculture, forestry, and ecological science, providing a measurable indicator of vegetation distribution and coverage pattern [
19]. In this study, FVC was used as a critical parameter in the estimation of the forest vegetation area and as a dependent variable in the analysis. It was obtained based on NDVI in combination with its minimum and maximum values to accurately estimate vegetation.
NBR is a widely used metric in the study of wildfires to detect areas burned over large fire-affected regions [
20]. It is a standard metric for assessing fire damage; a high NBR value generally indicates healthy vegetation, while a low value indicates barren land and severely burned areas [
21]. The NBR was calculated based on normalizing Shortwave Infrared 2 (SWIR 2) and red bands (
Table 2). The forests in the study area experience annual burning, primarily driven by human-related conflicts. This study applied the NBR index to monitor fire-damaged areas and evaluate their role in forest degradation and the acceleration of deforestation.
The BSI serves as a quantitative index for assessing bare soil extent, forest cover, and soil erosion rates [
22]. It is derived using the square root of the ratio between the difference and sum of the SWIR2 and Green (G) spectral bands, multiplied by 100 [
23] (
Table 2). The BSI was used in this study to map soil exposure and examine temporal land-cover changes during the last decade. It is particularly relevant for this region due to widespread land clearing and increasing bare soil exposure caused by illegal logging, conflict-driven fires, and overgrazing.
RI is a key parameter used to detect road networks. Road feature extraction from satellite images is a challenging task, but it is extremely important for a wide range of applications, such as urban planning, cartography, and traffic management [
24]. RI was computed based on the saturation ratio method from particular Sentinel-2 bands (11, 8, and 2) [
25] (
Table 2). In this study, RI was used to assess the impact of road construction on forested areas in the study area. It is especially important here, as road expansion is a primary driver of forest fragmentation and accessibility-related degradation in the Iraq–Turkey border zone.
Table 2.
Land cover indices derived from Sentinel-2 bands.
Table 2.
Land cover indices derived from Sentinel-2 bands.
Index | Formula | Reference |
---|
Normalized Difference Vegetation Index (NDVI) | | [26] |
Fractional Vegetation Cover (FVC) | | [27] |
Burn Ratio (NBR) | | [28] |
Bare Soil Index (BSI) | | [23] |
Road Index (RI) | | [24] |
3.1.2. Topographic Variables
Topography is a key factor influencing the spatial distribution of forest cover across the study region. This region is characterized by rugged mountain terrain and has the highest elevation in the IKR. Topographic variables have previously been proven to be closely linked with forest degradation, and their controlling influence on the distribution of vegetation and ecosystem processes has been explored [
14,
29]. In this study, four significant topographic factors, including altitude (AL), slope (S), aspect (AS), and hillshade (H), were investigated to determine their roles in forest dynamics (
Figure 3s–v). The variables were extracted using ArcGIS 10.8, where all topographic data were resampled to a spatial resolution of 10 m to match the dataset consistency.
Altitude (AI) serves as a primary determinant of forest distribution by influencing temperature, precipitation, and ecological zones. Higher altitudes are normally associated with cold climates and increased moisture availability, supporting vegetation growth [
30]. In contrast, lower altitudes are more susceptible to human activities such as deforestation and agricultural land use.
Slope (S) is another critical factor for assessing forest stability and regeneration. It is challenging to preserve forests on steep slopes, especially those with high inclinations, because they prone to erosion, landslide, and poor soil retention [
14,
31]. High-incline, sharp-slope areas can also pose effective fire management challenges, thereby exposing forests to loss.
Aspects (AS), or slope orientation, also affect vegetation distribution by controlling sunlight exposure and microclimatic conditions. Northern-facing slopes tend to retain more moisture and support denser vegetation because of reduced solar exposure, whereas southern slopes receive more direct sunlight, leading to higher evaporation rates and drier conditions [
32]. Variations in vegetation density and soil properties can influence the resilience of forests to environmental stressors and degradation.
Additionally, the degree of shading on the ground, known as hillshade (H), is a key factor for forest health. Areas with lower shade exposure receive increased solar radiation, resulting in elevated soil temperatures and drier conditions, which can cause physiological stress on vegetation and accelerate forest decline [
15]. In contrast, shaded slopes can support greater moisture content and therefore greater forest cover, whereas slopes with higher evaporation and drought stress potential occur on exposed slopes [
33]. By examining these topographic variables, this study sought to determine the combined effect of forest cover and environmental resilience in the study area.
3.2. Machine Learning Models
In this study, ML techniques were employed to identify the drivers of forest cover dynamics and to quantify their influence. Seven ML models, including XGBoost, Linear Regression (LR), Random Forest (RF), Support Vector Regression (SVR), an Artificial Neural Network (ANN), a Generalized Additive Model (GAM), and K-Nearest Neighbors (KNN), were assessed to measure correlations between dependent and independent variables. All the algorithms were run using the R programming package. The independent variable was FVC, while the independent variables included spectral indicators (NBR, BSI, RI, and MBI), climatic predictors (R and T), and topographic predictors (Al, S, H, and As).
XGBoost is an ensemble learning algorithm based on the gradient-boosting framework [
7]. It enhances classical gradient boosting by using second-order gradient statistics to optimize loss functions, achieving faster convergence and higher accuracy than models such as Gradient Boosting Decision Trees (GBDT) [
34,
35]. XGBoost builds sequential decision trees, each of which corrects errors made in previous iterations and uses regularization methods to avoid overfitting [
36,
37].
Linear Regression (LR) defines linear relationships between one dependent variable and a set of multiple independent variables [
38]. LR requires minimal multicollinearity among predictors and works best when the relationships are approximated to be linear. However, LR performs poorly when modeling nonlinear interactions or working with high-dimensional data.
Random Forest (RF), a bagging ensemble algorithm, creates an array of decision trees using bootstrap aggregating and feature selection by chance [
39,
40]. Learning was carried out in the trees using a portion of the data (two-thirds through bootstrapping), and the OOB samples cross-validated the model performance [
41]. RF is particularly useful for handling nonlinear data without overfitting and is therefore applicable to both regression and classification issues.
Support Vector Regression (SVR) is the generalization of Support Vector Machines (SVMs) to regression problems, projecting nonlinear data onto higher-dimensional spaces using kernel functions [
42,
43]. SVR regulates model complexity and errors in prediction through tunable parameters (e.g., ε-insensitive loss function and penalty factor C), offering flexibility in handling various datasets [
44,
45].
Artificial Neural Networks (ANNs) are computing systems inspired by biology composed of groups of interconnected nodes (neurons) in layers [
46]. ANNs learn complex nonlinear patterns by repeatedly adjusting weights and are therefore used for regression, classification, and pattern recognition tasks [
36].
Generalized Additive Models (GAMs) are a broadening of generalized linear models (GLMs) that allow smooth, nonlinear functions to capture predictor–response relationships [
14]. GAMs were applied in this case to illustrate and clarify the correlation effects of the independent variables on FVC.
K-Nearest Neighbors (KNN) is an instance-based, nonparametric algorithm that makes predictions based on similarity measures (e.g., Euclidean distance) between input attributes and training instances [
47,
48]. This approach is based on assessing the similarity between variables, which is a nonparametric case-based learning approach.
3.3. SHAP-Based Machine Learning Interpretability
ML models are often perceived as “black boxes” due to their complexity and difficulty in interpreting how individual features influence predictions. Thus, their uninterpretability makes it difficult to ascertain the positivity or negativity of a feature regarding model output. SHapley Additive exPlanations (SHAP) is widely acknowledged to be one of the best methods for enhancing the interpretability of ML and provides an explicit methodology for approximating feature contributions. It interprets each feature as a player in a cooperative game where the model’s output represents the final score of the game.
SHAP values are based on the contribution of each feature to the prediction as compared to other features and thus help in ranking the feature importance in the model [
49]. These values were calculated using built-in functions to provide an equal and objective evaluation of various datasets. Equation (1) outlines the SHAP value calculation [
50].
where
f represents the interpretation model,
X′ denotes the basic features,
K is the maximum feature subset size, and
J represents feature attributions.
The novelty of the SHAP is its consistency in measuring feature importance in a way that is theoretically justified, making it suitable for a wide range of ML tasks [
49]. This approach has been extensively employed in areas such as pedestrian safety evaluations [
51] and predictive analytics [
52]. SHAP achieves this by comparing model predictions both with and without a specific feature, allowing it to assess the strength of the influence of that feature. SHAP analysis was performed in this study using the “SHAP” package in R, providing deeper insights into the role of each variable in the ML models.
3.4. Sampling Strategy and Dataset Preparation
Random sampling was conducted to ensure proper assessment of forest cover within the study area. For the entire study area of 1645 km
2, 15,650 sample points were randomly allocated, representing approximately 10.5% of the study area (
Figure 4). The sampling design achieved a minimum interval of 1 km between consecutive points, with maximum spatial coverage across a range of topographic features. The sampling was carried out in ArcGIS 10.8 and the sample points were further separated into training and validation sets for ML model development.
To create a comprehensive dataset, thematic maps of all the factors of interest (FVC, NBR, BSI, RI, R, T, AL, S, H, and AS) were stacked and merged using ENVI software (v. 5.6). The values corresponding to 15,650 sampling points were then extracted using ArcGIS, ensuring that each point contained information from all layers. These extracted values formed the basis for model calibration and validation. Field validation was not conducted due to ongoing security risks in the study region. Instead, robust random sampling and validation sets were employed to ensure the reliability and representativeness of the results.
The random sampling method was applied to balance the datasets and improve ML models’ performance. The dataset was divided into two categories: 80% for training, enabling the model to learn subtle patterns for estimating forest vegetation cover, and 20% as a hold-out to check the accuracy of the model. This approach enhanced the models’ ability to generalize new data and refine predictive performance.
Figure 4 demonstrates the geographic location of sample points, where training samples (black triangles) were used for model construction and validation samples (yellow circles) were withheld from the training process to allow for objective assessment.
3.5. Feature Selection
To improve model accuracy, a rigorous feature selection process was applied to eliminate redundant and irrelevant attributes. The Variance Inflation Factor (VIF) was employed as the primary statistical metric to identify multicollinearity between independent variables [
53]. By measuring how much the variance of a regression coefficient is inflated because of the intercorrelation between predictors, VIF identifies troublesome dependencies.
VIF analysis was performed using R (v. 4.3) packages Caret and Car, which also supported the VIF and Pearson correlation evaluation calculations. The relationships between variables were visualized using heatmaps to identify and remove less important features. Independent variables that are highly correlated can form spurious parameter estimates, which provide unstable standard errors. By deleting the highly correlated predictors, the VIF analysis enhanced the model interpretability and stability of the predictions.
The VIF was used for the independent variable
and was calculated as follows:
where
represents the coefficient of determination when
is regressed against other predictors. The VIF interpretation rules were as follows: VIF = 1, no multicollinearity; VIF = 1–5, moderate but acceptable relationship; VIF > 5, high correlation that must be carefully monitored; and VIF > 10, extreme multicollinearity that requires dropping the variable.
3.6. Model Performance Evaluation
Five statistical metrics were used to assess the accuracy and reliability of the ML models. These measures were the mean absolute error (MAE), mean squared error (MSE), root mean square error (RMSE), mean absolute percentage error (MAPE), and coefficient of determination (R2) (Equations (3)–(7)). All measures were used for all three years of the study (2015, 2019, and 2024) to ensure a complete analysis.
The SHAPE package and the GAM in R Studio (v. 2023.12.1+402) were used to test the interaction between FVC and its influencing factors. The MAE, MSE, RMSE, and MAPE measure the prediction accuracy with a gap between the estimated and observed values. These metrics typically range from 0 to 1, with lower values indicating a better fit, and ideal model performance is denoted by lower values [
54]. Concurrently, R
2 and adjusted R
2 measure the extent of correlation between independent and dependent variables, with values closer to 1 indicating a stronger predictive relationship and higher model accuracy [
55]. Each metric was applied in the R. All the statistical tests and model fits were conducted within R, ensuring a robust and data-driven evaluation process. Having established robust evaluation metrics, we applied these criteria comprehensively to assess multiple ML algorithms for their predictive performance, as detailed in the Results section below.
4. Results
4.1. Predictor Variable Evaluation and Model Selection
A comprehensive dataset comprising 15,650 randomly distributed sample points was successfully constructed using the carefully designed sampling strategy detailed earlier (
Section 3.4). This robust sampling approach secured representative spatial coverage across the diverse climatic and topographic conditions of the study area, enhancing the reliability and generalizability of subsequent correlation analyses and model performance evaluations.
Correlation and multicollinearity between predictor variables were assessed using Pearson’s correlation coefficients and Variance Inflation Factor (VIF) analysis for each study year (2015, 2019, and 2024). The correlation matrix in 2015 (
Figure 5a) showed relationships, including a moderate negative correlation between the Bare Soil Index (BSI) and normalized burn ratio (NBR) (−0.56), indicating that increased bare soil is associated with reduced forest fire activity. Temperature (T) was negatively correlated with altitude (Al) (−0.35) and slope (S) (−0.03).
Figure 5b illustrates the VIF values for all variables, indicating minimal multicollinearity (VIF < 2) for the subsequent regression analysis, which is quite good, as it is less than 5.
Correlation analysis for 2019 (
Figure 6a) revealed moderate to weak positive relationships between normalized burn ratio (NBR) and BSI (0.45), suggesting an association between burned areas and heightened exposure of the soil. Temperature (T) was negatively correlated with altitude (Al) (−0.35) and positively correlated with aspect (AS) (0.04). The VIF analysis for 2019 (
Figure 6b) again showed low multicollinearity among all variables (VIF < 2).
By 2024,
Figure 7a indicates stronger correlations between anthropogenic variables; for instance, a moderate negative correlation is observed between the Road Index (RI) and the BSI (−0.48), along with a moderate positive relationship be-tween RI and NBR (0.45). The VIF analysis (
Figure 7b) further demonstrated long-standing low multicollinearity, upholding the appropriateness of such variables for follow-up modeling. Collectively, these analyses support the robustness of the predictor variables for use in regression modeling throughout the study period.
Having confirmed the suitability and minimal multicollinearity of the predictor variables, we next evaluated multiple ML algorithms to identify the most effective predictive model.
Seven ML models were rigorously tested to predict FVC, with detailed performance outcomes presented in
Table 3 and
Figure 8 for all the study years (2015, 2019, and 2024). Of all the tested models, XGBoost repeatedly showed a higher predictive accuracy across all study years. This was observed through the highest R
2 values (≥0.90) and lowest prediction errors, including RMSE values of 0.035 (2015), 0.034 (2019), and 0.022 (2024).
SVR closely followed as the second-best performing model, presenting marginally higher errors and slightly lower R2 values than XGBoost. Linear Regression (LR), on the other hand, had uniformly poor predictability, with significantly lower R2 values (e.g., 0.698 in 2019) and larger error readings during all the study intervals. The ANN recorded the largest prediction errors, indicating its unsuitability for this dataset.
Owing to its strong and improved performance, XGBoost was identified as an ideal model for further analysis and interpretation. The reliable performance of XGBoost highlights the ability of this algorithm to predict sophisticated forest-cover patterns driven by heterogeneous environmental and anthropogenic factors. Therefore, subsequent analyses, including SHAP and GAM, were based on XGBoost predictions to accurately interpret variable impacts. This further validates the real-world applicability of the model for forest conservation and planning endeavors in this geopolitically sensitive region.
4.2. Spatiotemporal Dynamics and Drivers of Forest Change
A comparative analysis of the influencing factors across the three years revealed a significant transition from climate-dominated vegetation dynamics to anthropogenically driven deforestation and degradation. In 2015, climatic variables played a dominant role in influencing forest cover. The SHAP values indicated that rainfall (+0.32) contributed positively, whereas temperature contributed negatively (−0.24) to forest cover (
Table 4). This suggests that ecological conditions primarily governed forest distribution and density in 2015, thereby supporting natural growth cycles and seasonal regeneration.
By 2019, however, climatic drivers had diminished dominance as drivers of a more anthropogenic nature emerged. SHAP values demonstrated that the Normalized Burn Ratio (NBR) (−0.34) and Bare Soil Index (BSI) (−0.18) became increasingly dominant (
Table 4). This emphasizes the occurrence of fire events and exposed soil, which are frequently associated with illegal deforestation and overgrazing. Although rainfall retained a modest positive effect (+0.22), its relative importance decreased dramatically in 2015.
This trend persisted in 2024, where the Road Index (RI) was the highest contributor, with a negative correlation with forest cover (SHAP value of −0.35), followed by NBR (−0.25) and BSI (−0.18) as the next two contributors (
Table 4). These shifts suggest that infrastructure expansion, particularly roads, had become a major driver of forest fragmentation and loss by 2024. Roads serve as the entry point for previously inaccessible peat-swamp forests for human-based activities, including illegal logging, expansion of settlements, and construction of secondary roads.
Spatial analysis confirmed this trend, with hotspots of degradation shifting considerably across the study period (Figures 10, 12 and 14). In 2015, the affected regions were largely isolated and attributed to natural stressors such as diminished rainfall or small fire outbreaks. In 2019, hotspots were spread out, particularly in the mid-elevation zones and areas surrounding agricultural land. In 2024, degradation patterns were concentrated along new roads and in the vicinity of urbanized areas, thereby confirming the growing role of anthropogenic activities in driving forest disturbance. The observed hotspots of degradation notably coincide with regions known from local reports to experience intensified human activities such as illegal logging, infrastructure development, and military operations.
To examine the nature, direction, and magnitude of the relationships between forest cover and its drivers, SHAP and GAM were applied to the outputs of the XGBoost model outputs, and only the top five most effective factors were presented in the GAM plot. Interpretability tools were applied to all three temporal snapshots to provide nuanced insights related to variable effects and interactions.
In 2015, GAM analysis indicated that rainfall exhibited a distinct positive linear correlation with FVC. Temperature showed a threshold response in which vegetation health improved when raising temperatures up to 20 °C, after which it declined (
Figure 9). These findings were corroborated by the SHAP, with rainfall and temperature being the most significant contributing variables (
Figure 10). In contrast, BSI and NBR demonstrated moderately negative effects, suggesting that the early emergence of anthropogenic stress began to manifest.
By 2019, the GAM plots showed a stronger nonlinear influence from BSI and NBR (
Figure 11). FVC declined markedly as bare soil exposure and burn severity increased, and SHAP analysis confirmed these variables as the dominant negative contributors (
Figure 12). Notably, the influence of rainfall was minimal, which paralleled the transition from natural to human pressure on the forests. However, aspect and hillshade were not significant contributors during this year.
The interpretability plots for 2024 revealed a decisive shift. In the best-performing linear model (GAM), FVC showed the highest linear decline with respect to road presence (
Figure 13), and RI was ranked as the highest negative contributor (−0.36) by SHAP (
Figure 14). The effect of the BSI and NBR remained strongly negative, but their impacts were amplified by ongoing infrastructure expansion. This indicates an important phase in which human activity thwarted natural control over forest dynamics. The minimal influence of topographic variables, such as altitude and hillshade, indicates that anthropogenic pressures became spatially indiscriminate, impacting a wide array of terrains.
A comparative visualization in
Figure 10 and
Figure 14 summarizes the critical transition from climate-driven ecosystem-regulating drivers in 2015 to increasingly complex, fragmented, and human-altered ones in 2024.
4.3. Spatial Predictions, Accuracy Assessment, and Forest Cover Trends
The forest cover maps are shown, with observed and predicted FVC values for each of the three years (
Figure 15), along with the respective error maps. The observed FVC maps derived from Sentinel-2 data illustrate the spatial heterogeneity in vegetation density, with greener shades representing denser cover. Moreover, the FVC maps predicted by the XGBoost model showed very similar trends, confirming the predictive accuracy of the model.
The error maps, which are calculated as the difference between the observed and predicted FVC, are interpreted as patterns of spatial overestimation and underestimation. Positive error values indicated over-prediction, whereas negative values indicated underestimation. The distribution of errors decreased dramatically over time, particularly in 2024, indicating improved accuracy and spatial generalization. These further underscores the reliability of the ML-based approach for mapping dynamic forest cover when data are scarce. The high accuracy of the spatial predictions provided confidence for further quantitative analysis, allowing us to reliably evaluate trends in forest areas and quantify the extent of deforestation.
Forest area defined based on a calibrated FVC threshold showed a constant decrease throughout the period analyzed. As of 2015, the forest area was approximately 630 km
2. This dropped slightly to 625 km
2 in 2019 and then collapsed to 577 km
2 by 2024 (
Figure 16). While the initial decline was subtle, the sharp reduction post-2019 was consistent with the rise in anthropogenic drivers, such as road construction and burn incidents.
This trend confirms the spatial and interpretive results derived from the model predictions and driver analysis. A forest cover loss of more than 50 km2 within the last decade indicates a more than 12% decline in forest cover within the study area. This alarming rate of decline is largely due to the ecological sensitivity and geopolitical significance of the study area. This highlights the urgency of implementing evidence-based conservation policies and spatial planning measures to mitigate forest degradation in this area.
Therefore, our findings synthesize rigorous model evaluations and interpretable ML insights using spatial analysis and quantitative forest change assessments. This ultimately provides a comprehensive understanding of the deforestation dynamics in the Iraq–Turkey border region between 2015 and 2024. These results demonstrate the ability to combine remotely accessed data with explainable ML for the monitoring of ecologies within complex geopolitical landscapes. These empirical findings and identified driver shifts clearly necessitate an interpretative discussion to contextualize the results against broader ecological and socio-political frameworks, as addressed in the following section.
5. Discussion
This study aimed to investigate the complex dynamics of deforestation and forest degradation along the Iraq–Turkey border within the Duhok Governorate over ten years (2015–2024). This was achieved by applying interpretable ML models in combination with high-resolution RS datasets, incorporating both climatic and topographic variables. These results provide compelling evidence that deforestation in the study area has transitioned from being predominantly climate-driven in 2015—when rainfall was the most influential variable—to being increasingly governed by anthropogenic factors such as road expansion (RI), burn severity (NBR), and bare soil exposure (BSI) by 2024. This temporal shift was robustly captured using SHAP and GAM and is consistent with independent reports of accelerating infrastructure development and conflict-induced degradation in northern Iraq [
56,
57].
Predictive modeling showed that the XGBoost model consistently outperformed other ML models with accuracy metrics of R
2 values exceeding 0.90 and low RMSE across all prediction years. This aligns with recent studies highlighting that XGBoost can shape nonlinear ecological phenomena and performs better than a simple regression approach in environmental modeling [
7,
35,
36]. The use of SHAP values and GAM plots further facilitates intuitive visualization of each variable’s direction and the magnitude of influence exerted by each variable, thereby aligning the gap between predictive performance and ecological interpretability [
49,
50].
Alongside the clear gain in the predictive performance of XGBoost, the interpretive insights developed through SHAP and GAM are also noteworthy, clarifying the nuanced drivers of deforestation and forest degradation.
In 2015, climate was the main driver of vegetation dynamics. The strongest relationships with positive impacts on FVC were observed for rainfall and temperature. This is aligned with previous studies analyzing forest health in the Kurdistan Region, where the main drivers of forest vitality were linked to precipitation amount and average temperature [
30,
58]. This timeframe corresponded to a time of little human disturbance when forest cover was generally stable.
However, by 2019, the ecological balance had begun to tip. Fire- and soil-exposure-related anthropogenic disturbances were among the strongest drivers of forest degradation. These results are consistent with those published by [
9,
59], who recorded increased fire incidences and land use pressures throughout northern Iraq. These disturbances increasingly occur through the intensification of illegal logging, land clearing, and localized fire outbreaks, particularly in mid-elevation zones close to encroaching settlements.
The situation had intensified by 2024, as this study revealed road infrastructure expansion to be the number one key driver of forest cover loss. The Road Index (RI) was the top negative contributor to FVC, as confirmed by the SHAP values, which further emphasizes the strong correlation between accessibility and forest exploitation results. This trend was corroborated by field observations and reports of widespread road construction by military and civilian actors from 2020 onwards [
56,
57]. The establishment of over 40 military bases and connecting roads in the Amedi and Zakho districts also enabled deep penetration into wooded areas, which hastened forest fragmentation and area clearing.
The impact of conflict, as a hidden driver of deforestation, is enormous. The study of Eklund and Dinc [
10] and a report in IraqClimateChange [
56] have highlighted the ecological toll exerted on militarized landscapes, especially in the borderlands of contention. Our findings align with Sur et al. [
8], who also observed that machine learning combined with remote sensing effectively identifies vegetation degradation trends. However, unlike their study in India, where climatic factors remained dominant, our results indicate a stronger influence of political and socio-economic disturbances. This discrepancy may be attributed to the militarized and geopolitically sensitive nature of our study area, which accelerates forest loss beyond climatic thresholds. Frequent air strikes, wars, and tourism in some areas cause immediate tree loss, fires, and long-term environmental disturbances. Our findings support these statements, demonstrating an unprecedented increase in NBR and BSI in 2019 and 2024, consistent with reported fire events and burned vegetation.
Livelihood pressure also contributed to the deforestation trends. Well-established studies, such as those by Hassan and Taha [
60] and Karim [
61], show that local populations are dependent on forests for fuel, timber, and the expansion of agricultural land. These results reflect the consistent BSI into the non-favorable range of the study period, indicating incremental degradation of landscape vegetation through strip clearing of trees and land clearing practices.
Notably, ecological variables (altitude, aspect, and hillshade) had little influence in the later years. This indicates that human-driven degradation has reduced topographic resilience, enabling exploitation in previously untouched higher mountain ranges. Steep slopes and high elevations are traditionally considered natural buffers to disturbances [
14,
30], and our findings show that roads and military activities have increasingly penetrated these zones, diminishing their protective function. This shift marks a significant departure from conventional forest degradation models, where topographic resilience (e.g., slope, elevation) is typically associated with lower deforestation rates. In our study, however, conflict dynamics and road presence instead seem to overwhelm these natural defenses. This spatial indiscriminateness suggests that military activity and infrastructure expansion are not merely co-occurring with degradation—they are actively enabling forest loss in previously inaccessible highland areas.
Understanding these drivers and how they are changing is critical; however, predicting how the changing drivers manifest spatially is also critical for implementing effective conservation measures. Our spatial modeling framework successfully closes this gap, yielding accurate predictions.
Despite these trends, the modeling approach has demonstrated considerable promise. The excellent congruency of the observed and predicted FVC maps, with low residual errors, emphasized the usefulness of XGBoost in mapping forest cover dynamics. Most importantly, tens of SHAP and GAM were combined to ensure not only accuracy in the results but also ecological interpretability, an aspect often missing from standard black-box ML applications [
49,
52].
The prediction errors were clustered spatially over time. This further confirms that this modeling framework is well-suited to long-term forest monitoring in data-scarce or geopolitically complex regions.
This study affirms that deforestation along the Iraq–Turkey border stems from a combination of climatic variability, conflict, infrastructure development, and subsistence activities. The paradox revealed in the reduction in the forest area, decreasing from 630 km2 in 2015 to 577 km2 in 2024, indicates more than 12% forest loss over less than a decade and underscores the need for land use planning and conflict-sensitive conservation to be undertaken immediately. This methodological framework provides a replicable basis for similar assessments in other geopolitically contested and data-scarce regions.
Future research should address several methodological limitations observed in this study. The absence of direct socio-economic variables—such as population density, fuel consumption, and land tenure systems—was due to limited availability and data reliability in the study area. Incorporating such information, where accessible, or using spatial proxies like settlement proximity or nighttime lights could strengthen future analyses. Another limitation is the reliance on multi-year Sentinel-2 imagery, which may overlook short-term or seasonal forest disturbances. Access to higher-temporal-resolution datasets would enable more detailed trend detection. Furthermore, the lack of direct ground-truthing, constrained by regional insecurity, remains a limitation. While a robust sampling design was applied, future studies could benefit from collaborative field validation involving local communities or remote-sensing-based verification protocols. Overall, future efforts should aim to integrate socio-economic and conflict-related data into predictive models and work closely with local stakeholders to produce actionable, context-sensitive conservation strategies. The rapid pace of forest loss emphasizes the need for targeted policy and management responses, supported by interpretable model outputs and a comprehensive understanding of local drivers.
6. Conclusions
This study presents a comprehensive, interpretable analysis of deforestation and forest degradation along the Iraq–Turkey border within the Duhok Governorate over a decade (2015–2024). Using high-resolution Sentinel-2 imagery, climatic and topographic data, and advanced machine learning, it demonstrates the effectiveness of explainable artificial intelligence in complex, data-limited regions.
Among the tested algorithms, XGBoost achieved the highest performance (R2 > 0.90), with minimal prediction errors. SHAP and GAM improved model interpretability and revealed a clear temporal shift in forest drivers—from climate-related factors (rainfall, temperature) in 2015 to anthropogenic pressures (fires, road construction, land clearing) by 2024. This shift was strongly linked to conflict-related development and infrastructure encroachment. Spatial modeling showed alignment between degradation hotspots and road expansion, with forest area declining by 12% (from 630 km2 to 577 km2). These patterns confirm the reported trends of militarization, illegal logging, and fire-induced degradation.
This study highlights the compounded effect of ecological vulnerability and political instability on forest loss. It also shows that forest degradation in geopolitically sensitive regions requires more than ecological monitoring. It requires integrating socio-political insights into predictive analysis.
Regulating road construction, strengthening legal enforcement against deforestation, and involving local communities in forest monitoring and restoration are critical. Public policies should combine environmental conservation with conflict-aware land management and access to higher temporal and socio-economic data.
The methodological framework developed here offers a reproducible, robust approach for forest monitoring. It integrates predictive analysis with ecological interpretation and holds significant potential to inform evidence-based policies in fragile landscapes.