Development of a Forest Fire Diagnostic Model Based on Machine Learning Techniques

: Forest fires have devastating effects on extensive forest areas, compromising vital ecological services such as air purification, water conservation, and recreational opportunities, thus posing a significant socioeconomic threat. Furthermore, the risk of forest fires is steadily increasing due to climate change. The most effective method for mitigating forest fire risk is proactive prevention before forest fires can occur by identifying high-risk areas based on land surface conditions. This study aimed to develop a machine learning-based forest fire diagnostic model designed for Republic of Korea, considering both satellite-derived land surface data and anthropogenic factors. For the remote sensing data, VTCI (Vegetation Temperature Condition Index) was used to reflect the land surface dryness. In addition, fire activity maps for buildings, roads and cropland were used to consider the influence of human activities. The forest fire diagnostic model yielded an accuracy of 0.89, demonstrating its effectiveness in predicting forest fire risk. To validate the effectiveness of the model, 92 short-term forest fire risk forecast maps were generated from March to May 2023 with real-time data on forest fire occurrences collected for verification. The results showed that 73% of forest fires were accurately classified within high-risk zones, confirming the operational accuracy of the model. Through the forest fire diagnostic model, we have presented the impact relationships of meteorological, topographical, and environmental data, as well as the dryness index based on satellite images and anthropogenic factors, on forest fire occurrence. Additionally, we have demonstrated the potential uses of surface condition data.


Introduction
Forest fires represent a significant socioeconomic threat with the capacity to devastate expansive forest areas.These blazes not only consume the forests themselves but also eliminate the myriad ecological services they provide, such as air purification, water conservation, and recreational opportunities.Notably, the impairment of the forest carbon sequestration ability and large amounts of carbon emissions pose substantial obstacles to achieving carbon neutrality goals [1][2][3].Recent data underscores an alarming trend: the incidence and severity of forest fires are escalating, a pattern that is expected to intensify with the progressing climate crisis [4,5].According to official statistics from the Korea Forest Service, the trend of forest fire frequency and damaged area increased in the 2020s compared to the past, from the 1980s to the 2000s.Approximately 8369 ha of forest was burned, and the number of incidents reached a peak of 580 in the 2020s [6].
While climate change is undoubtedly exacerbating weather conditions conducive to forest fires, the fact that most of these fires in Republic of Korea are caused by human activity cannot be overlooked, highlighting the necessity of robust prevention measures [7].Central to the strategy of forest fire mitigation is the accurate prediction of fire risks.Most forest fire prediction models, which largely depend on meteorological data, do not sufficiently consider the actual conditions of the land surface, indicating a potential area for improvement in which remote sensing technologies could be of significant use.
Studies on the use of satellite imagery for fire prediction have primarily focused on monitoring fuel dryness.Myoung et al. developed an empirical model function for live fuel moisture based on the correlation of satellite vegetation indices with in-situ fuel moisture [8].Others, such as Verbesselt et al., compared satellite indices with meteorological drought indices, revealing that satellite indices can provide superior reflections of arid conditions [9].Although these studies have validated the relevance of satellite data, they predominantly focused on long-term risk estimation and fell short of their applicability to daily risk assessment.
To address this limitation, Kang et al. yielded more granular spatiotemporal patterns through the integration of spatial road and population data, marking progress in fire risk prediction [7].Similarly, Kim et al.'s machine-learning applications, incorporating socioeconomic trends, disclosed a direct relationship between population density and forest fire risk [10].Nevertheless, these studies did not fully exploit surface condition data by using satellite imagery, and their resolutions were inadequate for identifying high-risk zones on a fine administrative scale.Forest fire risk forecasts rely on models that do not sufficiently consider the influence of anthropogenic activities, which are crucial factors affecting Korean forest fires.
Previous studies on forest fire prediction have faced limitations in fully reflecting surface conditions and have relied solely on historical data when utilizing satellite imagery.Therefore, this study developed a forest fire diagnostic model that accurately reflects the current surface conditions.To address the gaps commonly found in traditional satellite image applications, a model predicting satellite indices was developed, and its outputs were utilized as input data.Additionally, to simulate anthropogenic fire occurrences, fire activity maps that delineate the range of human activities were constructed and used.

Study Area
The focus of this study was Republic of Korea, located between 124 • and 132 • East longitude and 33 • and 43 • North latitude, on the eastern side of the Eurasian continent and to the northwest of the North Pacific Ocean.The country's terrain is primarily defined by a mountainous spine that stretches from north to south, creating a gradient of elevation that is lower in the west and higher in the east.The average annual precipitation is 1306.3mm, 54% of which occurs during the summer [11,12].This uneven rainfall distribution contributes to extended periods of drought in spring and winter, which in turn increases the prevalence of forest fires [13,14].In Republic of Korea, a significant proportion of forest fires are attributed to anthropogenic activities, particularly in regions close to residential areas bordering forests [15,16].These areas experience more than 60% of forest fire incidents, whereas regions at higher elevations, which are more densely forested, tend to have fewer occurrences [7,17,18].Comparing the forest fire inventory with the DEM, most points of forest fire were concentrated in low-elevation areas, while they were sparsely distributed in high-elevation areas (Figure 1).

Data
To develop the forest fire diagnostic model and VTCI prediction model, forest fire label data, remote sensing data, and fire activity maps were utilized.In addition, meteorological, topographical, and environmental data were collected.The data are summarized in Table 1.

Data
To develop the forest fire diagnostic model and VTCI prediction model, forest fire label data, remote sensing data, and fire activity maps were utilized.In addition, meteorological, topographical, and environmental data were collected.The data are summarized in Table 1.Forest fire inventory data was obtained from records maintained by the Korea Forest Service.These data, accessible via an API (Application Programming Interface) provided by the Forest Service, included the location addresses of past forest fires.The data were converted into a geospatial dataset containing latitude and longitude coordinates.However, given that the original data was based on address information, there were concerns regarding the accuracy of identifying the exact locations of the forest fires.To enhance the reliability of this dataset, a validation process was undertaken that included the use of historical media reports and visual inspections to refine the data.From 2003 to 2020, we collected data from 8115 locations where forest fires occurred.
To train a model for classifying forest fire occurrence areas, data on forest fire nonoccurrence areas is required.Previous studies have primarily utilized ratios of occurrence to non-occurrence data ranging from 1:1 to 1:2 [19,20].In this study, to distinguish forest fire risk areas based on as much data as possible, we constructed the data with a 1:2 ratio.In Republic of Korea, over 50% of wildfires occur in the spring and fall, while the risk significantly decreases in the summer.To clearly distinguish the actual differences in wildfire risk and train the model effectively, we applied different extraction ratios for non-occurrence points monthly.This approach ensures that fewer non-occurrence samples are extracted during high-risk months and more samples are extracted during low-risk periods.
The total number of non-occurrence data points was 16,000, accounting for approximately 66% of the label data.In Table 2, each column is marked with the signs (a) to (e).Column (a) represents the monthly counts of forest fires in the inventory data, and column (b) shows the monthly ratio of forest fires.Column (c) contains the values obtained by inverting the ratios in column (b), and column (d) represents the ratios of the values in column (c).The final counts of non-occurrence data are calculated by multiplying the values in column (d) by 16,000, the total number of non-occurrence cases.The number of non-occurrence labels for the monthly sample numbers was calculated as shown in Table 2.We used the Vegetation Temperature Condition Index (VTCI) to monitor the dryness of forest areas as a component of land surface data.The VTCI measures the variance in Land Surface Temperature (LST) across pixels designated with specific Normalized Difference Vegetation Index (NDVI) values within a broad study region [21][22][23].The VTCI was formulated as follows: LST NDV Ii max = a + bNDV I i (2) LST NDV Iimax and LST NDV Iimin represent the maximum and minimum LSTs of the pixels that have the same NDV I i , respectively.LST NDV Ii is the LST of a single pixel with a specific NDV I i value.The coefficients a, b, a ′ , and b ′ are determined from the study area, where soil moisture ranges from the wilting point to the field capacity [24].
The NDVI was calculated using bands 1 (red) and 2 (near-Infrared) from the MOD11A1 dataset provided by NASA's Terra Moderate Resolution Imaging Spectroradiometer (MODIS), whereas LST was obtained from MOD09GA data from the same source.All satellite image data were provided daily, and the VTCI values were calculated by excluding locations obscured by cloud cover.For missing values, a deep learning-based VTCI prediction model was employed for interpolation.

Forest Fires Activity Map
According to Korea Forest Service statistics, 33% of forest fires in Republic of Korea are caused by hikers' negligence, with other significant causes including agricultural residue burning (13%), garbage incineration (13%), and cigarette-related incidents (6%).These data highlight the prevalence of human influences on forest fire ignition [17].The forest fire activity map is a dataset constructed to estimate the range of human activities based on anthropogenic data.In this study, data on buildings, roads, and agricultural land from digital topographic maps was used to create the forest fire activity data for each element.Three types of forest fire activity maps were developed using the kernel density method and applied to spatial data on agricultural land, buildings, and road infrastructure sourced from the National Geographic Information Institute of Republic of Korea (Figure 2).

Meteorological Data
We collected and utilized the daily average humidity, daily total precipitation, daily average wind speed, daily average temperature, daily maximum temperature, and daily minimum temperature provided by the Korea Meteorological Administration (KMA).All meteorological data were sourced from KMA observation stations and converted into nationwide raster spatial data with a 1 km resolution using the Inverse Distance Weighting (IDW) interpolation method.
Meteorological factors such as Effective Humidity (EH), Fine Fuel Moisture Code (FFMC), and Duff Moisture Code (DMC) were incorporated.EH is a specialized index used to gauge the potential risk of forest fires based on the atmospheric moisture content.It is calculated by aggregating the relative humidity levels over a set period, typically incorporating data from both the current day and several preceding days, to provide a comprehensive picture of the atmospheric moisture availability that affects fire conditions [25,26].
The FFMC and DMC are two critical components of the Canadian Forest Fire Weather Index System (CFFDRS), designed to assess the moisture content of forest fuels, which is a key factor in determining the risk and behavior of forest fires [27,28].The FFMC specifically targets the moisture content of fine fuels such as leaves, grasses, and small twigs that are less than 1/4 inch in diameter.These materials are highly responsive to weather conditions and can dry rapidly under appropriate conditions, thereby becoming highly flammable.The FFMC value ranges from 0 to 100, where lower values indicate a higher moisture content and a lower risk of fire ignition, whereas higher values suggest drier conditions and a higher susceptibility to ignition and support fire spread.This index is particularly sensitive to relative humidity and temperature changes, and can significantly fluctuate over the course of a single day [29][30][31].
DMC, on the contrary, focuses on the moisture content of moderately decomposed organic matter located below the litter of freshly fallen leaves or needles, which is commonly referred to as the duff layer.This layer is deeper and denser than the materials targeted by the FFMC and therefore responds more slowly to weather changes.This serves as a good indicator of the drying trends in these medium fuels, which are crucial for the sustained burning of fires.The DMC also ranges from 0 to a high value, typically

Meteorological Data
We collected and utilized the daily average humidity, daily total precipitation, daily average wind speed, daily average temperature, daily maximum temperature, and daily minimum temperature provided by the Korea Meteorological Administration (KMA).All meteorological data were sourced from KMA observation stations and converted into nationwide raster spatial data with a 1 km resolution using the Inverse Distance Weighting (IDW) interpolation method.
Meteorological factors such as Effective Humidity (EH), Fine Fuel Moisture Code (FFMC), and Duff Moisture Code (DMC) were incorporated.EH is a specialized index used to gauge the potential risk of forest fires based on the atmospheric moisture content.It is calculated by aggregating the relative humidity levels over a set period, typically incorporating data from both the current day and several preceding days, to provide a comprehensive picture of the atmospheric moisture availability that affects fire conditions [25,26].
The FFMC and DMC are two critical components of the Canadian Forest Fire Weather Index System (CFFDRS), designed to assess the moisture content of forest fuels, which is a key factor in determining the risk and behavior of forest fires [27,28].The FFMC specifically targets the moisture content of fine fuels such as leaves, grasses, and small twigs that are less than 1/4 inch in diameter.These materials are highly responsive to weather conditions and can dry rapidly under appropriate conditions, thereby becoming highly flammable.The FFMC value ranges from 0 to 100, where lower values indicate a higher moisture content and a lower risk of fire ignition, whereas higher values suggest drier conditions and a higher susceptibility to ignition and support fire spread.This index is particularly sensitive to relative humidity and temperature changes, and can significantly fluctuate over the course of a single day [29][30][31].
DMC, on the contrary, focuses on the moisture content of moderately decomposed organic matter located below the litter of freshly fallen leaves or needles, which is commonly referred to as the duff layer.This layer is deeper and denser than the materials targeted by the FFMC and therefore responds more slowly to weather changes.This serves as a good indicator of the drying trends in these medium fuels, which are crucial for the sustained burning of fires.The DMC also ranges from 0 to a high value, typically approximately 300, with higher numbers indicating drier conditions and a greater potential for a fire to consume deeper fuel layers and persist once ignited [32,33].

Topographical and Environmental Data
To evaluate the influence of topographic characteristics on forest fire occurrence, we employed data from a Digital Elevation Model (DEM) along with slope and aspect information, all of which were provided by the National Geographic Information Institute.Topographical factors influence human accessibility, solar radiation, and wind speed, which in turn correlate with soil moisture and humidity levels in forests, closely linking them to the occurrence of forest fires [10,15,34,35].
Additionally, we considered the ecological impact of various forest tree species and patterns of initial fire outbreaks across different types of land cover.This analysis was facilitated by a land-cover map from the Korean Ministry of Environment, which categorized the landscape into nine distinct types: urban areas, agricultural lands, various forest types (coniferous, deciduous, and mixed), grasslands, wetlands, barren lands, and water bodies.This detailed categorization helps understand how land cover variability contributes to fire dynamics.
For further analysis, longitudinal and latitudinal data were used to identify and track the patterns of recurrent fire occurrences.This geographical information aided in identifying high-risk areas and understanding the spatial distribution of fire events over time.

Development of the VTCI Prediction Model
In response to the need for updated and accurate VTCI data, a deep-learning-based model was developed to predict VTCI values using satellite imagery.Due to the limited number of studies that have directly predicted satellite imagery in existing research, methodologies and input variables were selected by referencing studies predicting soil moisture [36][37][38].Figure 3 shows the workflow of the VTCI prediction model, which incorporates remote sensing, meteorological, and environmental factors.For the remote sensing factors, NDVI and LST data were utilized, with each satellite index being constructed from the average, maximum, and minimum values of data for the same month from 2003 to 2020.In the case of meteorological factors, average temperature, maximum temperature, minimum temperature, humidity, precipitation, and wind speed were used, along with the FFMC index, which measures the dryness of the surface layer.Additionally, land cover maps and a latitude-longitude coordinate system were utilized.approximately 300, with higher numbers indicating drier conditions and a greater potential for a fire to consume deeper fuel layers and persist once ignited [32,33].

Topographical and Environmental Data
To evaluate the influence of topographic characteristics on forest fire occurrence, we employed data from a Digital Elevation Model (DEM) along with slope and aspect information, all of which were provided by the National Geographic Information Institute.Topographical factors influence human accessibility, solar radiation, and wind speed, which in turn correlate with soil moisture and humidity levels in forests, closely linking them to the occurrence of forest fires [10,15,34,35].
Additionally, we considered the ecological impact of various forest tree species and patterns of initial fire outbreaks across different types of land cover.This analysis was facilitated by a land-cover map from the Korean Ministry of Environment, which categorized the landscape into nine distinct types: urban areas, agricultural lands, various forest types (coniferous, deciduous, and mixed), grasslands, wetlands, barren lands, and water bodies.This detailed categorization helps understand how land cover variability contributes to fire dynamics.
For further analysis, longitudinal and latitudinal data were used to identify and track the patterns of recurrent fire occurrences.This geographical information aided in identifying high-risk areas and understanding the spatial distribution of fire events over time.

Development of the VTCI Prediction Model
In response to the need for updated and accurate VTCI data, a deep-learning-based model was developed to predict VTCI values using satellite imagery.Due to the limited number of studies that have directly predicted satellite imagery in existing research, methodologies and input variables were selected by referencing studies predicting soil moisture [36][37][38].Figure 3 shows the workflow of the VTCI prediction model, which incorporates remote sensing, meteorological, and environmental factors.For the remote sensing factors, NDVI and LST data were utilized, with each satellite index being constructed from the average, maximum, and minimum values of data for the same month from 2003 to 2020.In the case of meteorological factors, average temperature, maximum temperature, minimum temperature, humidity, precipitation, and wind speed were used, along with the FFMC index, which measures the dryness of the surface layer.Additionally, land cover maps and a latitude-longitude coordinate system were utilized.The model incorporated a DNN (Deep Neural Network) algorithm.DNN, a highly prevalent regression model in machine learning, has proven effective in soil moisture retrieval [39,40].An illustration in Figure 4, DNN model constructed with 10 hidden layers, where each layer contained nodes between 100 and 500.The model utilized the LeakyReLU activation function and the Adam optimization algorithm to minimize the Mean Squared Error (MSE) loss function.The performance of the final model was evaluated using two metrics: Mean Absolute Error (MAE) and R 2 .Twelve models were developed to effectively account for monthly weather variations.Each model was trained with a batch size of 1000 and ran for a default of 10 epochs.

Development of a Forest Fire Diagnostic Model
To develop a forest fire diagnostic model, we utilized the Python PyCaret module, which is an open-source, low-code machine-learning framework that facilitates the rapid deployment of models after data preparation.PyCaret was designed to simplify the development, assessment, comparison, and deployment of machine learning models, making the process as efficient and effective as possible [41,42].In PyCaret, the performance of the model was evaluated against seven criteria: Accuracy, Area Under the Receiver Operating Characteristic Curve (AUC), Recall, Precision, F1 score, Kappa Value, and Matthews Correlation Coefficient (MCC).For this project, use PyCaret version 3.0.0.
Using PyCaret, we compared various machine learning algorithms to select the bestperforming algorithm for the forest fire diagnostic model.CatBoost was selected, which utilizes ordered boosting, a novel approach for managing categorical variables, and automatically handles them without requiring preprocessing such as one-hot or label encoding [43,44].Although this study does not utilize multiple categorical variables, CatBoost still incorporates advanced techniques, such as the robust handling of missing values, feature importance estimation, and efficient GPU support for faster training [45].
Figure 5 illustrates the comprehensive workflow of the study, which comprises three primary components.The first is the previously detailed VTCI prediction model, which serves as a foundational element.The second is the Forest Fire Diagnostic Model, which utilizes the predicted VTCI as input data.Finally, the wildfire risk forecasting section relies on 3-day short-term weather forecast data sourced from the KMA.

Development of a Forest Fire Diagnostic Model
To develop a forest fire diagnostic model, we utilized the Python PyCaret module, which is an open-source, low-code machine-learning framework that facilitates the rapid deployment of models after data preparation.PyCaret was designed to simplify the development, assessment, comparison, and deployment of machine learning models, making the process as efficient and effective as possible [41,42].In PyCaret, the performance of the model was evaluated against seven criteria: Accuracy, Area Under the Receiver Operating Characteristic Curve (AUC), Recall, Precision, F1 score, Kappa Value, and Matthews Correlation Coefficient (MCC).For this project, use PyCaret version 3.0.0.
Using PyCaret, we compared various machine learning algorithms to select the bestperforming algorithm for the forest fire diagnostic model.CatBoost was selected, which utilizes ordered boosting, a novel approach for managing categorical variables, and automatically handles them without requiring preprocessing such as one-hot or label encoding [43,44].Although this study does not utilize multiple categorical variables, CatBoost still incorporates advanced techniques, such as the robust handling of missing values, feature importance estimation, and efficient GPU support for faster training [45].
Figure 5 illustrates the comprehensive workflow of the study, which comprises three primary components.The first is the previously detailed VTCI prediction model, which serves as a foundational element.The second is the Forest Fire Diagnostic Model, which utilizes the predicted VTCI as input data.Finally, the wildfire risk forecasting section relies on 3-day short-term weather forecast data sourced from the KMA.

SHapley Additive exPlanations (SHAP)
The SHAP values were utilized to analyze the impact of each variable in the model.SHAP is a versatile approach used to explain the output of machine learning models.It provides insights on the importance and contribution of each feature in making predictions, offering a broader understanding of the model behavior.SHAP values are based on the Shapley values from cooperative game theory and provide a fair way to allocate credit among features.By examining the SHAP values, users can discern how changes in individual features affect model predictions, aiding model interpretation and debugging.Moreover, SHAP values can be aggregated to explain global model behavior or used to analyze specific predictions on a per-instance basis, thereby enhancing model transparency and trustworthiness [46,47].The formula for the SHAP value of the variable () is as follows: ∅ is the SHAP value of the -th data point and Ϝ represents the entire dataset.S denotes the set of all subsets obtained by removing the -th data point from the dataset. ∪  ∪ is the contribution of the entire set (including the -th data point), and   is the contribution of the subset obtained by removing the -th data point [48].

VTCI Prediction Model Performance
The final VTCI prediction model demonstrated an average Mean Absolute Error (MAE) of 0.0038, Mean Squared Error (MSE) of 0.1103, and an R 2 of 0.86.The best model performance occurred in February, whereas the lowest was in August (Table 3).

SHapley Additive exPlanations (SHAP)
The SHAP values were utilized to analyze the impact of each variable in the model.SHAP is a versatile approach used to explain the output of machine learning models.It provides insights on the importance and contribution of each feature in making predictions, offering a broader understanding of the model behavior.SHAP values are based on the Shapley values from cooperative game theory and provide a fair way to allocate credit among features.By examining the SHAP values, users can discern how changes in individual features affect model predictions, aiding model interpretation and debugging.Moreover, SHAP values can be aggregated to explain global model behavior or used to analyze specific predictions on a per-instance basis, thereby enhancing model transparency and trustworthiness [46,47].The formula for the SHAP value of the variable (i) is as follows: ∅ i is the SHAP value of the i-th data point and F represents the entire dataset.S denotes the set of all subsets obtained by removing the i-th data point from the dataset.
f S∪{i} x S∪{i} is the contribution of the entire set (including the i-th data point), and f S (x S ) is the contribution of the subset obtained by removing the i-th data point [48].

VTCI Prediction Model Performance
The final VTCI prediction model demonstrated an average Mean Absolute Error (MAE) of 0.0038, Mean Squared Error (MSE) of 0.1103, and an R 2 of 0.86.The best model performance occurred in February, whereas the lowest was in August (Table 3).Performance gradually declined from February to August, and increased from September onwards.In Republic of Korea, rainfall increases from June to August leading to a greater amount of cloud cover and making satellite image acquisition more challenging.Months with heavy rainfall typically have lower data availability than relatively drier periods, leading to decreased utilization of data and, consequently, lower model performance.Despite some errors, the performance of the VTCI prediction model was high.Considering that forest fires in Republic of Korea are concentrated between January and early June, it is valid to use the VTCI derived from this model.Furthermore, it allows for the assessment of forest dryness in areas not observed in satellite imagery owing to cloud cover and can provide future results using weather forecast data.

Forest Fire Diagnostic Model Performance
The results of comparing various machine learning algorithms showed that boosting algorithms demonstrated relatively high performance, with the CatBoost model exhibiting the highest performance in terms of Accuracy, AUC, F1 Score, Kappa Value, and MCC.In addition, it demonstrated commendable performance in terms of Recall and Precision.
After selecting the CatBoost algorithm, the model was tuned for 50 iterations based on the kappa criterion.Based on validation data, the final forest fire diagnostic model yielded the following performance metrics: Accuracy of 0.8898, AUC of 0.9541, Recall of 0.8515, Precision of 0.8279, F1 Score of 0.8395, Kappa of 0.7557, and MCC of 0.7558 (Table 4).

Variable Impact Analysis Based on SHAP Value
Figure 6 presents a SHAP summary plot of the input feature factors derived from the CatBoost classifier.The feature factors were ranked based on their contributions.The X-axis represents the SHAP value, and the y-axis represents the feature factors.Each dot in the plot corresponds to a sample of forest fires from the test dataset, with the color indicating the value of a specific factor.Blue signifies a lower value, whereas red indicates a higher value.The horizontal position of the dot indicates whether the feature factor has a positive or negative influence on prediction [48].A positive value indicates an impact on the occurrence of wildfires (1), increasing the risk of fire.Conversely, a negative value affects cases where no wildfires occur (0), reducing the fire risk.In the example of EH, the model assesses a higher forest fire risk as the EH value decreases (indicated by blue), whereas a decrease in the EH value (indicated by red) leads to a lower assessment of forest fire risk.Furthermore, the influence of EH can be evaluated by examining the numerical values along the x-axis.Upon analyzing the impact of variables in the forest fire diagnostic model based on the SHAP Summary Plots, EH emerged as the variable with the most significant influence among all variables, followed by FFMC, which also showed a substantial impact.The satellite imagery index VTCI, although relatively less influential on the results, revealed that differences in dryness affect forest fire occurrence.

Variable Impact Analysis Based on SHAP Value
Figure 6 presents a SHAP summary plot of the input feature factors derived from the CatBoost classifier.The feature factors were ranked based on their contributions.The Xaxis represents the SHAP value, and the y-axis represents the feature factors.Each dot in the plot corresponds to a sample of forest fires from the test dataset, with the color indicating the value of a specific factor.Blue signifies a lower value, whereas red indicates a higher value.The horizontal position of the dot indicates whether the feature factor has a positive or negative influence on prediction [48].A positive value indicates an impact on the occurrence of wildfires (1), increasing the risk of fire.Conversely, a negative value affects cases where no wildfires occur (0), reducing the fire risk.In the example of EH, the model assesses a higher forest fire risk as the EH value decreases (indicated by blue), whereas a decrease in the EH value (indicated by red) leads to a lower assessment of forest fire risk.Furthermore, the influence of EH can be evaluated by examining the numerical values along the x-axis.Upon analyzing the impact of variables in the forest fire diagnostic model based on the SHAP Summary Plots, EH emerged as the variable with the most significant influence among all variables, followed by FFMC, which also showed a substantial impact.The satellite imagery index VTCI, although relatively less influential on the results, revealed that differences in dryness affect forest fire occurrence.SHAP values generated (Y-axis) to evaluate the effect of each feature factor.
In the graph, the positions of the dots represent the SHAP values corresponding to each variable, indicating their association with the forest fire occurrence.Notably, for EH values below 50, FFMC showed the risk increased significantly when the value exceeded 60.On the contrary, the DMC indicated risks above 60, and even values between 0 and 10 were considered risky, suggesting that forest fires in Korea are heavily influenced by surface dryness rather than prolonged drought, which affects underground dryness in forests.
The VTCI exhibited a distinction in forest fire risk around a threshold of 0.5, indicating a correlation with forest fire occurrence.Despite VTCI's lower influence compared to meteorological data, the predicted VTCI data demonstrated their utility.In the case of forest fire activity maps, the risk of forest fires increased as the density of buildings and roads increased.However, for buildings, the trend indicated a decrease in forest fire risk at densities above approximately 0.6, whereas the impact of roads on risk levels remained consistent.Unlike buildings and roads, farmland showed a tendency to reduce risk levels as density increased, although a higher risk of forest fire occurrences was observed in the range of approximately 0.0 to 0.3.Among the topographical factors, a DEM between 0 and 350 m showed a higher forest fire risk, particularly in low-lying areas accessible to humans, reflecting the tendency for forest fires to occur in lowland areas rather than in higher-elevation forests.Aspect angles between 100° and 300°, closer to south-facing, exhibited higher forest fire risks, likely due to higher solar radiation and the resulting evaporation rates in those areas.Slope angles between approximately 5° and 30° were associated with forest fire occurrence, indicating their influence on distinguishing between forested and non-forested land.est fire risk, particularly in low-lying areas accessible to humans, reflecting the tendency for forest fires to occur in lowland areas rather than in higher-elevation forests.Aspect angles between 100° and 300°, closer to south-facing, exhibited higher forest fire risks, likely due to higher solar radiation and the resulting evaporation rates in those areas.Slope angles between approximately 5° and 30° were associated with forest fire occurrence, indicating their influence on distinguishing between forested and non-forested land.Analysis using SHAP values revealed that longitude significantly influenced the risk of forest fire occurrence in South Korea.Areas along the western coast and adjacent inland regions in the south, as well as coastal areas in the east, exhibited higher forest fire risks, whereas the central regions showed relatively lower risks.This pattern reflects the topographical characteristics of South Korea, where forest fires predominantly occur in urban areas adjacent to forests rather than in centrally forested areas.This phenomenon is also evident from the higher SHAP values observed in regions with higher latitudes, indicating the influence of the Gyeonggi and Gangwon provinces, where forest fires are more frequent [49].Regarding land cover, coniferous forests (LC 4), grassland (LC 6), mixed forests (LC 5), and deciduous forests (LC 3) had a higher impact on forest fire risk, in descending order, which is consistent with historical forest fire occurrence statistics based on tree species [15,50].

Forest Fire Forecast for South Korea
To conduct short-term forest fire forecasting, an automated process was developed by integrating the short-term weather forecasting process with the forest fire diagnostic model established in this study.This automated forecasting process involves collecting and processing meteorological data to obtain process information, such as the average relative humidity, daily precipitation, average/maximum/minimum temperature, and average wind speed announced at 5 PM the day before the forecast.This information was then used as the input for the model, and the process culminated in the production and mapping of the final forecast results.
Utilizing short-term weather forecast data provided by the Meteorological Administration at 5 PM on the previous day, the process calculates the forecast results one to three days ahead.A total of 92 short-term forest fire risk forecast maps were generated for the period of 1 March-31 May 2023.Forest fire inventory data was collected from the Korea Forest Service Real-Time Forest Fire Information Platform.The data were compiled by referencing the map locations provided by the platform for verification purposes.During the forecasted period, 362 forest fires occurred.
Based on short-term forecast data for forest fire diagnoses, additional validation was conducted on the forest fire diagnostic model, with 362 forest fire occurrences recorded from March to May 2023.The classification of these forest fire occurrences into the categories in Table 5. showed 165 cases as "Very High", 99 cases as "High", 21 cases as "Moderate", 40 cases as "Low", and 37 cases as "Very Low".With "High" and "Very High" cases making up 264 of the total, accounting for 73% of all incidents, it was confirmed that the forest fire diagnostic model significantly and accurately diagnoses actual forest fire risks.In the graph, the positions of the dots represent the SHAP values corresponding to each variable, indicating their association with the forest fire occurrence.Notably, for EH values below 50, FFMC showed the risk increased significantly when the value exceeded 60.On the contrary, the DMC indicated risks above 60, and even values between 0 and 10 were considered risky, suggesting that forest fires in Korea are heavily influenced by surface dryness rather than prolonged drought, which affects underground dryness in forests.
The VTCI exhibited a distinction in forest fire risk around a threshold of 0.5, indicating a correlation with forest fire occurrence.Despite VTCI's lower influence compared to meteorological data, the predicted VTCI data demonstrated their utility.
In the case of forest fire activity maps, the risk of forest fires increased as the density of buildings and roads increased.However, for buildings, the trend indicated a decrease in forest fire risk at densities above approximately 0.6, whereas the impact of roads on risk levels remained consistent.Unlike buildings and roads, farmland showed a tendency to reduce risk levels as density increased, although a higher risk of forest fire occurrences was observed in the range of approximately 0.0 to 0.3.
Among the topographical factors, a DEM between 0 and 350 m showed a higher forest fire risk, particularly in low-lying areas accessible to humans, reflecting the tendency for forest fires to occur in lowland areas rather than in higher-elevation forests.Aspect angles between 100 • and 300 • , closer to south-facing, exhibited higher forest fire risks, likely due to higher solar radiation and the resulting evaporation rates in those areas.Slope angles between approximately 5 • and 30 • were associated with forest fire occurrence, indicating their influence on distinguishing between forested and non-forested land.
Analysis using SHAP values revealed that longitude significantly influenced the risk of forest fire occurrence in Republic of Korea.Areas along the western coast and adjacent inland regions in the south, as well as coastal areas in the east, exhibited higher forest fire risks, whereas the central regions showed relatively lower risks.This pattern reflects the topographical characteristics of Republic of Korea, where forest fires predominantly occur in urban areas adjacent to forests rather than in centrally forested areas.This phenomenon also evident from the higher SHAP values observed in regions with higher latitudes, indicating the influence of the Gyeonggi and Gangwon provinces, where forest fires are more frequent [49].Regarding land cover, coniferous forests (LC 4), grassland (LC 6), mixed forests (LC 5), and deciduous forests (LC 3) had a higher impact on forest fire risk, in descending order, which is consistent with historical forest fire occurrence statistics based on tree species [15,50].

Forest Fire Forecast for Republic of Korea
To conduct short-term forest fire forecasting, an automated process was developed by integrating the short-term weather forecasting process with the forest fire diagnostic model established in this study.This automated forecasting process involves collecting and processing meteorological data to obtain process information, such as the average relative humidity, daily precipitation, average/maximum/minimum temperature, and average wind speed announced at 5 PM the day before the forecast.This information was then used as the input for the model, and the process culminated in the production and mapping of the final forecast results.
Utilizing short-term weather forecast data provided by the Meteorological Administration at 5 PM on the previous day, the process calculates the forecast results one to three days ahead.A total of 92 short-term forest fire risk forecast maps were generated for the period of 1 March-31 May 2023.Forest fire inventory data was collected from the Korea Forest Service Real-Time Forest Fire Information Platform.The data were compiled by referencing the map locations provided by the platform for verification purposes.During the forecasted period, 362 forest fires occurred.
Based on short-term forecast data for forest fire diagnoses, additional validation was conducted on the forest fire diagnostic model, with 362 forest fire occurrences recorded from March to May 2023.The classification of these forest fire occurrences into the categories in Table 5. showed 165 cases as "Very High", 99 cases as "High", 21 cases as "Moderate", 40 cases as "Low", and 37 cases as "Very Low".With "High" and "Very High" cases making up 264 of the total, accounting for 73% of all incidents, it was confirmed that the forest fire diagnostic model significantly and accurately diagnoses actual forest fire risks.On 2 April 2023, 34 forest fires occurred, the highest number of forest fires on a single day in 2023.Comparing the forest fire risk map with the actual forest fire occurrence locations, forest fire risks were effectively differentiated between urban (Seoul City) and rural areas (Yangpyeong County).In both cases, forests located near residential areas had a higher fire risk than those located in the central parts of the forests.However, despite the close proximity of the two areas, there is a significant difference in the distribution of forest fire risk levels, which is likely influenced by the distribution of the forest fire activity map.In the forest fire activity map of buildings and roads, areas marked in red, representing high density, coincide with major cities in Korea and show higher forest fire risk levels centered around these urban areas.In particular, Seoul shows a significant difference in forest fire risk levels in forests near residential areas compared to Yangpyeong (Figures 11 and 12).forest fire risk levels, which is likely influenced by the distribution of the forest fire activity map.In the forest fire activity map of buildings and roads, areas marked in red, representing high density, coincide with major cities in Korea and show higher forest fire risk levels centered around these urban areas.In particular, Seoul shows a significant difference in forest fire risk levels in forests near residential areas compared to Yangpyeong (Figures 11  and 12).On the other hand, Gangwon Province has the highest proportion (82%) of forests in South Korea, and most areas shown in green in the forest fire diagnostic forecast results correspond to forested areas (Figure 13).Historical forest fire records also indicate that, apart from the central forested areas of Gangwon Province, forest fires have primarily On the other hand, Gangwon Province has the highest proportion (82%) of forests in Republic of Korea, and most areas shown in green in the forest fire diagnostic forecast results correspond to forested areas (Figure 13).Historical forest fire records also indicate that, apart from the central forested areas of Gangwon Province, forest fires have primarily occurred near the eastern coastline and western residential areas (Figure 1).A similar trend has been observed in the current distinctions of areas at risk of forest fires.

Conclusions
In this study, we developed a forest fire diagnostic model using land surface condition data (which were VTCI satellite image index) and forest fire activity maps that can diagnose forest fire risks at a resolution of 100 m.One of the key components of the forest fire diagnostic model, the VTCI prediction model, demonstrated its usability with an average performance of R-square 0.83, MSE 0.1103, and MAE 0.0038, thereby validating the model's applicability.The forest fire diagnostic model yielded an accuracy of 0.89.The performance of the forest fire prediction was validated using short-term weather forecasts nationwide from March 2023 to May 2023, showing that 73% of forest fires were included in high-risk areas.
The forecast results of the forest fire diagnostic model showed that areas adjacent to human activity showed a higher risk of forest fire occurrence, demonstrating the potential for simulating anthropogenic fire events.However, since the wildfire activity map is constructed based on density, it appears that urban areas have a higher risk of wildfires compared to small towns or rural areas, which is an aspect that needs to be improved in further research.The results derived from SHAP values indicate that the impact of VTCI is relatively lower compared to meteorological factors.Nevertheless, the dryness indicated by VTCI has shown a correlation with actual forest fire occurrences, highlighting its utility in applying satellite image predictions to diagnose forest fire occurrences.Additionally, by utilizing the VTCI prediction model, it has become possible to address issues related to missing data points caused by clouds or snow, which were limitations previously identified when using satellite images for forest fire risk prediction.
Despite the limitations mentioned earlier, the model developed in this study can diagnose forest fire risks that reflect not only natural environmental factors but also anthropogenic influences by incorporating the condition of the land surface.Furthermore, by utilizing the forecast results, it is believed that the model can contribute to more effective forest fire prevention and response activities.

Forests 2024 , 20 Figure 1 .
Figure 1.Study area with DEM and forest fire inventory.

Figure 3 .
Figure 3.The Workflow of the VTCI Prediction Model.Figure 3. The Workflow of the VTCI Prediction Model.

Figure 3 .
Figure 3.The Workflow of the VTCI Prediction Model.Figure 3. The Workflow of the VTCI Prediction Model.The model incorporated a DNN (Deep Neural Network) algorithm.DNN, a highly prevalent regression model in machine learning, has proven effective in soil moisture retrieval[39,40].An illustration in Figure4, DNN model constructed with 10 hidden layers, where each layer contained nodes between 100 and 500.The model utilized the LeakyReLU

Figure 4 .
Figure 4. Structure of the DNN Algorithm for the VTCI Prediction Model.

Figure 4 .
Figure 4. Structure of the DNN Algorithm for the VTCI Prediction Model.

ForestsFigure 5 .
Figure 5. Workflow of the Forest Fire Diagnostic Model and Forecast Process.

Figure 5 .
Figure 5. Workflow of the Forest Fire Diagnostic Model and Forecast Process.

Figure 6 .
Figure 6.SHAP summary plots of the forest fire diagnostic model based on the CatBoost classifier.

Figure 6 .
Figure 6.SHAP summary plots of the forest fire diagnostic model based on the CatBoost classifier.

Figures 7 -
show the SHAP-dependence plot for these factors.The SHAP dependence plot identifies the relationship between a single factor (X-axis) and the corresponding SHAP values generated (Y-axis) to evaluate the effect of each feature factor.

Figure 7 .
Figure 7. SHAP dependence plots for meteorological and satellite image data.

Figure 7 .
Figure 7. SHAP dependence plots for meteorological and satellite image data.

Figure 8 .
Figure 8. SHAP dependence plots for forest fire activity map.

Figure 11 .
Figure 11.Forest fire forecast result for 2 April 2023 in Seoul City.Figure 11.Forest fire forecast result for 2 April 2023 in Seoul City.

Figure 11 .
Figure 11.Forest fire forecast result for 2 April 2023 in Seoul City.Figure 11.Forest fire forecast result for 2 April 2023 in Seoul City.

Figure 12 .
Figure 12.Forest fire forecast result for 2 April 2023 in Yangpyeong County.
Figure 1.Study area with DEM and forest fire inventory.

Table 2 .
Method for calculating the monthly non-occurrence label sample sizes.

Table 3 .
Performance of the VTCI prediction model.

Table 4 .
Comparison of Machine Learning Algorithma in PyCaret.

Table 5 .
Forest Fire Forecast Result Verification.

Table 5 .
Forest Fire Forecast Result Verification.