Assessment of Machine Learning Algorithms for Modeling the Spatial Distribution of Bark Beetle Infestation

Koreň, Milan; Jakuš, Rastislav; Zápotocký, Martin; Barka, Ivan; Holuša, Jaroslav; Ďuračiová, Renata; Blaženec, Miroslav

doi:10.3390/f12040395

Open AccessArticle

Assessment of Machine Learning Algorithms for Modeling the Spatial Distribution of Bark Beetle Infestation

by

Milan Koreň

^1,*

,

Rastislav Jakuš

^2,3,

Martin Zápotocký

¹,

Ivan Barka

⁴,

Jaroslav Holuša

³,

Renata Ďuračiová

⁵

and

Miroslav Blaženec

²

¹

Faculty of Forestry, Technical University in Zvolen, T. G. Masaryka 24, 960 01 Zvolen, Slovakia

²

Institute of Forest Ecology, Slovak Academy of Sciences, Ľ. Štúra 2, 960 53 Zvolen, Slovakia

³

Faculty of Forestry and Wood Sciences, Czech University of Life Sciences Prague, Kamýcká 129, 165 00 Prague 6, Czech Republic

⁴

National Forest Centre—Forest Research Institute, T. G. Masaryka 22, 960 01 Zvolen, Slovakia

⁵

Faculty of Civil Engineering, Slovak University of Technology in Bratislava, Radlinského 11, 810 05 Bratislava, Slovakia

^*

Author to whom correspondence should be addressed.

Forests 2021, 12(4), 395; https://doi.org/10.3390/f12040395

Submission received: 11 February 2021 / Revised: 19 March 2021 / Accepted: 22 March 2021 / Published: 27 March 2021

(This article belongs to the Special Issue Management of Forest Pests and Diseases)

Download

Browse Figures

Versions Notes

Abstract

Machine learning algorithms (MLAs) are used to solve complex non-linear and high-dimensional problems. The objective of this study was to identify the MLA that generates an accurate spatial distribution model of bark beetle (Ips typographus L.) infestation spots. We first evaluated the performance of 2 linear (logistic regression, linear discriminant analysis), 4 non-linear (quadratic discriminant analysis, k-nearest neighbors classifier, Gaussian naive Bayes, support vector classification), and 4 decision trees-based MLAs (decision tree classifier, random forest classifier, extra trees classifier, gradient boosting classifier) for the study area (the Horní Planá region, Czech Republic) for the period 2003–2012. Each MLA was trained and tested on all subsets of the 8 explanatory variables (distance to forest damage spots from previous year, distance to spruce forest edge, potential global solar radiation, normalized difference vegetation index, spruce forest age, percentage of spruce, volume of spruce wood per hectare, stocking). The mean phi coefficient of the model generated by extra trees classifier (ETC) MLA with five explanatory variables for the period was significantly greater than that of most forest damage models generated by the other MLAs. The mean true positive rate of the best ETC-based model was 80.4%, and the mean true negative rate was 80.0%. The spatio-temporal simulations of bark beetle-infested forests based on MLAs and GIS tools will facilitate the development and testing of novel forest management strategies for preventing forest damage in general and bark beetle outbreaks in particular.

Keywords:

machine learning; classifier; spatial distribution model; geographical information system; forest damage; bark beetle

1. Introduction

Wind damage to forests and subsequent spruce bark beetle (Ips typographus L.) outbreaks have increased in Central Europe in recent decades [1]. Bark beetle outbreaks are closely related to wind-caused forest damage, and the two factors interact to create wind–bark beetle disturbance systems [2]. There are also connections between bark beetle outbreaks and other forest disturbances related to climate change [3,4]. Rising mean annual temperatures, increases in drought duration and intensity, and shifts in growing seasons increase the risk of forest infestation by bark beetles [5]. To respond to this problem, forest managers require adequate forest management strategies based on knowledge of the forest damage spatial distribution and factors influencing the risk of wind damage and bark beetle dispersal [6]. The understanding of bark beetle dispersal processes and spatial patterns is important for the effective management of infested forests [7]. Forestry decision support systems profit from the integration of high-resolution remote sensing data, forest mapping and field inventories, advances in silviculture and forest ecology, and the use of modern statistical and machine learning approaches [8,9,10].

Spaceborne and aerial images are used to map the spatial and temporal distributions of forest damage caused by wind or bark beetles [11]. Damaged and infested forests are identified, and bark beetle spots are localized by classification of medium- or high-resolution images. Spatial and temporal changes in forest damage over large territories can be detected in time series of satellite images. Intra-annual satellite image series improve the detection of forest disturbance [12]. The derived multi-temporal digital maps of human and natural forest disturbances can be used to study the dynamics of forest damage.

The spatial dispersal of bark beetle infestations is driven by various environmental factors, including solar radiation, temperature, wind speed and direction, precipitation, soil moisture, distance to forest damage areas, spruce age, diameter at breast height, and stocking [13]. Bark beetle outbreaks and the spatial distribution of forest damage are influenced by the spatial structure of bark beetle populations, forest and landscape characteristics, and by factors at a wider, regional-scale [14]. Because the ecological and spatial relationships are complex, the development of a reliable spatial forest damage model (FDM) is a difficult and computationally intensive task.

Machine learning algorithms (MLAs) have been applied to solve non-linear and high-dimensional problems in forestry and ecology. A support vector machine and a random forest algorithm have recently been used to estimate the volume and basal area of eucalyptus stands from satellite images [15]. Proportions of planned end products were forecasted by Dirichlet regression and neural networks [16]. The stand volume of a rapidly growing forest plantation was estimated by random forest, support vector machine, and neural network regressions from aerial laser scanning data [17].

As noted earlier, forest damage by wind, insects, fire, and other factors is a complex spatial phenomenon that is difficult to model and predict. Advanced spatial modeling techniques are needed to develop accurate and reliable FDMs. Many MLAs are designed to handle large volumes of multi-dimensional data, including geographical data. MLAs learn from existing data and can adapt to hidden spatial patterns and unknown relationships among environmental variables. Rodrigues and de la Riva [18] modeled the risk of human-caused wildfire by random forest, boosted regression tree, and support vector machine algorithms. Mayfield et al. [19] calculated the risk of deforestation with generalized linear mixed models, Bayesian networks, neural networks, and Gaussian processes. Evolutionary and non-evolutionary MLAs were tested for predicting forest burned areas [20]. The potential of modeling forest insect dynamics by cellular automata was demonstrated by [21]. Neural network-based regression was used by Hlásny and Turčáni [22] to analyze the influence of site and stand characteristics on forest damage caused by the spruce bark beetle.

Here, we summarize the results of numerical experiments with MLAs for modeling the spatial distribution of forest damage. Our main objective was to select the MLA with the highest predictive accuracy for modeling the spatial distribution of forest damage in an open-source geographical information system (GIS).

2. Materials and Methods

2.1. Study Area

MLAs were tested for the Horní Planá region in Central Europe (Figure 1). The forests in this region are managed by the Military Forests and Farms of the Czech Republic, State Enterprise [23]. The Military Forests and Farms of the Czech Republic, State Enterprise, is a special-purpose organization that manages 19,960 ha of land in military training areas. Forests represent 16,569 ha of this area, and water reservoirs cover 203 ha; the remaining area is represented by grassland that is used for intensive military training.

The region is characterized by hills that differ in elevation by 100–150 m. Most of the study area is located at 600 to 800 m a.s.l., and the highest point (Lysá Mt) is at 1228 m a.s.l. Part of the area belongs to the Bohemian Forest (the Šumava Mts.), and part belongs to the Šumavské podhůři Mts. The annual mean temperature ranges from 5 to 7 °C, and annual precipitation ranges from 700 to 800 mm [24].

The dominant tree species in forests is Norway spruce (Picea abies L. Karst) (69%). Less common are Scotch pine (Pinus sylvestris L.) (12%), Silver fir (Abies alba Mill.) (6%), and European beech (Fagus sylvatica L.) (5%). Among the forest stands, 23.5% are younger than 40 years, and 32.0% are older than 100 years. At higher altitudes, forests are parts of complexes that include meadows and pastures.

Spruce forests in the Horní Planá region are disturbed mainly by snow, bark beetles, and especially wind. In recent years, incidental felling caused by forest damage has represented about 50% of total felling. Bark beetle outbreaks in the region follow the trends exhibited over the whole of the Czech Republic, with peaks in the mid-1980s and mid-1990s. The last long-term outbreak of I. typographus began in 2003 as a result of a severe drought that occurred throughout Central Europe. This outbreak was partly extended by the winter storm “Kyrill” (January 2007), which destroyed more wood than any other factor over the last 30 years. At the beginning of 2014, the volume of wood infested by bark beetles decreased to under 0.5 m³·ha⁻¹ [25]. The cyclic nature of forest damage in the study region is depicted in Figure 2. The bark beetle is a prevailing cause of forest damage in the study region.

2.2. Input Data

In our study, spatial distribution models of spruce forest damage were developed for the period 2003–2012. The period was limited by years for which raster layers representing explanatory variables were available (Table 1). The spatial resolution of the raster layers was 30 m, which corresponds to the spatial resolution of LANDSAT images.

A time series of LANDSAT images were used to identify damaged forest locations in the study area. Raster maps of forest health status were used to delineate the damaged forest locations. These forest health maps are based on LANDSAT images and have been prepared by standardized methods since 1984 by the Forest Management Institute, Brandýs nad Labem, Czech Republic [26]. Forest damage locations were not classified by forest damage factors. Only classes of strong and very strong damage of spruce forest stands [27] were considered for this study. A similar approach was used by [28,29].

Normalized difference vegetation index (NDVI) values were derived from LANDSAT scenes (Table 2). Cloud-free scenes from the growing season were preferred. Processing included manual removal of clouds and shadows and the mosaicking of two scenes based on linear regression using corresponding pixels that represented forest.

Information on spruce age (AGE), percentage of spruce in forest stands (PCT), volume of spruce per hectare (VOL), and stocking (STO) was imported from forest management plans. There are four management units within the study area, each with its own management plan. The forest management plans are prepared for 10-y periods. The age of a forest stand was specified in 5-year increments, and the percentage of spruce was based on basal area. Volume was derived from the stand mean diameter at breast height and the stand mean tree height for each species. Stocking was calculated as the relative density using yield tables. The raster layer representing spruce forest stands included only spruce stands with age >49 years and stocking >49%.

Distance to forest damage areas was calculated from the layer of actual forest damage. The actual forest damage layer was subtracted from the spruce forest stand layer to derive a layer of actual spruce forest. The raster layer of the actual spruce forest was used to calculate the distance to the spruce forest edge layer. Potential global solar radiation (PSR) was computed from a digital elevation model (DEM) by the GRASS GIS module r.sun [30] with a 1-h step.

EU-DEM 25 [31] was used as an input to compute PSR. Its original spatial resolution of 25 m is close to the resolution of LANDSAT scenes. The DEM was projected to the national spatial reference system (EPSG 5514, epsg.io/5514/) and resampled at a resolution of 30 m.

Samples representing damaged and undamaged forest were generated for each year of the period. All grid cells representing damaged forest were used as samples. An equal number of cells representing undamaged forest was randomly generated. Among the samples, 75% were used for model training, and 25% were used as controls. All FDMs were trained and validated using the same sample sets.

2.3. Computer Simulations and Data Processing

The FDM consists of a forest damage probability function and a classification function. The FDM

F

can be expressed in the following form:

F (x, y, t) = C (P (\vec{u} (x, y, t)))

(1)

where

C

is a classification function;

P

is a forest damage probability function;

\vec{u}

is an environment vector function; x, y are point (grid cell) coordinates; and t is a time.

The probability function P calculates the risk of forest damage at a given location (x, y) and time t. Environmental factors, e.g., distance to existing forest damage areas, drought stress, forest stand openness, and solar radiation, are described by independent variables of the forest damage probability function P. Each component

u_{i} (x, y, t)

of the environment vector function

\vec{u}

corresponds to an independent variable that varies over space and time. In GIS, the independent variable is represented by a time series of raster layers.

The open source-software GRASS GIS [32] was used for computer simulations. All suitable MLAs from the GRASS GIS add-on r.learn.ml were tested for modeling the spatial distribution of forest damage (Table 3). r.learn.ml is a front-end to the scikit-learn toolkit [33] for the Python programming language. A set of scripts in Python programming language was developed to automate the processing of forest damage layers, the processing of training and control samples, model training, and the analysis of FDM performance.

In the study, the spatial distribution of forest damage was modeled by linear MLAs (LR and LDA), non-linear MLAs (QDA, KNC, GNB, and SVC), and classification trees (DTC, RFC, ETC, and GBC).

Each MLA was trained and tested on all subsets of the explanatory variables (Table 1). All combinations of explanatory variables were used as inputs of FDMs because the most suitable combinations of explanatory variables for different MLAs were unknown. The total number of FDMs tested was a product of the number of MLAs (10) × the number of combinations of exploratory variables (255). Each of the 2550 models was calculated for each year of the period 2003–2012. The probability of forest damage was calculated by each MLA. The internal classifier of each MLA was applied to identify locations (grid cells) of forest damage.

In a computer environment, a spatial forest damage model is defined by MLA m and a non-empty subset S of the explanatory variables. The forest damage spatial distribution model

f d m (m, S, y)

was calculated for every year y of the period. The confusion matrix, true positive rate

T P R (m, S, y)

true negative rate

T N R (m, S, y)

, and phi coefficient

ϕ (m, S, y)

were calculated for each

f d m (m, S, y)

.

The true positive rate (sensitivity) describes how many locations (grid cells) of forest damage estimated by the FDM correspond to locations of actual forest damage. Similarly, the true negative rate (specificity) is a measure of the correspondence between estimated and actual undamaged forest locations. A reliable spatial FDM is characterized by high sensitivity and high specificity.

The overall performance of the spatial FDM was measured by the arithmetic mean of the phi coefficient for the period under study Y:

\bar{Φ} (m, S) = \frac{1}{l} \sum_{y \in Y} Φ (m, S, y),

(2)

where l is the length of the period Y, m is an used MLA, S is a non-empty subset of the explanatory variables, and y is a year of the period. The mean true positive rate

\bar{T P R}

(m,S) and the mean true negative rate

\bar{T N R}

(m,S) of the spatial FDM were calculated similarly.

Data were statistically processed with the R package [34]. All statistical hypotheses were tested at a 0.05 significance level. Modules from the packages lmPerm and RVAideMemoire were used for permutation tests of statistical hypotheses. Permutation one-way repeated measures ANOVA was used to compare the mean phi coefficients of the FDMs. Significances of mean phi coefficient differences were tested by a permutation pairwise t-test with Benjamini, Hochberg, and Yekutieli corrections [35].

The generated FDMs were sorted in descending order by

\bar{ϕ}

(Equation (2)). A unique ranking number r was assigned to each FDM. The ranking number 1 corresponded to the FDM with the highest mean phi coefficient for the period Y. Results were evaluated in terms of FDM performance and simplicity.

3. Results

As stated earlier, the current study evaluated the use of MLAs for modeling the spatial distribution of forest damage. A simple spatial dispersion model was used (Equation (1)). The locations of damaged forest areas were modeled with the machine learning classifiers.

The top FDM (i.e., the FDM with the highest mean phi coefficient for the period 2003–2012) was selected for each studied MLA (Table 4). As suspected, the optimal combination of input explanatory variables differed among the top FDMs.

The arithmetic mean and median of phi coefficients were higher for the top ETC-based model with five explanatory variables than for the other top FDMs for the period 2003–2012 (Table 4, Figure 3 and Figure 4). The influence of MLA on the mean phi coefficient (

\bar{ϕ}

) of the top FDMs was significant (one-way repeated measures ANOVA, p-value < 2.2×10−16). Results of the permutation pairwise t-test of the top FDMs’

ϕ

are presented in Table 5.

Overall performance as indicated by the mean phi coefficient was slightly lower for the best RFC-based model than for the best ETC-based model. However, performances for these best models were not significantly different. Performance for the other top FDM models was significantly different from performance for the best ETC- and RFC-based models (Table 5). Moreover, as evident in Figure 5, most of the ETC-based FDMs out-performed the FDMs generated by the other MLAs.

The performance of the FDMs generated by SVC was highly variable (Figure 5). SVC generated many low-performance FDMs as well as a few models that performed better than the models generated by the other tested MLAs except for those generated by ETC and RFC (Table 4). The performance of the best SVC-based model was significantly different from the performance of the other best FDMs (Table 5).

The mean TPRs were highest for the SVC-, QDA-, and GNB-based models (Figure 6). These models were highly sensitive because they significantly overestimated the area of damaged forest. On the other hand, SVC also generated several FDMs with low specificities (Figure 7).

We then compared the performance of the FDMs when the number of explanatory variables was constant in a range from one to eight. Regardless of the number of explanatory variables, performance was always best for the models generated by the ETC, RDC, and SVC MLAs (Table 6). Performance of FDMs generated by most of the MLAs increased with the number of variables, except in the case of SVC- and KNC-based models (Appendix A).

The Φ of ETC-based models generally increased as the number of explanatory variables increased (Figure A1). The computer simulations showed that the ETC-based model that included distance to actual forest damage, potential solar radiation, spruce forest age, percentage of spruce in forest stands, and volume per hectare had the highest mean accuracy and phi coefficient. Nearly the same performance, however, was achieved by ETC-based models that included different explanatory variables (Table A1).

4. Discussion

Given climate change and the increased emphasis on ecological and cultural services rather than on the economic services provided by forest ecosystems, new forest management strategies are needed. Forest damage by abiotic and biotic factors not only causes timber loss, but also has negative effects in terms of forest diversity, soil erosion, landslides, recreation, and landscape aesthetics. The spatial pattern of forest damage is driven by complex spatio-temporal environmental processes. Rammer and Seidl [36] have shown the effectiveness of MLAs for the spatial prediction of I. typographus infestations in unmanaged forests.

We found that FDMs based on traditional linear and non-linear methods generally performed less well than FDMs based on classification trees. LR and LDA are commonly used linear classification MLAs. Hernandez et al. [37] developed a spatial logistic regression model of the probability of bark beetle (Dendroctonus frontalis Zimmermann) attack in coniferous forests; the overall accuracy of the model was 68.7%. LR, which is a special case of the generalized linear model, models the log-odds as a linear function by minimizing the sum of the squared residuals. LDA minimizes the probability of misclassification by maximizing the separation between the classes. LR and LDA create the linear boundaries of classes in the explanatory variables’ space. LDA is less sensitive than LR to correlations between explanatory variables.

QDA uses class-specific covariance matrices and separates classes by quadratic surfaces in the predictors’ space. It is sensitive to collinearity between explanatory variables within a class. Because of a higher complexity of the discriminant function, QDA may perform better than linear MLAs. KNC calculates class probability as the proportion of the class in the set of k-closest neighbors from the training data. KNC is susceptible to measurement scales and local over-fitting. GNB estimates prior and conditional probabilities from the training set and then uses Bayes’s rule to calculate the probability of outcome class. An important assumption of GNB is the independence of explanatory variables, which is rarely satisfied in practice.

Only the performance of the top SVC-based FDM was significantly different from the best models generated by the other tested linear and non-linear algorithms. SVC is considered one of the most flexible and effective MLAs and is widely used. It belongs to the family of kernel methods, which allow the separation of classes by non-linear boundaries.

In our study, the FDMs generated by classification trees performed better than those generated by linear and non-linear MLAs. Classification trees describe patterns in data by complex hierarchies of simple rules. The performance of a single classification tree is usually weak. The performances of DTC-based FDMs were inferior to those of the FDMs generated by MLAs based on an ensemble of classification trees.

An ensemble of classification trees usually performs better than a single classification tree, because an ensemble can detect more complex patterns in the data. The RFC MLA is an ensemble of classification trees that are trained on bootstrap samples. The randomness of the tree construction process reduces correlations between trees. Mi et al. [38] investigated Stochastic Gradient Boosting, RFC, CART (Classification and Regression Tree), and MaxEnt (Maximum Entropy) for the modeling distribution of three crane species. The RFC-generated species distribution models were found more reliable and accurate than models generated by the other algorithms under their study.

The process of GBC ensemble construction is iterative. The newly constructed classification tree of the GBC ensemble is forced to learn unexplored data. However, the boosting process was not effective in the case of FDMs. In our study, GBC-based FDMs performed less well than the RFC- and ETC-based models.

The best results in our study were achieved by FDMs generated by ETC. In the ETC ensemble, classification trees are trained on all samples. The randomly selected exploratory variable and random value are used to split the nodes of the classification trees [39]. ETC-based models were responsive to the spatial variance of environmental conditions, they accurately modeled the spatial distribution of forest damage. In addition, ETC-based models are computationally efficient.

Default settings were used for testing the MLAs in the current study. These default settings were selected by developers based on their knowledge of algorithms. Although these settings provide a good starting point for experiments with MLAs, optimization of settings may improve the performance of the tested algorithms.

Input variables for FDMs vary over space and time. Field measurements and access to historical records or remote sensing data are needed to prepare corresponding inputs for spatial distribution models. PSR, NDVI, and distance to existing forest damage areas can be calculated for past periods. Archives of satellite images can be used to calculate NDVI and to identify bark beetle infestations for past years. Thanks to the archiving of satellite images, it should be possible to identify past and current infestations and to build spatial distribution models that predict future forest damage.

We tested all combinations of the explanatory variables as input to the studied MLAs. However, only eight explanatory variables were available for our study area (Table 1). As indicated in Table 4 and Table A1, some explanatory variables occurred more often than other explanatory variables in FDMs. The most common explanatory variables in the top FDMs were distance to damaged forest, potential solar radiation, forest age, and spruce volume per hectare (Table A1). These variables may carry substantial information for modeling the spatial distribution of forest damage [40,41,42,43].

5. Conclusions

In the study, we evaluated the performance of 10 MLAs and combinations of eight input variables for the spatial modeling of spruce forest damage. Our computer simulations confirmed the suitability of the ETC MLA for modeling the spatial distribution of spruce forest damage. We also found that the number of input explanatory variables could be reduced without significant spatial modeling accuracy loss. A smaller number of input explanatory variables simplifies data preparation and processing and therefore reduces financial costs.

Various MLAs are currently ready for use in spatial decision support systems. We evaluated the MLAs available in the open-source GRASS GIS. The identification of suitable MLAs and key environmental variables is an essential step in the development of GIS tools for forest damage modeling and prognosis. Our findings will facilitate the development of the open-source spatial decision system TANABBO for modeling the spatial distribution of forest damage related to I. typographus.

The risk of forest damage is affected by many interrelated environmental factors, forest stand parameters, and forest management practices. MLAs are effective for modeling non-linear, complex phenomena like the spatial distribution of forest damage. This study provides researchers and forest managers with an accurate method of modeling spatial forest damage. The integration of the FDM with a spatial decision support system will facilitate novel tools for managing forests.

Author Contributions

Conceptualization, M.K., R.J., and M.B.; methodology, M.K., R.J., J.H., R.Ď., and M.B.; software, M.K.; investigation, M.K., R.J., J.H., and M.B.; data curation, I.B.; writing—original draft preparation, M.K., R.J., and M.B.; writing—review and editing, M.K., R.J., M.Z., I.B., J.H., R.Ď., and M.B.; visualization, R.Ď. and M.Z.; funding acquisition, M.K., R.J., J.H., and R.Ď. All authors have read and agreed to the published version of the manuscript.

Funding

This was a cooperative study and benefited from the project Comprehensive research of mitigation and adaptation measures to diminish the negative impacts of climate changes on forest ecosystems in Slovakia (FORRES), ITMS 313011T678, Operational Programme Integrated Infrastructure (OPII) funded by the ERDF. The work was also supported by grant no. QK1920433 of the Ministry of Agriculture of the Czech Republic, by the Slovak Research and Development Agency grant no. APVV-15-0761, and grants no. VEGA 2/0176/17 and VEGA 1/0300/19 of the Scientific Grant Agency of the Ministry of Education, Science, Research, and Sport of the Slovak Republic and the Slovak Academy of Sciences.

Acknowledgments

The authors are grateful to the Military Forests and Farms of the Czech Republic, state enterprise, for its cooperation. The authors thank Bruce Jaffee (USA) for linguistic and editorial improvements.

Conflicts of Interest

The authors declare no conflict of interest. The founders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A

Figure A1. Box plot of mean phi coefficients (

\bar{ϕ}

) of the FDMs generated by ETC as affected by the number of explanatory variables. Box plot shows median plus upper and lower quartiles for

\bar{ϕ}