Integrating Hierarchical Statistical Models and Machine-Learning Algorithms for Ground-Truthing Drone Images of the Vegetation: Taxonomy, Abundance and Population Ecological Models

Damgaard, Christian

doi:10.3390/rs13061161

Open AccessFeature PaperTechnical Note

Integrating Hierarchical Statistical Models and Machine-Learning Algorithms for Ground-Truthing Drone Images of the Vegetation: Taxonomy, Abundance and Population Ecological Models

by

Christian Damgaard

Department of Bioscience, Aarhus University, Vejlsøvej 25, 8600 Silkeborg, Denmark

Remote Sens. 2021, 13(6), 1161; https://doi.org/10.3390/rs13061161

Submission received: 25 January 2021 / Revised: 10 March 2021 / Accepted: 17 March 2021 / Published: 18 March 2021

(This article belongs to the Special Issue Feature Paper Special Issue on Ecological Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In order to fit population ecological models, e.g., plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Keywords:

measurement uncertainty; machine-learning algorithms; plant competition models; hierarchical statistical model

1. Introduction

Using drones that record multi-spectral photography and LIDAR, it has now become possible to obtain spatio-temporal ecological data at a fine-scaled resolution. These new data collection possibilities provide a quantum leap compared to earlier methodologies for monitoring ecological processes, e.g., competitive plant growth [1,2]. However, in order to use the drone-aided image data types for modeling plant ecological processes, there is a need to develop statistical models that are especially tailored towards these new image data types [3].

Plant competition is a population ecological process where plant growth is reduced by the presence of neighboring plants. When investigating interspecific interactions in light-open vegetation, the population growth of a species is modeled as a function of the local abundance of other species [4,5,6]. Previously, plant competitive interactions have been modeled using non-destructive measurements of plant abundance, e.g., using pin-point data, where the vertical density (number of times a plant species is touched by a thin pin) is recorded several times during the growing season in permanent plots. Vertical density is correlated to plant biomass [7,8], and plant growth and interspecific interactions may, consequently, be estimated from repeated pin-point measurements of vertical density [6,9,10,11]. However, it is now possible to radically upscale the non-destructive measurements of plant abundance by repeated drone-aided recordings of multi-spectral and LIDAR image data of the vegetation. The new image data encompass vast possibilities, but also a new challenge. Compared to pin-point data, which is assembled by persons trained in plant taxonomy, the new image data come without plant taxonomic information or abundance measures.

Currently, image data from drones are being collected in several plant ecological laboratories, and valuable experience on how to recognize plant species is being collected. It is a natural choice to use machine-learning algorithms for fitting the information from the new image data to observed ground truth of species taxonomy or abundance and, currently, research is focused on how best to use such machine-learning algorithms for predictive purposes in plant ecology [1,12,13].

In the coming years, applied research in predicting the effect of anthropogenic environmental changes on plant community and ecosystem dynamics will surely become more important [14]. Consequently, it is expected that the construction of plant ecological predictions generated by applying machine-learning algorithms on drone-aided image data will be of increasing importance, and it is imperative that such predictions include estimates of prediction uncertainties that are rooted in sampling theory [15,16,17].

The aim of this study is to outline the principles for using machine-learning algorithms for fitting empirical population ecological models, e.g., competition models, with a known degree of uncertainty. This objective will be met by specifing statistical models that will allow us to quantify the possible bias and uncertainties of species identification and abundance predictions obtained by machine-learning algorithms, so that image data may be used to fit population ecological models with a known degree of uncertainty.

Here, it is proposed to use the confusion matrix of the chosen machine-learning algorithm for quantifying the uncertainty when identifying species taxonomy and integrate a Bayesian hierarchical modeling approach with machine-learning algorithms for quantifying the uncertainty when estimating species abundance. In this study, the proposed general statistical model will be outlined and tentatively specified with suggested relevant statistical distributions. The developed statistical models are needed for fitting population ecological models of plant communities and making quantitative ecological predictions of plant community and ecosystem dynamics, including quantitative assessments of the process or structural uncertainty.

The outline of the manuscript is to present typical non-destructive ground truth data of plant abundance, followed by a brief account of the use of image data and machine-learning algorithms for predicting plant abundance, a detailed proposal of how to model the uncertainty of plant abundance data and how this uncertainty may be integrated into plant ecological models. Finally, the method will be discussed.

2. Methods and Models

2.1. Pin-Point Data—Vertical Density

In a number of ground-truthing plots at a natural or semi-natural habitat site with light-open vegetation, plant species taxonomic identity and abundance is determined by the non-destructive pin-point method [7,8]. A pin-point frame with n grid points is placed in the vegetation and the position of the frame is recorded using high-accuracy GPS. At each grid point, a thin pin is inserted into the vegetation and the sequence in which different plant species touch the pin is recorded. Such sequence pin-point data allow the determination of several derived plant abundance measures, e.g., cover, top cover and vertical density at the spatial resolution of a single pin or the plot. Furthermore, it is possible to aggregate the species data to higher taxonomic levels or species groups at the pin level.

Depending on the vegetation and the studied ecological question, various measures of plant abundance may be relevant, but here, we will focus on the vertical density at the spatial level of the plot. Importantly, it is assumed that the pin-point measure of vertical density is an unbiased sample of the true, but unknown, vertical density.

2.2. Machine-Learning Algorithms of Image Data

A drone was used to record multi-spectral images and LIDAR data of the site with the ground-truthing plots at a resolution that is sufficient to compare the image data with the pin-point data. Using standard image software, [18], a 3D model of the site was constructed and the information of the different bands was summarized at the approximate position of each pin in the pin-point frame. Using supervised machine-learning algorithms, the taxonomic identity and vertical density of each species was predicted from the image data at the spatial resolution of the plot [1,12].

The species taxonomic identity is predicted from the information in the multi-spectral image bands as well as information on texture etc. [19,20]. In species-rich plant communities, it is to be expected that not all species can be distinguished with sufficient accuracy, and species that cannot be reliably distinguished may be aggregated into a common species group. Since the overall objective of the proposed statistical method is to fit plant population ecological models, it is more important that all plants are accounted for than that each species is identified precisely. Furthermore, when constructing plant population ecological models of species-rich plant communities, it is typically necessary to aggregate plant species into plant species groups or functional groups. In the following, the term species may either mean a single plant species or an aggregated group of plant species.

The vertical density of each species is predicted using the 3D modeling of the vegetation and LIDAR data. It is assumed that the vertical density predicted from the image data may be a biased sample of the true, but unknown, vertical density, and that the direction and magnitude of the bias is species-specific.

2.3. Statistical Models

By aggregating species with similar image information, using different auxiliary information, e.g., time series image data, and different supervised machine-learning algorithms, it will be possible to maximize the probability of correct species identification. However, there will always be a non-zero probability of false identification. The probabilities of falsely identifying an entity of vertical density to the wrong species is called a confusion matrix, which is a stochastic matrix, or transition matrix, where each row sums to one. If all species are correctly identified, then the confusion matrix is the identity matrix. The parameters in the confusion matrix are fitted using the data from the ground-truth plots and is, consequently, susceptible to sampling errors; thus, here, it is assumed that each row in the confusion matrix is distributed according to a Dirichlet distribution (

M 1)

:

M 1 : p_{i} ~ D i r (α_{i})

(1)

where

p_{i}

is a row vector of

p_{i k}

, which represents the probabilities of classifying species i as species k, and

α_{i}

is a row vector of

α_{i k}

, which represents the number of times species i is categorized as species k by the supervised machine-learning algorithm [21].

The hierarchical model for determining the uncertainty of the vertical density measured by the drone images is outlined in Figure 1. The true, but unknown, vertical density of species

i

at plot

j

is denoted

x_{i j}

. The pin-point vertical density of species

i

at plot

j

observed by the pin-point method is denoted

y_{i j}

, and assumed to be distributed according to a generalized Poisson distribution

(M 2)

with mean parameter

x_{i j}

and a species-specific scale parameter

ρ_{i}

[9,22]. The predicted vertical densities from the image data at the level of the plot are denoted

m_{i j}

and assumed to be distributed according to a reparametrized gamma distribution

(M 3)

with mean

x_{i j} + τ_{i} x_{i j}

, where

τ_{i}

is a species-specific bias parameter and

ν_{i}

is a species-specific scale parameter:

M 2 : y_{i j} ~ G P (x_{i j}, ρ_{i})

(2)

M 3 : m_{i j} ~ G a m m a (x_{i j} + τ_{i} x_{i j}, ν_{i})

(3)

The idea is now to fit the measurement equations

M 1

and

M 3

to the information in the ground-truthing plots and keep these fitted measurement equations fixed when fitting the plant population ecological models to the image data of the whole site.

2.4. Population Ecological Modeling Using Image Data

The ultimate aim of the statistical models is to be able to fit plant population ecological models, e.g., competition modes, to time-series image data with a known degree of uncertainty. Following a discrete Lotka-Volterra competition model and earlier population ecological modeling studies, where interspecific interactions are modeled using pin-point abundance data [6,9,10,11], the following general species interaction modeling framework may be followed:

x_{i, t + 1} = f_{i} (x_{i, t}) \sum_{j} E x p (- c_{i j} x_{j, t})

(4)

where

f_{i}

is a species-specific growth function in the absence of interspecific interactions and

c_{i j}

measure the competitive effect of species j on the growth of species i.

The population ecological model (Equation (4)) may now be applied on a selected “vegetation plot” l that has the same size as the ground-truthing plots, but where only image data are available. The model (Equation (4)) is the process equation in a hierarchical model, where the measurement equation of the true, but unknown, vertical density of species i in a selected “plot” l at time t,

x_{i l, t}

is specified by the predicted vertical density

m_{i l, t}

and the fitted

M 3

(Figure 2).

The output of running the selected machine-learning algorithms on the image data of plot l is a vector,

m_{l}

, where each element in the vector contains the predicted species identity and the corresponding predicted vertical density of that species in the plot. However, the species identity is determined with some uncertainty from the image data, and this uncertainty needs to be included in the uncertainty of the population ecological modeling. This uncertainty is proposed to be included when fitting the model using a numerical MCMC procedure by drawing

m_{l}^{d}

during the model fitting procedure according to

m_{l}

and the fitted

M 1

. More specifically, for each entity of vertical density in

m_{l}

, a new species identity is randomly drawn using the fitted

M 1

, and the resulting vertical densities are collected by their drawn species identity into the matrix

m_{l}^{d}

. The frequency of drawing a new

m_{l}^{d}

may be set to every 100th MCMC iteration, but the sensitivity of this frequency setting to the overall convergence properties of the MCMC must be checked by visual inspection of the sampling chains.

3. Discussion

Generally, when making ecological predictions, it is important that the measurement and sampling uncertainty is taken into account, e.g., by the explicit modeling of the error due to measurement and sampling in a hierarchical model; otherwise, the predictions may be biased due to regression dilution [11]. More specifically, such prediction biases have been demonstrated when omitting measurement errors in plant competition models [11].

In the outlined hierarchical modeling framework, it is demonstrated how measurement and sampling uncertainty may be modeled when fitting population ecological models to drone image data. The chosen statistical distributions

(M 1, M 2 and M 3)

are natural choices for modeling the statistical uncertainty of the different stochastic processes and, except for

M 3

, they have been applied in a number of empirical studies [6,9,10,11]. However, the outlined modeling concept is general, and alternative specifications of the suggested statistical distributions may be relevant in other cases. For example, the bias correction in

M 3

is suggested to be proportional to the vertical density, but if more detailed information on the bias is available, then this information should, of course, be used to specify

M 3

[15].

The reason for choosing vertical density obtained by the pin-point method as the measure of plant abundance in the ground-truthing plots is three-fold: (i) the vertical density is a non-destructive method for measuring plant abundance that has been shown to be correlated with plant biomass [7], (ii) the vertical density measure has previously been shown to be useful for fitting plant population ecological models [6,9,10,11] and (iii) it is possible to aggregate the abundance of single species into the abundance of species groups or plant functional types [23]. However, other measures of plant abundance with similar characteristics may be used instead, after which the statistical distribution used in M2 should be modified accordingly.

Generally, it is important that abundance measures allow for the aggregation of abundances across species groups, e.g., counts of individuals, biomass or vertical density. In species-rich communities, it will not be practically possible, or even desirable, to construct dynamic population ecological models where all species are accounted for individually. Instead, it is important to construct taxonomic or ecologically meaningful species groups that allow the results of population ecological models to be generalized across sites [24]. This necessity to group species may be compared to the plant trait-based approach of summarizing the ecological functions of local plant communities by the mean and variances of selected plant traits [23,25].

It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models. In order to meet this requirement, a number of actions can be applied: (i) use time-series image data to identify species-specific changes in the image data, (ii) aggregate species with similar characteristics in the image data into a species group and (iii) only select plots with species groups that are clearly distinct in the image data for population ecological modeling. Regarding the later suggestion, note that in the population ecological modeling of plants, competitive growth it is not necessary to include all plots or a random selection of plots in the fitting process. Instead, it is a valid approach to select plots and model competitive interactions where species of particular interest are locally coexisting [9].

In this study, we have focused on how to model the uncertainty when fitting population ecological models to drone image data, for example, when studying the effect of environmental gradients (e.g., nitrogen deposition, precipitation, grazing intensity and herbicide drift) on plant population growth. The general outcome of such an analysis will be the joint posterior probability distribution of the model parameters, which may be used to test hypotheses and make quantitative predictions on the effect of the studied environmental gradient [15]. Such results may be important for predicting the effect of anthropogenic environmental changes on plant community and ecosystem dynamics, and for recommending mediation strategies of possible negative effects of such changes.

Funding

The work is supported by an AnaEE grant.

Conflicts of Interest

The author declares that he has no conflict of interest.

References

Tay, J.Y.L.; Erfmeier, A.; Kalwij, J.M. Reaching new heights: Can drones replace current methods to study plant pop-ulation dynamics? Plant Ecol. 2018, 219, 1139–1150. [Google Scholar] [CrossRef]
Jiménez López, J.; Mulero-Pázmány, M. Drones for Conservation in Protected Areas: Present and Future. Drones 2019, 3, 10. [Google Scholar] [CrossRef] [Green Version]
Joseph, M.B. Neural hierarchical models of ecological populations. Ecol. Lett. 2020, 23, 734–747. [Google Scholar] [CrossRef] [PubMed]
Rees, M.; Grubb, P.J.; Kelly, D. Quantifying the Impact of Competition and Spatial Heterogeneity on the Structure and Dynamics of a Four-Species Guild of Winter Annuals. Am. Nat. 1996, 147, 1–32. [Google Scholar] [CrossRef]
Adler, P.B.; HilleRisLambers, J.; Levine, J.M. Weak effect of climate variability on coexistince in a sagebrush steppe community. Ecology 2009, 90, 3303–3312. [Google Scholar] [CrossRef] [PubMed]
Damgaard, C.; Riis-Nielsen, T.; Schmidt, I.K. Estimating plant competition coefficients and predicting community dynamics from non-destructive pin-point data: A case study with Calluna vulgaris and Deschampsia flexuosa. Veg. Acta Genet. 2008, 201, 687–697. [Google Scholar] [CrossRef]
Jonasson, S. The point intercept method for non-destructive estimation of biomass. Phytocoenologia 1983, 11, 385–388. [Google Scholar] [CrossRef]
Jonasson, S. Evaluation of the Point Intercept Method for the Estimation of Plant Biomass. Oikos 1988, 52, 101. [Google Scholar] [CrossRef]
Damgaard, C.; Strandberg, B.; Mathiassen, S.K.; Kudsk, P. The effect of glyphosate on the growth and competitive effect of perennial grass species in semi-natural grasslands. J. Environ. Sci. Health Part B 2014, 49, 897–908. [Google Scholar] [CrossRef] [PubMed]
Ransijn, J.; Damgaard, C.; Schmidt, I.K. Do competitive interactions in dry heathlands explain plant abundance patterns and species coexistence? Veg. Acta Geobot. 2014, 216, 199–211. [Google Scholar] [CrossRef]
Damgaard, C.; Weiner, J. The need for alternative plant species interaction models. J. Plant Ecol. 2020. [Google Scholar] [CrossRef]
Sun, Y.; Liu, Y.; Wang, G.; Zhang, H. Deep Learning for Plant Identification in Natural Environment. Comput. Intell. Neurosci. 2017, 2017, 1–6. [Google Scholar] [CrossRef] [PubMed]
Karami, M.; Westergaard-Nielsen, A.; Normand, S.; Treier, U.A.; Elberling, B.; Hansen, B.U. A phenology-based approach to the classification of Arctic tundra ecosystems in Greenland. ISPRS J. Photogramm. Remote Sens. 2018, 146, 518–529. [Google Scholar] [CrossRef]
National Research Council. Models in Environmental Regulatory Decision Making; The National Academies Press: Washington, DC, USA, 2007. [Google Scholar]
Jaynes, E.T.; Bretthorst, G.L. Probability theory: The logic of science. Math. Intell. 2005, 27, 83. [Google Scholar] [CrossRef] [Green Version]
Gelman, A.; Hill, J. Data Analysis Using Regression and Multilevel/hierarchical Models; Cambridge University Press: New York, NY, USA, 2006; ISBN 1-139-46093-5. [Google Scholar]
Cressie, N.; Calder, C.A.; Clark, J.S.; Hoef, J.M.V.; Wikle, C.K. Accounting for uncertainty in ecological analysis: The strengths and limitations of hierarchical statistical modeling. Ecol. Appl. 2009, 19, 553–570. [Google Scholar] [CrossRef] [PubMed]
AgiSoft. AgiSoft PhotoScan Professional, Version 1.2.6; 2016. Available online: https://www.agisoft.com/forum/index.php?topic=9513.0 (accessed on 17 March 2021).
Wäldchen, J.; Mäder, P. Plant Species Identification Using Computer Vision Techniques: A Systematic Literature Review. Arch. Comput. Methods Eng. 2018, 25, 507–543. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wäldchen, J.; Rzanny, M.; Seeland, M.; Mäder, P. Automated plant species identification—Trends and future direc-tions. PLoS Comput. Biol. 2018, 14, e1005993. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Frigyik, B.A.; Kapila, A.; Gupta, M.R. Introduction to the Dirichlet Distribution and Related Processes; UWEE Technical Report; University of Washington: Washington, DC, USA, 2010. [Google Scholar]
Damgaard, C. Measurement Uncertainty in Ecological and Environmental Models. Trends Ecol. Evol. 2020, 35, 871–873. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Zhu, Q.; Peng, C.; Wang, H.; Chen, H. From plant functional types to plant functional traits: A new paradigm in modelling gloegetation dynamics. Prog. Phys. Geogr. Earth Environ. 2015, 39, 514–535. [Google Scholar] [CrossRef]
Díaz, S.; Cabido, M. Plant Functional Types and Ecosystem Function in Relation to Global Change. J. Veg. Sci. 1997, 8, 463–474. [Google Scholar] [CrossRef]
Garnier, E.; Navas, M.L.; Grigulis, K. Plant Functional Diversity. Organism Traits, Community Structure and Ecosystem Properties; Oxford University Press: Oxford, UK, 2016. [Google Scholar]

Figure 1. Outline of the hierarchical model for determining the uncertainty of the vertical density measured by the drone images. The true, but unknown, vertical density of species

i

at plot

j

in ground-truthing plot j is modeled by the latent variable

x_{i j}

. The posterior distribution of the latent variable is calculated using both (i) the vertical density predicted from the information from the drone images using machine-learning algorithms

(m_{i j})

that are modeled using

M 3

, and (ii) the vertical density measured by the pin-point method

(y_{i j})

that is modeled using

M 2

.

Figure 1. Outline of the hierarchical model for determining the uncertainty of the vertical density measured by the drone images. The true, but unknown, vertical density of species

i

at plot

j

in ground-truthing plot j is modeled by the latent variable

x_{i j}

. The posterior distribution of the latent variable is calculated using both (i) the vertical density predicted from the information from the drone images using machine-learning algorithms

(m_{i j})

that are modeled using

M 3

, and (ii) the vertical density measured by the pin-point method

(y_{i j})

that is modeled using

M 2

.

Figure 2. Hierarchical population ecological model fitted to image data from a selected “vegetation plot” l that has the same size as the ground-truthing plots, but where only image data are available. The true, but unknown, vertical density of species

i

at time t is modeled by the latent variable

x_{i l, t}

and the solid arrows are the process equation (Equation (4)). The dashed arrows are the fitted measurement equations

(M 3)

that link the vertical density predicted from the information from the drone images

(m_{i j})

to the latent variables.

Figure 2. Hierarchical population ecological model fitted to image data from a selected “vegetation plot” l that has the same size as the ground-truthing plots, but where only image data are available. The true, but unknown, vertical density of species

i

at time t is modeled by the latent variable

x_{i l, t}

and the solid arrows are the process equation (Equation (4)). The dashed arrows are the fitted measurement equations

(M 3)

that link the vertical density predicted from the information from the drone images

(m_{i j})

to the latent variables.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Damgaard, C. Integrating Hierarchical Statistical Models and Machine-Learning Algorithms for Ground-Truthing Drone Images of the Vegetation: Taxonomy, Abundance and Population Ecological Models. Remote Sens. 2021, 13, 1161. https://doi.org/10.3390/rs13061161

AMA Style

Damgaard C. Integrating Hierarchical Statistical Models and Machine-Learning Algorithms for Ground-Truthing Drone Images of the Vegetation: Taxonomy, Abundance and Population Ecological Models. Remote Sensing. 2021; 13(6):1161. https://doi.org/10.3390/rs13061161

Chicago/Turabian Style

Damgaard, Christian. 2021. "Integrating Hierarchical Statistical Models and Machine-Learning Algorithms for Ground-Truthing Drone Images of the Vegetation: Taxonomy, Abundance and Population Ecological Models" Remote Sensing 13, no. 6: 1161. https://doi.org/10.3390/rs13061161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Hierarchical Statistical Models and Machine-Learning Algorithms for Ground-Truthing Drone Images of the Vegetation: Taxonomy, Abundance and Population Ecological Models

Abstract

1. Introduction

2. Methods and Models

2.1. Pin-Point Data—Vertical Density

2.2. Machine-Learning Algorithms of Image Data

2.3. Statistical Models

2.4. Population Ecological Modeling Using Image Data

3. Discussion

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI