1. Introduction
Extensive grasslands, especially at nature conservation sites, are important habitats for multiple endangered species [
1]. Thereby, they have a key role in supporting biodiversity, [
2]. Besides biodiversity, there are many other valuable ecosystem services, which are provided by extensive grasslands, such as soil carbon storage, forage production for ruminants, and reduction of soil erosion [
3]. Extensive grasslands are valuable culturally grown landscapes with increasing significance for species to adapt to the effects of climate change and human activities [
4]. Therefore, it should be one of our main goals to preserve such refugium, because changing climate will increase the challenges for species, which are adapted to specific habitat structures and climate conditions, while at the same time, habitats matching species requirements will become rare.
One threat to grassland ecosystems is the spread of invasive species. Until today, there is no saturation in the accumulation of new appearing alien species [
5]. If an invasive species has superior competition advantages, it can rapidly become a dominant species in a habitat, which can be vulnerable to such a degree, that species composition changes drastically and the profile and performance will change in recipient ecosystems, shifting the balance between services and disservices [
6].
One invasive species on the list of the 150 most widespread alien plants is Lupinus polyphyllus [
7]. Originated in pacific north America, it has spread over northern and central Europe [
8,
9]. L. polyphyllus is a perennial legume that has often been introduced to new areas for the purpose of erosion reduction, e.g., in connection with road constructions or to increase the nitrogen pool at agricultural sites [
10,
11].
For effective management, the distribution strategies of
L. polyphyllus have been studied and multiple expansion paths for seeds of
L. polyphyllus have been identified [
12]. From natural distribution and from multiple anthropogenic vectors, management strategies can be formulated to reduce seed transmission to new sites. As it is very difficult to renature already invaded areas [
13] and because of the danger of reinvasion, management should be as effective as possible. Therefore, long-term management, supported by highly accurate monitoring is of great need to detect the actual spread and identify a threat in vulnerable areas with high ecosystem value. Management strategies have to be validated on their potential to evaluate the effectiveness of management measures.
Remote sensing methods are an attractive tool for monitoring grassland ecosystems. Optical sensors can collect electromagnetic radiation reflected from the area of interest and its unique spectral reflectance pattern can be interpreted by machine learning algorithms to inform on physiological plant properties as well as distinguish between different species. For example, Ref. [
14] used WorldView-3 satellite products to extract tree crowns in semi-arid parklands, while [
15] classified dominant tree species in urban areas to estimate carbon stock and [
16] focused on weed detection.
However, challenges rise by the similarities of invading and native species characteristics.
For mapping invasive species like
L. polyphyllus, timing is important [
17]. Plants must be distinguishable through phenologically prominent features like the blossoming stage or at the end of the season after grassland was cut, and regrowth of the invasive species is advanced in height and less senescent compared to its surrounding species.
Traditional cover estimation methods are using human expert knowledge to estimate species cover in the field or use digitizing methods and image data from aerial flights. As digitizing is highly time consuming, interpolation methods are used as well. One way to eliminate the uncertainty of interpolation methods is the use of the computational classification of invasive species by extracting different features from sensor data to train classification models and predict species cover. Training samples thereby represent areas covered by the invasive species (one-class classification (e.g., [
18,
19])) or additional classes that belong to other species or surface types (e.g., multiclass classification [
20])). Further, samples with different percentages of invasive species cover can be collected to increase the degree of detail. The use of sensor-based species detection can reduce working load and increase the precision in cover estimation, especially in large areas.
Mapping efforts of
L. polyphyllus in the Rhön UNESCO Biosphere Reserve were carried out on a large area in the region ‘Leitgraben’ (407 ha) by visual inspection of experts at ground level and aerial imagery [
21] as well as on small areas by unmanned aerial vehicles (UAV) equipped with multiple sensors and computer-based image analysis [
22]. While the acquisition of species cover from human observations at the ground is highly time consuming, UAV-based approaches have their limit in spatial cover due to limited battery capacity. Additionally, drones are a potential disturbance to wildlife fauna [
23] through flight noise and the confusion of certain drone types (especially fixed wing drones) with predator birds [
24]. Even though, the impact of UAVs on wildlife species is uncertain and often not confirmed [
25,
26], wildlife species living in habitats mostly unaffected by human activities have to be considered as increasingly sensitive to disturbances through UAVs. Aerial flight missions, on the other hand, tend to be difficult to plan for a specific day time and season and often exceed the costs of UAV-based missions.
To cover large areas and still use the advantages of accurate spatial image analysis, compared to interpolation methods, satellite data may be used instead. By this, the disadvantages for wildlife disturbance are eliminated, however, limitations by satellites may arise from weather conditions (especially clouds) and a reduced spatial resolution compared to UAV-based data acquisition. Compared to large species (trees and bushes), smaller herbaceous species in grassland environments are much more challenging to detect. A comparison by [
17] of UAV and satellite images stated that sufficiency is dependent on demands deriving from monitoring aims itself (flexibility, spatial resolution, spectral resolution) but as well by sensor’s and platform’s characteristics (financial costs of imagery, weather constraints, legal constraints). Thereby, limits of satellites have been formulated at spatial resolution as well as for acquiring highly dense time series.
Nevertheless, monitoring large areas in Rhön UNESCO Biosphere Reserve with the necessity to reduce the impact on wildlife species leads to the subject of identifying the potential and possible ways of detecting L. polyphyllus from satellite data.
The overall aims of this study were:
Identifying the most promising classification algorithm detecting L. polyphyllus abundance from WorldView-3 satellite data
Comparing classification performance of different numbers of cover classes with variable degrees of L. polyphyllus cover
Evaluating classification performance with different feature selection steps as model input.
2. Materials and Methods
2.1. Study Area
The Rhön UNESCO Biosphere Reserve is a 243.323 ha wide mountain region in central Germany in 600 to 950 m a.s.l. (
Figure 1). Its core zone is characterized by extensively managed grasslands, mainly used as meadows and pastures. This special landscape provides a habitat for multiple endangered plant and animal species. The management strategy of large parts of the grasslands was optimized for the breeding behavior of
Lyrurus tetrix. Therefore, these meadows were not mown in the early summer, which, among other things, supported the spread of
L. polyphyllus in the region.
L. polyphyllus was introduced to this area in the 1930s, as a soil cover for spruce plantations and to stabilize verges [
27]. Due to the abandonment of the non-profitable meadows,
L. Polyphyllus also spread in these grasslands, and from the mid-1990s, monitoring and management efforts have been carried out to control the species. Between 1996 and 2016, in parts of the Rhön UNESCO Biosphere Reserve (‘Leitgraben’), the spread of
L. polyphyllus has doubled in terms of ground cover, both in open grassland areas as well as in areas which are difficult to manage, such as the margins of roads and near cairns build from stones removed from the grassland by farmers [
21].
Its dominant stands have changed the habitat in such a way that ground breeding birds (e.g.,
Lyrurus tetrix and
Crex crex) have lost breeding refugium and others their food supply. At the same time, the dominance of
L. polyphyllus reduced the floral biodiversity, because species with smaller habitus are disadvantaged against the tall growth of
L. polyphyllus single stands and especially large patches. Additionally,
L. polyphyllus can transport nutrients from deeper soil levels upwards to the nutrient poor upper levels. The plant’s ability to fix atmospheric N can lead to higher nitrogen pools, which can lead to modified edaphic conditions, resulting in a loss of biodiversity [
28].
2.2. Overview
Therefore, an eight-band multispectral WorldView-3 satellite image was acquired, and multiple machine learning (ML) methods have been trained and tested on their ability to classify
L. polyphyllus at different degrees of ground cover. ML approaches have shown good capabilities to classify invasive species [
29,
30]. Additional feature selection procedures proved successful to decrease model complexity and increase model performance [
16,
31].
2.3. Satellite Data Acquisition
Satellite data was acquired from WorldView-3, a multi-payload, high-resolution satellite. It provides a spatial resolution of 31 cm for the panchromatic band, and 1.24 m for multispectral bands (
Table 1). The image was taken on 6 August 2020 at 13:44 and covered a 100 km
2 area along the core zone of the Rhön UNESCO Biosphere Reserve, where the ground sampling took place (~50°28′07.6′′ N and 10°02′03.8′′ E). Late summer was chosen for monitoring because at this time almost every meadow was mown, which commonly happens only once in the year, mostly in July. In August,
L. polyphyllus has already regrown, while the surrounding grassland vegetation was still at almost cutting height (cf.
Figure 2).
2.4. Reference Ground Sampling
From 1st to 4th of September 2020, ground truth data were collected in the study area. Therefore, 3 × 3 m
2 flexible frames were used (
Figure 2), which have been divided into 16 equal squares (4 × 4 matrix with 0.75 m length). These frames were randomly distributed within the grasslands with a different ground cover of
L. polyphyllus. For each sample, it was assessed how many of the 16 equal squares were at least partly covered with
L. polyphyllus plants. A total of 219 3 × 3 m
2 ground truth plots were set up for RS data analysis visually by experts in the field with different degrees of lupine contribution (Lupine cover classes from 0–16 which equals 0–100% with 6.25% cover steps). Each sample plot was documented with an RGB handheld camera image, taken from one side of the plot at a ca. 2 m distance. All four corners of each plot were measured with an RTK GNSS with a horizontal accuracy of 2 cm [
32].
2.5. Pre-Processing of Satellite Data
Pre-processing steps were done in QGIS (v. 3.14.16). Calculations of TOA
radiance (Equation (1)) and TOA
reflectance (Equation (2)) (TOA: top of atmosphere) were made with the inbuild field calculator. Information on satellite and flight-specific parameters was given by the satellite data provider.
𝐿λ = Top of atmosphere radiance (spectral radiance at sensor) [W m−2 μm−1 sr−1)]
𝑔𝑎𝑖𝑛λ = Band-specific scaling factor [(W m−2 μm−1 sr−1)/DN]
𝐷𝑁λ = calibrated pixel value [DN]
𝑏𝑖𝑎𝑠λ = (upscaling factor/effective bandwidth) + OFFSET [W m−2 μm−1 sr−1]
𝜌𝜆 = Top of atmosphere reflectance (spectral reflectance at sensor) [unitless]
𝑑 = Distance Earth-Sun [au]
𝐸𝑒𝑥𝑜 = Exo-atmospheric radiance [W m−2 μm−1]
𝜃 = Solar zenith angle [°]
After these radiometric corrections, georeferencing was accomplished with coordinates collected with RTK GNSS (Leica Geosystems GmbH, Germany) at distinctive points (such as road marks and crossroads) in the study area.
2.6. Feature Creation
The panchromatic band and all eight spectral bands were used as features themself, as well as the Normalised Difference Spectral Indices (NDSI) calculated for each combination of multispectral bands (Equation (3)), resulting in additional 28 indices. Further, Haralick texture features [
33] were calculated for panchromatic and multispectral bands with
HaralickTextureExtraction plugin of Orfeo Toolbox library (OTB, open-source [
34]) accessed from QGIS. Simple texture set selection was chosen as shown in
Table 2, and image minimum and image maximum were adjusted for each band separately, depending on band intensity values. Expert knowledge was used to manually select all other preferences (computation step, radius, offset, histogram number of bins) to fit the purpose of generating distinguishable texture features (
Table A1,
Appendix A).
R = Spectral reflectance
i = Wavelength [nm]
j = Wavelength [nm]
From reference ground samples, plot corner coordinates were used to cut out each reference plot for each feature raster. As multiple pixels were located in a raster cut out, different metrics (average, standard deviation, minimum, maximum, 25% percentile, 50% percentile, 75% percentile) were used to calculate a value for each feature. In total, 763 features were created.
2.7. Feature Selection
To gain a higher model simplicity the total set of 763 features was reduced using the function VSURF (Variable Selection Using Random Forest) from R-package VSURF [
37]. This method was chosen because it proved suitable both for binary outcomes (which was one of our classification scenarios—cf.
Figure 3) and for datasets with many predictors [
38]. In the first (thresholding) step of VSURF, irrelevant features were eliminated. A random forest with 50 runs was used to rank all features according to their importance measure (gain). A threshold was computed depending on the standard deviation of feature importance since irrelevant features have smaller standard deviations compared to features with high importance. The second (‘interpretation’) step reduced the feature set in nested random forest models (25 runs), starting with only the most important feature and finally selecting the feature set from the model with an out-of-bag error smaller than the minimal out-of-bag error augmented by its standard deviation. Selection from this step still contains features with redundancy for interpretation purposes. The third (‘prediction’) step was a forward selection procedure based on the previously selected feature set. Here, an additional feature was only added if the out-of-bag error significantly decreased compared to the average variation obtained by adding a noise feature [
37].
2.8. Model Development
For model development, three different feature sets were used. The set with all 763 features, the one obtained by the VSURF ‘interpretation step’ and the one from VSURF ‘prediction step’. The sets were assigned to three different classification scenarios: A 17-class scenario, which included all documented data classes as described before (chap. 2.4. Reference ground sampling), a 5-class scenario, which aggregated the intermediate classes to nearly equal size in terms of the number of samples (cf.
Figure A1), and a binary classification scenario which exclusively contained sample plots with 100% coverage of
L. polyphyllus and those sample plots which were entirely free of
L. polyphyllus. Therefore, the binary classification scenario only contained 82 of 219 samples.
Overall, 9 different modeling approaches were conducted (3 classification scenarios with three different feature sets each as input). Multiple machine learning algorithms were applied through the AutoML algorithm [
39] from R-package
H2O ([
40] v. 3.32.0.4).
H2O is a machine learning and predictive analysis platform that is written in Java and has an application programming interface that can be used by web interface, Phyton as well as R binding. AutoML can be used as an automatic machine learning workflow, including training and tuning of different supervised machine learning algorithms (DRF: Distributed Random Forest, XRT: Extremely Randomized Trees, GLM: Generalised Linear Model, GBM: Gradient Boosting Machine, deep learning, and stacked ensembles (cf.
Table A2)). AutoML trains specific algorithms in the following order: A fixed grid of GLMs, a default DRF, five pre-specified GBMs, a near-default Deep Neural Net, an XRT, a random grid of GBMs, and a random grid of Deep Neural Nets. The number of models that are trained is eventually limited to a pre-defined training time. Additionally, and independent of time limitation, two stacked ensemble models are built, one (SE
family) using only the best performing model of each algorithm family (DL, GBM, GLM, XRT, DRF) and one (SE
all) combining all trained models.
Preferences of the AutoML function were kept similar for all classification scenarios and feature sets. Time limitation for the training process was limited to 600 s for each AutoML-run. A leader board, ranking all algorithms of a run depending on the mean per class error for multinominal classification and AUC (area under the receiver operating characteristic curve) for binary classification of a 5-fold cross-validation was created.
is the total number of observations of the corresponding data frame.
is the per observations user-defined weight (defaults is 1).
is the total number of classes (C = 2 for binary classification).
is the predicted value (uncalibrated probability) assigned to a given observation.
is the actual target value.
On a test dataset of 20% randomly picked samples, that represents a proportional split for each class, and which were not included in the training process, the validation measures mean per class error (Equations (4) and (5)) and Log loss (Equations (6) and (7)) were calculated and leaders inside each classification scenario were compared (all feature vs. VSURF ‘interpretation step’ vs. VSURF ‘prediction step’). Additionally, the best-performing algorithm was compared among the three different classification scenarios.
2.9. Model Interpretation
To get a better understanding of the underlying information of the prediction model, the contribution of each feature to the model was investigated. Feature importance was calculated for the best-performing model of the most promising AutoML run. Each algorithm family (DL, GBM, GLM, XRT, DRF) uses specific variable importance calculation (e.g., for deep learning it uses the Gedeon method [
41]), which is executed with function h2o.varimp from the H2O package. The different calculation procedures are listed in the algorithm table (
Table A2).
2.10. Final Validation of Best Model Approach
The best performing AutoML run (among all feature sets and classification scenarios) was inspected for its best models of each algorithm family (DL, GBM, GLM, XRT, DRF) and for the stacking ensemble model which was built by these five models. Each model was implemented in a loop of 100 training and testing steps, each built with a different combination of training and test samples. This was done to investigate the median prediction performance among all 100 runs to decrease the prediction outcome bias of a single test set that could be highly over-optimistic or over-pessimistic. Additionally, the range in prediction performance was used as an indicator for model stability. Median AUC (area under the ROC curve (receiver operating characteristic)) and median Log loss (Equations (6) and (7)) of each 100 model runs were compared to identify the best overall classification model for
L. polyphyllus. Further, those median model’s ROC curves were compared to investigate their performance when the threshold for class probability is varied along the area of conflict between a high
true positive rate (Equation (8)) and a low
false positive rate (Equation (9)).
2.11. Binarization Threshold
Classifications with different binarization thresholds were performed additionally with the best performing AutoML algorithm. Therefore, samples were divided into binary classes by a threshold of L. polyphyllus coverage. As the sampling plots have been divided into 16 subplots, the same number of threshold values could be realized (in steps of 6.25% L. polyphyllus coverage). Classification performance was compared in terms of mean AUC after a loop of 100 training and testing steps, as described in the previous validation procedure. The threshold of the best performing model was identified to assess the minimum L. polyphyllus coverage necessary for ML-based identification of L. polyphyllus in practice.
2.12. Prediction Map
At last, the whole reference dataset was used as training input in the model with tuning parameters set according to the best performing model from AutoML run (
Table A3) to create a prediction map of
L. polyphyllus abundance for the target zone ‘Leitgraben’. To this end, the ‘h2o.predict’ function was used for calculating class probabilities (for classification), which can be labeled by a threshold that fits best the purpose of the prediction. Because missing a Lupine spot is much worse than a false alarm, the threshold that defines where to split the probabilistic classification value for assigning a predicted sample to a class should be oriented at a low
false negative rate (Equation (10)). One way is to add a factor to the
F1-score (Equation (12)), which is the harmonic mean of
precision (Equation (11)) and
recall (Equation (8)). The weighted
F2-score (Equation (13)) penalizes more for
false negative than
false positive by adding a positive factor of 2 to the importance of
recall, while the
F0.5-score (Equation (14)) gives more weight to
precision than to
recall.
For this purpose, we identified the threshold value that gives a maximum performance of F0.5-score and F2-score to compare their outcome of L. polyphyllus prediction maps.
5. Conclusions
As binary classification was the only potentially usable approach compared to 5-class and 17-class scenarios, L. polyphyllus patches are only detectable by WorldView-3 data, when they cover at least 3 × 3 m2 ground surface. We have to consider WorldView-3 data as very limited to L. polyphyllus detection in the late summer season.
If UAV-based missions are impossible due to nature conservation restriction, a future aim could be to train a WorldView-3 based classification model using times series [
48,
49] or collecting data in early summer at the peak of
L. polyphyllus blossoming phenological stage.
We could show limitations for satellite-based species-detection in highly heterogenous grassland ecosystems and conclude that further sensor data fusion is necessary to compensate for similarities between target species and background as well as limits of the satellite’s spatial resolution.
Applicability is a crucial aspect in grassland monitoring to serve on strategic, tactic, and operational levels. To live up to the high expectations of RS-based monitoring tools, continuous development of RS methods under field conditions is inevitable.