Habitat Modeling of Alien Plant Species at Varying Levels of Occupancy

Distribution models of invasive plants are very useful tools for conservation management. There are challenges in modeling expanding populations, especially in a dynamic environment, and when data are limited. In this paper, predictive habitat models were assessed for three invasive plant species, at differing levels of occurrence, using two different habitat modeling techniques: logistic regression and maximum entropy. The influence of disturbance, spatial and temporal heterogeneity, and other landscape characteristics is assessed by creating regional level models based on occurrence records from the USDA Forest Service's Forest Inventory and Analysis database. Logistic regression and maximum entropy models were assessed independently. Ensemble models were developed to combine the predictions of the two analysis approaches to obtain a more robust prediction estimate. All species had strong models with Area Under the receiver operator Curve (AUC) of >0.75. The species with the highest occurrence, Ligustrum spp., had the greatest agreement between the models (93%). Lolium arundinaceum had the most disagreement between models at 33% and the lowest AUC values. Overall, the strength of integrative modeling in assessing and understanding habitat modeling was demonstrated.


Introduction
Invasive species are now a major threat to ecosystems, with the rapid anthropogenic acceleration of species introductions over the last century [1] and the subsequent impact of the species on economies and ecosystems [2].Invasive species are now recognized as a major component of global environmental change [3][4][5].Tools that can accurately assess the impacts of invasive species are becoming essential for identifying areas where management and monitoring efforts should be focused.Species distribution models (SDMs) are one such tool.They are widely used in ecology [6,7] and have broad applications in assessing the relationships between species occurrence, the environment and the impact of ecological change [8].For invasive species, SDMs are useful for predicting species distributions and ecological niches, and also for assessing potential spread and the suitability of areas that have not yet been invaded.SDMs can be used to assess the impacts of external environmental conditions such as climate change on species distribution [9] and the potential impacts of the species on the landscape [10].
The strength of a SDM is determined, in part, by the correlation of species distribution to input parameters [11] and the number of observation points.Input parameters are often derived from landscape-level digital information and provide a representation of the environmental heterogeneity of the landscape.Typical parameters used in SDM are those that represent climate, habitat diversity, landscape characteristics, habitat patch size and shape, connectivity, regional and local diversity of biota, vegetation structure, and the intensity, frequency and magnitude of disturbance [12][13][14], all of which vary across spatial and temporal scales [12,15].Collectively, these factors result in interlaced patterns of species distribution at multiple spatial and temporal scales [16].Geospatial datasets including remotely sensed data offer significant opportunities for providing information on these characteristics on a larger scale.
There are numerous methods for developing SDMs, many of which have been applied to invasive plants including logistic regression [17,18], fuzzy envelope models [19], genetic algorithms [20], maximum entropy [18,21], and general additive models [22].These models differ in the underlying assumptions and algorithms, and in their requirement for presence-only species data or for both presence and true absence data.These approaches can be used individually or collectively in an ensemble approach.Ensemble SDMs combine the strengths of several models while limiting the weakness of any one model [23,24] and offer a broad perspective to model results.
In this paper, we illustrate the application of two modeling techniques, logistic regression and maximum entropy, and the ensemble model approach.We discuss the impact of the size of the dataset on the resulting model by comparing the results from three species with different levels of prevalence.We focus on three of the invasive plant species of concern in the Cumberland Plateau and Mountain Region in the United States: privet (Ligustrum spp.), tall fescue (Lolium arundinaceum) and silktree (Albizia julibrissin).

Study Area
The Cumberland Plateau and Mountain Region (CPMR) extends from northern Alabama, through Tennessee and Kentucky, and into Virginia [25][26][27][28] (Figure 1).The region covers 59,000 km 2 and has one of the most diverse woody plant communities in eastern North America [29].Forest resources and management are a major part of the CPMR economy, particularly in rural communities.Approximately 70% of the land in this area is forested, with over 75% of this comprised of hardwoods [29,30].Elevations range from 200 to 1200 m [31], with annual rainfall varying from 940 to 1900 mm, and mean minimum winter temperatures of −7 °C to 1.5 °C [32].Like many of the forests in eastern North America, the native deciduous hardwood forests of the CPMR are characterized by a long history of land-use change driven by agricultural conversion and timber extraction.More recently, urban sprawl and large-scale conversion of land to intensively managed pine plantations have become major contributors to land cover change [33].McGrath and others [34] found that 14% of native forest cover was lost between 1981 and 2000, predominantly as a result of native forest conversion to pine plantations.Of the 33 invasive species monitored by the United States Forest Service (USFS) [35], 25 of them are found in the CPMR: four trees, seven shrubs, seven vines, five grasses and two forbs.

Species of Interest
Study species were selected to represent a range of life forms (grass, shrubs, and trees) and occurrence levels (moderate and low percentage of Forest Inventory and Analysis database (FIA) plots occupied) across the CPMR.Privet (the shrub) had moderate occurrence (16% occupied plots), and tall fescue (the grass) and silktree had low occurrence (5% and 2% respectively).

Privet
There are at least eight species of invasive privets (Ligustrum spp.) that have been introduced from Asia and Europe into the southern United States as ornamentals [36][37][38].The USFS collects information on two species of privet, Chinese privet (L.sinense) and European privet (L.vulgare) [35].It can be difficult to distinguish between privet species and instead we have modeled the Ligustrum genus as a whole.Privets are the second most abundant invasive plants in the southern region and the most prevalent in the understory of bottomland hardwood forests [39,40].Chinese privet is the most common species, being present in 20 states ranging from Texas to Massachusetts [36].All species are still being produced, sold and planted as ornamentals.Privets severely alter natural habitat and critical wetland processes, forming dense stands to the exclusion of most native plants and replacement regeneration.The abundance of specialist birds and the diversity of native plants and bees are dramatically reduced by privet thickets [41,42].The dense thickets impact forest communities by shading and out-competing many of the native species.Privet can survive in a variety of habitats, including wet or dry areas, but dominates best in mesic forests [39].Privets produce abundant seeds that are viable for about a year [43], which are predominately spread by birds [44].Privet also has the ability to increase in density by stem and root sprouts.The fruit produced, however, provides a substantial food source for birds and other wildlife [45].

Tall Fescue
Tall fescue is a grass native to Europe and was first introduced into the United States in the early to mid-1800s.It has been widely planted for turf, forage and erosion control [46].Tall fescue occurs throughout the continental United States [36] and has been reported as invasive in natural areas [47].It is still promoted by a variety of agricultural agencies; however, the USFS Southern Region has prohibited the use of endophytically enhanced tall fescue on USFS lands [39].Tall fescue is a cool season grass that invades native grasslands, savannahs, woodlands and other high-light natural habitats [46].It spreads mainly through rhizomes and can form extensive colonies that compete with and displace native vegetation.Viable seeds can be dispersed by grazing animals and birds, and remain in the seed bank for extended periods of time [39].Some varieties of tall fescue have a mutualistic fungal endophyte (Neotyphodium coenophialum) that gives them a competitive advantage over some plants, including legumes [48].As a result, communities dominated by tall fescue are often low in plant species richness [49].In addition, alkaloids produced by endophyte-infected tall fescue may be toxic to small mammals and of low palatability to ungulates [50].Tall fescue, which has replaced many acres of native grass, does not supply the type of food and cover that many birds need in order to thrive [51].The grass supports only a limited number of insects [52], which in turn, are an important food for both quail and turkey.Grasslands dominated by endophyte-infected tall fescue are expected to support less total herbivore biomass and less predator biomass [51,52].Tall fescue tolerates nutrient-poor and compacted soils, and grows well in disturbed areas such as highway and railroad right-of-ways.Annual nitrogen inputs are needed to maintain optimal grazing conditions [46].Tall fescue is adapted to cool, humid climates with moist soils of a pH 5.5 to 7.0 [46].It will produce top growth when soils are as low as 5 °C and it continues growing into late autumn in the southern United States [46].

Silktree
Silktree is a legume native to south and eastern Asia.It is a small to medium-sized tree that can grow up to 11 m tall.It was introduced to the United States in 1745 and widely planted as an ornamental.Silktree is now found throughout the southern United States along roadsides, beside parking lots bordering power lines and encroaching into forests.Silktree reproduces both vegetative and by seed [39].The seeds are encased with impermeable seed coats that allow them to remain dormant for many years [53].Because silktree is sun tolerant, it can grow in a variety of soils and can produce large seed crops and re-sprout when damaged.It is a strong competitor of native trees and shrubs in open areas and forest edges.Dense stands of silktree severely reduce the sunlight and nutrients available for other plants [39].Silktree can tolerate partial shade but is rarely found in forests with full canopy cover or at higher elevations (above 900 m) where cold hardiness is a limiting factor.However, silktree can become a serious problem along riparian areas where it becomes established along scoured shores and where its seeds are easily transported in water [39].Although it has been identified as being invasive in forests in the southern United States [39], silktree is still being encouraged as a tree crop species [54].Ares and others [54] state that in the southern United States, silktree has been considered in agroforestry practices as a forage species for goats and cattle [55,56], and for soil fertility improvement in permaculture systems [57][58][59].However, planting of silktree should be evaluated on a site-specific basis because it can become invasive, especially in riparian areas [60].This mixed message may increase the planting of silktree in the next decade and thus its invasion potential.

Invasive Plant Occurrence
The USFS, Forest Inventory and Analysis (FIA) program, analyses and reports information on the status, trends and conditions of forests within the United States.It is a periodic survey of all forested land in the United States and has occurred since 1928 [61].Recent inventories have typically been conducted every 5-7 years in the southeastern states, with approximately 20% of the points assessed every year [35].In the CPMR there are 2814 FIA sites [35].An extension of the FIA database focuses on invasive plants, and this database was made available for our study.Data were available for the last completed inventory cycle (2000)(2001)(2002)(2003)(2004)(2005) and consisted of species absence/presence records.

Landscape Variables
Landscape variables were categorized into six groups: Landsat, anthropogenic, environmental, climate, land use and water.Using ArcGIS [62] and ERDAS [63], all variables were extracted from available digital information including Landsat imagery, classified land use data, roads, rivers, human population census data and climatic information.All variables were converted to 30 m × 30 m cells across the CPMR [18].The total number of variables was 41 (Table 1).This initial set was reduced using exploratory data analysis to remove variables that were highly correlated (Pearson's correlation coefficient, r).For any two variables that were highly correlated (r > 0.8) only one was selected for input into further models.All input variables needed to be able to be displayed on a map.Two variables based on the Normalized Difference Vegetation Index (NVDI), NDVI75 and NDVI90-75, could not be mapped due to inconstancies across Landsat scenes, an artifact of instrumentation, and thus were not suitable for use in further analysis.This left a set of 28 variables (see Table 1).Descriptive statistics (mean, standard deviation (SD), minimum (Min) and maximum (Max)) for the 28 variables were calculated for both the land area covered by the FIA plots and the forested land area in the CPMR.The forested land area in the CPMR was the area depicted by the 2001 National Land Cover Database.This comparison was to determine if FIA data could be extrapolated to the entire forested CPMR (Table 1).The FIA points had a mean that was within one SD of the mean for the forested area of the CPMR for all variables (all but two variables had means within 0.2 SDs).In both cases, the maximum and minimum were very similar, suggesting that although there was some variation in the means, they still represented the full range of the CPMR.Overall, the FIA data are considered to be an adequate representation of the CPMR for this study.

Models
Two modeling techniques were used: binary logistic regression (using a binomial distribution and logit link) [71] and maximum entropy (MaxEnt) [72].The important difference between the two techniques is that logistic regression uses information on both occurrence and absence to estimate a predictive linear model, whereas MaxEnt uses information from occurrences only [18].The distribution of each species was modeled, following the methods of Lemke and others [18], using each group of variables (Landsat, anthropogenic, environmental, land use, water and climate) separately (Table 1).These "sub-models" were built using each of the two techniques.Using only variables selected in the final sub-model for each variable group, a final composite model was determined.Logistic regression models were conducted using SAS [73] and MaxEnt models were conducted using a specialized package of MaxEnt [72].Logistic regression models were derived using a stepwise regression method with Akaike's Information Criterion (AIC) [74] as the selection criterion.MaxEnt models were derived using a manual backward selection method, and variables that had little or no impact on the model were removed.A measure of variable contribution was calculated to identify the key variables determining the occurrence of each species.
The omission rate and Area Under the receiver operator Curve (AUC) were used to assess the reliability and validity of the models.The omission rate is the false negative or the proportion of sites where the species was present but the model predicted absence.To calculate the omission rate, the predicted model values are converted to a binary value (predicted occurrence = 1; predicted absence = 0).The threshold value for this binary conversion was set, for each species, as the value that maximized the sum of the sensitivity and specificity [75].The AUC provides a single measure of model performance independent of any particular choice of threshold [76].
Rasters were imported from MaxEnt into ArcGIS and the raster calculator was used in creating the logistic regression model.Initial maps with continuous rasters were reclassified into binary rasters based on the cut-off values determined by maximizing the sum of the sensitivity and specificity.
We integrated information from both logistic and MaxEnt using an ensemble approach.While logistic and MaxEnt models may be compared individually to select the best overall model for particular datasets, methods that combine the two models have the potential to reduce the uncertainty associated with any one particular algorithm [23,24].A number of approaches have been proposed for combining the outputs of individual models for ensemble predictions [23].Here, we adopt a consensus approach, adding the binary output rasters together to identify areas of agreement and disagreement in the models.Areas of agreement were where both models predicted occurrence or absence, and areas of disagreement were where the predictions of the composite models (the logistic regression or MaxEnt models) differed.

Data Selection
Models were built for each species, using 70% of the data with the remaining 30% used to test the models (Table 2).For the logistic regression models, the balance between occurrence and absence data points was fixed as 20:80 [77] for the three species, to reduce any effect of having a large binary class imbalance.This was done by under-sampling the absence data points [77].

Results and Discussion
Of the 42 models run, 41 had better than random predictions (Table 3).All three species had low omission rates and high AUCs.The final composite models were combined to create ensemble models (Figure 2).The species with the strongest agreement was the more prevalent species, privet (93% agreement), while the two low-prevalence species, with the smaller number of occurrence data points, had lower agreement between their composite models (67% agreement for tall fescue and 87% for silktree) [78].However, despite low prevalence and small datasets, composite models for all three species were acceptable.Table 3. Threshold (defined as maximum sensitivity plus specificity) and accuracy assessment for the three species (bold denotes strong models with AUC >0.80 and omission rate <0.20) using logistic regression (L) and MaxEnt (M).The variables were grouped into four groups: Landsat, Anthropogenic (Anthro), Environmental (Enviro) and Climate.The composite model is the final, best model.Of the 28 original variables used in developing the models, 15 were ultimately incorporated into at least one of the final composite models, but only seven were used in more than one model (Table 4).Overall, the composite models were dominated by environmental variables (32% of all composite model contributions) and climatic variables (42% of all composite model contributions) with minimum temperature as the single most important variable (40% of all composite model contributions; Table 4).This confirms the validity of matching the ranges of native species with the range of potential invasion, and the approach of integrating elevation, latitude and longitude, as is used to estimate potential invasive distribution [79].It also suggests that climate change will influence the distribution, and this variation should be integrated into models.Variables in the Landsat and water groups contributed very little to the models, contributing only one variable each to the composite models, and both were at low rates (disturbance index in 2001 at 1%, and water within 500 m at 3%, for all composite model contributions; Table 4).Information on human population, roads and land use (proportion of forest and FIA plots have privet.Overall, privet was predicted to occur across 22% of the forests by both models.Both composite models were strong, with logistic regression producing a slightly better model.Environmental variables dominated both models, at 73% (MaxEnt) and 79% (logistic regression).Minimum temperature was the single most dominant variable, with higher minimum temperatures having a higher probability of invasion.Both models showed a negative correlation with elevation and a positive correlation with road density, suggesting that privet will be found at lower elevation in areas of higher road density (increased human occupation).The logistic model also suggested privet had a higher chance of occurrence closer to roads and with more farming in the near vicinity.MaxEnt highlighted the trend that the less the forest cover, the more likely the area was to have privet.The logistic model used historical land use as one of the independent variables, associating privet with areas with less forest, more residential land use and more water in 1990.Overall, this suggests that areas of higher human use and disturbance will have more privet.The MaxEnt model also identified slope and rainfall as important, with low slope being more likely to have privet.

Species
For tall fescue, the MaxEnt composite model had the highest AUC (Table 3); however, the MaxEnt model that used only climatic variables had a slightly better omission rate.The logistic regression models had slightly lower validation statistics.Both the MaxEnt and logistic regression composite models were dominated by climatic variables.The MaxEnt composite model showed that tall fescue occurrence was influenced greatly by temperature, elevation, rainfall, farming and aspect.Lower temperature; intermediate levels of farming, rainfall and elevation; and a more southerly aspect were related to a higher occurrence of tall fescue.The logistic regression composite model only used three variables, minimum temperature, aspect and amount of residential land use within 100 m, with low temperature, more southerly slopes and less residential land use having a higher occurrence of tall fescue.
The silktree was the only species to integrate a high portion of anthropogenic variables into the composite models (Table 4).The MaxEnt composite model predicted 21% of the area to have probable occurrence of silktree, and showed its occurrence to be influenced by elevation, population density, road density and water bodies.The variables lower elevation, higher population and road density, and nearby water bodies were related to a higher occurrence of silktree.The composite logistic model also utilized a number of anthropogenic variables.The logistic model was dominated by elevation but road density also had a major role in the model.The logistic composite model was the only composite model to use a Landsat variable.The logistic model also suggested that low elevation and high road density are important contributors to silktree occurrence, with higher disturbance in the landscape also being important.

Conclusions
Remote sensing has been identified as an emerging tool for biodiversity science and conservation [80].However, in this work, the introduction of remotely sensed medium resolution (30 m) data had little value in the overall model development.Only one of the composite models, the logistic regression model for silktree, used any Landsat variables.The silktree model used the Landsat disturbance index for 2001 but this only had a 5% contribution to the model.Given the time put into developing the Landsat variables, we would suggest that for future work, this information adds little value to the predictive ability of models and is probably unnecessary at a landscape scale.The large size of the study area (59,000 km 2 ) made it impractical to use remotely sensed data at a finer resolution due to the computer processing power required for analysis.Exploring different abstraction resolutions, as suggested by Sester [81], would be a worthwhile study, possibly on a smaller scale, to identify an optimal resolution.
The use of the two different modeling approaches, logistic regression and MaxEnt, strengthens the validity of the results.The inclusion in the models of similar variables with the same direction of relationships gives confidence to any inference about the importance of these variables.In examining all the composite models, there was only one variable that had a different relationship between the two types of modeling: water in the tall fescue composite models.In this model, water had a positive relationship with MaxEnt (12% contribution) but a small weak relationship in the logistic regression (1% contribution to the model).
The ensemble approach and mapping the agreement and disagreement of composite models within each species showed privet to have a very strong agreement (93%), silktree a moderate agreement (87%) and tall fescue a limited agreement (67%).This is a reflection of the model strength, the number of occurrence points and the applicability of the independent variables in predicting the species of interest.Tall fescue had the lowest agreement of the three species, even though it was not the species with the smallest number of occurrence points.There may be a number of reasons for this; for example, only forested landscapes were modeled rather than grasslands.Other reasons could be the suitability of the independent variables or the scale of the independent variables.Independent variables were used at a 30 m × 30 m resolution and habitat characteristics that function at a smaller scale may be driving the distribution of tall fescue.
Models such as those developed by this research can be used as tools for landscape management, forest stand assessment or long-term forest monitoring programs.We recommend the use of an ensemble modeling approach to combine different models.One of the greatest benefits of large-scale GIS models is that they can outline the main characteristics of species distribution areas and be used to predict environmental favorability in regions where their distribution is less documented [82].They can also be integrated into forest management decision support systems [83] and assist in developing long-term management plans.

Figure 1 .
Figure 1.Study area location map: Cumberland Plateau and Mountain region in the southeastern United States.

Table 1 .
Description of landscape variables categorized into six groups, the resolution of the original data (Res), the citation for other studies that have used the variable, and the original data source.Descriptive statistics are shown for the 28 variables that were used in modeling.(TIGER = Topologically Integrated Geographic Encoding and Referencing, USGS = United States Geological Services, LULC = Land Use Land Cover, NED = National Elevation Dataset, PRISM = Parameter-elevation Regressions on Independent Slopes Model).

Table 2 .
Total number of points, for the occurrence and absence of three species, separated into training and test datasets.