Characterizing the Spatial Distribution of Eragrostis Curvula (Weeping Lovegrass) in New Jersey (United States of America) Using Logistic Regression

The increasing spread of invasive plants has become a critical driver of global environmental change. Once established, invasive species are often impossible to eradicate. Therefore, predicting the spread has become a key element in fighting invasive species. In this study, we examined the efficiency of a logistic regression model as a tool to identify the spatial occurrence of an invasive plant species. We used Eragrostis curvula (Weeping Lovegrass) as the dependent variable. The independent variables included temperature, precipitation, soil types, and the road network. We randomly selected 68 georeferenced points to test the goodness of fit of the logistic regression model to predict the presence of E. curvula. We validated the model by selecting an additional 68 random points. Results showed that the probability to successfully predict the presence of E. Curvula was 82.35%. The overall predictive accuracy of the model for the presence or absence of E. Curvula was 80.88%. Additional tests including the Chi-square test, the Hosmer–Lemeshow (HL) test, and the area under the curve (AUC) values, all indicated that the model was the best fit. Our results showed that E. curvula was associated with the identified variables. This study suggests that the logistic regression model can be a useful tool in the identification of invasive species in New Jersey.


Introduction
Global human migration has contributed to the movement and establishment of non-native plants into new environments. Throughout history, in addition to agriculture-related plant species dissemination, plant species have been intentionally or unintentionally introduced to new environments. Moreover, the increased mobility of humans over the past century has led to escalating rates of invasive species spread [1][2][3][4][5][6]. In the United States, new plants grown for decorative purposes or erosion control were found to be the primary dispersal source of nonnative species, and accidental introductions through seed contaminants were the secondary source [7].
Urban and suburban areas offer habitats with high levels of disturbance that provide new germination and colonization sites for non-native species [8,9]. Globally, non-native plant species are currently a substantial part of the vegetation of urban ecosystems, and they have had significant deleterious economic and ecological effects on these habitats, as well as in adjacent, more natural ecosystems [9]. In natural areas, non-native plants may reduce biodiversity, alter biogeochemical processes [10], and alter the natural disturbance regimes [11].
New Jersey is the most densely populated state in the United States, yet is characterized by vast stretches of forested and natural habitats. For centuries, the state's more disturbed environments have been occupied by non-native species, and many have spread from these urban centers to more New Jersey as a case study. We integrated readily available GIS data into a logistic regression model with the hopes that this approach would provide a simple technique that can be transferrable across landscapes and species. We used logistic regression to evaluate if it was an appropriate statistical tool to investigate the relationship between a binary dependent variable such as E. curvula that only takes two values (0 and 1, or absence vs. presence) and the four independent variables [51,52]. The overall intention is for this methodology to accelerate the prediction and control process of any particular invasive species by identifying potential habitats based on the species' ecological preferences and the current environment.

Study Area and Rationale
This study was conducted over the entire state of New Jersey in the United States of America ( Figure 1). New Jersey was selected because it offered two significant advantages: (1) All essential spatially georeferenced data for the study were readily available over the internet and; (2) human-related dispersal of invasive species is relatively high due to the long state history of European settlement and industrialization. New Jersey is about 240 km long, with the two furthest points from north to south being 270 km apart. On average, the state's east-west width is about 100 km, and the overall area is 22,610 km 2 (8729 mi 2 ). The northwestern part of the state, which is a part of the Appalachian Valley and Ridge Physiographic Province consisting mainly of elevated highlands and valleys, has a continental type of climate with much cooler temperatures. The south, central, and northeastern parts, mostly affected by the Atlantic Ocean, have a humid climate with generally warmer winters and cooler summers. The coldest month is January with average high temperatures of 3.8 • C (mid-30s • F) and the warmest month is August with high temperatures of 28.9 • C (80 • F) [53,54]. New Jersey as a case study. We integrated readily available GIS data into a logistic regression model with the hopes that this approach would provide a simple technique that can be transferrable across landscapes and species. We used logistic regression to evaluate if it was an appropriate statistical tool to investigate the relationship between a binary dependent variable such as E. curvula that only takes two values (0 and 1, or absence vs. presence) and the four independent variables [51,52]. The overall intention is for this methodology to accelerate the prediction and control process of any particular invasive species by identifying potential habitats based on the species' ecological preferences and the current environment.

Study Area and Rationale
This study was conducted over the entire state of New Jersey in the United States of America ( Figure 1). New Jersey was selected because it offered two significant advantages: (1) All essential spatially georeferenced data for the study were readily available over the internet and; (2) human-related dispersal of invasive species is relatively high due to the long state history of European settlement and industrialization. New Jersey is about 240 km long, with the two furthest points from north to south being 270 km apart. On average, the state's east-west width is about 100 km, and the overall area is 22,610 km² (8729 mi²). The northwestern part of the state, which is a part of the Appalachian Valley and Ridge Physiographic Province consisting mainly of elevated highlands and valleys, has a continental type of climate with much cooler temperatures. The south, central, and northeastern parts, mostly affected by the Atlantic Ocean, have a humid climate with generally warmer winters and cooler summers. The coldest month is January with average high temperatures of 3.8 °C (mid-30s ° F) and the warmest month is August with high temperatures of 28.9 °C (80° F) [53,54].  The human population in New Jersey has a distinctive spatial distribution pattern likely to be related to the patterns of species invasion. Despite being among the states with the highest average population density, approximately 467.2 people per km 2 [55], its spatial distribution is uneven. Higher population densities are found around the New York and Philadelphia Metropolitan areas. Rural counties such as Salem in the southwest and Warren in the west had a population density of 76.5 and 117.5 persons per km 2 , respectively [55]. As a result of its high population density, New Jersey also has the most traffic in the northeastern states, as measured by the vehicle lane miles traveled annually [42], as well as the highest density of road networks.

Data and Analysis Tools
A logistic regression model was used to analyze the distribution of E. curvula as a function of the selected environmental factors. This statistical tool helps to investigate the probability of occurrence of a dichotomous dependent variable by fitting the log odds and independent variables to a linear model [51,52,56], as shown in Equation (1). Previous environmental studies have also utilized logistic regression either for prediction, modeling, or monitoring purposes [35,[57][58][59][60][61].
The probability of occurrence could be predicted by the logistic function using the following Equation (2): where x 1 . . . x n are the independent environmental variables, y represents the presence (y = 1) or absence (y = 0) of E. curvula, b 0 is the intercept of the model, and b 1 . . . b n are the regression model parameters. Function y is represented as the log of the odds or likelihood ratio that the dependent variable is 1 (Equation (1)). We expected that the presence of E. curvula would result in higher probabilities or otherwise lower probabilities.
To test the significance of the binary logistic regression model, we examined the results using the maximum-likelihood method based on the following Equation (3): where df = k − k 0 and where LL 1 refers to the full log-likelihood model, and LL 0 refers to the model with only the intercept b 0 and no other coefficients. The goodness of fit of the model was determined using the Hosmer-Lemeshow (HL) test. The HL statistic was calculated based on the following Equation (4): where g = the number of groups, obs = observed values, and exp = expected values. The test statistic was evaluated using the chi-square distribution with g − 2 degrees of freedom. A significant HL test result indicates that the model is not a good fit, whereas a non-significant HL test indicates a good fit. The goodness of fit of the logistic regression model was also produced to confirm and evaluate the predictive accuracy of the model by producing a classification table. In this table, the number of successes (y = 1) predicted by the logistic regression model were compared to the number actually observed Similarly, the number of failures (y = 0) predicted by the logistic regression model were compared to the number observed. Classification table results (Table 1) are presented as follows: Table 1. Classification table [52].

Number of Cases Suc-Obs Fail-Obs Total Predicted
In Table 1 Site data points used in this study were randomly selected throughout the entire state using ArcGIS. Sixty-eight sites were used as training sites to test the actual model, and 68 additional sites were selected for the validation of the model. In each situation, half of these sites were selected from places where E. curvula was present and the remaining half from places where it was absent. We corroborated the location of each site on the ground using a global positioning system (GPS) device. ArcGIS was also used to extract the values related to each point corresponding to each of the four environmental and human-related datasets used in this study. These point values were exported to Microsoft Excel (Microsoft Excel 2010, Microsoft Corporation, Redmond, Washington, DC, USA). Real Statistics Resource Pack software (Release 6.2, Real Statistics, Oliva Gessi, Italy), an add-in tool within Microsoft Excel, was used to perform the binary logistic regression and data analysis.
Four factors were identified as the main variables controlling the spatial distribution of E. curvula in New Jersey. Three of them were environmental factors-temperature, precipitation, and soil types [48]. The fourth factor was anthropogenic-the road network.
E. curvula georeferenced data were downloaded from the New Jersey Invasive Species Strike Team [16] website. The shape-file formatted datasets attribute table included several fields. In addition to the names of all identified invasive species, the relational attribute table consisted of essential information such as the name of the county, the ecosystem, as well as the type of property where each species was found.
Additional downloaded georeferenced datasets included soil, road network, temperature, and precipitation. Two types of datasets related to the New Jersey road network were downloaded from the New Jersey Office of GIS Open Data [62] and US Census Bureau [63,64] websites. These 2016 TIGER/Line files included U.S. Highways, State Highways, County Highways, and town and city streets. After testing other potential multi-ring buffer widths, a 500 m constant width was found to be the satisfactory buffer width to study species distribution away from the road network. The multi-ring buffers were created around selected roads using ArcGIS. Soil data were downloaded for each county from the USDA National Geospatial Center of Excellence (NGCE)|NRCS [65,66] website and were joined together in ArcGIS to create a soil layer for the entire New Jersey state. The attribute table related to the soil map has information about the type of soil for each map unit [66]. We used ArcGIS to reclassify these soil units into three numeric classes based on their texture, mineral composition, and organic matters using the Soil Survey soil taxonomy [66]. Thus, all very good soils were attributed a value of 3, and the value 2 was assigned to moderate soil types, and the value 1 was assigned to relatively poor soil types. Data related to temperature and precipitation were downloaded from the PRISM Climate Group [67]. In both cases, the raster-formatted 800 m ground resolution was 30-years normal annual averages from 1981 to 2010. The metric units used for precipitation were in millimeters and temperature in degrees Celsius. The elevation data derived from the digital elevation model (DEM) high-resolution grid were obtained from the USGS National Elevation Dataset (https://viewer.nationalmap.gov/basic/) [68].

Results and Discussion
After running the binary logistic regression, results related to datasets training sites used to test the model and results used for validation were presented using tables and graphs. We performed several statistical tests following the recommendation by the American Statisticians Association (ASA), who advised that conclusions based solely on the results from p-values may not be sufficient and that some statisticians prefer to supplement p-values with other approaches [69]. For instance, concerning the earlier mentioned classification table (Table 1), our results are displayed in Tables 2 and 3.  The datasets training set model accurately predicted the presence of E. Curvula by 82.35%, while the validation model predicted by 79.41%. The overall accuracy, using the interpretation scheme from Table 1, is 80.88% for the datasets training model and 77.94% for the validation model. Based on the accuracy of these relatively close results, we could conclude that the logistic regression model was a good fit for such this study.
We also analyzed the overall performance of the model by looking at the Chi-squares and p-values of the datasets training set model compared to the validation-sites model (Table 4). In both cases represented in Table 4, the results show once again that the logistic regression model is a good fit in the analysis of the distribution of E. Curvula in the state of New Jersey.
We further analyzed results from the Hosmer-Lemeshow (HL) test to determine the goodness of fit of the logistic regression model on the two sites (Table 5). The HL statistic for the datasets training set's model was 46.371 and the p-value 0.968 > 0.05, and for the validation sites, it was 59.098 and the p-value 0.714 > 0.05. In both cases, it resulted in a non-significant HL test indicating that the logistic model was a good fit.
The accuracy of the model was further evaluated using the area under the curve (AUC) values from the relative operating characteristic (ROC) curve ( Figure 2). Bazzichetto et al. [35], Hosmer and Lemeshow [51], Pearce and Ferrier [60], and Zaiontz [52], have suggested that the closer an AUC value was to 1 the better was the model fit and the better was the ability of the model to discriminate between success and failure. The AUC value for the datasets training points was 0.907, and for the validation sites, it was 0.862. Hence, we concluded that our model was a good fit and it had an excellent discrimination capability.  The accuracy of the model was further evaluated using the area under the curve (AUC) values from the relative operating characteristic (ROC) curve ( Figure 2). Bazzichetto et al. [35], Hosmer and Lemeshow [51], Pearce and Ferrier [60], and Zaiontz [52], have suggested that the closer an AUC value was to 1 the better was the model fit and the better was the ability of the model to discriminate between success and failure. The AUC value for the datasets training points was 0.907, and for the validation sites, it was 0.862. Hence, we concluded that our model was a good fit and it had an excellent discrimination capability. Results from Table 6 show the significance of each selected variable to the fitness of the model. All the variables had p-values less than 0.05. Results from Table 6 show the significance of each selected variable to the fitness of the model. All the variables had p-values less than 0.05. A map of probability of occurrence of E. curvula (Figure 3) was created using the logistic function from Equation (3). Higher probability values designated areas where the presence of E. curvula was very likely to be present (lighter gray tone on the map), while lower values indicated the least likely areas to find the species (darker gray tone on the map).   (Figure 3) was created using the logistic function from Equation (3). Higher probability values designated areas where the presence of E. curvula was very likely to be present (lighter gray tone on the map), while lower values indicated the least likely areas to find the species (darker gray tone on the map). E. curvula has not spread throughout the entire state of New Jersey. Soil and climate parameters, as demonstrated by the predictive model, have helped to identify its range of growth and adaptation. Results from the spatial analysis of the distribution of E. curvula revealed that the species is mostly concentrated in the southern part of the state. The eight counties where the species has been spotted are all within the Coastal Plain Physiographic Province of New Jersey (Atlantic County, Burlington County, Camden County, Cape May County, Cumberland County, Gloucester County, Middlesex County, and Salem County) [70]. While this physiographic province occupies three-fifths of New Jersey, and it is part of a much larger system that stretches over 3540 km from Cape Cod, Massachusetts, to the Mexican border [70,71]. Much of the soils in this province are sandy, dry, and acidic, with low fertility that makes it unconducive to agricultural use aside from blueberries and cranberries. Therefore, large areas within the Coastal Plain Province remain undeveloped. In addition to the soil quality being consistent with E. curvula locations noted in the literature, the climatic conditions in southern New Jersey may be relatively similar to the ones characterized by previous studies that documented environmental conditions [38][39][40]47,48] as ideal for the species to thrive. E. curvula has not spread throughout the entire state of New Jersey. Soil and climate parameters, as demonstrated by the predictive model, have helped to identify its range of growth and adaptation. Results from the spatial analysis of the distribution of E. curvula revealed that the species is mostly concentrated in the southern part of the state. The eight counties where the species has been spotted are all within the Coastal Plain Physiographic Province of New Jersey (Atlantic County, Burlington County, Camden County, Cape May County, Cumberland County, Gloucester County, Middlesex County, and Salem County) [70]. While this physiographic province occupies three-fifths of New Jersey, and it is part of a much larger system that stretches over 3540 km from Cape Cod, Massachusetts, to the Mexican border [70,71]. Much of the soils in this province are sandy, dry, and acidic, with low fertility that makes it unconducive to agricultural use aside from blueberries and cranberries. Therefore, large areas within the Coastal Plain Province remain undeveloped. In addition to the soil quality being consistent with E. curvula locations noted in the literature, the climatic conditions in southern New Jersey may be relatively similar to the ones characterized by previous studies that documented environmental conditions [38][39][40]47,48] as ideal for the species to thrive.
Whereas most invasive plant species thrive in the highly disturbed ecosystems found in urban and suburban areas, the results from this study showed that a high human population density might not be associated with a high concentration of E. curvula. Two counties (Atlantic County and Burlington Counties) where the species was found in its highest abundance had a relatively low population density. On the other hand, counties with the highest population density including Union, Essex, and Hudson in the New York Metropolitan Area, do not have any documented occurrences of E. curvula. Densely populated areas have more impervious areas that inhibit the growth of the species as opposed to areas with more open space and the presence of soil. Many reported cases were found in ecologically-intact forests that are protected by the State, Federal Government, or the New Jersey Conservation Foundation (e.g., Wharton State Forest, Edwin B. Forsythe Wildlife Refuge, Higbee Beach Wildlife Management Area, and Franklin Parker Preserve).
This study found that the road network, in combination with other environmental factors, plays a role in species dispersal in New Jersey. In addition to climate restriction, all of the reported cases were found only within the 500 m-buffer zones of the roads (Figure 4). Cars or trucks appear to be the mechanism of dispersal. However, the road-speed limit has some impacts on the species dispersal because local roads (town toads) that have a 25-miles limit do not have any E. curvula species. More than half of reported cases were found along US and State highways (US Highway 322, US Highway 40, State Route 168, State Routes 72 and 73). The remaining identified species were located along county or rural roads. The actual species spatial distribution did not show its presence within its 500-meter buffer zone even though it was first planted along the Garden State Parkway [41]. Vehicle tires may have been playing a dominant role in the seed dispersal as opposed to other suggested dispersal mechanisms [72]. Whereas most invasive plant species thrive in the highly disturbed ecosystems found in urban and suburban areas, the results from this study showed that a high human population density might not be associated with a high concentration of E. curvula. Two counties (Atlantic County and Burlington Counties) where the species was found in its highest abundance had a relatively low population density. On the other hand, counties with the highest population density including Union, Essex, and Hudson in the New York Metropolitan Area, do not have any documented occurrences of E. curvula. Densely populated areas have more impervious areas that inhibit the growth of the species as opposed to areas with more open space and the presence of soil. Many reported cases were found in ecologically-intact forests that are protected by the State, Federal Government, or the New Jersey Conservation Foundation (e.g., Wharton State Forest, Edwin B. Forsythe Wildlife Refuge, Higbee Beach Wildlife Management Area, and Franklin Parker Preserve).
This study found that the road network, in combination with other environmental factors, plays a role in species dispersal in New Jersey. In addition to climate restriction, all of the reported cases were found only within the 500 m-buffer zones of the roads (Figure 4). Cars or trucks appear to be the mechanism of dispersal. However, the road-speed limit has some impacts on the species dispersal because local roads (town toads) that have a 25-miles limit do not have any E. curvula species. More than half of reported cases were found along US and State highways (US Highway 322, US Highway 40, State Route 168, State Routes 72 and 73). The remaining identified species were located along county or rural roads. The actual species spatial distribution did not show its presence within its 500-meter buffer zone even though it was first planted along the Garden State Parkway [41]. Vehicle tires may have been playing a dominant role in the seed dispersal as opposed to other suggested dispersal mechanisms [72].  In southern New Jersey, the species grows primarily on sandy soil but has been documented on a wide range of soils. Using the United States Department of Agriculture classification scheme [73], the species was found on Lakehurst series soil and Atsion sand series. These soil types consist of deep, moderately well, or somewhat poorly drained soils and are typically found in lowland or upland areas. It also grows on Downer and Manahawkin sand series-a loamy type of sand that can be found on 0 to 5 percent slopes. Even though previous studies have indicated that the species has adapted to survive in coarse-textured soils, the results from the logistic model have shown that it has not adapted to surviving and growing in fine-textured soils in our study area.
The increasing spread of invasive plants is a critical driver of global environmental change [34]. Conservation managers recognize that predicting infestations of invasive plants will help in planning for long-term environmental management [36]. Over the past decades, remote sensing has improved our understanding of the drivers, processes, and effects of plant invasions through features such as identifying invaded ecosystems, predicting the distributions of invasive species, and in comprehending landscape invasibility and associated ecological impacts [34]. In this study, we have offered an additional avenue of prediction. While this study focused on one invasive grass species, we believe that the methods of incorporating readily available GIS data into a logistic regression model can serve as a reliable tool to identify and predict the spatial occurrence of other invasive plants.
E. curvula can spread throughout non-planted sites, but the variables highlighted throughout this study control its distribution. Predictive models, such as logistic regression, can be built to predict its distribution. Even though some studies have suggested the efficiency of prioritizing the removal of low-density subpopulations of invasive species [74], we believe that it is cost-effective to concentrate on the southern end of New Jersey, where it is likely to be found as ascertained by the logistic regression model [75]. Out of the four original counties [41], today the species is not found in Ocean County and Monmouth County. It has spread mostly west into new counties, including Camden, Cape May, Cumberland, Gloucester, Salem, and Middlesex. However, more than 75% of newly discovered plants are still located in the Atlantic and Ocean Counties. Even though E. curvula has invaded new soil types, the species has been predicted to grow mostly on preferred soil types. The road network has been the principal route of its dispersal. For its eradication, attention must be paid to areas along the roads and in forested or protected environments.
As GIS and remote sensing technology continues to develop, we will see an increased effort to use these tools as predictors of invasion and to support management decision making. Here we present one such model that land managers can adopt. The next step is to bridge the divide between model theory and management practices in the development of efficient conservation solutions [35].