Wildlife‒vehicle collisions (WVC) are a prominent road safety issue as highway expansion projects in natural areas endanger the safe sharing of highways between vehicles and wildlife, which is a great potential threat to humans and wildlife [1
]. WVCs cause damage to wildlife populations, as well as serious human injuries and major property loss, especially in Western countries [3
]. Recently, the number of WVCs has been approximately 5% of all motor vehicle collisions, and the proportion is continuously rising [7
WVC data generally contain two types: reported WVC data and carcass removal data [9
]. In order to mitigate the wildlife‒vehicle collision risk and develop effective countermeasures, statistical regression models are frequently applied by transportation safety researchers to quantify the effect of explanatory factors on WVCs [10
]. Some recent studies [11
] provided a comprehensive overview of vehicle collision analysis methodologies, including Poisson regression [13
], Negative Binomial (NB) regression [15
], Poisson–lognormal regression model [18
], Gamma regression model [19
], semi-nonparametric Poisson regression model [21
], etc. Using carcass removal data, Gkritza et al. [22
] applied the Poisson regression model and the NB regression model to estimate the effect of identified factors on the frequency and severity of WVCs. Using reported WVC data, a stepwise logistic regression model was applied to identify the significant factors at a landscape scale and recognize the points of high collision risk [23
]. Seiler [26
] developed a multiple logistic regression model to predict collisions in non-accident control sites through the WVC data reported in observed sites. Tappe [27
] proposed a multivariate approach to estimate influential factors from the county level on WVC density. Lao et al. [28
] applied a diagonal inflated bivariate Poisson regression to model reported WVCs and carcass data jointly and found a correlation between the two datasets. Neumann et al. [29
] related the WVC risks to the probability of road-crossings of wildlife through generalized linear mixed model construction of reported WVC data. Murphy and Xia [30
] found the positive effect of coverage degree of vegetation on WVCs by using a hierarchical Bayesian model.
So far, many existing studies have focused on analyzing either the reported WVC data or carcass removal data and neglected the possible underreporting issue [31
]. Underreporting refers to the number of WVCs not being fully reported and recorded. This phenomenon can be observed in the discrepancy between the reported WVCs and carcass removal data [32
]. In the United States, reported WVC data are typically collected by the transportation agency, while the carcass removal data are collected by the natural resource management agency [33
]. Since the two datasets are commonly collected by different agencies using distinct equipment and methods, an inconsistency always exists between them. The quality of reported WVC and carcass removal data is affected by underreporting. Rowden et al. [34
] and Yannis et al. [35
] point out that it is difficult to estimate the level of underreporting since the combined effect of examined temporal and spatial factors is difficult to quantify. The underreporting of reported WVC data may be due to the reporting decision of travelers, the threshold of warrant report, communication techniques, and whether incidents are recorded by transportation agencies, etc. [9
]. Carcass removal data may be underreported due to decomposition, difficulty in detecting the carcass, tardy removal, etc. [37
]. As discussed by Huijser et al. [38
], nearly two-thirds of WVC go unreported in the United States. Alsop and Langley [39
] applied a multivariate stepwise logistic method to identify significant factors, including age, injury severity, etc., for underreporting. Correspondingly, Yamamoto et al. [40
] suggested that not considering underreporting can lead to bias when estimating the significant factors, even when using sequential binary probit models for better performance than the ordered-response probit models. Patil et al. [41
] replaced a multinomial logit model with a nested logit model to accommodate underreporting for more accurate crash severity level determination. However, Snow et al. [42
] suggested that WVC studies are not sensitive to underreporting until the underreporting level becomes severe. Determining the underreporting level is important, and requires WVC data availability and reliability [43
Based on the literature review, few approaches have been developed to analyze the underreported WVC data. Thus, the primary objective of this paper is to propose a copula-based approach to accommodate the underreporting issue and accurately quantify the impact of explanatory factors on WVCs when the additional underreporting information is available. To accomplish this objective, the WVC dataset collected in Washington State from 2002 to 2006 is considered and an underreporting indicator variable is generated to denote whether the wildlife‒vehicle collisions are underreported or not. To demonstrate the advantages of the proposed copula-based approach, the hotspot identification results using the proposed method and the conventional NB model are also compared.
2. Data Description
The reported wildlife‒vehicle collision and carcass removal data were collected on 10 highways (US2, SR8, US12, SR20, I90, US97, US101, US395, SR525, and SR970) in Washington State over a five-year period from 2002 to 2006. The dataset has been used in some previous studies [28
] and the definition of variables is explained in the AASHTO [44
]. The summary statistics of characteristics of explanatory variables in Washington data are provided in Table 1
. Some variables (e.g., access control type, terrain type, animal habitats, etc.) are binary variables. As shown in Table 1
, the reported wildlife‒vehicle collision data range from 0 to 22, and the mean wildlife‒vehicle collision frequency is 0.24, with a standard deviation of 0.81. The carcass removal data have a mean value of 0.94 and a standard deviation of 3.88.
For the reported WVC data and carcass removal data, although both data sources are underreported for different reasons, a larger number of wildlife‒vehicle collisions are usually recorded in carcass removal data [31
], which is also found in the data source from Washington State. In other words, carcass removal data are less likely to suffer from underreporting. However, the spatial coverage of carcass removal datasets depends on the carcass removal strategies and funding availability [45
]. Due to the restriction on finances, not every regional transportation agency collects carcass removal data [42
]. In this study, in order to investigate the impact of influential factors contributing to the WVC data underreporting issue, a new variable (underreporting indicator) is generated to denote whether the number of reported wildlife‒vehicle collisions per road segment is underreported or not. Specifically, if the number of carcasses is larger than the number of reported wildlife‒vehicle collisions, it is assumed that the number of reported wildlife‒vehicle collisions for that road segment is underreported. Otherwise, the number of wildlife‒vehicle collisions reported for the road segment is assumed to be the actual number. Since the carcass removal data are not often collected, the following analysis mainly considers records of reported wildlife‒vehicle collisions from data sources in Washington State.
5. Discussion and Conclusions
This research applied the copula regression model to examine the impact of underreporting on wildlife‒vehicle collision data analysis. The proposed Gaussian copula model was compared with the conventional NB model for analyzing the effects of explanatory variables using the WVC data collected from Washington State. To evaluate the HSID results from the Gaussian copula-based EB method against the NB-based EB method, a new variable to reflect the actual WVCs risk of each site was proposed and three HSID performance measures were adopted. The major findings can be summarized as follows: (1) For some explanatory variables, the Gaussian copula model provided different modeling results compared with the independent model (logistic regression model and NB model). A further examination suggested that the estimates of parameters for some variables from the independent model were inappropriate (for example, AADT is not identified as the significant explanatory variables for affecting the probability of underreporting). Neglecting the underreporting of the WVC data may result in biased parameter estimation results. (2) For the considered Washington WVC dataset, the Gaussian copula-based EB method can more accurately identify the hotspots than the NB-based EB method. Since the Gaussian copula-based model can consider the underreporting of WVC data, the proposed approach can generally provide more accurate safety performance. Thus, the HSID accuracy can possibly be improved by properly considering the underreporting of WVC data. Although the proposed Gaussian copula-based EB method is not ready yet, transportation safety analysts may use this approach to calculate the EB estimates for underreported WVC data.
Since both reported WVC data and carcass removal data are underreported to some extent, it is hard to know the true number of WVCs. Thus, to further validate the findings from this study, the WVC data collected from other regions with different characteristics should be examined using the proposed Gaussian copula model. In addition, as discussed by Wu et al. [58
], when using the simulated data, the true safety state of each road segment can be known and the true hotspots can be identified. Thus, in the future, a simulation study can be designed to examine the performance of the Gaussian copula model in identifying the hotspot. This study adopts the total crash count to identify hotspots. Crash severity (i.e., fatal, incapacitating injury, non-incapacitating injury, etc.) and collision type (i.e., rear-end, etc.) can be also considered for HSID.