A Copula-Based Approach for Accommodating the Underreporting E ff ect in Wildlife-Vehicle Crash Analysis

Wildlife-vehicle collision (WVC) data usually contain two types: the reported WVC data and carcass removal data. Previous studies often found a discrepancy between the number of reported WVC and carcass removal data, and the quality of both datasets is affected by underreporting. Underreporting means the number of WVCs is not fully recorded in the database; neglecting the underreporting in WVC data may result in biased parameter estimation results. In this study, a copula regression model linking wildlife-vehicle collisions and the underreporting outcome was proposed to consider the underreporting in WVC data. The WVC data collected from 10 highways in Washington State were analyzed using the copula regression model and the Negative Binomial (NB) model. The main findings from this study are as follows: (1) the Gaussian copula model can provide different modeling results when compared with the conventional modeling approach; (2) the hotspot identification results indicate that the Gaussian copula-based Empirical Bayes (EB) method can more accurately identify hotspots than the NB-based EB method. Thus, the proposed copula model may be a better alternative to the conventional NB model for modeling underreported WVC data.


Introduction
Wildlife-vehicle collisions (WVC) are a prominent road safety issue as highway expansion projects in natural areas endanger the safe sharing of highways between vehicles and wildlife, which is a great potential threat to humans and wildlife [1,2].WVCs cause damage to wildlife populations, as well as serious human injuries and major property loss, especially in Western countries [3][4][5][6].Recently, the number of WVCs has been approximately 5% of all motor vehicle collisions, and the proportion is continuously rising [7,8].
WVC data generally contain two types: reported WVC data and carcass removal data [9].In order to mitigate the wildlife-vehicle collision risk and develop effective countermeasures, statistical regression models are frequently applied by transportation safety researchers to quantify the effect of explanatory factors on WVCs [10].Some recent studies [11,12] provided a comprehensive overview of vehicle collision analysis methodologies, including Poisson regression [13,14], Negative Binomial (NB) regression [15][16][17], Poisson-lognormal regression model [18], Gamma regression model [19,20], semi-nonparametric Poisson regression model [21], etc.Using carcass removal data, Gkritza et al. [22] applied the Poisson regression model and the NB regression model to estimate the effect of identified factors on the frequency and severity of WVCs.Using reported WVC data, a stepwise logistic regression model was applied to identify the significant factors at a landscape scale and recognize the points of high collision risk [23][24][25].Seiler [26] developed a multiple logistic regression model to predict collisions in non-accident control sites through the WVC data reported in observed sites.Tappe [27] proposed a multivariate approach to estimate influential factors from the county level on WVC density.Lao et al. [28] applied a diagonal inflated bivariate Poisson regression to model reported WVCs and carcass data jointly and found a correlation between the two datasets.Neumann et al. [29] related the WVC risks to the probability of road-crossings of wildlife through generalized linear mixed model construction of reported WVC data.Murphy and Xia [30] found the positive effect of coverage degree of vegetation on WVCs by using a hierarchical Bayesian model.So far, many existing studies have focused on analyzing either the reported WVC data or carcass removal data and neglected the possible underreporting issue [31].Underreporting refers to the number of WVCs not being fully reported and recorded.This phenomenon can be observed in the discrepancy between the reported WVCs and carcass removal data [32].In the United States, reported WVC data are typically collected by the transportation agency, while the carcass removal data are collected by the natural resource management agency [33].Since the two datasets are commonly collected by different agencies using distinct equipment and methods, an inconsistency always exists between them.The quality of reported WVC and carcass removal data is affected by underreporting.Rowden et al. [34] and Yannis et al. [35] point out that it is difficult to estimate the level of underreporting since the combined effect of examined temporal and spatial factors is difficult to quantify.The underreporting of reported WVC data may be due to the reporting decision of travelers, the threshold of warrant report, communication techniques, and whether incidents are recorded by transportation agencies, etc. [9,36].Carcass removal data may be underreported due to decomposition, difficulty in detecting the carcass, tardy removal, etc. [37].As discussed by Huijser et al. [38], nearly two-thirds of WVC go unreported in the United States.Alsop and Langley [39] applied a multivariate stepwise logistic method to identify significant factors, including age, injury severity, etc., for underreporting.Correspondingly, Yamamoto et al. [40] suggested that not considering underreporting can lead to bias when estimating the significant factors, even when using sequential binary probit models for better performance than the ordered-response probit models.Patil et al. [41] replaced a multinomial logit model with a nested logit model to accommodate underreporting for more accurate crash severity level determination.However, Snow et al. [42] suggested that WVC studies are not sensitive to underreporting until the underreporting level becomes severe.Determining the underreporting level is important, and requires WVC data availability and reliability [43].
Based on the literature review, few approaches have been developed to analyze the underreported WVC data.Thus, the primary objective of this paper is to propose a copula-based approach to accommodate the underreporting issue and accurately quantify the impact of explanatory factors on WVCs when the additional underreporting information is available.To accomplish this objective, the WVC dataset collected in Washington State from 2002 to 2006 is considered and an underreporting indicator variable is generated to denote whether the wildlife-vehicle collisions are underreported or not.To demonstrate the advantages of the proposed copula-based approach, the hotspot identification results using the proposed method and the conventional NB model are also compared.

Data Description
The reported wildlife-vehicle collision and carcass removal data were collected on 10 highways (US2, SR8, US12, SR20, I90, US97, US101, US395, SR525, and SR970) in Washington State over a five-year period from 2002 to 2006.The dataset has been used in some previous studies [28] and the definition of variables is explained in the AASHTO [44].The summary statistics of characteristics of explanatory variables in Washington data are provided in Table 1.Some variables (e.g., access control type, terrain type, animal habitats, etc.) are binary variables.As shown in Table 1, the reported wildlife-vehicle collision data range from 0 to 22, and the mean wildlife-vehicle collision frequency is 0.24, with a standard deviation of 0.81.The carcass removal data have a mean value of 0.94 and a standard deviation of 3.88.For the reported WVC data and carcass removal data, although both data sources are underreported for different reasons, a larger number of wildlife-vehicle collisions are usually recorded in carcass removal data [31], which is also found in the data source from Washington State.In other words, carcass removal data are less likely to suffer from underreporting.However, the spatial coverage of carcass removal datasets depends on the carcass removal strategies and funding availability [45].Due to the restriction on finances, not every regional transportation agency collects carcass removal data [42].In this study, in order to investigate the impact of influential factors contributing to the WVC data underreporting issue, a new variable (underreporting indicator) is generated to denote whether the number of reported wildlife-vehicle collisions per road segment is underreported or not.Specifically, if the number of carcasses is larger than the number of reported wildlife-vehicle collisions, it is assumed that the number of reported wildlife-vehicle collisions for that road segment is underreported.Otherwise, the number of wildlife-vehicle collisions reported for the road segment is assumed to be the actual number.Since the carcass removal data are not often collected, the following analysis mainly considers records of reported wildlife-vehicle collisions from data sources in Washington State.

The Wildlife-Vehicle Collision Model
The NB [10] distribution is the model most frequently used by transportation safety researchers to analyze wildlife-vehicle collision data.Let Y be the number of wildlife-vehicle collisions during some time period, which is assumed to follow a Poisson distribution as defined below: where y i is the number of wildlife-vehicle collisions during some time period at site i; λ is the mean parameter of the Poisson distribution; and i is the site index.
The NB distribution arises if we let λ take a gamma distribution.For the complete derivation of the NB, more information can be found in Lord and Mannering (2010) [11].The probability density function of the NB is defined as follows: where is the expected number of wildlife-vehicle collisions during some time period at site i; x i is a vector of covariates, and β is a vector of the regression coefficients; and σ is the dispersion parameter.

The Underreporting Outcome Model
The underreporting indicator variable Z can be considered as a dichotomous variable (underreporting: 1; otherwise: 0).Thus, a logistic regression model is adopted here to analyze the impact of explanatory variables on the underreporting outcome, which is defined as follows: where z i is the probability that the number of wildlife-vehicle collisions during some time period at site i is underreported; x i is a vector of covariates; and γ is a vector of the regression coefficients.

Linking the Wildlife-Vehicle Collision Model and the Underreporting Outcome Model
The occurrence of wildlife-vehicle collisions and the probability of underreporting are affected by the explanatory variables.To capture the link between the occurrence of wildlife-vehicle collisions and the underreporting probability, a bivariate copula approach is proposed to describe variables (Y, Z).The joint cumulative density function (cdf) can be defined as follows: where C : [0, 1] 2 → [0, 1] is the copula function; F 1 (y) is the cdf of NB distribution for the number of reported wildlife-vehicle collisions model; and F 2 (z) is the cdf of binomial distribution for the underreporting outcome.
One advantage of the copula model is that the marginal distributions for variables (Y, Z) can be different [46].Some recent studies have comprehensively summarized different families of bivariate copula models [47][48][49], which are provided in Table 2. Some studies also applied different families of copula in transportation data analysis [50][51][52][53].The commonly used families of bivariate copulas are summarized in Table 2.Note that some bivariate copulas only allow a moderate correlation between two variables. Ali-Mikhail-Haq Note: a u and v represent the marginal cdfs of (Y, Z), respectively; b Φ 2 represents the standard cdf of bivariate normal distribution and Φ −1 denotes the inverse cdf of the standard univariate normal distribution; c D 1 (θ) is the first order Debye function; and, d D J (θ) = Let F(y, z) be the joint cdf of occurrence of wildlife-vehicle collisions and the underreporting probability, the copula regression is formulated as follows: where F 1 (y|x 1 , β, σ) and F 2 (z|x 2 , γ) are marginal cdfs of Y and Z; x j is a vector of explanatory variables, j = 1, 2; β and γ are vectors of the regression coefficients; and, σ is the dispersion parameter of variable Y; θ is the dependence parameter of the copula.Note that if θ = 0, then the formulation in Equation ( 5) corresponds to the multiplication of F 1 (y|x 1 , β, σ) and F 2 (z|x 2 , γ).Under this circumstance, F 1 (y|x 1 , β, σ) and F 2 (z|x 2 , γ) represent independent NB regression model and logistic regression model for describing the number of reported wildlife-vehicle collisions and underreporting probability, respectively.
The parameters of copula model are estimated using a penalized maximum likelihood method [54].Specifically, the log-likelihood function is given as: where f (y, z) is the joint density function of the copula model.Note that variable Z is a dichotomous variable that takes on the value 1 or 0. The parameters of copula model can be estimated by maximizing the log-likelihood function: argmax (t 1 , t 2 |x 1 , x 2 , Θ). Detailed explanation about the parameter estimation method can be found in [54].

Variables Affecting the Number of Reported Wildlife-Vehicle Collisions and the Underreporting Outcome
In this section, the mean functional form for modeling the number of reported wildlife-vehicle collisions is described below: where µ i is the expected number of wildlife-vehicle collisions at site i; L i represents the length of roadway segment in miles for site i; F i is the average daily traffic over five years traveling on site i; x 2i , . . ., x mi are the explanatory variables included in the functional form at site i; β = (β 0 , β 1 , β 2 , β 3 , . . ., β m ) are the estimated coefficients; and m is the number of explanatory variables.Similarly, the underreporting indicator is modeled using all explanatory variables: where z i is the expected probability of site i is underreported; x 1i , . . ., x mi are the explanatory variables; and γ = (γ 0 , γ 1 , . . ., γ m ) are the estimated coefficients.
To consider the possible link between the number of reported wildlife-vehicle collisions and the underreporting outcome, different families of copula models are used.Interestingly, the goodness-of-fit statistics for all copula models differ slightly.Among the different bivariate copula models, the Gaussian copula model has a simple function and its dimension can be easily expanded to trivariate form.Thus, in the following analysis of the parameter estimation results, we mainly focus on comparing the effects of different explanatory variables on underreporting outcome and reported wildlife-vehicle collisions using the independence copula (the logistic regression model and NB regression model are estimated independently) and Gaussian copula.
The modeling results for underreporting outcome and reported wildlife-vehicle collisions using the Gaussian copula model and the independent copula model are shown in Table 3. From Table 3, the results for some estimates of identified explanatory variables contributing to the underreporting indicator in Gaussian copula are consistent with those of independent copula.Similarly, the modeling results for number of reported WVCs are also partially consistent between the Gaussian copula and the independent copula.Note that the positive or negative values of estimates indicate the increasing or decreasing effect on underreporting probability and number of reported WVCs.Highway segments with restrictive access control impose a decreasing effect on the underreported indicator and reported WVCs.Restrictive access control facilities can restrict wildlife's access to the road segments, which decreases the number of WVCs.The estimated values of total number of lanes for both directions are all negative for underreported indicator and reported WVCs in Gaussian copula and independent copula.Posted speed limit and left shoulder width are shown as positive variables on the underreported indicator and reported WVCs.The terrain type being rolling or mountainous is found to increase the underreporting probability, while it decreases the number of reported WVCs.There is also an estimation difference between the Gaussian copula and the independent copula for the underreported indicator and the number of reported WVCs.AADT is found to be a significant contributor that decreases the underreporting probability in a Gaussian copula, while, interestingly, AADT is not identified as a significant explanatory variable for the underreported indicator in an independent copula using the logistic regression model.AADT are identified with the positive effect on number of reported WVCs since more observers may call the police when WVCs occur on a highway segment with heavier traffic.Under such conditions, the number of reported WVCs is close to the actual number of WVCs so that the underreporting probability is low.Therefore, the result of a Gaussian copula seems more reasonable for the underreported indicator to include AADT as a negative factor.Wider lanes result in a smaller number of reported WVCs for Gaussian copula, but are insignificant when using the independent copula.

Comparison of the Hotspot Identification Results Using the Gaussian Copula-Based EB Method and NB-Based EB Method
The Empirical Bayes (EB) method introduced by Hauer et al. [55] is adopted as a state-of-the-art hotspot identification (HSID) method and is recommended in the Highway Safety Manual for roadway safety management [56].In this section, the safety performance function estimated from the Gaussian copula model and NB model is used to calculate the EB estimates.To compare the hotspot identification accuracy using the modeling results from the copula model and the NB model, it is important to know the true wildlife-vehicle collision risk at each site.As discussed in the data description section, the reported wildlife-vehicle collision data and carcass removal data are both underreported to some extent, and a larger value is generally observed in carcass removal data [31].In this section, a new variable is defined as follows: where for site i, η i denotes the larger value between the number of reported wildlife-vehicle collision and the number of carcasses.When conducting the HSID analysis, variable η i is adopted to possibly reflect the real wildlife-vehicle collision risk of each site.Hereinafter, the number of reported wildlife-vehicle collisions and the underreporting indicator is denoted as the training data.Variable η i along with all explanatory variables are considered as the validation data.Similar to the three tests proposed by Cheng and Washington (2008), three performance measures are used to evaluate the hotspot identification results from the Gaussian copula-based EB method and the NB-based EB method.

Measure I
This measure is based on the idea that collision-prone sites will usually exhibit high collision frequencies, which means that a desirable HSID method should identify hotspot sites as those that can be expected to have poor safety performance.The measure considers the sites identified as hotspots by the copula-based EB method and NB-based EB method using the building data, and compares the methods based on the sum of η i at collision-prone site i.The optimal HSID method as determined by this measure is the one with the largest number of wildlife-vehicle collision (η i ) occurring on the sites identified as high risk by that method.Out of n sites, a threshold is defined for each method such that c × n are designated as high risk by each method.Then, the following test statistic (Equation ( 10)) is computed for the two EB methods and the preferred method is denoted as the one which yields the highest value of T I( j) [57].
where T I( j) is the total number of wildlife-vehicle collisions from validation data; n is the total number of sites under analysis; η k is the number of collisions for site k defined in Equation ( 9); c is the threshold for high-risk sites, defined as the fraction of all sites n that are designated high-risk; and j is the Gaussian copula-based EB method or the NB-based EB method.

Measure II
Measure II evaluates the performance of two EB methods by the extent to which the same sites are identified as hotspots using the building data and validation data.The better performing method is evaluated by the larger number of identified hotspots that are consistent between the building data and validation data, which is defined in Equation ( 11): T II( j) = k n−cn , k n−cn+1 , . . .k n j, buliding data ∩ k n−cn , k n−cn+1 , . . .k n j, testing data , (11) where T II( j) is the total number of the same sites identified using the building data and validation data; j is the Gaussian copula-based EB method or the NB-based EB method; and k is the site index.

Measure III
Measure III takes into account the safety performance ranking assigned by two EB methods, and estimates performance based on the ranking consistency between the building data and validation data.The performance of each method is calculated as the sum of differences between the rank assigned to all c × n high-risk sites for building data and the rank assigned to the same c × n sites for validation data.The test statistic for measure III is computed in Equation ( 12): where T III( j) is the total test statistic for method j; (k j, buliding data ) is the rank of site k obtained from the method j using building data; (k j, validation data ) is the rank of site k obtained from the method j using validation data; j is the Gaussian copula-based EB method or the NB-based EB method; and, k is the site index.The Gaussian copula-based EB method and the NB-based EB method are evaluated using the three proposed measures.Specifically, for the building data, both the Gaussian copula-based EB method and the NB-based EB method are used to identify poor safety performance sites.For the validation data, the NB regression is applied to estimate the safety performance function and the NB-based EB method is considered to label the true high-risk sites.As all test procedures involve comparison across two wildlife-vehicle collision datasets, we consider three different scenarios in terms of the number of high-risk sites selected for consideration under each HSID method.These scenarios correspond to considering 1%, 5%, and 10% of all sites as high-risk (i.e., c = [0.01,0.05, 0.10]).For example, in this study, when c = 0.05, a total of approximately 224 sites (i.e., about 5% of the 4474 sites) will be considered as high-risk.
Table 4 shows the results of the two HSID methods.For measure I, the underlying principle is that high-risk sites identified by two EB methods should also show high collision counts T I( j) .And thus the larger the value of T I( j) , the better performing the corresponding EB method is.From Table 4, for all cases (c = 0.01, 0.05 and 0.1 of sites are considered as high-risk), the copula-based EB method provides better accuracy than the NB-based EB method.Measure II is adopted to assess the consistent identification of the same high-risk sites using the building data and validation data.Similarly, higher value of test statistic T II( j) indicates the better performance of that HSID method.It can be observed that across all three scenarios, the copula-based EB method is preferred.For Measure III, the ranking of sites identified as high-risk using the building data are compared to the rankings of the same sites using the validation data.Thus, smaller value of the test statistic T III( j) suggests better performance of the HSID method.As shown in Table 4, the copula-based EB method yields smaller test statistic values than the NB-based EB method.In sum, the copula-based EB method consistently demonstrates better HSID performance than the NB-based EB method.The possible explanation for this is that since the copula model considers the underreporting issue associated with each site and the corresponding safety performance function estimated from the Gaussian copula model can better reflect the actual collision risk of the site.

Discussion and Conclusions
This research applied the copula regression model to examine the impact of underreporting on wildlife-vehicle collision data analysis.The proposed Gaussian copula model was compared with the conventional NB model for analyzing the effects of explanatory variables using the WVC data collected from Washington State.To evaluate the HSID results from the Gaussian copula-based EB method against the NB-based EB method, a new variable to reflect the actual WVCs risk of each site was proposed and three HSID performance measures were adopted.The major findings can be summarized as follows: (1) For some explanatory variables, the Gaussian copula model provided different modeling results compared with the independent model (logistic regression model and NB model).A further examination suggested that the estimates of parameters for some variables from the independent model were inappropriate (for example, AADT is not identified as the significant explanatory variables for affecting the probability of underreporting).Neglecting the underreporting of the WVC data may result in biased parameter estimation results.(2) For the considered Washington WVC dataset, the Gaussian copula-based EB method can more accurately identify the hotspots than the NB-based EB method.Since the Gaussian copula-based model can consider the underreporting of WVC data, the proposed approach can generally provide more accurate safety performance.Thus, the HSID accuracy can possibly be improved by properly considering the underreporting of WVC data.Although the proposed Gaussian copula-based EB method is not ready yet, transportation safety analysts may use this approach to calculate the EB estimates for underreported WVC data.
Since both reported WVC data and carcass removal data are underreported to some extent, it is hard to know the true number of WVCs.Thus, to further validate the findings from this study, the WVC data collected from other regions with different characteristics should be examined using the proposed Gaussian copula model.In addition, as discussed by Wu et al. [58], when using the simulated data, the true safety state of each road segment can be known and the true hotspots can be identified.Thus, in the future, a simulation study can be designed to examine the performance of the Gaussian copula model in identifying the hotspot.This study adopts the total crash count to identify hotspots.Crash severity (i.e., fatal, incapacitating injury, non-incapacitating injury, etc.) and collision type (i.e., rear-end, etc.) can be also considered for HSID.

Table 1 .
Description of variables for the wildlife-vehicle collision data.

Table 2 .
The characteristics of different families of bivariate copulas.

Table 3 .
Modeling results for underreporting outcome and reported wildlife-vehicle collisions using the Gaussian copula model and the independent copula model.

Table 4 .
Test statistics of three measures using the Gaussian copula-based EB method and NB-based EB method.