Spatial Econometric Analysis of Road Traffic Crashes

Keeping the basic principles of sustainable development, it must be highlighted that decisions about transport safety projects must be made following expert preparation, using reliable, professional methods. A prerequisite for the cost–benefit analysis of investments is to constantly monitor the efficiency of accident forecasting models and to update these continuously. This paper presents an accident forecasting model for urban areas, which handles both the properties of the public road infrastructure and spatial dependency relations. As the aim was to model the urban environment, we focused on the road public transportation modes (bus and trolley) and the vulnerable road users (bicyclist) using shared infrastructure elements. The road accident data from 2016 to 2018 on the whole road network of Budapest, Hungary, is analyzed, focusing on road links (i.e., road segments between junctions) by applying spatial econometric statistical models. As a result of this article, we have developed a model that can be used by decision-makers as well, which is suitable for estimating the expected value of accidents, and thus for the development of the optimal sequence of appropriate road safety interventions.


Introduction
Major investments in the field of transportation have several unique characteristics that differentiate them from investments in other fields of economy. Consequently, professional preparation of decisions is crucial, as investment costs are typically very high. Such projects create infrastructure with several decades of service life, and thus problems arising due to poor preparation can only be amended with a considerable amount of resources. Road safety is one of the pillars in the Sustainable Development Goals, and it does not only explicitly address the issues related to good health and safety of the people, but it is also associated with the targets towards the development of sustainable cities [1]. Because of the scarce resources available, the implementation of the most effective road safety improvements is of paramount importance, so an analysis of each project's effectiveness cannot be avoided [2]. Efficiency analysis ought to apply a complex approach, including a socio-economic study [3,4]. Together with the financial analysis, parameters affecting a wide range of economic actors need to be involved and evaluated from several points of view. Such a determining feature is the incidence and socio-economic impact of traffic accidents.
The scarcity of widely available methodologies for defining and managing these effects in developed countries shows that the analysis of these effects is complicated, though mandatory. However, it is of utmost importance to detect areas where a transportation safety improvement will have a significant impact.
Since accidents are believed to be discrete, random, and non-negative, Poisson and negative binomial models were generally implemented (e.g., [5]). The problem with those models is that they assume that accidents are independent of the space they occur in, although this is not the case in reality [6]. Generally, the safety performance functions (SPF) used in international practice usually modeled by negative binomial distribution due to the typically occurring overdispersion problem, as the equations can be solved, but there is no evidence that the accident frequency indeed follows a negative binomial distribution. Thus, we aimed to take the spatial effect into account when setting up the estimator model.
While road traffic crashes tend to occur at specific times, they are also affected by a comprehensive interaction of spatial factors [7]. The heterogeneity among traffic crashes on roads with the same geometric conditions is because of the unlikeliness of independence across space [8].
Accidents are not randomly distributed in space but represent spatial autocorrelation [7]. Spatial autocorrelation is the degree of influence by the value of a variable at a certain location on the value of the same variable at a contiguous location [9].
In the literature, 36 articles have been identified that also take spatiality into account in some way when analyzing traffic accidents. Table 1 summarizes how the literature can be classified, what explanatory variables that are relevant to us are used, and which of the most commonly used spatial econometric models (SAR-Spatial Autoregression, SEM-Spatial Error Model) appear in it.
Ha & Thill 2011 Al-Hasani et al. 2019 Saeed et al. 2020 Zhang et al. 2020 Lee et al. 2018 Azimian and Pyrialakou 2020 Korter 2016 Li et. al. 2020 (a) ----- [43] × the given model contains that parameter, -the given model does not contain that parameter, (a) elements of line road network, (b) spatial units, AADT: Annual average daily traffic, LN: the given parameter is logarithmic, NS: the given parameter is not significant.
Spatial traffic accident models that appear in the literature can be divided into three major groups according to how spatial interactions are handled: (i) the delimitation of spatial units appears, but the modeling of the interactions between them is omitted; (ii) they compare models that handle spatial interaction with models that do not take spatial interaction into account; (iii) only models that handle spatial interactions are set up. Within these, one of two approaches was used in each case: (a) the analyses are based on the elements of the line road network, within which it is delimited whether the given accident occurs at an intersection or between two intersections; (b) the analyses are based on spatial units. Based on this, the articles published in the literature can be classified as shown in Table 2. Table 2. Classification of Articles.

(a) Elements of Line Road Network (b) Spatial Units
(i) no spatial interactions considered [10,11] [ [12][13][14][15][16] (ii) compare models with and without spatial parameters [18][19][20]24,28] [ 6,8,17,[21][22][23][25][26][27][29][30][31][32][33] (iii) models considering spatial interactions [34,37,38,40,43] [ 35,36,39,41,42] In the literature, 36 spatial models were identified. As Tables 1 and 2 illustrate, for 31 of these, either the spatial units are the observed units (b); or they do not take into account spatial interactions (i); or they only analyze spatial models (iii). This article follows the five studies enumerated in the highlighted cell of Table 2, as we have compared spatial and non-spatial models (ii), based on the elements of the linear road network (a), including the sections between intersections. Two of the mentioned articles deal with intersections [24,28]. The remaining three discuss accidents on motorways. The article by Wang et al. [18] examines the accident frequency of the M25 motorway around London, on which they build Poisson-lognormal, Poisson-Gamma, and Poisson-lognormal CAR (conditional autoregressive) [44] models. The range of explanatory variables includes infrastructure characteristics as well as traffic characteristics. Out of these, the Congestion Index is highlighted, but traffic volume and speed are also displayed. The article by Castro et al. [19] deals with motorways around Austin, TX, USA, and the range of explanatory variables includes crash characteristics, highway design attributes, characteristics of drivers and vehicles involved in the crash, and environmental factors. The article by Xie et al. [20] examines accidents caused by Hurricane Sandy according to the type of accident, where it occurred, and at what time of day. The articles detailed above, which most closely resemble ours, take into account highway sections; thus, our analysis of urban sections fills a gap. The range of explanatory variables appears in many of the articles examined, such as traffic volume, speed, number of lanes, or length of sections (e.g., [18]), the existence of bus lanes (e.g., [11]), or ratio of truck traffic (e.g., [34]). Another difference is that in addition to the traditionally used SAR and SEM approaches, Spatial Autocorrelation (SAC) [45] models are also used. Although several articles in the literature have taken other approaches in addition to the SAR and SEM models, such as the Spatial Durbin Model (e.g., [6,30]) or the Besag-York-Mollié model [46] (e.g., [35,36]), SAC models have not been used by anyone in the literature known to us yet. In summary, we focused on an area that is missing from the literature: joint usage of four techniques: (i) we used spatial econometric models; (ii) we used real network sections instead of other spatial units; (iii) we analyzed urban roads instead of rural area; (iv) we take into account the spatial autocorrelation (SAC) modeling technique.
The main objective of this article is to identify a proper model for estimating the accident frequencies on different sections of the road infrastructure with different properties in Budapest, Hungary (for Hungary, there are numerous traffic safety models, such as [47][48][49]). Our assumption that a spatial econometric point of view could help us to create such a model.
The paper is organized as follows. The dataset and its preprocessing are described in the next chapter. Section 3. explains the methodology employed to diagnose spatial dependence among the data. The test and model results are discussed in Section 4, and the paper is concluded by Section 5.

Data Data Collection
Detailed data were collected on accidents with personal injury and fatality from the Center for Budapest Transport (BKK) for the analysis. The accident data from 2016 to 2018 were used for the analysis. The road network data comes from the EFM (official traffic model for Budapest and the agglomeration), where modeling of the current state for Budapest is available. This database contains a huge amount of data about the road segments, including average daily traffic of passenger cars, buses, bicycles, and HGVs (Heavy Good Vehicle); properties of the road segments (number of lanes, and the existence of bus lane); and free flow speed and current speed.
This line network was segmented into junctions and links. Because there is not a real barrier between the junctions and the links, it was assumed that if the accident is less than 50 m away from a junction, then that accident belongs to the junction. With this process segments are occurred, divided by the junction's 50-m radius area. Only accidents on the road links are considered in this study.
In general, accidents are not completely randomly distributed, but their occurrence is highly associated with the traffic condition and geometric features of the road. This postulate leads researchers to conduct segment-based traffic safety analysis [50]. The total number of road segments in Budapest that are considered in this study are 4701. The total length of the road network in the dataset is 1310 km. The average link length is 0.28 km. 1586 accidents on the road links for three years from 2016 to 2018 in Budapest were considered in the analysis. Figure 1 is a schematic map from the QGIS, version 3.10.12 (QGIS Association) software, showing the distribution of accidents on the road links for the city of Budapest for the three years surveyed. Accidents were assigned to each corresponding road link (accidents varying from 0 to 28 per link), and every possible parameter that could be related to the accidents from the EFM was described. The most important properties, which will be our decision variables during the modeling process, are listed in Table 3. In order to reach better modeling results, two variables (number of lanes, and free-flow speed) were merged together. The reason behind this is, that these two variables can specify clearly the main type of the road. Thus, a new set of explanatory variables occurred, which is denoted by CAT_XY, where X denotes the free-flow speed category, while Y represents the number of lanes. For this, we have to categorize the roads according to the available free-flow speed by identifying four clusters in harmony with the Hungarian laws:

1.
Residential streets (free-flow speed is less than or equal with 20 kph); 2.
High-speed roads (free-flow speed is above 90 kph). For the dummy variables, further discussion is needed in order to analyze the results. Previously two dummy variables are introduced; the first belongs to the bus lanes' existence, while the second belongs to the availableness of the heavy good vehicles above 12 tons. The rules for the dummy variables are denoted by Equations (1) and (2).
if HGVs above 12 tons are allowed on the given sec tion 0, if HGVs above 12 tons are not allowed on the given sec tion (1) if there is a bus lane at least in one direction 0, if there is no bus lane along the sec tion (2) Accidents that occurred on a given link over the examined period are aggregated. Based on this aggregation method, it is assumed that the contributing factors for accidents that occurred on the same link are similar, which may not be right all the time [51].
All insignificant variables were eliminated based on likelihood ratio analysis and standard errors of the parameter estimate values. The correlation matrix was analyzed, and if two variables were correlated to each other strongly, one of them was eliminated from the model. This helps avoid the confounding effect of the variables. Even if a variable is found to be significant but has less contribution to the decrease in deviance, it was eliminated from the analysis.
Accidents events and the frequency of road links are summarized in Table 4.

Methodology
The applied methodology is mostly based on the research of Luc Anselin [52], and Attila Varga [53]. The main idea behind the applied model environment is based on the phenomenon that traditional linear regression models estimated by ordinary least squares methods cannot take into consideration the fact that panel data based upon spatial specifications is not independent of its spatial location [52]. This means, models in which spatial autocorrelation can be found in the linear regression errors cannot be used for further investigation; instead, a spatial econometric model ought to be used.
The spatial econometric analysis applied here follows the methodology described by [54]. The first step is to set up the Ordinary Least Squares (OLS) models with proper explanatory variables (3).
where • y: dependent variable; The second step is the definition of the weight matrices. In accordance with the methodology proposed in the literature [54], queen type binary (B) matrices have been applied in our research, which are transformed into row standardized weight matrices (W). In the case of binary matrices, the spatial units have an effect on each other, depending on whether they are neighbors to each other (1) or not (0). A detailed description of this method can be found in [52][53][54][55][56]. Two conditions of neighborhood are introduced, one is where two points of interest are neighbors to each other if they are closer than a given distance. The second is when the given point neighbors are the k closest points (k can be chosen arbitrarily) or k nearest neighbors.
The third methodological step is to prove the existence of spatial autocorrelation in the dataset. The presence of spatial autocorrelation depends on the weight matrix. The existence of spatial autocorrelation and the proper weight matrix can be tested by the Moran-I test [57]. If the test results in a significant effect, a spatial econometric model can be applied [53]. If the existence of spatial autocorrelation is proven, three types of models can be set up: the SAR, the SEM, and the SAC. In order to decide which model should be used, the Lagrange-Multiplier test is available [54,58]. If it can be assumed that the spatial lagged dependent variable also affects the dependent variable, the SAR model should be chosen. In this case, the following regression formula can be applied [53]:  If spatial dependence can be eliminated from the model, and the spatial effects can be transferred to the error term, the SEM model should be used. In this case, the formulas presented below can be used [53]: where • ζ: vector of spatial dependent errors; • λ: autoregressive error parameter. The third option is to use the two approaches (the modeling of spatial lag, and spatial error) together. There are several models for this [45], out of which the SAC was used, because this is the one which can handle two different weight matrices at the same time. Its formula is given in Equation (8).
After building the OLS, SAR, and SEM models, the results were compared by Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC) [59]. In addition to AIC and BIC, we used the likelihood test ratio (LR) as well. According to this approach, two hierarchical models (the superior one has one or more surplus estimated parameter) can be compared to each other based upon the ratio of their likelihood value. The test formula is given in Equation (9) [60]: Spatial econometric analyses were performed in the R 3.4.0 environment [61]. The maptools [62], sp [63,64], spdep [63,65] and spatialreg [63,66,67] libraries were used during the analysis.

Results and Discussions
The following two hypotheses were tested to imply whether there is a spatial dependence or not.
H 0 : Accident frequencies by spatial unit are not spatially correlated. H 1 : Accident frequencies by spatial unit are spatially correlated.

Diagnosis of Spatial Dependence
The Moran's I test was run for different bandwidths and k nearest neighbor values. As Table 5, shows all the p-values were statistically significant and pass the five percent significance level criterion, and z-scores are positive and significantly higher than 1.96 (z-value for 95% confidence). Therefore, the null hypothesis is rejected, and the dataset is proven to be spatially clustered. In addition to that, since "I" exceeds E(I) = −1/N, it attests that there is a positive spatial autocorrelation. Figure 2 presents the Global Moran's I test summary and it also confirms that given the z-score of 4.888, there is less than 1% likelihood that this clustered pattern could be the result of random chance. Thus, it can be clearly substantiated that in the examined sample, a clear positive autocorrelation can be detected regarding personal injury road traffic accidents that occurred on the line network in 2016-2018 in the urban area of Budapest. Therefore, the result of the traditional OLS procedure cannot be considered due to the non-fulfillment of the Gaussian-Markov conditions [68], since autocorrelation can be detected in the error terms. Therefore, it is expedient to move to the spatial econometric approach, as it is suitable for dealing with spatial dependencies.  Since all Moran's I for regression residuals are significant and positive, the highest I (8.71 × 10 −2 ) based on k-nearest neighbor with k value of 1 is finally selected, and the spatial weight matrix is set up, which could give a better model. The matrix is row standardized. Figure 3 shows the Moran's I scatterplot for Euclidean distance and k nearest neighbors, respectively. Based on the Lagrange Multiplier test, the following outputs were recorded (Table 6). LMerr and LMlag show which spatial econometric model is ought to be used, RLMerr and RLMlag are the robust tests for the same, while SARMA indicates whether the SAC model could be used. The significance level is indicated in the tables by the following notations: stands for p < 0.1, * and *** stands for p < 0.001.

Model Outputs
Since both the LMerr and the LMlag are significantly different from zero, it is required to check their robust versions. The robust counterparts of the tests imply that the error model is the more likely to be used.
The F statistics (44.55) of the OLS model is highly significant, with p-value < 2.26 × 10 16 which infers that the model is acceptable.
Multiple models with different weight matrices for each technique (OLS, SAR, SEM, and SAC) were run. Finally, the champion models' parameter estimates, and goodness of fit results are summarized in Table 7. The significance level is indicated in the tables by the following notations: stands for p < 0.1, * stands for p < 0.05, ** stands for p < 0.01, and *** stands for p < 0.001.
The robust counterparts of the Lagrange Multiplier test infer that the error model is preferable over the lag model. This is now proved with the AIC, BIC, and LR value (Table 8). AIC and BIC values of the SEM are less than those values of the SAR and SAC. Furthermore, the Log-Likelihood values (L c ) of the SEM are higher than those of the SAR and SAC. The SEM was finally selected to be the best one of the three spatial econometric models. LR values prove that SEM (and in general all of the spatial econometric models) is significantly better than the standard linear regression model. In coherence with the AIC and BIC values, the SAC model's significant difference cannot be proven in this case. The significance level is indicated in the tables by the following notations: *** stands for p < 0.001.

Discussion
For the SEM model, which is considered to be the best, the sign of all explanatory variables (except for the ratio of HGVs) is positive and significantly different from zero. This can be considered as obvious for the length of the examined sections or the average daily traffic because it means that the longer the given section, or the higher the traffic on it, the higher the accident frequency. For bicycle traffic volume, the same observation can be applied, as if cycling traffic is higher, the frequency of accidents also increases. This is incoherent with the models observed in Denmark or the Netherlands due to the fact that in Budapest, there is a few physically separated infrastructure for cyclists [69], and the volume of bicycle traffic has not reached the critical level where the trend is reversed, and the risk of an accident decreases as bicycle traffic increases.
In the case of the category variables describing the characteristics of the infrastructure, it can be stated that the minimum number of accidents occurred on the high-speed sevenlane roads (this was chosen as a reference among the category variables). Generally, the frequency of accidents is lower on the road developed for higher speed, which can be derived from that these sections have a higher level of safety (guardrail, and divided road).
In the number of lanes: between 20 and 50 kph, there is no clear correlation between the number of lanes and the accident frequency; at higher speed designed sections (above 50 kph), the accident frequency decreases for more than four lanes.
In the case of parameters describing the quality of road public transport, it can be observed that the increase in bus traffic increases the risk of accidents. This is due to the fact that buses in traffic perform maneuvers that other road users do not expect (approaching and leaving a bus stop, handling overhead lines in the case of trolleybuses, and taking into account insulated sections).
The same is right for bus lanes, which, with their special rules, contribute to an increase in the risk of accidents, typically due to conflict situations resulting from pre-turn lane changes.
In the case of HGV traffic, the properties of the parameters are different from the previous ones. On the one hand, whether HGVs (over 12 tons) can use a given road section has no significant effect on the frequency of accidents. On the other hand, the ratio of HGVs negatively affects the accident frequency, i.e., the higher the proportion of HGV traffic in total traffic, the lower the frequency of accidents. This is due to the fact that lorries are less dynamic in traffic, perform less unforeseen maneuvers, and likely to use more protected main roads.
If the spatial parameters are taken into consideration, both the SAR and SEM model's parameters are significant; however, this cannot be said about the SAC model. Both SAR and SEM models have a positive spatial parameter, which means if a section's accident frequency is high, then the closest one's frequency will also be increased. This indicates that improvements for longer road sections seem to be beneficial in order to avoid traffic accidents. In the case of SAC models, both spatial parameters are not significantly differentiated from zero, which can be due to the fact that the two approaches (spatial error and delay) effect cancel each other out. Since the spatial parameters are not significant in this case, the effects described above are not fulfilled in this case.
The best model with a significantly positive spatial parameter (λ) is the SEM approach. This indicates that the closest infrastructure elements have a positive effect on each other, which could be eliminated from the model. These effects can be derived from the urban structure and the geographical properties affecting the city. These parameters are hard to take into consideration in a classic linear regression; thus, it can be stated that the spatial econometric approach is a more effective technique for modeling road accident frequency.

Conclusions
In this study, road traffic accidents occurring on the road links of Budapest over a three-year-long period were surveyed. The accidents on the same link were aggregated, and centroids of each road link were considered for the formulation of three different types of models. The global Moran's I test results with significant p-value and positive z-score confirm the presence of spatial dependence of road traffic accidents. OLS, SAR, SEM, and SAC models were formulated in this study. The results proved that incorporating spatial effect in the model gives a better result than the traditional multiple linear regression model. The Lagrange-Multiplier test, in conjunction with the AIC, BIC, and Log-Likelihood values proved that the spatial error model gives the best output over the remaining two spatial econometric models in this study. Confirmation of spatial dependence among the data; illustration of better performance of the spatial models over traditional OLS models; and identification of significant contributors or predictors of the road traffic accidents are the main findings of this study.
The spatial econometric approach allows the development of a universal and simple model that will significantly assist decision-makers in the development of road safety investments. The role of spatial econometrics is to include difficult or non-descriptive parameters in the model, which allows for a more accurate prediction of accident frequencies. However, it is essential to update these models every 3-5 years, as there may be changes in the urban structure that will fundamentally change traffic safety conditions. This study's main limitation is that the analysis only takes into consideration the city of Budapest. Furthermore, it considered accident events on the road links only; those at the intersections are not considered. The third limitation is that accident severity was not considered; instead all accidents that occurred on the road links over a period are aggregated.
These limitations define the future research directions. Three main directions could be identified: (i) choosing a different environment rather than Budapest, such as an analysis on the whole Hungarian network; (ii) take into consideration the accidents of the junctions in a separate model, or a joint one as well; (iii) taking into consideration the accident types and the accident results. Developing such spatial econometric models of good performance and discovering the most significant parameters would help authorities identify the specific interventions in improving road safety and ultimately ensure sustainable transport and urban development.