Discovering Spatio-Temporal Clusters of Road Collisions Using the Method of Fast Bayesian Model-Based Cluster Detection

: Public availability of geo-coded or geo-referenced road collisions (crashes) makes it possible to perform geovisualisation and spatio-temporal analysis of road collisions across a city. This study aims to detect spatio-temporal clusters of road collisions across Greater London between 2010 and 2014. We implemented a fast Bayesian model-based cluster detection method with no covariates and after adjusting for potential covariates respectively. As empirical evidence on the association of street connectivity measures and the occurrence of road collisions had been found, we selected street connectivity measures as the potential covariates in our cluster detection. Results of the most signiﬁcant cluster and the second most signiﬁcant cluster during ﬁve consecutive years are located around the central areas. Moreover, after adjusting the covariates, the most signiﬁcant cluster moves from the central areas of London to its peripheral areas, while the second most signiﬁcant cluster remains unchanged. Additionally, one potential covariate used in this study, length-based road density, exhibits a positive association with the number of road collisions; meanwhile count-based intersection density displays a negative association. Although the covariates (i.e., road density and intersection density) exhibit potential impact on the clusters of road collisions, they are unlikely to contribute to the majority of clusters. Furthermore, the method of fast Bayesian model-based cluster detection is developed to discover spatio-temporal clusters of serious injury collisions. Most of the areas at risk of serious injury collisions overlay those at risk of road collisions. Although not being identiﬁed as areas at risk of road collisions, some districts, e.g., City of London, are regarded as areas at risk of serious injury collisions.


Introduction
The distribution of road collisions is spatially heterogenous as road collisions are more likely to cluster in certain places than in others [1][2][3]. A spatio-temporal analysis of road collision across a city can help: (1) investigate the associations between road collisions and environmental characteristics (e.g., road infrastructure, land use and demographics); and (2) identify areas with a high risk of road traffic safety issues. The former can offer empirical evidence on the necessity of traffic safety interventions (e.g., improving road infrastructure, reducing traffic speed, etc.). The latter can be achieved by detecting clustering of road collisions [2,3]. Accordingly, road infrastructure improvement and speed limit measures should be prioritised in those high-risk areas inside a city to better reduce road collisions. In the past decade, as geo-coded or geo-referenced road collisions (crashes) are publicly available, geovisualisation and spatio-temporal analysis of road collisions are increasingly performable. On the one hand, point-level collision data enables us to identify spatial clustering of collisions without considering spatial distribution of "population". For point-level collision data, popular clustering methods have been developed, including kernel density estimation [1][2][3], network kernel density estimation [4][5][6], Ripley K-function [3,7], and K-means [8]. One the other hand, area-level collision volume data enable us to conduct cluster detection after taking account of "population". Typically, Kulldorff's spatial cluster detection methods have been widely used to detect spatial clusters of collision injuries [9,10]. Clustering identification results indicate areas with a high density of collisions (events or points) while cluster detection results indicate areas at high risk of collisions. Compared to cluster identification of road collisions, cluster detection of road collisions after considering the distribution of "population (traffic flows)" is scarce. Moreover, the existing studies on collision cluster detection have three limitations: (1) they focus mainly on spatial cluster detection but have not been extended to spatio-temporal cluster detection; (2) they mainly choose residential population or working population to represent the "population variable" in the cluster detection setting while traffic flow volume can, indeed, better represent "population variable"; and (3) as Kulldorff's spatial cluster detection methods are computationally demanding, they are not suitable for a large data set.
An efficient spatio-temporal cluster detection method has been recently developed [11]. Since the cluster detection method applies a model-based approach, it can largely improve efficiency by avoiding simulations and detect clusters regardless of whether fixed effects or mixed effects are included in the model [11]. Moreover, the method enables us to take account of potential covariates in the cluster detection. According to the existing studies, road collisions are mainly attributable to human factors, vehicle factors and built environment factors. In other words, human factors (e.g., drowsiness, fatigue, alcohol usage, drug abuse, driving inexperience, non-seatbelt use and traffic violations) [12][13][14], vehicle factors (e.g., older vehicles, emergency vehicles, overloaded vehicles) [15][16][17] and built environment factors (e.g., lower levels of street connectivity, lower levels of land use mix, lack of traffic calming) [18][19][20] contribute to road collisions. Compared with human factors and vehicle factors, built-up environment factors are related closely to urban design and urban planning. Particularly, the influence of the built environment (e.g., street connectivity and land use) on collision occurrence has been reported [18]. Typically, street connectivity is reportedly associated with the occurrence of road collisions [19,20]. Count-based density measures (e.g., intersection density) and length-based density measures (e.g., road density) are likely to exhibit different types of associations with the occurrence of road collisions. For instance, more crashes are associated with higher road density [19], while fewer crashes are associated with higher intersection density [20]. Therefore, we can count street connectivity measures, including both count-based and length-based measures, as the potential covariates in this study.
Although it is of more interest to model road collisions according to human, vehicle and built environment factors, regression models cannot be firmly established due to the absence of required data on those factors. Instead, this study is dedicated to identifying areas at risk of road traffic safety issues by discovering spatio-temporal clusters of road collisions across a city. In this study, a Bayesian model-based detection method was applied to cluster detection instead of conventional cluster detection methods (e.g., Kulldorff's spatial cluster detection methods), as the Bayesian model-based approach (1) is likely to produce a larger number of statistically significant clusters and more local areas are required to be identified as a result; and (2) it allows researchers to incorporate covariates into cluster detection and thereby to identify potential risk factors that are worth new investigations [11]. Specifically, this study aims to detect spatio-temporal clusters of road collisions when replacing residential population with traffic flow volume as the "population variable". Empirically, we used the district-level data across London from 2010 to 2014 to detect spatio-temporal clusters by district and year. Methodologically, we applied a fast Bayesian model-based detection method newly developed [11] to the road collision data due to its advantages: a model-based approach accounting for covariates and the application of a fast approximation method (integrated nested Laplace approximation) instead of a conventional one (Markov chain Monte Carlo methods). As empirical evidence on the association of street connectivity measures and the occurrence of road collisions has been found, we selected street connectivity measures as the potential covariates in the cluster detection. Furthermore, we detected spatio-temporal clusters of serious injury collisions when setting the number of road collisions as the "population variable". Detection results for road collisions and serious injury collisions were compared. This study makes new contributions to this field by: (1) extending spatial cluster detection to spatio-temporal cluster detection; (2) replacing residential population with traffic flow volume as the "population variable"; (3) applying a new and faster cluster detection method which can further incorporate covariates into the cluster detection; and (4) examining the potential impact of the covariates (street connectivity measures) by comparing cluster detection results with no covariates and after adjusting for covariates.

Literature Review
Spatial analyses of road collisions across a city are mainly divided into two groups: point-based analyses and area-level analyses. In the group of point-based analyses, researchers have focused on the spatial distributions of road collisions along roads or around intersections. Kernel density estimation (KDE) methods were initially used to explore spatial clustering of road collisions [1][2][3]. After considering the structure of road network, network kernel density estimation (NKDE) methods were adopted to improve the clustering analysis of road collisions [4][5][6]. Application of K-means allows researchers to define the number of clusters (groups) [8] while the application of KDE focuses on the concentration of high-density road collisions. Compared with KDE, NKDE and K-means methods, Ripley K-function methods were developed to determine the global distribution pattern of road collisions, including random distribution, clustering distribution and even distribution [3,7]. KDE and NKDE methods can be further utilised to investigate local areas with high clustering of road collisions if Ripley K-function methods determine that the global distribution pattern of road collisions is clustering distribution [3,7]. Those point-based analyses uncovered several findings on spatial distribution of road collisions: e.g., more road collisions have occurred at road intersections than on road segments [3]; meanwhile more road collisions have occurred along motorways than along other types of road [7].
In the group of area-based analyses, researchers have focused more on the spatial distribution of road collisions in relation to socioeconomic and built environment characteristics. First, researchers identified clusters of road collisions largely existed in those areas with lower socioeconomic status (e.g., densely populated or poorer areas) [9,10]. Second, relevant studies explained the spatial variations of road collisions by using a variety of regression models, including Poisson models (e.g., spatial lag, spatial multinomial-generalised Poisson, and Poisson log-normal regression models) [21][22][23][24], Bayesian models (e.g., Bayesian spatial joint, Bayesian spatial random parameters Tobit, and Bayesian-Poisson log-normal models) [25][26][27][28], and spatially varying coefficients models (e.g., geographically weighted regression and Bayesian spatially varying coefficients models) [29,30]. Bayesian models are reported to outperform Poisson models in modelling road collisions [29,30]. Third, impacts of socioeconomic and built environment characteristics on road collisions were investigated at the area level [23,27,29], the intersection level [21,24,25], and the road segment (street) level [22,25,26]. More specifically, population [29], traffic volume [29] and speed limit measures [29,30] are reported to contribute to road collisions at the area level. Roadway configuration, the type of approach roadway function, the type of traffic control, the total daily volume of entering traffic and the split of volumes between approaches are all associated with collision frequency at intersections [21], while increased traffic volume and poorer pavement conditions are associated with more collisions at road segments [26]. More collisions are reported to occur at intersections with signal controls, with more intersecting legs, and with higher speed limits, while more collisions are reported to occur on road segments with more lanes, more accesses, higher speed limits and worse pavement conditions [25].

Materials and Methods
In this section, data on crime and socioeconomic factors are introduced. The spatio-temporal cluster detection method used is presented, followed by a list of socioeconomic factors as potential covariates.

Data
In this study, we focus on 5-year road collisions across the region of Greater London. It consists of 33 districts, including City of London and 32 boroughs (see Figure 1). The district-level road collision data were downloaded from the London Datastore (https://data.london.gov.uk/dataset/road-collisionsseverity). According to the level of severity, the road collisions are classified into "fatal injury", "serious injury", and "slight injury". Table 1 shows the number of road collisions in London by severity and year. The number of road collisions increases largely after 2012 when the 2012 Summer Olympics took place in London. From 2012 to 2013, the number of road collisions by each severity level increases by more than 50%. The district-level motor vehicle flow volume data were downloaded from the website of GOV.UK (https://www.gov.uk/government/statistical-data-sets/road-traffic-statistics-tra). The flow volume is represented by the number of vehicles passing in 24 h at an average point on the road network in each local authority. It is calculated by dividing the estimate of annual vehicle miles in each local authority by the length of road in that authority and number of days in the year. The road network data were downloaded from the Ordnance Survey (https://www.ordnancesurvey.co. uk/business-government/products/open-map-roads). Figure 2 box-plots district-level collision rate and serious injury collision rate across London from 2010 to 2014. As Figure 2 shows, inter-annual variability in collision rate is not high across London though collision rate is relatively low in 2013.

Fast Bayesian Model-Based Cluster Detection
Based on the model-based approaches of [31] for the detection of spatial disease clusters to space and time [32], Gómez-Rubio et al. [11] propose a new approach that uses dummy variables in a regression model to group regions into clusters. The importance of the clusters is assessed based on a likelihood calculation that measures the extent to which the clusters capture the variability in the

Fast Bayesian Model-Based Cluster Detection
Based on the model-based approaches of [31] for the detection of spatial disease clusters to space and time [32], Gómez-Rubio et al. [11] propose a new approach that uses dummy variables in a regression model to group regions into clusters. The importance of the clusters is assessed based on a likelihood calculation that measures the extent to which the clusters capture the variability in the outcome [11]. To address a huge computational burden due to the usage of Bayesian hierarchical models fit by means of Markov chain Monte Carlo (MCMC) methods, Gómez-Rubio et al. [11] use a fast approximation method (integrated nested Laplace approximation) proposed by Rue et al. [33] to fit the model, and provide a reasonable estimate of the coefficient of the cluster variables and compute the deviance information criterion (DIC) in model selection. Theoretically, the problem of cluster detection is regarded as a problem of variable selection, where covariates include a number of dummy variables that represent all possible clusters [11]. Hence, when fitting an individual model to test for different clusters, this approach, based on integrated nested Laplace approximation (INLA), will be faster than fitting the same models with MCMC [11].
For the sake of brevity, we present the model as follows [32]: where µ i,t is the mean of area i at time t, and E i,t is the expected number of cases in area i at time t.
i,t is a cluster dummy variable for spatio-temporal cluster j, and γ j is the coefficient of the cluster dummy variable.
Note how now data are indexed according to space and time. Dummy cluster variables are defined as in the spatial case, by considering areas in the cluster according to their distance to the cluster centre, for data within a particular time period. When defining a temporal cluster, areas are aggregated using all possible temporal windows up to a predefined temporal range.
Moreover, E i,t is computed as follows [11]: "Raw expected cases E i,t are computed using the population in each area. Covariate standardised expected number of cases E i,t is computed fitting a Poisson regression (generalised linear model) with offset log(E i,t ) on the covariates. Then, the fitted values from this model are used to compute the expected number of cases using Equation (1)." Table 2 lists the covariates considered in this study. The response is the number of road collisions (unit: count). The covariates are street connectivity indicators, including road density (i.e., length of roads/area) and intersection density (i.e., number of road intersections/area). Table 2 also shows statistical descriptions for the covariates. In this study, the cluster detection is implementable in R. Specifically, the model-based cluster detection method is supported by an R package named "DClusterm" [32].

Results and Discussion
This section demonstrates the cluster detection results with no covariates or after adjusting for covariates, and discusses the potential impacts of potential covariates. Furthermore, the results of cluster detection for serious injury collisions are presented. They are further compared with those detected for road collisions.

Cluster Detection: Spatio-Temporal Clusters of Road Collisions
We applied the fast Bayesian model-based cluster detection method to the 165 observations (33 districts × 5 years) with no covariates and after adjusting for covariates respectively. In the cluster detection, the "case variable" is the number of road collisions by district and year; the "population variable" is the number of motor vehicle flows by district and year.

Cluster Detection with no Covariates
First of all, we implemented the model-based cluster detection method with no covariates. Covariate standardised expected number of cases E i,t was computed fitting a Poisson regression (generalised linear model) with offset log(E i,t ) on no covariates (see Equation (1)). The generalised linear model (GLM) estimated is shown in Table 3 (see GLM 1). As a result, five statistically significant clusters were detected with a p-value of below 0.05. These clusters are list in Table 4 and mapped in Figure 4 (see Table 4 and Figure 4a). In Table 4, the clusters are ranked according to the p-value in ascending order. All these clusters cover 5 years from 2010 to 2014 (see Table 4). Specifically, the most significant cluster (Cluster 1 in Figure 4a) and the second most significant cluster (Cluster 2 in Figure 4a) are located around the central areas (inner boroughs in Figure 1); while the other three clusters (Cluster 3, 4 and 4 in Figure 4a) are located around the peripheral areas (outer boroughs in Figure 1).

Comparison of Cluster Detection with and without Covariates
We compared the clusters detected in the two models (with and without covariates). The geographic boundaries of clusters tend to move eastward from the detection results with the covariates to those without the covariates (see Figure 4). Cluster 2 is an exception as its geographic boundaries remain the same. This indicates that Cluster 2 is unlikely to be explained by the covariates while the other clusters are partly explained by the covariates. Particularly, the most significant cluster (Cluster 1 in Figure 4a) changes into the third most significant cluster (Cluster 3 in Figure 4b) after adjusting for the potential covariates. Additionally, the most significant cluster (Cluster 1) moves from the central areas (inner boroughs) to southern peripheral areas (outer boroughs) (see Cluster 1 in Figure 4a and Cluster 1 in Figure 4b). Generally, the covariates are likely to have potential impact on the clusters of road collisions.
We further examined the high-risk areas (i.e., areas covered by clusters) which disappeared or newly appeared in relation to the two covariates. Figure 5 maps the covariates (i.e., RD and ID) across London. After comparing Figure 4a,b, we can identify 4 disappearing areas and 2 newly disappearing areas after adjusting for the covariates. Moreover, as Table 3 shows, RD is positively associated with the number of road collisions while ID is negatively associated with the number of road collisions. Accordingly, among the four disappearing high-risk areas, two co-locate with a high level of RD while the other two co-locate with a low level of ID (see Figures 4 and 5 together). Figure 6 shows the two areas mainly caused by a high level of RD, the two areas mainly caused by a low level of ID, and the two areas newly appearing after adjusting for the covariates. Apart from the 4 disappearing highrisk areas, other high-risk areas are unlikely to be attributable to the two covariates (i.e., RD and ID). In other words, the majority of high-risk areas are not attributable to street connectivity. Besides, further investigations are needed to explain the remaining high-risk areas.  Subsequently, we implemented the model-based cluster detection method after adjusting for covariates. E i,t was computed fitting a Poisson regression (generalised linear model) with offset log(E i,t ) on two covariates: RD (road density) and ID (intersection density). The GLM estimated is shown in Table 3 (see GLM 2). Expectedly, RD is statistically significantly and positively associated with observed number of road collisions (response), while ID is statistically significantly and negatively associated with observed number of road collisions (response). As a result, 6 statistically significant clusters were detected with a p-value of below 0.05. These clusters are listed in Table 5 and mapped in Figure 4 (see Table 5 and Figure 4b). In Table 5, the clusters are ranked according to the p-value in ascending order. Clusters 5 and 6 cover 2 and 3 years respectively while the other 4 clusters cover 5 years (see Table 5). Specifically, Cluster 2 (the second most significant cluster) and Cluster 3 are located around the central areas (inner boroughs) while the other 4 clusters are located around the peripheral areas (see Figure 4b). Particularly, Cluster 1 (the most significant cluster) is located around the southern peripheral areas. It is noted that 2 districts belong to Cluster 5 from 2010 to 2011 and constitute Cluster 6 from 2012 and 2014.

Comparison of Cluster Detection with and without Covariates
We compared the clusters detected in the two models (with and without covariates). The geographic boundaries of clusters tend to move eastward from the detection results with the covariates to those without the covariates (see Figure 4). Cluster 2 is an exception as its geographic boundaries remain the same. This indicates that Cluster 2 is unlikely to be explained by the covariates while the other clusters are partly explained by the covariates. Particularly, the most significant cluster (Cluster 1 in Figure 4a) changes into the third most significant cluster (Cluster 3 in Figure 4b) after adjusting for the potential covariates. Additionally, the most significant cluster (Cluster 1) moves from the central areas (inner boroughs) to southern peripheral areas (outer boroughs) (see Cluster 1 in Figure 4a and Cluster 1 in Figure 4b). Generally, the covariates are likely to have potential impact on the clusters of road collisions.
We further examined the high-risk areas (i.e., areas covered by clusters) which disappeared or newly appeared in relation to the two covariates. Figure 5 maps the covariates (i.e., RD and ID) across London. After comparing Figure 4a,b, we can identify 4 disappearing areas and 2 newly disappearing areas after adjusting for the covariates. Moreover, as Table 3 shows, RD is positively associated with the number of road collisions while ID is negatively associated with the number of road collisions. Accordingly, among the four disappearing high-risk areas, two co-locate with a high level of RD while the other two co-locate with a low level of ID (see Figures 4 and 5 together). Figure 6 shows the two areas mainly caused by a high level of RD, the two areas mainly caused by a low level of ID, and the two areas newly appearing after adjusting for the covariates. Apart from the 4 disappearing high-risk areas, other high-risk areas are unlikely to be attributable to the two covariates (i.e., RD and ID). In other words, the majority of high-risk areas are not attributable to street connectivity. Besides, further investigations are needed to explain the remaining high-risk areas.

Cluster Detection: Spatio-Temporal Clusters of Serious Injury Collisions
Likewise, we applied the fast Bayesian model-based cluster detection method to the 165 observations (33 districts × 5 years) with no covariates. In the cluster detection, the "case variable" is the number of serious injury road collisions by district and year whist the "population variable" is the number of all-type road collisions by district and year. As a result, five statistically significant clusters were detected with a p-value of below 0.05. These clusters are listed in Table 6 and mapped in Figure 7 (see Table 6 and Figure 7). Specifically, the most significant cluster (Cluster 1 in Figure 7

Cluster Detection: Spatio-Temporal Clusters of Serious Injury Collisions
Likewise, we applied the fast Bayesian model-based cluster detection method to the 165 observations (33 districts × 5 years) with no covariates. In the cluster detection, the "case variable" is the number of serious injury road collisions by district and year whist the "population variable" is the number of all-type road collisions by district and year. As a result, five statistically significant clusters were detected with a p-value of below 0.05. These clusters are listed in Table 6 and mapped in Figure 7 (see Table 6 and Figure 7). Specifically, the most significant cluster (Cluster 1 in Figure 7

Discussion
Generally, the covariates are likely to have potential impacts on the clusters of road collisions. The most significant cluster moves from central areas (inner boroughs) to southern peripheral areas (outer boroughs) after adjusting for the covariates. Moreover, as the potential covariates used in this study, length-based road density exhibits a positive association with the number of road collisions

Discussion
Generally, the covariates are likely to have potential impacts on the clusters of road collisions. The most significant cluster moves from central areas (inner boroughs) to southern peripheral areas (outer boroughs) after adjusting for the covariates. Moreover, as the potential covariates used in this study, length-based road density exhibits a positive association with the number of road collisions while count-based intersection density exhibits a negative association. This is consistent with some previous studies [19,20]. Furthermore, we compared the cluster detection results for road collisions and serious injury collisions (see Figures 4 and 7). Most of these areas at risk of serious injury collisions overlay those at risk of road collisions. Although not being identified as areas at risk of road collisions, some districts, e.g., City of London, are regarded as areas at risk of serious injury collisions.

Conclusions
In this study, we aimed to detect spatio-temporal clusters of road collisions across Greater London from 2010 to 2014. We implemented a fast Bayesian model-based cluster detection method with no covariates and after adjusting for covariates respectively. As a result, the most significant and second most significant clusters were located around the central areas covering 5 years. Moreover, after adjusting for the covariates, the most significant cluster moves from the central areas to the peripheral areas, while the second most significant cluster remains unchanged. Although the covariates (i.e., RD and ID) exhibit potential impact on the clusters of road collisions, they are unlikely to contribute to the majority of high-risk areas. Furthermore, we detected spatio-temporal clusters of serious injury collisions. As expected, most of the areas at risk of serious injury collisions overlay those at risk of road collisions. Although not being identified as areas at risk of road collisions, some districts, e.g., City of London, are regarded as areas at risk of serious injury collisions.
However, there are some limitations in this study. Firstly, we cannot undertake cluster detection by a higher level of temporal granularity (e.g., month) or spatial granularity (e.g., smaller area, street or intersection) due to the absence of spatio-temporally fine-grained traffic flow volume data. Due to the potential presence of the modifiable areal unit problem (MAUP), the cluster detection results might differ from fine-grained data and coarse-grained data. Secondly, although traffic flows should include traffic flows by different transport modes, we had to use motor vehicle flows rather than all-mode traffic flow to represent traffic flows in this study due to the absence of pedestrian and cycle flow volume. Thirdly, apart from traffic flow volume, other dynamic factors (e.g., weather conditions) have not been considered in this study. The impacts of street connectivity on road collisions might be better examined after adjusting for weather conditions.
We will attempt to address those limitations in the future. Firstly, we will perform a similar study in another city with the availability of fine-grained traffic flow data. This would help to understand the potential influence of the MAUP on the cluster detection. Secondly, we will attempt to repeat this study once all-mode traffic flow data are publicly available. The cluster detection results might differ between selected motor vehicle flow volume and all-mode traffic flow volume as the population variable. Thirdly, to take account of more built-up environmental factors as potential covariates, we will select transport facilities including traffic calming, walkways and sidewalks once the data are publicly available. Fourthly, we would include more dynamic factors (e.g., weather conditions) in the future. Finally, since previous studies argued that the reduced travel speed caused by increasing traffic volume may decrease the likelihood of crash occurrence [26], we would consider traffic volume and traffic speed that could be both observed or estimated [34].