1. Introduction
Traffic collisions account for over 1.35 million annual fatalities and approximately 50 million injuries worldwide, and are predicted to become the fifth leading cause of death by the year 2030 [
1]. On average, road traffic crashes account for approximately 3% of nations’ gross domestic product (GDP) worldwide, irrespective of their growth and rate of motorization [
2,
3]. Road traffic injuries are the third leading cause of death in Saudi Arabia, which poses a major socio-economic and public health concern for road safety agencies. The country has witnessed rapid economic growth and motorization, particularly after the oil boom in the early 1970s [
4,
5]. The fatality index (deaths/100,000 population) due to traffic crashes in Saudi Arabia is estimated to be around 27.4, which is significantly high compared with developed countries like the United States, Australia, Sweden, Netherland, the United Kingdom, and the neighboring Gulf states [
1]. The average crash to injury ratios of 8:4 and 8:6 reported for different regions in Saudi Arabia are also significantly high compared with the global ratio of 8:1 [
6,
7]. Several studies have investigated the crash causation factors in Saudi Arabia and neighboring Gulf countries in recent years. Al Kaaf and Abdel-Aty [
8] investigated the risk factors for crash occurrence on urban four-lane divided roadway segments in Riyadh, Saudi Arabia. Factors such as annual average daily traffic, speed limit, segment length, and driveway density were found to increase the likelihood of fatal and injury crashes. Islam et al. [
9] utilized panel regression models and pooled ordinary least square methods to analyze ten years (2003–2013) of annual crash data from 13 provinces in Saudi Arabia. The study results showed that adverse weather conditions, sandstorms, and the number of vehicles involved were identified as statistically significant for traffic crashes in the country. Poor roadway conditions, road geometry, and tires burst due to intense pavement temperature during summer were also identified as the most common causes of traffic crashes in Saudi Arabia [
10,
11,
12]. Driver distractions, speeding, and non-compliance to traffic rules are among the other predominant causes of traffic crashes in Saudi Arabia [
13,
14]. Studies conducted by Al-Kheder and Al-Rashidi [
15] and Mohamed et al. [
16] for neighboring Gulf state Abu Dhabi in United Arab Emirates (UAE) also reported that factors responsible for increased traffic crashes include road type, road surface conditions, excessive speeds, and non-compliance to traffic rules. A generalized linear model (GLM)-based safety appraisal study conducted for Salalah City in Oman also indicated that road geometry and traffic variables (volumes and 85th percentile speed) were the most significant variables affecting the frequency of crashes [
17]. Recently, some safety measures and policies have been initiated; however, the situations seem to have marginally improved. Concrete and authentic research is required to explore key risk factors for crash occurrence and severity.
Over the years, researchers have constantly sought various approaches with an aim to gain a better understanding of factors affecting crash occurrences to suggest appropriate countermeasures, and provide directions for policies to improve road safety [
18,
19,
20,
21,
22,
23,
24,
25,
26]. However, road traffic crashes are complex events involving a large number of factors having multi-faceted interactions, making it very challenging to comprehend them fully. Traffic crashes are the outcomes of several contributing factors such as driver attributes, vehicle factors, traffic exposure, roadway geometric features, spatial attributes of surrounding built environment, lighting and weather conditions, and so forth [
25,
27,
28,
29]. Among driver’s attributes, distracted driving and speeding are reported to be the leading factors causing increased motor vehicle crashes [
30,
31,
32,
33,
34,
35,
36]. Similarly, the likelihood of crash occurrences along the rural multi-lane highway is increased in the presence of a steep roadway gradient, sharp horizontal curve, and acute curve deflection angle [
37,
38]. Contrarily, lower crash occurrences on the same facilities are associated with the decrease in curve length and horizontal curve radius and an increase in the degree of curvature and number of lanes [
20,
39]. Likewise, poor road surface conditions, nighttime travel, adverse weather, and precipitation are reported to have a strong bearing on high crash frequencies [
40,
41,
42,
43]. A better understanding of all of these factors associated with traffic crashes is essential to promote and enhance the safety performance of road traffic systems. Advances on the methodological front for highway safety research continue to be investigated.
The GLM-based Poisson regression model has been widely proposed to model the equi-dispersed crash frequency data. Poisson models outperform the standard regression approaches in handling random, non-negative sporadic, and discrete features of crash counts. However, crash data are frequently characterized by relatively large sample variance compared with the sample mean, which limits the application of Poisson regression [
44]. Therefore, a negative binomial or Poisson gamma regression model is preferred for such datasets that account for the over-dispersion issue, while the GLM-based binomial regression model is often used to fit under-dispersed data. In practice, it is challenging to differentiate the characteristics of data. However, there exist a few standard testing procedures to differentiate the data characteristics. However, the mentioned procedures require excellent expertise, high computational efforts, and time. The Conway–Maxwell (COM)–Poisson (COM–Poisson) model proposed by Guikema and Goffelt [
45] is the better alternative, which is a flexible model and able to handle any type of disperse data [
46]. Lord et al. [
47] and Lord et al. [
48] have successfully used COM–Poisson regression to model the traffic crash data.
In the literature, studies have mostly focused on statistical modeling of the traffic crash data. However, factors contributing to crash occurrences and frequency exhibit spatial heterogeneity and vary from one location to another. In an effort to achieve better surveillance of highway safety, large amounts of data are collected by road safety organizations worldwide. Traditionally, the majority of existing crash prediction models rely on aggregated information with relatively large time-scales, usually on a yearly basis. However, researchers have argued that the likelihood of crash potential is significantly influenced by short-term fluctuations in crash contributing factors such as traffic, weather, complex terrains, and so on. Crash frequency models developed using aggregated data are designed to yield the prediction results on average data over a more extended period of time that may lead to loss of potentially useful information about some important explanatory variables. They also result in an error due to unobserved heterogeneity. Therefore, it is crucial to investigate the disaggregate models, also known as real-time crash risk evaluation models, for estimating crash potential in smaller time-scales such as an hour, a day, or a week. Crash prediction models with more refined time-scales are useful as they lead to timely and better safety decisions to improve highway safety. To fill this research gap, this study proposes the application of the statistical process control (SPC) method for real-time monitoring of crash data in the Kingdom of Saudi Arabia. A control chart is a well-known tool for statistical process control (SPC), which is often used to detect abrupt changes in the data [
49]. There exist several GLM-based control charts, which are designed on the residuals estimated through GLM modeling. For example, GLM-based control charts based on the Poisson model were proposed by Skinner et al. [
50] and Asgari et al. [
51] in their studies. Skinner et al. [
50] and Jearkpaporn et al. [
52] discussed GLM-based control charts based on the Gamma model, while a chart based on the binomial model was proposed by Shang et al. [
53] and Amiri et al. [
54]. The GLM-based control charts under the negative binomial model were discussed by Alencar et al. [
55] and Urbieta et al. [
56]. In their study, Kinat et al. [
57,
58] proposed GLM-based control charts by assuming an inverse Gaussian distributed response variable, while Mahmood [
59] proposed GLM-based control charts under the zero-inflated models.
Recently, Park et al. [
60] and Park et al. [
61] proposed Shewhart type GLM-based control charts by assuming the COM–Poisson distributed response variable. Park et al. [
60] considered deviance residuals as the plotting statistics, while in another study, the authors utilized randomized quantile residuals as the plotting statistics. In general, the Shewhart type charts are designed based on the current information, and they are used to detect a large deviation from the mean of data. Practitioners are usually interested in detecting small changes as early as possible, for which exponentially weighted moving average (EWMA) and cumulative sum (CUSUM) are especially designed structures. Both EWMA and CUSUM charts are designed based on past and current information, which makes them more efficient to detect small or/and moderate shifts in the process mean. This study intends to design EWMA and CUSUM type GLM-based control charts using the deviance and randomized quantile residuals of the COM–Poisson regression model. Further, the performance evaluation and comparative analysis are conducted using the simulated data, and the proposed methods are implemented to monitor the number of crashes reported in Saudi Arabia.
The remainder of this paper is structured as follows.
Section 2 presents a description of the data utilized in the current study.
Section 3 highlights the proposed research methodology for crash frequency modeling and statistical process monitoring.
Section 4 provides a comparative study of the proposed control charts using an extensive simulation study.
Section 5 presents the implementation of proposed methods on the traffic collisions data. Finally,
Section 6 summarizes key findings, study limitations, and outlook for future research.
2. Data Description
Motor vehicle crash data used in this study were procured from the ministry of transport, Riyadh, Saudi Arabia. A total of 47,984 crashes were reported during the three years (January 2017 to December 2019) of the study period. The study is focused explicitly on crashes involving only motor vehicles along inter-cities rural highways that fall under the ministry of transport jurisdiction. A significant proportion of these highways in the study area run through plain and desert terrain, having warm to high temperatures during most part of the year. Road inventory data are collected from the ministry for available sections, and for others, the geographic information system (GIS) tool was used to extract roadway geometric features. Each crash comprises several explanatory variables (shown in
Table 1), including road type, road surface conditions at those sites, damage type post-crash, weather conditions, and presence or absence of road markings and cat eyes.
In this study, frequency-based modeling was carried out for crashes aggregated on a daily basis.
Figure 1 shows a time series plot of aggregated daily count data of the number of crashes. To access the best-fitting distribution of the aggregated crashes, we have implemented three well-known count models, i.e., Poisson, negative binomial, and COM–Poisson. The detailed diagnostic analysis and model estimation results (shown in
Table 2) indicate that COM–Poisson distribution is the best-fitting model as it produced the minimum values of decision criteria (i.e., loglikelihood, Akaike information criteria (AIC), and Bayesian information criteria (BIC)) compared with other models.
As discussed earlier, each crash comprises a number of explanatory variables that were used in the modeling of the crash data set. To tackle the challenge of summing explanatory variables under each category against aggregated daily crashes, we have used the weighted average of the responses. The mathematical expression of the indexed value is defined below:
where
i is used to index days,
j is used to index categories,
Cij is the code value of category
j on day
i,
Nij is the number of responses of category
j on day
I, and
Nc is the total number of categories. For example, on day i, two crashes happened on the divided highway, one crash noted on the expressway, and five crashes reported on the single highway. Then, the indexed value for road type on day i can be obtained as follows:
Hence, all of the values of explanatory variables were converted into indexed values. Further, the Pearson correlation among all variables is estimated, and the plot is presented in
Figure 2. It is noted that all the explanatory variables have a negative relation with the number of crashes. The road surface condition is highly correlated, weather conditions are weakly correlated, and all other variables have a mild correlation with the number of crashes.
In brief, the COM–Poisson distribution is the best-fitting distribution for the crashes, and there exist some significant correlated explanatory variables. Therefore, the COM–Poisson regression is further used to measure the relationship between the crashes and the corresponding significant explanatory variables. The description of the COM–Poisson regression and the relevant control charting design structure is provided in the next section.
5. Implementation of Proposed Methods for Real-Time Crash Monitoring
In this section, we have applied the COM–Poisson regression to observe the significant factors contributing to the crash occurrences. Further, based on the estimated model, we have implemented the proposed charts for real-time monitoring.
A close view of
Figure 1 shows that daily crash counts witness a steady downward trend after December 2018 and do not exceed this point again. This downward trend in observed crash record may be attributed to the implementation and enforcement of a new automated citation system introduced at the start of 2018 under the SAHER program [
79]. SAHER is an automated system adopted for controlling traffic using a digital network of cameras connected to the central information center. Hence, crash data to this point act as IC data and are used for establishing control chart structure, and the remaining data are considered for the monitoring phase (in an OOC state).
As discussed in
Section 2, the indexed values of road type, road surface conditions, damage type post-crash, weather conditions, road markings, and cat eyes are considered as possible explanatory variables for the crashes. Therefore, to assess the most significant explanatory variables, we have computed pairwise COM–Poisson regression models based on IC data. Out of 63 different models, the following model is considered as the best model based on the minimum values of
,
, and
.
where IVRT and IVRSC represent the indexed values of road type and road surface conditions, respectively. It is evident from the results of the IC model that the intercept term, road type, and road surface conditions are statistically significant explanatory/predictor variables for the crashes with
p-values < 0.01 and standard errors of
and 0.0120, respectively. Moreover, it is also observed that the estimated dispersion parameter is observed as
which is also statistically significant with a
p-value of less than 0.01 and a standard error of
.
For the development of the control chart setup, we have considered the above-mentioned IC model. Further, using the normal probability plots plotted in
Figure 4, it is observed that the IVRT is a normally distributed variable with a mean of 1.86 and variance of 0.017 (cf.
Figure 4a). Similarly, the road surface conditions also follow a normal distribution, having a mean of 1.81 and variance of 0.08 (cf.
Figure 4b). Hence, based on these estimates and following
Section 4.1, we have set up the simulation settings of the COM–Poisson model, and control charting constants are obtained using the algorithm given in
Section 4.1, and are reported in
Table 7. It is noted that we have assumed
,
, and
.
The randomized quantile residual-based charts (i.e., QR-COM–P Shewhart, QR-COM–P EWMA, and QR-COM–P CUSUM) and deviance residuals-based charts (i.e., DR-COM–P Shewhart, DR-COM–P EWMA, and DR-COM–P CUSUM) are implemented on the crash data set. The QR-COM–P Shewhart, QR-COM–P EWMA, and QR-COM–P CUSUM charts are plotted in
Figure 5, whereas the DR-COM–P Shewhart, DR-COM–P EWMA, and DR-COM–P CUSUM are given in
Figure 6. In every chart, the red line represents plotting statistics based on the residuals from IC model, while the blue line is used to show plotting statistics based on OOC residuals. Further, green dotted lines are used to show the control limits of every chart.
It is observed from
Figure 5 and
Figure 6 that the QR-COM–P Shewhart chart signaled four false alarms with 26 OOC points, while the same number of signals is detected by the DR-COM–P Shewhart chart. The QR-COM–P EWMA and DR-COM–P EWMA charts show 13 and 14 false alarms, while 76 and 80 OOC signals are reported by QR-COM–P EWMA and DR-COM–P EWMA charts, respectively. Further, 22 and 23 false alarms are reported by the QR-COM–P CUSUM and DR-COM–P CUSUM charts, respectively, while 124 and 126 OOC points are signaled by the QR-COM–P CUSUM and DR-COM–P CUSUM charts, respectively.
6. Summary, Conclusions, and Recommendations
During the past few decades, many studies have proposed a wide range of statistical modeling approaches to explore the relationships between causal factors and crash occurrence. However, very few have focused on developing germane procedures for real-time highway safety surveillance using available crash records. Control charts are useful tools in SPC with numerous applications for active event monitoring. Traditionally, the Poisson distribution is frequently used to interpret the data for a control chart; however, it may be inappropriate to model under-dispersed or over-dispersed data. This study proposes the EWMA and CUSUM control chart scheme for highway safety monitoring using three years of crash data (2017–2019) for rural highways in Saudi Arabia. During the first stage of the study, three well-known count data distribution models (Poisson, negative binomial, and COM–Poisson) were investigated to identify the most appropriate distribution model for the data. The findings showed that the COM–Poisson regression model is the best fitted statistical model.
During the second stage of the study, EWMA and CUSUM type charts based on the residuals (i.e., deviance or randomized quantile residuals) of the COM–Poisson regression model were developed from a crash monitoring perspective. An extensive simulation study was designed to assess the performance evaluation of the proposed control charts scheme and their comparison with the existing Shewhart type chart. The results revealed that the charts based on deviance residuals (i.e., DR-COM–P Shewhart, DR-COM–P EWMA, and DR-COM–P CUSUM) were relatively more flexible and efficient in detecting increasing shifts in the mean. Further, it was noted the EWMA type charts (i.e., QR-COM–P EWMA and DR-COM–P EWMA) outperformed Shewhart and CUSUM type charts in terms of considered evaluation metrics (both ARL and SDRL). The results from the simulation study also indicated an inverse relationship of the reference value (smoothing parameter with the performance of the GLM-based CUSUM EWMA control charts. Finally, during the third stage of the study, the proposed monitoring methods were successfully implemented on real-time crash data. The findings of this study could provide useful essential guidance to policy- and decision-makers for initiating concrete steps to improve users’ road safety. The proposed real-time monitoring for highway safety surveillance can guide on effective and proactive implementation of different hazard control measures to mitigate the occurrence of future traffic crashes. Some of the quick hazard control measures suggested in this regard include measures such as improving the surface conditions, installation of variable message signs for guidance and warnings, raised pavements, delineators on horizontal curves, weather warning systems, ramp metering, and induction of speed and traffic calming measures at crash hotpots, among others.
This study has a few limitations that might be addressed in future studies. It is worth noting that the present study was designed assuming the known parameters. It will be interesting to see the effect of parameter estimation on the charts in forthcoming studies. The performance of the proposed control chart scheme was verified using limited (three years) real-life crash data. Further investigations using other detailed datasets are needed for a more precise assessment of the statistical properties of the suggested monitoring procedures. Future studies could also consider the application of the proposed methods for real-life monitoring of crashes by individual severity groups (fatal, injury, and property damage) or specific crash types. The monitoring scheme based on crash severity classes will be helpful in prioritizing prevention strategies with the emphasis placed on more severe crashes. Future studies could also investigate the impact of auto-correlated response variables. Furthermore, one may adopt advanced charting structures such as moving average, progressive moving average, mixed EWMA–CUSUM, and HEWMA type charts.