1. Introduction
With the rapid growth of urbanization, urban road safety has emerged as one of the most pressing social issues around the world [
1,
2]. According to the Organization for Economic Co-operation and Development (OECD) statistics, the road safety situation has exhibited a negative trend worldwide over the last two decades [
3]. Road traffic accidents lead to a significant problem in both human casualties and the social economy of the nations [
4,
5]. In 2013, the World Health Organization (WHO) reported that there are from 3.6 to 18.8 deaths per 100,000 individuals are involved in vehicle crashes in China [
6]. Likewise, in the Republic of Korea, approximately 200 thousand accidents and 7.3 casualties per 100,000 people were recorded in 2018. This trend has continued to increase in the recent years [
7]. Therefore, the ability to understand and forecast potential accidents in the future (e.g., where, when, or how) is very useful not only to public safety stakeholders (e.g., police), but also to transportation administrators and individual travelers. To be more specific, the correct understanding of prediction of the traffic accidents can dynamically advise and recommend commuters, traffic management systems, and city planners on alternatives, optimizations, and interventions. By avoiding high-risk areas based on our predictions, we aim to reduce the number of accidents, moving from mere prediction to active prevention.
The severity of traffic accidents is influenced by various factors, including human factors and road environmental factors [
8,
9,
10]. Extensive research has been conducted to identify the significant factors that contribute to the severity of traffic accident injuries [
11,
12,
13,
14,
15,
16]. For example, the authors in [
12] proposed a framework for analyzing and predicting the injury severity of traffic accidents, considering factors such as road types, weather conditions, and lighting. They utilized a stacked sparse autoencoder to incorporate comprehensive factors in the analysis of traffic accidents. Similarly, the authors in [
13] examined factors such as fatigue, gender, and internal/external distractions (e.g., rushing to a destination, listening to music) and assessed their impact on perceived and observed aggressive driving behaviors through surveys and simulations. Such analyses provide insights for appropriate responses, including the enactment of laws, infrastructure repairs, and the deployment of additional speed cameras. In this research, we also comprehensively address the various factors that contribute to traffic accidents and the severity of risks.
Despite the fact that it is essential to determine the influencing factors of risk, proactive actions to prevent traffic accidents should be performed ahead of time. In recent years, deep learning methods have gained popularity as powerful techniques for extracting information from big data and have demonstrated their efficiency in several applications, typically in prediction tasks [
17,
18,
19,
20,
21]. Therefore, much research has been conducted on forecasting road traffic accidents and predicting injury severity in urban areas using numerous types of data [
22,
23,
24,
25,
26,
27,
28,
29,
30]. For example, the authors in [
22] proposed a traffic accident casualty prediction model using neural networks and data mining techniques. Specifically, they used historical data such as the floating population, number of registered cars, and number of accidents to predict the casualties of traffic accidents. Similarly, in [
23], the research presents a spatio-temporal deep learning model to predict citywide short-term crash risk using multiple data such as land use, weather, and crash risks. The authors in [
24] also proposed a traffic accident count prediction model using a Bayesian hierarchical approach. The proposed model can rank the candidate sites, called hotspots, according to their potential risks for some future time period, and further provide simple diagnostics to validate the predictive capability of the proposed model. In [
26], the authors introduced an end-to-end deep learning model that integrates satellite imagery, GPS trajectories, road maps, and accident histories to predict traffic accidents. The authors in [
27] constructed a Long Short-Term Memory (LSTM) network-based model to predict the probability of traffic accidents based on spatio-temporal patterns of traffic accident frequency. The authors in [
28] utilized logistic regression analysis on 400 sets of accident data from 10 major roads in Beijing to identify significant factors influencing traffic accidents and to develop an accident hotspot prediction model. The authors in [
30] predicted the traffic risks as well as traffic speed and flow with the potential and broad usage of deep learning algorithms based on mobility data such as traffic data from infrastructure, trajectory data from vehicles, automatic fare collection devices widely deployed by urban transit systems.
In urban traffic accident prediction studies, one of the most crucial challenges is the sparsity of accident data, making it difficult to develop accurate prediction models [
31,
32]. Even though the global trend of urban traffic accidents tends to increase, traffic accidents are rare and infrequent events. Consequently, datasets for model training usually comprise a great portion of zero values, representing the non-occurrence of accidents. These data imbalances will cause a huge bias in model training and consequently affect the overall prediction performance [
33,
34]. Specifically, while such datasets might cause models to obtain high prediction accuracy, they also hide significant deficiencies in the actual prediction ability of the model. This imbalance not only misrepresents the efficiency of the model but also undermines its potential utility in real-world applications. Therefore, it is necessary for the prediction model to cope with such an imbalanced data problem to develop a high-performing model.
To address these challenges, this study introduces a novel system leveraging deep learning-based methods to predict urban traffic accidents and identify their severity, particularly under the imbalanced data environment. The proposed system not only forecasts the occurrence of urban traffic accidents but also estimates their severity as risk levels. In this study, we divided the whole target area into  cells to serve as the fundamental units, which is beneficial to reduce the excessive zero values in the input and enhance the predictability of the occurrence of traffic accidents. To further reduce the effect of zero values, we aggregated cells into grids and used the power of Convolutional Neural Networks (CNN) to filter the non-accident grids. The CNN module, renowned for its spatial feature extraction capabilities, discerns intricate spatial correlations and dependencies within the urban grid. This spatial understanding is pivotal, given the heterogeneous distribution of traffic accidents. Subsequent to the non-accident cell filtering, the DNN module, which shows high performance in data classification, was utilized to evaluate the risk severity of each cell. By combining these architectures, we aim to harness both the spatial understanding of CNNs and the deep feature interpretation capabilities of DNNs to develop a high-performing prediction system. To improve the model performance, we further utilize large-scale datasets from a variety of sources in the urban area including mobility data (e.g., digital tachograph (DTG)-based risky driving behaviors, traffic flow and speed, etc.), and road environment data (e.g., information related to safety facilities, road infrastructure, geometry, etc.).
The proposed deep learning-based system has two main objectives. First, it aims to predict the occurrence of traffic accidents accurately, especially forecasting potential accident “hotspots” in urban environments. To achieve this, we introduce a grid-clustered feature map using concepts of grids and cells to deal with the data imbalance problem in training datasets. Throughout the feature map, we mitigate the bias problem, especially the high frequency of zero values existing in training datasets. This approach can effectively capture the different characteristics of urban areas and improve the model performance. In addition, we leverage various types of data from traffic accidents, urban mobility, and road safety facility data to enhance our model’s performance. While our primary emphasis remains on prediction capabilities, the potential applications within traffic recommendation systems are significant to overlook. Second, the proposed system estimates the severity of risks using the Accident Risk Index (ARI), which is based on accident attributes such as the number of fatalities, serious injuries, and minor injuries. The ARI is used to categorize the risk levels associated with a given set of data, ranging from level 0 (no accidents) to levels 1–4, representing varying degrees of accident severity. In addition to classifying risk levels based on actual traffic safety data, the proposed ARI also provides road users with a more intuitive and straightforward traffic safety condition.
This paper is organized as follows. 
Section 2 presents the data used in this paper. 
Section 2.1 introduces the concept of cells to represent the urban area. 
Section 2.2 presents the urban road traffic accident and accident risk index. 
Section 2.3 and 
Section 2.4 describe the urban mobility data road safety facility information. In 
Section 3, we introduce the methodology in this study. In 
Section 3.1, the overall system architecture is introduced. 
Section 3.2 presents the method to predict traffic accidents using an urban grid clustered feature map. In 
Section 3.3, how to estimate the risk level in each cell is presented. 
Section 4 describes the result of the paper. In 
Section 4.1, experimental design is introduced. 
Section 4.2 describes the experiment results, and a related discussion is presented in 
Section 4.3. Finally, in 
Section 5, conclusions and future works are presented.
  2. Data Description
This section explains how to deal with various data and preprocess them for model training. To predict urban traffic accidents, we handle a variety of datasets, including traffic accidents, urban mobility, and road safety facility data. To be more specific, the Korean National Police Agency [
35] released the statistics of Korean traffic accidents with severity information (e.g., death, serious injury, or slight injury). In addition, commercial vehicles (such as buses and taxis) which are registered with transportation corporations are required to equip the Onboard Unit (OBU). The equipment enables to acquisition of the drivers’ Digital Tachograph (DTG) data, including trip, location, and speed in real time. The top 11 dangerous driving behaviors, such as sudden start, sudden turn, and overspeed, have been identified by the Korea Transportation Safety Authority using the DTG data [
36]. On the other hand, local governments are in charge of maintaining data on road geometry, such as the number of speed cameras, road signs, and school zone management. Some of them also run probe vehicles with OBUs deployed for the Cooperative Intelligent Transportation System (C-ITS). The proposed system is supposed to predict traffic accidents and their severity by leveraging dispersed data from each department. 
Figure 1 depicts the overall data preprocessing process.
  2.1. Cell Representation of Urban Area for Accident Analysis
In addressing the task of traffic accident prediction, the selection of an appropriate scale is vital to both the precision and computational efficiency of the training model. In this study, we adopt a cell-based approach, where the entire geographical scope is divided into a matrix of  cells, to serve as the units of analysis. This choice of scale offers several strategic advantages.
First, a much finer scale approach, such as a link-level analysis (the term link-level refers to a single road segment) although capable of providing detailed insights, comes with a significantly higher computational cost. Moreover, at the city scale, the link-based strategy may not be the most appropriate method of analysis, given that traffic accidents tend to concentrate in specific areas rather than evenly distributed across the entire road network. Consequently, utilizing link-level data could potentially generate a vast amount of non-accident data points, resulting in an imbalanced dataset that might affect the predictive ability and reliability of the model. On the contrary, the cell-based approach facilitates a more focused and efficient analysis by aggregating traffic data at the cell level. This not only improves the computational efficiency but also promotes a more balanced dataset. However, in the choice of cell size, if the cell is too large, even though we can reduce the occurrence of cells with very little or no accident data and increase the computation efficiency, we may lose spatial patterns and anomalies since critical localized events or conditions might be averaged out. On the other hand, smaller cells can capture very localized patterns, providing high-resolution predictions. Conversely, we may face increased data sparsity, with many cells potentially having zero or near-zero traffic accident events. It can also pose challenges in predictability, possibly leading the model to overfit noise or specific anomalies.
Therefore, in light of the above considerations, this research leverages a cell-based approach with appropriate size to address the complexities associated with traffic accident prediction at the city scale. The cell representation of the study area is shown in 
Figure 2. In this experiment, we discretized the study area into 100 by 100 cells (
 = 100). The width and height of each cell are approximately 230 m. The feature of the data used in the study is shown in 
Table 1. We assign these values to the respective cells. If there are several values in one cell, they are added together to represent the characteristics of the cell. For example, the total number of traffic accidents in a cell is determined by combining all the accidents that occurred in that specific cell.
  2.2. Urban Road Traffic Accident and Accident Risk Index (ARI)
This subsection describes the representation to measure the severity of urban traffic accidents. In this study, we refer to it as the Accident Risk Index (ARI). The Korean National Police Agency keeps track of every road traffic accident and analyzes it with a range of criteria such as driver age, gender, and driving condition. Taking into account the statistics, we analyzed road traffic accidents in Daejeon City over the whole of 2019 (from 1 January to 31 December). In 2019, there were 8337 accidents in Daejeon city.
The number of accidents can be counted in each cell created using the Geographical Information System (GIS) from the previous step, and the associated spatial distribution of the traffic accidents is shown in 
Figure 3a. The figure shows that most traffic accidents happen in urban areas. Even though we can clearly identify the number of accidents on the map, it is difficult to clarify the accident risk on each cell. In other words, the total number of accidents in the cell might not be an appropriate measurement, since it does not reflect the severity of each accident and traffic volumes in certain cells. Therefore, we define the new measurement ARI to precisely measure the accident risk of the cells. The related expression is shown as follows:
        where 
, 
, and 
 represent the number of deaths, serious injuries, and slight injuries, respectively. We used a weighted sum approach to reflect the severity of each cell differently. According to the standards from Korea Transportation Safety Authority [
36], the appropriate values are 
 = 1, 
 = 0.7, and 
 = 0.3. Additionally, the volume of traffic at each cell (
) is utilized to normalize the summation appropriately. The distribution of ARI values in GIS-based is shown in 
Figure 3b. Compared to the previous one, the figure can better depict the accident risk in the target area. To be more specific, in the previous figure, traffic accidents are depicted as dispersed dots, which can make it challenging to identify consistent patterns or hotspots, especially when some accidents are rare occurrences. This scattered representation might not be as informative for decision-makers or urban planners aiming to prioritize areas for interventions. On the other hand, the ARI map aggregates this information, reducing data sparsity, and providing a clearer depiction of accident-prone regions or “hotspots.”. Therefore, the ARI map can offer a more focused and actionable insight into areas that consistently show higher accident rates.
  2.3. Urban Mobility Data
This subsection describes the urban mobility data used in our experiment. We utilized basic traffic information, Digital Tachograph (DTG) data, and the top 11 dangerous driving behaviors exhibited by both commercial and general probe vehicles in this study. Specifically, the average speed and traffic volume are basic traffic information. All commercial and probe vehicles are equipped with onboard units (OBUs) for the purpose of gathering DTG data, which includes details about the trip duration, location, and speed. The top 11 dangerous driving behaviors can be determined from the DTG data based on specific rules. We hypothesize that urban mobility data, especially information on dangerous driving behaviors, is a contributing component influencing urban traffic accidents. Consequently, we incorporated these variables as features in our model to forecast accidents and assess the risk severity estimation. The DTG data includes GPS details logged every second, allowing us to graphically represent vehicle pathways, as depicted in 
Figure 4.
The top 11 dangerous driving behaviors include overspeed, overspeed time, sudden acceleration, sudden deceleration, sudden start, sudden stop, sudden left turn, sudden right turn, sudden u-turn, sudden overtaking, and sudden lane change. 
Figure 5 represents the process of extracting the top 11 dangerous driving behaviors from DTG data. These behaviors are visualized in 
Figure 6, illustrating the aggregation of all dangerous driving behaviors occurring in the cell. Similar to the previous figures, the dangerous drivings are predominantly concentrated in the urban area.
  2.4. Road Safety Facility Information
In this subsection, we describe the utilization of road safety facility information in our study, a necessity arising from the inadequacy of warning signs in the target area [
8]. The dataset includes a variety of safety facilities and road information, such as the location of traffic signals, controllers, various categories of warning signs, and CCTV installations. Furthermore, the dataset integrates land use information such as residential, commercial, industrial, and green zones. Similar to the previous processing procedure, we categorize each type of facility and undertake a quantitative analysis within individual cells.
  4. Result
  4.1. Experimental Design
This subsection describes the experimental design to evaluate the proposed system, which uses a variety of data sources, including traffic accident data, urban mobility data, and road safety facility information, to predict urban traffic accidents and estimate risk levels. In this study, the main objective of this experiment is to find optimal models for predicting traffic accidents and estimating risk levels in urban areas. In this study, we used the dataset from the whole day in 2019 to predict traffic accidents and estimate daily risk levels. The basic unit of the time t is day. Specifically, our dataset covers data from 365 days, with each day divided into 10,000 spatial cells, resulting in a total of 3,650,000 data points. We split the supplied data into a training set and a test set for the validation, with a ratio of 0.8 and 0.2, respectively. To ensure that the model is tested on unseen days, we divided our dataset based on days that 80% of the days (292 days) were used for training and the remaining 20% (73 days) for testing.
First, we evaluated the performance of the accident detector for urban traffic accident prediction. We compared the performance of the proposed model with other baseline models, including Support Vector Machine (SVM), Linear Regression (LR), Naïve Bayes Classification (NBC), and Multi-Layer Perceptron (MLP). The performance of the accident risk classifier was then further assessed using the results that were obtained from the previous step. In this experiment, we classified the risk level on a scale of 0 to 4 using SVM and DNN as the baseline models.
Here are the detailed descriptions of baseline models.
- Support Vector Machine (SVM) [ 41- ]: SVM is a supervised learning algorithm that aims to find the optimal hyperplane that best separates the data into classes. The method shows effectiveness in high-dimensional spaces; 
- Linear Regression (LR) [ 42- ]: LR is a linear approach to modeling the relationship between a dependent variable and one or more independent variables. It predicts the output based on the linear relationship with the input features; 
- Naïve Bayes Classification (NBC) [ 43- ]: NBC is a probabilistic classifier based on applying Bayes’ theorem with the “naïve” assumption of conditional independence between every pair of features; 
- Multi-Layer Perceptron (MLP) [ 44- ]: MLP is a class of feedforward artificial neural networks that consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. It is known for its ability to capture non-linear relationships of input and output data. 
Meanwhile, as a classification problem, we converted the numerical values of ARI into categorical data that reflected risk levels from 0 to 4. These ARI values are grouped into quantiles with risk level 0 corresponding to the lowest quantile (numerical value 0.0), and risk level 4 being the highest quantile.
In this study, we used accuracy, precision, recall, and F1 score as evaluation metrics. In binary classification, the F1 score, which is calculated as the harmonic mean of recall and precision, is a widely used statistical measurement. The related expression is shown below.
        
  4.2. Results of Traffic Accident Prediction and Risk Level Estimation
In this subsection, we present the results of two experiments conducted to evaluate the performance of our proposed system. The first experiment aims to validate the performance of accident prediction against other baseline models. In this task, the target unit to predict is the grid 
, which is one of the main contributions of this study. The training set consists of about 29,200 grids (365 days × 0.8 × (10 × 10) grids), and the test set has 14,600 grids (365 days × 0.2 × (10 × 10) grids) in the traffic accident prediction task. The results of this experiment are presented in 
Table 3. The empirical results reveal that the proposed model significantly outperforms the comparative models in performance. One of the findings in the result is that even though the other models show high accuracy, the performance is low in the other measurements. Compared to the baseline models, the proposed model shows much better and more stable performance. One key insight from these findings is that CNN models are particularly adept at processing grid-structured input. Moreover, the proposed is beneficial to capture and consider the features of neighboring areas, contributing to their enhanced predictive accuracy.
In the next step, we estimate the risk levels exclusively for the cells in the non-accident grids, which is obtained from the previous results. We also divide the obtained data with the same ratio of 0.8 and 0.2 for model training and testing. 
Table 4 shows the result of risk level estimation in each cell with two comparative models. We first measure the model’s performance using accuracy. We conclude that SVM and MLP are inappropriate for application in target risk-level estimation tasks. On the other hand, the proposed accident risk classifier shows over 80% accuracy in our task. We further evaluate our proposed model with other measurements to examine the efficiency of the model and related results are shown in 
Table 5. We can find that the proposed model is also stable and shows high performance in the other measurements.
From the aforementioned results, it is evident that the proposed system is highly efficient in predicting traffic accidents and estimating risk levels. Its proficiency in analyzing grid-structured inputs and incorporating neighboring features enhances its predictive accuracy. In addition, a hierarchical accident risk classifier is also beneficial in multi-risk classification tasks. The proposed system can be efficiently used in the domain of traffic safety and management.
  4.3. Discussion
The proposed system for predicting traffic accidents, estimating risk levels, and identifying risk sources is designed to provide valuable insights and tools for enhancing road safety. The system utilizes a variety of data sources, including traffic accident data, urban mobility data, and road safety facility information, to analyze and predict accident occurrences, and assess risk levels. In fact, although there are a variety of studies on predicting and analyzing traffic accident and their severity, to the best of our knowledge, it stands out as a pioneering attempt in utilizing a large amount and various data sources and implementing the actual system using deep learning-based models. Furthermore, the usage of cells as fundamental units instead of individual links can enhance the predictability of the occurrence of traffic accidents by reducing the number of zero values. To further reduce the occurrence of zero values, we aggregated cells into grids and used CNN to filter the non-accident grids.
Through a series of experiments, the system’s effectiveness is evaluated and compared with baseline models. The first experiment focuses on predicting traffic accidents using grid feature maps and compares different models, including SVM, NBC, and MLP. The basic unit in the experiment is the grid . The results show that the proposed model utilizing grid feature maps outperforms other models, indicating the effectiveness of incorporating grid-represented input data in the proposed model.
The second experiment focuses on estimating risk levels for each cell, which is included in the grid  that has accidents. In this experiment, since our goal is to compare the application of the hierarchical approach and the direct classification, we use two classification models as baseline models: SVM and MLP. These models are highly adopted due to their widespread adoption and recognized performance in multi-label classification scenarios. The results showed that our proposed model shows high accuracy and effectiveness in the other measurements. These results reveal that hierarchical DNNs have the capability to simplify multi-task problems and improve overall performance.
There are several challenges that remain to be dealt with. The first one is related to expanding and applying the proposed model to other cities. The biggest challenge with expanding to other cities is the size of the cell and grid. While the initial focus has been on Daejeon City, it is important to note that other areas should utilize different standards, potentially being larger or smaller than the target area. In addition, the proposed methodology is specifically focused on traffic accident prediction. Other domains with different data characteristics might be required to adjust the methodology or might not observe the same efficiency. Moreover, it is necessary to collect more data with longer time periods to further capture the time-dependent characteristics of accident data, since the current data size is limited. Furthermore, the assumptions we made, and the models used, are based on the nature and distribution of traffic accidents. For domains where the underlying patterns, distributions, or influencing factors differ significantly, these assumptions and models might not hold. Moreover, the computational requirements of our approach, especially the handling and processing of spatial data, might not be suitable for domains with real-time or resource-constrained applications. In addition, we faced constraints in gathering comprehensive data related to these factors. Therefore, the proposed Accident Risk Index (ARI) calculation in this experiment did not directly incorporate geography and weather conditions. In addition, we will also consider employing feature selection techniques to further refine our model, optimizing the inclusion of relevant predictors and potentially enhancing overall predictive performance.
Overall, the proposed system offers a comprehensive approach to analyzing and addressing road safety issues. By integrating various data sources and advanced deep learning techniques, it shows the potential to be used in accident prediction to risk assessments. The system’s outcomes advance the field of road safety by informing decision-making processes, prioritizing interventions, and implementing effective measures to improve road safety in urban areas. In addition, the proposed system can be beneficial to commuters, traffic management systems, and city planners in making more safer and optimized decisions.
  5. Conclusions
This study introduces a comprehensive system to predict traffic accidents and estimate risk levels in the urban area. The system takes into account various data sources, including traffic accident data, urban mobility data, and road safety facility information. It also uses the power of the deep learning method that shows the efficiency in extracting valuable information in big data. The cores of the proposed system are to use the gird-represented map as input for a CNN-based accident detector and use hierarchical DNNs to estimate multiple levels of risk. Specifically, the gird representation of input can effectively reduce the number of zero values in the input data and is efficient in CNN-based model training. In addition, the hierarchical DNNs can simplify the complexity of the multi-classification task and improve the total performance. Also, we propose the Accident Risk Index (ARI) to clearly measure the severity of risk at each cell.
In our experiment, we evaluate the performances of each component of the proposed system. It outperforms other models in predicting traffic accidents, and demonstrates high accuracy and effectiveness in the risk estimation, especially in the multiple binary class classification approach. Furthermore, we validate the feasibility and applicability of the proposed system by applying it to actual data in Daejeon City, Republic of Korea. The proposed system can provide valuable insights into the risk distribution across the urban and facilitates targeted interventions.
Overall, the proposed system offers a novel and comprehensive approach to enhancing road safety in urban areas. It not only serves as a predictive tool, but can also be adapted into a recommendation system that assists urban planners and authorities in implementing preventative measures efficiently. By integrating diverse data sources and utilizing advanced modeling techniques, the proposed system can facilitate the identification of high-risk zones and suggest targeted interventions based on analyzed patterns and trends. This makes it an invaluable asset for decision-makers and stakeholders to prioritize and implement strategies that focus on preventing accidents and reducing their severity when they occur. Furthermore, the system can recommend improvements in infrastructure and changes in traffic regulations, guided by insights drawn from real-time and historical data. These valuable insights, consequently, support the creation of safer urban environments, guiding not only immediate responses but also aiding in the planning and development of long-term road safety strategies. The findings of this research enrich the field of road safety, paving the way for groundbreaking advancements in accident prediction, risk assessment, and the formulation of more informed, data-driven road safety strategies. In addition, the proposed system can potentially be connected with real-world applications such as navigation and traffic management systems to actively recommend safer routes to road users, thus serving a preventing role.