A Systematic Literature Review of Learning-Based Trafﬁc Accident Prediction Models Based on Heterogeneous Sources

: Statistics afﬁrm that almost half of deaths in trafﬁc accidents were vulnerable road users, such as pedestrians, cyclists, and motorcyclists. Despite the efforts in technological infrastructure and trafﬁc policies, the number of victims remains high and beyond expectation. Recent research establishes that determining the causes of trafﬁc accidents is not an easy task because their occurrence depends on one or many factors. Trafﬁc accidents can be caused by, for instance, mechanical problems, adverse weather conditions, mental and physical fatigue, negligence, potholes in the road, among others. At present, the use of learning-based prediction models as mechanisms to reduce the number of trafﬁc accidents is a reality. In that way, the success of prediction models depends mainly on how data from different sources can be integrated and correlated. This study aims to report models, algorithms, data sources, attributes, data collection services, driving simulators, evaluation metrics, percentages of data for training/validation/testing, and others. We found that the performance of a prediction model depends mainly on the quality of its data and a proper data split conﬁguration. The use of real data predominates over data generated by simulators. This work made it possible to determine that future research must point to developing trafﬁc accident prediction models that use deep learning. It must also focus on exploring and using data sources, such as driver data and light conditions, and solve issues related to this type of solution, such as high dimensionality in data and information imbalance.


Introduction
The World Health Organization (WHO), through the Global Status Report on Road Safety (GSRRS) 2018, affirms that the number of deaths by road traffic-related issues reached the number of 1.35 million people in 2016 [1]. Meanwhile, the Pan American Health Organization (PAHO) [2] affirms that traffic accidents were the second cause of death among young adults (15-29 years old) in 2016. However, the most concerning is that 47% of all people who died in traffic accidents are vulnerable road users, such as motorcyclists, cyclists, and pedestrians.
The implementation of technological infrastructure and the adoption of strict traffic policies have significantly reduced the accident rate. However, the number of victims is still high and beyond expectations. This situation partly happens because it is complex to determine the real causes of traffic accidents. In most cases, their occurrence depends on one or many of the following factors: mechanical problems, adverse weather conditions, mental and physical fatigue, negligence, potholes in the road, among others.
At present, the use of prediction models as mechanisms to mitigate mortality in traffic accidents is a reality. The results of these models are helping policymakers, transportation safety designers, and researchers to identify factors and make recommendations to make significant achievements in terms of the accident rate [3,4]. Some studies are being funded by institutions or companies related to transportation, as in [4][5][6][7][8][9]. As soon as the prediction model can correlate information from heterogeneous sources, the model might infer accidents in a better way. However, this solution also brings along some issues to be resolved. For instance, some of them are the high dimensionality in data caused by information imbalance or the poor handling of large-scale datasets. In that way, the strategy to improve the prediction models must be focused on exploring other data sources to correlate them and finding strategies to resolve the issues related to this solution.
Since the models are generally fed with real data, the authors have resorted to government platforms and Internet services to collect data. The information from Internet services can be integrated into the prediction model to establish real-time information channels and improve their accuracy. However, this approach is not always feasible because the values and metrics of the different sources are not entirely comparable. In fact, there is much diversity in experimental design, acquisition protocol, equipment used, and data volume. For these reasons, it is important to highlight the current state of the development of learning-based traffic accident predictions and determine the main research challenges on this topic. This paper presents a systematic literature review on learning-based traffic accident prediction models based on heterogeneous data sources. To elaborate on this review, we used the general guidelines proposed by Kitchenham's methodology [10,11]. The research questions and search strategy focused on identifying the most relevant features that influence the accuracy and performance of accident prediction models. With this analysis in place, our purpose is to respond to these concerns: How do human factors influence the occurrence of traffic accidents? How does the number of features used in a model affect its performance? How can information from different data sources be correlated? What are the solutions for the challenges that real-time prediction models face? What type of algorithms are best suited for traffic accident prediction models? Moreover, can the best model be determined using only the evaluation metrics? For this purpose, we study the different platforms, services, and simulators used to collect data related to traffic and driver behavior. Regarding the survey of traffic accident prediction models, our work includes a comparative study of models, selection algorithms, evaluation metrics, and the percentage of data used for training/validation/testing. Furthermore, the performance obtained by each model is registered, scored, and analyzed. Following this survey, we aim to find open challenges and research niches in the early prediction of traffic accidents to reduce the death of drivers and passengers.
This article is organized as follows: Section 2 presents the methodology used to elaborate this literature review, followed by Section 3, which introduces the answers to all research questions. Section 4 discusses the most relevant thoughts about learning-based accident prediction. Finally, the conclusions of this literature review are presented in Section 5.

Materials and Methods
The current study was performed using the guide for systematic reviews proposed by Kitchenham and others [10,11]. For this study, we have considered the following phases and activities: Planning the Review (Research Questions), Conducting the Review (Search Strategy, Study Selection, Study Quality Assessment, and Data Extraction), and Reporting the Review (Results).

Planning the Review Research Questions
In this stage, we present seven research questions developed based on the goals of our research. The bibliographic databases and journal platforms used in this review were: Scopus, ACM Digital Library, IEEExplore, Springer Link, and Google Scholar. According to [12], Scopus and Web of Science provide a better quality of indexing and bibliographic records, at least in the computer science field. IEEExplore was picked out because it focuses exclusively on computer science, engineering, and electronics. ACM covers the area of computing and information technology. IEEExplore is considered one of the largest collections worldwide of technical literature. Finally, Springer Link was picked out because it contains many peer-reviewed journals and provides full-text access.
Based on the research questions presented, we extracted the following keywords: realtime, traffic accident prediction, learning, heterogeneous, data source, learning technique, algorithm, and evaluation metric. We added "predicting" and "forecast" to the keyword list as a synonym for prediction. We also developed a list of search strings combining the extracted keywords with the operators "AND" and "OR." We established three search strings (SS01, SS02, and SS03). SS01 is longer and more specific because it includes all the keywords and synonyms. SS02 does not include the keyword "real-time" from SS01, and SS03 that is less specific, does not include the keyword "heterogeneous" from SS02. This strategy implies that the results returned by each database or platform have duplicate items. Table 1 presents the search strings developed for this study and the search results. ALL(real-time AND "traffic accident*" AND (predicti* OR forecast*) AND learning AND heterogeneous AND "data source*") 48 SS02 ALL("traffic accident*" AND (predicti* OR forecast*) AND learning AND heterogeneous AND "data source*") 1 April 2021 61 SS03 ALL("traffic accident*" AND (predicti* OR forecast*) AND learning AND "data source*") 154 real-time AND "traffic accident*" AND (predicti* OR forecast*) AND learning AND heterogeneous AND "data source*" 115 SS02 "traffic accident*" AND (predicti* OR forecast*) AND learning AND heterogeneous AND "data source*" 1 April 2021 145 SS03 "traffic accident*" AND (predicti* OR forecast*) AND learning AND "data source*" 352 612 Scholar SS01 real-time AND "traffic accident*" AND (predicti* OR forecast*) AND learning AND heterogeneous AND "data source*"

Study Quality Assessment
In this stage, we defined the assessment questions used in the quality instrument. Additionally, we established two or three possible answers for each question and their scores. Thus, the answer "no" with 0 and "yes" is rated with 0.5 or 1.0 depending on the condition. We present the assessment questions and a short justification for them as follows.
The best way to evaluate a model is through the analysis of its evaluation metrics. Since some metrics are more robust and useful than others, having many of them helps to improve the model and its performance.

AQ01. Does the study present evaluation metrics
If the number of metrics = 1, the value is 0.5; If the number of metrics > 1, the value is 1.0.
Determining the real causes of traffic accidents is complex because they depend on many factors. Thus, the success of such a prediction model lies in correlating different data sources. AQ02. Does the prediction model correlate information from different data sources?
If the number of data sources = 1, the value is 0.5; If the number of data sources > 1, the value is 1.0. Proposing a prediction model by choosing one algorithm and calculating a metric is somewhat imprecise. This process requires an analysis of the model with several baseline algorithms to identify the best one based on indicators and metric values. AQ03. Does the prediction model use different automatic learning algorithms?
If 0 < the number of algorithms ≤ 2, the value is 0.5; If the number of algorithms > 2, the value is 1.0.
In general, the prediction models have to deal with high dimensionality and imbalance in information, poor handling of long-scale datasets, or insufficient capacities to process and analyze information. Our study also needs to know the challenges faced by traffic accident prediction models. AQ04. Does the study present challenges that the prediction models must face?
If the study presents any challenge, the value is 1.0.
The correct handling of missing and out-of-range data will prevent the occurrence of a bias that invalidates the study. The following studies include missing data treatment in their proposals [13][14][15][16][17]. AQ05. Does the study include missing data treatment?
If the study includes any data missing treatment, the value es 1.0.
We established, as a selection criterion, that only if the sum of all five questions is greater than or equal to the value defined as the boundary for the first quartile, then the primary study is accepted; otherwise, it is rejected. This value corresponds to 2.5. The research community has widely accepted this selection criterion [11,18]. Table A1 presents the quality instrument and its results, and Figure 1 presents the phase of Conducting the Review. As observed, 1923 articles were found after performing the search strategy activity. Then, 778 duplicate articles were removed, giving a total of 1145 articles. Once the inclusion and exclusion criteria were applied, 1123 articles were excluded, giving a total of 22 articles. After performing the snowballing technique, 20 articles were added, giving a total of 42 articles. Finally, eight articles were rejected because they did not fulfill the quality criterion. Thus, the number of selected primary studies reached 34 papers. Table 2 presents the primary studies that were selected.

Data Extraction
We designed four data collection forms to record the selected primary studies' information. The data collection forms proposed for this section are shown in Tables 3, 4, A2 and A3. The design of them was based on addressing the research questions. Thus, Table A2 was designed to answer RQ01, Table A3 to answer RQ02, Table 3 to answer RQ04, and Table 4 to answer RQ05, RQ06, and RQ07. Table A2 includes the primary study ID, the data sources (vehicle data, driver's data, weather and light conditions, traffic accidents, traffic flow, traffic events, road infrastructure, taxi trips, points of interest, and others), two categories to refer to the data type, and a list of variables of features of each data source. Table A3 includes the primary study ID, the datasets, services, or simulators. Table 3 includes the primary study ID, the algorithm or algorithms used on the model, and the groups to which those belong [47,48]. Finally, Table 4 includes the primary study ID, some evaluation metrics, percentages of data used for training, validation, and testing, and the algorithms used by models to compare their performance. The generated data will be presented in the "Results" section and analyzed and interpreted in the "Discussion".

Study Overview
Considering the year and the type of publication (Table 2), from 34 selected studies, 19 of them are articles from journals and 15 of them from conferences. The years in which more papers were published were 2015, 2018, and 2019. The answers to our research questions are presented as follows.

RQ01
The prediction models use the following data sources: vehicle data, driver's data, weather conditions, light conditions, traffic accidents, traffic flow, traffic events, road infrastructure, taxi trips, points of interest, and population. The most common data sources are weather conditions, traffic accidents, traffic flow, and road infrastructure. Meanwhile, driver's data, light conditions, and taxi trips are the least common. Based on Table A2, the attributes contained in each data source are presented as follows.

RQ02
The prediction models are fed with data collected from open and government platforms, others from Internet services, and even others with simulators' data. According to Table A3, the platforms, Internet services, and simulators used by the models to collect data are presented as follows.

RQ03
Considering that "no model is perfect", the prediction models present at least some of the following shortcomings.
• Non-inclusion of spatial heterogeneity within the zones of study; • Information imbalance (the amount of useless data is greater than useful data) because most data are non-accident related; • Insufficient capacities to process and analyze an enormous amount of data; • Poor handling of long-scale datasets. It is not practical to work with huge amounts of raw data; therefore, it is necessary to select relevant features to be extracted. If this selection is not made adequately, the generated models will not work correctly; • Not having enough related information to train and test the models (e.g., it is essential to have information about traffic accidents and normal traffic conditions from the same segment).

RQ04
The most common algorithms among prediction models in order of occurrence are Neural Networks (Long Short-Term Memory NN, Convolutional NN, Deep NN, and Feed Forward NN), Support Vector Machine, and Bayesian Networks. According to Figure 2, 30% of prediction models use some variants of Neural Networks, 15% of them use Support Vector Machine, and 12% use Bayesian Networks. Regarding ranking and selection variables/features, the most common algorithm is Random Forest. The categories to which those algorithms belong are Neural Networks, Classification, and Ensemble. Finally, the most common algorithms used by models to compare their performance are Logistic Regression, Support Vector Machine, Decision Tree, and some variants of Neural Networks. Their categories are Classification and Neural Networks. Of all these metrics, the more commonly used are: • For classification: PAR, TPR, and F1 Score; • For regression: RMSE and MAE.

RQ06
The prediction models obtained the results presented as follows. Figure 3 shows the dispersion of values of evaluation metrics.  17. These ranges could be seen as a reference for new models that use these evaluation metrics. By contrast, the values of the rest of metrics are so dispersed that it is not possible to identify group of values to serve as references.

RQ07
Most models only use data for training and testing; however, a few models also use data for validation. The percentages established by the models are as follows: Even there are models in which those percentages are variable and defined dynamically. The most common split configuration among proposals is 80% for training and 20% for testing.

Discussion
Below, we mention some thoughts presented in the articles to analyze and consider for future research. For instance, traffic accidents are not fortuitous events but events caused by conditions that occur in space and time and under certain circumstances [30]. According to [32], unfavorable traffic characteristics, adverse weather conditions, and driver distraction may lead to a crash. Additionally, the most significant factors on crash severity are vehicle failures, not wearing the seat belt, and unfavorable weather conditions [38]. Meanwhile, others assert that driving drunk and at high speed are serious factors in traffic accidents [45], and the wet pavement is one condition that increases the accident rate significantly [8]. Finally, the situation that causes the highest probability of suffering a traffic accident is the aggressive driving behavior after unusual congestion to recover the time lost [6]. For their part, the authors of [25] determined as follows: high speed is one of the most recurrent causes among fatal vehicle crashes; the traffic during morning peak and the first days of the week increase the risk of property-damage-only crashes; additionally, slopes and proximity to curves are the main road geometry factors that lead to fatal crashes; high speed and proximity to curves are the main causes of fatal-injury type crashes; faulty windshield wipers in rainy weather conditions and not wearing seat belt among young people are the most important causes of injury crashes; and, finally, driving at night without caution during rainy weather conditions increase the risk of property-damage-only crashes.
About performance, models based on Deep Neural Networks reduce their accuracy, precision, and F1 score as the learning data size increases [37]. Additionally, the performance of a Support Vector Machine model depends on the learning process, so future efforts must focus on tuning the scale of parameter values and kernel functions selection [42]. Finally, the authors of [26] assert that the performance of the prediction models decreases as the spatio-temporal resolution of the prediction task increases. Regarding features, incorporating more features into the model does not always improve its performance [44]. Meanwhile, ref. [45] asserts that a lesser number of features affect the performance of a neural network. Finally, and according to [37], removing features from models based on Decision Tree or Random Forest has an enormous impact, but slight in models based on Deep Neural Networks.
Some authors propose some recommendations; for instance, splitting data into pieces to send them to compute nodes can make the computational time much lower, which would benefit the handling of social media data [24]. For their part, ref. [19] suggests that the threshold used to separate different states (crash/non-crash) must compensate the values of True Positive Rate and False Positive Rate, and also that the optimal threshold may be found by comparing the performance of different thresholds. Finally, the authors of [5] proposes that the outcomes of a real-time traffic accident prediction model are shown through a variable message sign or transmitted between vehicles using a connected vehicular system.
Despite the advantages that simulators offer at present, this mechanism of data generation has not been received as expected. In fact, there is a clear trend in prediction models about using real data instead of simulated ones. From the results, we could remark that only 1 out of 10 models use data generated by simulators. Because traffic accidents are events caused by a group of conditions that are not always the same and take place in space and time and under certain circumstances, it can be suspected that the authors prefer less controlled scenarios than those provided by simulators to generate data. Moreover, it was noted that there are both static and variable data. Static data, such as most driver data, road infrastructure, points of interest, or satellite images could be used to build a base model. In contrast, data that vary over time, such as traffic accidents, weather conditions, or traffic flow, could be used to adjust the model.
The human factor is the leading cause of traffic accidents [49,50], and the most common human factor (contributing or principal) is inattention while driving because of overloading attention, distraction, or monotonous driving [51]. According to [46], young people are more susceptible than adults to suffer a traffic accident; male drivers are more involved in traffic incidents than female drivers, and female drivers are more susceptible than male drivers to suffering severe injuries. It is clear that the human factor influences and plays an essential role in the occurrence and severity of traffic accidents. This affirmation is confirmed in the Global Status Report on Road Safety. It establishes that factors associated with road user behavior, such as speeding and drink-driving, are two of the key risk factors to be considered and reinforced within the legislation of countries to prevent deaths and injuries due to traffic accidents. Some countries, especially high-income ones, have reduced the number of deaths and injuries by adopting policies for all the key risk factors [1]. Although we have improved much in the prevention of traffic accidents, it is clear that we must now focus on the field of the prediction of traffic accidents. In the context of our research, we could notice that very few models use driver's data, although the human factor is one of the leading causes of traffic accidents. We believe that this may be due to the non-availability of this type of information.
Considering that the prediction models are generally fed with real data, the authors have resorted primarily to governmental institutions related to transportation or related areas and secondly to Internet services. The information collected from government platforms is mainly related to traffic accidents, traffic flow, and road infrastructure. The information collected from government platforms is mainly related to traffic accidents, traffic flow, and road infrastructure. Internet services provide information mainly related to weather and light conditions and traffic events. Most Internet services (MapQuest Traffic, Microsoft Bing Map Traffic, or Twitter, among others) provide APIs that can be integrated into the model to establish real-time or deferred information channels.
One of the most challenging issues for the traffic accident prediction models is to count on a real-time solution. According to some authors [4,9,33,37], the development of a real-time decision-making tool to avoid traffic accidents is completely viable as soon as shortcomings such as the non-integration of spatial heterogeneity, the incorrect handling of long-scale datasets, the improper handling of unique data properties, the information imbalance, and the lack of related information, are resolved. The correct handling of longscale datasets requires feature extraction and imbalance correction. First of all, it is not practical to work with a huge amount of raw data, therefore to handle adequately large datasets, it is necessary to extract essential features such as weather, type of environment (for instance, rural highway vs. urban street), road conditions, speed limit, type of traffic, driver data, and type of vehicles [26]. Additionally, the accident-related data are less frequent than the non-accident-related information. Therefore, the datasets are imbalanced, and a predicting model has to be built to correct this situation [32].
It is also essential to consider which characteristics are time-sensitive, time-insensitive, and related to spatial heterogeneity. Time-insensitive data are fully connected, and spatial heterogeneity is a trainable component. It would be possible to obtain a somewhat generalized solution trained for different scenarios starting with a common base that considers this feature differentiation type [52]. These data-handling strategies could make it possible to obtain a real-time prediction which is the next big challenge for this research area.
The high-dimensionality problem may be solved using data processing techniques to derive relevant features through methods, such as clustering, chi-square, Minimum-Redundancy-Maximum-Relevance (mRMR), and predictor importance, among others. Some authors, such as [28], have worked on this strategy for dimensionality reduction using clustering, but other pre-processing techniques could also be tested.
Regarding algorithms, we were able to identify two stages for which machine learning algorithms were assigned. The pre-processing stage includes the tasks of ranking and selecting features, while the classification stage includes the selection of the model. For pre-processing, the most common algorithm is Random Forest; and, for classification, the most common algorithms are some variants of Neural Networks (Long Short-Term Memory NN, Convolutional NN, Deep NN, and Feed Forward NN). This algorithm selection is consistent with the fact that deep learning models applied in the area of Traffic Accident Prediction are becoming more popular. Most authors use shallow learning algorithms as baseline algorithms to compare the performance of their models based on neural networks. This tendency marks a path for research in learning-based accident prediction.
The metrics more commonly used for classification problems are accuracy, sensitivity, and F1 Score; meanwhile, for regression problems, Mean Absolute Error and Root Mean-Square Error. However, there is such a diversity of experimental design, data volume, and structure used in the various studies that it is difficult to compare results using simply evaluation metrics. Not to mention that some proposals present non-normalized values for their evaluation metrics. The datasets are typically unbalanced, and performance must be understood in a contextualized way. Therefore, to compare models to find the one with the best performance is not necessarily real because the results are not completely comparable among the studies.
Although there is no precise rule to split data for training, validation, and testing, a tacit agreement establishes an approximate data split configuration. From the analysis, we could establish that a higher percentage (more than 50%) of data are used for training and a lower percentage (less than 50%) for testing. The most common data split configuration among proposals is 80% for training and 20% for testing. Some models even establish a low percentage of data for validation. It was noted that there is no evidence or justification for splitting data in one way or the other or whether such a data split configuration could improve the performance of the models. Because of this drawback, we could suggest using data splitting methods (e.g., SPlit [53]) instead of splitting randomly to obtain the optimal configuration.

Conclusions
The elaboration of this work has made it possible to present a review of the research done so far on learning-based traffic accident prediction. Some of the most important points to be considered are as follows.
The development of prediction models in real-time is viable as soon as issues, such as the efficient use of large-scale datasets, the integration of spatial heterogeneity, and the solution for high dimensionality in data, are resolved. In this context, some solutions for these issues are presented as follows. The efficient handling of large-scale datasets may be solved using feature extraction and imbalance correction; meanwhile, the high dimensionality in data may be solved using data processing techniques.
There is a trend about using real data generated by less controlled scenarios (as in real life) instead of data generated by simulators. Thus, authors have opted to correlate real data usually collected from open and government platforms with information from Internet services. Additionally and through APIs, real-time or deferred information channels may be integrated into the model. The performance of a prediction model depends largely on the quality of data, the set of algorithms, among others, but also depends on the data split configuration. Despite not having with specific and exact mechanism is fundamental to count on a strategy to establish the correct percentages of data for training, validation, and testing. Using splitting methods instead of splitting randomly to obtain the optimal configuration may be an option.
Future research must point to developing prediction models using deep learning (a combination of supervised and unsupervised learning techniques) and be focused on using data sources little used in traffic accident predictions (driver's data and pedestrian mobility). Funding: This research was funded by Escuela Politécnica Nacional grant number PIS 20-02 (Emergent System based on acquisition, processing, and response agents for management of vehicle accident rate using artificial intelligence techniques).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:     PS20 Japan traffic accident data and Japan human mobility data PS21 Not available PS22 Signal Four Analytics [97], Central Florida Expressway Authority [98], and National Climatic Data Center (Weather Data) [78]