A Bibliometric Analysis and Benchmark of Machine Learning and AutoML in Crash Severity Prediction: The Case Study of Three Colombian Cities
Abstract
:1. Introduction
- A bibliometric study of ML in CSP. This systematic approach makes it possible to identify relevant authors in the field, trending topics, keyword evolution, and the most common ML methods used in CSP. Although there are already survey papers in the specialized literature [4,5], we expand their scope to the point of identifying and processing 2318 bibliographic references that allow us to characterize state-of-the-art of and AutoML methods in CSP from a theoretical perspective.
- An extensive experimental framework of state-of-the-art ML approaches for CSP. This practical benchmark experimentally characterizes the competitiveness and the significance of diverse ML methods in multiple crash accident settings modeled as supervised learning problems (binary and multiclass classification). This benchmark also provides transportation practitioners and researchers with a framework that supports the use of ML and AutoML in a field where ML expertise is an asset that is not always affordable or available.
2. Background and Related Work
2.1. Automated Machine Learning
- AutoGluon: It starts training its base of classifiers in the usual manner. Then, a stacker model is trained using the aggregated predictions of the base models as its features. Here, the first layer has multiple base models, whose outputs are concatenated and then fed into the next layer, which consists of multiple stacker models.These stackers then act as base models for an additional layer. The aforementioned process is repeated multiple times to get a multi-layer stacking approach that improves the shortcomings of the individual base predictions and takes advantage of the interactions between them. Finally, the last stacking layer applies ensemble selection to aggregate the stacker models’ predictions in a weighted manner.
- Auto-sklearn: In an off-line phase, Bayesian optimization is used to determine an optimised ML pipeline with high performance on every dataset for a repository of 121 datasets. These pipelines are generated from a search space of 15 classifiers, 14 feature preprocessing methods, and four data preprocessing methods. Then, for each dataset, a set of 38 meta-features is extracted to characterise every set of data; these meta-features include simple, information-theoretic and statistical information, such as statistics, about the number of data points, the number of classes, and data skewness, among others. Later on, instead of storing the 121 datasets, their meta-features and the ML pipelines are saved in a meta-knowledge base where each instance contains the set of meta-features describing every data-set and the optimized pipeline that works well on it.In the online phase, that is, when a new data-set is given, Auto-Sklearn computes its meta-features, ranks all the dataset information stored in the meta-knowledge base according to their distance with respect to , and selects the stored ML pipelines for the k nearest datasets. This selection of K most promising pipelines is then used then to seed the Bayesian optimization component as a warm-start approach, which boosts the performance of the optimization. The Bayesian optimization process (under a time budget constraint) also generates and tests new pipeline structures from the same aforementioned search space. Finally, the best pipelines identified during the Bayesian search process are used to construct an ensemble.
- TPOT: To combine these operators into a ML pipeline, TPOT treats them as GP primitives and constructs GP trees from them. To automatically generate and optimize these tree-based pipelines, TPOT uses a genetic algorithm that follows a standard GP process. First, the GP algorithm generates 100 random tree-based pipelines and evaluates their balanced cross-validation accuracy on the dataset. For every generation of the GP algorithm, TPOT selects the top 20 pipelines in the population trying to maximize the classification accuracy and minimize the number of operators in the pipeline at the same time.Each of the top 20 selected pipelines produces five copies (i.e., offspring) in the next generation’s population, 5% of these offspring are crossed with another offspring using one-point crossover. Then 90% of the remaining unaffected offspring are randomly changed using a point, insert, or shrink mutation. In every generation, the algorithm updates a Pareto front of the non-dominated solutions discovered at any point in the GP run. The algorithm repeats this evaluate-select-crossover-mutate process for 100 generations to improve classification accuracy. If the user does not define an execution time for TPOT, the method selects the pipeline with the highest-accuracy from the Pareto front as the best representative pipeline of the whole optimization process.
2.2. Related Work
3. Bibliometric Analysis of Machine Learning Approaches in Crash Severity Prediction
3.1. Conducting Research
3.2. Selection
3.3. Data Analysis
- Time trend of publications: Since 2014 there has been a growing trend of publications related to ML, AutoML, crash severity and road accidents, which is reflected in the number of annual publications. This shows the relevance and timeliness of these fields for conceptual, methodological, and practical applications (Figure 3). Moreover, in the transportation field, road accidents show a growing trend, and since 2014 the specific topic of the severity of collisions has been emerging rapidly. This is supported by the fact that in 2020 there was a significant scientific production in this regard, which was mainly focused on a case analysis of cities using various modeling tools, simulation, and machine learning approaches.When analyzing the three topics selected for this paper, we can highlight that the average number of citations per document for CSP is 18.51, 9.66 for road accidents, and 10.31 for AutoML. Together with the increasing trend in the number of publications, this is indicative of the interest related to crash severity issues. There has been significant scientific production concerning the topic of AutoML over the last 5 years; however, its application is focused on concentrated in health, and there are few works on the topic of transportation. Furthermore, most of the applications in this field are oriented towards traffic prediction [10,11,38], transport modes [39], and autonomous vehicles [40]. Thus, no applications focused on CSP were found in the review.
- Relevant sources, documents and authors in the field: The most influential journals in the study topics are: accident analysis and prevention, IEEE transactions oniIntelligent transportation systems, and transportation research record. All journals are focused on transportation and safety issues. On the other hand, IEEE Access stands out with a focus on applications, for example, computational applications in engineering, as shown Table 2.The most influential authors (Table 3) include Abdel Aty with his study on accident and traffic modeling and injury severity [41,42], and Lee J and Mannering with their articles on impact and accident assessment [43,44]. In the topic of severity analysis by prediction with ML, the works of Li [45,46] and Zhang [47] are the most relevant papers. The top 10 most globally cited articles present in Table 3 include: Sivaraman et al. [48] who explain a new active learning approach to developing vehicle recognition and tracking systems. Martinez et al. [49] present a study on driving style characterization and recognition by reviewing various machine learning-oriented algorithms. Desjardins et al. [50] propose a new concept for the design of autonomous vehicle controllers based on advanced ML techniques. Zhu [51] present a framework for big data analytics in Intelligent Transport Systems. Meiring et al. [52] analyze the applicability of ML and artificial intelligence algorithms to predict the driver’s behavior and driving style. Young et al. [53] explore the types of data used to develop, calibrate and validate CSP models. Finally, Ji et al. [54] explore the predictive potential of variables related to the collision mechanism using ensemble learning models.
- Evolution of keywords and topics: Analyzing keyword trends, Figure 4 shows that learning systems, especially ML, strongly influence applications directed to accident prevention policies. In recent years, terms related to the integrity of people and safety related to the severity of accidents have gained relevance. ML techniques are essential for modeling accident-related phenomena, mainly because of structured data available in open access repositories. Figure 4 also shows that new methods are emerging to address CSP, especially deep learning. This is due to the availability in recent years of open and unstructured data, mainly spatial, images or GPS data, which facilitate the use of these emerging methods.
- Machine Learning Methods: About 13 studies in the last four years were selected for analysis. In Table 4 we can observe that the most frequent methods for CSP are random forest (RF), used in 7 of the 13 studies, decision tree (DT) in 6 papers, support vector machine (SVM) in 5 articles, and Naive Bayesian (NB) in 4 studies. These methods are commonly used under the supervised learning paradigm, where CSP is modeled as a classification problem. On the other hand, we identified the following methods when analyzing the ML approaches that show the best performance in the studies consulted. RF [56,57,58], SVM [59,60,61], Light-GBM [62], Gradient Boosting [63], AdaBoost [58], Multi-layer perceptron [64], Nearest Neighbor Classification [25], and SimpleCart model [65]. These methods were commonly used for case studies located in the USA (Connecticut, Michigan, California, Florida, Nebraska), United Kingdom, China (Kunshan), Korea (Seoul), India, and Ghana. Thus, these geographic areas contain the majority of the scientific production on these topics, as shown in Figure 5.
4. The Proposed Benchmark
4.1. Case Studies and Raw Data
4.2. Datasets
4.3. Experimental Set-Up
- AutoML and ML methods: AutoML competitors are AutoGluon (Ag), Auto-Sklearn (As) and TPOT (Tp) with its default hyperparameter values. Ag and As were used for three execution times (15, 60, and 150 min) while Tp did not have an allocated time. Such an execution time corresponds to the time the methods take to find the best ML algorithm and its hyperparameter configuration for a given dataset. The assumption is that longer time budgets lead to better results. Therefore, such a progressive increase from 15 to 150 minutes should exemplify the expected behavior [70]. Additionally, every execution time assigned to a particular AutoML method is considered to be an individual AutoML competitor.As baseline methods, we used CatBoost (CatB), Decision Tree (DT), Extra Trees (ExtraT), Gradient Boosting (GB), Gaussian Naive Bayes (GnB), Light Gradient Boosting Machine (LGBM), Random Forest (RF), and a tuned Random Forest (tuned_RF). Moreover, it is relevant to note that we have not performed any optimization or tuning of the hyperparameters of the AutoML methods or the baseline methods. The above is justified because we aim to compare the performance of AutoML versus the baseline using the same human effort for both in order to carry out a fairer comparison.
- Performance metrics and Statistical tests: For the results of this paper, we followed the same experimental set-up proposed in the studies by Gijsbers et al. [71] and Angarita et al. [14]. They introduced earlier methodological guidelines to compare AutoML methods, and compare these techniques with other ML methods in CSP, respectively. Specifically, the area under the receiver operator curve (ROC_AUC) is used for the binary classification problems considered in this benchmark. For the multiclass problem, we used the Log Loss function. In addition, the final score achieved by every method is the average.We made use of non-parametric statistical tests to assess the differences method performance. Two statistical tests are used following the guidelines proposed in [72]. First, Friedman’s test for multiple comparisons is applied to check whether there are differences among the methods. Then, the Holm’s test is used to check whether the differences in the Friedman ranking are statistically significant or not.
5. Results
6. Conclusions
6.1. Summary
6.2. Challenges and Research Opportunities
6.2.1. The Method Selection Problem: From AutoML towards Automated Deep Learning
6.2.2. Data Fusion and Real-Time Data to Enhance the Power of ML and AutoML in CSP
6.2.3. Explainability for a Better Crash Severity Understanding
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- World Health Organization Road Traffic Injuries. Available online: www.who.int/news-room/fact-sheets/detail/road-traffic-injuries (accessed on 21 January 2021).
- United Nations Road Safety Considerations in Support of the 2030 Agenda for Sustainable Development. Available online: https://unctad.org/system/files/official-document/dtltlb2017d4_en.pdf (accessed on 21 January 2021).
- Perallos, A.; Hernandez-Jayo, U.; Onieva, E.; García-Zuazola, I.J. Intelligent Transport Systems: Technologies and Applications, 1st ed.; Wiley Publishing: Hoboken, NJ, USA, 2015. [Google Scholar]
- Silva, P.B.; Andrade, M.; Ferreira, S. Machine learning applied to road safety modeling: A systematic literature review. J. Traffic Transp. Eng. (Engl. Ed.) 2020, 7, 775–790. [Google Scholar] [CrossRef]
- Gutierrez-Osorio, C.; Pedraza, C. Modern data sources and techniques for analysis and forecast of road accidents: A review. J. Traffic Transp. Eng. (Engl. Ed.) 2020, 7, 432–446. [Google Scholar] [CrossRef]
- Tang, J.; Zheng, L.; Han, C.; Yin, W.; Zhang, Y.; Zou, Y.; Huang, H. Statistical and machine-learning methods for clearance time prediction of road incidents: A methodology review. Anal. Methods Accid. Res. 2020, 27, 100123. [Google Scholar] [CrossRef]
- Gajendran, C.; Vk, S.; Sg, S.; Swati, P. Different Methods of Accident Forecast Based on Real Data. J. Civ. Environ. Eng. 2015, 5, 1–5. [Google Scholar]
- Wolpert, D.H.; Macready, W.G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef] [Green Version]
- Hutter, F.; Kotthoff, L.; Vanschoren, J. (Eds.) Automated Machine Learning: Methods, Systems, Challenges; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
- Angarita-Zapata, J.S.; Masegosa, A.D.; Triguero, I. General-Purpose Automated Machine Learning for Transportation: A Case Study of Auto-sklearn for Traffic Forecasting. In Information Processing and Management of Uncertainty in Knowledge-Based Systems; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 728–744. [Google Scholar] [CrossRef]
- Angarita-Zapata, J.S.; Masegosa, A.D.; Triguero, I. Evaluating Automated Machine Learning on Supervised Regression Traffic Forecasting Problems. In Computational Intelligence in Emerging Technologies for Engineering Applications; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 187–204. [Google Scholar] [CrossRef]
- Angarita-Zapata, J.S.; Triguero, I.; Masegosa, A.D. A Preliminary Study on Automatic Algorithm Selection for Short-Term Traffic Forecasting. In Intelligent Distributed Computing XII; Del Ser, J., Osaba, E., Bilbao, M.N., Sanchez-Medina, J.J., Vecchio, M., Yang, X.S., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 204–214. [Google Scholar]
- Vlahogianni, E.I. Optimization of traffic forecasting: Intelligent surrogate modeling. Transp. Res. Part C Emerg. Technol. 2015, 55, 14–23. [Google Scholar] [CrossRef]
- Angarita-Zapata, J.S.; Maestre-Gongora, G.; Calderín, J.F. A Case Study of AutoML for Supervised Crash Severity Prediction. In Proceedings of the 19th World Congress of the International Fuzzy Systems Association (IFSA), the 12th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT), and the 11th International Summer School on Aggregation Operators (AGOP), Bratislava, Slovakia, 19–24 September 2021; Atlantis Press: Paris, France, 2021; pp. 187–194. [Google Scholar] [CrossRef]
- Erickson, N.; Mueller, J.; Shirkov, A.; Zhang, H.; Larroy, P.; Li, M.; Smola, A. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. arXiv 2020, arXiv:2003.06505. [Google Scholar]
- Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.; Blum, M.; Hutter, F. Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 2962–2970. [Google Scholar]
- Olson, R.S.; Bartley, N.; Urbanowicz, R.J.; Moore, J.H. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, Denver, CO, USA, 20–24 July 2016; pp. 485–492. [Google Scholar]
- Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Song, H.; Triguero, I.; Özcan, E. A review on the self and dual interactions between machine learning and optimisation. Prog. Artif. Intell. 2019, 8, 1–23. [Google Scholar] [CrossRef] [Green Version]
- Garcia, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Springer: Cham, Switzerland, 2015. [Google Scholar]
- Triguero, I.; García-Gil, D.; Maillo, J.; Luengo, J.; García, S.; Herrera, F. Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1289. [Google Scholar] [CrossRef]
- Zöller, M.A.; Huber, M.F. Survey on Automated Machine Learning. arXiv 2019, arXiv:1904.12054. [Google Scholar]
- Yao, Q.; Wang, M.; Chen, Y.; Dai, W.; Li, Y.; Tu, W.; Qiang, Y.; Yang, Y. Taking Human out of Learning Applications: A Survey on Automated Machine Learning. CoRR 2019. [Google Scholar]
- Kerschke, P.; Hoos, H.; Neumann, F.; Trautmann, H. Automated Algorithm Selection: Survey and Perspectives. CoRR 2018. [Google Scholar] [CrossRef]
- Iranitalab, A.; Khattak, A. Comparison of four statistical and machine learning methods for crash severity prediction. Accid. Anal. Prev. 2017, 108, 27–36. [Google Scholar] [CrossRef] [PubMed]
- Tang, J.; Liang, J.; Han, C.; Li, Z.; Huang, H. Crash injury severity analysis using a two-layer Stacking framework. Accid. Anal. Prev. 2019, 122, 226–238. [Google Scholar] [CrossRef] [PubMed]
- Li, P.; Abdel-Aty, M.; Yuan, J. Real-time crash risk prediction on arterials based on LSTM-CNN. Accid. Anal. Prev. 2020, 135, 105371. [Google Scholar] [CrossRef]
- Topuz, K.; Delen, D. A probabilistic Bayesian inference model to investigate injury severity in automobile crashes. Decis. Support Syst. 2021, 150, 113557. [Google Scholar] [CrossRef]
- Gao, L.; Lu, P.; Ren, Y. A deep learning approach for imbalanced crash data in predicting highway-rail grade crossings accidents. Reliab. Eng. Syst. Saf. 2021, 216, 108019. [Google Scholar] [CrossRef]
- Yang, Z.; Zhang, W.; Feng, J. Predicting multiple types of traffic accident severity with explanations: A multi-task deep learning framework. Saf. Sci. 2022, 146, 105522. [Google Scholar] [CrossRef]
- Yu, B.; Bao, S.; Chen, Y.; LeBlanc, D.J. Effects of an integrated collision warning system on risk compensation behavior: An examination under naturalistic driving conditions. Accid. Anal. Prev. 2021, 163, 106450. [Google Scholar] [CrossRef] [PubMed]
- Mannering, F.; Bhat, C.R.; Shankar, V.; Abdel-Aty, M. Big data, traditional data and the tradeoffs between prediction and causality in highway-safety analysis. Anal. Methods Accid. Res. 2020, 25, 100113. [Google Scholar] [CrossRef]
- Wahyuni, H.; Vanany, I.; Ciptomulyono, U. Food safety and halal food in the supply chain: Review and bibliometric analysis. J. Ind. Eng. Manag. 2019, 12, 373. [Google Scholar] [CrossRef]
- Aria, M.; Cuccurullo, C. bibliometrix: An R-tool for comprehensive science mapping analysis. J. Inf. 2017, 11, 959–975. [Google Scholar] [CrossRef]
- Bhatt, Y.; Ghuman, K.; Dhir, A. Sustainable manufacturing. Bibliometrics and content analysis. J. Clean. Prod. 2020, 260, 120988. [Google Scholar] [CrossRef]
- Klavans, R.; Boyack, K.W. Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge? J. Assoc. Inf. Sci. Technol. 2016, 68, 984–998. [Google Scholar] [CrossRef]
- Donthu, N.; Kumar, S.; Mukherjee, D.; Pandey, N.; Lim, W.M. How to conduct a bibliometric analysis: An overview and guidelines. J. Bus. Res. 2021, 133, 285–296. [Google Scholar] [CrossRef]
- You, J. A Genetic Algorithm-based AutoML Approach for Large-scale Traffic Speed Prediction. In Proceedings of the 2020 IEEE 5th International Conference on Intelligent Transportation Engineering (ICITE), Beijing, China, 11–13 September 2020; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- de S. Soares, E.F.; Revoredo, K.; Baiao, F.; de M.S. Quintella, C.A.; Campos, C.A.V. A Combined Solution for Real-Time Travel Mode Detection and Trip Purpose Prediction. IEEE Trans. Intell. Transp. Syst. 2019, 20, 1–10. [Google Scholar] [CrossRef]
- Shi, X.; Wong, Y.D.; Chai, C.; Li, M.Z.F. An Automated Machine Learning (AutoML) Method of Risk Prediction for Decision-Making of Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1–10. [Google Scholar] [CrossRef]
- Abdel-Aty, M. Analysis of driver injury severity levels at multiple locations using ordered probit models. J. Saf. Res. 2003, 34, 597–603. [Google Scholar] [CrossRef]
- Abdel-Aty, M.A.; Radwan, A. Modeling traffic accident occurrence and involvement. Accid. Anal. Prev. 2000, 32, 633–642. [Google Scholar] [CrossRef]
- Lee, J.; Mannering, F. Impact of roadside features on the frequency and severity of run-off-roadway accidents: An empirical analysis. Accid. Anal. Prev. 2002, 34, 149–161. [Google Scholar] [CrossRef]
- Nam, D.; Lee, J. Accident Frequency Model Using Zero Probability Process. Transp. Res. Rec. J. Transp. Res. Board 2006, 1973, 142–148. [Google Scholar] [CrossRef]
- Feng, S.; Li, Z.; Ci, Y.; Zhang, G. Risk factors affecting fatal bus accident severity: Their impact on different types of bus drivers. Accid. Anal. Prev. 2016, 86, 29–39. [Google Scholar] [CrossRef]
- Li, Z.; Chen, C.; Ci, Y.; Zhang, G.; Wu, Q.; Liu, C.; Qian, Z.S. Examining driver injury severity in intersection-related crashes using cluster analysis and hierarchical Bayesian models. Accid. Anal. Prev. 2018, 120, 139–151. [Google Scholar] [CrossRef]
- Zhang, J.; Li, Z.; Pu, Z.; Xu, C. Comparing Prediction Performance for Crash Injury Severity Among Various Machine Learning and Statistical Methods. IEEE Access 2018, 6, 60079–60087. [Google Scholar] [CrossRef]
- Sivaraman, S.; Trivedi, M.M. A General Active-Learning Framework for On-Road Vehicle Recognition and Tracking. IEEE Trans. Intell. Transp. Syst. 2010, 11, 267–276. [Google Scholar] [CrossRef] [Green Version]
- Martinez, C.M.; Heucke, M.; Wang, F.Y.; Gao, B.; Cao, D. Driving Style Recognition for Intelligent Vehicle Control and Advanced Driver Assistance: A Survey. IEEE Trans. Intell. Transp. Syst. 2018, 19, 666–676. [Google Scholar] [CrossRef] [Green Version]
- Desjardins, C.; Chaib-draa, B. Cooperative Adaptive Cruise Control: A Reinforcement Learning Approach. IEEE Trans. Intell. Transp. Syst. 2011, 12, 1248–1260. [Google Scholar] [CrossRef]
- Zhu, L.; Yu, F.R.; Wang, Y.; Ning, B.; Tang, T. Big Data Analytics in Intelligent Transportation Systems: A Survey. IEEE Trans. Intell. Transp. Syst. 2019, 20, 383–398. [Google Scholar] [CrossRef]
- Meiring, G.; Myburgh, H. A Review of Intelligent Driving Style Analysis Systems and Related Artificial Intelligence Algorithms. Sensors 2015, 15, 30653–30682. [Google Scholar] [CrossRef] [PubMed]
- Young, W.; Sobhani, A.; Lenné, M.G.; Sarvi, M. Simulation of safety: A review of the state of the art in road safety simulation modelling. Accid. Anal. Prev. 2014, 66, 89–103. [Google Scholar] [CrossRef]
- Ji, A.; Levinson, D. Injury Severity Prediction From Two-Vehicle Crash Mechanisms With Machine Learning and Ensemble Models. IEEE Open J. Intell. Transp. Syst. 2020, 1, 217–226. [Google Scholar] [CrossRef]
- Koesdwiady, A.; Soua, R.; Karray, F. Improving Traffic Flow Prediction with Weather Information in Connected Cars: A Deep Learning Approach. IEEE Trans. Veh. Technol. 2016, 65, 9508–9517. [Google Scholar] [CrossRef]
- Zhang, Z.; He, Q.; Gao, J.; Ni, M. A deep learning approach for detecting traffic accidents from social media data. Transp. Res. Part C Emerg. Technol. 2018, 86, 580–596. [Google Scholar] [CrossRef] [Green Version]
- Mondal, A.R.; Bhuiyan, M.A.E.; Yang, F. Advancement of weather-related crash prediction model using nonparametric machine learning algorithms. SN Appl. Sci. 2020, 2, 1–11. [Google Scholar] [CrossRef]
- Labib, M.F.; Rifat, A.S.; Hossain, M.M.; Das, A.K.; Nawrine, F. Road Accident Analysis and Prediction of Accident Severity by Using Machine Learning in Bangladesh. In Proceedings of the 2019 7th International Conference on Smart Computing & Communications (ICSCC), Sarawak, Malaysia, 28–30 June 2019; IEEE: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
- Assi, K.; Rahman, S.M.; Mansoor, U.; Ratrout, N. Predicting Crash Injury Severity with Machine Learning Algorithm Synergized with Clustering Technique: A Promising Protocol. Int. J. Environ. Res. Public Health 2020, 17, 5497. [Google Scholar] [CrossRef]
- Ahmadi, A.; Jahangiri, A.; Berardi, V.; Machiani, S.G. Crash severity analysis of rear-end crashes in California using statistical and machine learning classification methods. J. Transp. Saf. Secur. 2018, 12, 522–546. [Google Scholar] [CrossRef]
- Lee, S.L. Assessing the Severity Level of Road Traffic Accidents Based on Machine Learning Techniques. Adv. Sci. Lett. 2016, 22, 3115–3119. [Google Scholar] [CrossRef]
- Mamlook, R.E.A.; Abdulhameed, T.Z.; Hasan, R.; Al-Shaikhli, H.I.; Mohammed, I.; Tabatabai, S. Utilizing Machine Learning Models to Predict the Car Crash Injury Severity among Elderly Drivers. In Proceedings of the 2020 IEEE International Conference on Electro Information Technology (EIT), Chicago, IL, USA, 31 July–12 August 2020; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- Wang, C.; Liu, L.; Xu, C.; Lv, W. Predicting Future Driving Risk of Crash-Involved Drivers Based on a Systematic Machine Learning Framework. Int. J. Environ. Res. Public Health 2019, 16, 334. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Geyik, B.; Kara, M. Severity Prediction with Machine Learning Methods. In Proceedings of the 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey, 26–27 June 2020; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- Wahab, L.; Jiang, H. Severity prediction of motorcycle crashes with machine learning methods. Int. J. Crashworth. 2019, 25, 485–492. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, T.; Kwon, S.; Lee, J. Model Evaluation for Forecasting Traffic Accident Severity in Rainy Seasons Using Machine Learning Algorithms: Seoul City Study. Appl. Sci. 2019, 10, 129. [Google Scholar] [CrossRef] [Green Version]
- Semana, R. Las Motos Representan el 59% del Parque Automotor de Colombia. Available online: www.semana.com (accessed on 29 October 2021).
- Ministerio de Transporte de Colombia. Registro Nacional de Tránsito. 2021. Available online: www.runt.com.co (accessed on 29 October 2021).
- Revista Portafolio. Siniestros Viales le Cuestan al país 23,9 Billones de Pesos al año. 2020. Available online: www.portafolio.co/revista (accessed on 29 October 2021).
- Guyon, I.; Chaabane, I.; Escalante, H.J.; Escalera, S.; Jajetic, D.; Lloyd, J.R.; Macià, N.; Ray, B.; Romaszko, L.; Sebag, M.; et al. A brief Review of the ChaLearn AutoML Challenge: Any-time Any-dataset Learning without Human Intervention. In Proceedings of the Workshop on Automatic Machine Learning, New York, NY, USA, 24 June 2016; pp. 21–30. [Google Scholar]
- Gijsbers, P.; LeDell, E.; Poirier, S.; Thomas, J.; Bischl, B.; Vanschoren, J. An Open Source AutoML Benchmark. In Proceedings of the AutoML Workshop at International Conference on Machine Learning 2019, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Garcia, S.; Fernandez, A.; Luengo, J.; Herrera, F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 2010, 180, 2044–2064. [Google Scholar] [CrossRef]
- Zimmer, L.; Lindauer, M.; Hutter, F. Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and Robust AutoDLL. arXiv 2020, arXiv:2006.13799. [Google Scholar]
- Chefrour, A. Incremental supervised learning: Algorithms and applications in pattern recognition. Evol. Intell. 2019, 12, 1–16. [Google Scholar] [CrossRef]
- Wu, Y.; Chen, Y.; Wang, L.; Ye, Y.; Liu, Z.; Guo, Y.; Fu, Y. Large Scale Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Castro, F.M.; Marin-Jimenez, M.J.; Guil, N.; Schmid, C.; Alahari, K. End-to-End Incremental Learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Dries, A.; Rückert, U. Adaptive concept drift detection. Stat. Anal. Data Min. Asa Data Sci. J. 2009, 2, 311–327. [Google Scholar] [CrossRef] [Green Version]
- Castelvecchi, D. Can we open the black box of AI? Nature 2016, 538, 1–4. [Google Scholar] [CrossRef] [Green Version]
- Waring, J.; Lindvall, C.; Umeton, R. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif. Intell. Med. 2020, 104, 101822. [Google Scholar] [CrossRef]
- Gunning, D.; Aha, D. DARPA’s Explainable Artificial Intelligence (XAI) Program. AI Mag. 2019, 40, 44–58. [Google Scholar]
- Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef] [Green Version]
Main Information | Road Accident | Crash Severity | AutoML |
Timespan | 2010:2021 | 2010:2021 | 2010:2021 |
Sources (Journals, Books, etc.) | 310 | 53 | 333 |
Documents | 452 | 67 | 462 |
Average years from publication | 3.21 | 2.25 | 2.55 |
Average citations per documents | 9.668 | 18.51 | 10.31 |
Average citations per year per doc | 2.134 | 6.368 | 2.654 |
References | 10,278 | 1870 | 16,000 |
Document Types | Road Accident | Crash Severity | AutoML |
Article | 194 | 37 | 232 |
Book chapter | 3 | 28 | 4 |
Conference paper | 245 | 1 | 219 |
Review | 10 | 1 | 5 |
Keywords | Road Accident | Crash Severity | AutoML |
Keywords Plus (ID) | 2486 | 435 | 3650 |
Author’s Keywords (DE) | 1243 | 201 | 1129 |
AUTHORS | Road Accident | Crash Severity | AutoML |
Authors | 1512 | 185 | 2150 |
Author Appearances | 1679 | 199 | 2457 |
Authors of single-authored documents | 12 | 24 | 24 |
Authors of multi-authored documents | 1500 | 161 | 2126 |
Authors Collaboration | Road Accident | Crash Severity | AutoML |
Documents per Author | 0.299 | 0.362 | 0.215 |
Authors per Document | 3.35 | 2.76 | 4.65 |
Co-Authors per Documents | 3.71 | 2.97 | 5.32 |
Collaboration Index | 3.42 | 3.74 | 4.89 |
Most Cited Sources | Articles |
---|---|
Accident Analysis And Prevention | 1463 |
IEEE Transactions On Intelligent Transportation Systems | 239 |
Transportation Research Record | 108 |
Safety Science | 87 |
Machine Learning | 62 |
IEEE Access | 57 |
Journal Of Safety Research | 50 |
Analytic Methods In Accident Research | 45 |
Traffic Injury Prevention | 28 |
Papers | Year | Citations |
---|---|---|
[42] | 2000 | 483 |
[41] | 2003 | 323 |
[43] | 2002 | 395 |
[48] | 2010 | 248 |
[49] | 2018 | 152 |
[50] | 2011 | 137 |
[55] | 2016 | 135 |
[51] | 2019 | 132 |
[52] | 2015 | 104 |
[56] | 2018 | 101 |
[25] | 2017 | 92 |
[53] | 2014 | 83 |
[54] | 2020 | 67 |
Reference | Year | Methods |
---|---|---|
[65] | 2020 | Multi-layer perceptron (MLP), rule induction (PART) and classification and regression trees (SimpleCart) |
[57] | 2020 | Random forest (RF) and bayesian additive regression trees (BART) |
[59] | 2020 | Feed-forward neural networks (FNN), support vector machine (SVM), fuzzy C-means clustering based feed-forward neural network (FNN-FCM), and fuzzy c-means based support vector machine (SVM-FCM). |
[62] | 2020 | Naïve Bayesian (NB), Decision Tree (DT), Logistic Regression (LR), Light-GBM, and Random Forest (RF) model are proposed. |
[60] | 2020 | Multinomial logit, mixed multinomial logit, and support vector machine (SVM) |
[66] | 2020 | Random forest (RF), artificial neural network, and decision tree (DT) |
[64] | 2020 | Multi-layer Perceptron (MLP), Decision Tree (DT), Random Forest (RF) classifier and Naive Bayes (NB). |
[63] | 2019 | Random forest (RF), Adaboost with decision tree, gradient boosting decision tree (GBDT), and extreme gradient boosting decision tree (XGboost). |
[58] | 2019 | Decision Tree (DT), K-Nearest Neighbors (KNN), Naïve Bayes (NV) and AdaBoost |
[47] | 2018 | K-Nearest Neighbor(KNN), Decision Tree (DT), Random Forest (RF) and Support Vector Machine (SVM) |
[25] | 2017 | Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF) |
[61] | 2016 | Decision trees (DT), artificial neural networks, Bayesian networks, support vector machines (SVM), and regression models |
City | Records | Year | Data Source |
---|---|---|---|
Bogotá | 66,329 | 2015–2019 | datosabiertos.bogota.gov.co (accessed on 29 October 2021) |
Medellín | 150,646 | 2014–2018 | medata.gov.co (accessed on 29 October 2021) |
Bucaramanga | 32,857 | 2012–2020 | observatorio.bucaramanga.gov.co/index.php/datos-abiertos/ (accessed on 29 October 2021) |
Dataset | Instances | Distribution of Classes | Imbalance Ratios (A/B) | |
---|---|---|---|---|
People Injured (A) | Only Material Damages (B) | |||
Med2014 | 41,776 | 23,198 | 18,578 | 1.25 |
Med2015 | 42,427 | 23,550 | 18,877 | 1.25 |
Med2016 | 46,838 | 26,594 | 20,244 | 1.31 |
Med2017 | 42,443 | 22,917 | 19,526 | 1.17 |
Med2018 | 46,655 | 24,247 | 22,408 | 1.08 |
Dataset | Instances | Distribution of Classes | Imbalance Ratios | ||
---|---|---|---|---|---|
People Injured (A) | Casualties (B) | Only Material Damages (C) | |||
Bog2015 | 31,341 | 10,738 | 529 | 20,074 | C/A = 1.87 C/B = 37.95 |
Bog2016 | 34,988 | 10,578 | 567 | 23,843 | C/A = 2.25 C/B = 42.05 |
Bog2017 | 35,171 | 10,381 | 538 | 24,252 | C/A = 2.34 C/B = 45.08 |
Bog2018 | 36,953 | 12,609 | 500 | 23,844 | C/A = 1.89 C/B = 47.69 |
Bog2019 | 34,990 | 12,371 | 492 | 22,127 | C/A = 1.79 C/B = 44.97 |
Buc2012 | 4343 | 1587 | 64 | 2692 | C/A = 1.70 C/B = 42.06 |
Buc2013 | 4055 | 1519 | 67 | 2469 | C/A = 1.63 C/B = 36.85 |
Buc2014 | 3723 | 1617 | 37 | 2069 | C/A = 1.28 C/B = 55.92 |
Buc2015 | 3765 | 1705 | 47 | 2013 | C/A = 1.18 C/B = 42.83 |
Buc2016 | 3733 | 1705 | 64 | 1964 | C/A = 1.15 C/B = 30.69 |
Buc2017 | 3807 | 1903 | 39 | 1865 | A/B = 1.02 A/C = 48.79 |
Buc2018 | 3910 | 2100 | 40 | 1770 | A/B = 1.19 A/C = 52.5 |
Buc2019 | 3724 | 1993 | 42 | 1689 | A/B = 1.18 A/C = 47.45 |
Buc2020 | 1797 | 1000 | 38 | 759 | A/B = 1.32 A/C = 26.31 |
Binary Problems | Multiclass Problems | ||||
---|---|---|---|---|---|
Methods | Av. Ranking | p-Values | Methods | Av. Ranking | p-Values |
CatB | 4 | - | Ag60m | 3.6786 | - |
LGBM | 4.2 | 1 | Ag150m | 3.8929 | 1 |
Ag150m | 4.3 | 1 | Ag15m | 4.0714 | 1 |
Ag15m | 4.7 | 1 | As150m | 5 | 1 |
Ag60m | 4.8 | 1 | CatB | 5.2143 | 1 |
GB | 5.2 | 1 | GB | 5.2143 | 1 |
Tp | 5.2 | 1 | Tp | 5.4286 | 1 |
As150m | 7.8 | 1 | As60m | 6.1429 | 1 |
Tuned_RF | 8.4 | 0.958359 | As15m | 8.7143 | 0.023123 |
As60m | 9.2 | 0.593928 | LGBM | 9.4286 | 0.006696 |
RF | 10.6 | 0.196244 | tuned_RF | 9.4286 | 0.006696 |
As15m | 11 | 0.146612 | RF | 11.7857 | 0.000018 |
ExtraT | 11.6 | 0.086515 | ExtraT | 13.4286 | 0 |
GNB | 14 | 0.00529 | DT | 14.1786 | 0 |
DT | 15 | 0.001409 | GNB | 14.3929 | 0 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Angarita-Zapata, J.S.; Maestre-Gongora, G.; Calderín, J.F. A Bibliometric Analysis and Benchmark of Machine Learning and AutoML in Crash Severity Prediction: The Case Study of Three Colombian Cities. Sensors 2021, 21, 8401. https://doi.org/10.3390/s21248401
Angarita-Zapata JS, Maestre-Gongora G, Calderín JF. A Bibliometric Analysis and Benchmark of Machine Learning and AutoML in Crash Severity Prediction: The Case Study of Three Colombian Cities. Sensors. 2021; 21(24):8401. https://doi.org/10.3390/s21248401
Chicago/Turabian StyleAngarita-Zapata, Juan S., Gina Maestre-Gongora, and Jenny Fajardo Calderín. 2021. "A Bibliometric Analysis and Benchmark of Machine Learning and AutoML in Crash Severity Prediction: The Case Study of Three Colombian Cities" Sensors 21, no. 24: 8401. https://doi.org/10.3390/s21248401