Previous Article in Journal
AI for Sustainable Cultural Industries: A Screenplay-Aware Knowledge-Enhanced State Space Model with LLM-Derived Narrative Features for Forecasting Film Industry Sustainability Across National Economies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analysis of the Severity of Road Accidents Using Combined Data Mining Techniques

by
César Corrales
1,*,
Juan Carlos Rubio-Romero
2 and
María del Carmen Pardo-Ferreira
2
1
Department of Engineering, Pontifical Catholic University of Peru, Lima 15088, Peru
2
Department of Economics and Business Administration, University of Málaga, 29071 Malaga, Spain
*
Author to whom correspondence should be addressed.
Sustainability 2026, 18(12), 6118; https://doi.org/10.3390/su18126118 (registering DOI)
Submission received: 13 May 2026 / Revised: 6 June 2026 / Accepted: 11 June 2026 / Published: 14 June 2026

Abstract

Road traffic accidents represent a critical road safety issue, the severity of which depends on the complex interplay of multiple factors. This issue directly impacts Target 3.6 of Sustainable Development Goal (SDG) 3, which aims to halve global deaths and injuries by 2030, and SDG 11, which focuses on safe and sustainable transport systems. The study of these factors and their interrelationships is important in the scientific literature. The objective of this study is to analyze the factors that determine the severity of road traffic accidents, identifying the most important ones and their correlations. A dataset containing variables such as infrastructure, location, time, and vehicle type, among others, was used to predict severity, applying Association Rules to identify latent correlations and the Classification and Regression Tree for hierarchical risk classification. The results reveal that the type of collision is the primary predictor of severity; the highest severity is associated with heavy traffic and head-on or side-impact collisions, involving critical scenarios, in the early morning hours and in rural areas, linked to trucks. The combined use of both tools provides a scientific basis for designing interventions on highly vulnerable road segments, contributing to the fulfillment of the 2030 Agenda for safe mobility.

1. Introduction

The 2023 Global Status Report on Road Safety indicates that the annual number of road traffic deaths has decreased to 1.19 million, down from 1.25 million in 2010. However, traffic injuries remain the leading cause of death among people aged 5 to 29, and 9 out of 10 deaths occur in low- and middle-income countries [1]; thus, accidents primarily affect the most productive members of families—young men aged 15 to 29—plunging households into extreme poverty and weakening the social resilience of communities [2,3,4]. This reality directly impacts Target 3.6 of Sustainable Development Goal 3 (SDG 3), which establishes the commitment to halve traffic-related deaths and injuries worldwide by 2030 [5]. Likewise, SDG 11, aimed at achieving sustainable cities and communities, includes among its targets ensuring access to safe, affordable, and sustainable transport systems, emphasizing that road safety is a structural component of urban and territorial development, not an isolated sectoral issue [6]. Road traffic crashes are projected to be the seventh leading cause of death globally by 2030 [7]. Furthermore, the direct economic cost is very high. The National Highway Traffic Safety Administration (NHTSA) estimated that traffic accidents cost the U.S. $871 billion annually [8], and the International Road Assessment Programme (iRAP) estimates that road deaths and injuries worldwide cost $3.6 trillion annually, equivalent to more than 3% of global GDP [9]. This includes the following costs: medical and hospital expenses, treatment and rehabilitation of victims, material losses (vehicles, cargo, and physical objects such as poles, road signs, walls, etc.), removal of damaged vehicles, rescue of victims, cleaning and repair of damage to roads and traffic signs, lost workdays, pensions and early retirements, police and judicial costs, funerals, etc. [10]. Some broader estimates suggest that when all external costs are considered, this loss could reach between 10% and 12% of global GDP [11], making road safety a fundamental requirement for sustainable development: there can be no sustainable transportation if it is not inherently safe [12].
While it is true that road safety has improved in many countries, the opposite is true in many low- and middle-income countries, where fatalities have increased significantly. One example is what happened in Bangladesh, where the number of people killed in traffic accidents rose from 1483 in 1993 to 4046 in 2000—an increase of more than 200% [13]. This aligns with data presented by the World Health Organization, which highlights that low-income countries have a traffic accident mortality rate of 24.1 per 100,000 inhabitants, compared to 9.2 per 100,000 inhabitants in high-income countries [14]. In Peru, this rate is above 10 per 100,000, with 3316 deaths recorded in 2023 [15], and the National Road Network accounts for 60% of all traffic accident fatalities [16]. In the United States, however, when considering non-urban areas—that is, rural roads and highways—41% of traffic fatalities occur on these roads [17].
This disparity reflects a profound structural inequality: 93% of deaths occur in low- and middle-income countries, which account for just 60% of the world’s vehicles [18], demonstrating that road safety is also a key tool for reducing global inequalities. Furthermore, in the field of public health, it is estimated that for every road traffic death, approximately 20 people suffer serious injuries, and of these, at least one will be left with a permanent disability [2,4,5]. The treatment of these injuries diverts critical resources and delays elective surgeries for other conditions, undermining the overall efficiency of the system [10]. The saturation of emergency services is evident given that in some countries, up to 70% of patients presenting to emergency departments have been involved in a traffic accident [5,19]. The treatment of a patient with road traffic trauma is extremely complex and can be up to 15 times more expensive than other interventions due to the need for multiple surgeries and prolonged rehabilitation [20]. In this context, the 2030 Agenda reinforces this urgency by linking road safety to goals for public health (SDG 3), sustainable infrastructure (SDG 9), and safe urban mobility (SDG 11), establishing a global mandate that transcends the technical sphere and demands integrated public policy responses.
It is, therefore, very important to identify the factors that affect transportation safety, particularly on rural roads and highways, in order to have a positive impact on reducing fatalities and on the economy in the long term. Both globally and locally, there is a clear need to reduce traffic-related deaths and injuries [21]. In this regard, little is known about the determinants of the severity of road accidents, particularly involving buses, in low- and middle-income countries [22]. Several factors are attributed to fatalities, including driver behavior, vehicle characteristics, road infrastructure, and system characteristics [23]. Driver drowsiness contributes to a substantial proportion of all traffic accidents [24]. On the other hand, the probability of traffic accidents is expected to increase on sections of road with a high volume of heavy vehicles [25]. Other risk factors for road accidents include time of day, lighting conditions, weather, driver characteristics and behavior, infrastructure characteristics, road geometry, and type of collision, among others [26,27].
In all cases, there are many causes involved in accidents, so it is important to establish the relationship between these causes and the significance of each of them in order to seek to reduce road accidents. While numerous studies have investigated the factors influencing accident risks, most of them have focused on common traffic accidents. Consequently, the factors contributing to road accidents involving buses—which typically result in a high number of casualties—and, above all, the interdependence among them, have not been studied with the depth required. As a result, measures taken to reduce the risk of these accidents may not be as effective. Furthermore, due to differences in the characteristics and mechanisms of accidents with serious casualties and ordinary accidents, traditional countermeasures for ordinary accidents may not effectively reduce the risks of accidents with serious casualties. In this regard, the use of data mining through statistical and computational methods to search for behavioral patterns in the data can provide us with hidden insights into these processes. The two main objectives of data mining tend to be prediction—using a dataset to predict unknown or future values—and description—finding patterns that describe the data [28].
The main tasks of data mining include selecting the observed data—time series, cross-sectional, panel—the sample or population period, the frequency—annual, quarterly, etc.—and its measurement, selecting predictors (searching for a specific correlation structure), and model specification: changing the model’s assumptions, diagnostic tests, and data exploration [29]. Data mining extracts useful and interesting information from very large datasets and has become increasingly common in many fields, such as banking, insurance, medicine, retail, biology, and agriculture [30], and may involve the use of artificial intelligence, expert systems, machine learning, pattern recognition, statistics, intelligent databases, knowledge acquisition, data visualization, and other fields. Its task is to create models from data, and its development will largely determine the direction of data development [31]. Taking all this into account, road accident data can be analyzed using association rule mining and association tree techniques, which aim to extract knowledge from observational data, bearing in mind that road accidents—which are generally overseen by the Superintendency of Land Transport of Passengers, Cargo, and Goods (SUTRAN)—are caused by complex interactions among various contributing factors.
The choice of these analytical tools is fundamental. On the one hand, association rules allow for the discovery of co-occurrence relationships among multiple factors simultaneously—such as vehicle type, time, and location—without the need to designate dependent and independent variables a priori [28], revealing hidden patterns that conventional regression models, limited by assumptions of linearity and distribution, cannot detect [29]. On the other hand, the CART algorithm constructs a non-parametric predictive hierarchy that identifies the factors with the greatest discriminatory power for accident severity, allowing specific risk profiles to be segmented in a transparent and interpretable manner [32]. However, both tools have limitations when applied in isolation: association rules generate sets of rules of equal hierarchy without distinguishing which factor is most decisive, while CART does not capture the simultaneous interactions between multiple conditions that characterize the most severe scenarios. Their combined use can overcome these limitations: association rules reveal which combinations of conditions are dangerous, and CART establishes the relative importance of each factor in determining severity [33,34]. This dual approach constitutes the methodological core of the present study. The use of these techniques could reveal valuable relationships that have not been identified in existing studies. By identifying the contributing factors and their interdependencies, important information can be obtained to understand the reasons behind the occurrence of accidents and to develop effective policies and countermeasures to improve safety.
The main objective of this study is to demonstrate the importance of using combined data mining tools to determine the main factors contributing to accidents with serious injuries and their interdependencies, laying the groundwork for providing information and developing safety improvement policies and strategies to reduce these accidents. Thus, a safe and affordable transportation system is fundamental to sustainable development, as it can strengthen economic growth and increase accessibility, thereby promoting equality for the population [10]. The findings of this study, therefore, contribute not only to the advancement of technical knowledge in road safety but also to the fulfillment of the commitments made by countries under the 2030 Agenda, particularly with regard to the targets of SDG 3 and SDG 11.

2. Literature Review

Negative binomial models have been used to explore the impacts of various accident factors on their frequency [35,36,37]. The use of Poisson regression models is also widespread in road safety management and analysis, particularly for determining the importance of factors in traffic accidents [38,39]. In recent years, other types of models have emerged: artificial neural networks, Bayesian networks, decision trees, and genetic programming [40].
In recent years, the use of data mining tools has become particularly important. Data mining involves extracting implicit, previously unknown, and potentially useful information from data. The idea is to create computer programs that automatically scan databases for regularities or patterns [41]. Data mining allows for filtering out noise in the data, predicting patterns, forecasting outcomes, and generating information to support informed and agile decision-making. It is closely related to statistics, artificial intelligence, and machine learning. In predictive modeling, the goal is to estimate the value of a target attribute based on training data, using tools such as association rules and Classification and Regression Trees [32].
Association rules, also known as basket analysis, are one of the most popular approaches for discovering patterns in databases [42]. Algorithms for the discovery of association rules attempt to identify products that tend to be sold together [43]. The association rule method in data mining has been successfully used to discover patterns or hidden rules in a variety of fields, including shopping basket analysis, product recommendation, and medical record analysis [44], and allows for the identification of potential cause-and-effect relationships among the many factors that play a role, for example, in workplace accidents in the construction industry [45].
The goal of association rule mining is to identify any real associations in the data without specifically designating any variable as a dependent or independent variable. An association rule in the context of accident data indicates that the presence of a certain characteristic in an accident implies the presence of another characteristic in that accident. These rules can be searched for in the database using the a priori algorithm [43]. Improvements in the quantity and quality of data have sparked interest in new ways of analyzing and interpreting data. In particular, various authors have applied association rules to search for hidden patterns in accident databases in very general or very specific terms [46,47,48].
The Apriori algorithm has been used to extract strong association rules among the values of accident attributes in China, such as vehicle type, weather, and time of day, among others [49]. It has been combined with the WEKA platform to identify hidden factors in road accident records, with cross-validation of the method across different regions and cities [50]. It has also been combined with complex graph structures to reflect interactions between variables such as improper operations, overloading, mountainous terrain, and running off the road [51]. It is also found in the specialized literature, combined with Complex Network Analysis to examine the causal mechanisms of serious accidents by identifying human factors—speeding, fatigue, insufficient distance—vehicle factors (trucks), and environmental factors (national highways, nighttime hours) as the main causes [52]. Association rules can also be used with more complex tools such as Random Forest + SHAP (RF-SHAP), which combine a powerful machine learning model with a robust interpretability framework. The combination of RF-SHAP—to determine the individual importance of factors—with the Apriori algorithm (to explore interactions among multiple factors simultaneously) allows us to overcome the limitations of each method separately: RF-SHAP identifies which factors matter individually, while Apriori reveals how they interact with one another to cause serious accidents [33].
Classification trees, meanwhile, are a specific case of partitioning-based testing strategies [53]. Classification is a unique example of predictive modeling in which a dataset is already segmented into pre-specified groups, and patterns in the data are identified to distinguish those groups. The patterns explored can then be used to categorize another dataset where the appropriate group description for the target attribute is unknown. Regression analysis is also an example of predictive modeling with a numerical target attribute, and the goal is to predict that value for new data [32]. The method involves identifying factors relevant to the test, and based on each of them, an exclusive classification is made, which, in turn, can be further reclassified into subcategories, represented graphically in the form of a tree. Test cases will be generated by combining the different elements from the various classifications performed. One of the advantages of this method is that it allows all the information to be managed in a structured manner in small groups or parts, making it easier to understand and document [53].
Studies have been found that apply Classification and Regression Trees (CARTs) and Random Forest (RF) to heavy vehicle accident data in Malaysia [54] and others that combine decision trees with statistical analysis in SPSS to analyze accidents in tunnels on mountainous highways in China [55].
Association rules have been applied after segmenting data into homogeneous clusters, integrating results with GISs to identify black spots, determining that serious accidents occur between trucks and motorcycles in low-density areas, with the main causes being excessive speed and improper lane changes [56]. Decision trees (DTs) have also been used with Python to identify severity factors related to driver behavior and socioeconomic characteristics [57].
Moving toward more complex uses of these tools, there are studies that integrate various methods to overcome the individual limitations of each. Among the use of combined methods is the application of three classification algorithms—Decision Tree, LightGBM, and XGBoost—to UK accident data (2020), with hyperparameter tuning. It identifies that most accidents occur in daylight conditions on dry surfaces, and that speed limits of 30 mph (urban roads) see a higher concentration of accidents, although highways are more lethal [58]. There is a comparative study among the use of Logistic Regression (LR), CART, and RF to predict severity on roads in Taiwan [59]. There is also the use of Ordered Probit, association rules, and CART to identify severity factors for vulnerable users at road–rail crossings, finding that speed is the most important factor [34].
Unlike the traditional literature, which has relied primarily on Poisson and negative binomial regression models to study traffic accidents, this study aims to enhance the analysis by using data mining tools, specifically association rules and CART. These tools were chosen because the factors causing traffic accidents are complex and often interrelated, requiring tools capable of extracting relevant information for sound decision-making. Association rules facilitate the discovery of cause-and-effect relationships among variables or factors that interact simultaneously, such as driver behavior, the environment, and the vehicle. The integration of the CART model allows this information to be handled in a structured manner by partitioning the dataset into specific groups. This dual approach overcomes the individual limitations of each method, making this methodological combination a significant advance over studies that use isolated tools or basic descriptive statistics. Despite the rise of highly complex supervised learning algorithms, such as RF, LightGBM, and XGBoost, there remains a critical need for models that not only predict severity but also reveal the architecture of interactions between factors, thereby offering more robust and accurate insights for road safety management and the reduction in road accidents.

3. Materials and Methods

The data mining techniques widely used in the study of complex phenomena with multiple interrelated variables were employed in this study. Association rules and CART were chosen over more complex models with higher predictive accuracy, such as Random Forest or XGBoost, because they better fulfill the study’s central objective: to identify the interrelationships among different factors—rather than merely their involvement and importance in the occurrence and severity of accidents—without requiring additional interpretation tools. First, association rules were applied to identify frequent patterns and significant relationships among categorical variables within large datasets; second, Regression and Classification Trees (CARTs) were used to facilitate both severity prediction and the identification of factors with the greatest discriminatory power. These techniques and the methodology followed in this process are described below.

3.1. Association Rules

Rules take the form A → B, where A is the antecedent and B is the consequent. In association rules, rules can be expressed in terms of support, confidence, and lift. Support is the percentage of a rule that exists in the entire dataset. Confidence is the proportion of consequents among the antecedents. Lift is a mathematical measure to quantify the statistical dependence of a rule. The three indices can be calculated as follows:
Support   ( A     B )   =   # ( A     B ) N
Confidence = Support   ( A     B ) Support   ( A )
Lift = Support   ( A     B ) Support   A   ×   Support   ( B )
where N is the number of observations and #(A ∩ B) is the number of observations in which conditions A (antecedent) and B (consequent) are met. The lift of the rule indicates the ratio of the actual co-occurrences of the antecedent and consequent to the expected co-occurrences under the assumption that the antecedent and consequent are independent. A value less than 1 indicates a negative dependence between the antecedent and the consequent. A value equal to 1 indicates independence, and a value greater than 1 indicates a positive dependence. The higher the lift, the stronger the association rule [48]. It is desirable for rules to have a high level of support, high confidence, and a lift value considerably greater than one. To identify strong associations, threshold values for support (S), confidence (C), and lift (L) were set as follows: S ≥ 4%, C ≥ 20%, and L ≥ 2. For rules with lift values greater than 10, the support threshold was set at 1% [44,46].
For example, in the rule “reckless driving → alcohol (support = 1%, confidence = 50%, lift = 5)”, support indicates that the proportion of observations that include both reckless driving errors and alcohol is 1% across the entire dataset; confidence indicates that the proportion of observations that include both reckless driving errors and alcohol is 50% in the dataset that includes alcohol; and lift indicates that reckless driving errors are positively associated with alcohol [48].

3.2. Classification and Regression Tree

A decision tree model consists of root, internal, and leaf nodes. The metrics used in a decision tree model for branch selection are information gain, entropy, and Gini impurity. Figure 1 shows a schematic diagram of a decision tree model. It uses a top-down recursive method at each node in the sample set to select the branch attribute according to the given criteria, from a root node to a leaf node. A leaf node represents the value of an objective function, determined by the input variables along the path from the root node to the leaf node [60].
The CART algorithm is a decision-tree-specific algorithm widely used today. For classification problems, Gini impurity is used as the criterion for branch selection, whereas for regression problems, the mean squared error (MSE) is used. Gini impurity represents the probability that a randomly selected sample will be misclassified in the sample set. Gini impurity can be calculated as follows:
Gini ( A )   =   1     k = 1 n P k 2
where A represents node A; n is the number of classes of the target variable; and Pk is the probability that a randomly selected sample from the dataset of node A belongs to class k, with k = 1, 2, …, n. For regression problems, the objective of CART is to minimize the MSE. The MSE represents the difference between the predicted value and the actual value at a leaf node, and the calculation formula is as follows:
MSE   =   1 n i = 1 n y i     y i ^ 2
where n is the number of elements in a dataset, i = 1, 2, 3, …, n; yi is the actual value; and ŷi is the predicted value [60,61].
The application of the non-parametric CART model does not require prior probabilistic knowledge of the phenomenon under study or the fulfillment of strict assumptions, either regarding the type of relationship or the distribution of the dependent variable. These aspects represent the main advantages over parametric techniques. Each node in the tree indicates the predicted value, the number of experimental units contained in the node, and its descriptive percentage. CART offers both theoretical and practical advantages over parametric models. In fact, from a theoretical perspective, the advantage of the CART method is that it does not require prior specification of the model’s functional form or the assumption of an additive relationship between the dependent and independent variables. Another advantage is that CART analysis can effectively handle collinearity issues [62].

3.3. Methodology

A procedure for analyzing accident rates is proposed that consists of identifying critical factors in traffic accidents and their interrelationships by combining the use of association rules and CART, which can be fully generalized to any road safety study. This procedure consists of the phases shown in Figure 2.

3.3.1. Data Collection

In this stage, accident records are identified and obtained from the competent national authority responsible for land transport supervision, which typically centralizes data recorded by the traffic police in the field using standardized accident report forms. The study period and geographic scope are defined based on data availability and the accident rate of the road under analysis, prioritizing corridors with the highest incident frequency according to official statistics or the prior literature. The data should be as complete as possible, including as many variables as necessary, and additional information can be obtained from alternative sources or processed manually.

3.3.2. Data Characterization

The collected data is transformed into a structured set of variables. Each recorded attribute—type of accident, environmental conditions, human factors, etc.—is defined as a variable, specifying its nature (nominal, ordinal, numerical). The different categories for each variable are defined, and in the case of continuous numerical variables, they are discretized into intervals for analysis. A special coding system is used to identify the variables and their categories. The result is a data matrix characterized by clearly defined variables and categories.

3.3.3. Application of Association Rules

In this stage, association rules are extracted from the dataset using the R programming language (v4.4.2; R Foundation for Statistical Computing, Vienna, Austria) and the specialized arules library (Open-source library), which implements the Apriori algorithm. Minimum thresholds for support (S ≥ 4%) and confidence (C ≥ 20%) are set, adopted in accordance with the reference values used in previous road safety studies with similar sample characteristics [44,46]. These thresholds determine the set of generated rules. Support ensures that a rule is backed by a minimum proportion of cases in the dataset, and confidence ensures that the antecedent–consequent relationship holds with a minimum reliability. To verify the stability of the results, variations in these thresholds are explored: more restrictive values reduce the number of rules without altering the dominant patterns, while more permissive values incorporate redundant rules of little interpretive value. From the set of generated rules, those with the highest confidence are selected and ranked for analysis, since this indicator directly reflects the reliability of the association. Lift is used as a complementary metric to quantify the strength of each association. Since IPA1 is the most frequent consequent category in the dataset, the resulting lift values are moderate by mathematical construction—which is to be expected and does not invalidate the rules—with rules having a lift greater than 2 standing out due to their greater associative strength. Based on the results, patterns of co-occurrence among the conditions present in accidents can be identified.

3.3.4. Application of CART

In this stage, the R programming language (v4.4.2; R Foundation for Statistical Computing, Vienna, Austria) and the rpart package (Open-source package)are used to implement the CART algorithm, using accident severity as the target variable. The dataset is partitioned into training and test subsets using stratified sampling to preserve the class distribution of the severity variable. The model is fitted on the training set and evaluated on the test set using standard classification metrics (accuracy, Cohen’s Kappa, and F1-score by class). Cross-validation is applied to assess the stability of the results. The trained tree segments the data space through recursive binary partitioning, optimizing node homogeneity via Gini impurity minimization, and thereby identifies the variables with the greatest discriminatory power for accident severity. Variable importance scores are reported to complement the tree structure. Since the primary objective is interpretability, the final tree may be constructed on the full dataset to maximize the use of available information once validation is complete.

3.3.5. Discussion of Results

The findings from the techniques applied in steps 3 and 4 are discussed individually and then compared and integrated. Association rules provide patterns of co-occurrence among factors, while the CART tree establishes hierarchies of predictive importance. The agreement and complementarity between the results of both approaches are analyzed. The identified patterns are interpreted in light of the theoretical framework of the domain and compared with the existing literature.

3.3.6. Conclusions

In this stage, practical implications—preventive measures, intervention policies—are derived, and the study’s limitations are noted, suggesting lines of future research.

4. Results

4.1. Data Collection

The data on traffic accidents, with the exception of proximity to populated areas, which was obtained manually, was obtained from the Superintendency of Land Passenger Transport, Freight, and Goods (SUTRAN), a public institution whose functions include the regulation and supervision of all types of land transport activities, where it processes its own data and that recorded by the Peruvian National Police (PNP), an institution that records accidents in the field using the Standard Traffic Accident Report Form and centralizes them in the REATPOL (PNP Traffic Accident Registry), a system linked to the MTC and SUTRAN databases.
The data for each accident includes the time, date, kilometer marker, and vehicles involved, among other variables shown in Table 1. Accidents recorded between 2014 and 2020 on the North Pan-American Highway—the road with the highest accident rate in Peru in recent years [63]—were considered. This highway is part of Peru’s National Road Network, with a total length of approximately 1800 km from Lima to the border with Ecuador. The section analyzed in this study covers the first 200 km from Lima (km 0 to km 200), one of the sections with the highest accident frequency [64], which belongs to the coastal section of the road, characterized by predominantly flat topography with some undulating sections. It is a two-lane highway in each direction in urban and peri-urban sections, narrowing to one lane in each direction in rural sections, with roadways separated by a median in some areas. In addition to the data obtained, based on the kilometer where the accident occurred, proximity to a town was identified.

4.2. Data Characterization

This dataset comprises a total of 491 traffic accident records and was structured so that each record represents a traffic accident, characterized by different variables such as the kilometer where it occurred, the date, the day, the time, the number of vehicles involved, and the type of vehicle, among others. Based on this characterization, three main categories of variables or factors were defined: road factors (location, time), vehicle factors (type and number), and other factors (type of collision and severity). These variables are shown in Table 1.
To facilitate the analysis of rules, continuous variables were transformed into discrete categories (intervals). For example, the distance variable was divided into 20 km ranges (DIS1–DIS10) to ensure sufficient sample distribution in each category, and the time-of-day variable was divided into five blocks (TIM1–TIM5) defined based on traffic flow patterns and peak accident times identified in the literature. Finally, abbreviated labels were assigned to optimize computational processing (e.g., VEC1 for BUS, TYP6 for collision with an object). It is important to note that, to characterize the severity of accidents, two alternatives were considered: first, the Fatality and Weighted Serious Injury Index (FWSI), a measure of the consequences of major accidents that combines fatalities and serious injuries, where one serious injury is considered statistically equivalent to 0.1 fatalities [65]; and second, the IPA index, an adaptation of the IPA (Traffic Accident Participation Index) proposed by Peru’s Ministry of Transport and Communications (MTC) [66]. In this case, IPA = 4NM + NL, where NM = number of fatalities and NL = number of injuries. The weight of four assigned to fatalities comes from the official methodology of Peru’s Ministry of Transport and Communications (MTC) and reflects the statistical equivalence established for the national context. This value was chosen because it places greater emphasis on the severity of accidents, but above all to ensure comparability with official Peruvian statistics. Three IPA categories were created, from least to most severe, IPA1 (IPA = 0–1, low severity), IPA2 (IPA = 2–4, medium severity), and IPA3 (IPA ≥ 5, high severity), to ensure a sufficient sample size in each class. The results and further details of the characterization can be found in Table 2.
Given the importance of certain variables, a summary of their distribution across the different categories is provided.
Regarding the temporal distribution, the night–early morning time block (TIM5: 00:00–07:00) is the most frequent, with 171 accidents (34.8%), followed by the evening–night block (TIM4: 18:00–24:00) with 141 cases (28.7%), indicating that more than 63% of accidents occur in low-light conditions. The distribution by day of the week is relatively uniform, with Tuesday (DAY3, 79 cases) and Sunday (DAY1, 73 cases) being the most frequent days. On a monthly basis, April (MON4) has the highest incidence with 127 cases (25.9%), followed by July (MON7) with 68 cases (13.8%).
Regarding the type of vehicle involved, trucks (VEC2) predominate with 163 cases (33.2%), followed by buses (VEC1) with 145 cases (29.5%) and the “other vehicles” category (VEC6) with 114 cases (23.2%). Mixed bus–truck vehicles (VEC4) account for 10.0% of cases (49 records), while passenger cars (VEC3) and other minor categories account for the remaining 3.7%.
Regarding the type of collision, the most frequent category is “other types of collision” (TYP7) with 236 cases (48.1%), followed by “running off the road” (TYP1) with 95 cases (19.3%) and “collision with objects or pedestrians” (TYP6) with 64 cases (13.0%). Rear-end collisions (TYP3) account for 56 cases (11.4%), while rollovers (TYP5), head-on collisions (TYP2), and side collisions (TYP4) together account for the remaining 8.1%.
In terms of severity, measured using the IPA index, the distribution is relatively balanced across the three categories resulting from the discretization process: IPA1 (low severity) accounts for 149 cases (30.3%), IPA2 (moderate severity) has the highest number with 199 cases (40.5%), and IPA3 (high severity) comprises 143 cases (29.1%).
Finally, regarding proximity to urban centers, 61.9% of the accidents (n = 304) occur in areas far from towns (Proximity_F), compared to 38.1% (n = 187) in nearby areas (Proximity_N). The predominant traffic level is low (TL1) in 89.4% of cases (n = 439), with medium (TL2) and high (TL3) levels being in the minority, with 45 (9.2%) and 7 (1.4%) cases, respectively.

4.3. Application of Association Rules

To perform an association rule analysis, as previously indicated, the Apriori algorithm was used under the R language interface, utilizing the arules package. This technique is applied to discover frequent relationships between accident factors and their outcomes.
Combinations of “antecedents” (causes/conditions) leading to a “consequent” (severity outcome or type of collision) were sought. The data were filtered using the support, lift, and confidence metrics. The rules with the highest confidence were considered for the analysis, since high confidence guarantees that the rule has a high success rate and a reduction in false positives. Table 3 shows the rules with the highest confidence.
The rules with the highest confidence (1.00 or 100%) indicate an absolute relationship between the type of collision with objects or pedestrians and an IPA severity between 1 and 5. Rule 87 indicates that, if this type of accident occurs, it always results in an IPA1 between 1 and 5.
Also, based on Rules 491 and 890, when this type of collision occurs, it almost always (Confidence > 0.90) involves a single vehicle. Based on Rules 820 and 921, there is a strong association between the involvement of a truck in a two-vehicle collision and the generation of a collision classified as “other.” This frequently occurs in the 80–100 km range (DIS5) (Rule 763), suggesting a stretch of road where trucks have conflicts with other vehicles. Rule 820 adds the TimeBin_TIM4 factor (6:00 PM to midnight). This indicates a specific pattern of two-vehicle collisions during that time block. From Rule 452, it follows that if there is a single vehicle and it is a truck, there is a 91% probability of resulting in an IPA severity of 1 to 5.
Analyzing the lift, Rule 890 (lift: 3.02), with the highest lift, indicates that if a bus is involved in a collision with objects or pedestrians, the probability of it being an IPA severity accident between 1 and 5 is three times higher than normal. Rule 1022 (lift: 2.91) associates run-off-road incidents with trucks and locations far from cities to predict severity and single-vehicle involvement. Trucks appear in multiple high-confidence rules during the early morning hours (TIM5: 00:00–07:00), for example, in Rules 855 and 701. The combination of fatigue and driving heavy vehicles during these hours almost always results in accidents of a certain severity.
Finally, truck accidents far from the city, particularly in the months of April/May/June, have a 93.7% confidence level of being severe. In general terms, the rules with the highest confidence and “lift” involve buses and trucks.

4.4. Application of Classification and Regression Trees (CARTs)

In the case of the Classification and Regression Tree (CART), an algorithm was used within the R language interface to hierarchically identify which variables best discriminate severity (IPA). The five IPA categories were reduced to three to facilitate the analysis. The model selected the root node (the most important variable) and then continued its division to create specific risk profiles until reaching the terminal nodes. To apply CART, the same variables considered in the association rules and shown in Table 2 were used.
To evaluate the model’s performance, the data were randomly divided into a training set (70%, n = 343) and a test set (30%, n = 148), maintaining the proportional distribution of IPA categories through stratified sampling. The model achieved an accuracy of 41.2% on the test set, with a Cohen’s Kappa coefficient of 0.11, reflecting a level of agreement above chance given the multiclass nature of the problem (three categories) and the size of the dataset. Ten-fold cross-validation on the entire dataset confirmed an average accuracy of 48.5% (±6.0%), demonstrating the model’s stability (Table 4). These results are consistent with those of similar studies employing CART in multiclass contexts, where model interpretability takes precedence over maximizing predictive accuracy [59,67]. Since the training/test split was used solely for validation purposes, the tree presented in Figure 3 was constructed using the entire dataset (n = 491), in order to make the most of the available information and more robustly identify the hierarchy of factors and their interactions.
The use of this CART model is justified because its interpretation prioritizes understanding the road traffic phenomenon over mere predictive accuracy, offering a step-by-step visual logic that clearly identifies how different types of collisions and traffic levels interact to determine the severity of an accident; furthermore, it serves as a strategic tool for road safety by achieving its best performance (F1-Score = 0.51) in the most critical class (IPA3: high-severity accidents), which allows infrastructure managers to prioritize operational and control measures on the highest-risk road segments, establishing a clear baseline that honestly reflects the high randomness and intrinsic noise of road data without falling into overfitting.
The application of CART resulted in the tree shown in Figure 3. The tree identifies that the most important variable for separating the data is the type of collision. If the accident involves a collision with an object or a pedestrian, the model automatically classifies it as having a higher probability of IPA2 (low-to-moderate severity, range 5–10). This branch represents 14% of the total sample. The other types of collisions account for the remaining 86% of cases and require additional variables to determine severity.
The model also identifies a specific scenario in which severity is most likely to be IPA1 (the lowest level on the scale). The path would be: if the accident is not a collision with an object or pedestrian → traffic level = 1 (low) → collision type is run-off-road, rollover, or other. If the above conditions are met and the accident occurs on the sections corresponding to kilometers 0–10, 180–200, 20–40, 40–60, and 80–100, the model predicts IPA1. This node represents 32% of the data, making it the largest and most distinct group for IPA1 severity.
The tree identifies two clear paths leading to the highest-risk category visible in the graph (IPA3). The first indicates that if the crash type is not TYP6 and the traffic level is 2 or 3, the prediction falls directly into IPA3 (9% of cases), suggesting that as congestion increases, severity tends to increase; the second indicates that if the traffic level is low (Level 1) but the crash type is TYP2 (head-on collision), TYP3 (rear-end collision), or TYP4 (side collision), the model predicts IPA3 (15% of cases).
Upon performing the combined analysis, it was found that while the association rules showed that TYP6 is strongly linked to IPA1 with high confidence, the CART tree refined this view by showing that, depending on the traffic context and location, this or other types of collisions can escalate to IPA2 or IPA3 severity levels. Furthermore, it was identified that heavy vehicles (VEC1-BUS and VEC2-Truck) and frontal/side collisions in dense traffic are the scenarios with the highest severity.

5. Discussion

The joint analysis using association rules and CART decision trees reveals critical patterns linking the nature of the accident, the type of vehicle, and the operating environment to the severity of the event. Furthermore, it takes advantage of the CART method, which does not require prior specification of the model’s functional form or the assumption of an additive relationship between dependent and independent variables, as proposed by Pagliara [62]. These findings are particularly relevant from a public health perspective: traffic accidents are the leading cause of death among young people aged 5 to 29 and cause nearly 50 million non-fatal injuries per year, many of which result in permanent disabilities [4], which directly impacts Target 3.6 of SDG 3 and reinforces the need for evidence-based interventions such as the one proposed in this study [5].
The results demonstrate that there is a direct relationship between the type of crash (Crash_type) and the severity of injuries, an approach similar to that of Beshah and Hill [59,68]. They also show an absolute correlation between crash types involving buses (VEC1) and trucks (VEC2) with consistent severity levels. This is consistent with the findings of Samerei [69], who notes that the presence of heavy vehicles and multi-vehicle collisions are additional factors contributing to the increase in fatalities in accidents involving buses. Likewise, Kashani [70] indicated that the transport of heavy vehicles resulted in a high probability of accidents. The most severe scenarios identified in this study—heavy vehicles at night and in areas far from urban centers—are precisely those that cause the most severe injuries and have the greatest impact on healthcare systems. In accidents involving heavy vehicles, such as those repeatedly detected in the high-confidence rules, it is estimated that for every fatality, approximately 20 people suffer serious injuries [5], and in some countries, victims of traffic-related trauma account for up to 70% of beds in emergency and orthopedic units [5]. The cost of treating a patient for traffic-related trauma can be up to 15 times higher than that of other healthcare interventions, diverting critical resources and affecting the resilience of the healthcare system [71,72]. These data underscore that reducing the identified risk patterns—particularly those associated with TIM5, VEC2, and Proximity_F—is not merely a road safety measure, but a direct action to protect public health and ensure the sustainability of the healthcare system, in line with the objectives of SDG 3.
Association rules indicate that collision type TYP6 (impact with object or pedestrian) involving a bus (VEC1) invariably results in severity IPA1 (confidence 1.00), a finding further supported by Samerei [69], who argues that direct collisions between buses and pedestrians on roads greatly increase the probability of fatalities. This suggests that the inertial mass of heavy vehicles eliminates variability in the impact outcome, turning any collision with fixed objects into an event of guaranteed severity. The CART decision tree shows that severity depends not only on the impact but also on traffic flow conditions. It is observed that under traffic level 2 and 3 conditions, the probability of reaching a higher severity (IPA3) increases to 9%, regardless of whether the collision is a side-impact or a rear-end collision. Conversely, on specific road sections such as KM 80–100 (DIS5), trucks tend to be involved in TYP7-type crashes (Other/Rollovers), which is associated with driver fatigue due to these areas being far from the city (Proximity_F). It is also worth noting that road sections influence the frequency and severity of accidents, which can be linked to the road’s geometry—a finding also documented in the specialized literature [73,74]. It is important to highlight the danger posed by nighttime and early morning time blocks. The combination of TIM5 (00:00–07:00) with vehicle type VEC2 (Truck) has a confidence level exceeding 92% for causing IPA1 severity. These findings suggest that reduced visibility and fatigue in areas far from cities (Proximity_F) increase the severity of run-off-road accidents (TYP1), confirming that the early morning hours are the most dangerous due to reduced driver alertness [24,67].
Another interesting finding is that the CART model reveals that as congestion increases, accident severity tends to rise. This is consistent with the findings of Kashani [70]. The incidence of accidents and the severity of injuries depend on traffic volume on this highway. It is significant to note that while the CART model identifies the type of collision as the primary data separator, the association rules add the month of the year (MON4, MON5) layer as an aggravating factor in remote areas. This indicates that prevention campaigns should be seasonal and geographically targeted at the identified kilometer markers (e.g., DIS5 and DIS10).
The academic literature supports the combined use of data mining tools; while the CART algorithm defines the hierarchical structure of accident severity, association rules allow for the capture of complex and specific scenarios (such as the interaction between the TIM5 block, the kilometer, and the vehicle type) that traditional linear models often omit, hence providing advantages when using them in combination. The results of this study—particularly the identification of nighttime hours, remote rural areas, and heavy vehicles as factors associated with greater severity—directly contribute to the achievement of SDG 3 by specifying the scenarios that should be prioritized in public health and road safety policies. Identifying these patterns allows for the targeting of cost-effective preventive interventions that alleviate the burden on health systems and reduce the human, family, and economic impact of serious traffic accidents [11]. Anticipating these risk peaks with timely interventions is not only a road safety measure but a concrete action in favor of social protection, health resilience, and sustainable well-being.

6. Conclusions

The combined use of CART and association rules demonstrated that the combined application of data mining tools yields more comprehensive results than conventional linear models for this type of phenomenon, as it allows for capturing both the hierarchy of factors that cause accidents and the interrelationships among them. Having more accurate models to identify risk factors is, therefore, a concrete step toward policies that reduce this burden on people—primarily young people—and on healthcare systems, directly impacting Target 3.6 of SDG 3. Future studies on road safety should adopt this combined approach as the standard for analysis.
The methodology proposed and applied in this study can be replicated for any other type of road, whether urban, rural, or mountainous; however, given that this study was conducted on a specific section of a national highway on the Peruvian coast (Pan-American Highway North), characterized by a predominantly flat alignment with some curves, dual carriageway in certain sections, and heavy traffic of heavy vehicles, it would not be possible to extrapolate this study, and it would be necessary to incorporate other important variables, such as the direction of traffic in the case of a mountain route or the type of pavement and the width of the roadway, among others. Future studies could replicate this methodology on sections of different types, but while taking into account additional variables.
The identified patterns reveal that a uniform road safety policy is insufficient: remote road segments (Proximity_F), nighttime and early morning hours (TIM5), and the operation of heavy vehicles create specific risk scenarios that require targeted control measures. Notable among these are fatigue checks in remote areas, nighttime speed limits for trucks, and infrastructure improvements at identified critical kilometers (DIS5, DIS10). These results have a direct impact on public health, as accidents occurring at night and in rural areas tend to result in the most severe injuries, which place a greater burden on healthcare systems and entail higher long-term rehabilitation costs for families and the government.
The emergence of MON4 and MON5 as aggravating factors in remote areas, combined with the influence of the road segment on the frequency and severity of accidents, indicates that road risk is not uniform and depends on both the time of year and geographic location. Control measures must, therefore, be seasonal and geographically targeted, evaluating road geometry and signage on the segments with the highest accident rates and reinforcing police presence at certain times of the year. Anticipating these risk peaks with timely interventions is not only a road safety measure but a concrete action for social protection and sustainable well-being.
The fact that heavy vehicles—buses and trucks—are key determinants of severity, with a 100% probability of causing serious harm in collisions with pedestrians or objects, suggests that stricter time and location-based restrictions for heavy transport should be considered on identified critical road segments, along with strengthening technical and human monitoring systems for these vehicles, to simultaneously protect the lives, health, and well-being of communities.
In terms of public policy recommendations related to vehicle and time-of-day factors, the results point to three priority interventions. First, establish specific fatigue and speed controls for heavy vehicles on the DIS5 section (km 80–100), identified as the area with the highest concentration of truck accidents. Second, strengthen traffic enforcement and signage during the night-to-early-morning hours (00:00–07:00), which account for 34.8% of recorded accidents. Finally, consider differentiated time or speed restrictions for heavy transport on the identified critical road segments.
Additionally, the identified seasonal and geographic factors suggest two further interventions. Implement seasonal prevention campaigns focused on the months of April and May (MON4–MON5), when accident rates are highest in remote areas, and review infrastructure on stretches of road far from urban centers (Proximity_F), where 61.9% of accidents occur and severity tends to be higher. Both measures should be coordinated with an assessment of road geometry and signage at the points with the highest accident rates.
More broadly, the results of this study reaffirm that road safety is an inseparable component of sustainable development. Reducing road traffic fatalities and the severity of accidents not only saves lives but also alleviates pressure on health systems, protects the economic stability of families, and contributes to the fulfillment of commitments made under the 2030 Agenda. It is recommended that the findings presented here serve as a basis for the design of comprehensive public policies that integrate road safety with the objectives of public health, social equity, and sustainable mobility.

Author Contributions

Conceptualization, C.C. and J.C.R.-R.; methodology, C.C. and M.d.C.P.-F.; software, C.C.; validation, M.d.C.P.-F. and J.C.R.-R.; formal analysis, C.C.; investigation, C.C.; resources, C.C.; writing—original draft preparation, C.C.; writing—review and editing, C.C.; M.d.C.P.-F. and J.C.R.-R.; visualization, C.C.; supervision, J.C.R.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to legal reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. World Health Organization. Global Status Report on Road Safety 2023; World Health Organization: Geneva, Switzerland, 2023; Available online: https://repository.gheli.harvard.edu/repository/12838/ (accessed on 6 April 2026).
  2. World Health Organization Regional Office for Europe. Fact Sheet on Sustainable Development Goals (SDGs): Health Targets Road Safety; World Health Organization Regional Office for Europe: Copenhagen, Denmark, 2017; Available online: https://iris.who.int/server/api/core/bitstreams/d6e87d9f-cb15-4c56-97e9-f836037117d6/content (accessed on 18 May 2026).
  3. Mindell, J.S.; Watkins, S.J. Transport, Health and Inequality. An Overview of Current Evidence. J. Transp. Health 2024, 38, 101886. [Google Scholar] [CrossRef]
  4. Monclús, J. Road Safety and the SDGgs: A Guide for Private Sector Oorganizations; Fundación MAPFRE: Madrid, Spain, 2019. [Google Scholar]
  5. World Health Organization, Regional Office for South-East Asia. Accelerating Actions for Implementation of Decade of Action for Road Safety; World Health Organization, Regional Office for South-East Asia: New Delhi, India, 2017. [Google Scholar]
  6. Özaydın, Ö.; Kabak, Ö.; Topcu, Y.I.; Ülengin, F.; Önsel Ekici, Ş. Analysis of Direct and Indirect Relations among Sustainable Development Goals and Transportation Targets. Transp. Policy 2025, 171, 270–281. [Google Scholar] [CrossRef]
  7. Ahmed, S.K.; Mohammed, M.G.; Abdulqadir, S.O.; El-Kader, R.G.A.; El-Shall, N.A.; Chandran, D.; Rehman, M.E.U.; Dhama, K. Road Traffic Accidental Injuries and Deaths: A Neglected Global Health Issue. Health Sci. Rep. 2023, 6, e1240. [Google Scholar] [CrossRef]
  8. Blincoe, L.; Miller, T.R.; Wang, J.-S.; Swedler, D.; Coughlin, T.; Lawrence, B.; Guo, F.; Klauer, S.; Dingus, T. February 2023 6. Performing Organization Code 7. Authors 13. Type of Report and Period Covered NHTSA Technical Report 14. Sponsoring Agency Code Unclassified; National Highway Traffic Safety Administration: Washington, DC, USA, 2023. Available online: https://rosap.ntl.bts.gov (accessed on 10 December 2025).
  9. International Road Assessment Programme (iRAP). Safety Insights Explorer; International Road Assessment Programme (iRAP): Bracknell, UK, 2026; Available online: https://irap.org/safety-insights-explorer/ (accessed on 6 April 2026).
  10. Bezerra, B.S. Road Safety and Sustainable Development; Filho, W.L., Wall, T., Azul, A.M., Brandli, L., Özuyar, P.G., Eds.; Springer: Cham, Switzerland, 2019. [Google Scholar] [CrossRef]
  11. Putatunda, A.; Al Haddad, C.; Antoniou, C. A Comprehensive Review of the Socio-Economic Appraisal Methodologies of the Road Safety Measures. Accid. Anal. Prev. 2025, 217, 108021. [Google Scholar] [CrossRef]
  12. Litman, T. A New Traffic Safety Paradigm; Victoria Transport Policy Institute: Victoria, BC, Canada, 2026; Volume 29. [Google Scholar]
  13. Barua, U.; Tay, R. Severity of Urban Transit Bus Crashes in Bangladesh. J. Adv. Transp. 2010, 44, 36–41. [Google Scholar] [CrossRef]
  14. OMS. La Seguridad Vial 2013. Informe Sobre la Situación Mundial de la Seguridad Vial 2013; World Health Organization (WHO): Geneva, Switzerland, 2013; Volume 12. [Google Scholar]
  15. Observatorio Nacional de Seguridad Vial (ONSV). Estadísticas de Siniestros de Tránsito 2023; Observatorio Nacional de Seguridad Vial (ONSV): Lima, Perú, 2025; Available online: http://www.onsv.gob.pe/ (accessed on 6 April 2026).
  16. Ministerio de Transportes y Comunicaciones. MTC Impulsa Proyecto de Recolección de Información Sobre Accidentes en Vías Concesionadas Para Reforzar la Seguridad; Ministerio de Transportes y Comunicaciones: Lima, Perú, 2025; Available online: https://www.gob.pe/institucion/mtc/noticias/1092860-mtc-impulsa-proyecto-de-recoleccion-de-informacion-sobre-accidentes-en-vias-concesionadas-para-reforzar-la-seguridad (accessed on 6 April 2026).
  17. International Transport Forum (ITF). Road Safety Annual Report 2024; International Transport Forum (ITF): Paris, France, 2024; Available online: https://www.itf-oecd.org/road-safety-annual-report-2024 (accessed on 6 April 2026).
  18. El-Achkar, J.; El-Gharib, M.; Ahmad, N.; Al-Hajj, S. The Burden of Road Traffic Injuries: A Global Perspective. Panor. Emerg. Med. 2026, 4, 2026. [Google Scholar] [CrossRef]
  19. Tini, N.H.; Zaly Shah, M.; Sultan, Z. Impact of Road Transportation Network on Socio-Economic Well-Being: An Overview of Global Perspective. Int. J. Sci. Res. Sci. Eng. Technol. 2018, 4, 282–296. [Google Scholar]
  20. Sargazi, A.; Sargazi, A.; Kumar Nadakkavukaran Jim, P.; Ali Danesh, H.; Sargolzaee Aval, F.; Kiani, Z.; Hosein Lashkarinia, A.; Sepehri, Z. Economic Burden of Road Traffic Accidents; Report from a Single Center from South Eastern Iran; Trauma Research Center, Shiraz University of Medical Sciences: Shiraz, Iran, 2016; Volume 4, Available online: www.beat-journal.com (accessed on 6 April 2026).
  21. Ahmed, M.; Patnaik, J.L.; Whitestone, N.; Hossain, M.A.; Alauddin, M.; Husain, L.; Hossain, M.P.; Islam, M.S.; Hossain, M.I.; Imdad, K.; et al. Visual Impairment and Risk of Self-Reported Road Traffic Crashes Among Bus Drivers in Bangladesh. Asia. Pac. J. Ophthalmol. 2022, 11, 72–78. [Google Scholar] [CrossRef] [PubMed]
  22. Nguyen, T.C.; Nguyen, M.H.; Armoogum, J.; Ha, T.T. Bus Crash Severity in Hanoi, Vietnam. Safety 2021, 7, 65. [Google Scholar] [CrossRef]
  23. Verma, A.; Sasidharan, S.; Bhalla, K.; Allirani, H. Fatality Risk Analysis of Vulnerable Road Users from an Indian City. Case Stud. Transp. Policy 2022, 10, 269–277. [Google Scholar] [CrossRef]
  24. Miller, K.A.; Filtness, A.J.; Anund, A.; Maynard, S.E.; Pilkington-Cheney, F. Contributory Factors to Sleepiness amongst London Bus Drivers. Transp. Res. Part F Traffic Psychol. Behav. 2020, 73, 415–424. [Google Scholar] [CrossRef]
  25. Law, T.H.; Daud, M.S.; Hamid, H.; Haron, N.A. Development of Safety Performance Index for Intercity Buses: An Exploratory Factor Analysis Approach. Transp. Policy 2017, 58, 46–52. [Google Scholar] [CrossRef]
  26. Zegeer, C.V.; Huang, H.F.; Stutts, J.C.; Rodgman, E.; Hummer, J.E. Commercial Bus Accident Characteristics and Roadway Treatments. Transp. Res. Rec. 1994, 1467, 14–22. [Google Scholar]
  27. Kaplan, S.; Prato, C.G. Risk Factors Associated with Bus Accident Severity in the United States: A Generalized Ordered Logit Model. J. Saf. Res. 2012, 43, 171–180. [Google Scholar] [CrossRef]
  28. Polo, J.R.; Riveros, C.C.; Diaz, W.A.; Cansaya, A.C.; Anticona, M.R. Caracterización Del Nivel de Estrés de Alumnos de Ingeniería Mediante Herramientas de Data Mining. In Proceedings of the LACCEI International Multi-Conference for Engineering, Education and Technology 2021, Virtual, 19–23 July 2021. [Google Scholar] [CrossRef]
  29. Spanos, A. Revisiting Data Mining: ‘Hunting’ with or without a License. J. Econ. Methodol. 2000, 7, 231–264. [Google Scholar] [CrossRef]
  30. Xianfang, T.; Yachao, J.; Ru, Z. The Infiltration of Mathematical Modeling Thoughts in College Mathematics Teaching. J. Phys. Conf. Ser. 2019, 1168, 052018. [Google Scholar] [CrossRef]
  31. Wu, Y. The Modes of Data Development in the Internet Age. Data Sci. J. 2007, 6, 962–967. [Google Scholar] [CrossRef]
  32. Dandge, S.S.; Chakraborty, S. A Data Mining Approach for Analysis of a Wire Electrical Discharge Machining Process. Manag. Prod. Eng. Rev. 2021, 12, 116–128. [Google Scholar] [CrossRef]
  33. Wang, J.; Ma, S.; Jiao, P.; Ji, L.; Sun, X.; Lu, H. Analyzing the Risk Factors of Traffic Accident Severity Using a Combination of Random Forest and Association Rules. Appl. Sci. 2023, 13, 8559. [Google Scholar] [CrossRef]
  34. Ghomi, H.; Bagheri, M.; Fu, L.; Miranda-Moreno, L.F. Analyzing Injury Severity Factors at Highway Railway Grade Crossing Accidents Involving Vulnerable Road Users: A Comparative Study. Traffic Inj. Prev. 2016, 17, 833–841. [Google Scholar] [CrossRef] [PubMed]
  35. Chand, S.; Li, Z.; Alsultan, A.; Dixit, V.V. Comparing and Contrasting the Impacts of Macro-Level Factors on Crash Duration and Frequency. Int. J. Environ. Res. Public Health 2022, 19, 5726. [Google Scholar] [CrossRef]
  36. Li, F.; Jiang, K. Application of Random-Parameter Negative Binomial Model to Examine the Relationship between the Severity of Traffic Accident. In Proceedings of the 2020 IEEE 5th International Conference on Intelligent Transportation Engineering, ICITE 2020, Beijing, China, 11–13 September 2020; pp. 351–354. [Google Scholar] [CrossRef]
  37. Mahmud, A.; Gayah, V.V. Estimation of Crash Type Frequencies on Individual Collector Roadway Segments. Accid. Anal. Prev. 2021, 161, 106345. [Google Scholar] [CrossRef] [PubMed]
  38. Ghadban, N.R.; Abdella, G.M.; Alhajyaseen, W.; Al-Khalifa, K.N. Analyzing the Impact of Human Characteristics on the Comprehensibility of Road Traffic Signs. In Proceedings of the International Conference on Industrial Engineering and Operations Management, Bandung, Indonesia, 6–8 March 2018; pp. 2210–2219. [Google Scholar]
  39. Kraidi, R.; Evdorides, H. Pedestrian Safety Models for Urban Environments with High Roadside Activities. Saf. Sci. 2020, 130, 104847. [Google Scholar] [CrossRef]
  40. Mujalli, R.O.; de Ona, J. Injury Severity Models for Motor Vehicle Accidents: A Review. Proc. Inst. Civ. Eng. Transp. 2013, 166, 255–270. [Google Scholar] [CrossRef]
  41. Witten, I.; Frank, E.; Hall, M. Data Mining, 3rd. ed.; Morgan Kaufmann Publishers: Burlington, MA, USA, 2011. [Google Scholar]
  42. Agrawal, R.; Imieliński, T.; Swami, A. Mining Association Rules between Sets of Items in Large Databases. J. SIGMOD Rec. 1993, 22, 207–216. [Google Scholar] [CrossRef]
  43. Pande, A.; Abdel-Aty, M. Discovering Indirect Associations in Crash Data through Probe Attributes. Transp. Res. Rec. 2008, 2083, 170–179. [Google Scholar] [CrossRef]
  44. Montella, A. Identifying Crash Contributory Factors at Urban Roundabouts and Using Association Rules to Explore Their Relationships to Different Crash Types. Accid. Anal. Prev. 2011, 43, 1451–1463. [Google Scholar] [CrossRef] [PubMed]
  45. Cheng, C.W.; Lin, C.C.; Leu, S.-S. Use of Association Rules to Explore Cause-Effect Relationships in Occupational Accidents in the Taiwan Construction Industry. Saf. Sci. 2010, 48, 436–444. [Google Scholar] [CrossRef]
  46. Montella, A.; Aria, M.; D’Ambrosio, A.; Mauriello, F. Analysis of Powered Two-Wheeler Crashes in Italy by Classification Trees and Rules Discovery. Accid. Anal. Prev. 2012, 49, 58–72. [Google Scholar] [CrossRef]
  47. Daher, J.R.; Chilkaka, S.; Younes, A.; Shaban, K. Association Rule Mining on Five Years of Motor Vehicle Crashes. MATEC Web Conf. 2016, 81, 02017. [Google Scholar] [CrossRef]
  48. Wang, K.; Qin, X. Exploring Driver Error at Intersections: Key Contributors and Solutions. Transp. Res. Rec. 2015, 2514, 1–9. [Google Scholar] [CrossRef]
  49. Liu, S.; Kang, L.; Sun, H.; Wu, J.; Amihere, S. Exploring the Factors of Major Road Traffic Accidents: A Case Study of China. Front. Eng. Manag. 2025, 12, 414–424. [Google Scholar] [CrossRef]
  50. Tariq, M.; Mehmood, N.Q.; Mahfooz, S.Z. Discovering Associated Factors behind Road Accidents Using Association Rule Mining: A Case Study from Gujarat, Pakistan. World J. Adv. Res. Rev. 2022, 15, 001–011. [Google Scholar] [CrossRef]
  51. Gu, C.; Xu, J.; Gao, C.; Mu, M.; E, G.; Ma, Y. Multivariate Analysis of Roadway Multi-Fatality Crashes Using Association Rules Mining and Rules Graph Structures: A Case Study in China. PLoS ONE 2022, 17, e0276817. [Google Scholar] [CrossRef] [PubMed]
  52. Huang, S.; Jin, C.; Chen, T.; Wang, Z.W.; Wang, J. Analysis of Major Road Traffic Accident Causes Using a Combined Method of Association Rule and Complex Network. J. Adv. Transp. 2025, 2025, 8714444. [Google Scholar] [CrossRef]
  53. Grochtmann, M.; Grimm, K. Classification Trees for Partition Testing. Softw. Test. Verif. Reliab. 1993, 3, 63–82. [Google Scholar] [CrossRef]
  54. Azhar, A.; Ariff, N.M.; Bakar, M.A.A.; Roslan, A. Classification of Driver Injury Severity for Accidents Involving Heavy Vehicles with Decision Tree and Random Forest. Sustainability 2022, 14, 4101. [Google Scholar] [CrossRef]
  55. Le, K.G.; Tran, Q.H.; Do, V.M. Urban Traffic Accident Features Investigation to Improve Urban Transportation Infrastructure Sustainability by Integrating GIS and Data Mining Techniques. Sustainability 2024, 16, 107. [Google Scholar] [CrossRef]
  56. Abdullah, P.; Sipos, T. Drivers’ Behavior and Traffic Accident Analysis Using Decision Tree Method. Sustainability 2022, 14, 11339. [Google Scholar] [CrossRef]
  57. Megnidio-Tchoukouegno, M.; Adedeji, J.A. Machine Learning for Road Traffic Accident Improvement and Environmental Resource Management in the Transportation Sector. Sustainability 2023, 15, 2014. [Google Scholar] [CrossRef]
  58. Wang, H.; Liang, G. Association Rules Between Urban Road Traffic Accidents and Violations Considering Temporal and Spatial Constraints: A Case Study of Beijing. Sustainability 2025, 17, 1680. [Google Scholar] [CrossRef]
  59. Chen, M.M.; Chen, M.C. Modeling Road Accident Severity with Comparisons of Logistic Regression, Decision Tree and Random Forest. Information 2020, 11, 270. [Google Scholar] [CrossRef]
  60. Yang, X.; Ji, Y.; Gu, J.; Niu, M. An Electricity Consumption Disaggregation Method for HVAC Terminal Units in Sub-Metered Buildings Based on CART Algorithm. Buildings 2023, 13, 967. [Google Scholar] [CrossRef]
  61. Kim, H.; Kim, W.; Kim, J.; Lee, S.J.; Yoon, D.; Jo, J. A Study on Re-engagement and Stabilization Time on Take-over Transition in a Highly Automated Driving System. Electronics 2021, 10, 344. [Google Scholar] [CrossRef]
  62. Pagliara, F.; Mauriello, F.; Ping, Y. Analyzing the Impact of High-Speed Rail on Tourism with Parametric and Non-Parametric Methods: The Case Study of China. Sustainability 2021, 13, 3416. [Google Scholar] [CrossRef]
  63. SUTRAN. Reporte Estadístico de Siniestros Viales 2022; SUTRAN: Lima, Peru, 2023; Available online: https://www.gob.pe/institucion/sutran/informes-publicaciones/4171345-reporte-estadistico-de-siniestros-viales-2022 (accessed on 22 April 2026).
  64. Corrales, C.A.; Atoche, W.J. Modelo básico de simulación de accidentes de unidades de transporte público pesado en una carretera peruana. In Proceedings of the XIX Congreso Internacional de Prevención de Riesgos Laborales (ORP 2019), Sevilla, España, 5–7 June 2019; Fundación Editorial ORP: Barcelona, España, 2019; pp. 1088–1099. [Google Scholar]
  65. Office of Rail and Road. Common Safety Indicators: Assessment of Achievement of Safety Targets for 2021; Office of Rail and Road: London, UK, 2023. Available online: https://dataportal.orr.gov.uk/media/2184/common-safety-indicators-2021.pdf (accessed on 3 June 2026).
  66. Ministerio de Transportes y Comunicaciones del Perú. Directiva N° 002-2005-MTC/15: Mecanismos de Información Masiva sobre Niveles de Accidentalidad y Formalidad de las Empresas de Transporte Interprovincial de Personas; Dirección General de Circulación Terrestre: Lima, Perú, 2005; Available online: http://www2.sutran.gob.pe/portal/images/ranking_ipa/dirrectiva-002_accidentalidad-ipa.pdf (accessed on 3 June 2026).
  67. Pakgohar, A.; Tabrizi, R.S.; Khalili, M.; Esmaeili, A. The Role of Human Factor in Incidence and Severity of Road Crashes Based on the CART and LR Regression: A Data Mining Approach. In Procedia Computer Science; Elsevier: Amsterdam, The Netherlands, 2011; Volume 3, pp. 764–769. [Google Scholar] [CrossRef]
  68. Beshah, T.; Hill, S. Mining Road Traffic Accident Data to Improve Safety: Role of Road-Related Factors on Accident Severity in Ethiopia. In Proceedings of the AAAI Spring Symposium: Artificial Intelligence for Development, Stanford, CA, USA, 22–24 March 2010. [Google Scholar]
  69. Samerei, S.A.; Aghabayk, K.; Mohammadi, A.; Shiwakoti, N. Data Mining Approach to Model Bus Crash Severity in Australia. J. Saf. Res. 2021, 76, 73–82. [Google Scholar] [CrossRef]
  70. Kashani, A.T.; Zandi, K.; Okabe, A. Investigation of Factors Associated with Heavy Vehicle Crashes in Iran (Tehran–Qazvin Freeway). Sustainability 2023, 15, 10497. [Google Scholar] [CrossRef]
  71. Bonnet, E.; Nikiema, A.; Sana, I.; Guiard-Schmid, J.-B.; Petitfour, L. Assessing the Burden of Road Traffic Injuries: A One-Year Prospective Study in Ouagadougou, Burkina Faso. F1000Research 2025, 14, 1112. [Google Scholar] [CrossRef]
  72. Meskarpour Amiri, M.; Bahadori, M.; Mehrabi-Tavana, A. The Dilemma of Road Traffic Accidents in Iran. Int. J. Med. Rev. 2017, 4, 91–92. [Google Scholar] [CrossRef]
  73. Costa, J.O.D.; Freitas, E.F.; Jacques, M.A.P.; Pereira, P.A.A. Collision Prediction Models with Longitudinal Data: An Analysis of Contributing Factors in Collision Frequency in Road Segments in Portugal. In Proceedings of the 17th International Conference Road Safety on Five Continents (RS5C 2016), Rio de Janeiro, Brazil, 17–19 May 2016; pp. 1–12. [Google Scholar]
  74. Besharati, M.M.; Tavakoli Kashani, A. Factors Contributing to Intercity Commercial Bus Drivers’ Crash Involvement Risk. Arch. Environ. Occup. Health 2018, 73, 243–250. [Google Scholar] [CrossRef]
Figure 1. Schematic diagram of a decision tree model.
Figure 1. Schematic diagram of a decision tree model.
Sustainability 18 06118 g001
Figure 2. Phases of the proposed methodological procedure.
Figure 2. Phases of the proposed methodological procedure.
Sustainability 18 06118 g002
Figure 3. Application of regression and classification trees (CARTs).
Figure 3. Application of regression and classification trees (CARTs).
Sustainability 18 06118 g003
Table 1. Variables to be considered in the study.
Table 1. Variables to be considered in the study.
CategoryFactorAbbreviation
Road factorsCity proximityCity_prox
Road factorsPlace in which crash occurredKilometer
Road factorsTraffic levelTraffic_L
Time factorsMonth of yearMonth
Time factorsDay of weekDay
Time factorsTime of dayTime
Vehicle factorsVehicle type involvedVec_type
Vehicle factorsNumber of vehicles involvedVec_num
Other factorsType of crashCrash_type
Other factorsCrash severitySeverity
Table 2. Characterization of the variables to be considered in the study.
Table 2. Characterization of the variables to be considered in the study.
FactorAbbreviationCategory Definitions
Road factorsCity proximityCity_proxN/F Near/Far
Road factorsPlace in which crash occurredKilometerDIS1/DIS2/DIS3/DIS4/DIS5/DIS6/DIS7/DIS8/DIS9/DIS101/2/3/4/5/6/7/8/9/10KM 0–10/10–20/20–40/40–60/60–80/80–100/100–120/120–140/140–160/160–180/180–200
Road factorsTraffic levelTraffic_LTL1/TL2/TL31/2/3Low/Medium/High
Time factorsMonth of yearMonthMON1/MON2/MON3/MON4/MON5/MON6/MON7/MON8/MON91/2/3/4/5/6/7/8/9January/February/March/April, May, June/July/August/September, October/November/December
Time factorsDay of weekDayDAY1/DAY2/DAY3/DAY4/DAY5/DAY6/DAY71/2/3/4/5/6/7Sunday/Monday/Tuesday/Wednesday/Thursday/Friday/Saturday
Time factorsTime of dayTimeTIM1/TIM2/TIM3/TIM4/TIM51/2/3/4/507:00–11:00/11:00–14:00/14:00–18:00/18:00–24:00/00:00–07:00
Vehicle factorsVehicle type involvedVec_typeVEC1/VEC2/VEC3/VEC4/VEC51/2/3/4/5BUS/Truck/Private car/BUS-TRUCK/Others
Vehicle factorsNumber of vehicles involvedVec_numNUM1/NUM2/NUM3/NUM4/NUM51/2/3/4/51/2/3/4/Multiple
Other factorsType of crashCrash_typeTYP1/TYP2/TYP3/TYP4/TYP5/TYP6/TYP71/2/3/4/5/6/7Run off the road/Head-on collision/Rear-end collision/Sideswipe collision/Turnover/Hit object or pedestrian/Others
Other factorsCrash severityIPAIPA1/IPA2/IPA31/2/3IPA0–1/IPA2–4/IPA ≥ 5
Table 3. Results of applying the association rules.
Table 3. Results of applying the association rules.
Top Association Rules for Crash Type/Severity
AntecedentsConsequentsSupportConfidenceLift
87(CrashType_TYP6) (SeverityBin_IPA1)0.1303461.0000001.316354
218(Proximity_F, CrashType_TYP6)(SeverityBin_IPA1)0.0631361.0000001.316354
287(Proximity_N, CrashType_TYP6)(SeverityBin_IPA10.0672101.0000001.316354
439(VehType_VEC1, CrashType_TYP6)(SeverityBin_IPA1)0.0896131.0000001.316354
488(NumVeh_NUM1, CrashType_TYP6)(SeverityBin_IPA1)0.1181261.0000001.316354
684(NumVeh_NUM1, Proximity_F, CrashType_TYP6)(SeverityBin_IPA1)0.0570261.0000001.316354
742(Proximity_N, NumVeh_NUM1, CrashType_TYP6)(SeverityBin_IPA1)0.0611001.0000001.316354
886(VehType_VEC1, NumVeh_NUM1, CrashType_TYP6)(SeverityBin_IPA1)0.0855401.0000001.316354
724(Proximity_N, NumVeh_NUM1, VehType_VEC2)(SeverityBin_IPA10.0590630.9666671.272475
349(Weekday_DAY4, NumVeh_NUM1)(SeverityBin_IPA1)0.0570260.9655171.270962
820(VehType_VEC2, TimeBin_TIM4, NumVeh_NUM2)(CrashType_TYP7)0.0509160.9615382.000489
890(VehType_VEC1, CrashType_TYP6)(SeverityBin_IPA1, NumVeh_NUM1)0.0855400.9545453.023754
371(TimeBin_TIM3, CrashType_TYP7)(SeverityBin_IPA1)0.0672100.9428571.241134
855(NumVeh_NUM1, TimeBin_TIM5, VehType_VEC2)(SeverityBin_IPA10.0631360.9393941.236575
510(Proximity_F, VehType_VEC2, MonthBin_MON4)(SeverityBin_IPA1)0.0611000.9375001.234082
319(VehType_VEC2, MonthBin_MON4)(SeverityBin_IPA1)0.0835030.9318181.226602
279(Proximity_N, NumVeh_NUM1)(SeverityBin_IPA1)0.1384930.9315071.226193
549(TimeBin_TIM4, Proximity_F, VehType_VEC2)(SeverityBin_IPA1)0.0529530.9285711.222329
580(TimeBin_TIM5, Proximity_F, VehType_VEC2)(SeverityBin_IPA1)0.0529530.9285711.222329
701(Proximity_N, NumVeh_NUM1, TimeBin_TIM5)(SeverityBin_IPA1)0.0509160.9259261.218846
776(NumVeh_NUM1, VehType_VEC2, MonthBin_MON4)(SeverityBin_IPA1)0.0509160.9259261.218846
912(CrashType_TYP1, VehType_VEC2)(SeverityBin_IPA1, NumVeh_NUM1)0.0977600.9230772.924069
461(CrashType_TYP1, VehType_VEC2)(SeverityBin_IPA1)0.0977600.9230771.215096
908(CrashType_TYP1, NumVeh_NUM1, VehType_VEC2)(SeverityBin_IPA1)0.0977600.9230771.215096
1022(CrashType_TYP1, Proximity_F, VehType_VEC2)(SeverityBin_IPA1, NumVeh_NUM1)0.0692460.9189192.910898
631(CrashType_TYP1, Proximity_F, VehType_VEC2)(SeverityBin_IPA1)0.0692460.9189191.209622
1017(CrashType_TYP1, NumVeh_NUM1, Proximity_F, VehType_VEC2)(SeverityBin_IPA1)0.0692460.9189191.209622
402(TimeBin_TIM5, VehType_VEC2)(SeverityBin_IPA1)0.0916500.9183671.208896
1047(VehType_VEC2, SeverityBin_IPA1, Proximity_F, NumVeh_NUM2)(CrashType_TYP7)0.0672100.9166671.907133
1044(VehType_VEC2, Proximity_F, CrashType_TYP7, NumVeh_NUM2)(SeverityBin_IPA1)0.0672100.9166671.206658
141(MonthBin_MON5, Proximity_F)(SeverityBin_IPA1)0.0651730.9142861.203524
452(NumVeh_NUM1, VehType_VEC2)(SeverityBin_IPA1) 0.1710790.9130431.201888
746(Proximity_N, CrashType_TYP6)(SeverityBin_IPA1, NumVeh_NUM1)0.0611000.9090912.879765
491(CrashType_TYP6)(SeverityBin_IPA1, NumVeh_NUM1) 0.1181260.9062502.870766
949(SeverityBin_IPA1, Proximity_F, NumVeh_NUM2, MonthBin_MON4)(CrashType_TYP7)0.0590630.9062501.885461
Table 4. CART model performance metrics.
Table 4. CART model performance metrics.
MetricValueNotes
Data partition
Training set (70%)n = 343IPA1: 104/IPA2: 139/IPA3: 100
Test set (30%)n = 148IPA1: 45/IPA2: 60/IPA3: 43
Model performance (test set)
Accuracy41.2%Proportion of correctly classified cases
Cohen’s Kappa0.11Agreement beyond chance
Cross-validation (10-fold)
CV Accuracy (mean)48.5%Mean over 10 folds
CV Accuracy (SD)±6.0%
F1-Score by class (test set)
F1—IPA1 (low severity)0.32
F1—IPA2 (moderate severity)0.41
F1—IPA3 (high severity)0.51
Variable importance (top)
1st—Crash type (CRASH_TYPE)36.5%
2nd—Traffic level (TRAFFIC_L)35.3%
3rd—Number of vehicles (VEC_NUM)11.0%
Note: Stratified 70/30 split. CART criterion: Gini impurity.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Corrales, C.; Rubio-Romero, J.C.; Pardo-Ferreira, M.d.C. Analysis of the Severity of Road Accidents Using Combined Data Mining Techniques. Sustainability 2026, 18, 6118. https://doi.org/10.3390/su18126118

AMA Style

Corrales C, Rubio-Romero JC, Pardo-Ferreira MdC. Analysis of the Severity of Road Accidents Using Combined Data Mining Techniques. Sustainability. 2026; 18(12):6118. https://doi.org/10.3390/su18126118

Chicago/Turabian Style

Corrales, César, Juan Carlos Rubio-Romero, and María del Carmen Pardo-Ferreira. 2026. "Analysis of the Severity of Road Accidents Using Combined Data Mining Techniques" Sustainability 18, no. 12: 6118. https://doi.org/10.3390/su18126118

APA Style

Corrales, C., Rubio-Romero, J. C., & Pardo-Ferreira, M. d. C. (2026). Analysis of the Severity of Road Accidents Using Combined Data Mining Techniques. Sustainability, 18(12), 6118. https://doi.org/10.3390/su18126118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop