Next Article in Journal
Analysis of Force–Velocity Profiles in Young Football Players: Effect of Competition Level and Position
Previous Article in Journal
IoTBystander: A Non-Intrusive Dual-Channel-Based Smart Home Security Monitoring Framework
Previous Article in Special Issue
Optimal Routing in Urban Road Networks: A Graph-Based Approach Using Dijkstra’s Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Empirical Analysis of Crash Injury Severity Among Young Drivers in England: Accounting for Data Imbalance

1
Department of Motor Vehicles, Institute of Land and Maritime Transport, Faculty of Transport and Mechanical Systems, Technical University of Berlin (TU Berlin), 13355 Berlin, Germany
2
Department of Transportation and Hydraulic Engineering, School of Rural and Surveying Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
3
Innovative Transportation Research Institute, Department of Transportation Studies, Texas Southern University, Houston, TX 77004, USA
4
Future Mobility Center, Huddersfield Buisness School, University of Huddersfield, Huddersfield HD1 3DH, UK
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(9), 4793; https://doi.org/10.3390/app15094793
Submission received: 24 February 2025 / Revised: 16 April 2025 / Accepted: 17 April 2025 / Published: 25 April 2025
(This article belongs to the Special Issue Sustainable Urban Mobility)

Abstract

:

Featured Application

The findings offer valuable guidance for creating targeted interventions to reduce crash severity among young drivers by addressing significant variables influencing crash severity. By leveraging the RUMC-CART methodology to handle imbalanced data, the study contributes to the broader theme of sustainable urban mobility and enhances overall traffic safety, particularly for young drivers.

Abstract

Crash data analysis is key to improving road safety, but imbalanced data challenges accurate predictions for severe crashes, often leading to biased outcomes. This study investigates crash severity among young drivers (aged 17–24) in England, using crash data collected between April 2019 and February 2022. To address the imbalance issue, the performance of a standard classification and regression tree (CART) model is compared with a modified approach—random undersampling of the majority class CART (RUMC-CART). Although RUMC-CART yields slightly lower accuracy, it demonstrates superior performance in identifying severe crashes. Key contributing factors—categorized as type of vehicle and vulnerabilities, number of vehicles and casualties, area type (urban vs. rural), vehicle maneuvers and dynamic factors, and minor influences and timeline—are shown to significantly impact injury severity outcomes among young drivers. The findings of the study provide valuable insights for developing targeted interventions to enhance road safety.

1. Introduction

Road traffic crashes constitute a key public health issue globally, resulting in significant property damage and economic loses [1]. Each year, around 1.35 million people are killed and 20–50 million are injured in road traffic crashes around the world, representing the ninth leading cause of death overall, in addition to being the leading cause of death among children and young adults aged 5–29 [2]. The World Health Organization (WHO) has urged public authorities to take immediate actions towards implementing safety strategies that embrace the key principles of the safe systems approach, emphasizing the need to address risky driving behaviors, such as speeding, non-compliance with traffic regulations, illegal overtaking, and lane violations [3]. Risky and aggressive driving behaviors have been linked to various characteristics, such as driving experience, gender, income, and education. Driving experience plays a key role in shaping driving habits [4], with new drivers often facing a higher risk of road crashes due to inexperience.
This study aims to shed light on the key factors influencing road traffic crashes among young drivers aged 17 to 24. Previous research indicates that young drivers are particularly prone to risky behaviors, making them a vulnerable group in terms of road safety [5,6]. Understanding the factors contributing to the severity of crashes involving young drivers is crucial for developing targeted safety interventions. In this context, this study aims to investigate the severity of crashes involving young drivers in England (from April 2019 to February 2022, a period covering before, during, and slightly after the outbreak of the COVID-19 pandemic), with the key research questions being as follows:
  • How can the severity of crashes involving young drivers (aged 17–24) in England be accurately predicted?
  • What are the influential factors affecting crash severity, also considering specific traffic dynamics and behavioral trends observed during the study period?
To address these questions and gain granular insights into the determinants of crash outcomes involving young drivers, we developed a decision tree prediction model and attempted to enhance its performance by implementing the random undersampling of the majority class (RUMC) method on our decision tree CART model. This integrated approach accounts for data imbalance, which is typically observed in crash severity data and can lead to biased statistical estimates if left unaccounted for. For the quantitative analysis of the study, crash data were drawn from the STATS19 database of the United Kingdom. The outcomes of this study can inform targeted safety measures and strategies for increasing the road safety of young drivers, who constitute a particularly vulnerable population group. In addition, the findings of the analysis can shed light on the predictive capabilities of the RUMC-CART model, especially in the context of crash injury severities, where it has not been extensively employed.

2. Background

This section provides an overview of the factors affecting crash injury severities by focusing on the characteristics of novice drivers and delving into the determinants of crash risk among them. Moreover, it investigates the approaches typically employed in exploring crash severity and highlights the importance of resampling methods to address data imbalance.

2.1. Driving Experience

Novice drivers, who are usually inexperienced, tend to struggle with identifying hazards on the road and reacting quickly compared to experienced drivers, thus being more likely to face challenging driving situations [7,8,9]. Their focus is often limited to controlling the vehicle, which results in them not scanning their surroundings as thoroughly and underestimating risks [7,10,11]. On the other hand, experienced drivers have a grasp of the overall driving environment and are less likely to break traffic rules unless they feel completely in control [11,12]. Studies on driving distractions have found that external distractions can significantly affect how well both new and experienced drivers anticipate hazards while on the road [13,14]. New drivers are greatly impacted by using phones while driving and tend to struggle more with making maneuvers compared to experienced drivers [7,15]. Additionally, novice drivers are more prone to showing aggressive behavior, and do so in different driving situations than experienced drivers [16,17].

2.2. Young Drivers’ Crash Risk

Young drivers face challenging situations that can lead to crashes, including behavioral and cognitive aspects [8,18]. Weather conditions, like rain, snow, fog, and smoke, play a role in increasing the risk of crashes for drivers, as they affect visibility and road conditions. It is important for young drivers to be extra cautious in such situations [18,19]. Teenage drivers are less likely to become involved in crashes at night; however, they are more prone to being injured or killed compared to adults, especially between midnight and 6 am [20,21]. Speeding is a behavior among drivers especially on roads where the speed limit is over 55 mph, underscoring the need to enforce speed limits and to increase awareness about their importance [22,23]. Additionally, rural roads pose a greater risk for teen drivers than urban roads because of the presence of two lane highways [24,25]. Age and driving experience independently impact crash risk, with young drivers experiencing the highest crash rates within the first month of licensure, which subsequently decline with accumulated driving experience [26,27]. Notably, 18-year-old drivers exhibit a greater vulnerability to severe injuries compared to 19-year-old drivers [19]. Alcohol-impaired driving is a distinct risk factor, with drivers below the age of 21 facing significantly higher injury crash risks at any blood alcohol concentration level [28,29]. Young drivers operating heavy vehicles, particularly pick-up trucks, between the ages of 16 and 17 demonstrate a higher likelihood of incurring severe injuries in a crash [30]. Seatbelt usage among teen drivers remains suboptimal, especially when driving alone or with passengers, highlighting the need for continuous education and the enforcement of the seatbelt use [31].

2.3. Research Gap and Objectives

In this study, we consider a diverse array of potential influential factors, encompassing the presence of road and environmental conditions and vehicle and human factors to investigate injury severities of young drivers. The primary thrust of this paper is twofold: firstly, it seeks to address the safety performance of young drivers; secondly, it strives to disentangle the uncertainty surrounding the prediction of imbalanced safety data. Many studies grappling with machine learning prediction models on the same UK crash dataset encountered challenges in prediction, manifesting as confusion matrix issues [32,33]. A previous study investigated the implementation of the RUMC undersampling method on distinct machine learning models, yielding promising results that suggest higher prediction accuracy compared to conventional counterparts [32]. In this context, this study leverages a similar approach by scrutinizing the undersampling of the RUMC method on decision tree models for young drivers. Through this approach, we strive to provide an analytical tool that can potentially offer more granular insights into the dynamics of crash injury severities in a straightforward manner.
From an empirical point of view, our research aims to fill a significant knowledge gap concerning the crash severity patterns of young drivers—an underexplored, high-risk demographic in road safety research. Apart from the typical road-, vehicle-, driver-, and crash-related characteristics, the impact of external events, such as the COVID-19 pandemic, is also taken into account in the analysis. The insights garnered from this study possess the potential to shape targeted road safety interventions and policies attuned to the distinctive challenges arising from young drivers and their behavioral patterns.

3. Methodology

3.1. Overview

Missing information and under-reporting of crashes constitute key challenges in crash datasets across the globe, thus posing major limitations in traditional statistical analyses for accurately predicting crash injury severities [34,35]. To overcome these hurdles, the decision tree approach can manage missing data, providing a foundation for modeling complex relationships within the dataset [36]. In this study, the decision tree CART model is suitable for capturing non-linear relationships between variables, making it adept at uncovering intricate patterns in crash severity data. Its adaptability in handling both numerical and categorical features aligns well with the variables present in the dataset that can serve as potential determinants of crash severity for young drivers [37]. To better handle the imbalance (in terms of injury severity) that is typically observed in crash data, the RUMC method is coupled with the CART model. The interpretability of decision trees plays a key role in understanding the nuanced factors contributing to crash severity within the demographic of young drivers. Figure 1 illustrates how the RUMC undersampling method operates by selectively picking instances from the overrepresented class and leaving them out of the training data to ensure an even distribution of classes. This approach is implemented to address the issue of data imbalance, which occurs when the distribution of classes in a dataset is skewed, leading to models that are biased toward the majority class. Undersampling is a technique used to balance the dataset by reducing the size of the majority class, ensuring that both classes have equal representation in the training data [38]. Figure 2 provides a flowchart outlining the steps followed in this study, detailing how both CART and RUMC-CART models were developed and implemented for the data analysis.

3.2. Model Selection

The decision tree model is commonly used for categorization purposes. It effectively sorts data by learning rules based on characteristics. When fine tuning, setting parameters, such as maximum tree depth and minimum sample size splits, are important. In a CART model, attributes are assigned based on their importance. The structured layout identifies factors for classification or regression. Essentially, decision trees make predictions by splitting the input features into subsets, refining these splits to reduce uncertainty in the class variable. This process uses the Gini index as a measure of uncertainty [37].
The Gini index is used to quantify homogeneity in data by computing the proportion of sample that belongs to a class. This index is defined in the following Equation (1):
G i n i   i n d e x = 1 i = 1 n p i 2 ,
where p i denotes the probability of an element being classified in a distinct class.
Nonparametric techniques, such as CART, have become popular in predicting the severity of traffic crashes. In contrast to regression models that heavily rely on the relationships between dependent and independent variables, CART works without predefined assumptions. The results of CART models are presented in a tree format, making it easier to understand and interpret the outcomes. When developing machine learning models for classification, such as decision trees, it is essential to divide the data into training, validation, and testing sets. While there is no rule for the percentage allocation between these sets, common practices include using 60% to 80% of the data for training and 10% to 20% for validation, and having a testing set similar in size to the validation set [37].
The RUMC technique is a method used to tackle imbalanced datasets by reducing samples from the majority class. In analyzing crash severity, the application of RUMC with CART involves adjusting the training dataset. Traditional CART models face challenges with imbalanced data where one class dominates (such as slight injuries) resulting in biased predictions. By undersampling the majority class, RUMC-CART enables focus on minority instances (like more severe crashes) improving the model’s predictive capabilities for underrepresented cases. When determining the model configurations, a thorough grid analysis is conducted to assess levels of complexity, leaf arrangements, and division strategies [32].

3.3. Model Evaluation Metrics

The effectiveness of the models was assessed by computing goodness-of-fit metrics, such as accuracy, precision, and ROC curves. Table 1 shows how well negative and positive samples were correctly identified, with TN and TP values indicating true negatives and true positives, respectively, and FP and FN indicating false positives and false negatives, respectively. By calculating accuracy, recall, precision, F1 score, specificity, and false positive rate (for their full specifications, see Equations (2)–(7)), a comprehensive evaluation of the model’s ability to distinguish between different classes was obtained [37].
A c c u r a c y = T N + T P T N + F P + T P + F N ,
R e c a l l   ( T r u e   P o s i t i v e   R a t e T P R ) = T P T P + F N ,
P r e c i s i o n = T P T P + F P ,
F 1 S c o r e = 2 × R e c a l l P r e c i s i o n R e c a l l + P r e c i s i o n ,
S p e c i f i c i t y = T N T N + F P ,
F a l s e   P o s i t i v e   R a t e F P R = F P T N + F P ,
A receiver operating characteristic (ROC) curve visually represents the performance of a binary classifier by illustrating its ability to distinguish between two classes considering different settings. It plots the recall (true positive rate) against the false positive rate (FPR). A better-performing classifier tends to have its ROC curve closer to the upper left side, while a curve approaching the diagonal signifies lower test accuracy. The default cutoff for classifiers is typically set at 0.5 to maintain a balance between true positive and false positive rates. Another performance metric, area under the curve (AUC), ranges from 0.5 to 1.
In this study, we employed Python 3.12.0 as the programming language and leveraged the “Sklearn” library to process and build the decision tree model. The evaluation metrics were calculated using the “Sklearn” library, and the “Imblearn” library was utilized for implementing the random undersampling method (RUMC).

3.4. Data

The crash data used in this research are drawn from the UK crash database, STATS19, which offers information on injury crashes reported by the police in Great Britain. Specifically, the crash data were collected by the local police forces in the UK following standardized guidelines, as outlined in the STATS20 manual. Crashes resulting only in property damages are not included in the STATS-19 database [35]. This study delves into crash data from England spanning a period from 7 April 2019 to 1 February 2022 (a period covering before, during, and slightly after the outbreak of the COVID-19 pandemic), comprising 52,966 observations specifically pertaining to young drivers. In the UK, the pre-COVID era refers to the time before March 2020, the during-COVID era spans from March 2020 to February 2021, and the post-COVID era refers to February 2021 onwards, marked by the gradual easing of restrictions [39]. The dataset covers crashes in both rural and urban settings, offering a holistic perspective of the diverse landscapes within the UK’s road network.
The research systematically explores the influence of various factors, such as crash-related, vehicle-related, driver-related, and environmental factors. The crash-related variables included various attributes, such as crash severity, number of casualties, and junction detail, all of which provide context about the circumstances of the incidents. Within the vehicle-related category, variables, such as vehicle type, towing and articulation, and vehicle maneuvers, were incorporated in the analysis to capture the role of vehicle characteristics and movements in influencing crash severity outcomes. The driver-related variables, such as the age of the driver—with a specific focus on young drivers aged 17–24—and gender of the driver, aim to capture the impact of human factors on crash severities. Lastly, environmental factors, such as road surface conditions, weather conditions, light conditions, and road type, were included to account for the external conditions surrounding the incidents. A comprehensive overview of all these variables and their specific characteristics is provided in Table A1 in Appendix A.
For the purposes of this study, crashes are classified into two categories based on their observed injury severity, namely “slight” and “serious” injury crashes. The “slight” category encompasses injuries of a minor character, like sprains, bruises, or cuts that are not deemed severe, along with slight shocks requiring roadside attention. It also includes very minor injuries not necessitating medical treatment. Due to the low prevalence of fatal injuries in the dataset, fatal and serious injury crashes were combined into a single category, i.e., the “serious” injury category. In addition to fatal accidents, “serious” crashes involve injuries for which a person is hospitalized as an “in-patient” or a specific injury, whether or not hospitalization occurs. These injuries comprise fractures, concussions, internal injuries, crushing, burns (excluding friction burns), severe cuts, severe general shock requiring medical treatment, and injuries causing death 30 or more days after the crash.
Among all age groups totaling 388,157 crashes in our study period, 52,966 (13.6%) fell within the 17 to 24 age range, categorizing them as young drivers in our analysis. Table 2 presents the distribution of crash severity levels, highlighting the imbalanced nature of the dataset. Slight injuries dominate with a share of 78.69%, while serious injuries (i.e., fatal and serious) account for 21.41% of the sample.
The analysis reveals that 19,057 (36.0%) of the crashes occurred in rural areas, emphasizing the challenges and risks associated with less densely populated regions, while 33,909 (64.0%) crashes occurred in urban areas. Slight injuries were predominant in both settings, constituting 74.9% (14,291 out of 19,057) in rural areas and 80.9% (27,391 out of 33,909) in urban areas. For serious injuries, rural areas accounted for 25.0% (4766 out of 19,057), while urban areas accounted for 19.2% (6518 out of 33,909) of such crashes.
Figure 3 visually represents the spatial distribution of the two crash types (i.e., slight and serious injury crashes) in England. The green spots indicate locations of slight injury crashes, while the red spots mark areas with serious crashes. As expected, a clustering of spots is illustrated in major cities of England, such as London, Birmingham, Manchester, and so on. Figure 4 further highlights the discrepancies between rural and urban areas. Yellow spots denote urban areas, encompassing 64% of the crashes, while blue spots represent crashes in rural areas, representing 36% of the total crash observations.

4. Results

Utilizing 70% of the crash data for training, 15% for validation, and the remaining 15% for testing, two decision tree models employing the CART and RUMC-CART techniques were developed. These models were developed to predict crash severity levels, distinguishing between “slight injury” and “serious injury”, with the latter including crashes with serious or fatal crashes, as mentioned earlier. The Gini impurity function guided the growth of a total of 60 nodes in the CART model and 56 nodes in the RUMC-CART model. Evaluation metrics for both the training and testing data, including classification results and the area under the ROC curve, are detailed in Table 3 and Figure 5.
As shown in Table 3, CART exhibits robust accuracy across the training, validation, and testing datasets, with values of 0.79, 0.78, and 0.79, respectively. However, the model faces challenges in capturing positive instances, as evidenced by low recall values ranging from 0.04 to 0.06. The specificity remains consistently high at 0.99, indicating a strong ability to correctly identify negative instances. In terms of precision, CART demonstrates moderate values, decreasing from 0.58 in training to 0.52 in testing. The F1 score, which indicates the balance between precision and recall, shows a range from 0.07 to 0.11. Despite facing these challenges, CART maintains an AUC value of 0.66 across all datasets, indicating its distinguishing abilities.
Focusing on the RUMC-CART model, its accuracy is comparatively lower, with values of 0.63 for the training set, 0.61 for the validation set, and 0.62 for the testing set. The model performs well in identifying cases, as indicated by recall values ranging from 0.58 to 0.59. This highlights the model’s strength in recognizing instances of the class, especially in scenarios where capturing positive cases accurately is crucial. The specificity values for RUMC-CART range from 0.65 to 0.68, suggesting challenges in identifying negative instances. The precision values are stable between 0.63 and 0.64, showcasing the consistent accuracy of the RUMC-CART in positive predictions. The F1 score for RUMC-CART varies from 0.60 to 0.61, showing its balance between precision and recall. The ROC curves in Figure 5 provide a visualization of the performance of both the CART and RUMC-CART models across various sets, demonstrating their robust performance on different datasets.
Table 4 presents the confusion matrices for the decision trees concerning the training, validation, and testing datasets. In these matrices, 0 represents “slight injury” crashes and 1 indicates “serious injury” crashes. Notably, the CART model exhibits higher precision in predicting the “slight injury” class compared to the “serious injury” class. In the training dataset, 28,205 instances (99%) of the “slight injury” class and 470 instances (6%) of the “serious injury” class were accurately predicted. Similarly, the validation and testing datasets yielded precise predictions for the “slight injury” class, with 6151 instances (99%) and 6219 instances (99%) correctly classified, and for the “serious injury” class, with 369 instances (18%) and 90 instances (5%) correctly predicted, respectively. The CART model also presented challenges, misclassifying 339 instances (1%) of “slight injury” crashes in the training dataset as “serious injury” crashes. Furthermore, a substantial number of serious injury crashes, amounting to 7462 instances (94%), were inaccurately classified as “slight injury” crashes. The false positive (FP) and false negative (FN) values manifested as 86 instances (1%) and 82 instances (1%) of “slight injury” crashes being misclassified as “serious injury” crashes and 1639 instances (82%) and 1554 instances (95%) of “serious injury” crashes being erroneously predicted as “slight injury” crashes in the validation and testing datasets, respectively.
As far as the RUMC-CART model is concerned, Table 4 reveals a higher precision in predicting the “slight injury” class compared to the “serious injury” class. In the training dataset, the model accurately predicted 5372 instances (68%) of the “slight injury” class and 4566 instances (58%) of the “serious injury” class. This precision extends to the validation dataset, where 1099 instances (65%) of the “slight injury” class and 1140 instances (67%) were correctly predicted, alongside 997 instances (59%) and 970 instances (58%) for the “serious injury” class, respectively.
On the other hand, for the training dataset of the RUMC-CART model, 2523 instances (32%) of “slight injury” crashes were incorrectly predicted as “serious injury” crashes, and 3336 instances (42%) of “serious injury” crashes were misclassified as “slight injury” crashes. Similar misclassifications occurred in the validation and testing datasets, where instances of “slight injury” crashes were predicted as “serious injury” crashes, and vice versa. These misclassifications were quantified as false positive (FP) and false negative (FN) values. For instance, in the validation dataset, 590 instances (35%) of “slight injury” crashes were classified as “serious injury” crashes (FP), while 560 instances (33%) of “serious injury” crashes were predicted as “slight injury” crashes (FN). This pattern persisted in the testing dataset, with 700 instances (41%) classified as FP and 715 instances (42%) as FN for “serious injury” crashes.
The feature importance analysis of the CART model (see Figure 6) reveals significant contributors to crash severity prediction. Key factors include the “vehicle_type” at 20.33%, highlighting the impactful role of motorcycle involvement. Variables, such as “number_of_vehicles” and “number_of_casualties”, closely follow, with weights of 20.04% and 19.06%, respectively, emphasizing their substantial roles. “Vehicle_leaving_carriageway” and “speed_limit” play important roles at 13.15% and 6.94%, respectively, underscoring the importance of road and traffic conditions. “Urban_or_rural_area” emerges as an observable factor at 5.88%, indicating distinctions in crash dynamics between urban and rural settings. “Vehicle_manoeuvre” and “skidding_and_overturning” contribute 3.55% and 2.53%, respectively, shedding light on driver actions and vehicular dynamics. Other factors, like “road_surface_conditions”, “light_conditions”, and “hit_object_in_carriageway”, exhibit moderate importance, while the temporal aspect, specifically the “timeline” feature, denotes a minor influence, particularly in the pre-COVID period, at 0.95%.
The feature importance analysis for the RUMC-CART (see Figure 7) model elucidates critical factors influencing crash severity prediction. “Number_of_casualties” stands out as the most impactful feature at 22.02%, emphasizing its significance in determining crash severity. Following closely, “Number_of_vehicles” and “vehicle_type: Passenger_car” contribute substantially at 19.22% and 18.20%, respectively, underscoring their essential roles in predicting crash outcomes. The assigned “Speed_limit:20 mph” and “vehicle_manoeuvre: Turning” also emerge as factors with observable impacts at 5.38% and 4.94%, respectively, emphasizing the role of road- and driver-related dynamics. Notably, the specific “vehicle_type: Motorcycle” and “vehicle_manoeuvre: Slowing_or_stopping” contribute to a lesser extent, with 3.27% and 2.89%, respectively, shedding light on distinct vehicle types and driver actions. “Junction_detail” and “sex_of_driver” exhibit a lower importance at 2.48% and 2.42%, respectively. Additionally, road and environmental variables, such as “light_conditions: Darkness_no_street_lighting” and “weather_conditions: fine”, influence model predictions, with weights of 4.30% and 1.32%, respectively.
Figure 8 displays the root node (node 0) along with the immediate child nodes (node 1 on the left and node 32 on the right) to provide an overview of the decision tree. The “number_of_vehicles” variable at the root node (node 0) is utilized to bifurcate the tree into two child nodes (node 1 and node 32). Implementing the RUMC method on our CART model indicates an equal distribution of data for both slight and serious injury classes in node 0. In the subsequent analysis, we scrutinize the decision tree in four distinct sections, starting from the child nodes of node 1 on the left and right sides, followed by the child nodes of node 32 on the left and right sides.
The left segment of the RUMC-CART diagram (see Figure 9) explores crash severity in light of vehicle type, speed limit, and surrounding conditions. When two or more vehicles are involved, the “vehicle_type” variable is primarily associated with slight injuries in 53% of cases. However, if the vehicle type is not a passenger car, the speed limit becomes influential for injury severity, with 58% of these crashes resulting in serious injuries, especially when the speed limit is not 20 mph. Within this branch, the type of vehicle continues to have an observable impact; if the vehicle is something other than a motorcycle, crashes in rural areas show a higher likelihood of resulting in serious injuries (58%), while urban crashes are more associated with slight injuries (61%). When motorcycles are involved and the speed limit is 30 mph, serious injuries are more likely to occur in 74% of cases. Additional factors, like skidding and overturning, also contribute to the likelihood of injury severity, with pre- or post-COVID periods showing a higher rate of serious injuries (76%).
On the right side of the diagram (see Figure 10), the focus shifts to crashes involving passenger cars, where the number of casualties, lighting conditions, and vehicle maneuvers significantly influence crash outcomes. When there are three or more casualties, 62% of these crashes are linked with serious injuries, especially when the vehicle maneuver does not involve slowing or stopping. Further branching reveals that crashes near junctions have a higher likelihood of resulting in serious injuries (73%), while crashes occurring away from junctions also show a significant proportion of severe injuries. Environmental factors, such as lighting conditions, are also found to serve as injury severity determinants; for instance, darkness without street lighting results in serious injuries in 57% of cases. Additionally, other variables, like weather conditions, skidding, and overturning, further exacerbate crash severity, particularly when the weather is poor or when the driver is male.

5. Discussion

The findings of this study provide valuable insights into the determinants of crash injury severity determinants among young drivers, analyzed through the CART and RUMC-CART models. These insights help understand the unique vulnerabilities and behaviors of this demographic in terms of road safety.

5.1. CART and RUMC-CART Comparison for Young Drivers

The CART model demonstrates exceptional precision in predicting “slight injury” outcomes, with accuracy rates of 94%, 99%, and 90% in the training, validation, and testing phases, respectively. However, as far as the prediction of more severe outcomes is concerned, its performance fluctuates significantly, with accuracy ranging between 5% and 18%. In contrast, the RUMC-CART model, despite exhibiting lower overall accuracy (68% in training, 65% in validation, and 67% in testing), demonstrates greater reliability in predicting serious injuries, with improved precision (58–59%) and reduced false negative rates. This is especially relevant for young drivers, as their propensity for risk-taking behaviors often leads to severe injuries as outcomes of their crash involvement [40]. The RUMC-CART model’s ability to better handle imbalanced data highlights its suitability for addressing these high-risk scenarios.

5.2. Feature Importance Analysis for Young Drivers

The examination of features in the CART and RUMC CART models (See Figure 6 and Figure 7) provides more nuanced insights into the determinants of crash injury severity. The type of vehicle stands out as one of the key determinants, emphasizing the higher injury risk faced by motorcyclists, as also evidenced in earlier studies [41]. The significant risk associated with motorcycles, as highlighted in the discussion, aligns with the previously discussed risk propensity of young drivers operating heavy or less protective vehicles. Young motorcyclists face heightened injury risks due to their inherent vulnerability, which can be attributed to limited protective enclosures compared to other vehicle occupants [30]. Additionally, the aggressive driving behavior observed among novice drivers [16] exacerbates this risk, especially when operating motorcycles that demand greater control and level of attention. The lack of driving experience further compounds this issue, as novice drivers struggle with hazard recognition and risk estimation [9]. The number of vehicles and number of casualties involved in the crash both exert substantial influence, underlining the multiplicative effect of these factors on crash severity. This aligns with prior findings suggesting a greater likelihood of severe injuries in multi-vehicle crashes, particularly for drivers [42]. Young drivers’ inexperience in managing complex traffic scenarios [11] and their tendency to focus narrowly on vehicle control rather than scanning their surroundings contribute to increased crash severity in such situations. Moreover, young drivers face the most pronounced crash risks within the early stages of licensure, highlighting their struggle with challenging driving environments involving multiple vehicles [26,27].
The area type (urban or rural) appears to be a less influential factor but still indicates distinctions in crash dynamics across different spatial settings. This aligns with evidence suggesting that higher speeds in rural areas may be linked to risk-taking behavior among young rural drivers [43]. Rural roads pose greater risks for young drivers due to the prevalence of two-lane highways and higher speeds [24]. Turning maneuvers and contributing vehicle actions at the time of the crash (e.g., skidding and overturning) have a moderate influence on injury severity outcomes, with mixed results documented in previous studies [35,44]. Novice drivers are more likely to struggle with making maneuvers due to their limited ability to anticipate hazards and react appropriately [45]. The aggressive behavior of novice drivers [16] also increases the likelihood of risky maneuvers, such as skidding or overturning, during challenging driving scenarios. However, the minor influence of the “timeline” feature, specifically for the pre-COVID period, is intriguing. It suggests a potential shift in crash dynamics for young drivers before and after the pandemic. However, further research with a larger pre-COVID data set is required for more reliable and detailed insights into the specific impact of the pandemic on driving behavior of young drivers.

5.3. Policy Implications

The findings of the RUMC-CART model analysis can help shape policy interventions targeted at improving the safety performance of young drivers.
One key suggestion driven by this study is to implement driver licensing programs which consider the important factors that emerged from the model results, such as the number of vehicles and casualties. These programs would limit young drivers’ access to high-risk vehicles, particularly during their initial licensing stages. Through these programs, young individuals can become more aware about risks associated with distraction and peer influence.
Prioritizing improvements to road design and enforcement aligns with the model’s findings regarding several factors, such as the speed limit or vehicle maneuvers during the crash. Advocating for and enforcing lower speed limits, especially in areas with significant pedestrian or cyclist activity [46], are essential to mitigate the severity of crashes involving young drivers. Additionally, investing in infrastructure enhancements and vehicle-aided navigation (e.g., through advanced driving assisting systems—ADAS) to improve visibility at critical road elements (e.g., intersections) and simplify complex maneuvers corresponds with the insights provided by the RUMC-CART analysis.
Enhancing driver education and training is of the utmost importance, considering the model’s findings relating to such factors as vehicle maneuvers and lighting conditions. Mandatory advanced driving courses focusing on essential skills, like emergency braking, hazard perception, and navigating complex maneuvers, as highlighted by the RUMC-CART results, would better prepare young drivers for challenging driving scenarios.
The findings of the study highlight the impact of road conditions, vehicle characteristics, and speed limits on the crash injury severity of young drivers, thus paving the way for targeted mitigation policies. Investments in rural road infrastructure, enhanced urban traffic management, in-vehicle assistance systems, and tailored educational programs for young drivers can potentially address—to some extent—the sources of driving risk for young individuals. However, such policy implications are applicable to young drivers from England and may not be generalized to similar demographics from different populations or spatial settings. In addition, issues relating to the under-reporting of crashes involving young drivers or the accuracy of classification of injury outcomes may induce bias to data analyses from crash databases. The joint consideration of crash and hospital data by future studies could potentially enhance the predictability and the explanatory power of injury severity analyses focusing on vulnerable road user groups, such as young individuals.

6. Conclusions

Despite the existence of numerous research endeavors investigating road safety, the attention given to young drivers remains notably lacking and is often overlooked. This study seeks to address this gap by modeling crash severities and the associated influential factors specifically for young drivers aged 17–24 in England, UK. To address imbalances in injury severity data (especially between slight and serious injury levels), the study introduced the use of the RUMC undersampling method on the decision tree CART model.
The significant variables identified in this study include the type of vehicle and vulnerabilities, number of vehicles and casualties, area type (urban vs. rural), vehicle maneuvers and dynamic factors, and minor influences and timeline. Motorcyclists face heightened injury risks due to their lack of protective enclosures and the aggressive tendencies of novice drivers, exacerbating their inherent vulnerability. Multi-vehicle crashes further compound injury severity, reflecting young drivers’ challenges in managing complex traffic scenarios, especially during the early licensure period. Rural areas present elevated risks, with higher speeds and two-lane highways amplifying crash severity among risk-prone young rural drivers. Vehicle maneuvers, such as skidding and overturning, moderately impact injury outcomes, rooted in novice drivers’ difficulty in anticipating hazards and their propensity for risky behaviors. Although minor factors, such as road surface conditions and temporal shifts, play a less significant role, they offer intriguing insights into the evolving crash dynamics among young drivers.
The model results showed that the CART model has a superior performance in predicting slight injury cases with high accuracy but exhibits an inferior performance with serious injury cases, showing significant fluctuations and more false negatives. The RUMC-CART model, while less accurate overall, offers improved precision for serious injury cases and lower false negative rates, making it more effective in handling imbalanced data, such as those used in this study. Additionally, the RUMC-CART model incorporates a broader set of features, enhancing not only its predictive capabilities for crash severity, but also its explanatory power.
A limitation of this study lies in the generalized nature of the findings from the CART and RUMC-CART models. While these models offer valuable insights into crash severity modeling and prediction, they may have limitations in identifying specific, actionable patterns or risk factors to directly guide interventions [47]—especially in the context of our study, which focuses on young drivers and cases involving slight injuries. Future research should seek to refine these models by integrating variables that are tailored to specific demographic groups and crash scenarios, thereby improving their practical applicability and supporting the development of targeted safety interventions.

Author Contributions

A.T.: Conceptualization, Methodology, Visualization, Investigation, Formal analysis, Data curation, Writing—original draft; K.S.: Data curation, Methodology, Formal analysis; G.F.: Conceptualization, Methodology, Investigation, Validation, Writing—review and editing; A.S.: Conceptualization, Writing—review and editing; N.D.: Validation, Visualization, Writing—review and editing; S.M.: Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The crash data used in this research are taken from the UK crash database, STATS19, which offers information on injury crashes reported by the police in Great Britain [48].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CARTClassification and regression trees
RUMCRandom undersampling of the majority class
WHOWorld Health Organization
UKUnited Kingdom
ROCReceiver operating characteristic
TPTrue positive
TNTrue negative
TPRTrue positive rate
FPFalse positive
FPRFalse positive rate
AUCArea under the curve
FNFalse negative

Appendix A

Table A1. Overview of variables and their characteristics used in the modeling process.
Table A1. Overview of variables and their characteristics used in the modeling process.
AttributeCharacteristicsDescription
Crash severityCombined fatal and serious injuryBroken neck or back, fractures etc.
Death due to the consequences of the accident
Slight injuryAll other injuries apart from fatal or serious
Number of casualties1–21–2 people injured
>2More than 2 people injured
Day of weekWorking dayMonday–Friday
WeekendSaturday and Sunday
Junction detailJunction existsNear a junction
No junctionNot at or within 20 m of a junction
Pedestrian crossing physical facilitiesPhysical facilities existPedestrian crossing exists
No physical facilitiesPedestrian crossing does not exist
Road surface conditionsDryDry conditions
Not dryWet/damp, snow, frost/ice, flood
Urban or rural areaUrbanUrban areas
ruralRural areas
Towing and articulationTowing/articulation existsArticulated vehicle, double or multiple trailer, caravan, single trailer, other tow
No towing/articulationNot existing
Skidding and overturningSkidding/overturning existsSkidded, overturned, jack-knifed
No skidding/overturningNo skidding, jack-knifing or overturning
Hit object in carriagewayObject hitRoadworks, bridges, etc.
No object hitNo collision with objects in carriageway
Hit object off carriagewayObject hitRoad sign, tree, etc.
No object hitNo collision with object off carriageway
Vehicle leaving carriagewayLeft carriagewayNearside, straight, offside
Did not leave carriagewayNo leaving
Sex of driverMaleMale driver
FemaleFemale driver
Age of vehicle0–9Vehicle is 0–9 years old
10–19Vehicle is 10–19 years old
20–96Vehicle is 20–96 years old
Number of vehicles11 vehicle involved
22 vehicles involved
3+3+ vehicles involved
Speed limit20Limit: 20 mph (32 km/h)
30Limit: 30 mph (48 km/h)
40Limit: 40 mph (64 km/h)
50+Limit: 50+ mph (80+ km/h)
Weather conditionsFineGood weather
RainingIt is raining
SnowingIt is snowing
Fog or mistFog or mist—if hazard
OtherOther than above
Road typeRoundaboutRoundabout (all sizes)
One-way streetOne-way street
Dual carriagewayDual carriageway
Single carriagewaySingle carriageway
Slip roadSlip road
Vehicle maneuverReversingReversing
ParkedParked
WaitingWaiting to go ahead/turn left/right
Slowing or stoppingSlowing or stopping
Moving offMoving off
TurningLeft/right
Changing laneTo right/left
OvertakingMoving/stationary vehicle on offside or on nearside
Going aheadLeft hand bend, right hand bend, other
Vehicle typeMotorcycleAll cc-sizes
Passenger carPassenger car
BusBus or coach
Agricultural vehicleIncludes diggers, etc.
Tram/light railTram/light rail
Van/goodsVan/goods of all weights
Lighting conditionsDaylightDaylight
Darkness (streetlights present and lit)Darkness (streetlights present and lit)
Darkness (streetlights present but unlit)Darkness (streetlights present but unlit)
Darkness (no street lighting)Darkness (no street lighting)
Year20197 April 2019–31 December 2019
20201 January 2020–31 December 2020
20211 January 2021–31 December 2021
20221 January 2022–1 February 2022
HourNight22:00–05:59
Morning06:00–09:59
Noon10:00–17:59
Evening18:00–21:59
TimelinePre-COVID-197 April 2019–15 March 2020
During COVID-1916 March 2020–22 February 2021
After COVID-1923 February 2021–1 February 2022

References

  1. Daoud, R.; Vechione, M.; Gurbuz, O.; Sundaravadivel, P.; Tian, C. Comparison of machine learning models to predict nighttime crash severity: A case study in Tyler, Texas, USA. Vehicles 2025, 7, 20. [Google Scholar] [CrossRef]
  2. Ahmed, S.K.; Mohammed, M.G.; Abdulqadir, S.O.; El-Kader, R.G.A.; El-Shall, N.A.; Chandran, D.; Rehman, M.E.U.; Dhama, K. Road traffic accidental injuries and deaths: A neglected global health issue. Health Sci. Rep. 2023, 6, e1240. [Google Scholar] [CrossRef] [PubMed]
  3. Abdel-Aty, M.; Ugan, J.; Islam, Z. Exploring the influence of drivers’ visual surroundings on speeding behavior. Accid. Anal. Prev. 2024, 198, 107479. [Google Scholar] [CrossRef]
  4. Feyzollahi, M.; Pineau, P.-O.; Rafizadeh, N. Drivers of Driving: A Review. Sustainability 2024, 16, 2479. [Google Scholar] [CrossRef]
  5. Zhu, Y.; Qian, Y.; Xu, J.; Hu, W. Young novice drivers’ road crash injuries and contributing factors: A crash data investigation. Traffic Inj. Prev. 2024, 25, 1031–1038. [Google Scholar] [CrossRef] [PubMed]
  6. Zhu, Y.; Jiang, M.; Yamamoto, T. Does a cautious driving style reduce the crash risk of older drivers? An analysis using a novel driving style recognition method. Transp. Res. Part F Traffic Psychol. Behav. 2024, 104, 72–87. [Google Scholar] [CrossRef]
  7. Krasniuk, S.; Toxopeus, R.; Knott, M.; McKeown, M.; Crizzle, A.M. The effectiveness of driving simulator training on driving skills and safety in young novice drivers: A systematic review of interventions. J. Saf. Res. 2024, 91, 20–37. [Google Scholar] [CrossRef]
  8. Faridiaghdam, A.; Mirzahossein, H.; Rassafi, A.A.; Khanpour, A. Exploring the cognitive and behavioral factors impacting novice young drivers: Structural equation modeling of situational awareness, driving skills, reported crash history, and violations, using a driving simulator. Transp. Res. Part F Traffic Psychol. Behav. 2025, 111, 130–144. [Google Scholar] [CrossRef]
  9. Borowsky, A.; Shinar, D.; Oron-Gilad, T. Age, skill, and hazard perception in driving. Accid. Anal. Prev. 2010, 42, 1240–1249. [Google Scholar] [CrossRef]
  10. Coyne, R.; Hanlon, M.; Smeaton, A.F.; Corcoran, P.; Walsh, J.C. Understanding drivers’ perspectives on the use of driver monitoring systems during automated driving: Findings from a qualitative focus group study. Transp. Res. Part F Traffic Psychol. Behav. 2024, 105, 321–335. [Google Scholar] [CrossRef]
  11. Scialfa, C.T.; Deschênes, M.C.; Ference, J.; Boone, J.; Horswill, M.S.; Wetton, M. A hazard perception test for novice drivers. Accid. Anal. Prev. 2011, 43, 204–208. [Google Scholar] [CrossRef] [PubMed]
  12. Xu, Y.; Li, Y.; Jiang, L. The effects of situational factors and impulsiveness on drivers’ intentions to violate traffic rules: Difference of driving experience. Accid. Anal. Prev. 2014, 62, 54–62. [Google Scholar] [CrossRef] [PubMed]
  13. Bakhtiari, S.; Zhang, T.; Zafian, T.; Samuel, S.; Knodler, M.; Fitzpatrick, C.; Fisher, D.L. Effect of visual and auditory alerts on older drivers’ glances toward latent hazards while turning left at intersections. Transp. Res. Rec. 2019, 2673, 117–126. [Google Scholar] [CrossRef]
  14. Shariatmadari, K.; Samuel, S.; Cao, S.; Singh, A. Comparison of Two-Level and Three-Level Graded Collision Warning Systems under Distracted Driving Conditions. IEEE Access 2025, 13, 43818–43829. [Google Scholar] [CrossRef]
  15. Wright, C.J.; Dietze, P.M.; Crockett, B.; Lim, M.S. Participatory development of MIDY (Mobile Intervention for Drinking in Young people). BMC Public. Health 2016, 16, 184. [Google Scholar] [CrossRef] [PubMed]
  16. Fountas, G.; Pantangi, S.S.; Hulme, K.F.; Anastasopoulos, P.C. The effects of driver fatigue, gender, and distracted driving on perceived and observed aggressive driving behavior: A correlated grouped random parameters bivariate probit approach. Anal. Methods Accid. Res. 2019, 22, 100091. [Google Scholar] [CrossRef]
  17. Adavikottu, A.; Velaga, N.R. Modeling the impact of driving aggression on lane change performance measures: Steering compensatory behavior, lane change execution duration and crash probability. Transp. Res. Part F Traffic Psychol. Behav. 2024, 103, 526–553. [Google Scholar] [CrossRef]
  18. Sadeghi, P.; Goli, A. Investigating the impact of pavement condition and weather characteristics on road accidents. Int. J. Crashworthiness 2024, 29, 973–989. [Google Scholar] [CrossRef]
  19. Duddu, V.R.; Kukkapalli, V.M.; Pulugurtha, S.S. Crash risk factors associated with injury severity of teen drivers. IATSS Res. 2019, 43, 37–43. [Google Scholar] [CrossRef]
  20. Rahman, M.A.; Hossain, M.M.; Mitran, E.; Sun, X. Understanding the contributing factors to young driver crashes: A comparison of crash profiles of three age groups. Transp. Eng. 2021, 5, 100076. [Google Scholar] [CrossRef]
  21. Williams, A.F. Teenage drivers: Patterns of risk. J. Saf. Res. 2003, 34, 5–15. [Google Scholar] [CrossRef] [PubMed]
  22. Scott-Parker, B.; Oviedo-Trespalacios, O. Young driver risky behaviour and predictors of crash risk in Australia, New Zealand and Colombia: Same but different? Accid. Anal. Prev. 2017, 99, 30–38. [Google Scholar] [CrossRef] [PubMed]
  23. Islam, M.; Hosseini, P.; Kakhani, A.; Jalayer, M.; Patel, D. Unveiling the risks of speeding behavior by investigating the dynamics of driver injury severity through advanced analytics. Sci. Rep. 2024, 14, 22431. [Google Scholar] [CrossRef]
  24. Das, A.; Ahmed, M.M.; Ghasemzadeh, A. Using trajectory-level SHRP2 naturalistic driving data for investigating driver lane-keeping ability in fog: An association rules mining approach. Accid. Anal. Prev. 2019, 129, 250–262. [Google Scholar] [CrossRef] [PubMed]
  25. Hossain, A.; Sun, X.; Islam, S.; Rahman, A.; Das, S. Single-vehicle roadway departure crashes at rural two-lane highway curved segments: A diagnosis using pattern recognition. Int. J. Transp. Sci. Technol. 2024, 15, 298–318. [Google Scholar] [CrossRef]
  26. Lewis-Evans, B. Crash involvement during the different phases of the New Zealand Graduated Driver Licensing System (GDLS). J. Saf. Res. 2010, 41, 359–365. [Google Scholar] [CrossRef]
  27. Xue, G.; Liu, L. Real-world crash configurations and traffic violations among newly licensed young drivers with different route familiarity levels. Traffic Inj. Prev. 2024, 25, 673–679. [Google Scholar] [CrossRef]
  28. French, D.; Gerona, R.R. Alcohol and Drug Testing in Motor Vehicle Crashes. Clin. Lab. Med. 2025. [Google Scholar] [CrossRef]
  29. Fell, J.C.; Waehrer, G.; Voas, R.B.; Auld-Owens, A.; Carr, K.; Pell, K. Effects of enforcement intensity on alcohol impaired driving crashes. Accid. Anal. Prev. 2014, 73, 181–186. [Google Scholar] [CrossRef]
  30. Paleti, R.; Eluru, N.; Bhat, C.R. Examining the influence of aggressive driving behavior on driver injury severity in traffic crashes. Accid. Anal. Prev. 2010, 42, 1839–1854. [Google Scholar] [CrossRef]
  31. Ortmann, N.; Haddad, Y.K.; Beck, L. Special report from the CDC: Provider knowledge and practices around driving safety and fall prevention screening and recommendations for their older adult patients, DocStyles 2019. J. Saf. Res. 2023, 86, 401–408. [Google Scholar] [CrossRef]
  32. Fiorentini, N.; Losa, M. Handling imbalanced data in road crash severity prediction by machine learning algorithms. Infrastructures 2020, 5, 61. [Google Scholar] [CrossRef]
  33. Obasi, I.C.; Benson, C. Evaluating the effectiveness of machine learning techniques in forecasting the severity of traffic accidents. Heliyon 2023, 9, e18812. [Google Scholar] [CrossRef] [PubMed]
  34. Aldred, R. Inequalities in self-report road injury risk in Britain: A new analysis of National Travel Survey data, focusing on pedestrian injuries. J. Transp. Health 2018, 9, 96–104. [Google Scholar] [CrossRef]
  35. Fountas, G.; Fonzone, A.; Gharavi, N.; Rye, T. The joint effect of weather and lighting conditions on injury severities of single-vehicle accidents. Anal. Methods Accid. Res. 2020, 27, 100124. [Google Scholar] [CrossRef]
  36. Mehdizadeh, A.; Cai, M.; Hu, Q.; Alamdar Yazdi, M.A.; Mohabbati-Kalejahi, N.; Vinel, A.; Rigdon, S.E.; Davis, K.C.; Megahed, F.M. A review of data analytic applications in road traffic safety. Part 1: Descriptive and predictive modeling. Sensors 2020, 20, 1107. [Google Scholar] [CrossRef]
  37. Taheri, A.; Azarasa, N.; Iranmanesh, M.; Seyedabrishami, S.; O’Hern, S.; Lord, D. The influences of strict and post-strict lockdowns due to the Covid-19 pandemic on crash severity on rural roads: A case study of Khorasan Razavi, Iran. Transp. Res. Part F Traffic Psychol. Behav. 2023, 97, 231–245. [Google Scholar] [CrossRef]
  38. Werner de Vargas, V.; Schneider Aranda, J.A.; dos Santos Costa, R.; da Silva Pereira, P.R.; Victória Barbosa, J.L. Imbalanced data preprocessing techniques for machine learning: A systematic mapping study. Knowl. Inf. Syst. 2023, 65, 31–57. [Google Scholar] [CrossRef]
  39. Semple, T.; Fountas, G.; Fonzone, A. Who is more likely (not) to make home-based work trips during the COVID-19 pandemic? The case of Scotland. Transp. Res. Rec. 2023, 2677, 904–916. [Google Scholar] [CrossRef]
  40. Hatfield, J.; Fernandes, R. The role of risk-propensity in the risky driving of younger drivers. Accid. Anal. Prev. 2009, 41, 25–35. [Google Scholar] [CrossRef]
  41. Olszewski, P.; Szagała, P.; Rabczenko, D.; Zielińska, A. Investigating safety of vulnerable road users in selected EU countries. J. Saf. Res. 2019, 68, 49–57. [Google Scholar] [CrossRef] [PubMed]
  42. Kockelman, K.M.; Kweon, Y.-J. Driver injury severity: An application of ordered probit models. Accid. Anal. Prev. 2002, 34, 313–321. [Google Scholar] [CrossRef] [PubMed]
  43. Knight, P.J.; Iverson, D.; Harris, M.F. Early driving experience and influence on risk perception in young rural people. Accid. Anal. Prev. 2012, 45, 775–781. [Google Scholar] [CrossRef]
  44. Gray, R.C.; Quddus, M.A.; Evans, A. Injury severity analysis of accidents involving young male drivers in Great Britain. J. Saf. Res. 2008, 39, 483–495. [Google Scholar] [CrossRef] [PubMed]
  45. Ehsani, J.P.; Seymour, K.E.; Chirles, T.; Kinnear, N. Developing and testing a hazard prediction task for novice drivers: A novel application of naturalistic driving videos. J. Saf. Res. 2020, 73, 303–309. [Google Scholar] [CrossRef]
  46. Olowosegun, A.; Fountas, G.; Davis, A. Effective trigger speeds for vehicle activated signs on 20 mph roads in rural areas. Safety 2024, 10, 25. [Google Scholar] [CrossRef]
  47. Elvik, R. Risk factors as causes of accidents: Criterion of causality, logical structure of relationship to accidents and completeness of explanations. Accid. Anal. Prev. 2024, 197, 107469. [Google Scholar] [CrossRef]
  48. Department for Transport. Road Accidents and Safety Statistics. data.gov.uk 2025. Available online: https://www.data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-accidents-safety-data (accessed on 1 April 2025).
Figure 1. Random undersampling of the majority class (RUMC).
Figure 1. Random undersampling of the majority class (RUMC).
Applsci 15 04793 g001
Figure 2. Methodology flowchart for the CART and RUMC-CART models.
Figure 2. Methodology flowchart for the CART and RUMC-CART models.
Applsci 15 04793 g002
Figure 3. Distribution of the severity classes.
Figure 3. Distribution of the severity classes.
Applsci 15 04793 g003
Figure 4. Distribution of crashes in urban and rural areas.
Figure 4. Distribution of crashes in urban and rural areas.
Applsci 15 04793 g004
Figure 5. ROC curves of the training, validation, and testing datasets.
Figure 5. ROC curves of the training, validation, and testing datasets.
Applsci 15 04793 g005
Figure 6. Feature importance of explanatory variables (CART).
Figure 6. Feature importance of explanatory variables (CART).
Applsci 15 04793 g006
Figure 7. Feature importance of explanatory variables (RUMC-CART).
Figure 7. Feature importance of explanatory variables (RUMC-CART).
Applsci 15 04793 g007
Figure 8. Root node (0) and the primary branches: node 1 and node 32.
Figure 8. Root node (0) and the primary branches: node 1 and node 32.
Applsci 15 04793 g008
Figure 9. Left branch of the RUMC-CART decision tree, illustrating node 1 and its subsequent child nodes.
Figure 9. Left branch of the RUMC-CART decision tree, illustrating node 1 and its subsequent child nodes.
Applsci 15 04793 g009
Figure 10. Right branch of the RUMC-CART decision tree, illustrating node 32 and its subsequent child nodes.
Figure 10. Right branch of the RUMC-CART decision tree, illustrating node 32 and its subsequent child nodes.
Applsci 15 04793 g010
Table 1. Confusion matrix of a binary classification model.
Table 1. Confusion matrix of a binary classification model.
Observed ClassesPredicted Classes
01
0True Negative (TN)False Positive (FP)
1False Negative (FN)True Positive (TP)
Table 2. Share of each crash severity category.
Table 2. Share of each crash severity category.
Severity LevelSlight InjurySerious InjuryTotal
Number of observationsCount%Count%52,966
41,68278.6911,28421.41
Table 3. Decision trees’ evaluation metrics.
Table 3. Decision trees’ evaluation metrics.
Models DatasetAccuracyRecallSpecificityPrecisionF1-ScoreAUC
CARTRUMC-CARTCARTRUMC-CARTCARTRUMC-CARTCARTRUMC-CARTCARTRUMC-CARTCARTRUMC-CART
Training0.790.630.060.580.990.680.580.640.110.610.660.66
Validation0.780.610.040.590.990.650.450.630.070.610.640.64
Testing0.790.620.050.580.990.670.520.630.100.600.660.65
Table 4. Confusion matrices of training, validation, and testing datasets for the CART and RUMC-CART models.
Table 4. Confusion matrices of training, validation, and testing datasets for the CART and RUMC-CART models.
Observed ClassesCART Predicted Classes
Training DatasetValidation DatasetTesting Dataset
010101
028,205 (99%) *339 (1%)6151 (99%)86 (1%)6219 (99%)82 (1%)
17462 (94%)470 (6%)1639 (82%)369 (18%)1554 (95%)90 (5%)
Observed ClassesRUMC-CART Predicted Classes
Training DatasetValidation DatasetTesting Dataset
010101
05372 (68%)2523 (32%)1099 (65%)590 (35%)1140 (67%)560 (33%)
13336 (42%)4566 (58%)700 (41%)997 (59%)715 (42%)970 (58%)
* The values displayed relate to the percentages of each class from the training, validation, and testing datasets.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Taheri, A.; Switala, K.; Fountas, G.; Sheykhfard, A.; Dadashzadeh, N.; Müller, S. An Empirical Analysis of Crash Injury Severity Among Young Drivers in England: Accounting for Data Imbalance. Appl. Sci. 2025, 15, 4793. https://doi.org/10.3390/app15094793

AMA Style

Taheri A, Switala K, Fountas G, Sheykhfard A, Dadashzadeh N, Müller S. An Empirical Analysis of Crash Injury Severity Among Young Drivers in England: Accounting for Data Imbalance. Applied Sciences. 2025; 15(9):4793. https://doi.org/10.3390/app15094793

Chicago/Turabian Style

Taheri, Amirhossein, Kevin Switala, Grigorios Fountas, Abbas Sheykhfard, Nima Dadashzadeh, and Steffen Müller. 2025. "An Empirical Analysis of Crash Injury Severity Among Young Drivers in England: Accounting for Data Imbalance" Applied Sciences 15, no. 9: 4793. https://doi.org/10.3390/app15094793

APA Style

Taheri, A., Switala, K., Fountas, G., Sheykhfard, A., Dadashzadeh, N., & Müller, S. (2025). An Empirical Analysis of Crash Injury Severity Among Young Drivers in England: Accounting for Data Imbalance. Applied Sciences, 15(9), 4793. https://doi.org/10.3390/app15094793

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop