MobileNetV3–Transformer-Based Prediction of Highway Accident Severity

Chen, Liang; Wei, Jia; Wang, Guoqing; Yang, Xiaoxiao; Qin, Lusheng

doi:10.3390/app152312694

Open AccessArticle

MobileNetV3–Transformer-Based Prediction of Highway Accident Severity

by

Liang Chen

¹

,

Jia Wei

¹,

Guoqing Wang

^1,2,*,

Xiaoxiao Yang

² and

Lusheng Qin

²

¹

School of Civil and Transportation Engineering, Hebei University of Technology, Tianjin 300401, China

²

Hebei Transportation Investment Group Corporation, Shijiazhuang 050051, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12694; https://doi.org/10.3390/app152312694

Submission received: 25 September 2025 / Revised: 16 October 2025 / Accepted: 20 October 2025 / Published: 30 November 2025

Download

Browse Figures

Versions Notes

Abstract

Traffic accidents on highway are often characterized by high destructiveness and severe casualties. Predicting accident severity and understanding its causes are crucial for enhancing highway safety. To address the issues of limited prediction accuracy and poor interpretability of traditional machine learning and deep learning methods at the current stage, this study proposes an accident severity prediction model based on a hybrid architecture of MobileNetV3 and a Transformer. The model first encodes numerical accident-related variables into two-dimensional images using the Gramian Angular Field (GAF) method. Local spatial features are then extracted via the depthwise separable convolution modules of MobileNetV3, and long-range temporal dependencies are captured through the Transformer encoder, which outputs the final prediction. The proposed model is compared with Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), MobileNetV3, a Transformer, and LSTM–Transformer architectures in terms of prediction performance. Results show that the MobileNetV3–Transformer model achieves the highest accuracy of 0.9549. Finally, the DeepSHAP interpretability algorithm is introduced to reveal the systemic influence and contribution of significant factors to accident severity. The results indicate that vehicle age, special road conditions, speed limits, and lighting conditions are closely related to the severity of highway accidents. This study provides a reliable theoretical basis for early warning of highway accidents and refines control measures to further enhance highway safety.

Keywords:

traffic safety; traffic accident; highway; accident severity prediction; MobileNetV3–Transformer

1. Introduction

With the rapid increase in the global number of motor vehicles, traffic accidents have also been rising accordingly [1], making driving safety a widespread concern. According to statistics, approximately 1.19 million people die in traffic accidents worldwide each year, while an additional 20 to 50 million sustain non-fatal injuries, with a significant proportion left with permanent disabilities [2]. Due to the characteristics of highway accidents, such as higher severity, uneven spatial distribution, and more complex emergency response systems, greater attention and more specialized measures are required for the prevention. Traffic accidents are not random events; rather, they are influenced by a variety of factors that contribute to different levels of severity. Establishing an accident severity prediction model can help identify the key factors influencing accident outcomes, thereby enabling the implementation of effective preventive measures to reduce the likelihood of accidents and ensure the safety of road users.

In early research on traffic accident severity prediction, scholars have commonly employed statistical regression method to construct models. Among these, the Logit regression model and its various extensions have been widely used [3]. Other approaches include the Poisson regression model and the Negative Binomial (NB) regression model [4]. Discrete choice models, grounded in random utility theory, exhibit strong explanatory power [5]. However, their ability to handle multi-dimensional data is limited. Machine learning methods have subsequently emerged, and with advantages such as powerful nonlinear fitting capabilities, they have gradually replaced discrete choice models. Methods such as Random Forest [6], Extreme Gradient Boosting and Support Vector Machines [7], and Fault Tree Analysis [8], have been widely applied. These models are effective in identifying the impact of various influencing factors on accident severity. However, they struggle to further capture the spatiotemporal correlations of accident occurrences and thus improve prediction performance. With the rapid development of artificial intelligence technology in recent years, in the field of machine learning, besides traditional machine learning models, artificial neural networks [9] and deep learning models [10] have also achieved favorable results in the prediction of accident severity. However, there are still some unresolved issues at present. First, deep learning is a black-box method that cannot explain the relationship between influencing factors and accident severity [11]. Second, the use of large numbers of hyperparameters and complex deep architectures, combined with weight updates via gradient-based backpropagation, leads to slow training when handling high-dimensional data and increases the risk of becoming trapped in local optima. Lastly, deep learning models tend to suffer from reduced predictive performance when applied to imbalanced datasets, which are common in traffic accident data, thereby lowering overall prediction accuracy. Li et al. [12] proposed a prediction framework named ReMAHA-CatBoost. This framework applies the working principle of oversampling algorithms and elaborates on how to solve the low accuracy problem caused by imbalanced data from the data perspective.

The aim of this study is to enhance road safety and mitigate the adverse impacts of traffic accidents by developing a more accurate, interpretable, and actionable prediction model for accident severity. To achieve this goal, we propose an innovative MobileNetV3–Transformer model that integrates the strengths of transfer learning with a spatiotemporal local attention mechanism. To address the issue of data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) is first applied to train data to balance the class distribution. A local attention mechanism is then incorporated into the model architecture, enabling the model to focus more effectively on critical features of minority classes during training. Additionally, a weighted cross-entropy loss function is used to further enhance the model’s sensitivity to minority class samples and reduce bias toward the majority class. Spearman correlation analysis is conducted to evaluate the relationship between feature variables and prediction outcomes, and variables with significant associations to accident severity are selected as predictors. To tackle the black-box nature of deep learning models and to identify key influencing factors, DeepSHAP is employed for model interpretation, allowing the contribution of each predictor to be quantitatively assessed.

This study introduces new elements to the prediction of accident severity. The innovations include the application of MobileNetV3 for transfer learning and the integration of a spatiotemporal local attention mechanism to enhance the model’s predictive performance, thereby improving overall accuracy. By comparing with several benchmark models, the superiority of our approach is effectively demonstrated. Numerical accident-related variables are encoded into two-dimensional images using the Gramian Angular Field (GAF) method, allowing the extraction of local features from the generated images.

2. Related Works

In recent years, highway accident severity prediction has become an important aspect in the field of traffic safety. Therefore, the selection of influencing factors, as well as the application and interpretation of deep learning models, have become key focuses in the current field of accident severity prediction.

2.1. Study on Factors Affecting Accident Severity

The causes of accidents are all multifactorial, and synthesizing the research results of scholars at home and abroad, it is found that the important factors affecting the severity of traffic accidents can be broadly categorized into three categories: driver factors, road and environmental factors, and vehicle factors [13].

Santamariña-Rubio et al. [14] found that there is an interaction between gender and age on accident severity. Additionally, males have higher risk of accidental injuries in the slow children group and young drivers, while females have higher risk of accidental injuries in the older age group. In addition to this, Choudhary et al. [15] showed that distracted driving affects the severity of accidents, with texting and manually adjusting the music player having a greater effect on accident risk. Behzadi et al. [16] used the data of traffic accidents from 2015 to 2021 in Kerman province in southeastern Iran, and found that the most important factors include distraction, speeding, brake failure and average temperature.

A study by Çelik et al. [17] showed that factors such as the educational background of the driver, the type of road, the presence of pedestrian crossings, the time period of the accident, and the meteorological conditions all influence the severity of traffic accidents to varying degrees. Alkheder et al. [18] argued that the severity of traffic accidents is affected by a number of factors, and it was found that eight potential factors, namely, the characteristics of the injured person, the nature of the accident, the type of road, number of lanes, speed limit, wearing of seat belts, specific location of the injured person, and behavior of the traffic participants, have different weights in the assessment of accident severity. Among them, the type of road has the greatest impact on accident severity. Faisal et al. [19] analyzed crash data from 2015 to 2018 in Washington State using a hybrid Logit model to investigate the factors affecting injury severity outcomes in large truck crashes, and proposed that thirteen new parameters consistently showed a stable impact on injury severity, with six of these parameters all related to road and environmental factors.

In the study of vehicle factors, Ratanavaraha [20] used crash data from Thai highway to develop a multivariate Logistic regression model. The results clearly indicated that speed of travel is a key determinant of highway crash severity, which outweighs other considerations. Haq [21] classified collisions involving trucks into four categories: single truck accidents, truck-car collisions, truck-pickup or truck-SUV collisions, and truck-truck collisions. The study noted significant differences in the key factors that determine the severity of occupant injuries in different types of collisions. Liu et al. [22] integrated traffic accident events, participants, environmental factors, and other multi-dimensional elements by constructing a traffic knowledge graph.

2.2. Study on Accident Severity Prediction Model

In the research of early prediction of accident severity, scholars commonly use statistical regression methods to construct models, but these models have limitations such as low accuracy and difficulty in dealing with nonlinear relationships. In recent years, due to its ability to explore potential correlation relationships and provide more accurate prediction results than traditional statistical methods, machine learning has been increasingly applied in the prediction of accident severity.

Celika et al. [17] used multinomial Logit model to analyze the data of traffic accidents in two provinces of Turkey and found that road type, driver literacy level, and driver age have a significant effect on accident severity. Hou [23] et al. used binary Logit model to estimate propensity scores, matched them using nearest-neighbor matching with calipers, and then subsequently used a random-effects negative Binomial regression model (RENB) to further analyze the mechanism of safety benefits of climbing lanes. Abbas [24] predicted the number of casualties and severity of traffic accidents on Egyptian rural roads using a negative binomial distribution model based on 13 traffic safety indicators. The study revealed that the high frequency of accidents and severity of casualties on Egyptian rural roads are primarily associated with factors such as driver behaviors and road conditions. Jiang et al. [25] used a zero-inflated ordered probit model to analyze the effect of curbs on injury severity in single-vehicle accidents. The study found that while the presence of curbs increases the probability of accidents entering an injury-prone state, it effectively reduces the risks of severe and fatal accidents.

Harb et al. [26] assessed the degree of influence of various types of factors on rear-end, front-end and side-end collision accidents, respectively, using the Decision Tree method and further analyzed the importance of key influencing variables by ranking them through a random forest algorithm. While Tang et al. [27] compared the Random Forest (RF) with the K-Nearest Neighbors (KNN) model and applied it in the study of prediction of traffic accidents’ durations. Olutayo et al. [28] used an Artificial Neural Network with a Decision Tree method based on data from highway with the highest accident rates in Nigeria, and the results showed that tire blowout, loss of vehicle control and speeding behavior are the major factors contributing to accident severity. AIKheder et al. [18] used three machine learning models, namely, a Support Vector Machine, a Decision Tree and a Bayesian Network, to explore the accident severity risk factors and found that the Bayesian network model performed best in terms of prediction accuracy. To achieve high prediction accuracy and model interpretability, Yan et al. [29] proposed a hybrid model that integrates a Random Forest (RF) and Bayesian Optimization (BO). Experimental results show that the BO-RF model outperforms traditional algorithms in terms of accuracy.

Deep learning, as a cutting-edge machine learning method, is gradually being introduced into the field of accident severity prediction. Shahdah et al. [30] explored the impact of human unsafe behaviors on accident severity based on a Convolutional Neural Network (CNN) model and compared it with a Logistic traditional regression model. The results showed that the CNN is better than the traditional methods in terms of analytical effectiveness. Sameen et al. [31] developed a deep learning model based on Recurrent Neural Network (RNN) to predict the injury severity of traffic accidents occurring on the North–South Expressway (NSE) in Malaysia. Alhaek et al. [32] used a Convolutional Neural Network to extract spatial features while using a Bidirectional Long Short-term Memory Network (BiLSTM) to capture temporal dependencies between features, and constructed an accident severity prediction model based on CNN-BiLSTM using traffic accident data from London and Liverpool as a research sample. Manzoor et al. [33] used a CNN-BiLSTM model on the 20 most relevant feature dimensions, proposed RFCNN, a comprehensive learning method that identifies the key factors affecting the severity of road traffic accidents, showing a very high prediction accuracy.

Although recent studies have made notable progress in accident severity prediction using deep learning, each approach exhibits inherent limitations that hinder its applicability in real-world highway safety scenarios. Convolutional Neural Networks (CNNs) [30] excel at capturing local spatial patterns but struggle to model long-range temporal dependencies inherent in accident evolution. Recurrent architectures such as LSTM or RNN [31] can learn sequential dynamics but suffer from slow training, vanishing gradients, and limited parallelization. Hybrid models like CNN-BiLSTM [32,33] attempt to combine spatial and temporal modeling; however, they often involve complex architectures with large parameter counts, leading to high computational costs and a tendency to overfit on imbalanced traffic datasets. Moreover, pure Transformer-based models, while powerful in capturing global interactions through self-attention, require substantial data and computational resources to generalize well—conditions rarely met in highway accident datasets, which are typically small-scale and highly skewed (e.g., fatalities account for only ~2.6% of cases in this study). Notably, recent approaches that combine LSTM and Transformer architectures for road-accident severity identification [34] offer fresh ideas; however, they likewise suffer from high computational and storage overheads and a strong reliance on large quantities of high-quality training data. Crucially, most of these deep models operate as “black boxes,” offering little insight into which factors drive severe outcomes—a major barrier to their adoption in safety-critical domains like transportation policy. Therefore, there remains a clear need for a lightweight, accurate, and interpretable framework that balances local feature extraction, global dependency modeling, and real-world deplorability. This gap motivates our proposed MobileNetV3–Transformer architecture, which integrates efficient convolutional encoding with attention-based temporal reasoning while enabling post hoc interpretability via DeepSHAP.

2.3. Interpretability Study of Accident Severity Prediction Models

As a data-driven modeling approach, deep learning algorithms show good predictive ability. However, its black-box nature makes the internal reasoning process of the model difficult to be understood and explained intuitively. Currently, many scholars have studied and explored the interpretability of models. Cicek et al. [35] introduced a variety of machine learning models with interpretability for predicting the severity of traffic accidents. Ma et al. [36] proposed a deep learning method based on the impact factor, which classified the accidental injuries into two levels: serious and non-serious. They also analyzed the impact of each variable on the severity of traffic accidents by using the CatBoost model in combination with the Shapley value to analyze the strength and dependence of the influence of each variable on the degree of injury, while eliminating the factors with low correlation. After filtering out highly correlated key factors and constructing temporal features, they further applied a deep learning model based on stacked sparse self-encoder (SSAE) to predict the severity of accidental injuries in various types of clustering results. The introduction of multiple explanatory mechanisms significantly improves the model’s performance in terms of transparency, explanatory ability, domain knowledge integration, and scientific consistency, effectively alleviating the inherent problem of unexplainability of traditional prediction models [37]. At present, SHAP has become a mainstream explanatory tool by virtue of its theoretical rigor and powerful global and local explanatory capabilities.

3. Methodology

3.1. MobileNetV3

MobileNet is a lightweight deep neural network architecture with simple and efficient design, which significantly reduces model parameters and computational overhead by introducing depth-separable convolution and splitting the conventional convolution operation into deep convolution and point-by-point convolution. The network is widely recognized for its compactness, high computational efficiency, and fast inference response. With the evolution to MobileNetV3 [38], this version builds on the empirical foundations of its predecessors (V1 and V2) by incorporating Neural Architecture Search (NAS) with an improved convolution module, which further optimizes the model in terms of inference speed and computational performance.

Depthwise Separable Convolution

$Y_{c} (i, j) = \sum_{m, n} X_{c} (i + m, j + n) \times K_{c} (m, n)$

(1)

$Y_{k} (i, j) = \sum_{c} X_{c} (i, j) \times W_{k, c}$

(2)
Squeeze-and-Excitation

$z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{c} (i, j)$

(3)

3.2. Transformer

Transformer is a deep learning architecture designed for modeling sequential data [39], and its basic architecture consists of an Encoder and a Decoder, both of which are composed of a multi-layer self-attention mechanism and a feed-forward neural network. Unlike traditional Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN) that rely on sequential processing, Transformer achieves global dependency modeling between arbitrary positions in a sequence through the self-attention mechanism, thus improving parallel computing power and modeling efficiency.

Input Embedding and Positional Encoding

The input sequence is X = [x1, x2, …, xn], and each token is mapped to a vector, plus positional encoding:

E = X W_{e} + P E

(4)

W_{e}

denotes the word embedding matrix and PE represents the position encoding (commonly used sine and cosine functions).

2.: Scaled Dot-Product Attention

The inputs are linearly transformed to generate Queries (

Q

), Keys (

K

), Values (

V

):

Q = X W^{Q}, K = X W^{K}, V = X W^{V}

(5)

Then attention output:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(6)

d_{k}

denotes the dimension of the key vector for scaling and

s o f t m a x

ensures weight normalization.

3.: Multi-Head Attention

The attention mechanism is divided into

h

distinct heads, each of which learns independently:

M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, \dots {h e a d}_{h}) W^{O}

(7)

Each head:

{h e a d}_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(8)

4.: Feed-Forward Network, FFN

Applied to each position, the structure is as follows:

F F N (x) = R e L U (x W_{1} + b_{1}) W_{2} + b_{2}

(9)

5.: Layer Normalization

Each sub-layer (Multi-Head and FFN) is equipped with a residual connection followed by normalization:

O u t p u t = L a y e r N o r m (x + S u b l a y e r (x))

(10)

3.3. LSTM-Transformer

This paper simultaneously applies a hybrid model for accident severity prediction that integrates LSTM time modeling with the global attention mechanism of Transformer. Through the collaborative optimization of Gated Recurrent Units (GRUs) and the self-attention mechanism, the model effectively captures local temporal patterns while analyzing long-range dependencies. First, the raw time-series data is segmented using a sliding window approach to construct multi-dimensional temporal slices, preserving the dynamic continuity of accident evolution. Each slice is then processed by an LSTM network in both forward and backward directions to extract features. The forget and input gates are employed to regulate information flow, filter out noise, and generate hidden state vectors that encode local temporal dependencies. Subsequently, the sequence of hidden states is fed into the Transformer encoder layers. Leveraging the dynamic allocation of multi-head self-attention weights, the model uncovers global correlations across time steps. Positional encoding is introduced to incorporate temporal order as prior knowledge. Finally, a gated feature fusion module adaptively aggregates the local representations from LSTM and the global representations from Transformer. The fused features are passed through a multi-layer perceptron (MLP) to output the predicted severity level.

3.4. MobileNetV3–Transformer (Large)

The proposed MobileNetV3–Transformer model predicts highway accident severity through the following four sequential steps:

Step 1: Input Encoding via the Gramian Angular Field (GAF)

Numerical and categorical accident variables (26 in total) are first normalized and then encoded into a 600 × 600 image using the Gramian Angular Field (GAF) method. This transformation preserves temporal continuity and enables the use of vision-based deep learning architectures.

Step 2: Local Feature Extraction with MobileNetV3

The GAF image is fed into a MobileNetV3-Large backbone (trained from scratch). Depthwise separable convolutions and squeeze-and-excitation (SE) blocks are employed to efficiently extract local spatial features while keeping the model lightweight. The output is a 960-dimensional feature vector after global average pooling.

Step 3: Global Dependency Modeling via a Transformer Encoder

The 960-D feature vector is treated as a single token and passed through a 2-layer Transformer encoder with 8 attention heads. No positional encoding is applied, as the input represents a static feature snapshot rather than a time series. The self-attention mechanism refines the feature representation by modeling implicit interactions among accident factors.

Step 4: Classification with Weighted Loss and SMOTE

The refined feature is fed into a fully connected classification head (960 → 256 → 3) with ReLU activation and 0.1 dropout. To address class imbalance (fatalities: 2.6%), we apply SMOTE on the training set and use a weighted cross-entropy loss, where class weights are inversely proportional to their frequencies.

The overall architecture is illustrated in Figure 1.

4. Results and Discussion

4.1. Dataset

To validate the predictive performance of the proposed model, this study utilizes a subset of the UK Road Safety Dataset (2016–2020), focusing on accident data occurring on motorways or major A-roads in the UK. Analysis of the dataset reveals that accidents on motorways or major A-roads tend to exhibit higher severity compared to those on lower-class roads. Therefore, this study focuses on these types of accidents for further investigation. The original dataset primarily includes information related to road, weather, and environmental conditions at the time of the accident, as well as details concerning the vehicles involved, such as vehicle type, point of impact, and driver characteristics. This dataset is publicly accessible through the UK open data portal.

Data preprocessing can effectively enhance computational performance. Based on the defined research samples, data filtering was performed to retain only records related to highway by excluding non-highway samples according to road classification. Subsequently, data cleaning and processing were conducted. The original dataset contains 597,973 accident records. After preprocessing, the resulting dataset involves 102,149 injured individuals, including 80,458 slight injuries (78.8%), 19,031 serious injuries (18.6%), and 2660 fatalities (2.6%).

All selected accidents occurred on UK motorways or major A-roads, which are high-speed, controlled-access highways designed exclusively for motorized traffic. These roads prohibit pedestrian and cyclist access, have limited entry/exit points, and are functionally distinct from both urban streets and rural minor roads.

Regarding crash configuration, the dataset includes both single-vehicle and multi-vehicle collisions. Specifically, 29.1% of the records correspond to single-vehicle crashes, while 70.9% involve two or more vehicles (based on the “Number of vehicles” field, which is included as a predictor in Table 1.

Feature selection is performed using Pearson correlation coefficients and mutual information. As a result, 26 key variables are selected as the basis for the study. These variables encompass multiple dimensions, including accident-related information (e.g., month, time of day, number of vehicles involved), human factors (e.g., gender, age), vehicle-related attributes (e.g., vehicle type, vehicle condition, collision type, travel distance), road characteristics (e.g., road type, road condition, speed limit), and environmental factors (e.g., weather, lighting conditions). The goal of this selection is to comprehensively identify the influencing factors associated with accidents of varying severity levels. The final set of independent variables is presented in Table 1.

The selection of these 26 variables is not only data-driven but also grounded in established traffic safety literature. For instance, vehicle age, speed limit, and light conditions have been consistently identified as critical predictors of accident severity in prior studies [18,19,20]. Similarly, special road conditions (e.g., fog, oil spill) and number of vehicles are known to influence crash dynamics and injury outcomes [16,22]. By combining statistical relevance (via Pearson correlation and mutual information) with domain knowledge, we ensure that the selected features are both predictive and interpretable from a road safety perspective.

To enable the use of vision-based deep learning models, the 26 numerical accident variables are encoded into a 600 × 600 image using the Gramian Angular Field (GAF) method. GAF preserves the temporal continuity and monotonicity of the original sequence while producing symmetric, smooth images that are well-suited for convolutional architectures. The intensity of colors represents different numerical values, and different colors represent different numerical ranges. Each accident record is thus represented as a single image, which serves as input to the MobileNetV3 backbone. The choice of Gramian Angular Field (GAF) for image encoding is motivated by its ability to preserve temporal continuity and monotonicity of the original numerical sequence, which is critical for maintaining the semantic meaning of accident-related variables. Unlike other time-series-to-image methods (e.g., Recurrence Plot or Markov Transition Field), GAF maps the original data into a bijective, differentiable, and invertible representation, ensuring that no critical information is lost during transformation. This property significantly reduces the risk of “misinterpretation” by the downstream deep learning model. Furthermore, GAF generates symmetric and smooth images with clear gradient structures (see Figure 2), which are well-suited for convolutional architectures like MobileNetV3 that rely on local spatial patterns. Empirical studies in time-series classification have shown that GAF-based representations consistently outperform raw sequences or other encoding methods in terms of model accuracy and robustness. In our case, the high prediction performance (Accuracy = 0.9549) and near-perfect AUC scores (0.99–1.00) further validate that the GAF transformation enhances rather than degrades the reliability of severity prediction.

4.2. Selection of Evaluation Metrics

This study employs multiple performance evaluation metrics, including Accuracy, Precision, Recall, and F1 Score, to comprehensively assess the predictive capability and practical effectiveness of the proposed hybrid model. Accuracy measures the proportion of correctly predicted samples among all samples, without distinguishing between different classes. Precision emphasizes the proportion of true positive samples among those predicted as positive by the model. Recall reflects the model’s ability to identify actual positive samples, which is the ability to identify the proportion of true positives among all real positive cases. The F1 Score is the harmonic mean of Precision and Recall, which evaluates the balance between the two and reflects the overall performance of the model in a more balanced manner.

The corresponding calculation formulae are as follows:

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(11)

R r e c = \frac{T P}{T P + F P}

(12)

R e c = \frac{T P}{T P + F N}

(13)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(14)

T P

(True Positive) refers to the number of samples that are actually positive and correctly predicted as positive.

F N

(False Negative) denotes the number of samples that are actually positive but incorrectly predicted as negative.

T N

(True Negative) indicates the number of samples that are actually negative and correctly predicted as negative. If the model demonstrates strong performance across all metrics, including Accuracy, Precision, Recall, and F1 Score, it indicates robust discriminative ability and stability in the task of traffic accident risk prediction.

In addition, performance evaluation is a crucial step. The Receiver Operating Characteristic (ROC) curve is a widely used evaluation tool derived from the confusion matrix. The horizontal axis represents the False Positive Rate (FPR), which is the proportion of negative samples incorrectly classified as positive among all actual negative cases. The vertical axis denotes the True Positive Rate (TPR), which is the proportion of correctly identified positive samples among all actual positive cases. The ROC curve provides a comprehensive view of the model’s discriminative ability across different classification thresholds. The specific formulae for calculating FPR and TPR are as follows:

F P R = \frac{F P}{F P + T N}

(15)

T P R = \frac{T P}{T P + T N}

(16)

The Area Under the Curve (AUC) refers to the area under the ROC curve. When comparing the performance of multiple classification models, ROC curves are plotted for each model, and their corresponding AUC values are used as a basis for evaluation. A higher AUC value indicates stronger discriminative capability in distinguishing between positive and negative samples, as well as higher predictive accuracy. From a statistical perspective, the AUC can be interpreted as the probability that a randomly chosen positive sample will receive a higher score than a randomly chosen negative sample among all possible positive-negative sample pairs.

A U C = \frac{\sum_{i ò p o s i t i v e C l a s s} {r a n k}_{i} - \frac{M (1 + M)}{2}}{M \times N}

(17)

In the formula,

M

denotes the number of positive samples,

N

represents the number of negative samples and

n

refers to the total number of samples.

r a n k

corresponds to the rank values assigned based on the sorted list of predicted scores.

4.3. Experimental Results and Analysis

In this study, the dataset is divided into a training set and a testing set. Specifically, 80% of the data samples are randomly selected for model training to enable the model to effectively capture the underlying patterns and features. The remaining 20% is used as the test set to evaluate the model’s generalization ability and fitting performance on unseen data. After completing hyperparameter tuning, the final experiments are conducted using the hyperparameter configuration listed in Table 2, the detailed model architecture and hyperparameters are provided in Appendix A.

To validate the superiority of the proposed model, several baseline models are selected for comparative analysis, including a CNN model, an LSTM model, a MobileNetV3 model, a Transformer model, and a combined LSTM-Transformer model. Table 3 presents a comparative performance evaluation of these deep learning models along with the proposed hybrid model in predicting accident severity.

The results demonstrate that the proposed MobileNetV3–Transformer hybrid model outperforms the other models across all evaluation metrics, achieving an Accuracy of 0.9549, Precision of 0.9674, Recall of 0.8290, and an F1 Score of 0.8862. Following closely is the LSTM-Transformer model, which achieved an Accuracy of 0.9194, Precision of 0.9657, Recall of 0.8516, and F1 Score of 0.8329. The CNN model yields the lowest performance, with an Accuracy of 0.8845. The MobileNetV3–Transformer model effectively mitigates feature representation bias in class-imbalanced scenarios and significantly enhances prediction accuracy. By leveraging the lightweight local perception capability of MobileNetV3 and the global attention mechanism of the Transformer, a spatiotemporal feature fusion framework is constructed to support collaborative decision-making. Compared with other deep learning models, the MobileNetV3–Transformer model exhibits superior performance in predicting the severity of traffic accidents.

While it is true that traffic accidents involve inherent randomness, the severity out-come is not purely stochastic but influenced by systematic factors such as vehicle condition, speed, lighting, and road environment. The high performance of all models (Accu-racy: 0.8267–0.9549) indeed reflects this underlying predictability. However, the key challenge in highway safety lies not in distinguishing slight from serious injuries—where data is abundant—but in early and reliable identification of rare fatal cases (only 2.6% of the dataset).

As shown in Table 3, the proposed MobileNetV3–Transformer model achieves the highest Recall (0.8290) and F1-score (0.8862) among all models, indicating superior sensitivity to the minority fatality class without sacrificing precision. In contrast, the LSTM-Transformer, while achieving high recall (0.8516), suffers from lower overall ac-curacy (0.9194 vs. 0.9549), suggesting a trade-off between minority detection and generalization. The CNN model, despite an accuracy of 0.8267, exhibits extremely low precision (0.6345) for fatal predictions, leading to excessive false alarms.

Therefore, the advantage of our model lies not merely in overall accuracy, but in its balanced and robust performance across all severity levels, particularly for the most critical and underrepresented class—fatalities. This capability is essential for real-world deployment, where both missed detections and false alarms carry significant safety and operational costs.

As shown in Figure 3, the figure illustrates the accuracy curves of each model throughout the training process. It can be observed that the loss values of all models decrease rapidly within the first 10 training epochs and then gradually stabilize, indicating a fast convergence rate across all models. In terms of accuracy, the MobileNetV3–Transformer hybrid model demonstrates the best performance, consistently maintaining an accuracy above 95%. The LSTM–Transformer, LSTM, Transformer, and MobileNetV3 models follow in descending order, while the CNN model exhibits the lowest accuracy with relatively large fluctuations. Overall, from the trends in both accuracy and loss curves, the combination of the MobileNetV3 and Transformer architectures significantly improves the model’s stability and discriminative capability in the classification task.

To statistically validate the performance advantage of the proposed model, we conducted 5 independent runs for each architecture and performed paired t-tests on F1-scores. As shown in Table 4, the MobileNetV3–Transformer achieves the highest mean F1-score (0.942 ± 0.001) and significantly outperforms all baseline models (p < 0.001), confirming that the improvement is not due to random variation but stems from the hybrid architecture’s superior modeling capability.

To further evaluate the computational efficiency of the proposed architecture, we compare the number of parameters, FLOPs (floating-point operations), and average inference time per sample across all baseline models. As shown in Table 5, the MobileNetV3–Transformer model achieves the best trade-off between accuracy and efficiency. Despite incorporating a Transformer encoder, its total parameter count (0.11 M) is significantly lower than that of LSTM-Transformer (0.58 M) and pure Transformer (0.15 M), thanks to the depthwise separable convolutions in MobileNetV3. Moreover, its inference time (0.1814 ms/sample on AMD Ryzen 7 7840 H cpu, manufactured by Advanced Micro Devices, Sunnyvale, CA, USA) is comparable to MobileNetV3 alone and much faster than LSTM-based models, demonstrating its suitability for real-time highway accident severity prediction.

The AUC, which stands for the Area Under the Receiver Operating Characteristic (ROC) Curve, is a crucial metric for evaluating model performance. After training and validation, AUC values are obtained for each class. Class 0 represents slight injuries, Class 1 represents serious injuries, and Class 2 represents fatalities. The resulting AUC scores are nearly perfect, with Class 0 = 1.00, Class 1 = 0.99, and Class 2 = 0.99, indicating that the proposed MobileNetV3–Transformer model maintains strong robustness even in multi-class and imbalanced data scenarios. In particular, the model demonstrates excellent early identification capability for the high-risk but low-frequency fatality class (which accounts for only 2.6% of the data). As shown in Figure 4, the ROC curves for all three classes closely approach the top-left corner, confirming that the model effectively captures the discriminative features associated with slight, serious, and fatal accidents.

Based on the research findings, the proposed MobileNetV3–Transformer model demonstrates several advantages in predicting the severity of highway traffic accidents. MobileNetV3 effectively extracts feature values from the image-encoded dataset, which significantly reduces training time and mitigates the risk of overfitting. By further applying a multi-head self-attention mechanism, the model captures long-range temporal dependencies within the feature sequences, thereby enhancing its global awareness of key risk factors associated with accidents. The training results indicate that the model achieves high accuracy and effectively improves the prediction efficiency while maintaining the model’s accuracy, resulting in overall good performance.

4.4. Model Interpretability and Key Factor Identification

To enhance the credibility of the model’s predictive results and verify its interpretability, this study introduces the DeepSHAP method for interpretability analysis, thereby obtaining a ranked list of feature importance.

The SHAP method (SHapley Additive exPlanations) [40] is a technique for interpreting the local behavior of machine learning models. It establishes a unified interpretability framework by quantifying the contribution of each input feature to the model’s output, enabling attribution analysis of feature importance. SHAP possesses key properties such as local accuracy, missingness, and consistency, which ensure the rationality and reliability of its explanations. DeepSHAP is an improved variant based on SHAP theory, integrated with the ideas of DeepLIFT. By providing a linear approximation of deep neural network structures, DeepSHAP significantly improves computational efficiency and better aligns with the interpretability requirements of deep models.

By identifying key factors that significantly influence accident severity prediction, the findings can provide valuable support for the development of traffic safety policies, ultimately helping to reduce the risk of traffic accidents. The DeepSHAP algorithm assigns a feature score to each input, where a positive score indicates a favorable contribution to the model’s prediction, and the magnitude of the score reflects the strength of that contribution. The dataset is categorized into three risk levels based on accident severity, each associated with different contributing factors. Subsequently, key features are identified for each level to accurately determine the most influential decision-making variables.

This study utilizes a SHAP summary plot to visualize the ranked importance of features. In the plot, features with higher importance are displayed at the top of the y-axis, while those with lower importance appear at the bottom. The x-axis represents each feature’s contribution to individual sample predictions, where each dot corresponds to the SHAP value of a specific instance in the dataset. SHAP values are proportional to the impact of the features on the prediction, with positive values increasing the model’s prediction and negative values decreasing it. The color of each point on the x-axis represents the magnitude of the feature value. Red indicates higher feature values, and blue indicates lower feature values. If the red points (representing high feature values) are concentrated on the right side, it suggests that higher feature values contribute positively to the prediction. Conversely, if red points are concentrated on the left, it implies that lower feature values have a stronger positive influence on the prediction.

The DeepSHAP analysis provides actionable insights into how key factors influence predictions across different severity levels. As shown in Figure 5, Figure 6 and Figure 7, vehicle age, special road conditions, speed limit, and light conditions consistently emerge as the top contributors to model predictions for all three injury categories.

For fatal accidents (Figure 7), higher values of vehicle age and speed limit are strongly associated with increased prediction scores (red points concentrated on the right), indicating that older vehicles and higher-speed environments significantly elevate fatality risk. Similarly, adverse light conditions (e.g., darkness) and the presence of special road conditions (e.g., fog, oil spill) also push predictions toward fatal outcomes.

In contrast, for slight injuries (Figure 5), these same factors exhibit the opposite pattern: lower vehicle age, lower speed limits, and daylight conditions (blue points on the left) are linked to reduced severity predictions.

Notably, while the set of influential variables remains largely consistent across severity levels, their direction of influence varies systematically: features that increase fatality risk consistently decrease the likelihood of slight injuries, demonstrating the model’s coherent and interpretable risk logic. These findings align with established traffic safety principles and validate the model’s ability to capture meaningful, real-world relationships.4.5. Error Analysis and Stability Assessment.

To further understand the model’s prediction behavior, we analyzed the confusion matrix on the test set (Figure 8). The results reveal that while the model achieves high overall accuracy, its error patterns are not uniform across severity levels. Specifically:

For slight injuries (Class 0), the model correctly predicts 79% of cases, but misclassifies 13% as serious and 8% as fatal. This suggests a tendency to slightly overestimate severity for minor accidents, possibly due to overlapping feature profiles between slight and more severe cases.

For serious injuries (Class 1), 82% are correctly identified, while 18% are misclassified as fatal. This indicates that the model may associate certain risk factors (e.g., high speed, darkness) with fatality even when the outcome is serious but non-fatal.

Crucially, for fatalities (Class 2), the model achieves perfect recall (100%), meaning no fatal accident was missed. This is particularly important for highway safety applications, where false negatives carry the highest cost.

The model’s performance is stable across training runs, as evidenced by the consistent convergence of loss and accuracy curves (Figure 3) and the high AUC scores (>0.99) for all classes. The perfect recall for fatalities further demonstrates the model’s robustness in identifying high-risk cases, which is essential for real-world deployment.

5. Conclusions and Limitations

This study proposes a hybrid model based on MobileNetV3 and Transformer architectures for predicting the severity of highway accidents. In addition, the DeepSHAP interpretability method is employed to further refine the model and identify the key contributing factors to accident severity on highways. By integrating the strengths of both MobileNetV3 and a Transformer, the model effectively captures the spatiotemporal characteristics of the features. Furthermore, the incorporation of a localized spatiotemporal attention mechanism enables the model to extract complex temporal and spatial dependencies, enhancing its robustness and improving overall predictive performance.

Using a real-world traffic accident dataset from the United Kingdom, the effectiveness and superiority of the proposed prediction model are validated through algorithmic comparison experiments, ROC curve analysis, and standard evaluation metrics. The model achieves an Accuracy of 0.9549, a Precision of 0.9674, a Recall of 0.8290, and an F1 Score of 0.8862, demonstrating its strong capability in accurately predicting traffic accident risk.

DeepSHAP is used to interpret the model and generate feature contribution scores for the three categories of accident severity. The results reveal that vehicle usage duration, special road conditions, speed limits, and lighting conditions are the key factors influencing accident severity on highways. Identifying these critical contributing factors can support the development of targeted policy interventions and traffic control measures aimed at reducing the likelihood of traffic accidents, thereby enhancing the safety and security of road users in terms of both life and property.

Beyond predictive performance, the interpretability analysis offers actionable insights for highway safety management. The identification of vehicle age as a top risk factor suggests that mandatory periodic inspections or early retirement policies for older vehicles could significantly reduce fatality risk. Similarly, the strong influence of lighting conditions underscores the need for enhanced nighttime illumination on highways, especially in rural or tunnel sections. The high impact of special road conditions (e.g., fog and oil spills) calls for real-time hazard detection systems and dynamic warning mechanisms (e.g., variable message signs) to alert drivers in advance.

Furthermore, the consistent ranking of speed limits across all severity levels validates the effectiveness of speed management strategies—such as intelligent speed adaptation (ISA) or automated enforcement—as core components of highway safety programs. These findings not only explain why accidents become severe but also provide concrete, evidence-based recommendations for policymakers, road operators, and vehicle manufacturers.

The utilization of the MobileNetV3 architecture confers significant computational efficiency and lightweight advantages to the proposed model. This design choice results in a reduced number of parameters and lower computational overhead compared to more complex architectures, making the model more suitable for deployment in resource-constrained environments or real-time applications. Furthermore, experimental evaluations confirmed the model’s efficiency, demonstrating a competitive average inference time per sample (as detailed in Table 5), which is crucial for practical highway safety monitoring and early warning systems.

Despite these promising results, this study has several limitations that warrant consideration. First, the model is trained and evaluated exclusively on the UK Road Safety Dataset, which reflects the specific infrastructure, traffic regulations, and vehicle fleet characteristics of the United Kingdom. Its generalizability to other countries or regions—particularly those with different road design standards, driving behaviors, or data collection protocols—has not been validated. Second, the current input features are derived from static post-crash records and do not incorporate real-time dynamic variables such as traffic flow, weather evolution, or driver behavior (e.g., distraction and fatigue). Future work will focus on cross-regional validation, integration of streaming sensor data, and deployment in real-time highway safety monitoring systems to enhance practical applicability.

Author Contributions

Conceptualization, L.C. and J.W.; methodology, G.W.; validation, J.W., L.C. and G.W.; formal analysis, X.Y.; investigation, L.Q.; data curation, L.C.; writing—original draft preparation, J.W.; writing—review and editing, L.Q.; visualization, J.W.; supervision, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the funding provided by the National Natural Science Foundation of China under Grant 51908187.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study used the UK Road Safety dataset, which is a publicly available dataset. They can be found at UK Accidents (2016–2020) (https://www.kaggle.com/ accessed on 15 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Model Architecture and Hyperparameters

To facilitate reproducibility, we provide a comprehensive specification of the proposed MobileNetV3–Transformer model and data preprocessing pipeline.

Appendix A.1. Data Preprocessing and GAF Encoding

Input features: 26 numerical/categorical variables (listed in Table 1 of the main text). Categorical variables were one-hot encoded.

GAF transformation:

Method: Gramian Angular Summation Field (GASF)

Image size: 600 × 600 pixels

Input sequence length: 26 (one value per feature)

Each accident record is converted into a single image.

Appendix A.2. MobileNetV3 Configuration

Variant: MobileNetV3-Large (as referenced in [39])

Input shape: 600 × 600 × 1

Pretrained weights: None (trained from scratch)

Activation function: Hard-swish (default in MobileNetV3; note: Table 2 in main text refers to final MLP activation as ReLU)

Squeeze-and-Excitation (SE): Enabled (as per standard MobileNetV3-Large)

Output feature map: After global average pooling, the output dimension is 960 (standard for MobileNetV3-Large at input size ≥600 × 600).

Appendix A.3. Transformer Encoder Configuration

Number of encoder layers: 2

Number of attention heads: 8

Model dimension (d_model): 960 (matches MobileNetV3 output)

Feed-forward hidden dimension: 3840

Dropout rate: 0.1

Positional encoding: Not applied (since GAF image encodes spatial relationships; sequence order is fixed by feature index)

Input to Transformer: Flattened feature vector from MobileNetV3 (treated as a single “token”)

Appendix A.4. Classification Head

Structure:

Linear(960 → 256) → ReLU → Dropout(0.1) → Linear(256 → 3)

Output: Logits for 3 classes (slight, serious, fatal)

Appendix A.5. Training Settings

Table A1. Training settings.

Parameter	Value
Batch size	128
Optimizer	AdamW
Learning rate	0.001
Weight decay	0.0001 (default for AdamW)
Epochs	100
Loss function	CrossEntropyLoss
SMOTE	False
Hardware	AMD Ryzen 7 7840 H
Framework	PyTorch 22.5.1
Batch size	128

References

Micheale, K.G. Road traffic accident: Human security perspective. Int. J. Peace Dev. Stud. 2017, 8, 19–25. [Google Scholar] [CrossRef]
World Health Organization. Global Status Report on Road Safety 2023; WHO: Geneva, Switzerland, 2023. [Google Scholar]
Yan, X.; Harb, R.; Radwan, E. Analyses of factors of crash avoidance maneuvers using the General Estimates System. Traffic Inj. Prev. 2008, 9, 173–180. [Google Scholar] [CrossRef]
Miaou, S.P. The relationship between truck accidents and geometric design of road sections: Poisson versus negative binomial regressions. Accid. Anal. Prev. 1994, 26, 471–480. [Google Scholar] [CrossRef]
Lancsar, E.; Fiebig, D.G.; Hole, A.R. Discrete choice experiments: A guide to model specification, estimation and software. Pharmacoeconomics 2017, 35, 697–716. [Google Scholar] [CrossRef]
Chen, Z. Analysis and Prediction of Road Traffic Accident Severity Considering Weather and Road Conditions. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2022. [Google Scholar]
Wang, L. Analysis of Accident Severity at Intersections Based on XGBoost and SVM. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2022. [Google Scholar]
Liu, J.; Leng, J.; Shang, P.; Luo, L. Analysis of highway accidents and severity factors under icy and snowy road conditions. J. Harbin Inst. Technol. 2022, 54, 1–10. [Google Scholar]
Delen, D.; Sharda, R.; Bessonov, M. Identifying significant predictors of injury severity in traffic accidents using a series of artificial neural networks. Accid. Anal. Prev. 2006, 38, 434–444. [Google Scholar] [CrossRef] [PubMed]
Rahim, M.A.; Hassan, H.M. A deep learning based traffic crash severity prediction framework. Accid. Anal. Prev. 2021, 154, 106090. [Google Scholar] [CrossRef]
Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. In Computer Vision—ECCV.; Springer: Cham, Switzerland, 2014; pp. 818–833. [Google Scholar] [CrossRef]
Li, G.; Wu, Y.; Bai, Y.; Zhang, W. ReMAHA-CatBoost: Addressing imbalanced data in traffic accident prediction tasks. Appl. Sci. 2023, 13, 13123. [Google Scholar] [CrossRef]
Mohanty, M.; Gupta, A. Factors affecting road crash modeling. J. Transp. Lit. 2015, 9, 15–19. [Google Scholar] [CrossRef]
Santamariña-Rubio, E.; Pérez, K.; Olabarria, M.; Novoa, A.M. Gender differences in road traffic injury rate using time travelled as a measure of exposure. Accid. Anal. Prev. 2014, 65, 1–7. [Google Scholar] [CrossRef]
Choudhary, P.; Pawar, N.M.; Velaga, N.R.; Pawar, D.S. Overall performance impairment and crash risk due to distracted driving: A comprehensive analysis using structural equation modelling. Transp. Res. Part F Traffic Psychol. Behav. 2020, 74, 120–138. [Google Scholar] [CrossRef]
Behzadi Goodari, M.; Sharifi, H.; Dehesh, P.; Mosleh-Shirazi, M.A.; Dehesh, T. Factors affecting the number of road traffic accidents in Kerman province, southeastern Iran (2015–2021). Sci. Rep. 2023, 13, 6662. [Google Scholar] [CrossRef]
Çelik, A.K.; Oktay, E. A multinomial logit analysis of risk factors influencing road traffic injury severities in the Erzurum and Kars provinces of Turkey. Accid. Anal. Prev. 2014, 72, 66–77. [Google Scholar] [CrossRef] [PubMed]
AlKheder, S.; AlRukaibi, F.; Aiash, A. Risk analysis of traffic accidents’ severities: An application of three data mining models. ISA Trans. 2020, 106, 213–220. [Google Scholar] [CrossRef] [PubMed]
Habib, M.F.; Motuba, D.; Huang, Y. Beyond the surface: Exploring the temporally stable factors influencing injury severities in large-truck crashes using mixed logit models. Accid. Anal. Prev. 2024, 205, 107650. [Google Scholar] [CrossRef] [PubMed]
Ratanavaraha, V.; Suangka, S. Impacts of accident severity factors and loss values of crashes on expressways in Thailand. IATSS Res. 2014, 37, 130–136. [Google Scholar] [CrossRef]
Haq, M.T.; Zlatkovic, M.; Ksaibati, K. Investigating occupant injury severity of truck-involved crashes based on vehicle types on a mountainous freeway: A hierarchical Bayesian random intercept approach. Accid. Anal. Prev. 2020, 144, 105654. [Google Scholar] [CrossRef]
Liu, X.; Wu, H.; Yu, D.; Chen, Y.; Wu, H. A construction and representation learning method for a traffic accident knowledge graph based on the enhanced TransD model. Appl. Sci. 2025, 15, 6031. [Google Scholar] [CrossRef]
Hou, Q.; Meng, X.; Huo, X.; Cheng, Y.; Leng, J. Effects of freeway climbing lane on crash frequency: Application of propensity scores and potential outcomes. Phys. A 2019, 517, 246–256. [Google Scholar] [CrossRef]
Abbas, K.A. Traffic safety assessment and development of predictive models for accidents on rural roads in Egypt. Accid. Anal. Prev. 2004, 36, 149–163. [Google Scholar] [CrossRef]
Jiang, X.; Huang, B.; Zaretzki, R.L.; Richards, S.; Yan, X.; Zhang, H. Investigating the influence of curbs on single-vehicle crash injury severity utilizing zero-inflated ordered probit models. Accid. Anal. Prev. 2013, 57, 55–66. [Google Scholar] [CrossRef] [PubMed]
Harb, R.; Yan, X.; Radwan, E.; Su, X. Exploring precrash maneuvers using classification trees and random forests. Accid. Anal. Prev. 2009, 41, 98–107. [Google Scholar] [CrossRef]
Tang, J.; Zheng, L.; Han, C.; Yin, W.; Zhang, Y.; Zou, Y.; Huang, H. Statistical and machine-learning methods for clearance time prediction of road incidents: A methodology review. Anal. Methods Accid. Res. 2020, 27, 100123. [Google Scholar] [CrossRef]
Olutayo, V.A.; Eludire, A.A. Traffic accident analysis using decision trees and neural networks. Int. J. Inf. Technol. Comput. Sci. 2014, 6, 22–28. [Google Scholar] [CrossRef]
Yan, M.; Shen, Y. Traffic accident severity prediction based on random forest. Sustainability 2022, 14, 31729. [Google Scholar] [CrossRef]
Shahdah, U.; Saccomanno, F.; Persaud, B. Integrated traffic conflict model for estimating crash modification factors. Accid. Anal. Prev. 2014, 71, 228–235. [Google Scholar] [CrossRef]
Sameen, M.I.; Pradhan, B. Severity prediction of traffic accidents with recurrent neural networks. Appl. Sci. 2017, 7, 476. [Google Scholar] [CrossRef]
Alhaek, F.; Liang, W.; Rajeh, T.M.; Javed, M.H.; Li, T. Learning spatial patterns and temporal dependencies for traffic accident severity prediction: A deep learning approach. Knowl.-Based Syst. 2024, 286, 111406. [Google Scholar] [CrossRef]
Manzoor, M.; Umer, M.; Sadiq, S.; Ishaq, A.; Ullah, S.; Madni, H.A.; Bisogni, C. RFCNN: Traffic accident severity prediction based on decision level fusion of machine and deep learning model. IEEE Access 2021, 9, 128359–128371. [Google Scholar] [CrossRef]
Kour, V.; Kumar, S.; Reddy, T.V.; Poojary, K.; Misra, R.; Singh, T.N. Exploring spatiotemporal relational learning with TimeSformer for identifying the severity of the road accidents. IEEE Trans. Comput. Soc. Syst. 2025, 12, 345–356. [Google Scholar] [CrossRef]
Cicek, E.; Akin, M.; Uysal, F.; Topcu Aytas, R. Comparison of traffic accident injury severity prediction models with explainable machine learning. Transp. Lett. 2023, 15, 1043–1054. [Google Scholar] [CrossRef]
Ma, Z.; Mei, G.; Cuomo, S. An analytic framework using deep learning for prediction of traffic accident injury severity based on contributing factors. Accid. Anal. Prev. 2021, 160, 106322. [Google Scholar] [CrossRef]
Roscher, R.; Bohn, B.; Duarte, M.F.; Garcke, J. Explainable machine learning for scientific insights and discoveries. IEEE Access 2020, 8, 42200–42216. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S. A unified approach to interpreting model predictions. Nat. Mach. Intell. 2017, 1, 221–227. [Google Scholar] [CrossRef]

Figure 1. Structure of MobileNetV3–Transformer model.

Figure 2. Accident data image transformation samples.

Figure 3. Training accuracy curves of six deep learning models over 100 epochs. The proposed MobileNetV3–Transformer model achieves the highest and most stable accuracy (>95%), outperforming the CNN, LSTM, MobileNetV3, Transformer, and LSTM-Transformer baselines.

Figure 4. ROC curves and AUC values of the MobileNetV3–Transformer model for three accident severity classes: slight injury (Class 0, AUC = 1.00), serious injury (Class 1, AUC = 0.99), and fatality (Class 2, AUC = 0.99). The near-perfect AUC scores indicate excellent discriminative capability, particularly for the rare fatality class (2.6% of samples).

Figure 5. Impact of features on the minor accident.

Figure 6. Impact of features on the serious injury.

Figure 7. Impact of features on the fatal accident.

Figure 8. Confusion matrix of the MobileNetV3–Transformer model on the test set.

Table 1. Argument list.

Factor Classification	Factor Name	Factor Classification	Factor Name
Accident information	Number of vehicles	Vehicle factors	Van
	Number of casualties		Towing and articulation
	Year		Age of vehicle
	Season		Vehicle direction from
	Day of week		Vehicle direction to
	Holiday		Skidding and overturning
	Hour		Offside impact
Driver factor	Age of driver		Nearside impact
Driver factor	Sex of driver		Front impact
Environmental factor	Light conditions		Rear impact
	Weather conditions	Road factors	Road type
	Previous accident		Speed limit
	Special conditions		Road surface conditions

Table 2. Hyperparameter settings.

Parameter Name	Parameter Value
batch size	128
optimizer	AdamW
epochs	100
LSTM hidden size	256
activation function	ReLU
learning rate	0.0001
num encoder layers	2
dropout	0.1

Table 3. Deep learning model predicts results.

Models	Acc	Prec	Rec	F1
CNN	0.8267	0.6345	0.8390	0.6928
LSTM	0.8988	0.9503	0.6995	0.7825
MobileNetV3	0.9007	0.9629	0.7841	0.7860
Transformer	0.9132	0.9623	0.7530	0.8252
LSTM–Transformer	0.9194	0.9657	0.8516	0.8329
MobileNetV3–Transformer	0.9549	0.9674	0.8290	0.8862

Table 4. Statistical comparison of F1-scores across models (mean ± standard deviation over 5 runs).

Models	F1-Score (Mean ± SD)	p-Value vs. MobileNetV3- Transformer
CNN	0.846 ± 0.002	<0.001
LSTM	0.841 ± 0.002	<0.001
MobileNetV3	0.844 ± 0.003	<0.001
Transformer	0.903 ± 0.001	<0.001
LSTM–Transformer	0.904 ± 0.001	<0.001
MobileNetV3–Transformer	0.942 ± 0.001	—

Table 5. Computational efficiency comparison of different models.

Models	Parameters	FLOPs (G)	Inference Time (ms/Sample)
CNN	0.39 M	0.002	0.0468
LSTM	0.27 M	0.0003	0.1132
MobileNetV3	0.01 M	0.00001	0.1172
Transformer	0.15 M	0.0001	0.1814
LSTM–Transformer	0.58 M	0.0005	0.3444
MobileNetV3–Transformer	0.11 M	0.0001	0.1443

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, L.; Wei, J.; Wang, G.; Yang, X.; Qin, L. MobileNetV3–Transformer-Based Prediction of Highway Accident Severity. Appl. Sci. 2025, 15, 12694. https://doi.org/10.3390/app152312694

AMA Style

Chen L, Wei J, Wang G, Yang X, Qin L. MobileNetV3–Transformer-Based Prediction of Highway Accident Severity. Applied Sciences. 2025; 15(23):12694. https://doi.org/10.3390/app152312694

Chicago/Turabian Style

Chen, Liang, Jia Wei, Guoqing Wang, Xiaoxiao Yang, and Lusheng Qin. 2025. "MobileNetV3–Transformer-Based Prediction of Highway Accident Severity" Applied Sciences 15, no. 23: 12694. https://doi.org/10.3390/app152312694

APA Style

Chen, L., Wei, J., Wang, G., Yang, X., & Qin, L. (2025). MobileNetV3–Transformer-Based Prediction of Highway Accident Severity. Applied Sciences, 15(23), 12694. https://doi.org/10.3390/app152312694

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MobileNetV3–Transformer-Based Prediction of Highway Accident Severity

Abstract

1. Introduction

2. Related Works

2.1. Study on Factors Affecting Accident Severity

2.2. Study on Accident Severity Prediction Model

2.3. Interpretability Study of Accident Severity Prediction Models

3. Methodology

3.1. MobileNetV3

3.2. Transformer

3.3. LSTM-Transformer

3.4. MobileNetV3–Transformer (Large)

4. Results and Discussion

4.1. Dataset

4.2. Selection of Evaluation Metrics

4.3. Experimental Results and Analysis

4.4. Model Interpretability and Key Factor Identification

5. Conclusions and Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Detailed Model Architecture and Hyperparameters

Appendix A.1. Data Preprocessing and GAF Encoding

Appendix A.2. MobileNetV3 Configuration

Appendix A.3. Transformer Encoder Configuration

Appendix A.4. Classification Head

Appendix A.5. Training Settings

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI