Next Article in Journal
Environmental and Institutional Factors Affecting Renewable Energy Development and Implications for Achieving SDGs 7 and 11 in Mozambique’s Major Cities
Previous Article in Journal
Spatial Correlates of Perceived Safety: Natural Surveillance and Incivilities in Bayan Baru, Malaysia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Empirical Analysis of Running-Behavior Influencing Factors for Crashes with Different Economic Losses

1
College of Metropolitan Transportation, Beijing University of Technology, Beijing 100124, China
2
Beijing Key Laboratory of Traffic Engineering, Beijing University of Technology, Beijing 100124, China
3
School of Civil Engineering and Transportation, Guangzhou University, Guangzhou 510006, China
4
China Huanong Property & Casualty Insurance Company Limited, Beijing 100020, China
*
Author to whom correspondence should be addressed.
Urban Sci. 2026, 10(1), 45; https://doi.org/10.3390/urbansci10010045
Submission received: 2 December 2025 / Revised: 28 December 2025 / Accepted: 6 January 2026 / Published: 12 January 2026
(This article belongs to the Special Issue Urban Traffic Control and Innovative Planning)

Abstract

Miniature commercial trucks constitute a critical component of urban freight systems but face elevated crash risk due to distinctive driving patterns, frequent operation, and variable loads. This study quantifies how long-term and short-term driving behaviors jointly shape crash economic loss levels and identifies factors most strongly associated with severe claims. A driver-level dataset linking multi-source running behavior indicators, vehicle attributes, and insurance claims is constructed, and an enhanced Wasserstein generative adversarial network with Euclidean distance is employed to synthesize minority crash samples and alleviate class imbalance. Crash economic loss levels are modeled using a random-effects generalized ordinal logit specification, and model performance is compared with a generalized ordered logit benchmark. Marginal effects analysis is used to evaluate the influence of pre-collision driving states (straight, turning, reversing, rolling, following closely) and key behavioral indicators. Results indicate significant effects of inter-provincial duration and count ratios, morning and empty-trip frequencies, no-claim discount coefficients, and vehicle age on crash economic loss, with prolonged speeding duration and fatigued mileage associated with major losses, whereas frequent speeding and fatigue episodes are primarily linked to minor claims. These findings clarify causal patterns for miniature commercial truck crashes with different economic losses and provide an empirical basis for targeted safety interventions and refined insurance pricing.

1. Introduction

With the increase in global consumption levels, the efficient functioning of international supply chains depends significantly on extensive freight transportation between cities and regions. In China, operational freight vehicles transport an impressive 31.5 billion tons of cargo annually, achieving a remarkable cargo turnover of 57,955.72 billion ton-kilometers [1]. Miniature commercial trucks, defined as vehicles with a length of 3500 mm or less and a gross weight not exceeding 1800 kg, are essential to global freight and urban delivery services, yet this category of vehicles is particularly vulnerable to an increased risk of crashes [2]. According to the Traffic Management Research Institute of the Ministry of Public Security, by the end of 2020, operational vehicles constituted only 5.55% of the total motor vehicle stock in China. However, operational vehicles were involved in 9.98% of traffic crashes and accounted for 15.67% of traffic crash with fatalities. The fatality rate from road traffic crashes per 10,000 operational vehicles was recorded at 3.95, which is approximately 2.9 times higher than the average for other types of vehicles. Notably, cargo vehicles represent a significant majority of crashes involving operational vehicles [3]. Therefore, it is essential to analyze the features and identify the key contributing factors of miniature commercial trucks’ crashes for the purpose of mitigating crashes proactively.
Besides the severity of the crashes [4], another commonly used approach to describe the consequences of crashes is the estimated economic loss [5,6]. This study aims to extract the estimated damage amounts from insurance claim data and then to analyze the relationship between various influencing factors and the crashes with different levels of estimated damage (i.e., the economic loss of crashes). In property-damage-dominant freight crashes, economic-loss levels provide a decision-relevant measure of crash consequences for operational safety management and insurance risk assessment, complementing severity classifications based on injuries. Previous research indicates the crashes with different economic losses involving miniature commercial trucks are affected by a variety of intricate factors, including driver behavior, vehicle condition, road characteristics and, etc. [7]. In fact, finding the causations of crash economic loss involves addressing two primary issues. The first is pertaining to data acquisition and class imbalance. In the analysis of traffic crashes, the sample size for severe injuries is considerably smaller compared to sample for no injury and minor injuries. Typical data sampling methods such as undersampling, oversampling and ensemble learning methods have been employed in previous studies [8,9]. The second issue lies in the selection of an appropriate prediction method for crash economic loss. Two main methods are usually employed to analyze the economic loss levels of crashes: (1) discrete choice models, including logit and probit models; and (2) machine learning models, such as Random Forest, AdaBoost and Neural Network. A quantitative analysis of the impact of various variables as well as the relationship between independent and dependent variables, can be enabled by discrete choice models; however, lower accuracy might be exhibited. In contrast, machine learning methods provide advantages such as fewer assumptions and higher accuracy, but the non-parametric nature of machine learning requires large datasets and might result in slower fitting speeds [10].
The main issues studied in this research are as follows:
(1)
How to handle the imbalance of different crash economic loss levels to meet sample requirements in model construction effectively?
(2)
In what ways do various influencing factors impact the economic loss levels of crashes under different pre-crash driving conditions, such as straight driving, turning, reversing, rolling, or close following?
To address these two issues stated above, it is necessary but challenging to develop a comprehensive model which balances precision and interpretability while tackling data imbalance effectively. By considering both predictive accuracy and explainability, this study develops an ordinal modeling and interpretation framework for crash economic-loss levels using insurance-claim-estimated losses and running-behavior statistics. Within this framework, the objective of this study is to identify measurable running-behavior factors associated with higher economic-loss levels in miniature commercial truck crashes and to quantify their associations with ordered economic-loss categories.
The study is original in two respects. First, it focuses on miniature commercial trucks and operationalizes crash consequences using ordered claim-estimated economic-loss categories, linked to annual-scale running-behavior indicators that capture longer-horizon exposure patterns in high-frequency urban freight operations. Second, it adopts a generalized ordinal modeling specification that allows threshold-specific effects across loss levels and incorporates pre-crash driving-state heterogeneity through a random-effects structure, enabling interpretable effect quantification under distinct maneuver contexts. From the perspective of urban science, the contribution lies in providing empirical evidence for data-informed governance of urban freight mobility, by identifying behavior-related risk patterns associated with greater economic losses and by offering a quantitative basis for targeted interventions and insurance-oriented risk control in urban commercial-vehicle operations.

2. Literature Review

Insurance claims data play a crucial role in road safety analysis, primarily serving as the basis for assessing economic losses and informing insurance rate determination [11]. Although relatively few studies have leveraged such data to examine the relationship between driving behaviors and crashes with different economic losses, the causation analysis of traffic crashes has been extensively conducted by researchers, with focus on both data and methodology. Section 2.1 presents the main data sampling methods used in previous studies, while Section 2.2 summarizes the existing methods for modeling the relationship between influencing factors and traffic crashes.

2.1. Sampling Method

In the analysis of traffic crashes, uneven sample distribution presents a significant challenge, particularly due to the limited number of severe crash samples. Such an imbalance not only diminishes the accuracy of predictions for minority classes but also introduces instability in the model’s performance on new data, thereby impacting both fairness and effectiveness in modeling the relationship between influencing factors and crashes. Consequently, balancing the dataset is essential to enhance the performance of the model in predictive accuracy and causation explanations. In the existing crash-safety literature, most imbalance-handling studies focus on injury-based severity outcomes, often formulated as binary labels or coarse multi-class groupings; comparatively fewer studies examine imbalance under monetary-loss outcomes, especially ordered economic-loss categories derived from claim-estimated damages. This thematic emphasis shapes evaluation practice, since resampling methods are commonly validated using injury-severity prediction rather than ordered economic-loss modeling.
Oversampling methods are extensively employed to address the challenges posed by imbalanced datasets. SMOTE aims to balance the dataset by generating synthetic samples for the minority class within the feature space [12]. However, SMOTE fails to account for the heterogeneity of boundary samples, which can result in sample overlap and impact the model’s ability to differentiate between classes obviously. Recent advances in SMOTE-based oversampling have introduced several refined approaches. One line of work proposes four variants—Distance ExtSMOTE, Dirichlet ExtSMOTE, FCRP SMOTE, and BGMM SMOTE—that generate higher-quality synthetic samples by computing a weighted centroid of neighboring instances, thereby reducing the adverse impact of outliers [13]. Another study introduces Bliorderline-SMOTE, which specifically synthesizes samples near the decision boundary of the minority class; however, the precise definition of these boundary regions remains somewhat ambiguous [14]. Further, a hybrid framework integrates histogram-based density estimation with SMOTE and employs a Fuzzy ARTMAP (FAM) neural network to effectively address classification challenges in imbalanced datasets [15]. Additionally, a two-stage resampling strategy combines SMOTE with a two-layer k-nearest neighbor classifier to enhance discriminative performance in imbalanced learning scenarios [16]. Taken together, this stream reflects a development trend from generic interpolation toward boundary-aware or distribution-aware synthesis, aiming to improve minority representativeness while limiting overlap and noise introduced by synthetic samples.
Different from oversampling, undersampling strategies aim to balance data distribution by reducing the number of samples from the majority class. Some studies have integrated both undersampling and oversampling techniques to capitalize on the advantages of each method. Kernel density estimation was employed to filter out majority-class samples, specifically removing those located within high-density regions of the minority class, thereby enhancing classification performance [17]. Recent undersampling and hybrid strategies increasingly emphasize selective removal guided by neighborhood structure or density, rather than random deletion, in order to reduce information loss from the majority class.
In addition, ensemble learning methods have also made significant advancements in addressing the challenges posed by imbalanced data. The SMOTE Bagging approach incorporates SMOTE oversampling to enhance the classification performance of minority classes effectively [18]. A cluster-based adaptive undersampling ensemble approach has been demonstrated to be effective and robust in highly imbalanced classification tasks [19]. For multi-label classification problems, an innovative resampling algorithm—Multi-Label Tomek Link (MLTL)—was proposed to mitigate class imbalance while preserving inter-label dependencies [20]. Furthermore, a multi-granularity relabeling undersampling algorithm (MGRU) was developed to identify and eliminate overlapping majority-class instances by leveraging local structural information in dataset subspaces; this approach improves the detection of Tomek links through the integration of global relabeling index values [21].This line of work illustrates an analytical development direction integrating resampling with aggregation mechanisms (e.g., bagging or cluster-based ensembles) to improve minority-class detection and reduce performance variance under severe imbalance.
Generative Adversarial Networks (GANs) have emerged as a promising technology for addressing the challenges associated with imbalanced data [22]. Generative adversarial networks (GANs) generate realistic synthetic samples by modeling the underlying distribution of the minority class, thereby enhancing model learning from limited minority data [23,24]. However, GANs encounter mode collapse frequently when dealing with high-dimensional data, which might result in reduced sample diversity and difficulties in accurately simulating the original distribution. Wasserstein GANs (WGANs), which are grounded in Wasserstein distance, have been demonstrated to have significant enhancements in both the quality of generated samples and diversity [25]. The WGAN with Gradient Penalty (WGAN-GP) further mitigates mode collapse, and therefore resulting in more realistic generation of minority class samples and improving model performance on imbalanced datasets [26]. Conditional mixed Wasserstein GANs (cMWGANs) have been employed to approximate true feature distributions, enabling the generation of diverse, label-preserving synthetic samples for the minority class across multiple datasets [27]. In related work, Wasserstein GANs (WGANs) have demonstrated remarkable performance and training stability when applied to image denoising tasks [28]. Further advancing this line of research, Wasserstein GAN with Confidence Loss (WGAN-CL) was introduced to augment small-sample plant datasets, achieving a 2.2% increase in recall and a 2% improvement in F1 score while requiring fewer computational resources and maintaining both effectiveness and robustness [29]. Compared with rule-based synthesis, generative approaches aim to learn the minority distribution directly, which can be advantageous when feature interactions are complex and minority classes are extremely sparse.
Overall, significant advancements have been achieved in addressing data imbalance in previous studies. SMOTE and its extensions enhanced synthetic sampling by concentrating on boundary samples. Adaptive sampling techniques also demonstrated potential in mitigating noise and improving model robustness. Furthermore, the integration of sampling methods with ensemble learning approaches enhanced classification performance by capitalizing on the strengths of both bagging and boosting. Emerging technologies such as Generative Adversarial Networks (GANs) have made considerable progress in generating synthetic minority class samples. Techniques like Balanced-GAN and WGAN-GP improved sample diversity and quality, showcasing high adaptability in managing imbalanced high-dimensional crash data. A common methodological practice in the reviewed literature treats resampling as a standalone preprocessing step, followed by model training on the rebalanced dataset; sampling design and model estimation are therefore typically decoupled. This workflow is usually assessed using classification-oriented metrics, while ordered-outcome structure receives limited explicit treatment in sampling design, which can affect subsequent interpretation when the target variable is ordinal.

2.2. Crash Modeling

In recent years, models used to predict and analyze factors associated with traffic crashes have predominantly followed two methodological streams: discrete choice models and machine learning models. In the crash-safety literature, most empirical studies focus on injury-severity outcomes derived from police reports, whereas comparatively fewer studies model monetary-loss outcomes, especially ordered economic-loss levels constructed from claim-estimated damages. This thematic pattern influences both model specification and evaluation, since injury-based labels are often treated as the default response variable in comparative model studies.
I.
Discrete choice model
Discrete choice models, particularly ordered and multinomial logit formulations, have been extensively applied to model truck crash severity. One study examined 227 crash cases from three roadways in China, incorporating 15 explanatory variables related to driver characteristics, vehicle attributes, roadway geometry, and environmental conditions. The ordered logit model achieved a prediction accuracy of 78.3 percent for property-damage-only crashes, while the multinomial logit model attained 94.81 percent accuracy for fatal crashes. Statistical analysis identified driver age, driving experience, road alignment, pavement surface condition, time of occurrence, visibility, and lighting as significant predictors of crash severity [30]. A subsequent study analyzed collision data from mountainous regions to classify injury outcomes across multiple severity levels, explicitly testing and addressing violations of the proportional odds assumption that underlies the ordered logit model. Despite such violations, the ordered logit specification exhibited consistently high predictive accuracy across crash types, demonstrating its empirical robustness for ordinal severity modeling. The findings also indicated that drivers who had consumed alcohol but were not impaired exhibited a reduced probability of sustaining severe injuries, although their likelihood of minor injuries increased [31]. More recent work has adopted random-parameter or mixed logit models within the discrete choice framework to account for unobserved heterogeneity in crash severity analysis [32]. These approaches integrate diverse factors including driver behavior, vehicle specifications, environmental conditions, and collision dynamics into a flexible modeling structure that improves both interpretability and predictive performance.
These approaches improve model fit by accommodating heterogeneity and correlations among variables, which is particularly relevant when modeling severe and fatal crashes. Across these studies, significant associations frequently involve driver-related characteristics, vehicle- and time-related conditions, and collision attributes, with notable patterns linked to distraction-related incidents. Additional factors reported as associated with crash severity include speeding, tailgating, absent or malfunctioning airbags, frontal and rear-end collisions, lane-width variation, low-light conditions, downhill curves, inclement weather (rain or snow), and morning hours. Variables such as weekday occurrence, road type, speed limit, lighting condition, and weather pattern also appear repeatedly as determinants in severity models. Beyond classical ordered or multinomial logit specifications, the analytical development in this stream increasingly emphasizes relaxing restrictive assumptions (e.g., proportional odds) and representing unobserved heterogeneity using random/mixed parameters, random effects, or related hierarchical structures, improving interpretability while accounting for correlation and clustering in crash data.
II.
Machine learning model
In recent years, a growing body of research has employed machine learning models to forecast traffic crashes. These approaches are well suited to handling complex, high-dimensional datasets without requiring stringent parametric assumptions [33]. For instance, a comparative evaluation of more than 25 machine learning algorithms identified random forest as the top-performing method for crash severity prediction [34]. Subsequent work applied both random forest and AdaBoost, achieving accuracy rates of 91.72% and 91.27%, respectively, with driver error identified as the most influential factor in determining severity outcomes [35].
Deep learning models have also been leveraged to capture spatiotemporal dependencies in crash data for severity prediction. One study introduced a deep learning architecture that explicitly models temporal dynamics, reporting an accuracy of 86%, recall of 91%, and F1 score of 88% [36]. Another approach combined artificial neural networks with the whale optimization algorithm, yielding an average accuracy of 72.41% across multiple severity levels; key predictors included driver age, alcohol consumption, pedestrian location, time of day, lighting conditions, and road surface status [37]. Further research integrated support vector machines (SVMs) with random-parameter logit models to predict both non-fatal and fatal crashes, achieving 81.87% accuracy and identifying driver fatigue and leftward lane deviation as the dominant contributing factors in large-truck-involved crashes [38]. Additionally, a deep residual network enhanced with an attention mechanism was applied to crash severity prediction on mountainous roads, attaining a test accuracy of approximately 85%.
Overall, crash-outcome modeling has advanced along two complementary directions: discrete choice models remain central for interpretable inference and marginal-effects analysis, while machine learning models provide strong predictive performance under complex feature interactions. Discrete choice models, including ordered and multinomial Logit models, continue to play a fundamental role in analyzing the impact of factors such as driver characteristics, vehicle attributes, roadway conditions, and environmental influences on crash severity. Such models are particularly effective in quantifying the marginal effects of individual variables with clarity, rendering them invaluable for policy development. Machine-learning methods, including Support Vector Machines (SVMs), Random Forests, and Neural Networks, exhibit robust predictive performance, which is particularly evident for complex datasets. Unlike traditional models which rely on predefined assumptions, intricate patterns within the data are captured by machine learning approaches. Nonetheless, limited interpretability is imposed by the non-parametric nature; quantification of the specific influence of each factor on crash severity becomes challenging. A common methodological practice in the reviewed machine-learning literature prioritizes accuracy-oriented evaluation, whereas discrete-choice studies more often emphasize effect estimation and interpretability; this difference in evaluation focus shapes the model choice in applied crash research.

3. Data Preparation

3.1. Data Source and Study Scope

The dataset was provided by a nationwide private property-and-casualty insurer operating in mainland China. The insurer name is withheld due to confidentiality obligations under the data-sharing agreement. Publicly available regulatory statistics published by the former China Banking and Insurance Regulatory Commission (CBIRC, 2023) indicate a motor insurance ranking within the top 25 nationally. The extracted records cover compulsory motor third-party liability insurance (CTPL) and major commercial motor coverages, including third-party liability and vehicle damage coverage. All claim records correspond to loss events occurring between 1 January 2022 and 31 December 2022 and originate from 12 provincial-level administrative regions in China, including Guangdong, Zhejiang, Jiangsu, Shandong, Henan, and Hunan. Claims handling follows the nationally applicable framework for road traffic accident processing and motor insurance settlement in China, supporting consistent procedures for accident determination, liability attribution, and loss assessment across provinces.
Raw data were extracted from the insurer’s core motor-claims database and include notice-of-loss records, inspection and assessment reports, repair estimates, police accident determination fields, and settlement/payment information. Telematics or usage-based insurance (UBI) driving-behavior variables are available only for insured vehicles equipped with approved devices and associated with data-collection authorization. The analytical sample restricts vehicles to enclosed vans or box-type light trucks with gross vehicle mass ≤ 3.5 tons and retains claims entering the settlement process with compensable losses. Records with duplicated entries, rule-flagged suspicious cases not processed through normal settlement, or more than 30% missingness in key modeling variables are excluded. Loss severity is categorized as minor (<CNY 1000), moderate (CNY 1000–10,000), and severe (>CNY 10,000). Claims-based data may under-represent minor incidents not reported to the insurer, and this potential reporting or selection bias is addressed in the limitations together with class-imbalance mitigation in model development.

3.2. Data Fields and Variable Definitions

The data used in this research includes driving characteristics, vehicle characteristics, and insurance claim estimate amounts for 156 miniature commercial trucks in the year of 2022. A total of 50 variables are used as independent variables (see Table 1). Driving characteristics include 46 variables composed by load size, road type, driving time period, traffic violations, and unfamiliar road dimensions. Vehicle characteristics comprise 4 variables, including vehicle age, No Claims Discount (NCD) coefficient, vehicle tonnage, and pre-collision driving state. Specifically, the NCD coefficient reflects the driver’s historical crash rate and insurance claim records. A lower NCD value indicates fewer past crashes. The unfamiliar road coefficient is the ratio of the number of times the driver traveled on highways (excluding repeated trips) to the total number of highway trips in the annual statistics.
The economic loss levels of crashes (i.e., dependent variable) are derived from insurance claim estimates and comprise three principal indicators: vehicle damage estimation, property damage assessment, and bodily injury evaluation. Generally, the economic loss of crashes is divided into three categories. Cases with damage estimates below 1000 RMB for vehicle, property, and bodily injury are considered minor insurance claims. Cases where damage estimates for any of the vehicle, property, or bodily injury exceed 10,000 RMB, or where multiple types of damages exceed this threshold, are classified as major insurance claims. The classification criteria, provided by a certain Chinese insurance company, ensure consistency in claims processing and align with the distribution of claims costs observed in historical data. All other cases are categorized as general insurance claims. Among the analyzed sample of 156 truck drivers, 77 reported minor insurance claims (49.36%), 47 indicated general insurance claims (26.92%), and 32 experienced major insurance claims (23.08%). Notably, the highest recorded estimated damage was 91,564 RMB, and the lowest was 100 RMB. In addition, there were 5 types of pre-collision driving states for each crash: straight driving, turning, reversing, rolling, and close following.

4. Methodology

4.1. Data Sampling

In order to ensure consistent scaling and eliminate potential biases from varying factor ranges, all continuous variables undergo z-score standardization before data sampling. This transformation, defined as subtracting the mean and dividing by the standard deviation of each variable, ensures standardized variables have a mean of zero and a standard deviation of one, thereby facilitating model convergence and improving numerical stability. Notably, categorical variables, including pre-collision driving states (i.e., straight, turning, reversing, rolling, and following closely), remain unchanged as they are not subject to standardization. An improved Generative Adversarial Network (Wasserstein GAN, WGAN) is employed to tackle the challenge of imbalanced sample sizes [39]. The WGAN comprises two primary components: a generator and a discriminator. The generator (G) continuously learns from the original data and extracts distributional characteristics in order to produce synthetic data which closely resembles real data, aiming to deceive the discriminator (D). Conversely, the role of the discriminator is to distinguish between authentic data and the fake data generated by G while avoiding manipulation by itself. Both models undergo concurrent training through an alternating optimization approach in a zero-sum game framework. A schematic illustration of the enhanced Generative Adversarial Network framework is presented in Figure 1.
The objective function of WGAN differs from traditional GANs by using the Wasserstein distance (Earth Mover’s Distance, EM distance) to optimize the generation process, thereby improving the stability of model training and reducing the occurrence of mode collapse. The objective function is expressed as follows:
m i n G m a x D V D , G = E x P d a t a x D x E z P z z D G z
where E represents the mathematical expectation, x represents the real data from the original sample, P d a t a x represents the distribution of the real data, G z represents the fake data generated by the generator, P z z represents the distribution of the generated data, and z represents the input random noise signal.
To address the challenge of categorical fields in standardized data for WGAN, the assignment of labels to newly generated samples is determined based on the Euclidean distance calculated among all features, while excluding the categorical fields. The specific step of the data sampling method employed is as follows:
  • A new sample is generated, and the Euclidean distance between the new sample and existing samples within the continuous feature space is computed;
  • The assessment of sample similarity is carried out;
  • The original sample exhibiting the closest proximity is then selected;
  • The generated sample is assigned the categorical label “Pre-collision Driving State” based on the selected original sample.
The methodology adopted guarantees the generated samples preserve consistency in continuous features while also incorporating relevant categorical information from the associated original samples. After the generation of each sample, the overall and category-specific prediction accuracies of the finalized sample, are forecasted utilizing the LightGBM model. LightGBM is an efficient gradient boosting framework developed by a team at Microsoft [40]. A decision tree learning algorithm is leveraged to enhance both efficiency and accuracy, which makes the model particularly adept at addressing classification challenges. LightGBM facilitates rapid training through its histogram-based decision tree algorithm and optimizes tree construction via leaf-wise splitting, which significantly bolsters the model’s performance.

4.2. Model Formulation

The Generalized Ordered Logit Model and the Random Effects Generalized Ordered Logit Model are utilized to predict the economic loss levels of traffic crashes. The estimation of model parameters is executed using the Maximum Likelihood Estimation method, which ensures both accuracy and stability in parameter estimates.

4.2.1. The Generalized Ordered Logit Model

The generalized ordered logit model captures the ordered nature of crash economic loss levels adeptly and accommodates the natural ordering among categories effectively, thereby mitigating biases associated with fixed threshold methods. The probability calculation formula for the generalized ordered logit model is presented as follows:
P Y = j X = 1 1 + e x p α j β T X for   j = 0 1 1 + e x p α j β T X 1 1 + e x p α j 1 β T X for   j = 1,2
where P Y = j X is the probability the crash economic loss level Y takes the value j , with j representing the economic loss level categories (minor claim (0), general claim (1), or major claim (2)); X is the vector of influencing factors, including driving characteristics and vehicle features; α j is the threshold (cut-off point) for category j , used to distinguish between different categories; and β is the coefficient vector for the influencing factors X , representing the effect of factors on crash economic loss levels.

4.2.2. Random Effects Generalized Ordered Logit Model

The random effects generalized ordered logit model enhances the generalized ordered logit model by introducing random effects to address individual heterogeneity. The probability calculation formula for the categories in the random effects generalized ordered logit model is as follows:
P Y = j X , u i = 1 1 + e x p α j + u i β T X for   j = 0 1 1 + e x p α j + u i β T X 1 1 + e x p α j 1 + u i β T X for   j = 1,2
where P Y = j X , u i is the probability that the crash economic loss level Y takes the value j , with j representing the economic loss categories (minor claim (0), general claim (1), or major claim (2)); X is the vector of influencing factors, including driving characteristics and vehicle features; α j is the threshold (cut-off point) for category j , used to distinguish between different categories; u i is the random effect for individual i , reflecting unobserved individual characteristics, including the five Pre-Collision Driving States, namely straight driving, turning, reversing, rolling, and close following, and is assumed to follow a normal distribution u i N 0 , σ 2 ; β is the coefficient vector for the influencing factors X , representing the effect of factors on crash economic loss levels.

5. Results

5.1. Result of Data Sampling

In the WGAN generator model, the input dimension is set to 100, corresponding to a random noise vector; while the output dimension comprises 60 continuous features. The generator is constructed using fully connected layers containing 128 and 256 neurons. The ReLU activation functions are employed. The output layer utilizes a Sigmoid activation function to yield an output of 60 neurons. The input dimension of the discriminator aligns with the generator’s output, featuring fully connected layers with 256 and 128 neurons. The output layer is designed to employ a linear activation function to compute the Wasserstein distance.
To assess the consistency between synthetic and real samples, the Classifier Consistency Factor (CCF) is introduced. CCF quantifies the agreement between the synthetic data classification and the actual class labels. It is computed as follows:
C C F = 1 1 N i = 1 N y ^ i y i
where N is the total number of samples, y ^ i represents the predicted class label for sample i , and y i denotes the corresponding real class label.
Additionally, the Relative Loss (Lrel) is utilized to measure the discrepancy between generated and real data distributions. It is defined as:
L r e l = 1 M j = 1 M X j r e a l X j g e n X j r e a l
where M represents the number of continuous features, X j r e a l denotes the real feature values, and X j g e n corresponds to the generated feature values.
The model evaluation result in a Classifier Consistency Factor (CCF) is 0.95, demonstrating robust consistency with the class labels. The Relative Loss (Lrel) is measured at 0.035, indicating a close alignment with actual samples. Meanwhile, the Wasserstein GAN loss (LWGAN) stands at 3.76, which falls within an acceptable range, particularly during the initial stages of training.
After each sample generation, the LightGBM model was employed to evaluate both overall and class-specific prediction accuracies.
The overall accuracy (OA) is calculated as:
O A = 1 N i = 1 N   I ( y ^ i = y i )
where N denotes the total number of test samples, y ^ i represents the predicted class label for sample i, y i is the corresponding actual class label, and I ( · ) is the indicator function which equals 1 if y ^ i = y i and 0 otherwise.
The class-specific prediction accuracy for economic loss level k (denoted as C A k ) is calculated as:
C A k = 1 N k i D k   I ( y ^ i = y i )
where N k represents the number of test samples in class k , and D k denotes the set of samples belonging to class k .
The sampling process commenced with 77 samples allocated to each economic loss level. By 37th iteration, a total of 339 samples were generated through WGAN, evenly distributed across different crash economic loss levels: 113 each for minor, general, and major insurance claims. The model achieved overall prediction accuracy of 92.87%. Notably, prior studies on crash severity prediction have reported overall accuracies ranging from 75% to 85% under similar experimental settings [41,42]. Specifically, the prediction accuracy for minor insurance claims was 91.23%, while it reached 94.67% for general insurance claims and 93.44% for major insurance claims. Notably, the prediction accuracies across all economic loss levels were relatively close, indicating the model maintained consistent performance across various categories. While a minor increase in loss was noted during the early stages of training, the overall performance remained within the expected parameters. Both generator’s ability to produce samples and discriminator’s capability to classify accurately improved markedly through adversarial optimization techniques, as illustrated in Figure 2, where the red marker indicates the optimal number of synthetic samples.

5.2. Result of Model Prediction

5.2.1. Selection of Independent and Dependent Variables

A multicollinearity analysis is conducted on the balanced sample. The process of feature selection was as follows:
  • VIF Calculation: The Variance Inflation Factor (VIF) is computed for each feature as V I F j = 1 1 R j 2 , where R j 2 represents the coefficient of determination obtained by regressing the feature j on all other features.
  • Iterative Feature Elimination: Features with the highest VIF were sequentially removed. After each elimination, the importance of the remaining features was recalculated using the LightGBM model. This iterative process continued until all features with VIF values greater than 10 were excluded.
  • Evaluation of Feature Importance: After removing collinear features, the cumulative importance of the retained variables was assessed using the LightGBM model to ensure the total contribution of the selected variables exceeded 80%.
During the elimination of high VIF features, an increase in cumulative contribution is observed (see Figure 3). Table 2 represents the variables used for prediction model development of crashes with different economic loss levels which exhibit a Variance Inflation Factor (VIF) of less than 10.

5.2.2. Model Performance

In the realm of model evaluation, the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) serve as essential metrics for assessing both model fit and complexity, while classification accuracy gauges the predictive performance of classification models across various categories. The AIC facilitates the selection of a model which strikes an optimal balance between robust data fit and manageable complexity by imposing a penalty on the number of parameters utilized. Thus, AIC is utilized for model selection to attain an ideal equilibrium between fit effect and model complexity. The calculation formula of AIC is as follows:
A I C = 2 × l o g L + 2 × k
where l o g L is the log-likelihood function value of the model, indicating the goodness of fit of the model; k is the number of parameters in the model.
Compared to the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC) is more rigorous in large-sample contexts, placing a stronger emphasis on penalizing an excessive number of model parameters. As a result, BIC is particularly well-suited for scenarios involving larger sample sizes, as it imposes heavier penalties on model complexity to mitigate the risk of overfitting. BIC extends AIC by incorporating an additional penalty specifically for model complexity, thereby making it especially effective in comparing models within large-sample settings. The formulation of BIC is as follows:
B I C = 2 × l o g L + k × l o g n
where l o g L is the log-likelihood function value; k is the number of parameters in the model; and n is the number of samples.
Classification accuracy reflects a model’s predictive performance, making it a critical metric for evaluating its effectiveness. The metric provides an intuitive evaluation of the model’s predictive power, which can be refined to assess both individual category accuracy and overall accuracy and thus enabling a comprehensive model assessment. For category j ( j = 0, 1, or 2), the formula for calculating classification accuracy is expressed as follows:
C A j = i = 1 n j L Y ^ i = Y i n j × 100 %
where Y ^ i denotes the predicted category for the driver corresponding to position i , with crash economic loss level as minor claim (0), general claim (1), or major claim (2); Y i denotes the actual category for the driver corresponding to position i , with crash economic loss level as minor claim (0), general claim (1), or major claim (2); n j is the total number of drivers whose true category is j ; and L Y ^ i = Y i is the indicator function, which equals 1 when the predicted value Y ^ i matches the true value Y i , and 0 otherwise.
The formula for calculating the overall classification accuracy across the dataset is delineated as follows:
C A t = i = 1 n L Y ^ i = Y i n × 100 %
where n is the total number of at-risk drivers; and L Y ^ i = Y i is the indicator function, which equals 1 when the prediction is correct, and is 0 otherwise.
The AIC, BIC, and classification accuracy of Generalized Ordered Logit Model based and Random Effects Generalized Ordered Logit Model based crash economic loss level prediction model are shown in Table 3. It is apparent the Mixed-Effects Ordered Logit Model demonstrates superior performance in predicting the economic loss level of truck crashes when compared to the traditional Generalized Ordered Logit Model. Although AIC and BIC values for the random effects model are higher, the increase can be attributed to the enhanced capability to capture inter-group heterogeneity within the data. In terms of classification accuracy, the random effects model significantly improves predictions of crash economic loss levels for Categories 1 and 2, achieving accuracies of 88.50% and 92.92%, respectively. The overall classification accuracy of the random effects model reaches 90.86%, which is 9.38 percentage points greater than the traditional model.

5.2.3. Crash Causality Analysis

In Section 4.1, the key factors influencing economic loss levels of crashes are identified. Section 4.2 presents the theoretical foundation for the marginal effects of factors and analyzes the specific impact of each individual crash factor on economic loss levels.
I.
Identification of Factors with Significant Impact
The factors deemed statistically significant (at a confidence level of 95% or higher) selected for the marginal effects analysis are presented in Table 4.
In the analysis of potential factors influencing crashes’ economic loss levels, several explanatory variables were found to have statistically significant effects on the dependent variable, based on the significance of their coefficients (p-value < 0.05). Notably, Inter-provincial Duration Ratio, Speeding Duration Ratio, and Fatigue Mileage Ratio were found to be significantly positively correlated with crash economic loss levels. Conversely, Inter-provincial Count Ratio, Speeding Count Ratio, Morning Count Ratio, Fatigue Count Ratio, empty vehicle frequency ratio, NCD coefficient, and Car Age demonstrated significant negative correlations with economic loss levels of crashes.
II.
Marginal Effects Analysis of Influencing Factors
After variable reduction, the generalized ordinal logit model is estimated using 10 predictors, which are also used for marginal-effects computation and interpretation (Table 4): Inter-provincial Duration Ratio, Speeding Duration Ratio, Inter-provincial Count Ratio, Speeding Count Ratio, Morning Count Ratio, Fatigue Count Ratio, Fatigue Mileage Ratio, Empty Count Ratio, NCD coefficient, and CarAge. As a robustness check, five-fold cross-validation is conducted; out-of-sample classification accuracy remains above 0.85 for each loss category (0.91 for minor, 0.88 for general, 0.86 for major), and the overall accuracy is 90.3%, supporting stable performance under repeated sample partitioning.
Marginal effects analysis is a crucial tool in the context of the random effects logit model for examining the influence of independent variables on the probability of a given outcome. In this model, marginal effects quantify the change in the probability of a particular outcome (economic loss category) resulting from a small change in one of the independent variables, while holding all other variables constant. These effects are essential for understanding the magnitude and direction of the relationship between the predictors and the outcome, considering the presence of unobserved heterogeneity captured by random effects.
For a random effects logit model, the marginal effect of an independent variable x k on the probability P j of outcome category j ∈ {0, 1, 2} is computed by differentiating the predicted probability with respect to x k , and then averaging over the distribution of the random effect u i . This process accounts for individual-level variation in the relationship between the independent variables and the dependent variable, reflecting different pre-collision driving states (straight driving, turning, reversing, rolling, and close following).
The marginal effect of each predictor variable x k is expressed as:
P j x k = E P j x k X = x k 1 1 + e x p α 0 + u i β T X f o r   j = 0 x k e x p α 1 + u i β T X 1 + e x p α 1 + u i β T X f o r   j = 1 x k e x p α 2 + u i β T X 1 + e x p α 2 + u i β T X f o r   j = 2
where P j denotes the probability the economic loss of an crash belongs to category j , where j { 0,1 , 2 } , representing the probabilities of minor claim (0), general claim (1), or major claim (2); X is the vector of crash influencing factors, including driving characteristics and vehicle features; β is the vector of coefficients for the crash influencing factors X , representing the impact of factors on crashes’ economic loss levels; α j is the threshold for category j , used to distinguish between different categories; u i is the random effect, reflecting the five driving conditions prior to the crash, namely straight driving, turning, reversing, rolling, and close following, and is assumed to follow a normal distribution u i N 0 , σ 2 , capturing individual heterogeneity; f u i is the probability density function of the random effect.
In the context of the random effects logit model, the pre-collision driving state is an observable categorical variable, which is treated as a random effect to account for individual heterogeneity and unobserved variations across different driving conditions. Specifically, u i , the random effect associated with each individual, captures the variability in crash outcomes due to unobserved factors specific to each pre-collision driving state. Although these driving states (e.g., straight driving, turning, reversing, rolling, and close following) are explicitly observed, modeling them as random effects allows for a more nuanced representation of their influence on crash outcomes. This approach incorporates individual-specific intercepts and slopes for each driving state, enabling the model to reflect the unique impact of each state while controlling for potential random fluctuations. Consequently, the marginal effect is derived by averaging the derivative with respect to x k over the distribution of u i , ensuring the individual differences inherent in the various driving states are properly accounted for.
Interpretation of marginal effects is conditional on the modeling assumptions of the random-effects generalized ordinal logit specification, and several limitations follow from these assumptions. The model is formulated in a latent-variable framework with a logistic link and a linear-additive predictor on the logit scale. The generalized ordinal structure permits threshold-specific effects for selected predictors and relaxes the proportional-odds constraint; inference nevertheless remains dependent on the selected link function and the adopted functional form. Nonlinearities or interaction structures not included in the linear index are not represented in the current specification. Simulation-based residual diagnostics using DHARMa report a standardized residual dispersion ratio of 1.03 (95% interval [0.96, 1.10]) and a Kolmogorov–Smirnov test with p = 0.42, indicating no apparent systematic lack-of-fit under these checks.
Unobserved heterogeneity across pre-crash driving states is represented by normally distributed random intercepts with mean zero, with an estimated variance of 0.87 (SE 0.21). Posterior predictive checking based on simulated outcome frequencies yields a Cramér–von Mises statistic of 0.082, indicating agreement between simulated and observed outcome distributions under the adopted random-intercept structure. The random-effects component is restricted to a mean-zero normal random-intercept specification. Estimated coefficients and marginal effects quantify conditional associations given the covariates included in the final specification and should not be interpreted as establishing causal direction, causal magnitude, or underlying mechanisms.
The Sankey diagram in Figure 4 maps pre-collision driving states to crash economic-loss levels. It specifies five states: straight driving, turning, reversing, rolling, and close following, and highlights the distribution of economic-loss levels within each state. Based on the data analysis, A substantial majority of crash records are linked to straight driving and reversing maneuvers, accounting for 53.1% and 38.3%, respectively. It indicates both two driving states are more prevalent and might be associated with a higher frequency of occurrences when compared to other states. Conversely, other driving states such as close following (3.5%), turning (3.5%), and rolling (1.5%) contribute relatively less to the overall crash dataset, suggesting a lower proportion of crashes in these three mentioned scenarios. From the analysis presented in Figure 4, crashes occurring during reversing maneuvers are shown to exhibit the highest economic loss levels. In contrast, incidents during straight driving are predominantly categorized as minor claims or general claims, with only a negligible percentage being classified as major claims. Close following related crashes are primarily minor claims; moreover, turning-related incidents demonstrate an increased incidence of minor claims while also showing a slightly elevated proportion of major claims if compared to other states. Rolling crashes tend to be mainly classified as either minor claims or general claims. The marginal effects of various risk factors on the economic loss levels of crashes categorized by different pre-collision driving states, are detailed in Table 5 and Figure 5.
Figure 5 displays a bar chart depicting the marginal effects of various influencing factors on economic loss levels of crashes.
For minor insurance claims, the increase in Morning Count Ratio, CarAge, NCD coefficient, Empty Count Ratio, Inter-provincial Count Ratio, Speeding Count Ratio, and Fatigue Count Ratio leads to an average increase in the probability of crashes with different economic losses by approximately 4.9%, 6.9%, 8.0%, 8.5%, 11.3%, 29.0%, and 53.3%, respectively. While the increase in Inter-provincial Duration Ratio, Speeding Duration Ratio, and Fatigue Mileage Ratio results in an average decrease in the probability of crashes with different economic losses by approximately 13.5%, 31.1%, and 77.8%.
For general insurance claims, the increase in the Morning Count Ratio, CarAge, NCD coefficient, Empty Count Ratio, Inter-provincial Count Ratio, Speeding Count Ratio, and Fatigue Count Ratio, respectively, leads to an average decrease in the probability of crashes with different economic losses by approximately 3.5%, 4.9%, 5.6%, 6.1%, 8.0%, 20.6%, and 68.6%. While the increase in the Inter-provincial Duration Ratio, Speeding Duration Ratio, and Fatigue Mileage Ratio results in an average increase in the probability of crashes with different economic losses by approximately 9.6%, 22.1%, and 83.5%.
For major insurance claims, the increase in the proportion of Morning Count Ratio, CarAge, NCD coefficient, Empty Count Ratio, Inter-provincial Count Ratio, Speeding Count Ratio, and Fatigue Count Ratio, respectively, leads to an average decrease in the probability of crashes with different economic losses by approximately 1.4%, 2.0%, 2.3%, 2.5%, 3.3%, 8.4%, and 44.6%. While the increase in the proportion of Inter-provincial Duration Ratio, Speeding Duration Ratio, and Fatigue Mileage Ratio results in an average increase in the probability of crashes with different economic losses by approximately 3.9%, 9.0%, and 34.1%.
In a multinomial severity framework, the marginal effects describe how the probability of different crash severity outcomes (Minor, General, Major) reallocate. The observed sign reversals for minor crashes, particularly for variables related to speeding, fatigue, and inter-provincial travel, represent this redistribution across severity levels and are consistent with the adding-up constraint in multinomial logit models. These sign reversals indicate frequency-based indicators, including speeding counts and fatigue counts, are more strongly associated with minor crashes, which generally involve lower severity [43]. In contrast, duration-based indicators, including speeding duration and fatigue mileage, capture sustained exposure that increases crash severity, resulting in higher probabilities of general and major crashes s [44]. It is consistent with the theory that brief, frequent risk events increase the probability of minor crashes, while prolonged exposure to risk factors raises the likelihood of higher severity crashes y [45,46].
Figure 6 presents a line graph which compares the influence of different pre-collision driving states on crashes with different economic loss levels, using straight-line driving as a baseline.
As shown in Figure 6, compared to straight driving, an increase in Fatigue Count Ratio and Speeding Count Ratio raises the probability of minor claims, with this effect being more pronounced in the turning, reversing, rolling, and close following states in specific order. In contrast, an increase in Fatigue Mileage Ratio and Speeding Duration Ratio reduces the probability of minor claims, with the reduction being more noticeable in the reversing and rolling states when compared to turning and close following. For crashes with general economic losses, an increase in Fatigue Mileage Ratio raises the crash probability in the turning, reversing, rolling, and close following states, while an increase in Fatigue Count Ratio lowers the crash probability in the reversing and rolling states when compared to turning and close following states. Notably, a higher Morning Count Ratio when reversing significantly reduces the probability of general crashes. For major claims, an increase in Fatigue Count Ratio and Speeding Count Ratio decreases the crash probability, with the reduction being most pronounced in the turning state; while an increase in Fatigue Mileage Ratio and Speeding Duration Ratio increases the crash probability, with this increase becoming more significant in the rolling and close following states when compared to the turning state.
The stacked area chart representing the influencing factors emphasizes the most significant indicators solely based on the absolute magnitude of the impact on crash economic loss levels, as depicted in Figure 7. The y-axis in Figure 7 represents the cumulative contribution of each influencing factor to crash economic loss levels, measured by the absolute magnitude of its effect. It can be observed the influence of factors such as the Inter-provincial Duration Ratio, Inter-provincial Count Ratio, Morning Count Ratio, Empty Count Ratio, NCD coefficient, and CarAge on crashes’ economic loss levels is relatively small. Factors such as Speeding Duration Ratio, Fatigue Mileage Ratio, Speeding Count Ratio, and Fatigue Count Ratio have a relatively large impact on crashes’ economic loss levels. The Inter-provincial Duration Ratio is chosen as the threshold for distinguishing the degree of influence on the probability of crashes with different economic losses, as it represents a key turning point in the stacked area chart. This inflection point marks a clear transition in the impact of various factors, with the green band visually highlighting the distinction in how different factors contribute to the probability of crashes with different economic losses. Figure 7 reveal interesting patterns needing to further exploring for the factors of Speeding Duration Ratio, Fatigue Mileage Ratio, Speeding Count Ratio, and Fatigue Count Ratio.
Summarizing above, the impact of speeding and fatigue on traffic crashes is quite complex, as both of these two factors increase the likelihood of crashes, but their effects vary across different levels of crash economic loss. The increase in Speeding Duration Ratio and Fatigue Mileage Ratio has a smaller effect on crashes with minor insurance claims, but a larger impact on crashes with general and major insurance claims. In addition, an increase in Speeding Count Ratio and Fatigue Count Ratio significantly raises the probability of crashes with minor insurance claims, while having a smaller effect on crashes with general and major insurance claims. Figure 8 shows the relationship between main influencing factors (i.e., Speeding Duration Ratio, Fatigue Mileage Ratio, Speeding Count Ratio, and Fatigue Count Ratio) and crashes with different economic losses (i.e., minor, general and major insurance claims).
The Speeding Duration Ratio serves as an indicator of driving behavior, illustrating a tendency to maintain elevated speed for extended periods. Sustained excessive speeding significantly heightens a driver’s neglect of potential hazards and amplifies the risk of losing control over the vehicle [47]. In contrast, the Speeding Count Ratio pertains to instances of speeding occurring within short time intervals, which reflects a driver’s inclination towards rapid driving at various moments [48]. While frequent instances of brief speeding would increase the likelihood of distractions and operational errors, the relatively short duration associated with each instance typically permits drivers to undertake corrective actions aimed at mitigating severe outcomes. Consequently, such behavior is more prone to result in crashes with minor claims, such as scrapes or slight collisions [49].
The Fatigue Mileage Ratio represents the cumulative fatigue experienced during prolonged driving. As fatigued mileage increases, a driver’s alertness and reaction time gradually diminish, resulting in a decreased capacity to respond to sudden situations, which in turn heightens the risk of crashes with major insurance claims. Such an impact is especially pronounced in complex driving scenarios (e.g., turning or reversing), where fatigue exerts a significant impact on the driver’s performance [50,51]. Conversely, the Fatigue Count Ratio emphasizes the frequency of fatigue events encountered by the driver over short intervals. Frequent episodes of fatigued driving typically indicate the driver has not taken timely breaks; however, each instance of fatigue tends to be brief, thus allowing for maintaining a certain level of alertness and reaction capabilities. Consequently, crashes are more likely to be minor insurance claim rather than major insurance claims [52,53].
Short-term high-risk behaviors refer to frequent, brief episodes of hazardous actions, such as instances of exceeding speed limits or momentary fatigue, occurring within short time intervals, which are characterized by high frequency and short duration, meaning the exposure to risk is limited in time [54,55]. Long-term high-risk behaviors involve sustained exposure to hazardous conditions, such as prolonged periods of speeding or extended driving while fatigued, which result in prolonged exposure to risk, where the effects accumulate over time [56,57].
Overall, the Speeding Duration Ratio and Fatigue Mileage Ratio primarily emphasize long-term driving states [58,59], which represent the duration a driver engages in hazardous behaviors or conditions. Over time, prolonged speeding and extended fatigued driving progressively erode driver control and vigilance. The resulting degradation increases crash economic-loss levels. In contrast, the frequency of speeding events and fatigued events pertain more to short-term high-risk behaviors. While recurrent instances of such risky actions might elevate the likelihood of crashes, each event typically transpires within a shorter timeframe, allowing for quicker driver reactions. As a result, crash economic loss levels are often relatively lower and thus leading to minor claims generally [60].

6. Discussion

This study employs real-world vehicle insurance data in conjunction with the Random Effects Generalized Ordered Logit Model to explore how influencing factors including both driving and vehicle characteristics impact crashes with different economic losses. In contrast to studies which depend on simulated data [61] or publicly accessible traffic crash statistics [62], data derived from commercial vehicles and insurance claims are utilized in the current research. Particularly, the economic loss represented by insurance claim is applied to indicate the grade of crashes providing a loss-based complement to injury-severity endpoints and addressing limitations in studies relying exclusively on the most severe injury as the crash-severity metric [8]. The dataset presented in this study is not only more representative of realistic condition but also delivers enhanced detail, thereby supporting loss-oriented inference for urban freight operations.
In this study, the implementation of Wasserstein Generative Adversarial Networks (WGAN) has significantly improved both the quality and stability of the generated samples. Furthermore, the Euclidean distance-based classification label assignment method mitigates inconsistencies in classification features produced by GANs effectively [22]. Although conventional resampling techniques are frequently employed, overfitting might be induced in the model, or critical information might be eroded [14]. Additionally, feature selection rigor is optimized by integrating the VIF to manage multicollinearity; meanwhile, cumulative feature importance derived from LightGBM is used for enhanced feature selection. Despite persisting challenges in addressing non-linear relationships among variables [63], the integration of VIF and cumulative feature importance demonstrates notable advancements in both variable interpretability and model robustness relative to other feature selection methods such as Principal Component Analysis [64] and LASSO regression [65]. Collectively, the imbalance-treatment procedure and the screening strategy establish a reproducible workflow in which data augmentation, multicollinearity control, and information-preserving selection are specified prior to estimating an interpretable ordinal regression model.
The empirical results refine earlier evidence on risky driving by separating exposure-driven and frequency-driven behavioral signals across ordered loss thresholds. Higher speeding duration and greater fatigued mileage exhibit the strongest associations with general or major claim categories, consistent with mechanisms involving reduced control margins and delayed reactions under complex driving conditions [66]. Conversely, short-term high-risk behaviors, such as frequent speeding and fatigue, are more likely to result in crashes with minor insurance claims. Drivers’ response abilities tend to remain relatively intact over short durations, and thus transitory and frequent speeding or fatigue often leads to minor collisions or scrapes [67]. Additionally, new findings from our study indicate a significant reduction in the probability of crashes with general insurance claims for reversing maneuvers when driving frequency in morning increased [68]. This state-specific association supports contextualized risk assessment and is consistent with evidence emphasizing heterogeneity across operating conditions in crash modeling. In relation to earlier literature, the random-effects ordinal framework provides an interpretable mechanism for representing heterogeneity across pre-collision driving states, supporting state-aware evaluation of economic-loss risk in miniature commercial truck operations.

7. Conclusions

This study presents a framework for analyzing ordered economic loss levels of crashes, leveraging macro-level annual driving characteristics, vehicle attributes, and insurance claim amounts, which provides a quantitative evaluation of how various risk factors impact the economic loss levels of crashes. To tackle issues related to data imbalance and enhance prediction accuracy, an improved Wasserstein Generative Adversarial Network (WGAN) was developed, incorporating Euclidean distance-based data augmentation techniques. Following data balancing, both the Generalized Ordered Logit model and the Random Effects Generalized Ordered Logit model were applied. The latter model not only accounts for the influences of different pre-collision driving states (such as straight driving, turning, and reversing) through state-level heterogeneity, but also demonstrates superior predictive performance under the evaluation protocol reported in the Results Section.
Model-based evidence indicates differentiated association patterns across loss categories. The causal analysis reveals for crashes with minor insurance claim, an increase in Morning Count Ratio, CarAge, NCD coefficient, Empty Count Ratio, and Inter-provincial Count Ratio is linked to a heightened probability of crashes. Conversely, a higher Inter-provincial Duration Ratio is associated with a reduced crash economic loss. Additionally, the frequent occurrence of fatigue and speeding significantly contributes to increased risks for crashes with minor insurance claims during vehicle status such as “turning”, “reversing”, “rolling” and “close following”. On the other hand, both fatigued mileage and prolonged speeding durations seem to be not highly related to crashes with minor insurance claims. For crashes with general and major insurance claims, Morning Count Ratio, CarAge, NCD coefficient, Empty Count Ratio, and Inter-provincial Count Ratio are observed to decrease the likelihood of crash economic loss; however, an elevated Inter-provincial Duration Ratio substantially heightens the risk of crashes with general and major insurance claims. Notably, during reversing maneuvers, greater morning driving frequencies correlate with lower probabilities of crashes with general insurance claims. Furthermore, increased fatigued mileage along with extended periods of speeding amplifies risks for crashes with both general and major insurance claims across all vehicle operation conditions. In contrast, the likelihood of experiencing crashes with general claims and major claims is reduced but the probability of crashes with minor insurance might be high, when the occurrences of fatigue or excessive speed are frequent. These results distinguish exposure-driven indicators (e.g., speeding duration and fatigued mileage) from frequency-driven indicators (e.g., fatigue or speeding occurrences), and the two indicator types align with different loss categories.
In summary, the current study proposes a theoretical framework to explore the causations of miniature commercial trucks’ crashes with different economic losses based on driving characteristics, vehicle characteristics and insurance claims. On one hand, crash risk modeling is advanced by incorporating random effects into an ordinal regression framework, addressing unobserved heterogeneity across different driving conditions. In relation to earlier crash-outcome research relying primarily on injury severity or coarse severity groupings, the use of claim-estimated losses as an ordered endpoint provides a complementary outcome definition aligned with loss-oriented decision needs in urban freight operations. The findings substantiate the role of both short-term and long-term high-risk behaviors in shaping crash economic outcomes, highlighting the need for dynamic risk assessment models. On the other hand, the results provide actionable insights for traffic safety management, insurance claim optimization, and policy formulation. The identified risk factors, including inter-provincial travel patterns, morning driving frequency, and vehicle aging, offer empirical support for refining driver monitoring systems and insurance premium structures.
However, there are some limitations needing to be considered in the future studies. The quality of the samples generated by the improved Generative Adversarial Network (WGAN) is heavily dependent on the distribution characteristics of the original data, which might capture complex patterns inadequately. In addition, the real-time driving behaviors and road environment data were not incorporated into our developed model. Furthermore, the long-term applicability of the model might be affected by evolving policies, technological advancements, and changes in driving behaviors. Future research would benefit from incorporating multimodal data such as sensor inputs and traffic flow information while utilizing advanced deep learning techniques like Graph Neural Networks (GNNs) or temporal modeling. Validating the model’s effectiveness through real-world applications and making adjustments to ensure adaptability to dynamic environmental changes are also essential for future studies. In addition, linkage with police crash databases or roadway exposure data can strengthen external validity and reduce potential reporting or selection bias inherent in claims-based outcomes.

Author Contributions

Conceptualization, Writing—original draft, Project administration: P.S.; Conceptualization, Writing—original draft, Supervision: Y.W.; Methodology, Visualization: H.Z. and J.R.; Data curation: N.Z. and J.M.; Resources: X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China (52102403).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are not permitted to be made publicly available under the terms of the data-use agreement.

Acknowledgments

The data are supported by China Huanong Property & Casualty Insurance Company Limited (CHIC).

Conflicts of Interest

Author Xiaoheng Sun was employed by the China Huanong Property & Casualty Insurance Company Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Niu, Y.; Li, Z.; Fan, Y. Correlation Analysis of Influencing Factors of TruckTraffic Accidents on Expressways. Saf. Environ. Eng. 2020, 27, 180–188. [Google Scholar] [CrossRef]
  2. Yang, C.; Chen, M.; Yuan, Q. The Application of XGBoost and SHAP to Examining the Factors in Freight Truck-Related Crashes: An Exploratory Analysis. Accid. Anal. Prev. 2021, 158, 106153. [Google Scholar] [CrossRef]
  3. Hu, L.; Yang, H.; He, Y.; Zhao, X.; Yin, Y.; Tian, H.; Ling, H. Driving Risk Identification of Commercial Trucks Basedon Complex Network Theor. J. Transp. Eng. Inf. 2022, 20, 128–134. [Google Scholar] [CrossRef]
  4. Mannering, F.; Bhat, C.R.; Shankar, V.; Abdel-Aty, M. Big Data, Traditional Data and the Tradeoffs between Prediction and Causality in Highway-Safety Analysis. Anal. Methods Accid. Res. 2020, 25, 100113. [Google Scholar] [CrossRef]
  5. Guillen, M.; Bermúdez, L.; Pitarque, A. Joint Generalized Quantile and Conditional Tail Expectation Regression for Insurance Risk Analysis. Insur. Math. Econ. 2021, 99, 1–8. [Google Scholar] [CrossRef]
  6. Denuit, M.; Guillen, M.; Trufin, J. Multivariate Credibility Modelling for Usage-Based Motor Insurance Pricing with Behavioural Data. Ann. Actuar. Sci. 2019, 13, 378–399. [Google Scholar] [CrossRef]
  7. Yang, M.; Bao, Q.; Shen, Y.; Qu, Q.; Zhang, R.; Han, T.; Zhang, H. Determinants Influencing Alcohol-Related Two-Vehicle Crash Severity: A Multivariate Bayesian Hierarchical Random Parameters Correlated Outcomes Logit Model. Anal. Methods Accid. Res. 2024, 44, 100361. [Google Scholar] [CrossRef]
  8. Habib, M.F.; Motuba, D.; Huang, Y. Beyond the Surface: Exploring the Temporally Stable Factors Influencing Injury Severities in Large-Truck Crashes Using Mixed Logit Models. Accid. Anal. Prev. 2024, 205, 107650. [Google Scholar] [CrossRef] [PubMed]
  9. Alnawmasi, N.; Mannering, F. A Temporal Assessment of Distracted Driving Injury Severities Using Alternate Unobserved-Heterogeneity Modeling Approaches. Anal. Methods Accid. Res. 2022, 34, 100216. [Google Scholar] [CrossRef]
  10. Çeven, S.; Albayrak, A. Traffic Accident Severity Prediction with Ensemble Learning Methods. Comput. Electr. Eng. 2024, 114, 109101. [Google Scholar] [CrossRef]
  11. Soomro, A.A.; Mokhtar, A.A.; Muhammad, M.B.; Saad, M.H.M.; Lashari, N.; Hussain, M.; Palli, A.S. Data Augmentation Using SMOTE Technique: Application for Prediction of Burst Pressure of Hydrocarbons Pipeline Using Supervised Machine Learning Models. Results Eng. 2024, 24, 103233. [Google Scholar] [CrossRef]
  12. Matharaarachchi, S.; Domaratzki, M.; Muthukumarana, S. Enhancing SMOTE for Imbalanced Data with Abnormal Minority Instances. Mach. Learn. Appl. 2024, 18, 100597. [Google Scholar] [CrossRef]
  13. Rao, C.; Wei, X.; Xiao, X.; Shi, Y.; Goh, M. Oversampling Method via Adaptive Double Weights and Gaussian Kernel Function for the Transformation of Unbalanced Data in Risk Assessment of Cardiovascular Disease. Inf. Sci. 2024, 665, 120410. [Google Scholar] [CrossRef]
  14. Liaw, L.C.M.; Tan, S.C.; Goh, P.Y.; Lim, C.P. A Histogram SMOTE-Based Sampling Algorithm with Incremental Learning for Imbalanced Data Classification. Inf. Sci. 2025, 686, 121193. [Google Scholar] [CrossRef]
  15. Sun, P.; Wang, Z.; Jia, L.; Xu, Z. SMOTE-kTLNN: A Hybrid Re-Sampling Method Based on SMOTE and a Two-Layer Nearest Neighbor Classifier. Expert Syst. Appl. 2024, 238, 121848. [Google Scholar] [CrossRef]
  16. Sun, Z.; Ying, W.; Zhang, W.; Gong, S. Undersampling Method Based on Minority Class Density for Imbalanced Data. Expert Syst. Appl. 2024, 249, 123328. [Google Scholar] [CrossRef]
  17. Zhang, J.; Wang, T.; Ng, W.W.Y.; Pedrycz, W. Ensembling Perturbation-Based Oversamplers for Imbalanced Datasets. Neurocomputing 2022, 479, 1–11. [Google Scholar] [CrossRef]
  18. Yuan, X.; Sun, C.; Chen, S. A Clustering-Based Adaptive Undersampling Ensemble Method for Highly Unbalanced Data Classification. Appl. Soft Comput. 2024, 159, 111659. [Google Scholar] [CrossRef]
  19. Pereira, R.M.; Costa, Y.M.G.; Silla, C.N., Jr. MLTL: A Multi-Label Approach for the Tomek Link Undersampling Algorithm. Neurocomputing 2020, 383, 95–105. [Google Scholar] [CrossRef]
  20. Dai, Q.; Liu, J.; Liu, Y. Multi-Granularity Relabeled under-Sampling Algorithm for Imbalanced Data. Appl. Soft Comput. 2022, 124, 109083. [Google Scholar] [CrossRef]
  21. Doubinsky, P.; Audebert, N.; Crucianu, M.; Le Borgne, H. Multi-Attribute Balanced Sampling for Disentangled GAN Controls. Pattern Recognit. Lett. 2022, 162, 56–62. [Google Scholar] [CrossRef]
  22. Park, N.; Park, J.; Lee, C. Conditional Generative Adversarial Network-Based Roadway Crash Risk Prediction Considering Heterogeneity with Dynamic Data. J. Saf. Res. 2025, 92, 217–229. [Google Scholar] [CrossRef]
  23. Zhang, Y.; Fang, Y.; Cao, Y.; Wu, J. RBGAN: Realistic-Generation and Balanced-Utility GAN for Face de-Identification. Image Vis. Comput. 2024, 141, 104868. [Google Scholar] [CrossRef]
  24. Man, C.K.; Quddus, M.; Theofilatos, A.; Yu, R.; Imprialou, M. Wasserstein Generative Adversarial Network to Address the Imbalanced Data Problem in Real-Time Crash Risk Prediction. IEEE Trans. Intell. Transp. Syst. 2022, 23, 23002–23013. [Google Scholar] [CrossRef]
  25. Chen, J.; Yan, Z.; Lin, C.; Yao, B.; Ge, H. Aero-Engine High Speed Bearing Fault Diagnosis for Data Imbalance: A Sample Enhanced Diagnostic Method Based on Pre-Training WGAN-GP. Measurement 2023, 213, 112709. [Google Scholar] [CrossRef]
  26. Zhang, Y.; Sun, B.; Xiao, Y.; Xiao, R.; Wei, Y. Feature Augmentation for Imbalanced Classification with Conditional Mixture WGANs. Signal Process. Image Commun. 2019, 75, 89–99. [Google Scholar] [CrossRef]
  27. Tirel, L.; Ali, A.M.; Hashim, H.A. Novel Hybrid Integrated Pix2Pix and WGAN Model with Gradient Penalty for Binary Images Denoising. Syst. Soft Comput. 2024, 6, 200122. [Google Scholar] [CrossRef]
  28. Mi, J.; Ma, C.; Zheng, L.; Zhang, M.; Li, M.; Wang, M. WGAN-CL: A Wasserstein GAN with Confidence Loss for Small-Sample Augmentation. Expert Syst. Appl. 2023, 233, 120943. [Google Scholar] [CrossRef]
  29. Wang, L.; Lyu, P.; Lin, Y. Traffc accidents on freeways:influencing factors analysisand injury severity evaluation. China Saf. Sci. J. 2016, 26, 86–90. [Google Scholar] [CrossRef]
  30. Yang, W.; Xie, B.; Fang, R.; Qin, Y. Comparative Analysis and Prediction of Motor VehicleCrash Severity on Mountainous Two-lane Highways. J. Transp. Syst. Eng. Inf. Technol. 2021, 21, 190–195. [Google Scholar] [CrossRef]
  31. Rejali, S.; Aghabayk, K.; Seyfi, M.; Oviedo-Trespalacios, O. Assessing Distracted Driving Crash Severities at New York City Urban Roads: A Temporal Analysis Using Random Parameters Logit Model. IATSS Res. 2024, 48, 147–157. [Google Scholar] [CrossRef]
  32. Li, Z.; Wang, C.; Liao, H.; Li, G.; Xu, C. Efficient and Robust Estimation of Single-Vehicle Crash Severity: A Mixed Logit Model with Heterogeneity in Means and Variances. Accident Anal. Prev. 2024, 196, 107446. [Google Scholar] [CrossRef]
  33. Santos, K.; Dias, J.P.; Amado, C. A Literature Review of Machine Learning Algorithms for Crash Injury Severity Prediction. J. Saf. Res. 2022, 80, 254–269. [Google Scholar] [CrossRef] [PubMed]
  34. Alhaek, F.; Liang, W.; Rajeh, T.M.; Javed, M.H.; Li, T. Learning Spatial Patterns and Temporal Dependencies for Traffic Accident Severity Prediction: A Deep Learning Approach. Knowl.-Based Syst. 2024, 286, 111406. [Google Scholar] [CrossRef]
  35. Mokhtarimousavi, S.; Anderson, J.C.; Azizinamini, A.; Hadi, M. Factors Affecting Injury Severity in Vehicle-Pedestrian Crashes: A Day-of-Week Analysis Using Random Parameter Ordered Response Models and Artificial Neural Networks. Int. J. Transp. Sci. Technol. 2020, 9, 100–115. [Google Scholar] [CrossRef]
  36. Hosseinzadeh, A.; Moeinaddini, A.; Ghasemzadeh, A. Investigating Factors Affecting Severity of Large Truck-Involved Crashes: Comparison of the SVM and Random Parameter Logit Model. J. Saf. Res. 2021, 77, 151–160. [Google Scholar] [CrossRef] [PubMed]
  37. Lyu, P.; Bai, Q.; Chen, L. A Model for Predicting the Severity of Accidents on MountainousExpressways Based on Deep Inverted Residuals andAttention Mechanisms. China J. Highw. Transp. 2021, 34, 205–213. [Google Scholar] [CrossRef]
  38. Zhou, B.; Zhou, Q.; Li, Z. Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods. IEEE Access 2025, 13, 2929–2944. [Google Scholar] [CrossRef]
  39. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 52. [Google Scholar]
  40. Fountas, G.; Fonzone, A.; Gharavi, N.; Rye, T. The Joint Effect of Weather and Lighting Conditions on Injury Severities of Single-Vehicle Accidents. Anal. Methods Accid. Res. 2020, 27, 100124. [Google Scholar] [CrossRef]
  41. Haq, M.T.; Zlatkovic, M.; Ksaibati, K. Investigating Occupant Injury Severity of Truck-Involved Crashes Based on Vehicle Types on a Mountainous Freeway: A Hierarchical Bayesian Random Intercept Approach. Accid. Anal. Prev. 2020, 144, 105654. [Google Scholar] [CrossRef]
  42. Boylan, J.; Meyer, D.; Chen, W.S. A Systematic Review of the Use of In-Vehicle Telematics in Monitoring Driving Behaviours. Accid. Anal. Prev. 2024, 199, 107519. [Google Scholar] [CrossRef]
  43. Gao, J.; Yang, D.; Xu, C.; Ozbay, K.; Sharma, S. Assessing the Impact of Fixed Speed Cameras on Speeding Behavior and Crashes: A Longitudinal Study in New York City. Transp. Res. Interdiscip. Perspect. 2025, 30, 101373. [Google Scholar] [CrossRef]
  44. He, C.; Xu, P.; Pei, X.; Wang, Q.; Yue, Y.; Han, C. Fatigue at the Wheel: A Non-Visual Approach to Truck Driver Fatigue Detection by Multi-Feature Fusion. Accid. Anal. Prev. 2024, 199, 107511. [Google Scholar] [CrossRef] [PubMed]
  45. Zhang, C.; Ma, Y.; Chen, S.; Zhang, J.; Xing, G. Exploring the Occupational Fatigue Risk of Short-Haul Truck Drivers: Effects of Sleep Pattern, Driving Task, and Time-on-Task on Driving Behavior and Eye-Motion Metrics. Transp. Res. Part F Traffic Psychol. Behav. 2024, 100, 37–56. [Google Scholar] [CrossRef]
  46. Payyanadan, R.; Domeyer, J.; Angell, L.; Sayer, T. Naturalistic Driving Analysis of Situational, Behavioral, and Psychosocial Determinants of Speeding. Accid. Anal. Prev. 2024, 207, 107751. [Google Scholar] [CrossRef]
  47. Yannis, G.; Michelaraki, E. Effectiveness of 30 Km/h Speed Limit—A Literature Review. J. Saf. Res. 2024, 92, 490–503. [Google Scholar] [CrossRef]
  48. Su, Z.; Woodman, R.; Smyth, J.; Elliott, M. The Relationship between Aggressive Driving and Driver Performance: A Systematic Review with Meta-Analysis. Accid. Anal. Prev. 2023, 183, 106972. [Google Scholar] [CrossRef] [PubMed]
  49. Crizzle, A.M.; Malkin, J.; Zello, G.A.; Toxopeus, R.; Bigelow, P.; Shubair, M. Impact of Electronic Logging Devices on Fatigue and Work Environment in Canadian Long-Haul Truck Drivers. J. Transp. Health 2022, 24, 101295. [Google Scholar] [CrossRef]
  50. Zhang, G.; Yau, K.K.W.; Zhang, X.; Li, Y. Traffic Accidents Involving Fatigue Driving and Their Extent of Casualties. Accid. Anal. Prev. 2016, 87, 34–42. [Google Scholar] [CrossRef]
  51. Li, M.K.; Yu, J.J.; Ma, L.; Zhang, W. Modeling and Mitigating Fatigue-Related Accident Risk of Taxi Drivers. Accid. Anal. Prev. 2019, 123, 79–87. [Google Scholar] [CrossRef]
  52. Duarte Soliani, R.; Vinicius Brito Lopes, A.; Santiago, F.; da Silva, L.B.; Emekwuru, N.; Carolina Lorena, A. Risk of Crashes among Self-Employed Truck Drivers: Prevalence Evaluation Using Fatigue Data and Machine Learning Prediction Models. J. Saf. Res. 2025, 92, 68–80. [Google Scholar] [CrossRef]
  53. McCall, C.A. Moving towards a More Naturalistic Approach to Evaluating Drowsy Driving Risk. Sleep 2024, 47, zsae139. [Google Scholar] [CrossRef]
  54. Bharadwaj, N.; Edara, P.; Sun, C. Sleep Disorders and Risk of Traffic Crashes: A Naturalistic Driving Study Analysis. Saf. Sci. 2021, 140, 105295. [Google Scholar] [CrossRef]
  55. Saleem, S. Risk Assessment of Road Traffic Accidents Related to Sleepiness during Driving: A Systematic Review. East. Mediterr. Health J. 2022, 28, 695–700. [Google Scholar] [CrossRef] [PubMed]
  56. Vergara, E.; Aviles-Ordonez, J.; Xie, Y.; Shirazi, M. Understanding Speeding Behavior on Interstate Horizontal Curves and Ramps Using Networkwide Probe Data. J. Saf. Res. 2024, 90, 371–380. [Google Scholar] [CrossRef]
  57. Soliani, R.D.; Argoud, A.R.T.T.; Santiago, F.; Lopes, A.V.B.; Emekwuru, N. Catastrophic Causes of Truck Drivers’ Crashes on Brazilian Highways: Mixed Method Analyses and Crash Prediction Using Machine Learning. Multimodal Transp. 2024, 3, 100173. [Google Scholar] [CrossRef]
  58. Davidović, J.; Pešić, D.; Antić, B. Professional Drivers’ Fatigue as a Problem of the Modern Era. Transp. Res. Part F Traffic Psychol. Behav. 2018, 55, 199–209. [Google Scholar] [CrossRef]
  59. Rashmi, B.S.; Marisamynathan, S. An Investigation of Relationships between Aberrant Driving Behavior and Crash Risk among Long-Haul Truck Drivers Traveling across India: A Structural Equation Modeling Approach. J. Transp. Health 2024, 38, 101871. [Google Scholar] [CrossRef]
  60. van Erpecum, C.-P.L.; Bornioli, A.; Cleland, C.; Jones, S.; Davis, A.; den Braver, N.R.; Pilkington, P. 20mph Speed Limits and Zones for Better Public Health: Meta-Narrative Evidence Synthesis. J. Transp. Health 2024, 39, 101917. [Google Scholar] [CrossRef]
  61. Amiri, A.M.; Sadri, A.; Nadimi, N.; Shams, M. A Comparison between Artificial Neural Network and Hybrid Intelligent Genetic Algorithm in Predicting the Severity of Fixed Object Crashes among Elderly Drivers. Accid. Anal. Prev. 2020, 138, 105468. [Google Scholar] [CrossRef]
  62. Almasi, S.A.; Behnood, H.R. Exposure Based Geographic Analysis Mode for Estimating the Expected Pedestrian Crash Frequency in Urban Traffic Zones; Case Study of Tehran. Accid. Anal. Prev. 2022, 168, 106576. [Google Scholar] [CrossRef] [PubMed]
  63. Zhou, T.; Zhang, J. Analysis of Commercial Truck Drivers’ Potentially Dangerous Driving Behaviors Based on 11-Month Digital Tachograph Data and Multilevel Modeling Approach. Accid. Anal. Prev. 2019, 132, 105256. [Google Scholar] [CrossRef]
  64. Smits, E.; Brakenridge, C.; Gane, E.; Warren, J.; Heron-Delaney, M.; Kenardy, J.; Johnston, V. Identifying Risk of Poor Physical and Mental Health Recovery Following a Road Traffic Crash: An Industry-Specific Screening Tool. Accid. Anal. Prev. 2019, 132, 105280. [Google Scholar] [CrossRef]
  65. Alam, M.S.; Tabassum, N.J. Spatial Pattern Identification and Crash Severity Analysis of Road Traffic Crash Hot Spots in Ohio. Heliyon 2023, 9, e16303. [Google Scholar] [CrossRef] [PubMed]
  66. Wu, Y.; Zhang, H.; Song, P.; Sun, X.; Meng, J.; Ma, J.; Gao, L. Developing an XGBoost Based Model to Predict the Probability of Truck Crashes Driven by Macro Operation and Insurance Data. Traffic Inj. Prev. 2025, 1–10. [Google Scholar] [CrossRef] [PubMed]
  67. Wu, Y.; Yang, A.; Liu, F.; Cui, Q. An Integrated Method for Risk Assessment and Diagnosis of Bus Drivers Driven by Unbalanced Data. J. Transp. Saf. Secur. 2025, 17, 1503–1533. [Google Scholar] [CrossRef]
  68. Zhang, N.; Wu, Y.; Yang, A.; Liu, T. Development of a Bus Real-Time Crash Risk Prediction Framework by Using a Self-Attention-Based Bidirectional Long and Short-Term Memory Network with Anomaly Detection Learning and Mixed Sequence Features. Accid. Anal. Prev. 2026, 225, 108306. [Google Scholar] [CrossRef]
Figure 1. Framework of the Improved Generative Adversarial Network.
Figure 1. Framework of the Improved Generative Adversarial Network.
Urbansci 10 00045 g001
Figure 2. Relationship between sample size and model accuracy.
Figure 2. Relationship between sample size and model accuracy.
Urbansci 10 00045 g002
Figure 3. Relationship between the remaining number of features and cumulative contribution rate.
Figure 3. Relationship between the remaining number of features and cumulative contribution rate.
Urbansci 10 00045 g003
Figure 4. Sankey Diagram of Driving States and Crashes’ Economic Loss Levels.
Figure 4. Sankey Diagram of Driving States and Crashes’ Economic Loss Levels.
Urbansci 10 00045 g004
Figure 5. Bar Chart of Marginal Effects of Influencing Factors on Crashes with Different Economic Loss Levels.
Figure 5. Bar Chart of Marginal Effects of Influencing Factors on Crashes with Different Economic Loss Levels.
Urbansci 10 00045 g005
Figure 6. Marginal Effects of Different Driving States on Crashes with Different Economic Loss Levels (Straight Driving as the Baseline Reference).
Figure 6. Marginal Effects of Different Driving States on Crashes with Different Economic Loss Levels (Straight Driving as the Baseline Reference).
Urbansci 10 00045 g006
Figure 7. Stacked Area Chart of Marginal Effects (Based on Absolute Values) of Influencing Factors of Crashes with Different Economic Loss Levels.
Figure 7. Stacked Area Chart of Marginal Effects (Based on Absolute Values) of Influencing Factors of Crashes with Different Economic Loss Levels.
Urbansci 10 00045 g007
Figure 8. Relationship between main influencing factors and crashes with different economic losses.
Figure 8. Relationship between main influencing factors and crashes with different economic losses.
Urbansci 10 00045 g008
Table 1. Summary of Variables.
Table 1. Summary of Variables.
DimensionsSpecific Influencing FactorsExplanations
Empty Mileage RatioThe proportion of mileage driven without any cargo load
Loaded Mileage RatioThe proportion of mileage driven with partial cargo load
Full Load Mileage RatioThe proportion of mileage driven with cargo load reaching full capacity threshold
Empty Duration RatioThe proportion of total driving time spent in an empty-load state
Loaded Duration RatioThe proportion of total driving time spent in a partially loaded state
Full Load Duration RatioThe proportion of total driving time spent in a full-load state
Empty Count RatioThe number of trips conducted without cargo, divided by total number of trips
Loaded Count RatioThe number of trips with partial cargo load, divided by total number of trips
Full Load Count RatioThe number of trips with full cargo load, divided by total number of trips
Long-distance Mileage RatioThe proportion of mileage from trips longer than 500 km
Medium-distance Mileage RatioThe proportion of mileage from trips between 200 km and 500 km
Short-distance Mileage RatioThe proportion of mileage from trips shorter than 200 km
Long-distance Duration RatioThe proportion of driving time spent on trips over 500 km
Medium-distance Duration RatioThe proportion of driving time spent on trips between 200 km and 500 km
Short-distance Duration RatioThe proportion of driving time spent on trips shorter than 200 km
Long-distance Count RatioThe number of long-distance trips divided by total number of trips
Medium-distance Count RatioThe number of medium-distance trips divided by total number of trips
Short-distance Count RatioThe number of short-distance trips divided by total number of trips
Inter-provincial Mileage RatioThe proportion of driving mileage on roads connecting different provinces
Inter-city Mileage RatioThe proportion of driving mileage on roads connecting different cities
Inter-county Mileage RatioThe proportion of driving mileage on roads connecting different counties
Intra-county Mileage RatioThe proportion of driving mileage on roads within a single county
Inter-provincial Duration RatioThe total driving time spent on inter-provincial roads
Inter-city Duration RatioThe total driving time spent on inter-city roads
Inter-county Duration RatioThe total driving time spent on inter-county roads
Intra-county Duration RatioThe total driving time spent within a county
Inter-provincial Count RatioThe number of inter-provincial trips divided by total number of trips
Inter-city Count RatioThe number of inter-city trips divided by total number of trips
Inter-county Count RatioThe number of inter-county trips divided by total number of trips
Intra-county Count RatioThe number of intra-county trips divided by total number of trips
Morning Mileage RatioThe proportion of driving mileage between 05:00 and 08:00
Dusk Mileage RatioThe proportion of driving mileage between 17:00 and 19:00
Early Night Mileage RatioThe proportion of driving mileage between 19:00 and 00:00
Late Night Mileage RatioThe proportion of driving mileage between 00:00 and 05:00
Daytime Mileage RatioThe proportion of driving mileage between 08:00 and 17:00
Morning Duration RatioThe total driving time between 05:00 and 08:00
Dusk Duration RatioThe total driving time between 17:00 and 19:00
Early Night Duration RatioThe total driving time between 19:00 and 00:00
Late Night Duration RatioThe total driving time between 00:00 and 05:00
Daytime Duration RatioThe total driving time between 08:00 and 17:00
Morning Count RatioThe number of trips initiated between 05:00 and 08:00, divided by total trips
Dusk Count RatioThe number of trips initiated between 17:00 and 19:00, divided by total trips
Early Night Count RatioThe number of trips initiated between 19:00 and 00:00, divided by total trips
Late Night Count RatioThe number of trips initiated between 00:00 and 05:00, divided by total trips
Daytime Count RatioThe number of trips initiated between 08:00 and 17:00, divided by total trips
Fatigue Mileage RatioThe proportion of mileage driven after 4 consecutive hours without rest
Speeding Mileage RatioThe proportion of mileage where speed exceeds the legal limit
Overload Mileage RatioThe proportion of mileage with cargo weight exceeding regulatory limit
Fatigue Duration RatioThe total driving time occurring after 4 h of continuous driving
Speeding Duration RatioThe total driving time during which speed exceeded legal limits
Overload Duration RatioThe total driving time with overload condition
Fatigue Count RatioThe number of trips with fatigue events divided by total number of trips
Speeding Count RatioThe number of trips with speeding events divided by total number of trips
Overload Count RatioThe number of overload events divided by the total number of trips
Unfamiliar Road CoefficientAnnual proportion of unique highway trips to total highway trips.
Table 2. Summary of Variables with VIF Less Than 10.
Table 2. Summary of Variables with VIF Less Than 10.
VariableVIFVariableVIFVariableVIFVariableVIF
Inter-provincial Duration Ratio9.37Late Night
Duration Ratio
5.98Fatigue Count Ratio3.53Loaded Duration Ratio1.95
Speeding Duration Ratio8.81Intra-county Duration Ratio5.7Early Night Duration Ratio3.05Inter-city Count Ratio1.87
Inter-provincial Count Ratio8.2Intra-county Count Ratio5.6Fatigue Mileage Ratio2.62NCD coefficient1.77
Speeding Count Ratio7.48Morning Count Ratio4.65Tonnage2.43CarAge1.67
Daytime Count Ratio6.82Morning Mileage Ratio4.35Empty Count Ratio2.42Full Load Count Ratio1.49
Dusk Mileage Ratio6.77Fatigue Duration Ratio3.74Pre-Collision Driving States2.29Unfamiliar Road Coefficient1.37
Dusk Count Ratio6.43
Table 3. Performance of these two models for predicting crashes with various economic loss levels.
Table 3. Performance of these two models for predicting crashes with various economic loss levels.
Model Comparison ItemsGeneralized Ordered Logit ModelRandom Effects Generalized Ordered Logit Model
AIC168.48278.18
BIC359.78377.66
Classification accuracy for minor claims69.03%88.50%
Classification accuracy for general claims92.04%92.92%
Classification accuracy for major claims89.23%91.15%
Overall classification accuracy81.48%90.86%
Table 4. Model Regression Results and Significance Analysis.
Table 4. Model Regression Results and Significance Analysis.
VariableCoefficientStd.Err.zp > |z|
Inter-provincial Duration Ratio2.382 **0.8672.750.006
Speeding Duration Ratio2.546 ***0.6433.960
Inter-provincial Count Ratio−2.006 *0.797−2.520.012
Speeding Count Ratio−2.473 **0.721−3.430.001
Morning Count Ratio−1.258 *0.568−2.220.027
Fatigue Count Ratio−16.053 *6.604−2.430.015
Fatigue Mileage Ratio14.543 **5.4312.680.007
*** Statistically significant at α = 0.001. ** Statistically significant at α = 0.01. * Statistically significant at α = 0.05.
Table 5. Marginal Effects of Different Risk Factors on the Economic Loss levels of Crashes in Different Pre-Collision Driving States.
Table 5. Marginal Effects of Different Risk Factors on the Economic Loss levels of Crashes in Different Pre-Collision Driving States.
VariableEconomic Loss levels of CrashesStraight Drivingp-ValueTurningp-ValueReversingp-ValueRollingp-ValueClose Followingp-Value
Inter-provincial Duration Ratio0−0.135 *0.019−0.135 *0.02−0.135 *0.02−0.135 *0.021−0.135 *0.022
10.096 *0.0250.096 *0.0270.096 *0.0330.096 *0.0430.0960.077
20.0390.0530.0390.0540.0390.0680.0390.0940.0390.178
Speeding Duration Ratio0−0.311 ***0.000−0.311 ***0.000−0.312 ***0.000−0.311 ***0.000−0.312 ***0.000
10.221 ***0.0000.221 ***0.0000.221 ***0.0000.221 ***0.0000.222 **0.004
20.091 ***0.0010.090 ***0.0010.090 **0.0070.090 *0.0260.0890.119
Inter-provincial Count Ratio00.113 *0.0440.113 *0.0440.113 *0.0440.113 *0.0450.113 *0.047
1−0.0800.051−0.0800.053−0.0800.059−0.0800.07−0.0800.103
2−0.0330.08−0.0330.083−0.0330.098−0.0330.125−0.0320.207
Speeding Count Ratio00.290 ***0.0000.290 ***0.0000.290 ***0.0000.290 ***0.0000.291 ***0.000
1−0.206 ***0.000−0.206 ***0.000−0.206 ***0.000−0.207 ***0.000−0.207 **0.005
2−0.085 ***0.001−0.084 **0.003−0.084 **0.01−0.084 *0.032−0.0830.128
Morning Count Ratio00.049 *0.0240.049 *0.0240.049 *0.0240.049 *0.0250.049 *0.026
1−0.035 *0.03−0.035 *0.0310.049 *0.036−0.035 *0.044−0.0350.075
2−0.0140.058−0.0140.062−0.0140.079−0.0140.108−0.0140.195
Fatigue Count Ratio00.533 **0.0020.533 **0.0020.534 **0.0020.534 **0.0020.535 **0.003
1−0.686 **0.005−0.688 **0.007−0.689 *0.012−0.691 *0.021−0.6950.055
2−0.446 **0.01−0.445 **0.009−0.444 *0.015−0.443 *0.033−0.4410.114
Fatigue Mileage Ratio0−0.777 ***0.001−0.778 ***0.001−0.778 ***0.001−0.778 ***0.001−0.779 **0.002
10.834 **0.0040.836 **0.0050.837 **0.010.838 *0.0180.841 *0.05
20.343 **0.0090.342 **0.0080.341 *0.0140.340 *0.0320.3390.113
Empty Count Ratio00.085 *0.0250.085 *0.0260.085 *0.0270.085 *0.0280.085 *0.03
1−0.060 *0.031−0.061 *0.035−0.061 *0.043−0.0610.055−0.0610.092
2−0.0250.06−0.0250.059−0.0250.071−0.0250.095−0.02450.175
NCD coefficient00.080 ***0.0010.080 ***0.0010.080 ***0.0010.080 ***0.0010.081 ***0.001
1−0.057 ***0.001−0.057 ***0.001−0.057 **0.002−0.057 **0.005−0.057 *0.024
2−0.023 *0.019−0.023 *0.022−0.023 *0.037−0.0230.065−0.0230.159
CarAge00.069 **0.0040.069 **0.0040.069 **0.0040.069 **0.0050.069 **0.006
1−0.049 **0.006−0.049 **0.008−0.049 *0.012−0.049 *0.019−0.049 *0.05
2−0.020 *0.029−0.020 *0.03−0.020 *0.042−0.0200.066−0.0190.153
*** Statistically significant at α = 0.001. ** Statistically significant at α = 0.01. * Statistically significant at α = 0.05.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, P.; Wu, Y.; Zhang, H.; Rong, J.; Zhang, N.; Ma, J.; Sun, X. An Empirical Analysis of Running-Behavior Influencing Factors for Crashes with Different Economic Losses. Urban Sci. 2026, 10, 45. https://doi.org/10.3390/urbansci10010045

AMA Style

Song P, Wu Y, Zhang H, Rong J, Zhang N, Ma J, Sun X. An Empirical Analysis of Running-Behavior Influencing Factors for Crashes with Different Economic Losses. Urban Science. 2026; 10(1):45. https://doi.org/10.3390/urbansci10010045

Chicago/Turabian Style

Song, Peng, Yiping Wu, Hongpeng Zhang, Jian Rong, Ning Zhang, Jun Ma, and Xiaoheng Sun. 2026. "An Empirical Analysis of Running-Behavior Influencing Factors for Crashes with Different Economic Losses" Urban Science 10, no. 1: 45. https://doi.org/10.3390/urbansci10010045

APA Style

Song, P., Wu, Y., Zhang, H., Rong, J., Zhang, N., Ma, J., & Sun, X. (2026). An Empirical Analysis of Running-Behavior Influencing Factors for Crashes with Different Economic Losses. Urban Science, 10(1), 45. https://doi.org/10.3390/urbansci10010045

Article Metrics

Back to TopTop