ReMAHA–CatBoost: Addressing Imbalanced Data in Tra ﬃ c Accident Prediction Tasks

.


Introduction
Road traffic accidents pose significant economic and medical burdens globally, often leading to catastrophic family tragedies and accounting for approximately 1.3 million fatalities annually.However, as highlighted by the World Health Organization (WHO), injuries caused by traffic accidents can be prevented [1,2].With the rapid advancement of machine learning, the field of traffic accident prediction has demonstrated tremendous potential.It not only enhances the accuracy of accident prediction but also allows for realtime response to traffic conditions, reducing accident risks.Existing research has primarily focused on whether traffic accidents will occur, often neglecting the further classification of potential accidents.Ideally, all potential traffic accidents should receive significant attention, with resources mobilized to prevent their occurrence.However, due to limited resources and the need for multi-agency co-ordination, it is often challenging to address all traffic accidents promptly.Therefore, it is essential to give additional attention to the possibility of severe traffic accidents and take necessary measures to prevent their occurrence or prepare for their management in advance.
Simultaneously, traffic accidents typically exhibit characteristics of imbalance, such as the far smaller number of accident samples compared to normal traffic samples [3], a significantly higher number of accidents in high-risk areas compared to low-risk areas [4], and an imbalance in the sample proportions of different accident types.Particularly, in traffic accidents, the number of severe events is much lower than that of common events.Machine learning models tend to learn this prior information regarding the proportions of the two sample classes in the training set, which results in an emphasis on the majority class during actual prediction (possibly leading to better accuracy for common accidents but poorer accuracy for severe accidents).Unfortunately, in previous work, the probability of occurrence of these minority class accidents is often overlooked, and especially the occurrence of severe accidents in traffic accidents should receive more attention.
Due to the presence of a considerable number of boolean features, such as points of interest (POI), streets, wind direction, weather, and other factors in traffic accident datasets, there remains an issue of feature sparsity in traffic accident prediction.To address the aforementioned issues, this paper introduces the ReMAHA-CatBoost traffic accident prediction model, aiming to classify potential traffic accidents and to resolve the problems of imbalanced sample quantities, high-dimensional, and sparse feature-induced prediction inaccuracies in traffic accidents.This model incorporates feature selection and clustering operations into the oversampling process to improve the problem of traditional genetic algorithms generating overly divergent data.Simultaneously, it addresses the issue of Mahalanobis distance calculation during the oversampling process, caused by sparse features.Additionally, the weight matrix obtained using the relief-F algorithm not only aids in Mahalanobis distance weighted computation, increasing the impact of feature importance on sampling, but also serves for original feature selection to avoid the curse of dimensionality and overfitting.In this paper, the contributions can be summarized as follows:

•
We introduce ReMAHA-CatBoost, a traffic accident prediction model based on Cat-Boost [5].ReMAHA-CatBoost leverages the relief-F algorithm to obtain feature importance and MAHAKIL to avoid generating data with a centralized distribution, effectively alleviating the imbalance issue in traffic accident data; • To address the concern of Mahalanobis distance computation in MAHAKIL, which does not consider feature importance, we employ the relief-F algorithm to obtain a diagonal matrix representing feature importance and perform weighted Mahalanobis distance calculations; • Additionally, to tackle the problem of generating overly divergent data and handling the overall large sample size in classical MAHAKIL algorithms, we utilize mini batch K-means to cluster the samples before data generation; • Furthermore, we leverage the advantage of CatBoost in reducing gradient bias, enhancing the model's generalization capabilities;

•
We evaluate ReMAHA-CatBoost and other three sampling algorithms combined with CatBoost, as well as ReMAHA combined with four other prediction models on the US-Accidents dataset.The results demonstrate that ReMAHA-CatBoost outperforms other models in imbalanced traffic accident data, highlighting its generalization and effectiveness in the domain of traffic accidents.
The structure of this paper is outlined as follows: in Section 2, we delve into pertinent research concerning traffic accident prediction and approaches addressing data imbalance.Section 3 expounds upon the utilized dataset and delineates the architecture design of ReMAHA-CatBoost.The subsequent section, Section 4, presents a detailed analysis of the experimental results.Section 5 engages in a discussion of the research findings presented herein.Finally, Section 6 summarizes the broad body of work and real-world contributions of this study.

Related Work
In the field of traffic accident research, accurate accident prediction contributes to the identification of traffic hazards, optimization of traffic system resource allocation, and timely provision of medical assistance.Currently, there have been many excellent research achievements in the domain of traffic accident prediction [6][7][8], and they have played a crucial role in revealing the mechanisms behind traffic accident occurrences [9].For example, Nur et al. [10] used the least-squares method to analyze the relationship between environmental factors and accidents, identifying correlations between rainfall, temperature, wind speed, and accidents.Li et al. [11] conducted association rule mining and classification studies, revealing that physiological factors such as alcohol consumption are more likely to lead to fatalities in accidents, while the impact of natural environmental factors like weather on the fatality rate is relatively smaller.Wang et al. [12] employed the a priori algorithm based on association rules to explore influential factors in traffic accidents and investigate strong association rules among various causal factors.
Through our investigation, we found that existing research often focuses on the causes and underlying factors of accident occurrences, using this information to predict whether traffic accidents will happen.Furthermore, most studies conduct single-dimensional analyses, only allowing for shallow data analysis and making it difficult to express the spatiotemporal correlations in traffic accidents [13].Moreover, these studies do not conduct further classification of traffic accidents.Offering only the set of conditions that may trigger a traffic accident, such coarse-grained information also has limited contributions to the traffic system.
Simultaneously, due to the phenomenon of imbalanced samples in historical traffic accident data, especially where the number of severe traffic accidents is significantly lower than that of common accidents, traditional prediction models demonstrate poor performance.Currently, there are also many outstanding works addressing the common problem of sample imbalance, which has led to improved accuracy for minority class samples in prediction models.Broadly, their approaches mainly fall into two directions: data-level and algorithm-level [14,15].Data-level approaches include oversampling, undersampling, and hybrid sampling, while algorithm-level approaches involve ensemble learning and cost-sensitive algorithms.Specifically, they are as follows: (1) Undersampling Methods: Undersampling methods begin with the majority class data and reduce the dataset by removing the class with a higher quantity to balance the dataset.For example, Dai et al. [16] started by eliminating duplicate samples and expanded the detection range of the Tomek-link undersampling algorithm by introducing a global re-labeling index.Wei et al. [17], focusing on data complexity, proposed an undersampling algorithm based on weighted complexity, WCP-UnderSampler, which achieved promising results in defect prediction datasets.However, undersampling can lead to information loss in the majority class and affect classifier generalization.
(2) Oversampling Methods: Oversampling methods introduce new minority class data to create a superset of minority class samples, reducing the imbalance between data categories [18].For instance, SMOTE is a classical oversampling algorithm for addressing imbalanced data [19].Gao et al. [20] proposed an improved SMOTE oversampling algorithm based on ant clustering, addressing both inter-cluster and intra-cluster data imbalance.Bennin et al. [21] introduced genetic chromosome theory into the sampling domain, presenting the MAHAKIL algorithm, which ensures that the generated data inherit the features of parent data instances.It outperforms other oversampling methods on multiple datasets.However, oversampling methods can lead to data overlap and distribution marginalization issues, potentially trapping the algorithm in local optima.
(3) Mixed Sampling: Mixed sampling combines both undersampling and oversampling techniques to balance the data.For example, Wang et al. [22] synthesized minority class and majority class samples separately using a generative adversarial network (GAN) and SMOTE.They compared this dual oversampling strategy to results obtained from a single oversampling approach targeting the minority class only, and found that the dual oversampling strategy outperformed the single oversampling method.While mixed sampling can alleviate information loss in undersampling and overfitting in oversampling, research suggests that the order of executing undersampling and oversampling in mixed sampling can influence predictive accuracy [23].
(4) Ensemble Learning: Different from traditional individual learners, ensemble learning combines multiple weak learners to create a strong learner [24].The two most typical forms of ensemble learning, based on the composition of base learners, are bagging and boosting.Navaneeth et al. [25] implemented a hybrid ensemble learning model that combines CNN with CatBoost, providing a new approach for non-invasive COVID-19 detection.Yan et al. [26] combined undersampling with ensemble learning and proposed a spatial undersampling model for local pattern learning.For ensemble methods, their purpose is to enhance overall accuracy, making them challenging to work effectively on imbalanced classification problems alone [27].They often need to be used in conjunction with other algorithms.
(5) Cost-Sensitive Learning: A new cost-sensitive imprecise classification decision tree has been introduced, which considers error costs by weighting instances and incorporates these costs during tree construction.Serafín et al. [28] proposed the nonparametric predictive inference model to improve cost-sensitive decision trees.In addition to modifying tree generation, the fusion of active learning with cost-sensitive algorithms can effectively enhance the classification performance of imbalanced data [29].Although combining cost-sensitive learning with classification models effectively improves predictive accuracy for classification results, determining misclassification costs still requires substantial effort.
These various types of methods each have their own characteristics and have made significant progress.Considering the randomness and diversity of traffic accident data [30], as well as the abundance of sparse features in the field of traffic accidents, applying them directly to address sample imbalance in traffic accident prediction can lead to suboptimal prediction performance.Given these considerations, this paper addresses the sample imbalance phenomenon in severe imbalanced traffic accident datasets from both a data-level and algorithm-level perspective, merging the two to enhance the robustness of predictions.

Methodology
This paper aims to further classify the results of traffic accident prediction and address the common issue of sample imbalance in traffic accident data, which can potentially reduce the accuracy of model predictions.To tackle this challenge, we propose a traffic accident prediction model that integrates relief-F, MAHAKIL, mini batch K-means, and CatBoost.The primary objective is to mitigate the adverse impact of sample imbalance on model performance and more accurately identify the severity of traffic accidents.
To evaluate the effectiveness of the ReMAHA-CatBoost model in the field of traffic accident prediction, we chose the US-Accidents dataset [31,32] as the subject of our study.We initially conducted a series of data preprocessing steps, including missing values, handling duplicate values, outliers, one-hot encoding, and feature engineering.These steps provided critical data for predicting the severity of traffic accidents.Subsequently, we developed a hybrid sampling prediction model using the ReMAHA-CatBoost architecture.This model is designed to alleviate the data imbalance issue in traffic accidents and perform the classification task for the severity of traffic accidents.
At the data-level, due to the strong randomness in traffic accident data, it is challenging to generate data using traditional oversampling methods.Therefore, we initially perform clustering on minority class samples, enhancing the similarity between samples.We then use MAHAKIL to generate new minority class samples between clusters.Furthermore, due to the presence of a considerable number of boolean features in traffic accident characteristics, feature sparsity can occur.When dealing with large datasets and extremely sparse features, computing the Mahalanobis distance becomes challenging as the covariance matrix might become non-invertible, thereby impeding the use of traditional Mahalanobis distance computation.
We introduce the relief-F algorithm to process the features after clustering, which can eliminate near-uniform labels in the clusters and populate them after generating the data.Such processing avoids the influence of sparse features on the computation of the Mahalanobis distance.In addition, we use the obtained diagonal matrix of feature importance weights for Mahalanobis distance weighting.The purpose of this step is to optimize the traditional Mahalanobis distance which has the problem of exaggerating the role of small variables, and to improve the influence of important features on the distance.
Despite the extensive volume of traffic accident data, which is highly advantageous for subsequent predictions.However, the internal data distribution is unbalanced, which leads to the problem of overfitting easily using traditional algorithms, making the prediction results biased towards categories with more samples.In this regard, we used CatBoost for prediction at the algorithm-level, which is used in order to mitigate the overfitting problem in the traffic accident classification task.CatBoost's ranking boost and overcoming gradient bias enhance the model's predictive capabilities on the imbalanced traffic accident dataset.Subsequent subsections will provide detailed explanations of the mentioned methods.The overall workflow of the proposed method is outlined in Figure 1, which will be elaborated in this section.

Dataset
Given the extensive volume, imbalance, and sparsity of features within traffic accident data, we selected the highly representative dataset US-Accidents as the focal point of our study.It contains all the traffic accident data in the United States from 2016 to 2022 and is sourced from both Bing and MapQuest.The dataset comprises over 7 million records, making it large in terms of data quantity, and it includes a wide range of feature dimensions.Based on different feature attributes, the dataset is categorized into three types: numerical, boolean, and text, as shown in Table 1.The US-Accidents traffic accident dataset is imbalanced, where the accident severity levels range from 1 to 4 (Table 2).These severity levels indicate increasing degrees of severity, with 1 representing the least impact on traffic (i.e., causing short-term delays) and 4 representing a more significant impact on traffic (i.e., causing long-term delays).In terms of data quantity, there is a substantial difference between the number of events with severity levels 1 and 4 compared to severity levels 2 and 3.
The overall dataset exhibits a pronounced imbalance.Based on the imbalance ratio (IR) equation, it can be concluded that the imbalance ratio of this dataset is as high as 91.40.
where Nmajor represents the sample size with the highest number of categories in the dataset and Nminor represents the sample size with the lowest number of categories in the dataset.Data preprocessing stands as a crucial part throughout the entire machine learning process.In order to avoid the "noise" in the dataset from affecting the experiment, we conducted operations to handle missing values, duplicates, and outliers.Additionally, we engineered some new features.Following these data processing steps, we partitioned the dataset into two subsets, the training set and the test set, for model validation purposes.

Missing Value Handling
Table 3 describes the distribution of missing values in the dataset.In the experiment, missing values in natural environmental features were imputed using the median of the same weather station and the same month.Additionally, for spatial environmental features, missing values were imputed using data from geographically close locations.Some features with a significant number of missing values lost their original utility and were subsequently removed.Given the two mentioned reasons, to ensure minimal and accurate duplicate data in the experiment's dataset, while also avoiding data leakage due to duplicate values, we excluded the 2016 data that had overall fewer records and a substantial disparity in data volume between different months.We retained data from 2017 to 2022 for the experiment.Subsequently, we screened the remaining dataset of over six million entries for duplicate values, filtering out entries with a time interval smaller than 10 min and a geographical distance less than 250 m, assessed by the geographical location formula illustrated in Equation (2).Here, φa and φb represent the longitudes of points a and b, respectively, while λa and λb denote the latitudes of points a and b, respectively.
To provide a more detailed demonstration of the process of eliminating duplicate data based on time and distance, we will illustrate it through the following example (Table 4): Assuming that the above four data are subjected to repeat value judgment, the data pairs (X1, X3), (X1, X4), and (X2, X3) with time interval less than 10 min are firstly identified.Then, the longitude and latitude values of the above data pairs are substituted into Equation (2) for distance judgment, which yields dist(X1, X3) = 129,461.41m, dist(X1, X4) = 240.38m, and dist(X2, X3) = 179,192.03m.Therefore, we believe that the duplicated point may be between X1 and X4, and that X1 occurs earlier than X4, therefore, X1 is the real place where the accident originated.
In this section, we screened and deleted duplicate values from 2017 to 2022, and deleted duplicate value data with time intervals of less than 10 min and geographic locations of less than 250 m to ensure the purity of the data and to avoid data leakage.

Handling Outliers
In addition, there are some anomalies in the dataset due to malfunctioning environmental testing instruments and other reasons.If they are not dealt with, they will cause the model's predictions to be skewed.In order to provide a visual representation of the original data, this experiment conducted a feature analysis.Table 5 presents descriptive statistics for some of the features in the original data.The Table 5 includes the upper quartile (Q1), lower quartile (Q3), lower bound (lower), upper bound (upper), maximum (max), and minimum (min) for numerical features such as temperature (F), humidity (%), pressure (in), and wind speed (mph).The interquartile range (IQR) is calculated as: The maximum and minimum values represent the maximum and minimum occurrences of a particular feature in the original dataset.If the maximum value is greater than the upper bound or the minimum value is lower than the lower bound, it indicates the presence of outliers in that feature (e.g., Wind_Spped (mph)).And when the maximum value is less than the upper bound and the minimum value is greater than the lower bound, it signifies the absence of outliers in that feature (e.g., humidity (%)).It is worth noting that the maximum and minimum values of some environmental features far exceed the most extreme values in a normal natural environment.Since the upper and lower bounds of the IQR can effectively represent the distribution of the majority of the data [33,34].To ensure that the data are closer to real environmental conditions, in this experiment, data that fall significantly outside the interquartile range are removed.

Feature Engineering
In order to fully incorporate text features into the model building process, the original text features were one-hot encoded.Here are some examples of the original text features (Table 6): The processed textual features are shown below (Table 7):

The ReMAHA Algorithm
This study introduces a novel model for predicting the severity of unbalanced traffic accidents (ReMAHA-CatBoost).It aims to address the issue of poor prediction performance caused by data imbalance in the traffic accidents domain.At the data level, we propose a new oversampling algorithm called ReMAHA, which effectively prevents overfitting in the generated data.
The ReMAHA model consists of three main components: feature weight calculation, inter-cluster weighted Mahalanobis distance ranking, and genetic algorithm for generating new data pairwise.First, we use the relief-F algorithm to extract significant features, alleviating the issue of excessively sparse features.Through the relief-F algorithm, we calculate the importance weights of features and represent them as a diagonal matrix.Next, we use a clustering algorithm to cluster minority class samples, aiming to ensure that the data do not become too scattered during the generation of new data pairwise using a genetic algorithm.Once the data are clustered into clusters, we apply weighted Mahalanobis distance ranking to the data between different clusters.Finally, we employ a genetic algorithm to generate new data pairwise, addressing the problems of data overlap and excessive concentration that may arise from oversampling.The flow of the ReMAHA model is shown in Figure 2.

Feature Weight Calculation
In this study, we first employed the relief-F algorithm for feature selection and obtained a weight matrix used to assess feature importance.The choice of using the relief-F algorithm is driven by the following reasons: due to the sparsity of features and the large volume of traffic accident data, both of which result in issues with the covariance matrix being non-invertible, using the relief-F algorithm to filter out these overly sparse and uninformative features can alleviate such problems.
Relief-F is an algorithm designed for multi-class problems to calculate feature weights based on the correlation between features and class labels.Its primary objective is to calculate the weights between features, thus determining which features are more influential for subsequent classification tasks [35].Given a feature A, its weight at the t-th iteration is denoted as Wt(A).In any t + 1 iteration, the relief-F algorithm updates the weighting of the feature vector as follows: where T represents the number of iterations, k is the number of nearest neighbors chosen, p(c) represents the prior probability of class c, and Class(Xi) indicates the class to which Xi belongs.
In a given dataset, the relief-F algorithm operates as follows: first, a sample Xi is randomly selected, and then the nearest same-class sample point NHi and the nearest different-class sample point NMi are found in the dataset.Here, the function diff(A, Xi, Xj) is used to calculate the distance between sample points Xi and Xj in feature A, where max(A) and min(A) denote the maximum and minimum values of feature A.
For numerical features, when feature A is numerical: ( , , ) max( ) -min( ) For discrete features, when feature A is discrete: After T iterations, each feature will obtain a feature weight wi, forming a feature weight matrix W = [w1, w2, ..., wm].The larger wi is, the better the feature is for classification.

Inter-cluster Weighted Mahalanobis Distance Sorting
Directly generating new data using a genetic algorithm can lead to excessive divergence in the generated data, potentially causing a loss of original features [36].Given the large volume of traffic accident data, this study first employs mini batch K-means to cluster minority class sample data.This process groups data with similar features into clusters, which are then used for inter-cluster sorting of the sample data.
Additionally, traditional Mahalanobis distance, although addressing the issue of dimension measurement not resolved by Euclidean distance, neglects the varying influence of features on labels.Applying feature weighting can lead to more precise distance calculations between sample points.Here, it is assumed that the sample for calculating the distance originates from the same cluster, with a cluster center represented as C = (c1, c2, …, cn), and the point for Mahalanobis distance calculation as X = (x1, x2, …, xn).Following that, the feature importance diagonal matrix B is generated using the relief-F algorithm.The weighted Mahalanobis distance is then represented as follows: Σ represents the covariance matrix, and the formula for calculating the covariance is as follows:

Genetic Algorithm for Generating New Data
The genetic theory of chromosome inheritance posits that during the formation of chromosomes, each parent contributes an equal half of the total genes to their offspring [37].The new offspring inherits 50% of their genes from each parent, making them similar to their parents while also possessing unique traits.In this study, sample data are treated as chromosomes, and new data are constructed from these samples.This approach ensures that the newly generated data retain common characteristics while also exhibiting the uniqueness of the original two datasets.Using genetic algorithms guarantees the diversity of generated data and effectively addresses the issues of marginalization of data distribution and data overlap that traditional oversampling methods encounter.
The formula for generating data using genetic algorithms, as represented in Equation ( 11), incorporates an influence factor β to preserve the uniqueness of samples.In this study, we set β to 0.5 to a greater extent: The letters a and b represent the parent samples used for generating data, while 'x' represents the newly generated sample.This means that, in this experiment, since β is equal to 0.5, both parent samples have the same effect on the newly generated sample.

ReMAHA
The ReMAHA algorithm, formed by combining the above-mentioned methods, can be summarized into three main steps: Step 1: Utilize the relief-F feature selection algorithm on the entire dataset to assess the influence of features on accident severity and acquire a weight diagonal matrix; Step 2: Perform clustering on the minority class samples in the dataset, ensuring that data with similar features are clustered together in the same clusters.Calculate the distance from each point to the cluster center within each cluster and rank them based on the Mahalanobis distance weighted using the weight diagonal matrix; Step 3: Generate new data using genetic algorithms with β equal to 0.5 based on the sorted distances between data points, and then re-sort the newly generated data.To prevent excessive dispersion, iterate up to 5 times at most for each cluster.
The schematic of ReMAHA-generated data is shown in Figure 3 and the algorithm is shown in Algorithm 1.

CatBoost
On the algorithm level, we have chosen to employ CatBoost, a variant of gradientboosting algorithms [38,39].Similar to other gradient-boosting algorithms, CatBoost aims to iteratively learn from errors by combining multiple weak learners.
In gradient-boost models, a commonly used node-splitting method is greedy targetbased statics, which utilizes the average label value as the node-splitting criterion.However, when extreme values exist in the dataset or when there is a discrepancy between the data distribution of the training and test sets, using the mean value to split nodes can lead to conditional shift, resulting in decreased prediction accuracy.CatBoost mitigates conditional shift by introducing a prior distribution term, represented by p and a weight coefficient, represented by a.The formula is as follows: Furthermore, since gradient-boost models use the same dataset for training in each iteration, it may lead to prediction shift concerning the gradient and the true distribution [40].CatBoost addresses this issue by employing a ranking boost approach, training a separate model M for each sample.Consequently, the current model M is not trained using the current sample.CatBoost's handling of conditional and prediction shifts makes it more suitable for situations involving imbalanced data.

Experiments
When evaluating the performance of the proposed models, we addressed the following research questions: (1) Does varying the number of clusters affect the generation of new data?
(2) How does the quality of the newly generated data using our proposed method compare with other sampling techniques?
(3) How does this method perform on traffic accident datasets compared to other baseline methods?

Experimental Setup
The experimental platform utilized in this study includes a system with 64 GB of RAM, an 11th Gen Intel (R) Core (TM) i7-11700K CPU clocked at 3.60 GHz, an NVIDIA GeForce RTX 3070 Ti GPU, all running on CentOS 7.9.

Evaluation Metrics
The confusion matrix is the most commonly used method for evaluating the performance of classification problems.The definition of a confusion matrix is shown in Table 8.Based on the definition of the confusion matrix, various other data classification evaluation metrics have been derived.
Precision represents the proportion of correctly predicted samples for a specific class to the total number of samples predicted as that class: Recall represents the proportion of correctly predicted samples for a specific class to the total number of samples for that class: Typically, precision and recall are trade-offs, and in many cases, it is necessary to consider both precision and recall simultaneously.The Fβ score is used to provide a weighted harmonic mean of precision and recall [41].Its calculation is as follows: The F1-Score is the most commonly used Fβ score, where β is equal to 1.It balances precision and recall equally.

The Impact of Clustering on Generated Data
Considering that at the algorithm level, the proposed model in this paper involves the selection of the number of clusters.To demonstrate the effectiveness of clustering in traffic accident prediction, we conducted experiments to investigate the impact of varying cluster numbers on the results.Given that oversampling algorithms primarily affect the minority class in imbalanced datasets, this experiment particularly focused on observing the effect of cluster numbers on labels 1 and 4 in the dataset.
Clustering is one of the key steps of the model proposed in this paper, and Figure 4 depicts the relationship between the number of clusters and the F1-Score.In this experiment, experiments were conducted from the number of clusters being 1 to 9. Considering the problem of the relatively large amount of data, we choose mini batch Kmeans as the clustering method.In Figure 4, the relationship between the number of clusters and the F1-Score variation is illustrated.When the number of clusters is set to 1 (i.e., the entire category forms a single cluster without clustering), the performance is relatively poor compared to cases where clustering is applied.Within a smaller range, the F1-Score improves as the number of clusters increases.However, when the number of clusters becomes excessive, it may lead to highly consistent features within clusters, potentially resulting in unstable quality in the generated data.

Quality Comparison of Data Generated by Sampling Algorithms
In the previous section, we assessed the impact of internal parameters on model performance.In this section, we compare ReMAHA with different sampling algorithms.At the data level, we compared three oversampling algorithms, namely, SMOTE [19], ADASYN [42], and random oversampling, with the proposed ReMAHA.The parameters used in the sampling models during the experiments are as follows (Table 9): We used three oversampling algorithms, SMOTE, ADASYN, and random oversampler, and compared them with the proposed ReMAHA method.The predictive models in all cases were CatBoost.The experimental results are shown in Table 10. Figure 5 shows that the addition of oversampling improves the predictive performance, with a significant increase in the number of predictions for minority class samples.This is mainly reflected in the improvement of recall, where higher recall indicates better identification of minority class samples.However, as recall increases, some precision values decrease.This is because with the increase in minority class samples predicted, the precision of predictions may decrease.As shown in Table 10, ReMAHA achieves a precision of 71.41 and 76.31, recall of 75.60 and 53.23, and an F1-Score of 73.44 and 62.71 for labels 1 and 4, respectively, with oversampling performance relatively superior to other models.Figure 6a illustrates the ranking of feature importance in the original data.Figure 7 shows the effect of the oversampling algorithms ReMAHA, SMOTE, ADASYN, and ROS on the correlation of the data, respectively.It is noticeable that, compared to the original data, ReMAHA and ROS demonstrate the closest effects.However, ROS directly duplicates existing data, leading to a sensitive identification of certain classes during prediction, while yielding poor identification for other classes.

Predictive Model Comparison Experiment
To evaluate the effectiveness of ReMAHA-CatBoost, we conducted experiments involving eight different models (Table 11).These eight models were divided into two groups for comparison: prediction model comparison experiments, and prediction + sampling (hybrid) model comparison experiments.
At the algorithm level, to ensure the accuracy of the experiments, we chose boosting algorithms, including AdaBoost [43], GBDT [44], XGBoost [45], LightGBM [46], and CatBoost.To ensure fairness in the experiments, default parameters were used for all models.First, we compared the prediction models without any sampling.US-Accidents Table 12 shows the results of the five algorithms on the unsampled U.S. accident dataset.It can be observed that AdaBoost performs significantly worse on the minority class label 1 compared to the other models.While it achieves a high recall for the minority class label 4, the low F1-Score indicates that AdaBoost identifies more instances of label 4 but with poor precision.This is because adaptive models focus more on misclassified samples in each iteration, and with the highly imbalanced dataset, the adaptive model tends to favor the majority class during prediction.CatBoost, XGBoost, and LightGBM, as emerging variants of gradient-boosting algorithms, have similar performance.CatBoost achieves an F1-Score of 72.53 for class label 1 on this dataset, slightly higher than the other two.When considering both minority class labels, CatBoost performs better overall.CatBoost's superior performance in identifying minority classes in this experiment is mainly attributed to its built-in symmetric trees used as split nodes, making it more suitable for imbalanced datasets.

Models
Table 13 displays the performance of hybrid models.We combined ReMAHA with five predictive algorithms to assess the impact of sampling models on predictive models.As shown in Figure 9, the experimental results show that all model predictions using ReMAHA oversampling mixing are better than the unmixed model.With the introduction of ReMAHA oversampling, the models have shown notable improvements in F1-Score, ranging from 0.3% to 9.32%, especially in the severity levels 1 and 4. While the addition of oversampling increases the number of predictions for minority class samples, it can lead to a decrease in precision if misclassified.However, ReMAHA has demonstrated improvements in F1-Score, precision, and recall for minority class categories 1 and 4. From the perspective of algorithms and sampling, the combined ReMAHA and CatBoost show the most stable predictive performance.Particularly, the F1-Score and recall for minority classes 1 and 4 are the highest, indicating that this hybrid model accurately identifies minority class samples.To highlight the contrast before and after sampling, we computed the differences between the models using various prediction algorithms (CatBoost, AdaBoost, GBDT, XGBoost, and LightGBM) after employing the ReMAHA sampling and those used before, as depicted in Figure 10.Although after processing with the ReMAHA oversampling algorithm, the weights of minority class samples have increased, potentially leading to a decrease in precision while enhancing recall.Overall, upon incorporating ReMAHA oversampling, the majority of performance metrics for prediction models have significantly improved, especially for the label 4.

Discussion
ReMAHA-CatBoost's comparisons at different levels confirm its excellent performance in oversampling and prediction.These experimental results indicate that ReMAHA-CatBoost is effective in recognizing different severity levels of traffic accidents.At the same time, we did not consume too much extra computer resources in the process of generating data, in the ReMAHA oversampling algorithm to generate new data, consumed 2594.3MB of computer memory, less than the consumption of SMOTE, ROS, and ADASYN algorithms.
It is worth noting that the addition of oversampling algorithms can increase the number of predictions for minority class samples.However, with increased data generation, the risks also grow.If the generated data are too concentrated, they will not yield significant performance improvement, while overly dispersed generated data can blur the boundaries between different categories.ReMAHA, as proposed in this paper, balance the generated data well through the process of clustering and data generation.
Due to CatBoost's improvements in handling prediction shift and conditional shift, it reduces overfitting.Combined with the characteristics of symmetric trees, it performs exceptionally well on datasets like traffic accidents, which are characterized by sparse features and extreme class imbalance.Recent research has already demonstrated the effectiveness of hybrid approaches combining sampling and prediction models [47].Long [48] proposed a hybrid sampling model combined with the GNN model, validating the effectiveness of hybrid samplers on public datasets.Wang et al. [49] introduced the MatFind model, which used the K-nearest neighbors algorithm to balance the dataset and employed SVM for prediction, leading to improved performance in miRNA identification.These successful cases suggest that hybrid models combining sampling and prediction are feasible when dealing with imbalanced and large datasets.
However, based on existing work, it is clear that the problems caused by data imbalance in machine learning tasks are far from being completely resolved.Additionally, data imbalance is a common phenomenon in real-life datasets, and the current achievements are not enough to fully trust the models.Particularly, the dataset used in this study, the US-Accidents traffic accident dataset, has a class data difference of 91.40 times, such a massive difference leads to many oversampling algorithms and prediction models not achieving ideal results.Therefore, future work will continue to focus on improving the credibility of models and enhancing their generalization capabilities.Future work will primarily focus on the following three aspects: exploring correlations among data within the same category from multiple dimensions to generate more representative data; leveraging additional traffic accident data to enhance the model's generalization ability; and further refining feature engineering to improve the model's computational efficiency.

Conclusions
In this study, we proposed the ReMAHA-CatBoost model tailored for the imbalanced domain of traffic accident prediction.Our key design comprised three parts: utilizing the relief-F algorithm for feature selection and acquiring feature weight matrices in the feature selection step; integrating the feature weight matrices into the distance calculation process of sample points to enhance the reliability of oversampled data; and, finally, training the CatBoost model using the oversampled dataset.
These designs ensured the model's capability to identify different features' impact on accident severity while enhancing the recognition of minority class samples.Experimental results demonstrated the effectiveness of ReMAHA-CatBoost in predicting accident severity, especially its superior performance in identifying minority class samples compared to other oversampling algorithms (SMOTE, ADASYN, and ROS).ReMAHA-CatBoost enables us to accurately classify potential traffic accidents, assisting traffic management authorities in efficiently allocating limited resources for potential severe accidents.Moreover, the proposed model aids relevant authorities in taking necessary preventive measures to mitigate the economic and healthcare burdens caused by traffic accidents.

Figure 4 .
Figure 4.The impact of cluster variations on F1-Score graph.

Figure 5 .
Figure 5. Predictive performance of different oversampling models.

Figure 6b demonstrates theFigure 8 .
Figure6bdemonstrates the confusion matrix before sampling, while Figure8presents the confusion matrices for various oversampling algorithms.Manifested in the confusion matrices, SMOTE, ROS, and ADASYN all exhibited varying degrees of overfitting, primarily by misclassifying label 2 and label 3 as label 1 or label 4.This occurrence is primarily attributed to insufficient data quality in the generated samples.For instance, the process of generating new data might not account for correlations or result in excessive overlap among generated data, rendering correct categorization challenging.

Figure 9 .
Figure 9. Performance on different models: (a) performance on the prediction model; and (b) performance on the hybrid model.

Figure 10 .
Figure 10.Comparison of effects before and after sampling.

Table 3 .
Distribution of missing values (%).From the perspective of the features in the US-Accidents traffic accident dataset, there are two sources of duplicate values in the dataset:1.Repetitive reporting of the same traffic accident information by different reporting sources.For the same traffic accident, different reporting sources may report it, and differences in the source can also result in variations in geographic and time information.Based on the different data sources, the source field is categorized into Source 1: reported from Bing, Source 2: reported from MapQuest, and Source 3: reported from both parties; 2. Multiple reports of information caused by the same traffic accident source.This is because the determination of traffic accident types in US-Accidents is based on whether traffic delays occur in a specific geographical location, and there is a strong spatial correlation between adjacent roads.Therefore, closely located traffic congestion may have a chain reaction, leading to multiple reports of traffic congestion on closely located roads caused by the same traffic accident source.

Table 4 .
Example of Duplicate Data Processing.

Table 6 .
Example of Text Features.

Table 7 .
Example of Processed Textual Features.

Table 11 .
Predictive Model Parameter Settings.