1. Introduction
With the rise of a new wave of technological revolution represented by mobile internet, big data, and cloud computing, autonomous driving technology is advancing rapidly and has become the latest development direction in intelligent transportation systems and intelligent vehicle engineering. According to estimates by the U.S. Department of Transportation, full market penetration of autonomous driving technology is difficult to achieve in the short term. For at least the next 20 years, mixed traffic flows composed of both autonomous vehicles and human-driven vehicles will persist. Lane-changing (LC) behavior is a common maneuver in traffic and significantly impacts traffic flow efficiency, safety, and energy consumption [
1,
2]. Due to the numerous factors influencing LCs, such behavior is often complex. For autonomous vehicles to safely coexist with human-driven vehicles, the greatest challenge lies in understanding and predicting human driving behavior. Furthermore, modeling approaches that align with the logical relationships inherent in the real world are more conducive to addressing driving behavior modeling problems [
3].
To deeply understand and predict LC intentions, researchers have developed various prediction models, which can be broadly categorized into rule-based LC decision models and learning-based LC decision models [
4]. Rule-based models involve manually presetting a series of clear, fixed “IF-THEN” rules to simulate a driver’s decision-making process regarding whether to change lanes in specific traffic scenarios. For example, the MOBIL (Minimize Overall Braking Induced by Lane Change) model [
5] uses acceleration as a variable to establish safety and incentive criteria, incorporating a politeness factor to describe driver LC decisions in a rule-based format. In recent years, many scholars have combined cellular automata, game theory, Markov process, utility theory, and observed phenomena or patterns to construct new rules describing LC decisions and build corresponding models [
6,
7,
8,
9,
10,
11,
12]. Such models offer strong interpretability, but the quality of the rules depends entirely on the expert knowledge of the modelers. Subsequently, researchers developed learning-based LC decision models, which utilize traditional machine learning or deep learning to automatically learn patterns of LC decisions from large amounts of driving data, thereby improving predictive capability. Traditional machine learning-based models encompass single-learner models (such as support vector machines, decision trees, and Bayesian classifiers [
13,
14,
15,
16]) and ensemble learning models (such as random forest (RF) and XGBoost [
17,
18,
19,
20,
21]). These models are characterized by a simple structure, fast training speed, and strong interpretability. However, they often suffer from limited expressive capability when dealing with complex nonlinear problems and are prone to overfitting or underfitting. Deep learning-based models encompass architectures such as recurrent neural networks (RNNs), long short-term memory networks (LSTM), gated recurrent units (GRUs), convolutional neural networks (CNNs), and Transformers [
22,
23,
24,
25,
26,
27]. These deep learning models can automatically extract high-level spatio-temporal features from large-scale trajectory data and excel at capturing complex nonlinear relationships and long-term dependencies, achieving the highest prediction accuracy when sufficient data are available; nevertheless, they require substantial amounts of labeled data and computational resources, suffer from long training times, and are often criticized as “black box” models due to their poor interpretability.
Compared with rule-based models, traditional machine learning-based models are more powerful and flexible, as they can automatically learn complex, nonlinear decision boundaries from data, and possess stronger generalization capability. Compared with deep learning-based models, traditional machine learning-based models are more transparent and efficient, require less data, and offer stronger interpretability of the decision-making process. This is particularly crucial for the safety-critical domain of autonomous driving and is more conducive to understanding real-world LC behavior. Therefore, this paper selects traditional machine learning-based LC models as the subsequent modeling approach.
A systematic comparison of representative LC decision prediction studies is presented in
Table 1, summarizing their methods, datasets, key features, evaluation metrics, and main findings.
Despite the advantages of traditional machine learning-based models, a common challenge remains: most studies directly employ predefined feature engineering methods to extract input features, focusing primarily on immediate neighboring vehicles. However, discretionary LC decisions may involve consideration of traffic conditions beyond immediate neighbors, such as the average speed and spacing of multiple preceding vehicles. This paper addresses this gap by introducing multi-vehicle information factors. The main contributions of this paper are as follows:
Identification of multi-vehicle information factors. Beyond conventional physical factors (such as speed, spacing, and speed difference), this paper identifies average speed and average spacing of multiple preceding vehicles as key factors influencing discretionary LC decisions. Statistical analysis confirms that drivers consider these multi-vehicle information factors even when immediate neighboring conditions appear favorable.
Safety considerations dominate feature importance in LC decisions. In the dataset analyzed, feature importance reveals that drivers consider both safety and benefit when making LC decisions, with safety taking precedence. Safety-related features, particularly the spacing with the following vehicle in the target lane and the safety margin, rank significantly higher than benefit-related features such as the average speed of the target lane. This finding is consistent with a two-stage decision logic where safety conditions are evaluated prior to benefit considerations.
Systematic evaluation of imbalanced processing and model selection. Based on the US101 dataset used in this paper, a comprehensive comparison of five imbalanced processing methods identifies SMOTE+Tomek as the optimal approach. Among six baseline models, KNN achieves the best performance (F1 = 0.79, AUC = 0.97). Ablation experiments further quantify the contribution of each multi-vehicle expectation feature.
The remainder of this paper is organized as follows.
Section 2 describes the trajectory data and preprocessing steps.
Section 3 presents the imbalance analysis and comparison of five imbalanced processing methods.
Section 4 analyzes the factors influencing LC decisions, including feature importance ranking.
Section 5 details the construction and evaluation of the LC decision prediction model.
Section 6 concludes the paper and discusses the implications, limitations, and future work.
2. LC Trajectory Data
2.1. Data Preprocessing
This study utilizes the vehicle trajectory dataset from the US101 segment, publicly released by the NGSIM (Next Generation Simulation) project. The US101 segment is located in Los Angeles, California, adjacent to Lankershim Boulevard (as shown in
Figure 1). Data collection was conducted on 15 June 2005, from 7:50 a.m. to 8:35 a.m., during which three 15 min segments of vehicle trajectory data were collected. The dataset includes inherent vehicle attributes such as vehicle type, length, and width, as well as motion attributes including position, speed, and acceleration recorded at 0.1 s intervals.
As the data were manually extracted frame by frame using video processing software, significant errors exist in the raw data. Therefore, data preprocessing is necessary. The preprocessing includes data smoothing and the selection of single discretionary LC trajectories for analysis.
To obtain more accurate vehicle trajectory data, the symmetric Exponential Moving Average (sEMA) method was employed to smooth the trajectory data, as shown in Equation (1):
where
represents the smoothed position of LC vehicle
at time
;
represents the original measured position of LC vehicle
at time
;
denotes the smoothing window for boundary data;
denotes the smoothing window for intermediate data,
; when
represents position data,
= 0.5 s; when
represents speed data,
= 1 s; and
represents the total number of frames in which vehicle
appears within the detected road segment [
31].
The sEMA smoothing method reduces high-frequency noise while preserving the overall trajectory trend. Taking the trajectory data of Vehicle 31 in the US101 dataset during the time period from 7:50 to 8:05 as an example, the lateral position and velocity before and after smoothing are shown in
Figure 2. The raw lateral position data exhibit noticeable frame-to-frame fluctuations due to manual video tracking errors. After applying sEMA, the smoothed trajectory eliminates these unrealistic oscillations while maintaining the key characteristics of the LC, including the start point, end point, and overall duration. This smoothing is essential for reliable LC detection and accurate calculation of vehicle kinematics.
Figure 1 shows the data collection segment of US101, where Lanes 1–5 are the mainline lanes, Lane 6 is an auxiliary lane. Based on the traffic characteristics of the target segment, LC maneuvers occurring in mainline Lanes 1–5 are generally regarded as discretionary LC trajectories [
32]. Second, trajectories involving multiple LCs (e.g., consecutive LCs) were excluded, leaving only single LC trajectories that take place within mainline Lanes 1–5. Finally, LC data that are too short may contain incomplete information. To ensure data authenticity, for trajectories focusing on the LC vehicle, at least 10 s of data before and after the LC is required; if the available data duration is less than 10 s, the trajectory is removed.
Since this study introduces multi-vehicle information features (average speed and average spacing of five preceding vehicles), samples were further filtered to ensure data quality. Specifically, a sample was excluded if (1) fewer than five vehicles were present ahead of the lane-changing vehicle (LCV) in the current lane or target lane, or (2) any of the five preceding vehicles was a non-standard vehicle (e.g., heavy trucks, motorcycles).
After applying the above criteria, trajectory data from 6101 LCVs totaling 4.028 million time steps were smoothed, resulting in 237 qualified LC trajectories with 16,500 time steps. The detailed statistics are presented in
Table 2.
2.2. Identification of LC Start and End Points
The identification of LC decision points is closely related to the definition of the LC process. In determining the start and end points of a LC, both the vehicle’s lateral speed and lateral position constraints were considered.
First, the lateral speed constraint was considered. To eliminate the influence of minor lateral displacements and errors, consistent with the approach in Reference [
28], the vehicle’s lateral speed was calculated at 0.5 s intervals:
Taking a rightward LC as an example, starting from the time point
T when the lane number changes, the first time point along the forward direction of the time axis where the lateral velocity exceeds a threshold
is identified as the LC start point
. Similarly, the first time point along the backward direction of the time axis where the lateral velocity exceeds the threshold
is identified as the LC end point
. The time interval between
and
is taken as the LC duration
. In Reference [
28], the authors used a threshold of
= 0.2 m/s. Given that the data in Reference [
28] were not smoothed, whereas the data in this paper have been smoothed, the threshold is set to = 0.1 m/s. The LC durations obtained with this threshold are comparable to those reported in other studies (see the analysis of LC duration for details), thereby demonstrating the reasonableness of this threshold selection.
Meanwhile, to address the issue of erroneous identification of LC start and end points caused by vehicle fluctuations near lane boundaries during the LC process, a lateral position constraint is introduced in the identification process, i.e., a vehicle will only start and end an LC when it is within the normal car-following lateral range. Lateral velocity changes occurring outside this normal car-following lateral range are considered temporary interruptions during the LC.
The method for determining the normal car-following lateral range of vehicles proposed in this paper draws on the concept of the outer fence in a box-and-whisker plot. In a box-and-whisker plot, to identify extreme outliers, the outer fence of the data is calculated, and data points lying outside the outer fence are considered extreme outliers. Similarly, when a vehicle is following another vehicle in each lane, its lateral position generally falls within a certain range. If it exceeds this range, it is regarded as an extreme outlier point. The specific steps are as follows: First, the upper quartile (Q1) and lower quartile (Q3) of the lateral positions of all vehicles during car-following on each lane are calculated, and the interquartile range (IQR) is obtained as IQR = Q3 − Q1. Then, the two outer fences, Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, are computed, which serve as the lateral position constraints for the vehicle’s LC start point.
Figure 3 shows an example of vehicle lateral trajectories and the selected LC start and end points. The gray area in the figure represents the determined normal car-following range for the lane, the red circles indicate LC start points, and the black triangles indicate LC end points. Panels (a) and (b) illustrate rightward and leftward LCs, respectively. The trajectories in the figure exhibit driving near lane boundaries during the LC process. The method proposed in this paper can effectively avoid such issues.
2.3. Analysis of LC Duration
Based on the above principles, the LC durations of the 237 LC trajectories were statistically analyzed, and the results are shown in
Table 3. The vehicle LC duration fluctuates between 3 and 14 s, which is consistent with the LC duration ranges reported in other studies [
28,
33]. The mean duration of rightward LCs is slightly longer than that of leftward LCs.
Table 4 presents the results of the Kolmogorov–Smirnov test examining whether LC duration follows a log-normal distribution. In the table,
and
represent the calibrated mean and standard deviation of the logarithm of LC duration, respectively. The null hypothesis of the Kolmogorov–Smirnov test is that LC duration follows a log-normal distribution with a log-mean of
and a log-standard deviation of
. KS_STAT is the value of the test statistic, and KS_CV is the critical value for accepting or rejecting the hypothesis. As shown in the table, KS_STAT < KS_CV; therefore, the null hypothesis cannot be rejected, indicating that LC duration follows a log-normal distribution, with the probability that the null hypothesis holds being 0.723.
Figure 4 presents the probability histogram of LC duration along with the fitted distribution curve. It can be seen that the two are in close agreement.
This study focuses on maneuver recognition—identifying whether a vehicle is currently changing lanes—rather than early intention prediction. The primary objective is to identify factors influencing LC decisions (e.g., safety vs. benefit, multi-vehicle information), which requires temporally accurate correspondence between features and the LC event. Labeling the [LC start, LC end] interval ensures this alignment. Furthermore, as demonstrated by Ali et al. [
30], LC maneuvers can be aborted during execution when drivers perceive unsafe conditions. Therefore, for a successfully completed LC, the traffic conditions throughout the entire period must have remained acceptable to the driver, justifying the labeling of the full interval for maneuver recognition purposes.
4. Analysis of Factors Influencing LC Decisions
4.1. Selection of LC Features
This paper utilizes discretionary LC data from the US101 dataset. As shown in
Figure 7, a vehicle engages in interactions with various surrounding vehicles while driving, and these interactions can be characterized by a range of factors. Based on the LC influencing factors commonly considered in the previous literature, 13 attributes, as shown in
Table 7, were extracted for each LC trajectory. In this paper, subscript
C denotes the current lane and subscript
T denotes the target lane. These 13 factors are categorized into three categories: physical factors, multi-vehicle expectation factors, and safety factors.
The physical factors include the speed () of the LCV itself, the speed difference () and spacing () between the LCV and the preceding vehicle in the current lane, the speed difference () and spacing () between the LCV and the preceding vehicle in the target lane, and the speed difference () and spacing () between the LCV and the following vehicle in the target lane.
The multi-vehicle information factors include the absolute average speed of five preceding vehicles in the current lane () and the target lane (), and the absolute average spacing of five preceding vehicles in the current lane () and the target lane (). These absolute averages are used directly as input features.
The choice of five preceding vehicles is not arbitrary but is theoretically grounded in traffic flow stability analysis. According to the multi-anticipative IDM model [
38], linear stability analysis reveals a pattern of diminishing returns: as the number of look-ahead vehicles increases from 1 to 3, the stable region of traffic flow expands significantly; however, the marginal benefit gradually saturates, and once the number reaches five, further improvements in string stability become negligible. Moreover, as noted by Treiber et al. [
39], for typical driver reaction times of 1.0–1.5 s, considering up to five preceding vehicles is sufficient to fully compensate for the instability caused by human reaction delays. Given these theoretical and empirical considerations, selecting five vehicles strikes an optimal balance between capturing sufficient multi-vehicle information and maintaining model parsimony. Therefore, the absolute average speed and spacing of the five preceding vehicles in both the current and target lanes are adopted as multi-vehicle expectation factors in this study.
It is worth noting that our conclusions regarding multi-vehicle information are primarily applicable to the five-vehicle setting adopted in this study. We did not systematically vary the number of preceding vehicles. Therefore, whether the same findings would hold for a different number remains an open question, and future work could perform sensitivity analyses to explore the optimal trade-off between information richness and model parsimony.
Due to the continuously dynamic nature of the vehicle and its surrounding vehicles during driving, and considering the need for a continuous indicator, the safety margin (SM) is selected as the safety factor. The safety margin is a threshold that protects the driver from hazards [
40]. Generally,
≥ 0 is considered safe, while
< 0 is considered unsafe. This paper adopts the definition of
as shown in Equation (12):
where
and
are the speeds of the subject vehicle and its preceding vehicle, respectively,
is the distance between the subject vehicle and its preceding vehicle, and
is the gravitational acceleration. This paper considers the SM of the LCV (
) as well as the SM of the following vehicle in the target lane (
).
4.2. Statistical Analysis of Results
Currently, the factors considered in LC decision-making are largely focused on the physical factors of neighboring vehicles, such as the spacing and speed between the LCV and the preceding/following vehicles in adjacent lanes. The core concept underlying these rules is that discretionary LC decisions are based on the instantaneous states of neighboring vehicles. That is, the LCV can immediately obtain the benefits of the LC at the next time step, immediately after executing the LC. This section will explore the influence of physical factors on LC decisions using empirical data, and further analyze the importance of multi-vehicle information factors in LC decision-making.
Table 8 presents the explanatory power of physical factors regarding LC decisions, measured as the proportion of cases where specific conditions are satisfied at the moment of LC execution. The results indicate that at the moment of LC execution, the speed of the preceding vehicle in the target lane exceeds that in the current lane in 63.02% of cases, suggesting that the speed of the immediate preceding vehicle is indeed an important influencing factor. Furthermore, the average speed of preceding vehicles in the target lane exceeds that in the current lane in 59.25% of cases, indicating that drivers consider not only the immediate preceding vehicle (i.e., instantaneous benefit, where traffic conditions would improve immediately after changing lanes) but also the long-term average speed of the entire lane (i.e., potential benefit, representing future traffic conditions in that lane) when making decisions.
Additionally, the average spacing of preceding vehicles in the target lane exceeds that in the current lane in 57.71% of cases, whereas the spacing of the immediate preceding vehicle in the target lane exceeds that in the current lane in only 42.17% of cases. This suggests that seeking more space is another primary motivation for LCs, although its driving force is slightly weaker than that of seeking higher speed. Moreover, unlike speed—where drivers place greater emphasis on immediate benefits—for spacing, drivers appear to value the long-term potential trend of the entire lane more highly.
Figure 8 illustrates a scenario where a vehicle changes lanes despite the preceding vehicle in the current lane having both greater speed and spacing compared to the preceding vehicle in the target lane. The first row of subplots presents the vehicle’s physical parameters, including velocity, average velocity, spacing, and average spacing. In these plots, the blue line represents the LCV, the red line represents the preceding vehicle in current lane, and the green line represents the preceding vehicle in adjacent target lane. Solid lines indicate the immediate preceding vehicle, while dashed lines represent the average conditions of multiple vehicles ahead. The black vertical dashed line marks the moment the LCV crosses the lane line, and the light blue vertical line indicates the LC execution time.
The second row of subplots depicts the actual traffic situation at the LC execution moment. In these plots, circular points represent the LCV, square points represent surrounding vehicles, with point sizes proportional to actual vehicle dimensions and different colors indicating different speeds.
As shown in the figure, although the spacing and velocity of the immediate preceding vehicle in the target lane are not superior to those in the current lane, the average headway distance and average speed of multiple vehicles ahead in the target lane are indeed better. This further demonstrates that when deciding whether to change lanes, drivers consider not only the speed and spacing of the immediate preceding vehicle but also multi-vehicle information. Even when the immediate neighboring conditions in the current lane appear favorable, drivers may still initiate an LC if they anticipate deteriorating conditions in the current lane or better conditions in the target lane in the future. This observation confirms that the average spacing and average velocity of vehicles ahead in the target lane are indeed variables worth considering.
4.3. Importance Ranking of LC Influencing Factors
As concluded in
Section 3.3, among the five imbalanced data processing methods evaluated with XGBoost, SMOTE+Tomek achieved the best overall performance, with the highest F1-Score (0.68) and the best balance between Recall (0.79) and Precision (0.59). Therefore, the dataset processed by SMOTE+Tomek is adopted for subsequent feature importance analysis.
Figure 9 presents the feature importance ranking and cumulative contribution rate curve obtained from the XGBoost model trained on the balanced dataset. The bar chart shows the importance score of each feature (left
y-axis), with features arranged by their contribution to LC decision prediction. The line chart represents the cumulative contribution rate of the top
n features (right
y-axis), and the dashed line indicates the 80% and 90% cumulative contribution rate reference line.
As shown in
Figure 9, the top four features (
,
,
,
) achieve a cumulative contribution rate of 46.5%. The top nine features reach 79.6% (approaching 80%), and the top 11 features reach 90.1% (exceeding 90%). Notably, the importance scores of the bottom four features (
,
,
,
) are very close (0.055, 0.051, 0.050, 0.049), exhibiting a “long-tail” distribution—although each individual feature contributes little, the cumulative contribution of multiple low-importance features may not be negligible. Whether these low-importance features can be removed without significantly affecting model performance requires further validation through experiments in
Section 5.
Based on the feature importance ranking and cumulative contribution rate analysis presented above, the following main conclusions can be drawn:
The spacing with the following vehicle in the target lane is the most critical factor in this balanced dataset. The feature (spacing between the LCV and the following vehicle in the target lane) achieves the highest importance score (0.187), far exceeding all other features (approximately 2.4 times the average importance). This result suggests that, in the driving behavior reflected by this dataset, drivers prioritize the availability of sufficient space in the target lane’s rear gap when making LC decisions.
Following vehicles in the target lane matter more than preceding vehicles in this balanced dataset. When aggregating features related to following vehicles ( + ) and preceding vehicles ( + ) in the target lane, the total importance of following-vehicle features (0.242) is substantially higher than that of preceding-vehicle features (0.163). This suggests that, in the LC decisions reflected by this dataset, drivers pay more attention to “risks from the rear” than to “gains from the front” in the target lane.
Average speed of the target lane contributes most significantly among multi-vehicle information features in this dataset. Among the four multi-vehicle information features, (average speed of preceding vehicles in the target lane, 0.091) shows notably higher importance than (0.060), (0.050), and (0.049). This result indicates that, under the data conditions of this paper, drivers are more concerned with the average traffic efficiency of the target lane (measured by average speed), while average spacing information contributes relatively little.
A dual-layer pattern emerges from the feature importance ranking: safety-related features dominate the top positions, followed by benefit-related features. In the feature importance ranking of this paper, a clear dual-layer structure can be observed as shown in
Table 9:
Safety Threshold: In this dataset, the spacing with the following vehicle in the target lane () and the LCV’s own safety margin with preceding vehicle in the target lane () rank at the top of the importance list, indicating that safety conditions are a prerequisite for LC decisions.
Benefit Incentive: After safety conditions are satisfied, the spacing () and speed difference () with the preceding vehicle in the target lane, together with the average speed of the target lane (), reflect the spatial and speed benefits obtainable after LC, further influencing the decision outcome.
To further validate the feature importance ranking and provide interpretability at the individual prediction level, SHAP (SHapley Additive exPlanations) analysis was conducted on the XGBoost model.
Figure 10 presents the SHAP summary plot, where features are ranked by their mean absolute SHAP values. The SHAP results confirm the dominant role of safety-related features:
(spacing with the following vehicle in the target lane) exhibits the highest SHAP value, followed by
(spacing with the preceding vehicle in the target lane). This ranking is consistent with the gain-based feature importance shown in
Figure 9, providing complementary evidence that safety-related features are the strongest predictors of LC decisions in this dataset. Notably,
(average speed of the current lane) and
(average speed of the target lane) show moderate contributions, confirming that multi-vehicle information factors have predictive value, albeit secondary to safety considerations. However, we note that SHAP values reflect the model’s learned patterns and do not necessarily imply causal driver behavior.
The dominance of safety-related features is consistent with risk-aware driving behavior modeling. Shao et al. [
41] proposed a safety potential field model for autonomous vehicle longitudinal control, demonstrating that risk-aware control strategies—where vehicles maintain larger following distances under higher perceived risk—improve both safety and traffic efficiency. From this perspective, the high importance of
(spacing with the following vehicle in the target lane) can be interpreted as drivers responding to a risk potential field that intensifies with proximity to surrounding vehicles. This alignment between our SHAP analysis (where
exhibits the highest mean absolute SHAP value) and the risk potential field theory provides a theoretical foundation for understanding why safety-related factors dominate LC decisions: drivers are fundamentally minimizing collision risk, particularly rear-end collisions in the target lane.
5. LC Decision Modeling Considering Multi-Vehicle Information
5.1. Experimental Setup for Model Comparison
5.1.1. Dataset and Evaluation Metrics
Following the data preprocessing procedures described in
Section 2 and the imbalance processing evaluation conducted in
Section 3, the dataset processed by SMOTE+Tomek—identified as the optimal imbalanced data processing method—is adopted for all experiments in this section. Consistent with the workflow described in
Section 3.2.7, SMOTE+Tomek was applied only to the training set after data splitting. The final dataset consists of balanced samples between LC and non-LC classes in the training set only. The dataset is divided into training and testing sets using an 80/20 stratified split, ensuring consistent class distribution across both subsets.
The same evaluation metrics described in
Section 3.3.1 (Precision, Recall, F1-Score, Accuracy, and AUC) are used to assess model performance.
All experiments were implemented in Python 3.9 using the following libraries: scikit-learn (version 1.2.2), XGBoost (1.7.5), imbalanced-learn (0.10.1), and SHAP (0.41.0). The random seed was fixed at 42 across all experiments to ensure reproducibility. The hardware environment consisted of an Intel Core i7-12700K CPU with 32 GB RAM.
5.1.2. Baseline Models
As discussed in the introduction, traditional machine learning-based models can be broadly categorized into single-learner models and ensemble learning models. To provide a comprehensive evaluation of different modeling paradigms for LC decision prediction, six representative models are selected from both categories.
From the ensemble learning category, four models are included:
XGBoost and GradientBoosting (both from the software libraries listed in
Section 5.1.1) represent gradient boosting methods, which build models sequentially by correcting previous errors. They are widely adopted in traffic behavior modeling due to their high predictive accuracy and ability to handle complex feature interactions.
Random Forest (RF) represents bagging-based ensemble methods, which build multiple decision trees in parallel and aggregate their predictions. It offers robustness against overfitting and provides feature importance analysis.
AdaBoost represents adaptive boosting methods that iteratively focus on hard-to-classify samples, providing a contrast to gradient-based boosting approaches.
From the single-learner category, two models are included:
Logistic Regression (LR) serves as a linear model baseline with strong interpretability, helping to assess whether nonlinear relationships in the data provide substantial performance gains over a linear decision boundary.
K-Nearest Neighbors (KNNs) represents instance-based learning, which makes no assumptions about data distribution and offers insight into the local structure of the feature space.
5.1.3. Hyperparameter Optimization
For each of the six baseline models, hyperparameters are optimized using grid search with three-fold cross-validation on the training set. F1-Score is used as the optimization metric to balance Precision and Recall.
Table 10 summarizes the search spaces for each model. The search spaces are designed to cover typical ranges reported in the literature while balancing computational efficiency. After grid search, the optimal hyperparameters for each model are selected based on the highest cross-validated F1-Score, as shown in
Table 11. These optimal configurations are then used to train the final models on the full training set.
5.2. Comparison of Feature Set Configurations
Based on the cumulative contribution rate analysis in
Section 4.3, the importance scores of the bottom four features (
,
,
,
) are very close (ranging from 0.049 to 0.055), exhibiting a “long-tail” distribution. While the top nine features achieve a cumulative contribution rate of 79.6% and the top 11 features reach 90.1%, it remains unclear whether removing these low-importance features would significantly affect model performance. To investigate this issue, three feature set configurations were compared:
Both configurations are evaluated using XGBoost under the same experimental conditions.
Table 12 and
Figure 11 present the performance comparison, and
Figure 12 shows the corresponding ROC curves.
As shown in
Table 12 and
Figure 11 and
Figure 12, the Full Feature Set consistently outperforms the reduced feature sets across all evaluation metrics. The F1-Score decreases from 0.68 (Full Feature Set) to 0.63 (11-Feature Set) and further to 0.60 (9-Feature Set), with similar declines observed in Recall, Precision, Accuracy, and AUC. These results indicate that removing the low-importance features leads to noticeable performance degradation. Therefore, to maximize prediction performance, the Full Feature Set (13 features) is adopted for all subsequent experiments in this study.
5.3. Comparison of Different Baseline Models
Using the Full Feature Set (13 features) identified in
Section 5.2, six baseline models are compared to identify the most suitable model for LC decision prediction. All models are evaluated under the same experimental conditions.
Table 13 presents the performance comparison.
As shown in
Table 13, KNN achieves the highest F1-Score (0.79) and AUC (0.97), significantly outperforming all other models. GradientBoosting and XGBoost show comparable performance (F1 = 0.68), while RF (0.58), AdaBoost (0.48), and LR (0.44) show relatively modest performance.
Since KNN is sensitive to feature scales, all variables were standardized using StandardScaler prior to model training. The scaler was fitted on the training set and applied to both training and testing sets to ensure no data leakage. Additionally, to address KNN’s sensitivity to class distribution and local sample density, SMOTE+Tomek was applied exclusively to the training set to balance the classes, and the ‘weights = ‘distance’’ parameter was used to give greater influence to closer neighbors.
To evaluate KNN’s performance on the original (unbalanced) test set, the confusion matrix and class-wise metrics are reported in
Table 14 and
Table 15 below.
As shown in
Table 14, the KNN classifier correctly identified 1969 out of 2088 lane-keeping instances (94.3%) and 278 out of 308 LC instances (90.3%), resulting in an overall accuracy of 93.8%. The misclassification pattern reveals that 119 lane-keeping samples were falsely flagged as LC (false positives), while only 30 LC samples were missed (false negatives).
Table 15 presents the class-wise performance metrics. For the minority LC class, KNN achieves a recall of 0.90, indicating that 90% of actual LC are successfully detected—a critical property for safety-oriented autonomous driving systems. The precision of 0.70 implies that 70% of predicted LC are correct, corresponding to a false alarm rate of 30%. The macro-average F1-score of 0.88, which equally weights both classes, confirms that the model maintains balanced performance despite the substantial class imbalance (LC accounts for only 13% of test samples). These results demonstrate that KNN, combined with appropriate feature normalization and imbalance processing (SMOTE+Tomek applied exclusively to the training set), generalizes effectively to the original imbalanced test data.
Several observations can be made from these results. First, KNN’s strong performance suggests that instance-based learning is particularly well-suited for LC decision prediction in this dataset. Unlike tree-based models that partition the feature space globally, KNN makes local decisions based on neighboring samples. This is advantageous for LC prediction, where the decision boundary between lane-keeping and LC may be irregular and context-dependent. KNN makes no prior assumptions about data distribution and can effectively capture local patterns in the feature space, which may explain its superior performance. Second, GradientBoosting and XGBoost show nearly identical performance. Both are boosting-based ensemble methods that build models sequentially by correcting previous errors. Third, low performance of LR (F1 = 0.44) suggests that a linear decision boundary is insufficient for capturing the complexity of LC decisions. The superior performance of KNN and the tree-based models confirms that nonlinear relationships play an important role in this task. Based on these results, KNN is selected as the primary model for subsequent analysis due to its superior predictive performance.
5.4. Contribution Analysis of Multi-Vehicle Expectation Factors
To further validate the contribution of multi-vehicle information factors, several feature configurations are compared using KNN as the base classifier:
Model A: physical factors + Safety Factors
Model B: physical factors + Safety Factors +
Model C: physical factors + Safety Factors +
Model D: physical factors + Safety Factors +
Model E: physical factors + Safety Factors +
Model F: physical factors + Safety Factors + all multi-vehicle Information Factors
Model A serves as the baseline, including only physical and safety factors (nine features). Models B, C, D, and E add individual multi-vehicle information features to the baseline. Model F includes all four expectation features (13 features).
Table 16 presents the performance comparison.
As shown in
Table 16, different multi-vehicle expectation features contribute differently, but all improve prediction performance. Each single-feature addition (Models B–E) achieves higher F1 scores than the baseline Model A (0.70), with improvements ranging from 0.02 to 0.06. This indicates that all four expectation features—including speed and spacing information from both current and target lanes—provide valuable predictive information for LC decisions. Second, the average velocity of the target lane (
) is the most valuable individual expectation feature. Model B, which adds
, achieves the highest F1 score (0.76) among all single-feature addition models, which suggests that the traffic efficiency of the target lane (measured by average speed) is the primary consideration for drivers when making LC decisions.
5.5. Generalization Assessment: Trajectory-Level Split
The preceding experiments used random sample-level splitting, where individual time steps from the same vehicle trajectory could appear in both training and testing sets. While this approach maximizes sample size for model comparison, it may lead to optimistic performance estimates due to potential data leakage. To assess the generalization capability of models to unseen vehicles and address this methodological concern, an additional experiment using trajectory-level splitting was also conducted in this section.
In this evaluation protocol, all time steps belonging to the same vehicle trajectory (identified by the combination of vehicle ID and time period) were assigned exclusively to either the training set (80% of trajectories) or the testing set (20% of trajectories). This ensures that the model is evaluated on entire maneuvers from vehicles it has never seen during training, providing a more realistic assessment of generalization performance.
Table 17 presents the performance of all six baseline models under this trajectory-level split, with SMOTE+Tomek applied for imbalance processing and hyperparameters optimized via grid search.
Table 18 summarizes the performance differences between random splitting and trajectory-level splitting for direct comparison.
As shown in
Table 17, GradientBoosting achieves the best overall performance under trajectory-level splitting (F1 = 0.47, AUC = 0.83), followed by RandomForest (F1 = 0.45) and XGBoost (F1 = 0.45). Compared with the random splitting results reported in
Section 5.3, several observations can be made.
First, the performance gap reveals substantial driver heterogeneity. Under random splitting, KNN achieved an F1-score of 0.79, while under trajectory-level splitting, the best-performing model (GradientBoosting) achieved an F1-score of only 0.468—a relative decrease of approximately 41%. This finding aligns with recent safety modeling studies on driver takeover behavior. Shao et al. [
42] demonstrated that aggressive drivers respond faster to takeover requests but exhibit reduced post-takeover stability, while cautious drivers show the opposite pattern. This parallel suggests that LC decisions, like takeover responses, are highly driver-specific and may require personalized modeling approaches for accurate prediction across diverse driver populations.
Second, the relative ranking of models changes under trajectory-level splitting. GradientBoosting and RandomForest outperform XGBoost and KNN, suggesting that tree-based ensemble methods may generalize better to unseen drivers than instance-based methods like KNN, which are more sensitive to vehicle-specific feature distributions.
Together, these findings confirm that LC behavior exhibits significant driver heterogeneity, and that tree-based ensemble methods are more suitable than KNN for generalization to unseen drivers.
From a practical deployment perspective, the trajectory-level split results offer two implications. First, random sample-level splitting risks over-optimistic performance estimates. In real-world deployment, models encounter entirely new vehicles unseen during training. Trajectory-level splitting, which assigns all time steps of a given vehicle exclusively to either training or test set, simulates this scenario more faithfully and provides a more realistic estimate of generalization performance. Second, the substantial performance drop under trajectory-level splitting suggests that driver heterogeneity may be a relevant factor. This finding motivates future investigation into whether personalized or driver-adaptive modeling approaches could further enhance prediction accuracy for individual drivers. These findings advocate for trajectory-level data partitioning in model evaluation and adaptive LC prediction systems.