1. Introduction
Traffic crashes remain a major concern for all road users, often resulting in serious injuries, fatalities, and significant economic costs. Beyond the immediate impact, these incidents can trigger secondary crashes, traffic congestion, and widespread disruptions. Accurate prediction of traffic collisions is essential for enhancing road safety and improving traffic management. As vehicular traffic continues to grow, the need for reliable crash prediction becomes increasingly critical to reducing injuries and fatalities. Timely collision prediction systems can provide drivers and safety mechanisms with crucial lead time to react, potentially preventing crashes and saving lives. Advanced driver-assistance systems, such as automated emergency braking and driver alertness monitoring [
1,
2], can greatly benefit from predictive models that forecast imminent collisions. Consequently, the development of accurate traffic crash prediction methods has attracted growing attention from the research community [
3,
4,
5,
6,
7,
8].
Collision avoidance systems have become a focal point in traffic safety research. These systems leverage a range of sensors to detect potential hazards around a vehicle and either alert the driver or automatically initiate preventive maneuvers. For example, Al-Smadi et al. [
1] proposed an intelligent collision avoidance and safety system that uses ultrasonic sensors to detect imminent crashes by measuring inter-vehicle distances and activating safety protocols when a collision trajectory is identified. Similarly, Tan et al. [
2] investigated the effectiveness of Automatic Emergency Braking (AEB) systems, reporting a significant reduction in accident severity, fatalities, and injuries following their deployment. Recent innovations have further advanced the capabilities of these systems. Adaptive collision avoidance switching systems [
9] have been introduced to enhance system flexibility under varying traffic scenarios. Additionally, the integration of laser radar with Controller Area Network (CAN) systems has enabled more comprehensive in-vehicle communication and real-time collision warnings [
10].
Timely collision prediction also offers substantial benefits for traffic management. By anticipating potential collisions, traffic control systems can reroute traffic in real time, mitigating congestion and improving the overall efficiency of road networks. This not only shortens travel times for road users but also reduces the economic losses associated with traffic delays caused by crashes. Furthermore, integrating collision prediction into traffic management enables the development of more effective control strategies that promote smoother traffic flow and help lower vehicle emissions, contributing to broader goals of environmental benefits [
11]. A growing body of research has focused on predicting traffic collisions to enhance both safety and operational efficiency [
12,
13,
14,
15,
16]. In this context, machine learning has emerged as a powerful tool, evolving from traditional statistical methods to advanced models capable of capturing non-linear patterns and complex spatiotemporal dynamics in traffic environments [
4,
7,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23].
Deep learning techniques have further advanced traffic crash prediction by enabling the analysis of complex, high-dimensional, heterogeneous data. Prior research has consistently demonstrated the ability of deep learning models to capture intricate patterns in multi-dimensional traffic data and to model complex spatial and temporal relationships within traffic networks [
3,
4,
5,
6,
7,
8,
24,
25,
26,
27,
28,
29]. For example, Chavan et al. [
3] introduced COLLIDE-PRED, a real-time collision prediction system that leverages surveillance video and computer vision techniques. Wang et al. [
5] highlighted the robustness and high accuracy of deep learning models across diverse traffic scenarios and environmental conditions. Other noteworthy approaches include the integration of Long Short-Term Memory (LSTM) networks with Gradient Boosting algorithms [
25], the application of Convolutional Neural Networks (CNNs) [
27], and various hybrid approaches [
30,
31,
32,
33], all of which have demonstrated enhanced performance and adaptability in traffic safety analysis.
To more comprehensively address the challenges of traffic crash prediction, recent models have increasingly adopted hybrid and mobile approaches, combining complementary techniques to boost performance and adaptability. For example, the RFCNN model [
30] integrates Random Forest and CNN to improve the prediction of crash severity. The Hetero-ConvLSTM [
31] captures both spatial dependencies and temporal dynamics in traffic data, effectively modeling evolving traffic patterns. In parallel, the emergence of mobile deep-learning architecture has opened new avenues for real-time, edge-based crash prediction. Notably, models based on the MobileNet architecture [
32] have demonstrated the feasibility of performing on-device crash severity prediction, offering a lightweight yet powerful solution suitable for in-vehicle deployment. These advancements highlight the growing potential of hybrid and mobile deep-learning-based systems to support proactive, scalable, and context-aware traffic safety interventions.
Traditional traffic crash prediction methods often rely on historical accident records and static variables, such as road conditions, aggregated traffic volumes, and weather data, which may fail to capture the dynamic, real-time nature of traffic environments. In contrast, the emergence of Graph Attention Networks (GATs) [
34] offers a powerful new paradigm for enhancing predictive accuracy by explicitly modeling interactions among nodes. The GAT architecture leverages an attention mechanism to selectively weigh the importance of neighboring nodes and their features, allowing the model to focus on the most relevant and influential interactions. GATs have been applied for speed and traffic forecasting [
35,
36]. A key strength of GATs lies in their ability to integrate and process rich, heterogeneous data [
37]. Moreover, GATs can effectively incorporate contextual information from various sources, including road geometry, traffic signals, and environmental conditions [
38,
39]. This enables the model to understand the richer context, leading to more accurate predictions.
Recent studies have demonstrated the effectiveness of GATs across various domains, particularly in their ability to process multi-dimensional data and uncover complex relationships within structured environments. Their application to traffic systems has shown great promise in modeling dynamic environments such as traffic on road networks [
40,
41,
42,
43,
44]. Leveraging GATs for traffic collision prediction not only enhances predictive performance but also offers a flexible framework for incorporating emerging technologies and diverse data sources into intelligent traffic safety and management systems.
Evolving technologies, such as Advanced Driver-Assistance Systems (ADAS) and Connected and Autonomous Vehicles (CAV), continually produce detailed, real-time data on vehicle speed, position, trajectory, and driver behavior. GATs are well suited to synthesize these rich, high-resolution data sources into a dynamic graph representation of the traffic scene, enabling the model to capture transient, subtle, and often critical interactions that traditional methods fail to capture.
In this study, we propose a novel GAT-based framework for predicting potential traffic collisions, representing the first work to leverage graph structures for modeling real-time interactions among road users. By modeling vehicles and other road users as nodes and their interactions as edges on a dynamic graph, our approach learns to recognize spatiotemporal patterns and interdependencies of road users that frequently precede collisions. This interaction-centric perspective enables more context-aware and timely assessments of collision risk and supports the development of more responsive safety interventions and traffic management strategies.
2. Dataset Description
The data used in this study is from an open-source dataset: DeepAccident [
45]. The CARLA simulator [
46] was used for creating the DeepAccident dataset, reflecting diverse real-world traffic collision scenarios. The DeepAccident dataset was created across seven town maps in CARLA, each one including several traffic intersections where accidents were simulated and collision data was collected. The seven town maps include a wide range of different scenes, such as urban streets, highways, and rural areas. There are 2 different intersection types for signalized intersections, four-way and three-way. While for unsignalized intersections, besides four-way and three-way intersections, there are also merging ramp intersections on the express way. An example of unsignalized four-way intersection from CARLA ‘Town03’ map is shown in
Figure 1.
There is a total of 691 simulated scenarios in the DeepAccident dataset. However, only 196 scenarios involve actual collisions. Therefore, we used only these collision scenarios for model training and evaluation.
The collision scenarios were captured by both video frames and ground truth labels, where the frequency of videos is 10 frames per second. In this study, we only used the ground truth annotations of vehicles for both training and testing our collision prediction model. The ground truth annotations for collision videos include Frame ID (starting from index number 0), Object ID (individually assigned for each traffic object), Object Category (including Car, Van, Truck, Cyclist, and Pedestrian), Object Location (coordinates x and y in meters), and Object Velocity (absolute speed in meters per second). It should be noted that all collisions in the dataset are between two road users.
The collision events were categorized into 8 types: Vehicle–Pedestrian Collision, Rear End, Switch Lane, T-Bone, Opposite Frontal Impact, Opposite Merging, Angle Merging, and Right-Vehicle-Turn-Left. Specifically, Vehicle–Pedestrian Collision refers to an impact between a motor vehicle and a pedestrian. Rear End collisions occur when the front of a following vehicle strikes the rear of a leading vehicle, often due to sudden braking by the leading vehicle. Switch Lane collisions occur when a vehicle changes lanes without yielding to another vehicle already traveling in the target lane.
A T-Bone collision occurs at an intersection when one vehicle strikes the side of another vehicle while both vehicles are traveling straight through the intersection. Opposite Frontal Impact and Opposite Merging collisions involve vehicles approaching from opposite directions. When one vehicle travels straight while the opposing vehicle travels straight or makes a left turn, the resulting head-on crash is classified as an Opposite Frontal Impact. When both vehicles turn into the same roadway from opposite directions, the collision is classified as an Opposite Merging collision.
Angle Merging and Right-Vehicle-Turn-Left collisions involve vehicles approaching from perpendicular directions. An Angle Merging collision occurs when both vehicles merge into the same roadway. A Right-Vehicle-Turn-Left collision occurs when one vehicle travels straight or turns left, while another vehicle approaching from its right side makes a left turn, leading to a collision.
Figure 2 illustrates the representative trajectories of colliding vehicles for all collision types except Vehicle–Pedestrian Collision and Rear End. Each pair of blue and orange arrows represents a collision scenario between two vehicles.
These collision categories provide insightful information on colliding vehicle behaviors. Therefore, training and testing datasets were created for each collision type, resulting in 161 collision scenes for training and 35 collision scenes for testing. The distribution of collision scenes among collision types is shown in
Figure 3.
Angle Merging and Right-Vehicle-Turn-Left have the greatest numbers of collision scenes while Vehicle–Pedestrian has only 8 collision scenes in total. Splitting the dataset based on collision categories ensures that the testing set share similar collision type distribution as the training set.
3. Dynamic Graph Attention Networks
For the traffic collision prediction task, we construct a graph representation that captures the dynamic states and interactions among road users across sequential frames in a traffic video. It should be noted that the proposed method is not restricted to video data and can be applied to CAV data. In this study, we used the public dataset extracted from simulated videos for demonstrating our approach. Specifically, for a given reference frame at timestamp
T, we consider a sequence of preceding frames within a defined temporal window to model the evolving scene context. Within this window, a dynamic graph is constructed where nodes represent individual road users and edges encode their spatiotemporal relationships. This graph is processed by a GAT, which learns informative embeddings for each node by attending to relevant neighboring nodes and capturing their dynamics. The learned embeddings encode both motion patterns and inter-user interactions that could signal potential collision risk. Finally, the embeddings are passed to a binary classifier to predict the likelihood of a collision. The overall process is illustrated in
Figure 4.
GATs represent a special neural network architecture designed to handle graph-structured data by leveraging an attention mechanism to model relationships and interactions between nodes. The core structure of our GAT includes multiple layers, each consisting of nodes that represent road user entities (e.g., vehicles and pedestrians) and edges that represent relationships (e.g., interactions between road users). In each layer, an attention mechanism computes the importance of neighboring nodes, allowing the network to focus on the most relevant neighbors based on both node and edge attributes. As shown in
Figure 5, this mechanism involves computing attention coefficients for each pair of connected nodes, normalizing them, and using these coefficients to weigh the influence of neighboring nodes. The attention coefficients are computed by Equation (1).
where
: the set of neighbors for Node
: attention weights between Node and Node
: attention scores between Node and Node
The attention score
is computed by Equation (2).
where
: edge attributes between Node and Node
: attention weight matrix
: embedding for Node
, : learnable parameters for nodes and edges
The attention scores are obtained from both node embeddings and edge attributes (Equation (2)), aggregated in a way that prioritizes critical interactions. This structure enables the GAT to dynamically adjust to complex and changing environments, making them particularly effective for applications like traffic collision predictive modeling where the relationships between nodes (road users) are crucial and constantly evolving. The computation of embeddings for an example node, Node 1 (N1) is illustrated by
Figure 5.
Figure 5.
Computing the embeddings for Node 1 with the multi-head self-attention mechanism. (a) Road user graph representation; (b) Concatenating multi-head self-attentions.
Figure 5.
Computing the embeddings for Node 1 with the multi-head self-attention mechanism. (a) Road user graph representation; (b) Concatenating multi-head self-attentions.
3.1. Graph Generation
A graph is composed of a set of nodes (also called vertices) and a set of edges . The nodes represent entities, while the edges represent relationships between these entities. Formally, a graph is denoted as . Each node can represent various entities depending on the application, such as individuals in a social network or intersections/segments in a road network. Each edge represents a relationship between a pair of nodes, which can be directed or undirected, and may have associated weights to indicate the strength or capacity of the connection. In the context of traffic crash prediction, the graph can be structured with nodes representing road users, and edges representing interactions between vehicles (e.g., proximity, communication). The adjacency matrix typically represents the connection status between nodes, with non-zero entries indicating the presence of edges. Each node or edge can have associated features.
For traffic collision prediction, each node represents a road user, and each edge encodes the interaction between a pair of road users. Node features describe object-level characteristics, including spatial location , object size (w, h), velocity components , and object category (i.e., Car, Van, Truck, Motorcycle, Cyclist, or Pedestrian).
The adjacency matrix, which determines the existence of edges between nodes, is constructed based on two spatiotemporal proximity criteria that must be satisfied simultaneously. Specifically, an edge is established between two nodes if (1) the minimum Euclidean distance
between the corresponding road users is smaller than a predefined spatial threshold
, and (2) the time
required for the two road users to reach this minimum distance is less than a predefined temporal threshold
. When both criteria are met, the interaction is considered collision-relevant and an edge is created in the graph. Let
and
denote the relative position and relative velocity vectors between two road users, respectively. The angle
θ between these vectors characterizes their interaction state, indicating whether the two road users are approaching or departing, as illustrated in
Figure 6.
Specifically, when θ is acute, the projection of the relative velocity onto the relative position vector is negative, implying that the two road users are moving closer to each other and will reach their minimum separation at a future time. In contrast, when θ is a right angle or obtuse, the projection is non-negative, indicating that the minimum distance has already been reached and the road users are moving apart.
It is worth noting that when
, the two road users are relatively stationary; consequently, their mutual distance remains constant over time. Based on the geometric relationships illustrated in
Figure 6, the minimum distance
is computed by Equation (3) or Equation (4), depending on whether the relative speed satisfies
. The corresponding time
to reach this minimum distance is calculated by Equation (5).
For graph construction, an edge is created between and if both criteria: and are satisfied, where and are predefined spatial and temporal thresholds, respectively. These thresholds regulate the sparsity of the interaction graph. Larger values of and produce a denser graph by introducing more edges; however, many of these edges may correspond to spatiotemporally distant interactions and thus introduce inference noise, as such road users are unlikely to collide. Conversely, smaller thresholds yield a sparser graph by retaining only the most relevant interactions. In the limiting case, sufficiently small values of and preserve only edges associated with imminent or highly probable collision scenarios.
Beyond encoding the existence of edges through the adjacency matrix, D and t are further incorporated as continuous edge attributes. These edge features explicitly quantify the spatiotemporal proximity between interacting road users, enabling the GAT to differentiate varying levels of collision risk rather than treating all edges as equally informative. By embedding both geometric and temporal cues into edge representations, the model can more effectively learn risk-sensitive interaction patterns and capture the heterogeneous nature of potential collision scenarios.
3.2. Consideration of Sequential Information
In this study, two representation strategies are evaluated. The first strategy uses a single frame, in which graph features are extracted solely from the video frame at timestamp
T, denoted as
FT. As illustrated in
Figure 7, node features, the adjacency matrix, and edge features are all derived from this single frame, resulting in a purely spatial interaction graph at time
T.
The second strategy explicitly incorporates temporal dynamics by leveraging a short sequence of four consecutive frames, including the previous three frames FT-3, FT-2, FT-1, and the current frame FT. Graph nodes are defined based on the road users detected in the current frame, ensuring a consistent node set for prediction at timestamp T. For each node, the corresponding locations and velocities at timestamps T-3, T-2, T-1, and T are concatenated to form an input sequence to a node-wise Long Short-Term Memory (LSTM) network, which applied to each node separately. Each time step in the sequence represents the road user’s state at a specific timestamp and includes location and velocity as features.
The LSTM produces an output sequence with the same dimensionality as the input (dimension = 6), effectively encoding the spatiotemporal evolution of each road user. The resulting LSTM embeddings are then concatenated with the object category to construct the final node features of the graph, as shown in
Figure 8. The adjacency matrix and edge features are still constructed from the current frame
FT. Through this integration of LSTM-based temporal encoding, sequential motion information is embedded into the graph representation, enabling more expressive modeling of dynamic interactions among road users.
3.3. Graph Attention Modeling
Following graph construction, a Graph Attention Network (GAT) is employed to learn latent, risk-aware representations of road users and their interactions. The GAT leverages attention mechanisms to assign adaptive importance weights to neighboring nodes, enabling the model to selectively emphasize interactions that are more relevant to collision risk. As shown in
Figure 4, the node-level embeddings produced by the GAT are aggregated into a graph-level representation, which is subsequently fed into a binary classification head to predict whether a collision would occur.
To enable effective discrimination between collision-prone and non-collision scenarios, a contrastive labeling strategy is adopted in which safe and dangerous graphs are constructed from the same video sequence but at different temporal offsets relative to the collision event. This design minimizes scene-specific confounding factors (e.g., road geometry, lighting conditions, etc.) and encourages the model to focus on interaction dynamics that evolve as a collision becomes imminent.
The collision prediction task is formulated as a binary classification problem, where the GAT and the classification head are trained end-to-end to estimate the probability of a future collision given a graph representation at time T. A key methodological consideration is the selection of the lead time to collision, which governs the assignment of graph-level labels. The lead time must balance two competing objectives: (1) enabling sufficiently early prediction to support preventive interventions, and (2) preserving discriminative spatiotemporal cues indicative of collision risk.
Consistent with prior findings that average driver reaction time can be as high as 1.5 s [
47,
48], a lead time of 1.5 s is adopted in this study. Accordingly, during model training, a video frame occurring 1.5 s prior to the collision, and its corresponding interaction graph, is labeled as dangerous. A temporally earlier frame, captured 4 s before the dangerous frame, is labeled as safe. While shorter lead times typically yield higher classification accuracy due to the increased availability of collision-relevant features, they offer limited practical value for early warning. The adopted lead time therefore reflects a principled trade-off between predictive performance and operational relevance.
Figure 9 illustrates the labeling procedure for safe and dangerous frames within a video sequence.
4. Experiments
The models were implemented using PyTorch 1.10.2 and PyTorch Geometric 1.7.2, and supporting libraries such as scikit-learn 1.0.2 and pandas 1.3.5. To evaluate the performance of the proposed method, all models are trained and tested on an NVIDIA RTX A6000 GPU (NVIDIA, Santa Clara, CA, USA). The experimental design systematically compares the two graph construction strategies, single-frame-based and multi-frame (four-frame) temporal encoding, under the same modeling framework consisting of a GAT followed by a binary classification head. For clarity, the node and edge features are summarized in
Table 1, while the model architecture and training settings are presented in
Table 2.
In addition, we investigate the sensitivity of the proposed approach to key hyperparameters, including the preset lead time to collision and the spatiotemporal thresholds used for graph construction, namely the spatial distance threshold
, and the temporal threshold
.
Table 3 presents a summary of the model prediction accuracy under different experimental settings.
Sequential-frame graph construction consistently outperforms the single-frame approach across nearly all parameter settings and lead times. Accuracy gains are substantial in a range of 4–10 percentage points, demonstrating that incorporating short-term temporal dynamics via sequential frames provides more discriminative information than relying on instantaneous snapshots alone. This confirms that collision precursors are inherently temporal and benefit from motion-history encoding.
With respect to the lead time to collision, prediction accuracy consistently decreases across both graph construction strategies as the lead time increases from 0.5 s to 1.5 s. This trend is expected, as shorter lead times (0.5 s) capture stronger and more explicit collision cues, such as rapidly converging trajectories. In contrast, longer lead times (1.5 s) require the model to infer risk from weaker, more uncertain interaction signals. Despite this challenge, the sequential-frame GAT model maintains relatively high accuracy even at a lead tie of 1.5 s (80.1%), highlighting its robustness for early collision warning.
Regarding spatial and temporal thresholds, configurations with 20 m and 1.5 s, yield the highest prediction accuracies (86.1%, 82.9%, and 80.1%). These settings strike a balance between interaction coverage (i.e., capturing all potentially relevant interactions) and noise suppression (i.e., excluding edges between spatiotemporally distant road users unlikely to collide).
For comparison, overly large temporal thresholds tend to introduce inference noise. For instance, increasing to 3 s consistently degrades performance for both models. This suggests that interactions too far into the future dilute collision-relevant signals and introduce inference noise, confirming the importance of temporal selectivity in graph construction. Conversely, overly small spatial and temporal thresholds risk excluding critical contextual information. The most restrictive setting ( 10 m and 0.5 s) results in unstable or degraded performance, particularly for longer lead times. Although such tight thresholds retain only imminent interactions, they may exclude early yet informative precursors that are essential for advance prediction.
Overall, the sequential approach demonstrates greater robustness to parameter variation, especially under longer lead times and moderate-to-large thresholds. These results demonstrate that collision risk is best captured by temporally informed, selectively connected interaction graphs.
5. Conclusions
This study proposes a novel graph-based framework for traffic collision prediction that leverages GATs to perform binary classification of potential collision events. By explicitly modeling road users as nodes and their interactions as edges, the proposed approach effectively integrates spatial and temporal user-level information into a unified graph representation. An analytical process for transforming dynamic traffic scenes into interaction graphs is developed, enabling GATs to extract latent, risk-aware features for collision prediction. By assigning adaptive attention weights to interacting road users, the GAT model selectively emphasizes high-risk interactions while suppressing less relevant ones. This adaptive focus is essential for capturing transient and rapidly evolving interactions that are often overlooked by traditional rule-based or fixed-structure models. Experimental results demonstrate that the proposed method achieves a prediction accuracy of up to 86.1%, highlighting its effectiveness in capturing collision-relevant interaction patterns. This capability enables proactive collision warnings and provides critical support for early safety interventions, with the potential to significantly reduce traffic crashes.
While the proposed approach demonstrates promising performance, several limitations point to important directions for future research. First, the current dataset is relatively small and highly imbalanced, particularly for rare yet safety-critical collision types such as vehicle–pedestrian collisions and lane-change-related crashes. This limitation restricts the model to binary collision prediction and may affect its generalizability across diverse traffic scenarios. Future efforts should focus on expanding the dataset and incorporating real-world data and a wider range of collision types, which would enable multi-class collision prediction and enhance model robustness and generalization to diverse, real-world traffic conditions. The ability to predict specific collision types would further support targeted safety interventions, inform roadway and intersection design, and assist traffic management agencies in deploying context-aware control strategies in high-risk locations.
Second, the growing availability of ADAS data and vehicle-to-everything (V2X) communication technologies provides continuous streams of high-resolution, road user-level information. These data sources can be naturally integrated into the proposed graph-based framework to construct real-time, dynamic representations of traffic scenes and further improve collision prediction accuracy.
Furthermore, the selection of spatiotemporal thresholds for graph construction is currently based on predefined values, which may not generalize optimally across varying traffic densities, road geometries, and behavioral contexts. Future research should explore adaptive or data-driven threshold selection strategies to dynamically balance interaction coverage and noise suppression, thereby improving model scalability and performance under diverse traffic conditions.
In summary, this study establishes a strong foundation for graph-based traffic collision prediction by demonstrating the effectiveness of GATs in modeling complex, dynamic road user interactions. The proposed framework shows significant promise for advancing intelligent transportation systems and proactive safety applications. Continued methodological refinements, larger and more diverse datasets, and deeper integration with emerging sensing and communication technologies can play a pivotal role in enhancing traffic safety, improving mobility, and reducing crash-related injuries and fatalities for all road users.