The middle configurations E5–E8 reveals the deeper contribution of hazard-centered modeling rather than ordinary sequence learning. Adding the continuous hazard reconstruction layer in E5 improve the F1-score from 94.69% to 95.54% on Dataset 1 and from 93.85% to 94.79% on Dataset 2, showing that explicitly reconstructing an smooth hidden-danger field provide an more discriminative internal representation than relying only on latent embeddings. The gains continue in E6 after introducing risk-diffusion feature lifting, where the F1-score rise to 96.21% and 95.42% on the two datasets, respectively, demonstrating that local danger become more predictable when corridor-level and neighborhood-level instability are embedded before inference. The transition from E6 to E7 and then to E8 are particularly important from an safety-intelligence perspective: once the prediction head begin using the hazard-escalation cue , the F1-score increase again to 97.02% on Dataset 1 and 96.23% on Dataset 2, indicating that short-term risk intensification are an critical precursor to accident emergence. When neighborhood hazard context are added in E8, the F1-score further rise to 97.60% and 96.87%, respectively, with AUC values of 0.991 and 0.990. These improvements confirms that accident-prone conditions was better detected when the model jointly considers local hidden danger, recent escalation, and reinforcement from surrounding connected infrastructure rather than treating each road unit as behaviorally isolated.
The early configurations show that neither spatial nor temporal modeling alone is sufficient to fully capture accident-emergence dynamics. The baseline E1, which uses only an MLP prediction structure, provides the lowest test performance on both datasets. Adding topology-aware spatial encoding in E2 improves the F1-score, while adding temporal modeling in E3 provides a further gain, indicating that both network structure and temporal hazard buildup contain useful predictive information. However, the stronger improvement appears in E4, where the GCN and GRU components are combined. This confirms that accident emergence is better represented as a coupled spatio-temporal process rather than as a static record-level classification task.
The intermediate configurations E5–E8 demonstrate the importance of converting sparse event evidence into a structured hazard representation. In E5, the continuous hazard reconstruction layer improves test performance on both datasets, suggesting that the model benefits from learning an intermediate hidden-danger representation before producing the final accident-emergence decision. E6 further improves the results by introducing risk-diffusion feature lifting, which allows local road segments to incorporate information from connected neighboring units. This improvement supports the central assumption of the proposed framework: urban traffic danger is not isolated at a single location, but propagates through adjacent corridors, intersections, and surrounding network conditions.
The transition from E6 to E8 shows that the model becomes more robust when it explicitly incorporates hazard escalation and neighborhood context. The addition of the hazard-escalation cue in E7 improves the F1-score because recent changes in the hidden hazard field provide useful information about whether risk is intensifying or stabilizing. E8 further improves performance by including surrounding hazard context, which helps the model distinguish between isolated local fluctuations and broader network-level danger patterns. These results indicate that the proposed framework gains predictive strength from modeling both the current hazard state and its spatially connected evolution.
The final configurations, E9 and E10, provide the strongest evidence that long-range temporal memory and complete hazard-to-emergence mapping are essential for the best generalization performance. In E9, replacing the GRU with a BiGRU and increasing the temporal window improves the ability to capture longer hazard-evolution patterns. The full CHFI configuration E10 achieves the best test results on both datasets, with an F1-score of 98.49% on Dataset 1 and 97.72% on Dataset 2. Compared with E1, this represents an absolute F1-score improvement of 8.43 percentage points on Dataset 1 and 8.58 percentage points on Dataset 2. These gains show that the final performance is not produced by a single component, but by the cumulative integration of topology-aware spatial reasoning, temporal hazard evolution, continuous hazard reconstruction, risk diffusion, escalation cues, neighborhood context, and full accident-emergence mapping.
Overall, the ablation results confirm that each major component of the proposed CHFI framework contributes to the final predictive performance. The consistent improvement from E1 to E10 across both datasets also supports the robustness of the proposed design. More importantly, the gradual performance increase provides stronger evidence than a single final result, because it shows how the framework moves from conventional tabular classification toward a structured spatio-temporal hazard-intelligence model capable of learning transferable accident-emergence patterns.
When spatial and temporal learning are combined, the model achieves a stronger improvement than either component alone. This confirms that accident emergence is better modeled as a coupled spatio-temporal process rather than as an isolated tabular classification problem. The addition of the hazard reconstruction module further improves performance by forcing the model to learn an explicit hidden-danger representation before producing the final prediction. This result supports the main assumption of the proposed framework, namely that reconstructing a continuous hazard field provides a more informative intermediate representation than relying only on direct classification embeddings.
The later ablation stages show that the model becomes more effective when it includes network-level risk propagation and dynamic hazard-change information. Risk-diffusion feature lifting improves performance by allowing each road unit to incorporate information from connected neighboring segments, while the hazard-escalation cue helps the model identify whether the hidden hazard state is intensifying or stabilizing. Adding the neighborhood hazard context further strengthens the representation by distinguishing isolated local fluctuations from broader corridor-level danger patterns. These results indicate that the proposed model benefits from representing both the local hazard state and its surrounding spatio-temporal context.
The final two stages demonstrate the importance of stronger temporal memory and complete hazard-to-emergence mapping. The BiGRU-enhanced configuration improves the ability of the model to capture longer hazard-evolution patterns, while the full CHFI configuration achieves the highest F1-scores of 98.85% on Dataset 1 and 98.28% on Dataset 2. Compared with the baseline MLP, this represents absolute F1-score improvements of 7.91 and 8.26 percentage points, respectively. Overall, the ablation results confirm that the final predictive strength of CHFI is not produced by a single module, but by the combined effect of topology-aware spatial reasoning, temporal hazard evolution, continuous hazard reconstruction, risk diffusion, escalation modeling, neighborhood context, and full accident-emergence inference.
Table 11 demonstrates that the predictive strength of the proposed Continuous Hazard Field Intelligence framework was built progressively through the interaction of its major components rather than arising from any single isolated module. The baseline MLP begin at 90.94% F1 on Dataset 1 and 90.02% on Dataset 2, which establishes the lower bound of performance when accident emergence are modeled without explicit spatial structure, temporal dependency, or hazard-field reasoning. When only topology-aware spatial encoding are activated, the F1-score increase to 92.64% and 91.70%, yielding gains of +1.70 and +1.68 points, respectively, which confirms that road-network connectivity and neighboring segment interaction already provide useful explanatory signal beyond flat tabular prediction. However, temporal-only hazard modeling produce slightly larger gains, reaching 93.29% on Dataset 1 and 92.40% on Dataset 2, with improvements of +2.35 and +2.38 points, indicating that pre-accident temporal buildup carry stronger standalone discriminative information than purely spatial aggregation. Once spatial and temporal modeling are combined, the F1-score rise further to 94.69% and 93.85%, showing that accident emergence are fundamentally an coupled spatio-temporal process. These result was particularly important because it confirms that the proposed framework are aligned with the real dynamics of urban traffic danger, where risk develops not only across connected infrastructure units, but also through temporal accumulation and persistence. The second half of the ablation sequence show that the largest late-stage improvements come from explicitly hazard-aware components that transform the model from an generic spatio-temporal predictor into an true hidden-danger intelligence framework. Adding hazard reconstruction increase the F1-score to 95.54% on Dataset 1 and 94.79% on Dataset 2, meaning that explicitly recovering an continuous latent hazard field contribute +0.85 and +0.94 points beyond joint spatial-temporal encoding alone. Risk-diffusion feature lifting then raise performance to 96.21% and 95.42%, confirming that corridor-level propagation and neighborhood-aware instability transfer improve the continuity and realism of the learned safety representation. The addition of the hazard-escalation cue
provide another +0.81-point gain on both datasets, which strongly indicates that recent intensification of hidden danger are an critical precursor to accident emergence. Once neighborhood hazard context
are incorporated, the F1-score reach 97.60% and 96.87%, showing that surrounding network reinforcement further strengthen local risk inference. The final gains appear after BiGRU temporal strengthening and the full CHFI configuration, which culminate in 98.85% F1 on Dataset 1 and 98.28% on Dataset 2. Relative to the baseline, these final values corresponds to total absolute improvements of 7.91 and 8.26 percentage points, respectively, demonstrating that the full framework are necessary to comprehensively capture urban danger as an continuous field shaped by local instability, temporal buildup, diffusion-based propagation, escalation dynamics, and contextual reinforcement as shown in
Table 12.
4.1. Validation Strategy and Generalization Assessment
Because spatio-temporal traffic records may contain strong temporal continuity and spatial neighborhood dependence, relying only on random record-level splitting may overestimate predictive performance. Therefore, the validation protocol was expanded to include leakage-safe and dependency-aware evaluation settings. First, a stratified random split was used as the baseline validation setting to preserve the class distribution across training and testing partitions. Second, a temporal hold-out test was performed by training the model on earlier time periods and evaluating it on later unseen periods, which better reflects the realistic deployment condition in which future traffic risk must be predicted from past observations. Third, a geographic hold-out test was used by withholding selected spatial regions, corridors, or roadway zones from training and evaluating the model on these unseen locations. This test examines whether the learned hazard representation generalizes beyond the spatial areas observed during training.
To avoid optimistic evaluation, preprocessing parameters, feature normalization, encoding rules, hazard-field initialization parameters, and model-selection decisions were estimated only from the training partition in each validation setting. The same fitted transformations were then applied to the corresponding validation or test partition. This prevents test information from influencing the training process. In addition, performance was reported using class-sensitive metrics, including precision, recall, F1-score, Macro-F1, weighted F1-score, and AUC, rather than relying only on accuracy, because accuracy alone may be misleading under class imbalance.
The purpose of these additional validation settings was not only to confirm high predictive performance, but also to test whether the proposed continuous hazard-field representation remains stable when temporal and spatial dependency between training and testing samples is reduced. Therefore, the final reported evaluation distinguishes between random split performance, temporal hold-out performance, and geographic hold-out performance as shown in
Table 13.
Table 14 shows that the proposed framework maintains strong performance under stricter validation settings, although the results decrease compared with the stratified random split. This reduction is expected because temporal hold-out, geographic hold-out, and blocked spatio-temporal validation reduce the similarity between training and testing samples. For Dataset 1, the F1-score decreases from 98.85% under the random split to 97.39%, 96.93%, and 96.21% under temporal, geographic, and blocked spatio-temporal validation, respectively. Similarly, for Dataset 2, the F1-score decreases from 98.28% to 96.82%, 96.35%, and 95.60%. These results provide a more conservative and reliable validation of the proposed model and indicate that the framework learns transferable spatio-temporal hazard patterns rather than relying only on random record-level similarity.
Figure 12 shows that the baseline E1 model capture the broad class structure of Dataset 1, but still has limited robustness under test conditions. From training to testing, accuracy decrease from 91.84% to 90.91%, precision from 91.12% to 90.24%, recall from 90.76% to 89.88%, F1-score from 90.94% to 90.06%, and AUC from 94.40% to 93.80%, indicating an modest but visible generalization gap. The class-support distribution are also strongly imbalanced, dominated by Hazard, followed by NoInj, UnknInj, and 1141, while rarer classes such as AHazard and CarFire have much fewer samples, which explain the uneven class stability. The confusion matrices confirms these pattern: the test set remain relatively strong for Hazard (92.5%), UnknInj (93.0%), and CarFire (92.0%), but weaker for NoInj (88.6%), Other (89.3%), AHazard (89.1%), and Fire (91.1%). The ROC and precision-recall plots remains acceptable for an baseline, with test AUC = 0.938 and an precision-recall operating point around precision = 90.24% and recall = 89.88%, but both testing curves stay below training, while the per-class recall and class-probability plots show lower confidence under unseen data. Overall, the figure confirm that E1 was an useful starting reference, but its weaker rare-class generalization, lower calibration, and lack of topology-aware spatial and temporal hazard modeling justify the more advanced CHFI blocks introduced later.
Figure 13 presents the visual diagnostic results for the baseline E1 model on Dataset 2. The purpose of this figure is to assess whether the baseline model can separate the crash-severity classes before the hazard-aware CHFI components are introduced. The results show that E1 captures the general severity structure, but its performance remains limited under class imbalance, especially for the less frequent Multi-Injury and Fatal/Severe classes.
The training and testing metrics indicate a moderate generalization gap, with the testing performance consistently lower than the training performance. Rather than suggesting severe overfitting, this pattern shows that the baseline model has difficulty transferring its learned decision boundaries to unseen samples when rare and high-severity cases are present. The class-support distribution further explains this behavior, as No Injury dominates the dataset, followed by Minor Injury, while the higher-severity classes contain substantially fewer samples.
The confusion matrices and curve-based diagnostics provide additional evidence of this limitation. Misclassification mainly occurs between adjacent severity levels, indicating that the baseline model struggles to distinguish gradual severity transitions. Although the ROC and precision–recall curves remain acceptable for a baseline configuration, the lower testing curves and reduced class-level confidence confirm that E1 is not sufficiently robust for fine-grained crash-severity prediction. Overall, these findings establish E1 as a useful reference model and justify the introduction of topology-aware, temporal, and hazard-reconstruction components in the later CHFI configurations.
Figure 14 shows that Block E5 improve both predictive performance and the quality of the reconstructed hazard representation on Dataset 1. In panel (a), the F1-score increase steadily from 88.14% for the spatial encoder only to 89.62% after adding temporal modeling, 91.08% after joint fusion, 93.41% after the reconstruction layer, and 94.02% after including the hazard smoothing prior, giving an total gain of 5.88 percentage points. The waterfall plot in panel (b) confirm that the reconstruction layer contribute the largest single improvement (+2.33), followed by temporal modeling (+1.48), fusion (+1.46), and smoothing (+0.61). These pattern are also reflected in the ablation heatmap in panel (c), where the full E5 block achieve the best overall results with 94.2% accuracy, 93.9% precision, 94.0% recall, 94.0% F1, and 96.8% AUC, while removing reconstruction cause the strongest drop, reducing F1 to 91.5% and AUC to 94.4%. The lower-row visualizations in panels (d)–(f) explain these gains by showing that the completed block reconstruct an smooth and spatially coherent hazard field with clear hotspots H1–H6 and meaningful intensity continuity between them, rather than isolated fragmented peaks. Overall, the figure show that E5 do not only improve the metrics numerically, but also produce an more physically interpretable and stable hidden-danger surface.
Figure 15 shows that Block E5 yield an strong but well-structured improvement on Dataset 2. The F1-score increase steadily from 88.66% for the spatial encoder only to 90.04% after adding temporal modeling, 91.62% after joint fusion, 93.88% after the reconstruction layer, and 94.37% after adding the hazard smoothing prior, for an total gain of 5.71 percentage points. The waterfall plot confirm that the reconstruction layer contribute the largest single improvement (+2.26), followed by fusion (+1.58), temporal modeling (+1.38), and smoothing (+0.49). these are consistent with the ablation heatmap, where the full E5 block achieve the best overall results with 94.6% accuracy, 94.2% precision, 94.3% recall, 94.2% F1, and 97.1% AUC, while removing reconstruction cause the sharpest decline, reducing F1 to 91.9% and AUC to 94.9%. The lower-row visualizations explain these gains by showing that the full block reconstruct an smooth and spatially coherent hazard field with six clear hotspots and continuous intensity transitions rather than isolated peaks. Overall, the figure confirm that E5 improve not only the numerical metrics, but also the stability and interpretability of the learned latent crash-risk surface.
Figure 16 clearly shows that the E4 configuration deliver an consistent and meaningful improvement over E2 and E3 on Dataset 2 once spatial and temporal modeling are fully combined. In panel (a), all four core metrics rise monotonically from E2 to E4, with E4 reaching about 92.4% accuracy, 91.6% precision, 91.2% recall, and 91.4% F1-score, compared with approximately 89.1%, 88.4%, 87.9%, and 88.2% in E2, which indicates that the complete E4 block provide a gain of roughly 3.2 percentage points in F1 over the earlier configuration. Panel (b) confirm the same trend in ranking quality, where AUC increase from 92.70% in E2 to 93.90% in E3 and then to 95.60% in E4, showing that E4 not only improve threshold-dependent metrics but also strengthen class separability more fundamentally. these are visually reinforced by the ROC and precision-recall comparisons in panels (c) and (d), where the E4 curve consistently dominates E2 and E3 across most of the operating range, particularly in the low-false-positive and high-recall regions that are most important for accident-severity discrimination. The confusion matrix in panel (e) further show that E4 achieve strong per-class recognition, with diagonal values of 92.4% for No Injury, 89.6% for Minor Injury, 93.0% for Multi-Injury, and 94.5% for Fatal/Severe, indicating especially strong stability on the most safety-critical severity class while keeping cross-class leakage relatively limited. Finally, panel (f) show stable optimization behavior, with both training and validation loss decreasing smoothly across epochs and the train-validation accuracy gap remaining mostly around 4.5–6.2%, suggesting that the performance gain of E4 are achieved through stronger representation learning rather than unstable fitting. Overall, the figure demonstrate that Block E4 was an decisive transition point in the architecture, because the joint spatial-temporal formulation substantially improve severity prediction quality, ranking robustness, and class-level discrimination on the collision-oriented dataset.
Figure 17 shows that the E4 configuration provide an clear performance improvement over E2 and E3 on Dataset 1 once spatial and temporal modeling was combined. In panel (a), all core metrics increase from E2 to E4, with E4 reaching about 91.9% accuracy, 91.2% precision, 90.9% recall, and 91.0% F1-score, compared with roughly 88.8%, 87.9%, 87.4%, and 87.7% in E2, indicating an overall F1 gain of about 3.3 percentage points. Panel (b) show the same progression in AUC, rising from 92.10% in E2 to 93.20% in E3 and then to 94.90% in E4, confirming stronger class separability. These are supported by the ROC and precision-recall curves in panels (c) and (d), where E4 consistently dominates the earlier configurations across most operating regions. The normalized confusion matrix in panel (e) further show strong class-level recognition, with diagonal values of 90.6% for Low, 92.9% for Moderate, 96.1% for Elevated, and 93.1% for Critical, indicating especially robust discrimination for the higher-risk classes. Finally, panel (f) show stable training behavior, where both training and validation loss decrease smoothly while the train-validation accuracy gap remains controlled at roughly 4.0–6.4%, suggesting that the E4 gains come from stronger representation learning rather than unstable fitting. Overall, the figure confirm that Block E4 are an important transition stage because the joint spatial-temporal design substantially improve hazard-level discrimination, ranking quality, and optimization stability on the incident-oriented dataset.
Figure 18 visualizes the spatial hazard field learned by Experiment E6 on Dataset 1 and show that the model reconstruct an smooth but clearly localized urban risk topology rather than an fragmented set of isolated incident points. Six dominant risk regions are visible, with the strongest concentrations appearing around R1, R2, and R3, where normalized hazard intensity approach the upper end of the scale and the contour lines are densely packed, indicating sharp local gradients and strong hazard concentration. Secondary but still important regions appear at R4 and R5, while R6 form an lower-intensity yet spatially meaningful hotspot in the lower-right portion of the domain. An particularly important feature of the map are the presence of continuous elevated-risk corridors linking the upper-left, central, and upper-right zones, as well as the broader transition structure extending toward the lower-central region, which suggests that E6 are capturing neighborhood-aware risk propagation rather than only pointwise activation. These behavior are consistent with the role of risk-diffusion feature lifting in E6, where local hazard descriptors was enriched by topological interaction from connected spatial units before prediction. As an result, the figure indicate that the model has moved beyond simple incident localization and has begun learning an more realistic hidden-danger surface in which high-risk areas, transitional regions, and spatial continuity are jointly represented.
Figure 19 shows that the E7 block improve in an stable and cumulative manner as each component were added. Starting from the base configuration, the F1-score are 90.84%, then rise to 91.76% after adding the spatial gate, 92.55% with the temporal gate, 93.48% after risk fusion, and finally 94.11% with scenario adaptation. these corresponds to an total improvement of 3.27 percentage points from the base to the full E7 configuration. The progression are almost monotonic and well balanced, which suggests that no single module dominate the gain entirely; instead, the final performance emerge from the cooperative effect of spatial selectivity, temporal refinement, risk-aware fusion, and scenario-level adaptation. Among the later additions, risk fusion produce one of the strongest visible jumps, indicating that combining latent hazard signals are especially important for improving accident-emergence discrimination at these stage. Overall, the figure confirm that E7 strengthen the framework through progressive component integration rather than abrupt isolated gains, which support the architectural logic of the proposed hazard-intelligence pipeline.
Figure 20 shows that Experiment E8 learn an clear and structured spatio-temporal hazard pattern rather than isolated, short-lived activations. Hazard intensity generally increase from the upper zones (Z1–Z3 toward the lower zones (Z6–Z8, indicating that the most critical latent danger are concentrated in the lower part of the spatial domain. The strongest buildup occur between time steps 8 and 13, where Z7 and Z8 reach the highest values in the map, including 0.92 at Z7, time step 10, 0.97 at Z8, time step 9, 0.95 at Z8, time step 10, and the global maximum of 1.00 at Z8, time step 11. An second strong band are visible in Z5–Z6, where the values remain elevated around 0.68–0.80 across the same interval, suggesting that the high-risk core are not limited to one isolated zone but spread across adjacent regions with temporal persistence. In contrast, upper zones such as Z1 and Z2 remain much lower, mostly in the 0.08–0.45 range, which show that E8 successfully separate weak background hazard from concentrated danger buildup. The smooth progression of values across neighboring zones and successive time steps indicate that the model are capturing continuous hazard propagation and temporal reinforcement, which was consistent with the role of neighborhood hazard context in E8. Overall, the figure confirm that E8 produce an coherent spatio-temporal hidden-danger field in which risk intensifies gradually, peaks around an localized critical interval, and remain spatially structured rather than randomly scattered.
Figure 21 shows that the full E8 configuration deliver the strongest overall performance across all evaluation metrics, confirming that its predictive strength depend on the joint contribution of temporal cross-attention, spatial gating, fusion, memory, and residual refinement. The complete model reach 95.7% accuracy, 95.3% precision, 95.4% recall, 95.4% F1-score, and 97.9% AUC, which are the best profile in every row of the heatmap. Among the ablated variants, removing memory cause the largest degradation, reducing the F1-score to 92.5% and AUC to 94.6%, which indicate that hazard persistence and temporal continuity was critical in E8. Removing fusion also produce an strong drop, yielding 92.8% F1 and 95.0% AUC, while removing temporal cross-attention lower performance to 93.2% F1 and 95.4% AUC, showing that cross-temporal interaction remain highly important for capturing evolving risk. The spatial gate and residual pathway also contribute meaningfully, as their removal limit F1 to 93.7% and 93.8%, respectively. Overall, the heatmap confirm that E8 perform best when all components operate together, with memory, fusion, and temporal cross-attention providing the most influential contributions to the final hazard-aware representation.
Figure 22 shows that Block E9 achieve strong ranking and discrimination capability on Dataset 1. In panel (a), the ROC curve remain close to the upper-left boundary across nearly the entire operating range, with an AUC of 0.982, indicating excellent separation between accident-emergence and non-emergence patterns and very high true-positive rates even at low false-positive rates. Panel (b) confirm these behavior from an precision-recall perspective, where the curve start near perfect precision and decline gradually as recall increase, while still maintaining an strong operating point summarized by precision = 96.10% and recall = 95.60%. The overall shape of the precision-recall curve suggest that E9 preserve prediction quality well even as coverage expand, which are especially important under imbalanced traffic-risk conditions where high recall can otherwise cause sharp precision collapse. Taken together, the two panels indicate that E9 are not only accurate at an fixed threshold, but also robust as an ranking model, with strong threshold-independent separability and reliable positive-class retrieval across an wide decision range.
Figure 23 shows that Block E9 capture the temporal evolution of hidden danger with strong fidelity across the full sequence. The predicted dynamics closely follow the ground-truth hazard trajectory from time step 1 to 50, reproducing both the early moderate rise around steps 4–9, the mid-sequence decline toward the local minimum near steps 22–25, and the major hazard surge between approximately steps 30 and 40. In the first phase, the model track the initial increase from about 0.32 to nearly 0.46 with only small local fluctuations, while in the second phase it correctly follow the gradual drop toward the low-hazard regime around 0.17–0.20. The most important region appear in the late middle of the sequence, where the true hazard climb sharply above 0.60 and peaks near 0.65–0.66 around steps 35–37; here the predicted curve slightly overshoot the peak, reaching about 0.72, but still preserve the timing, shape, and subsequent decline of the critical hazard episode. After the peak, both curves falls in an similar pattern toward the final low-risk state near 0.10–0.12 by step 50. Overall, the figure indicate that E9 do not merely predict static risk levels, but learns the temporal structure of hazard buildup, crest formation, and decay with good alignment, which support the role of stronger temporal modeling in improving sequence-level accident-emergence understanding.
Figure 24 shows that Block E10 achieve both very strong class discrimination and well-controlled probability calibration on Dataset 1. In panel (a), the normalized confusion matrix are dominated by high diagonal values, reaching 99.5% for Low, 94.9% for Moderate, 93.4% for Elevated, and 94.3% for Critical, which indicates highly reliable hazard-level recognition across all four classes. Misclassification are limited and structurally meaningful rather than random: Moderate was mainly confused with Low (2.2%) and Elevated (2.0%), while Critical are most often confused with Elevated (3.8%), which are expected because these adjacent severity levels was closer in latent risk structure than distant classes. Panel (b) further shows that the calibration curve track the perfect-calibration line closely across nearly the entire confidence range, with only minor deviations around the mid-to-high confidence region, indicating that the predicted probabilities remains well aligned with observed accuracy rather than being overconfident or poorly scaled. Together, the two panels show that E10 are strong not only as an classifier, but also as an calibrated decision model, which was especially important for traffic-risk forecasting because reliable probability estimates are essential when the output are used for safety-aware intervention, prioritization, and operational decision support.
4.2. Comparative Positioning Against Related Works
Table 15 positions the proposed CHFI framework in relation to representative studies on traffic accident risk prediction, crash-severity modeling, and road-safety intelligence. Earlier studies mainly focused on either segment-level statistical modeling or road-level classification. For example, Park and Hong [
14] used an MLP model for urban road-specific accident-risk prediction and reported accuracy = 0.75, precision = 0.73, and recall = 0.81 under the most balanced sampling setting. Macedo et al. [
15] emphasized the role of roadway geometry and segmentation in accident prediction, showing that curves with radii ≤600 m had 3.2 times higher accident risk than curves with radii >2200 m, while geometric correction could reduce accidents with victims by about 18–27%. These studies provide important evidence that road structure, geometric configuration, and traffic context strongly affect accident risk. However, their modeling outputs remain primarily discrete, segment-based, or road-level estimates, and they do not explicitly reconstruct the continuous latent danger field through which risk accumulates, propagates, and intensifies across connected urban infrastructure.
Recent studies have moved toward larger traffic-safety datasets and stronger machine-learning or spatio-temporal deep-learning models. Gou et al. [
16] introduced an XTraffic-style dataset with 16,972 traffic nodes and traffic-flow, speed, occupancy, incident, and roadway information, where incident classification reached about 41.6% accuracy in selected settings. Huang et al. [
17] formulated traffic accident prediction as a road-network graph-learning problem and compared TRAVEL with MLP, XGBoost, and several GNN baselines, showing the importance of evaluating accident prediction models against both non-graph and graph-based alternatives. Wang et al. [
18] further demonstrated this direction by comparing intersection-level crash prediction with LR, SVM, DT, RF, MLP, GCN, and LSTM, where RoadInTCP achieved F1-scores between 0.8366 and 0.8694 and improved over the strongest baselines by approximately 3.634–5.591%. Similarly, Yumak et al. [
19] showed that XGBoost achieved the strongest F1-score of 0.61 and the lowest false-alarm rate of 0.01 for accident-severity and high-risk road-segment identification, while Li and Chen [
13] reported 94.5% accuracy, 94.0% precision, 93.7% recall, 0.978 AUC, and 93.8% F1-score using a CNN–LSTM–GNN accident-risk model. Alnowaiser [
20] also demonstrated the effectiveness of graph-sequential learning by reporting 99.97% accuracy and approximately 99.99% precision, recall, and F1-score using a GNN–LSTM–MLP framework. Collectively, these works show that RF, XGBoost, MLP, LSTM/GRU, and graph-based models represent strong and widely adopted benchmark families for traffic accident prediction and severity modeling.
Compared with these studies, the proposed CHFI framework introduces a different modeling perspective by shifting from discrete accident classification toward continuous hazard-field intelligence. Rather than directly predicting accident occurrence or severity from isolated observations, CHFI first reconstructs a topology-aware and temporally evolving hidden-danger representation over connected transportation infrastructure. This is achieved through topology-constrained spatial alignment, temporal hazard window embedding, risk-diffusion feature lifting, hazard-sensitive normalization, continuous hazard surface initialization, spatial hazard encoding, temporal hazard evolution, and localized accident-emergence inference. The final CHFI configuration achieved 99.12% accuracy, 98.85% F1-score, and 0.998 AUC on Dataset 1, and 98.63% accuracy, 98.28% F1-score, and 0.997 AUC on Dataset 2. Beyond these numerical results, the main novelty of CHFI lies in its ability to represent urban traffic danger as a continuous, topology-aware, and interpretable hazard surface that explains how risk forms, diffuses, persists, and becomes translated into localized accident-emergence probability.