1. Introduction
With the growing diversity of road users and transport systems in Germany, it is becoming increasingly important to better understand the causes and influencing factors of crashes [
1]. Although traffic is becoming increasingly complex, the number of fatal road crashes in Germany has been falling for several years. According to the German Federal Statistical Office [
2], this is due to the introduction of new regulations and improved vehicle technology. Nevertheless, further analysis of the factors influencing crashes is necessary to improve road safety. This applies not only to fatal crashes, but also to minor crashes. Identifying the influencing factors can provide very useful information for infrastructure planners.
Since each crash is highly dependent on individual external factors, it is necessary to work out these influences for each crash individually. Due to the interdependence of these influences and their mutual impact, it is crucial to develop new methodologies to isolate these interdependencies. Only then is a large-scale and empirical analysis possible.
There are a number of factors in the event of a crash that contribute to its complexity. Among these are weather conditions, road conditions, traffic congestion, and the individuals involved. Due to the complexity of crash data, patterns and correlations are not immediately apparent. Even small changes, such as the time of day or the types of people and vehicles involved, can determine whether a crash at the same location results in minor or serious injuries. For some years now, machine learning (ML) approaches have been used to classify the severity of crashes ([
3,
4,
5,
6,
7]). The aim is to automatically learn and map the correlations between the details of a real crash and the severity of the crash. This approach offers the significant advantage of requiring minimal assumptions in advance to effectively classify a wide variety of data. This ensures a high degree of flexibility, especially with more complex ML models, with respect to a wide variety of crashes and their outcomes. This approach is widely used in crash research to extract influences from large amounts of data. This makes it possible to use a model for real-time monitoring of roads. The large number of crashes already recorded can be used to identify potential danger spots. To do this, traffic crash data must be combined with real-time road data to include the current situation on the road in the crash risk prediction. The work of Zhang et al. proves the feasibility of such a prediction tool based on precrash traffic dynamics, such as the mean pre-crash traffic speed or the mean speed reduction in certain road sections within one minute [
8].
However, this approach has some limitations: decision rules derived from simple ML models such as decision trees or random forests are capable of identifying correlations between individual factors. However, a major limitation of these decision rules is that they are unable to identify local factors associated with specific crashes. Even the widely used feature importance metrics are not suitable for analyzing individual crashes, as they cannot identify local effects and do not indicate whether the effects are negative or positive.
While complex ML models, such as neural networks, have demonstrated efficacy in predicting the severity of crashes based on available data, it is considerably more challenging to derive and interpret the influencing factors. Researchers increasingly use explainable AI (XAI) methods to reveal how these ‘black box’ models make decisions.
Besides understanding how an ML model makes decisions, it is equally important to consider the geospatial characteristics of the data. There is a large research gap in this area, especially when analyzing crash data. At present, the number of studies that apply geospatial explainable artificial intelligence is limited [
9]. In geospatial XAI, integrating geospatial information into the analysis and communication of ML results is essential.
The objective of this study is to demonstrate the efficacy of geospatial XAI in determining the factors that influence crash severity. An additional challenge is the interpretation and visualization of these complex results. Therefore, another goal of this study is to develop a concept that uses geospatial areas at different scales to enable the study of global crash factors, influencing factors in a geospatial region and local influencing factors.
Roussel & Böhm [
10] introduced a related approach termed Geo-Glocal Explainable Artificial Intelligence (Geo-Glocal XAI), which combines the strengths of global and local explanations to better capture spatial dependencies in data. This concept aligns closely with the goals of this research. However, to maintain terminological consistency and emphasize the spatial dimension of the data and explanations, the present study uses the broader term geospatial XAI to encompass global and local explanatory levels within a unified spatial framework. From this, we define the following research questions:
How can geospatial XAI by using global and local explanations help reveal the factors influencing an ML model’s classification of crash severity?
How can visualization of global and local influencing factors with maps and plots simplify and improve interpretability?
Beyond its methodological contribution, the proposed geospatial XAI approach has the potential to be applied in practice in urban and transport planning. By identifying the factors that influence crash severity at a semi-global and local level, planners and policymakers can locate areas of risk, evaluate the effectiveness of existing infrastructure and prioritize safety measures. For instance, insights from explainable ML models can inform the redesign of dangerous intersections, the optimization of bicycle infrastructure, and the targeted implementation of speed management measures. Furthermore, integrating these results into interactive dashboards allows professionals such as urban planners, transport engineers, safety authorities and emergency services to understand the results of the model without needing in-depth knowledge of machine learning. This promotes data-driven decision-making, increases the transparency of AI-supported safety assessments and ultimately contributes to the development of safer, more sustainable urban transport systems.
A universally applicable concept is developed in this study and examined using data from the German city of Mainz. Due to its size and diversity, the city of Mainz offers a representative basis for identifying these influences. The proposed geospatial XAI method which combines spatial segmentation, explainability methods, and visualization can be transferred to other cities and transportation systems, provided that similar spatial and temporal data are available.
The structure of the study is as follows: In
Section 2, we present a comprehensive overview of the current approaches in research to determine crash severity using ML models. We present the missing explanatory approaches in research and introduce geospatial XAI. In
Section 3, we present the newly developed method to derive and map complex influencing factors together with the ML quality. In
Section 4, we apply the developed concept to our selected use case of crash analysis in Germany. We present and discuss the results in
Section 5. In
Section 6, we conclude the results of the study and point out future directions.
2. State of the Art
In this chapter, we present current research using ML approaches to predict crash severity. Specifically, we will show studies in this research area that use some XAI approaches. We also provide general information on the research field of geospatial XAI.
Crash severity classification using ML has become increasingly important in recent years. Numerous studies have looked at databased prediction of crash severity using different data sources, models and analysis methods. In the past, the focus has been on identifying the best ML model for such classification tasks.
Using complex ML models like neural networks (NNs) improves performance compared to linear models [
11]. They confirm that the relationships between features and outputs in crash severity prediction are non-linear, and therefore best represented by a NN. Recent research also proves that using a neural network (NN) is promising for predicting crash severity [
12]. However, these studies also point to the risk that increasing complexity can reduce the comprehensibility of the model, without presenting solutions to make the model transparent. The review also highlights this research gap by presenting an extensive list of database analysis approaches for crash severity [
6]. This study also focuses on the performance of each ML model: F1 score, recall, precision or accuracy are often used to compare the efficiency of a model. The study highlights the difficulties in predicting crash severity with ML approaches, as well as possible solutions to achieve better results. One challenge that often arises in this use case is that the crash data is often very unbalanced. There are many more crashes of minor severity compared to fatal crashes, which poses a challenge for ML models. The study presents possible sampling methods to overcome this challenge and increase the accuracy of the models. It also discusses feature selection and preprocessing to improve the quality of the training data. While these methods can improve performance, they can also make the predictability of the prediction more difficult. For example, encoding features can potentially complicate interpretation. The diversity of features is also essential for the quality of ML predictions. Previous studies have demonstrated a correlation between road characteristics (e.g., road class, number of lanes, and speed limit) and crash severity, suggesting that a broad range of crash-related features can improve predictive performance [
3].
Furthermore, selecting the model according to the database is important, with neural networks offering the greatest variety of applications due to their high adaptability [
3]. Combining neural networks and decision trees improves the performance and accuracy of crash data classification [
12]. However, it is also important that the decisions made by ML models are transparent and fully disclosed so that people can understand the influences on the model, especially for sensitive predictions such as crash severity. The study identifies two benefits of transparency [
12]: First, identifying model biases due to missing and biased data. Second, increasing confidence in the prediction by communicating the rationale for the decision in a human-readable way.
A practical dataset also demonstrates the strength of a neural network [
13]. They used 53,732 crashes from Florida in 2006, each with 17 descriptive features and 4 crash classes. They achieved an accuracy of 54.84% for the test dataset, which is comparable to other studies. In addition, they performed sensitivity tests on the trained ML model to make the decision making more comprehensible. Approaches such as these are the first step in analyzing the factors influencing predicted crash severity. However, this approach can only identify influences globally. The Correlations and effects of individual characteristics are not covered.
In order to gain a deeper understanding of the decision-making process of an ML model, some studies have focused on utilizing decision trees, random forests, or other combinations as an alternative to an NN. The strength of these models is that they can cover non-linear problems and are explainable with little effort [
14]. The quickest way to do this is to derive so-called decision rules. These can provide a meaningful insight into the decision making of the ML model, as they base entirely on the trained model. The model expresses its decisions directly in the form of rules. However, to obtain a comprehensive statement about the influences, it may be necessary to train many trees. To obtain an unbiased and comprehensive picture of the most influential attributes, the individual decision rules of each tree need to be combined and normalized [
4]. However, the work proves that the explanatory power of the decision rules is not sufficient to determine the factors influencing the decision of crash severity [
4]. They called these influencing factors ‘risk combinations’. Working with decision rules also presents several challenges. Normalized influences are only available globally, and the derived rules only apply to part of the trained model. It is not possible to output the influences for individual instances. Moreover, it is possible to specify only a feature’s importance, without indicating whether its value had a positive or negative effect on the decision. Furthermore, a large number of decision trees are required to cover the influences of all factors.
XAI techniques improve ML models in geospatial use cases more often than serving as interpretable domain information for domain experts [
9]. Recent research illustrates this finding, too [
15]. They predict the severity of crashes on Brazilian highways. To reduce the complexity of the data, they use the XAI technique local interpretable model-agnostic explanations (LIME) to identify the features that have the least influence on the prediction. They then remove them from the entire data set for a new model training. This improved the prediction quality of the model. However, this study again shows that the potential of the influencing factors is greatly underestimated and remains unused for deeper knowledge transfer or further data analysis. Especially for geospatial use cases, XAI techniques can add significant value and provide new insights for decision-making. Boosting algorithms, including extreme gradient boosting (XGBoost), represent the most common models applied alongside XAI [
9]. It should be emphasized that the lower the model accuracy, the lower the explanatory power of a model explanation; XAI can only reveal the results of an ML model but cannot evaluate or modify them [
16].
By far the most widely used XAI technique is the SHAP method [
16]. SHAP is model agnostic and can be applied to any ML model to reveal the decision-making process of a classification. A key advantage of SHAP is its ability to provide both global and local explanations. By aggregating local SHAP values over the entire dataset, global explanations reveal which features most influence the model predictions. Different types of plots can present these global results to visualize the average contributions of all features. Local explanations, on the other hand, show how individual features influence specific predictions and are often visualized in waterfall plots, which show the contribution of individual features to a specific prediction gradually.
In addition to traditional SHAP value visualization, recent research explores the geovisualization of these values [
16]. The paper highlights SHAP as a useful tool for identifying influencing factors in ML models, while noting that the geospatial representation of these factors remains often underexplored. He presented two approaches to visualize these factors using geographic maps. A point visualization and an area visualization of the SHAP values. In the punctual visualization, the influencing factors represented by color and the characteristic values by size on a map with an exact spatial coordinate. The method reveals spatial patterns that standard plots cannot visualize [
16]. The area-wide visualization displays the SHAP values or the features with the highest SHAP values using a Voronoi diagram. After intersecting with the road network, this method visualizes actual geospatial areas of influence across the entire region.
When analyzing crash data, previous research has used ML models to predict crash severity as best as possible. Only a few studies have used a trained ML model to evaluate factors influencing the prediction. So far, researchers have mainly used decision rules or feature combinations to identify the so-called risk factors of serious crashes. The interpretation of influencing factors is not widely used in crash analysis but has great potential. The calculated influencing factors can not only make the decision processes of ML models transparent and thus provide important information for subsequent investigations. Influencing factors also help improve ML models and enhance their robustness. Furthermore, there is still a lack of modern visualization approaches to display crash data. Crash data can be geospatially localized, providing the opportunity to perform geospatial analysis, which can help to identify patterns in the data.
The first outlines of a global–local explainability approach, a scalable method for geospatial use cases, appear in [
16]. Building on this, we apply the geospatial XAI technique which uses global and local factors to identify influencing factors and extract related geospatial information. The next chapter presents a concept that performs a geospatial scaling of the SHAP values in order to present ML predictions.
3. Concept
This chapter introduces a developed concept that can calculate the influencing factors of an ML classification not only locally for individual datasets or globally for all predictions but also scaled and taking the predicted classes into account. The concept of semi-global influence factors aims to represent extensive influencing variables in a generalized way without distorting them. No one has yet investigated how to combine these two requirements; the present concept addresses this research gap.
To achieve this, we employ two types of data segmentation: First, we spatially segment the data points and then assign them to the cells of the confusion matrix. The first segmentation classifies geographical areas, while the second segmentation enables a detailed analysis of the ML models decision-making process based on the target and predicted classes. We can then aggregate the influencing factors of this twofold-segmented data to assess the effect of individual features on specific cells of the confusion matrix within spatial areas (See
Figure 1).
The first step is spatial segmentation. With geographic data in particular, it is essential to recognize and analyze spatial patterns and relationships. The challenge lies in structuring the data in a meaningful way to make semi-global statements, especially when data points distribute irregularly in space. Therefore, the data set is first divided into clearly definable geographical areas. These areas can be administrative boundaries, such as federal states, counties, or city districts. Alternatively, sub-areas based on social, infrastructural, or climatic criteria, such as forest, urban, and rural areas, are also possible. The choice of segmentation depends on the available data and the respective use case.
The larger the selected areas, the more generalized the subsequent statement about the influencing factors will be. Conversely, smaller segments enable a more precise derivation of spatial patterns. The presented concept uses scalable analysis to examine both small-scale structures with low point density and large-scale areas with aggregated influencing factors. Thus, it is possible to conduct an area-related analysis of the influencing variables at different spatial scales. This segmentation also allows us to localize the predictions generated by the ML model semi-globally.
The ML model calculates a probability for each target class, so the sum of all the probabilities is always 100%. XAI methods, such as SHAP, allow us to analyze these probabilities in detail to determine the influence of individual features. The resulting local influencing factors depend on the instance and target class under consideration.
A common mistake when deriving global influencing factors is aggregating all local influences on a target class . Problems arise if the calculation proceeds without considering whether the model correctly predicted the target class. This procedure can lead to distortions because it includes incorrect predictions. To avoid this, we perform a second classification of the data based on the initial classification after spatial segmentation. We then assign the predictions to cells of a confusion matrix that compares the actual classes with the model’s predicted classes.
By dividing the data this way, we enable dynamic analysis of predictions based on spatial units. An interactive dashboard presents the results and facilitates user-friendly, straightforward exploration of complex influencing factors.
The confusion matrix also facilitates targeted aggregation. We calculate influencing factors per target class exclusively from instances that the model correctly classified. This avoids distortions due to misclassification. Additionally, the analysis of incorrect predictions allows identification of the responsible features. Classifying the data as true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) enables a structured and cumulative representation of the influencing factors. This approach provides a robust basis for analyzing model decisions, especially wrong ones.
Combining spatial and classification-based segmentation provides a more detailed view of the classification results. This approach is particularly effective when using XAI methods, such as SHAP. It goes beyond classic SHAP diagrams, such as beeswarm or summary plots, by offering alternative, visually prepared display options.
The visualization uses simplified symbols to depict individual feature flows without altering the underlying numerical values. For each target class, it is clear which features have increased or decreased the prediction probability. Displaying the intensity of influence symbolically helps a diverse user group understand the semi-global influencing factors.
Additionally, the interactive dashboard lets users dynamically select the target class, revealing different influences. This allows for a targeted analysis of the features that led to the inclusion or exclusion of certain classes. Even when the model misclassifies, users can quickly identify the features that led to the incorrect prediction. The dashboard offers a transparent and comprehensible representation of the factors influencing the model prediction thanks to the dynamic visualization of local SHAP values using waterfall and decision plots. The combination of interactive analysis options, user-centered design, and a comparative presentation of all target classes strengthens understanding of the model’s internal decision-making processes. This creates a sound basis for further interdisciplinary analyses.
This concept is suitable for any use case involving the classification of highly spatially distributed data with a trained ML model, followed by analysis using XAI methods.
4. Implementation
This chapter describes how to implement the concept presented in
Section 3. This process systematically derives factors influencing crash severity from crash data using XAI methods. The study uses the city of Mainz as an example area.
4.1. Data
First, we present the data set used. The crash data used in this study comes from the ‘Unfallatlas’ or Crash Atlas of the state and federal statistical offices Germany (
https://unfallatlas.statistikportal.de/ (accessed on: 10 May 2025)). The publicly accessible dataset includes all crashes involving personal injury on German roads, as recorded by the police. Although the data has an annual publication cycle, some federal states delay its release. For example, data from Mecklenburg-Western Pomerania is only included starting in 2021. Each recorded crash includes geographical coordinates, enabling spatial representation and analysis.
The study period spans from 2016 to 2022. During this period, authorities recorded 1,396,570 crashes in Germany and published them in the Crash Atlas. Of these, 80% (1,118,604 cases) were crashes involving minor injuries, 19% (263,664 cases) were crashes involving serious injuries, and 1% (14,316 cases) were crashes involving fatalities. Previous research have shown that crash datasets exhibit a high degree of imbalance [
7,
10].
The statistical offices of the federal states and the federal government compile and summarize crash data from the individual states. For each crash, there are thirteen features recorded.
Table 1 lists the features that provide spatial, temporal, atmospheric, and situational information. Compared to similar study datasets, these features are less extensive. The descriptions of these features come from the dataset description ‘Unfallatlas’ [
17] and the explanation of basic terms in traffic crash statistics [
18].
As part of the data processing, the statistical offices georeferenced the crashes by matching each crash’s exact location to the nearest road axis. However, the dataset only contains spatial coordinates, not information on the characteristics of the road. To enrich the crash data with additional road information, a scalable approach is necessary.
Free data from OpenStreetMap (OSM) (
https://www.openstreetmap.org (accessed on: 10 May 2025)). serves as the basis for attaching the respective road’s additional properties to a crash. For example, the data enrichment consists only of the road class as an additional attribute because information on road width, number of lanes, and maximum speed is missing. Additionally, the selection is limited to OSM roads on which traffic crashes can occur.
4.2. Data Engineering
An examination of the feature data types in
Table 1 reveals that they are manly categorical. The federal and state statistical offices specify a value from a defined list for each feature. The categorical type of the values often prevents direct comparison or sorting. Transformations and encoding convert these values into a model-readable structure, enabling a model-agnostic evaluation of crash severity. Additionally, sampling methods compensate for unequal crash frequencies to ensure a balanced data set for evaluating crash severity.
Time data, such as hours and months, are cyclical; therefore, a purely numerical representation distorts neighboring values, such as 23:00 and 01:00.
Sine and cosine transformations capture this cyclical nature [
19]. This work transforms the crash times following the confirmed practice [
10]. The two new continuous values are stored as features.
The categorical, nominally scaled features—crash type, light conditions, road condition, and road class—are data with no natural sequence that are not directly comparable. One-hot encoding ensures the ML model considers the individual feature classes independently. While one-hot encoding allows for model-agnostic processing of categorical data, it also increases the number of features, resulting in the ’curse of dimensionality‘ [
20]. In this case, the number of features increased from 13 to 52 due to the transformations, requiring higher model complexity and more training data.
We used a self-developed sampling approach to train an ML model that can classify crash severity based on features. We subsequently used the test dataset to evaluate the model’s quality and visualize the semi-global influencing factors. To train a meaningful ML model, a large amount of uniformly distributed, heterogeneous crash data is required. We compile the training data independently of specific geographical regions to avoid distortions and maximize the model’s performance with training data as variable as possible. However, the distributions of individual crash severities in the enriched dataset are still imbalanced. Previous work additionally highlights this issue [
15]. They compared trained ML models with balanced and unbalanced data sets. They found that correlations between input features and target variables were only present in balanced datasets. Finding the right class in an imbalanced dataset was difficult, and classification quality based on error measures was over 10% worse (NN, decision trees, SVM, and Naive Bayes were examined). The use of undersampling or oversampling is essential. One disadvantage of undersampling is that the ML model cannot see and adapt important information contained in randomly deleted instances. Researchers frequently apply oversampling for this reason. In particular, the Synthetic Minority Over-sampling TEchnique (SMOTE) is widely used. Rather than simply copying instances of the underrepresented class until the class distribution is equal, SMOTE creates synthetic instances. Copying the data would lead to overfitting because the same instances would always form the basis of the data. SMOTE creates new instances by generating feature values based on neighboring instances, resulting in unique values. Overfitting can also occur when there are extensive new generations and little initial data. SMOTE adjusts inherently imbalanced crash data, as demonstrated in previous studies [
5,
14]. The new sampling approach also uses this proven sampling method.
When SMOTE was applied to the German crash data, the subsequent overrepresentation of previously underrepresented crash severity classes was evident. To generate the synthetic instances, we followed standard practice and used k = 5 nearest neighbors for the fatal crash class, increasing its size to 160,000 cases. Since undersampling and SMOTE alone did not produce satisfactory results, we combined the strengths of the two methods to create our own approach: first, SMOTE increased the number of fatal crashes, and then the overrepresented minor and serious crash classes were downsampled to match the number of fatal crashes. This combined sampling approach prevents overfitting, which could result from including too much synthetically generated data, and simultaneously avoids loss of information that would result from pure undersampling. Sensitivity analysis was conducted by varying k from 3 to 7, which showed minimal impact on the model performance, confirming the robustness of the approach.
Figure 2 illustrates the post-resampling class distributions, verifying the validity of the synthetic samples. For later evaluation, crashes from Mainz were removed from the training dataset and used as test data; these test data remained unbalanced to reflect real-world conditions.
4.3. Classification
The prepared crash data were used to train models predicting crash severity, with XGBoost and MLP-NN representing tree-based and neural approaches, respectively, allowing a comparison of fundamentally different architectures and confirming XGBoost’s suitability for complex, real-world crash data.
We performed the initial training of the XGBoost and MLP-NN classifiers using summarized crash data from Germany. In this case, we did not sample the data (see
Figure 3). Then, we repeated the training with the adjusted crash severity classes. For the third model training, we used the balanced crash data and considered the road class for the two ML approaches. However, this data augmentation only improved the XGBoost model.
Table 2 presents the multi-class classification results. The comparison of XGBoost with MLP-NN serves as an ablation study to demonstrate the relative robustness and suitability of tree-based ensemble models for crash severity prediction.
After adjusting the class frequencies in the training data, most error measures improved because the model could correctly classify more serious and fatal crashes (see
Figure 4). The recall values for these classes are 75% and 38%, respectively. However, the model now predicts more minor crashes than fatal crashes in the test dataset. This is evident in the low precision value of 3% for this crash class and the decrease in recall value to 70% for the minor crash class. These results may indicate overfitting of the models and require further investigation. The classification accuracy is also decreasing. It decreased by 20%, falling to 65%. The model now more often predicts the previously overrepresented minor crashes as other crash severities. However, the 26% increase in the overall recall value shows that the model assigns crash severity more accurately, particularly for underrepresented classes. The XGBoost model correctly classified three out of four actual fatal crashes. However, it incorrectly predicted 26 serious and 89 minor crashes as fatal, resulting in an overall precision value of 38.7%.
The two models examined, XGBoost and MLP-NN, have similar error values. This confirms the classification by these two independent, different ML models qualitatively. Similarly to the XGBoost classifier, the MLP-NN only predicts serious and fatal crashes after adjusting the class frequencies. However, this leads to possible overfitting of the model due to the incorrectly classified minor crashes. Nevertheless, the MLP-NN shows slightly better results in all error measures (1–3%) because this model correctly predicted more minor crashes.
Adding another feature, the road class, makes the data set more complex. This requires a model that can correctly map the increased complexity. Only the XGBoost algorithm increased the quality of the classification results (see
Figure 5). The MLP-NN performed worse in all error measures. It misclassified crashes with minor injuries as crashes with serious or fatal injuries. This is evident in the class’s recall value of 34%. Additionally, the model identified fewer actual crashes with fatal severity. After enriching the data with the road class, the model incorrectly predicted more crashes than serious crashes.
The XGBoost classifier will be utilized as the primary model for the subsequent stages of processing. The selection of XGBoost is based on both empirical evidence from previous research and its suitability for the characteristics of the crash dataset used in this study. XGBoost is a machine learning algorithm that has gained significant traction in the research community due to its three key strengths: robustness, computational efficiency, and the effective handling of heterogeneous, high-dimensional data [
5]. In contrast to conventional decision trees or random forests, XGBoost employs an ensemble of gradient-boosted trees, where each tree sequentially corrects the residual errors of its predecessors. This iterative learning process, when combined with built-in regularization, renders XGBoost particularly resistant to overfitting—a frequent issue in crash datasets characterised by strong feature correlations and class imbalance [
10].
Several comparative studies [
5,
10,
15,
16] have demonstrated that XGBoost consistently outperforms other classifiers, including Support Vector Machine, logistic regression and even deep neural networks, in the prediction of traffic crash severity. Its capacity to effectively integrate both categorical and continuous variables is in alignment with the structure of the German crash dataset, which encompasses a range of feature types (e.g., road type, lighting conditions, vehicle involvement). Furthermore, in contrast to numerous neural networks, XGBoost furnishes inherently interpretable structures through the provision of feature importance and Shapley-based post hoc explanations, a factor which is advantageous for the integration of XAI into geospatial analyses.
We then used cross-validation for hyperparameter tuning to identify the optimal parameters for the model. We configured the model for a multiclass classification problem using a softmax objective function. It consists of 500 decision trees with a maximum depth of four levels. To reduce overfitting, we used only 80% of the training data and 60% of the features to construct each tree. We employed L1 and L2 regularization to limit model complexity. We deliberately chose a low learning rate of 0.015 to promote stable and gradual convergence. Additional fine-tuning improved all error measures by 2%.
4.4. XAI
To represent transparently the semi-global influencing factors of the XGBoost model, we apply XAI. We use SHAP for model diagnostics and its ability to provide both local and global interpretations of model decisions simultaneously [
21]. Unlike alternative approaches, such as LIME, which primarily provide local explanations, SHAP allows for a more consistent assessment of influencing factors across many instances. For the XGBoost model, we use a specially adapted TreeExplainer that calculates SHAP values efficiently within the model. This enables high-performance analysis of large data sets (
https://github.com/shap/shap (accessed on: 10 May 2025)).
Our goal is to use SHAP values to explain all crashes in the Mainz test data. These consist of 878 instances, for which the algorithm calculates one SHAP value per feature and per target class—i.e., 52 features across three classes—resulting in 136,968 individual SHAP values in total.
Using a waterfall plot, for example, we can visualize the influencing factors for each target class locally for each instance, as shown in
Figure 6,
Figure 7 and
Figure 8.
Figure 5.
The confusion matrices show the classification results of XGBoost trained on the German data after adjusting the class distribution in the training set. We also enriched the data with road class information.
Figure 5.
The confusion matrices show the classification results of XGBoost trained on the German data after adjusting the class distribution in the training set. We also enriched the data with road class information.
Figure 6.
The factors influencing an ML prediction are visualized using waterfall plots of the three crash severities (e.g., ‘Persons killed’). In the SHAP plots, negative contributions are shown in blue, while positive contributions are shown in red.
Figure 6.
The factors influencing an ML prediction are visualized using waterfall plots of the three crash severities (e.g., ‘Persons killed’). In the SHAP plots, negative contributions are shown in blue, while positive contributions are shown in red.
Figure 7.
The factors influencing ML prediction are visualized using waterfall plots of the three crash severities (e.g., ‘Persons seriously injured’). In the SHAP plots, negative contributions are shown in blue, while positive contributions are shown in red.
Figure 7.
The factors influencing ML prediction are visualized using waterfall plots of the three crash severities (e.g., ‘Persons seriously injured’). In the SHAP plots, negative contributions are shown in blue, while positive contributions are shown in red.
Figure 8.
The factors influencing an ML prediction are visualized using waterfall plots of the three crash severities (e.g., ‘Persons slightly injured’). In the SHAP plots, negative contributions are shown in blue, while positive contributions are shown in red.
Figure 8.
The factors influencing an ML prediction are visualized using waterfall plots of the three crash severities (e.g., ‘Persons slightly injured’). In the SHAP plots, negative contributions are shown in blue, while positive contributions are shown in red.
In addition to SHAP values, the algorithm calculates a base value, E [f(x)], for each class. This expresses the basic probability of an individual class in the case of an unbalanced dataset. In other words, this probability describes the occurrence of a class regardless of the individual features of an instance. If all target classes occur equally, the base value is the same for each class.
To analyze the influencing factors using SHAP values on unbalanced test data, we input the trained ML model and the class ratios in the test dataset to the tree explorer. This enables us to apply crash severity predictions and their influencing factors to new unbalanced data sets. The calculated base probabilities of the classes in the test dataset reflect the frequencies of different crash severities, with minor injuries being the most common. These base values serve as a reference point for calculating the SHAP values and account for the class imbalance in the dataset.
The library outputs the SHAP values in logit space, which we then convert into probabilities using the softmax function for interpretation:
The softmax function (1) is a common method for converting logits to probabilities in classification tasks [
22].
The SHAP values of the categorical features required post-processing because pre-processing with one-hot encoding split them into several binary variables (e.g., crash type and Lighting condition). One original feature consists of several feature classes (see
Figure 9). This makes explanations more difficult because there are now several influencing factors for each original feature.
To improve comprehensibility, we grouped feature classes based on how the one-hot encoding split divided the features and summed their SHAP values for each original feature. This step is mathematically permissible due to the additivity property of SHAP values. We assigned the groups based on the coding structure and order of the feature matrix. As a result, each original categorical feature of an instance received a single influence value per target class again (see
Figure 10).
This aggregation leaves the total sum of the SHAP values unchanged and still corresponds to the predicted probability. However, it is important that the resulting SHAP value of a feature reflects the interaction of all associated categories—both their presence and the absence of others.
Another way to visualize the influencing factors using the SHAP library is with a beeswarm plot. Unlike the previous illustrations, this provides a global view of the influences.
Figure 11,
Figure 12 and
Figure 13 show beeswarm plots for each of the three crash severity classes. Notably, ‘Crash with Other Vehicle’ is the most important feature for predicting fatal crashes. When this feature has a value of one, i.e., when the vehicle type is involved, the probability of this class increases. The same is true for pedestrian involvement. The road class ‘residential road’ and the crash type feature value of 2 (‘Collision with preceding/waiting vehicle’) lower the probability of this class.
The probability of a crash being predicted as serious is influenced by the feature values of crash type 6 (‘crash in longitudinal traffic’) and crashes involving vehicles. Unlike fatal crashes, the absence of these features increases the probability of this class. In other words, if no car was involved in a crash, the model increases the predicted probability of a serious crash. The vehicles involved primarily influence the predicted probability of crashes with minor outcomes. The involvement of a bicycle, pedestrian, motorcycle, or other vehicle negatively affects the prediction probability, as does the road class ‘highway’. Crash type 2 (‘collision with preceding/waiting vehicle’) reduces the probability of a fatal crash. For predicting crashes with minor injuries, this feature has the greatest positive influence on the ML model.
Therefore, the beeswarm plot is a good way to map the influencing factors in an ML model. In addition to showing the absolute influence, it indicates whether a feature has increased or decreased the probability. It also takes into account the respective value of the feature, which illustrates the advantage of a beeswarm plot over a classic bar plot. While the latter only shows the importance of individual features for the model, the beeswarm plot shows the influence of each value and class-dependent influences.
However, the beeswarm plot in
Figure 13 shows that we did not merge the categorical features. Summing up the SHAP values of the individual feature classes generalizes the analysis, so we have not performed this step yet. Some feature classes increase the prediction probability, while others reduce it. Additionally, the feature classes have nominal scales, meaning they have no natural order. A higher crash type class does not necessarily have a higher or lower influence on the prediction. Summarizing the SHAP values of features with multiple classes means that their influences can no longer be interpreted using a Beeswarm plot (see
Figure 14).
We used a boxplot to visualize how feature classes influence a specific crash class (See
Figure 15). For example, we use the boxplot to show the individual crash type categories and their average influence on the probability of a crash being classified as fatal. This representation allows one to compare the influencing factors of the classes of an individual feature. However, it is important to note that all instances are considered—including those, the model does not predict as fatal crashes. We based our approach on handling and visualizing categorical features on the work of [
23].
Since a beeswarm plot becomes meaningless after SHAP values are merged, a bar plot is used instead (see
Figure 16). This plot clearly shows the average absolute SHAP values per feature and crash category. This representation is less detailed but easier to understand.
In the next chapter, we present an approach to visualizing semi-global influencing factors, including the use of bar plots. We subdivide the dataset to create bar plots for specific predictions. This enables us to address user inquiries such as ‘on average, which feature has the greatest influence on the ML model when identifying minor crashes as fatal crashes?’ It also enables us to evaluate classification results and misclassifications based on many instances.
5. Results and Discussion
We developed this concept to address the research questions posed in
Section 1. The goal is to present a solution for calculating semi-global influencing factors and, subsequently, communicate them in an accessible manner. There is a research gap in communicating spatial influencing factors. We designed the prototypical dashboard implementation to present a possible solution to this issue. In the use case, we dynamically summarize crash data, ML results, and influencing factors to enable in-depth analysis. This allows specialist users, such as road planners, crash researchers, ML experts, and road users, to explore the large number of individual ML predictions or evaluate the quality of the ML model. Geo-referencing crashes makes it possible to show spatial patterns. We developed this prototype for German analysts, so the upcoming screenshots show only the German version.
We sorted the Mainz crash data by spatial location, as shown in the concept. Using the districts of Mainz as an example, we demonstrate this approach. Other administrative units or socio-spatial boundaries can also serve as subdivisions. However, we have not examined these spatial units in depth, so we cannot identify small-scale influences, such as dangerous traffic junctions. We can now visualize the subdivided point data on a map using areal representations (see
Figure 17). Additionally, other visualization concepts can highlight differences between districts. In the study, we visualized the number of crashes per district and the model’s prediction quality using color-coding and error measures for each district. Depending on the specific influencing factors, other colorations could also be conceivable to achieve even better spatial comparisons with the aid of maps. When comparing city districts, it is clear that areas in the city center have a much higher crash rate, primarily due to higher traffic volume rather than an increased risk of crashes. Crash frequency should be included as an additional target variable in an ML model to specify crash frequency in addition to crash severity. Therefore, we can only compare the districts to a limited extent based on their numbers. Additionally, the influencing factors vary by district. Using a single ML model does not account for district-specific characteristics.
In addition to district-based segmentation, alternative spatial analysis methods should be considered to capture transport-specific dynamics more effectively. Roussel [
16] suggests several geovisualisation techniques for geospatial XAI that transcend administrative boundaries. One promising approach is mapping Shapley values onto the street network, which visualizes feature influences directly along transportation infrastructure. This method uses Voronoi-based interpolation (Shapley Mapped Voronoi Values) to associate explanatory values with specific street segments, thereby preserving spatial continuity and avoiding interpretative biases caused by arbitrary administrative borders. Compared to district-level aggregation, network-based representations can reveal linear spatial dependencies. For instance, they can reveal recurring crash patterns along main roads or at intersections with high-risk features, which may be masked when using areal units. However, such methods also increase computational and visual complexity, particularly when many influencing features are involved. While the present study focuses on administrative units due to their communicability and prototype feasibility, future work should evaluate network-based and multi-scale segmentation approaches to enhance the explanatory depth of geospatial XAI in traffic crash analysis, as suggested by Roussel [
16].
Clicking on a district reveals a dynamic, interactive confusion matrix for the ML predictions of crash severity in that district. This matrix compares the predicted and actual crash severity in the selected district. This makes it easy for users to understand how error measures are calculated. This will also allow users to identify potential classification errors in the model.
Figure 18 shows a district overview from the web dashboard.
Clicking on a cell in the confusion matrix generates a bar plot showing the average absolute SHAP values for this crash severity. One disadvantage of the bar plot from the SHAP library is that it generalizes the data too much. When a bar plot was generated for a crash severity class, the algorithm calculated the average SHAP value of a feature from all instances, even when the classifier predicted the investigated class as the least likely for an instance. Furthermore, it was not possible to consider correctly or incorrectly classified instances separately. Therefore, the presentation of the results in the bar plot was distorted. It was not possible to conduct a differentiated analysis of which features had the greatest influence on incorrect classifications. However, by dynamically calculating the average absolute influencing factors depending on the actual and predicted crash severity in the web dashboard, users can now access the influencing factors specifically (see
Figure 19). Thus, we eliminated the limitations of the global bar plot to gain a clearer understanding of the ML prediction and its influencing factors. However, the average absolute SHAP values represented by the bar plot do not indicate whether a feature increased or decreased the probability of predicting crash severity. Therefore, the bar plot expresses the importance of a feature in the ML prediction rather than providing information on how a feature value influences the prediction of individual crash severity. Additionally, very large or very small SHAP values can distort the average data in individual cases.
We also integrated a beeswarm plot to enable users to take a closer look at the composition of the average absolute SHAP values shown in the bar plot and to gain a deeper understanding of the influencing factors and their interrelationships. This plot shows the individual SHAP values of the instances. We create an axis for each feature and mark the SHAP values as points on each axis. Interacting with the points displays the respective SHAP value and the decisive feature value. This allows users to understand the direction of importance and composition of the SHAP values. As with a typical Beeswarm plot, we color-code the points according to their feature values. However, the markings do not show a continuous color gradient because we did not sort the nominally scaled features. Nevertheless, this highlighting allows users to recognize possible correlations between feature and SHAP values.
Figure 20 shows an example. This diagram illustrates the factors that influenced the misclassification of minor crashes as serious crashes in a neighborhood. Looking at the feature and SHAP values together makes it possible to identify the patterns that caused the ML model to misclassify these crashes. As can be seen here, the involvement of a motorcycle in a crash increases the probability of serious crash severity. The lack of motorcycle involvement minimally reduced the predicted probability of crash severity. However, the bar plot does not show this correlation. Moreover, it was not possible to determine whether indicating motorcycle involvement lowered or increased the probability (see
Figure 19). Therefore, the visualization of individual SHAP values using the beeswarm plot indicates that motorcycle involvement increases the predicted probability of a serious crash in this district. Users can derive similar correlations and influences from the confusion matrix diagrams.
The developed concept enables targeted analysis of crash patterns at the district level, reduces the amount of displayed data, and combines interactive visualizations with differentiated model evaluation. Separating correctly and incorrectly classified crashes and using common slide types supports transparent and low-threshold interpretation of the influencing factors.
In addition to displaying crash severity and its influencing factors at the district level, the dashboard allows users to examine individual crashes freely. This encourages users to explore individual crashes. When examining local influencing factors, the dashboard primarily serves as a visualization and navigation tool to display the results of individual calculations rather than offering analysis options.
The developed process sequence can provide non-case-dependent statements on the formation probability of an ML model. Each crash contains unique information, yet all crashes share the same structure. To this end, we color-code crashes on an overview map of a city district, building on previous investigations of map-based mediation of influencing factors [
16]. In addition to the crash class, additional information, such as the most influential feature in the decision-making process, a comparison of actual and predicted crash severity, or prediction uncertainty, could increase the map’s informational content. Interacting with a marker displays the details of an individual crash. Crash severity is the most important information and the main inspection feature. The information window displays the probabilities of the individual crash classes, as determined by the classifier. We present the prediction probability alongside a comparison with the actual crash severity class and indicate the uncertainty of the prediction. We display this extensive information in table form for the clicked marker to avoid overloading the map with complex visualizations. Nevertheless, we enable users to view the map and crash-related information simultaneously. The information window lists the classification features and their respective values. This gives the user an easy overview of the crash. Details from the database enable reconstruction of the crash. Meaningful descriptions replace coded values in the database, allowing users unfamiliar with statistical office crash statistics to understand the information describing the crash.
Another difficulty was the low-threshold communication of the influences of this crash information. We developed a concept to address the following user question: How did the lighting conditions at the time of the crash affect the likelihood of serious injuries?
Influencing factors describe how individual features influence the probability of a class in the classification. Depending on whether they have increased or decreased the probability of crash severity for the model, these influencing factors can be positive or negative. The number of influencing factors determines how much a feature influences the prediction. To help users understand these factors easily, we indicate them with symbols in the list of crash information next to each feature. Rather than indicating the influence as a percentage, arrows were used to visualize the direction of influence (downward arrows “↓” express a negative influence on the prediction, and upward arrows “ ↑” express a positive influence). This relative indication of influences simplifies traceability in an ML classification. In addition, features with the largest SHAP values (i.e., the features with the greatest influence) are marked with a double arrow. This allows users to identify the most influential features without examining the exact SHAP values (see
Figure 21).
The dynamic dashboard enables users to select and analyze class-specific local SHAP values. Selecting a different crash class automatically updates the symbols and diagrams. Thus, the dashboard allows users to directly read and compare the composition of the model’s influencing factors.
Figure 21 shows a screenshot of the dashboard.
Users gradually receive increasingly detailed and comprehensive information on local ML decisions, culminating in the full display of waterfall and decision plots (see
Figure 22 and
Figure 23). Here, experts can view all values simultaneously. However, potential users often lack prior knowledge of ML classifications and therefore cannot interpret the specified class probabilities and SHAP values. In extreme cases, this could lead the user to interpret the specified influencing factors as the importance of a feature for crash severity in the real world. The dashboard does not clearly communicate that the calculated influencing factors affect only the ML prediction and not the actual causes of the crash.
The work primarily focused on creating a concept for calculating and visualizing the semi-global influencing factors of spatial data sets. We confirmed this concept through practical application. This concept made it easy to identify misclassifications and their respective semi-global influencing factors. The dashboard enabled analysis of how the ML model makes its decisions. This allows for many subsequent investigations, such as those of crash data provided by federal and state statistical offices. As we discussed in
Section 2, relevant characteristics must describe crashes. During publication, information such as the exact date and details about the individuals involved (e.g., age, gender, and driving experience) or possible drug or alcohol influences was removed. The state and federal statistical offices even generalized or summarized other details, such as weather conditions, the number of people involved, speeds, and road data. The absence of these proven relevant influencing factors can lead to a simpler, and therefore less meaningful, model. We were not able to consider important crash characteristics when training the ML model. This has caused incorrect correlations between the crash data or created information gaps in the crash description. These issues become apparent when analyzing the model using SHAP. For example, misclassified crashes in which the severity of the crash was incorrectly stated as fatal instead of minor disproportionately influenced the feature ‘Crash with Other Vehicle’ (e.g., truck, bus, or streetcar) since motor vehicles were predominantly involved in fatal crashes (see
Figure 11). We trained the model using German crash data, so it can only predict crash severity based on the 13 crash features. This resulted in information gaps in the data. Consequently, crashes involving serious injuries were classified as the same as those involving minor injuries. Due to the small number of features, the model could not assign crashes to the correct severity level. The high misclassification rate of the trained ML model demonstrates this. In this context, the use of freely available OSM data also requires critical consideration. Because OpenStreetMap depends on voluntary user contributions, the data may lack completeness or timeliness. This primarily refers to the completeness of a road’s properties, rather than the road network. It is necessary to investigate whether additional information could further improve quality. Besides additional road characteristics, the most important additions are weather data and details about the individuals involved in the crashes. Developing efficient approaches to data enrichment and identifying new sources of information is essential. Additionally, we must check whether an ML model can handle greater data complexity. We found that adding road class as a feature did not improve the MLP-NN’s predictions. Further investigation of more complex ML models may be necessary to assess their suitability for predicting crash severity.
While SHAP values provide a robust measure of feature importance for the ML model’s predictions, they primarily capture the isolated contribution of individual features. Interactions between features—such as the combined effect of rain and nighttime—are not directly visualized in the current analysis, which could mask nonlinear or conditional dependencies affecting crash severity. Decision rules derived from tree-based models inherently encode such conditional relationships but interpreting them for all possible feature combinations is complex and was not the focus of this study. Future work could incorporate SHAP interaction values or partial dependence plots to explore these dependencies systematically, while taking care to maintain interpretability and avoid overcomplicating the visualization for end-users.
The use of unbalanced crash data quickly made a sampling approach necessary. Unlike other studies, we employed a combination of undersampling and oversampling. Few categorical features combined with unbalanced distribution distinguish individual crash classes but lead to poorer outcomes compared to earlier studies [
4,
7]. This could be related to the conservative “macro averaging” assumed for the calculation.
Thanks to the dashboard, it is possible to conduct deeper investigations into the factors that influence ML decisions. The spatial and thematic breakdowns of the data enabled a thorough investigation of the crash data and the classifier.