Mining Topological Dependencies of Recurrent Congestion in Road Networks

The discovery of spatio-temporal dependencies within urban road networks that cause Recurrent Congestion (RC) patterns is crucial for numerous real-world applications, including urban planning and scheduling of public transportation services. While most existing studies investigate temporal patterns of RC phenomena, the influence of the road network topology on RC is often overlooked. This article proposes the ST-Discovery algorithm, a novel unsupervised spatio-temporal data mining algorithm that facilitates the effective data-driven discovery of RC dependencies induced by the road network topology using real-world traffic data. We factor out regularly reoccurring traffic phenomena, such as rush hours, mainly induced by the daytime, by modelling and systematically exploiting temporal traffic load outliers. We present an algorithm that first constructs connected subgraphs of the road network based on the traffic speed outliers. Second, the algorithm identifies pairs of subgraphs that indicate spatio-temporal correlations in their traffic load behaviour to identify topological dependencies within the road network. Finally, we rank the identified subgraph pairs based on the dependency score determined by our algorithm. Our experimental results demonstrate that ST-Discovery can effectively reveal topological dependencies in urban road networks.


Introduction
Urban road networks possess complex interdependencies that can become apparent during congestion events [13]. Established traffic research distinguishes between Recurrent Congestion (RC), e.g., rush hours, and Non-Recurrent Congestion events, e.g., accidents [9]. This article aims to identify topological dependencies within the road network that may cause RC phenomena, henceforth called structural dependencies. Such dependencies are often not well understood and can become apparent only under real traffic load and can cause co-occurring RC patterns in the road network. Therefore, understanding topological dependencies in urban road networks is crucial for many real-world applications, including city planning, traffic management, and public transportation services.
We illustrate structural dependencies in urban road networks at the example of the area of Gehrden -a town in the district of Hanover, Germany -in Figure 1. This figure illustrates two subgraphs of the road network (marked blue and purple). Both subgraphs represent the feeder roads to the main highway (B65) connecting Gehrden and the city of Hanover that constitutes the main commuting route for the Gehrden inhabitants. During a period with a high traffic load (e.g. during a rush hour), these subgraphs are typically simultaneously congested due to the network topology. Thus, we consider such subgraphs to be structurally dependent.
The existing literature on RC mainly identifies temporal patterns [1,2]. However, we observe a lack of methods that investigate the influence of road network topology on RC. Detection of such dependencies within complex road networks is not trivial, particularly due to the variety of the influence factors (e.g. planned special events, accidents, construction sites and extreme weather conditions) and their dynamic impact on the traffic flow concerning the spatial and temporal dimensions. To the best of our knowledge, the task of the data-driven discovery of structural dependencies in urban road networks is new and has not been addressed in the literature.
This article presents ST-Discovery-a novel unsupervised data-driven spatio-temporal data mining algorithm to reveal structural dependencies in urban road networks. ST-Discovery relies on the intuition that structural dependencies can manifest as correlations of congestion patterns. In this article, we represent congestion patterns as subgraphs of the road network. We aim to exclude RC patterns that are mainly induced by temporal factors such as rush hour patterns that mainly ©OpenStreetMap contributors, ODbL Figure 1: Example of structural dependencies in a urban road network observed near Gehrden, Germany.
depend on the daytime. To this end, we only consider temporal traffic load outlier that factor out daily patterns to construct the subgraphs. We identify the subgraphs using spatial clustering of the connected network segments that indicate a high traffic load. We consider temporal correlations of subgraphs located in spatial proximity as indicators of structural dependencies in the underlying road network.
To assess the effectiveness of the proposed ST-Discovery algorithm, we conducted a case study. This case study utilises two real-world traffic datasets in the regions of Hanover and Brunswick, Germany. The study results illustrate that our method can accurately detect meaningful structural dependencies in urban road networks.
In summary, our contributions are as follows: (1) We introduce the new task of the data-driven discovery of structural dependencies in road networks; (2) We propose the novel ST-Discovery algorithm to detect structural dependencies using traffic flow data; and (3) We conduct a case study to assess the effectiveness of the proposed method.
This article extends our prior work [27] towards this direction. Compared to [27], in this article, we provide a detailed explanation of the individual algorithmic steps of ST-Discovery and the runtime complexity analysis. Furthermore, we present an extensive evaluation, including the manual assessment of the identified topological dependencies and a detailed investigation of the algorithm's parameter impact. A demonstration that includes visualisation of the dependencies identified by ST-Discovery in an interactive traffic analytics dashboard is available at [28].

Related Work
This section discusses related work in the areas of congestion analysis and spatio-temporal data mining, along with related data sources.
Recurrent congestion denotes the remaining congestion events. Rush hour patterns constitute the most prominent examples of RC events [17]. A large number of studies focus on the temporal analysis of RC. One line of research addresses the prediction of RC where machine learning models such as neural networks [11,12,35] or Support Vector Machines [29] currently constitute the stateof-the-art. The closely related task of short-term traffic forecasting is well studied in the existing literature [30]. Models for both RC prediction and short-term traffic forecasting typically utilise periodically reoccurring traffic patterns, e.g. rush hours and day of week patterns, to facilitate predictions. Another line of research investigates the evolution of RC patterns. Current approaches typically analyse the propagation of RC within a spatial grid [1,2,32] or a road network graph [33,7,23].
While these studies mainly address the temporal aspects of RC, we observe a lack of research that investigate the influence of road network topology on RC. This article proposes an algorithm that filters out periodic traffic patterns and identifies mutual dependencies of subgraphs within the road network.

Spatio-Temporal Data Mining
Spatio-temporal data mining algorithms address the challenge of extracting information, e.g., frequent patterns or anomalies, from large sets of spatio-temporal data. Atluri et al. provide an overview of approaches and problems in a recent survey [5].
Previous data mining approaches for road network data often aim at identifying individual important roads or junctions within the road network. [13] introduced a data-driven approach to identify the importance of individual roads within the road network by measuring the correlation of traffic load between a particular road and the whole road network. Similar, [34] discovers individual important intersections. The authors represent trips within the road network as a tripartite graph. They compute the importance of intersections with an iterative ranking algorithm. In contrast, we consider the problem of identifying pairwise dependencies in the road network.
Several studies consider outlier detection in road traffic data. In [6] anomalous traffic flow is detected by grouping road intersections via their traffic flow patterns and self-organising maps. [19] focuses on detecting outliers in the traffic load by sudden changes. ST-Discovery builds upon existing outlier detection methods and exploits outlier co-occurrences and mutual information to determine spatio-temporal dependencies.

Data Sources
Road traffic data is often collected from stationary sensors, GPS devices, or simulations. Stationary sensors, such as induction loops permanently installed within roads, are traditional traffic data sources. Stationary sensors usually measure high quality and consistent traffic data but lack coverage of the road network, especially in urban environments. Existing research has widely adopted the use of data from stationary sensors [21,8,3,11]. The increasing digitisation of urban traffic has led to a boost in real-world traffic data availability. In particular, floating car data (FCD) is usually collected from GPS devices. FCD enables detailed and realistic insights into the specific regions' actual traffic load. FCD has proven to be suitable data source for the analysis of both RC ( [2,7]) and NRC ( [4]). Compared to data collected from stationary sensors, FCD typically covers a larger fraction of the road network but is less consistent. Simulation-based approaches (e.g. [25,24]) utilise the features originating from the network topology and capacity and can reveal critical parts of road networks and the possible impact of incidents. However, these methods are restricted by the approximations of the underlying models that can provide only rough estimates of the real traffic flow. In this article, we rely on FCD, representing real-world traffic flow data and can provide insights into the topological dependencies that become apparent only under real traffic conditions.

Problem Statement and Formalisation
In this article, we address the problem of identification of the structurally dependent subgraphs in a road network. We consider subgraphs to be structurally dependent if: 1. The subgraphs are located in spatial proximity.
2. The subgraphs are typically simultaneously affected by Recurrent Congestion.
3. The road network topology causes the correlation of the congestion on these subgraphs.
In the following, we formalise the key concepts adopted in the article. In this formalisation, we adopt and, where necessary, extend some of the concepts defined in our previous work [27].
Transportation graph. We represent the road network as a directed multi-graph T G := (V, U ), referred to as a transportation graph. U is a set of edges (i.e. road segments); V is a set of nodes (i.e. junctions). We refer to an edge of the transportation graph as a unit u ∈ U .
Unit load. We denote the traffic flow observed on a unit at a particular time point as unit load. Formally, ul(u, t) is the traffic load on the unit u at the time point t ∈ T , where T denotes the set of time points.
We measure the unit load ul(u, t) ∈ [0, 1] as the relative speed reduction at unit u at time point t with respect to the speed limit lim(u) of the corresponding edge of the transportation graph: where speed(u, t) represents the traffic speed on unit u at time t. Affected unit. We denote a unit that exhibits an abnormally high traffic load at a certain time point t as an affected unit at t. Formally, affected (u,t): U × T → {True, False} indicates whether unit u is affected at time point t.
Subgraph. We refer to a subgraph of the transportation graph as subgraph sg := (V , U ) with V ⊂ V , U ⊂ U . We consider a subgraph sg containing the units sg.U to be affected at time point t if at least one of its units u ∈ sg.U is affected at t:

Approach
In this article, we aim to determine structurally dependent subgraphs in a road network.
Our ST-Discovery approach consists of the following main steps illustrated in Figure 2: (i) We identify affected units of the transportation graph using traffic data. (ii) We develop algorithms to identify affected subgraphs of the transportation graph. (iii) We develop an algorithm to identify structural spatio-temporal dependencies of the identified subgraphs.

Identification of Affected Units
This step aims to identify affected units, i.e. the units that exhibit an exceptionally high load at any single time point. Factors influencing road traffic include reoccurring temporal factors (such as, e.g. rush hours) and spatial factors such as road network topology. As we aim to identify Recurrent Congestion that result from the road network topology, it is desirable to exclude reoccurring temporal factors from our analysis. To factor out exclusively temporal congestion patterns, we identify affected units as temporal outliers by employing the interquartile range (IQR) rule [14]. As the weekday and the day time strongly influence the traffic load, we identify units that exhibit an exceptionally high load compared to the average traffic load on these units at the same weekday and day time. More formally, we consider u to be affected at time t, if the following condition holds: where Qn(u, t) denotes the n th quartile of the unit load on unit u regarding the weekday and day time and IQR = Q3(u, t) − Q1(u, t) denotes the interquartile range. We precompute the upper bound of the unit load Q3(u, t) + 1.5 · IQR for each unit, weekday and day time. We consider each unit u that exhibits a higher unit load ul(u, t) than the corresponding upper bound to be affected at time point t.

Identification of Affected Subgraphs
This step aims to identify topologically continuous, disjunctive subgraphs of the transportation graph, homogeneous concerning their unit load at a given point in time. We approach this goal by conducting spatial clustering of the transportation graph's affected units. In this step, the clustering is performed independently at each point in time. To ensure the spatial continuity of the resulting subgraphs, the clustering of affected units follows the basic region growing principle [10].
We conduct the clustering as follows. First, we put one seed point on each affected unit. During the next steps, the regions are expanded by merging neighboured regions until there is no further change. The neighbourhood of two regions is determined by evaluating the distance between their closest edges in the transportation graph. This distance is measured as the edge count du within the shortest path between the regions.
As the data can potentially be incomplete or contain measurement errors, we allow for certain tolerance while determining the neighbourhood. To this extent, we introduce the threshold du,max to bridge gaps of a predefined size between two regions. Thus, two regions are considered as neighboured and are merged by the algorithm if the condition du ≤ du,max holds. As a result, the units affected at time point t are clustered into a set of n clusters Ct = {c0, . . . , cn}.
Note that this clustering approach utilises the transportation graph's underlying graph structure. Thus, there is a unique mapping between each cluster and the corresponding subgraph of T G. In the following, while referring to the clustering results, we use the terms cluster and affected subgraph interchangeably.

Spatial Merging of Affected Subgraphs
The affected subgraphs identified in Section 4.2 are spatially disjoint with respect to the specific time points. Intuitively, when considering traffic load on the transportation graph over a longer time, we can observe spatial variations of the affected subgraphs, e.g. due to the congestion propagation along the graph. Furthermore, affected subgraphs (and their variations) can reoccur at different points in time. To capture these patterns, we conduct a merging of spatially overlapping affected subgraphs that occur at different time points.
Similarity computation is performed for all candidates (i.e. subgraph pairs) by computing the similarity function similarity : SG × SG → [0, 1] as follows (line 14): Based on the definition of the affected subgraph, we consider the subgraph pairs in which one subgraph entirely contains the other subgraph as a special case that has the maximum similarity of 1. Otherwise, the similarity is computed as Jaccard similarity that measures to which extent the subgraph units overlap. Finally, the merging step aggregates the subgraph pairs with a similarity score above the threshold thsim (line [19][20][21][22][23]. The pairs with the highest similarity are merged first (line 16). Here the function ordered() orders the subgraph pairs in the descending similarity order. As merging affects the similarity computation, in each iteration of the algorithm, any subgraph can be merged only once (line [17][18]. The run time complexity of Algorithm 1 arises from combination of the main while loop in line 7 (O(|C|)) and the iteration over the ordered set of subgraph pairs in line 16 (O(|C| 2 · log(|C| 2 )) = O(|C| 2 · log(|C|))) resulting in a total complexity of O(|C| 3 · log(|C|)).
To facilitate an efficient candidate generation, we maintain a hashmap (commonUnits) that provides a mapping from a single unit u to all subgraphs that contain this unit u (line 2-5). The computation of all subgraph combinations would require quadratic time (O(|C| 2 )) in each iteration. In contrast, the hashmap is computed once (O(|C| · |U |)) and is then updated iteratively during the algorithm according to the performed merging using the function update() (line 23). The loops in line 10 and line 13 can be executed in parallel for further optimisation.

Identification of Spatio-Temporal Dependencies
In this step, we aim to identify dependent subgraphs of the transportation graph, i.e. the subgraphs that are typically simultaneously affected and are located in spatial proximity. To this extent, we consider the subgraphs identified and merged in Section 4.3. These subgraphs represent topologically connected subgraphs of the transportation graph, including their spatial variations, that have been affected at some point(s) in time. In this step, we bring the temporal dimension into consideration and aim to identify the pairs of these subgraphs that are typically simultaneously affected.  1-7), where the columns correspond to the subgraphs and the rows to the time points. If a subgraph is affected at time point t, then the corresponding cell is set to 1, otherwise to 0. From the occurrence matrix, candidate subgraph pairs are generated (line [8][9][10][11][12][13][14][15]. Each subgraph pair that is affected simultaneously in at least one time point is considered as a candidate pair. For each candidate pair, we compute the subgraph dependency score. The intuition behind this score is to capture both the temporal co-occurrence and the spatial proximity of the subgraphs. Therefore, the score is computed as a combination of the mutual information and an inverse spatial distance metric:

Algorithm 1 Merge Subgraphs
Here, dist(sg1, sg2) denotes the shortest geographic distance between two subgraphs. The threshold distmin specifies the minimum geographic distance for a subgraph pair to be considered dependent. distmin allows excluding trivial dependencies, such as adjacent subgraphs. The mutual information mi(sg1, sg2) aims to assess the temporal co-occurrence of two subgraphs, computed as: where Ti = {t ∈ T | affected(sgi, t)} denotes the set of time points in which the subgraph sgi is affected. The spatial proximity is measured as the inverse distance, where dist(sg1, sg2) denotes the shortest geographic distance between two subgraphs. Finally, the subgraph pairs are ordered in the descending order of their dependency scores (line 20, ordered()).
The run time complexity of Algorithm 2 results from the identification of candidates (O(|SG| 2 ·T )) in line 9 and line 12 as well as the sorting of subgraph pairs by score (O(|SG| 2 · log(|SG|))) in line 20. Therefore, the overall complexity is bounded by O(|SG| 2 · (T + log(SG))). Finally, the for loop in line 17 can be executed in parallel.

Input:
SG: Set of subgraphs T : Set of time points Output: P dependent Set of pairs of subgraphs, ordered by dependency score

OpenStreetMap
OpenStreetMap (OSM) 2 is a provider of publicly available map data. We make use of the OSM road network to form the transportation graph T G. OSM partitions roads in smaller road segments that correspond to the transportation graph units T G. In particular, we extract the road segments located within the cities of Hanover and Brunswick, Germany. Considering the OSM-taxonomy for road types, we restrict the transportation graph to the major roads, as reliable traffic information for smaller roads is rarely available. In particular, we extract all roads that belong to one of the following classes: {primary, primary link, secondary, secondary link, tertiary, tertiary link, motorway, motorway link, trunk, trunk link}. The extracted transportation graphs contain approx. 23,000 units (Hanover) and 7,600 units (Brunswick) in total. For each unit u ∈ T G, available information regarding the speed limit lim(u) as well as the road type is extracted from OSM.

Traffic Dataset
The experiments conducted in this article employ a proprietary traffic dataset that contains aggregated floating car data. In particular, the dataset provides traffic speed records for each unit u of the transportation graph T G. The dataset was collected by a company offering routing software. The dataset contains data contributions obtained from various sources, including the data collected from the users of the routing software and traffic data acquired from third-party data providers. Although detailed statistics of these contributions, such as the number or the types of the monitored vehicles, are not available to the authors, due to the variety of sources involved we do not expect any particular biases towards certain vehicle types or expense classes. The traffic data records contain the average traffic speed on the individual transportation graph units at discrete time points, i.e. speed(u, t), recorded every 15 minutes. The average speed records are computed by the data provider through calculating the average traffic speed from the raw floating car data, averaged over all vehicles for which the data is available for the given unit and time interval. The data for the major road categories mentioned in Section 5.1 is captured regularly within the dataset. Table 1 presents statistics about the number of available traffic data records for Hanover and Brunswick. We believe that the available data is sufficient to capture typical congestion patterns.

Experiments and Discussion
The evaluation aims to assess the effectiveness of the proposed ST-Discovery approach and its applicability to the real-world datasets presented in Section 5. First, we present an assessment of the dependencies identified by ST-Discovery. Second, we evaluate and discuss the results of each main step of ST-Discovery, i.e., the identification of affected units and subgraphs, and the merging subgraphs.

Structural Dependencies
The task of the data-driven identification of structural dependencies between subgraphs of a road network is new, such that neither a baseline nor any gold standard exists. Therefore, to assess the quality of the identified structural dependencies, we conduct a manual evaluation. In this evaluation, we use ST-Discovery to generate a ranked list of top-k subgraph pairs with high dependency scores, while using different values of thsim. To exclude trivial dependencies, i.e. the subgraphs that are adjacent to each other, we set the threshold distmin = 500 meters. We set du,max = 1 for Hanover and du,max = 2 for Brunswick.
We manually assess the correctness of each of the top-k subgraph pairs with the highest dependency scores. We judge each pair to be correct if we can observe and explain a structural dependency, or incorrect otherwise. The article authors performed the evaluation, while we discussed the individual judgments to obtain consensus. Finally, we calculate the precision@k as the proportion of the results judged as correct within the top-k subgraph pairs. Figure 3 presents the precision@k for thsim ∈ {0, 0.1, 0.2, 0.3} for Hanover (3a) and Brunswick (3b). For both datasets, the highest precision@k is achieved at k=10 for all values of thsim except thsim = 0. As the k value increases (i.e., we consider more subgraph pairs with lower scores), the precision decreases in the majority of configurations. This behavior can be expected as the pairs with lower scores possess lower mutual information or are located at a higher geographic distance and are therefore less related. We conclude that the proposed score is well suited to quantify the dependency of affected subgraphs.
The best precision at k=10 is achieved by thsim = 0.3 (Hanover) and thsim = 0.1 (Brunswick). This indicates that the optimal value of thsim is dependent on the target road network. In both cases, the worst performance is achieved at thsim = 0. The graph partitioning at thsim = 0 is relatively coarse such that units that exhibit different dependencies can be aggregated into the same subgraph. Therefore, the achieved performance is lower than for the higher threshold values.
Note that the adopted evaluation method assesses the subgraph pairs' dependency but not the subgraphs' granularity. Partitioning with higher threshold values (i.e. thsim = 0.3) leads to fine granular subgraphs. In this case, the partitioning can lead to a split of subgraphs that exhibit the same structural dependencies into different subgraphs, which may potentially lead to the inclusion of redundant subgraph pairs in the top-k results. Therefore, whereas the combination of lower values of k and higher values of thsim = 0.3 leads to the highest precision@k values, it can also lead to some redundancy in the top-k results (i.e. subgraph pairs representing the same dependency at different levels of granularity). After manual inspection of the result granularity in our dataset, we observe that values of thsim ∈ {0.1, 0.2} lead to good results, with a precision of 80% (Hanover) and 100% (Brunswick) at k=10, while avoiding redundant subgraphs in the top results. In general, we recommend that the thsim threshold should be adjusted according to the specific dataset and the use case under consideration.
To facilitate a better understanding of the determined dependencies, we discuss exemplary cases identified by ST-Discovery. Figure 4 provides examples of the identified dependent subgraph pairs in Hanover, where the corresponding subgraphs in a pair are marked in blue and purple, correspondingly. Figure 4a shows two junctions of a major rural street. The affected subgraph in the west includes the junction and its feeder roads, whereas only the junction is affected in the east. This combination can be caused by the drivers who avoid the larger congestion in the west by accessing the street in the east, leading to increased traffic on both junctions. Figure 4b depicts two affected subgraphs that cross the central railway within the city of Hanover. The railway divides two city districts and needs to be crossed when travelling between these districts. Therefore the subgraphs represent alternative routes for trips from the north to the south and form a bottleneck for such trips. Figure 4c illustrates an affected highway (blue) and the last possible exit before that section (purple). If the highway is affected, the nearby exit and the consecutive roads, get affected as well. This is likely caused by drivers trying to exit the highway before entering the congested part, leading to an increased load in that region. Figure 4d depicts a road leading to a town (purple) and another road leaving the town (blue). This indicates a large amount of traffic unnecessarily passing through the town to reach from the purple to the blue subgraph because of lack of alternative routes. In this case, building a new alternative road could prevent the town from being exposed to the high traffic load.     Figure 5a shows two near-parallel road sections that connect similar city districts. The roads constitute alternative routes for trips to the east or west. If one of the road segments faces congestion, the other road is likely to face increased load caused by drivers that want to avoid congestion. Figure 5b presents two dependent subgraphs near the city centre of Brunswick. The subgraphs are prominent options for leaving or entering the city centre. If the city centre is congested, the congestion is likely to propagate to the subgraphs as well. Furthermore, if one subgraph is congested, the other subgraph represents an alternative route for a similar trip.

Analysis of Affected Units and Subgraphs
In this section, we analyse the distribution of the affected units and subgraphs identified by ST-Discovery in our datasets and the influence of the corresponding parameter.

Distribution of Affected Units
The analysis results of the distribution of affected units identified by the algorithm presented in Section 4.1, is shown in Figure 6. Figure 6a presents the number of affected units per time point between November and December 2017 in Hanover. We observe that the number of affected units varies continuously, from 3 to 7724 in our dataset, with a median value of 409. Furthermore, we can observe several peaks (i.e. time points that exhibit an exceptionally high number of affected units). We observe the highest peaks between 2017-12-10 and 2017-12-15. Through a manual investigation (i.e. search for news articles related to this region and dates on Google), we found that heavy snowfall caused high and unusual delays on the whole road network during this period. This is reflected in a high number of affected units throughout the entire network. Figure 6b  Smaller peaks occur at several times, for instance at 2017-12-03 (Hanover) and 2019-01-26 (Brunswick), where the size of the peaks highly differs. Given these observations, we believe that temporary peaks in the number of affected units can indicate occurrences of extraordinary incidents. Note that these observations are solely based on the temporal co-occurrences and do not provide any insights into the incidents' spatial characteristics.

Distribution of Affected Subgraphs
In this section, we analyse the distribution of the affected subgraphs identified using the algorithms presented in Section 4.2 and the influence of the relevant parameters.
During the identification of affected subgraphs using clustering of affected units presented in Section 4.2, the threshold du,max was introduced, which describes the distance tolerance in assigning affected units to a subgraph. Therefore variations of the value of this threshold leads to a different number and size of unit clusters, as illustrated in Figure 7a for Hanover and in Figure 7b for Brunswick.
With an increasing tolerance (i.e. the value of du,max) two trends can be observed in both road networks. While the average size of the resulting subgraphs increases, their number decreases. This is caused by the fact that more and more subgraphs are merged if the tolerance for determining a segment neighbourhood is increased.
Comparing the road networks with each other, we observe that the saturation of the subgraph size is reached at smaller values of du,max for Brunswick (at 51) than for Hanover (at 71) since the road network of Brunswick (7678 units) is smaller than the road network of Hanover (23125 units).
Due to such opposite trends, the choice of a suitable value for du,max strongly depends on the application scenario, the road network and the scale of the analysis. For instance, if a large region, e.g. a whole city or one of its districts, has to be analysed, tiny subgraphs consisting of just a few road segments are not that important. For this case, a higher threshold value should be chosen. In contrast, for a detailed inspection of smaller parts of the road network, e.g. specific roads or junctions, finer subgraphs are more critical. In this use case, to prevent merging of smaller subgraphs, a lower value of du,max should be selected.

Temporal Persistence of Affected Subgraphs
Besides the size of the identified affected subgraphs, we analyse their temporal persistence at the example of Hanover. This means that the affected subgraphs identified at time point i (including the typical variations of these subgraphs, e.g. due to the propagation of the traffic load along the transportation graph) have to be recognised in the subsequent time steps. For this purpose, we apply an algorithm based on the Hungarian method [15]. This algorithm aims to find an optimal assignment of clusters (i.e. affected subgraphs) in two consecutive time steps by minimising the assignment costs. In order to compute the assignment costs, the intersection of the affected units involved in each subgraph (cluster) in two subsequent time steps i and j is calculated as: Further, no tolerance would allow a subgraph to skip the subsequent time steps. This means the subgraph existence time will not be prolonged if this subgraph does not appear in a time step. Thus, if there is no assignment for a current subgraph in the following time step, this subgraph will 'die'. If there is a cluster of affected units located on the same road segments in the next but one time step, this cluster will be treated as a new affected subgraph, i.e. a new identifier will be assigned to this subgraph and a new existence time will be initiated.
As we assume that the existence time of a cluster depends on its size, we evaluated the subgraph existence time in dependency of their size. As discussed in Section 6.2.2, the size of the subgraphs depends on du,max. Figure 8 shows the average existence time of the clusters in relation to the average size and the chosen threshold du,max. The overall trend indicates that the affected subgraphs' existence time increases with their size. Thus, the choice of du,max does not only influence the subgraph size but also its existence time.

Evaluation of Subgraph Merging
In this section, we analyse the influence of the threshold thsim on the results of the subgraph merging conducted using Algorithm 1. Figure 9 opposes the number of subgraphs and the average size of subgraphs computed by Algorithm 1 with respect to the similarity threshold thsim that specifies to which fraction two subgraphs need to overlap to be merged. Higher threshold values are more restrictive.
In general, as thsim increases, we observe a growing number of subgraphs, whereas the average size of these subgraphs decreases. This indicates that higher thsim values result in a finer granular division of the road network into smaller subgraphs. The highest change of the average subgraph size can be observed for thsim ∈ [0, 0.4]. For thsim > 0.4 we only observe small changes of the average subgraph size. We conclude that values within [0, 0.4] are particularly suited to calibrate the algorithm in the considered setup, whereas higher threshold values result in a larger number of subgraphs (up to 50k independent subgraphs in Hanover) and only weakly affect the subgraph size.
We illustrate the influence of thsim at the example of a major junction in Hanover within our dataset. Figure 10 presents six different partitionings of the road network into subgraphs with respect to the values of thsim ∈ [0, 0.5], where the colour of the units represent their assignment to different subgraphs. In general, the subgraphs become finer granular with an increasing value of thsim. For thsim = 0 (the least restrictive value), we observe that all units in the considered area are assigned to a single subgraph. Note, that Algorithm 1 with thsim = 0 will not automatically merge all subgraphs, but only those who share at least one common unit. For thsim = 0.1 the junction is partitioned into three major subgraphs corresponding to the north (green), south (yellow) and west (purple) part of the network. Further increase of thsim leads to finer granular partitions of the junction. For instance, for thsim = 0.3 individual subgraphs for roads in the north-west (pink) and the northeast (purple) are present. Finally, for thsim ∈ {0.4, 0.5} we observe a fine granular partitioning of the junction in a large number of subgraphs, where individual subgraphs may contain only a few units. This corresponds to the rise of the number of subgraphs in Figure 9 for high values of thsim.
Overall, thsim can be used to adjust the granularity of ST-Discovery. Whereas lower threshold values result in large subgraphs covering larger fractions of the road network (thsim = 0), subgraphs

Conclusion and Outlook
In this article, we addressed the problem of data-driven discovery of topological dependencies of Recurrent Congestion within urban road networks. We presented the ST-Discovery approach -a novel method to detect such dependencies based on traffic outlier analysis. ST-Discovery detects the units (i.e. road segments) within the road network that indicate an exceptionally high traffic load, proposes algorithms to identify affected subgraphs within the road network using these units and identifies spatio-temporal dependencies among these subgraphs. Furthermore, ST-Discovery provides parameters to adjust the granularity of the identified subgraphs to specific use cases. Our evaluation results on the real-world datasets demonstrate that ST-Discovery can detect meaningful spatio-temporal dependencies among the subgraphs in urban road networks. The identified RC patterns include, for example, dependencies in the feeder roads of highways, alternative routes in case of traffic disruptions, or typical routes to POIs such as, e.g., event venues. In future work, we intend to address the aspects of explainability of ST-Discovery results for end-users, including, e.g., city planners and traffic managers.