Graph Representation Learning on Street Networks
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper seems interesting and has the potential to be published, but several weaknesses should be resolved to increase its visibility.
1) The formulas should be better explained, for example, the one on page 175 is presented as a conditional probability, but this is not explained anywhere. Formula 2 is the Laplacian matrix, but it is not mentioned anywhere.
2) On page 177, it is stated that two embeddings are created, one for individual nodes that encodes their spatial distribution and one for the entire network that encodes the topological structure. I consider that there is an obvious overlap of information since the first embedding also encodes the network structure. The differences and the reasons for their use should be explained.
3) In general, the notation used is difficult to follow or poorly explained. For example, for the size of a matrix, capital letters are used as the matrices themselves. A table with the notation used should be displayed.
4) In section 3.3, a matrix Z of size NxF appears but it is not specified that it is F.
5) Why is the ReLU activation function used and not another one?
6) In general, there is an obvious lack of explanation. For example, the paragraph corresponding to lines 253, 254, and 255 is not understood.
7) Formula 3 uses the sigmoid function, why?
8) Formula 4 is not explained, which, in my opinion, makes no sense.
9) There is a lack of in-depth discussion; it seems that only the obvious things are explained. Furthermore, I do not understand why there are no conclusions.
In summary, I see potential for publication, but many parts of the paper would need to be better explained, and a deeper discussion of the results obtained.
Author Response
Thank you very much for taking the time to review this manuscript and for providing insightful feedback and constructive suggestions. Based on the comments, we have made several significant revisions to our manuscript to address the concerns and enhance the clarity, depth, and rigour of our study. Please find the detailed responses below and the corresponding revisions/corrections highlighted in the resubmitted files.
Comment 1: The formulas should be better explained, for example, the one on page 175 is presented as a conditional probability, but this is not explained anywhere. Formula 2 is the Laplacian matrix, but it is not mentioned anywhere.
Response 1: We thank the reviewer for highlighting the need for greater clarity in the explanation of the formulas. In response, we have substantially revised the relevant sections to improve their interpretability. Specifically, we extended the explanation for Equation 1 to better highlight that the task of learning a probability distribution over all graphs is decomposed into learning a conditional probability over the adjacency matrices and the marginal probability over the node attribute matrices (i.e., coordinate pairs). We have also added equation 2 to explicitly define how we learn the marginal distribution over the attribute matrices (C). Formula 2 mentioned (now equation 4) is not the Laplacian matrix, but the symmetrically normalized adjacency matrix as mentioned in the text. It relates to the symmetrically normalized Laplacian by the following relationship: Ã = D^(-1/2) A D^(-1/2) = I - L_sym, and it essential does a diffusion-like operation. We hope these revisions address the reviewer’s concerns and improve the clarity of the manuscript.
Comment 2: On page 177, it is stated that two embeddings are created, one for individual nodes that encodes their spatial distribution and one for the entire network that encodes the topological structure. I consider that there is an obvious overlap of information since the first embedding also encodes the network structure. The differences and the reasons for their use should be explained.
Response 2: We thank the reviewer for this insightful observation. We agree that at first glance, redundancy may appear between the node and graph embeddings. To clarify, we have revised the manuscript accordingly, specifically expanding the discussion following the definition of Equation 1. The key distinction lies in the flow of information: the node model never sees A, whereas the graph model sees both A and the node embeddings, resulting in a strictly hierarchical information flow. We have clarified this point in the main text after defining equation 1. In essence, the node model captures spatial layout by learning an autoregressive distribution over coordinate sequences, independently of network connectivity. The graph model then conditions on these learned spatial embeddings to infer the adjacency structure. This separation enables us to disentangle spatial and topological information, and allows each component to specialise in distinct structural priors. We hope this clarification addresses the reviewer’s concern and strengthens the conceptual distinction between the two embeddings.
Comment 3: In general, the notation used is difficult to follow or poorly explained. For example, for the size of a matrix, capital letters are used as the matrices themselves. A table with the notation used should be displayed.
Response 3: We appreciate the reviewer’s feedback regarding the clarity of the notation. In response, we have revised the equations and accompanying explanations to improve readability. Some notation has been modified, such as using lowercase letters to represent the size of matrices. In addition to this, we have also added a notation summary table (Table 1) as suggested.
Comment 4: In section 3.3, a matrix Z of size NxF appears but it is not specified that it is F.
Response 4: We have clarified that f is equal to 128
Comment 5: Why is the ReLU activation function used and not another one?
Response 5: It is mainly used for practical reasons. ReLU outputs zero for all negative inputs, introducing sparsity into the learned representations. This helps the model focus on a smaller subset of active features per node, which is especially useful in graph data where local structures can vary widely in complexity. Additionally, its derivative remains constant for positive values, which helps avoid the vanishing gradient problem during backpropagation, and is very efficient for training. We have added some additional explanations in the main text.
Comment 6: In general, there is an obvious lack of explanation. For example, the paragraph corresponding to lines 253, 254, and 255 is not understood.
Response 6: We thank the reviewer for pointing out the lack of clarity in this section. In response, we have revised the relevant portions of the text to provide more thorough and accessible explanations. Importantly, we added a more explicit explanation of both the encoder and decoder block of the VGAE model. This includes the addition of the explicit factorisation of the posterior (new Eq. 3) and immediately explains why the mean-field factorisation is used (tractable optimisation on large graphs). We have also clarified the structure of the encoder, stating explicitly that the two-layer GCN outputs the parameters of the Gaussian latent variables: μ=GCNμ(X,A and logσ=GCNσ(X,A), with the first layer shared between both heads. Regarding lines 253–255, which previously provided a brief and potentially confusing sketch of the reparameterization trick, we have opted to remove this section. Instead, we now refer readers directly to the canonical explanation in Kingma & Welling (2014), which we believe is more appropriate and informative for those unfamiliar with the technique.
Comment 7: Formula 3 uses the sigmoid function, why?
Response 7: The matrix product Z * Z^T yields a symmetric matrix of unbounded real values. Applying the sigmoid function element-wise to this matrix maps all values into the interval (0, 1), yielding a probabilistic adjacency matrix A' whose entries represent the likelihood of an edge existing between node pairs. This transformation enables the model to interpret pairwise similarities as Bernoulli probabilities, suitable for reconstructing the binary edge structure of the graph.
Comment 8: Formula 4 is not explained, which, in my opinion, makes no sense.
Response 8: We thank the reviewer for highlighting the lack of explanation surrounding Formula 4. In response, we have introduced a dedicated section titled “Training Objectives” in the revised manuscript, where we clearly define the loss functions used for both the autoregressive node-sequence model and the variational graph autoencoder (VGAE). In this new section, we explicitly detail the negative log-likelihood (cross-entropy) loss used (eq. 7) to train the autoregressive node model as well as the evidence lower bound (ELBO) used to optimise the variational graph auto-encoder (eq. 8). We further explain the composition of the ELBO into a binary reconstruction term and a KL divergence regularisation term. We believe this addition significantly improves the transparency and completeness of the training procedure.
Comment 9: There is a lack of in-depth discussion; it seems that only the obvious things are explained. Furthermore, I do not understand why there are no conclusions.
Response 9: We appreciate the reviewer’s observation regarding the limited depth of the discussion and the absence of a formal conclusion. In response, we have added a dedicated Conclusion section that expands on the implications of our work and offers a more reflective synthesis of the findings.
Reviewer 2 Report
Comments and Suggestions for AuthorsA new learning framework for street network representations is proposed, evaluated empirically for quality of results, compared to real networks and real network types. The novelty of the approach is well-argued, with the research gap clearly indicated.
Overall, there is sufficient content of merit here. The research is well implemented, and some interesting results gained, in particular the identification of street network types and their clustering to specific global areas. What is lacking somewhat is a deeper examination of why you would do this. You have created a novel technique of generating synthetic networks which closely share network morphological characteristics with real cities, and those cities are somewhat autocorrelated as regards network type... where does this lead? Is there a real-world operational use? (e.g. would city planners benefit from this, and if so, how would they use this information?) Thinking along these lines would significantly augment the introduction and end discussion (even form the basis for a conclusion, which is missing in this paper).
It appears that there is a geographic imbalance in the sources of data used - approx 3 times as much black OSM data is concentrated mainly on China as opposed to the red data source, with distribution across the rest of the world (is this also from OSM?) Care to comment - is it your data, or is it an artifact of the map (would a separate density map be more effective)?
Figure 3 is missing - this is an important schematic for the paper, so must be included. It includes essential information for successful comprehension of research.
Be consistent with your descriptions. In the abstract, networks or graphs are described as having 'nodes' and 'links'. In the Introduction, it is 'vertices' and 'edges'.
Author Response
Thank you very much for taking the time to review this manuscript and for providing insightful feedback and constructive suggestions. Based on the comments, we have made several significant revisions to our manuscript to address the concerns and enhance the clarity, depth, and rigour of our study. Please find the detailed responses below, along with the corresponding revisions and corrections highlighted in the resubmitted files.
1. Is there a real-world operational use? (e.g. would city planners benefit from this, and if so, how would they use this information?) -
Thank you for pointing this out. As with many generative processes, there is currently no direct application in public policy. This is particularly true for urban planning, where models, regrettably, played a small part in the decision-making process. However, we add some ideas about why this type of work is important in urban planning:
Introduction: Finally, besides the mentioned importance of street networks in everyday life, they are a robust proxy for population density, jobs and housing accessibility; therefore, having a better understanding of how street networks evolve could aid urban planners in better comprehending urbanization’s complex processes.
Conclusion: This framework also complements traditional urban studies by enabling structured experimentation. For example, it enables interpolating between city forms, exploring how topological properties emerge, or generating counterfactuals that isolate specific network traits, complementing existing morphological analysis.
2. It appears that there is a geographic imbalance in the sources of data used.
We thank the reviewer for this helpful observation regarding the apparent geographic imbalance in the data visualization. We have clarified in the figure caption that both the red and black dots are sourced from OpenStreetMap (OSM), with the red dots representing a filtered subset of the black dots. Specifically, the red subset includes only cities for which corresponding population data is available and the population value is greater than 1,000. The observed overrepresentation of Chinese cities in the unfiltered dataset (black) arises from the fact that many Chinese cities in OSM lack associated population metadata and are therefore excluded from the red subset. To provide further context and ensure transparency, we have now added explicit mention of the skew toward U.S. cities (which make up approximately 13.5% of the filtered set), and we indicate the countries that collectively account for over 50% of the included cities in the caption of Figure 1.
3. Figure 3 is missing. Thank you for pointing this out. We thoroughly verified the generated PDF to ensure no figures were missed.
4. Be consistent with your descriptions. Thank you for pointing this out. We revised the manuscript only to use nodes and edges when referring to the network elements.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe model training in the paper utilizes a large dataset of street networks, including data from 39,000 towns and cities, which is undoubtedly an extensive dataset. However, the authors do not provide detailed information on the data processing and selection criteria. For example, how do they handle the structural differences between street networks across different cities? Have the data been standardized or normalized in any way? These factors are crucial for the model's generalization ability and comparability, and the authors should further clarify these steps to enhance the credibility of the experimental results.
While the model effectively learns low-dimensional embeddings and generates synthetic street networks through a Variational Graph Autoencoder (VGAE), the paper mentions certain structural issues with the generated networks, such as small triangular intersections and degree-2 nodes forming continuous streets. These issues may be related to the dimensionality of the low-dimensional embedding space and training time. However, the authors have not thoroughly discussed how these factors impact the model's generation ability. Specifically, have they considered incorporating additional regularization terms during training to reduce the occurrence of these anomalies? Further experimental results would help to understand the underlying causes of these issues and provide more targeted suggestions for improvement.
Although the paper discusses classification of street networks through embedding learning, the choice of classification method and experimental design seems somewhat simplistic. While the use of KMeans clustering is effective in some cases, it is worth questioning whether this method sufficiently captures the diversity and complexity of network structures. Could the authors consider combining other, more complex classification methods, such as graph neural networks in deep learning, or explore other graph-based clustering algorithms to further validate the robustness and effectiveness of their results?
Author Response
Thank you very much for taking the time to review this manuscript and for providing insightful feedback and constructive suggestions. Based on the comments, we have made several significant revisions to our manuscript to address the concerns and enhance the clarity, depth, and rigour of our study. Please find the detailed responses below and the corresponding revisions/corrections highlighted in the resubmitted files.
Comment 1: The model training in the paper utilizes a large dataset of street networks, including data from 39,000 towns and cities, which is undoubtedly an extensive dataset. However, the authors do not provide detailed information on the data processing and selection criteria.
Response 1: Thank you for the comment. We believe this is already addressed in the original manuscript in section 3.1. Data processing, where we describe selection criteria, and 3.2. Node model where we describe the normalisation procedure in order to enable model generalisation capabilities. In terms of selection criteria, we use all features with place tag equal to 'town' or 'city' in OpenStreetMap. Subsequently, we select only those places with a population greater than 1,000. We download the street networks for these places within a 1 km x 1 km box centred at the centroid of each feature. To facilitate learning the spatial distribution of node coordinates, we centre each street network at (0, 0) and normalise both x and y coordinates such that the diagonal of their bounding box is equal to 1. Once the nodes are centred and normalised, we apply a uniform 8-bit quantisation that allows us to model the nodes' coordinate values as categorical distributions. Additional information on the impacts of the normalisation on the model has been added to the Node model section to help clarify these points.
Comment 2: While the model effectively learns low-dimensional embeddings and generates synthetic street networks through a Variational Graph Autoencoder (VGAE), the paper mentions certain structural issues with the generated networks, such as small triangular intersections and degree-2 nodes forming continuous streets.
Response 2: We thank the reviewer for this insightful comment and for raising important considerations regarding the structural anomalies observed in some of the generated street networks. Specifically, small triangular intersections and extended sequences of degree-2 nodes. We agree that these artefacts may stem from factors such as the dimensionality of the latent space or the lack of architectural or regularisation constraints that enforce topological plausibility. While we did not incorporate additional regularisation terms or conduct targeted ablation studies in the current work due to time and resource constraints, we acknowledge the importance of exploring these avenues more rigorously in future work. Future research could examine how inductive biases or structural priors, such as planarity constraints, regularisation on node degree distributions, or penalties on certain subgraph motifs, might reduce the frequency of such anomalies. Additionally, systematic experiments varying embedding dimensionality and training regimes would be valuable in diagnosing the specific contributions of these factors. We have added a brief discussion of these points in the revised manuscript's conclusions and propose this as a promising direction for future work.
Comment 3:Although the paper discusses classification of street networks through embedding learning, the choice of classification method and experimental design seems somewhat simplistic. While the use of K-Means clustering is effective in some cases, it is worth questioning whether this method sufficiently captures the diversity and complexity of network structures. Could the authors consider combining other, more complex classification methods, such as graph neural networks in deep learning, or explore other graph-based clustering algorithms to further validate the robustness and effectiveness of their results?
Response 3: Thank you for bringing this to our attention. We agree with the reviewer that K-Means is likely not the best method for clustering spatial objects based on their topological characteristics. However, in this case, we are not actually clustering the networks based on their raw topological characteristics, but rather on the distance between their latent spaces. Thanks to this comment, we realised that now section 4.3.1 was poorly explained and relied on the reader to review the Kempinska and Murcio (2019) reference fully. We updated the first paragraph of 4.3.1 extensively as follows:
Following [12 ], we performed a distance and clustering analysis over the latent representations of each road network, under the premise that the latent space encompasses the main characteristics of the topological structure of the street network. Let X(M)=m and X(N)=n, m =(m_1,m_2,...,m_n) and n =(n_1,n_2,...,n_n) two vectors in the latent space generated by street network M and street network N respectively.
First, we measured the Euclidean distance d(m,n). If d(m,n)-->0, we conclude that the networks that generated m and n share the same topological characteristics; therefore, in a classification procedure, they should belong to the same cluster.
We cluster the obtained d(m,n) values for all studied networks using a K-Means approach, with an optimal K=7, as determined by the Elbow method. So the clustering is not performed over the actual topological features (where the K-Means method would likely perform poorly), but over the set formed by all d(m,n), which is a simple real number vector of size n.
Reviewer 4 Report
Comments and Suggestions for AuthorsThis study proposes a two-stage generative framework based on graph representation learning for analyzing the topological and geometric properties of street networks, overcoming the limitations of traditional raster-based representations and convolutional neural network methods that lose topological details. Using OpenStreetMap data, the research extracts central urban street networks and combines Transformer models with VGAE and graph convolutional layers to generate synthetic street networks and conduct urban form cluster analysis. Experiments evaluated the model's performance on topological and geometric metrics, revealing urban form patterns across different regions and countries. Specific revisions and suggestions are as follows:
- While the manuscript mentions using OpenStreetMap data to extract street networks for cities and towns, it does not specify the criteria defining city and town boundaries.
- The manuscript compares synthetic and real street networks through topological/geometric metrics but lacks intuitive visual comparisons. It is recommended to add comparative figures in the results section to illustrate structural similarities and differences (e.g., node distribution, edge connection patterns).
- In Section 4.1, the quantitative analysis describing the model’s synthetic network generation is relatively general; a more detailed quantitative analysis of the results is recommended.
- Section 4.3 does not provide representative examples and detailed explanations for each cluster. It is suggested to include sample diagrams for each cluster and explain their characteristics.
- The conclusion that "Most countries have six or seven types of clusters" aligns with existing research findings. It is recommended to add comparisons with existing research results. See Mapping urban forms worldwide: An analysis on 8910 street networks and 25 indicators for reference.
- Figure 9 lacks detailed analysis on the possible reasons for differences in clustering results of street networks across countries.
- The discussion section lacks elaboration on application scenarios and future research directions. It is recommended to further clarify the applied value of the research.
Author Response
All of the additions and modifications are highlighted in red in this resubmission.
Comment 1: While the manuscript mentions using OpenStreetMap data to extract street networks for cities and towns, it does not specify the criteria defining city and town boundaries.
Response 1: We appreciate the reviewer's comment. The criteria used to define city and town boundaries are detailed in Section 3.1 ("Data Processing"). Specifically, we extract urban centres from OpenStreetMap by selecting nodes tagged as place=town or place=city. These tags denote significant urban settlements, and we further subset them based on population metadata (population > 1,000) for model training, with a total dataset of 39,364 cities. We have now clarified this more explicitly in the manuscript to avoid any ambiguity.
Comment 2: The manuscript compares synthetic and real street networks through topological/geometric metrics but lacks intuitive visual comparisons. It is recommended to add comparative figures in the results section to illustrate structural similarities and differences (e.g., node distribution, edge connection patterns).
Response 2: We appreciate the reviewer's suggestion. In response, we have added Figure 6 to provide an intuitive visual comparison between real and synthetic street networks. This figure illustrates the structural similarities and differences across key topological properties, as shown in Figure 5. We believe this complements the quantitative analysis and helps clarify the performance of the generative model.
Comment 3: In Section 4.1, the quantitative analysis describing the model's synthetic network generation is relatively general; a more detailed quantitative analysis of the results is recommended.
Response 3: We appreciate the reviewer's valuable suggestion. In response, we have expanded Section 4.1 to provide a more detailed quantitative analysis of the model's generative performance. Specifically, we now highlight the variance and distributional differences across topological features and discuss their implications in more depth. Furthermore, we have added Figure 6 to provide a qualitative visual comparison between real and synthetic samples, offering an intuitive understanding of the generative model's performance.
Comment 4: Section 4.3 does not provide representative examples and detailed explanations for each cluster. It is suggested to include sample diagrams for each cluster and explain their characteristics.
Response 4: We appreciate the reviewer's valuable suggestion. As mentioned in our manuscript, on page 14, lines 426-427, analysing and classifying worldwide street networks is not the aim of this work. Instead, the goal is to construct latent representations of street networks that can serve as the basis for different research streams, such as generating synthetic networks or classifying street networks. Studying the actual characteristics of the obtained clusters is beyond the scope of this work. Nevertheless, we provided some representative examples of the most and least common clusters in Figure 9 and included some empirical observations already present in our last version on page 14 from line 425.
Comment 5: The conclusion that "Most countries have six or seven types of clusters" aligns with existing research findings. It is recommended to add comparisons with existing research results. See Mapping urban forms worldwide: An analysis on 8910 street networks and 25 indicators for reference.
Response 5: We appreciate the reviewer's reference and recommendations. As stated in our response 4, the present study is not about classifying cities, but rather about the latent space of street networks, so comparing our clusters with those suggested in the study or others similar is outside the scope. We added the following paragraph to clarify this situation further:
"These previous cities' classification studies are not directly comparable with the classification presented here, because on those the analysis was conducted over the entire city (not over the smaller street segments), or based on PCA analysis over different city and street indicators, such as a total of resident population and average node degree."
Additionally, we added references 33-36 (including the suggested reference) to emphasise the relevance of the street's topological features in classifying worldwide cities.
Comment 6: Figure 9 lacks detailed analysis on the possible reasons for differences in clustering results of street networks across countries.
Response 6: We appreciate the reviewer's suggestion, however, as explained in our responses to comments 4 and 5, this study is not concerned with the analysis of the clusters, so we avoided adding further explanations beyond the some briefs observations at page 14, for example, the coincidences between Asian countries and the fact that we can observe the same common cluster between Europe and USA presents an interesting research venue that warrants further investigation elsewhere (highlighted in red in the resubmitted text)
Comment 7: The discussion section lacks elaboration on application scenarios and future research directions. It is recommended to further clarify the applied value of the research.
Response 7: Thank you for your recommendation. We believe that this elaboration is already present in the manuscript. The applied value is mentioned in the last paragraph of the Discussion section (lines 496-505): "An immediate implication of the study is that by learning a useful and compact representation from street networks, we can immediately use this information for other downstream geographical tasks, such as prediction or classification. [...]This framework [...] enables interpolating between city forms, exploring how topological properties emerge, or generating counterfactuals that isolate specific network traits, complementing existing morphological analysis". In terms of future research directions, in the current manuscript, lines 490 to 494, it is stated that: "Future work should explore hierarchical models that can progressively grow graphs, possibly via graph diffusion or nested variational approaches [...] Incorporating equivariant attention mechanisms or learning a canonical ordering could mitigate these effects."
Round 2
Reviewer 3 Report
Comments and Suggestions for Authors1. Although the paper clearly identifies the limitations of conventional image-based convolutional models in capturing the topological features of street networks and proposes learning representations directly from graph structures, the design of comparative experiments remains weak. The current experiments mainly demonstrate the model’s ability to fit the distributions of geometric statistics (such as average street length, circuity, and block compactness), but they lack systematic comparisons with other mainstream models (e.g., ConvAE, Node2Vec, StreetGAN) on identical tasks and evaluation metrics. As a result, the advantages of the proposed method in terms of generation quality and topological expressiveness are not convincingly established. It is recommended that the authors introduce multiple baseline models for horizontal comparison and incorporate structural similarity metrics—such as Graph Edit Distance or Subgraph Isomorphism Measures—to further validate the model's effectiveness.
2. In the two-stage framework proposed in the paper, the processes of node generation and edge generation are carried out independently, which may theoretically lead to a disconnect between spatial consistency and structural plausibility. Although the authors enhance input consistency through centering, normalization, and node ordering, these steps do not fundamentally resolve the potential misalignment between node positions and edge connections. The methodology lacks the incorporation of prior geometric rules or topological constraints—for example, functional relationships between spatial proximity and edge connectivity. This shortcoming likely contributes to the appearance of structural artifacts in some synthetic samples, such as long chains of degree-2 nodes or non-orthogonal triangular intersections. It is therefore suggested that the authors consider incorporating stronger geometric or spatial regularization mechanisms during the edge generation phase to enhance the practical feasibility of the generated graphs.
Author Response
We thank the reviewer for the careful reading and constructive feedback. We have highlighted all modified and new text in our resubmitted manuscript.
Our study introduces a graph-native two-stage generative framework for street networks. Because our contribution lies in methodological novelty rather than in outperforming benchmarks on a single downstream task, we have focused the evaluation on whether the learnt latent space reproduces aggregate geometric and topological statistics and supports an exploratory clustering analysis across 39,000 real street segments worldwide. We now clarify this positioning explicitly in the introduction and discussion sections.
Comment 1: Although the paper clearly identifies the limitations of conventional image-based convolutional models in capturing the topological features of street networks and proposes learning representations directly from graph structures, the design of comparative experiments remains weak. The current experiments mainly demonstrate the model’s ability to fit the distributions of geometric statistics (such as average street length, circuity, and block compactness), but they lack systematic comparisons with other mainstream models (e.g., ConvAE, Node2Vec, StreetGAN) on identical tasks and evaluation metrics. As a result, the advantages of the proposed method in terms of generation quality and topological expressiveness are not convincingly established. It is recommended that the authors introduce multiple baseline models for horizontal comparison and incorporate structural similarity metrics—such as Graph Edit Distance or Subgraph Isomorphism Measures—to further validate the model's effectiveness.
Response 1: Our revision now explains why conventional raster-based baselines and node-embedding methods are not suitable comparators for our graph-native generator. Image-centric models, such as ConvAE and StreetGAN, rely on a fixed-resolution rasterisation step that inevitably discards precise adjacency information and subsequently require an additional graph-recovery post-processing step. Reproducing that workflow for 39,000 networks would pull the study away from its methodological focus. Node2Vec, by contrast, learns task-specific node embeddings and does not include a mechanism for whole-graph synthesis, making it orthogonal rather than competitive with other methods. These clarifications have been added to the Introduction and Discussion sections so that the manuscript’s scope is clear.
We also justify omitting metrics such as Graph Edit Distance and sub-graph-isomorphism counts: even their fastest approximations scale poorly beyond a few hundred nodes, whereas a typical sample in our corpus contains more than 400 nodes and over 1,100 edges, rendering exhaustive pairwise comparisons impractical. Instead, we retain permutation-invariant statistics, including degree distribution, circuity, block compactness, and face-shape descriptors. These are both interpretable to urban-morphology scholars and computationally tractable for large datasets.
Comment 2: In the two-stage framework proposed in the paper, the processes of node generation and edge generation are carried out independently, which may theoretically lead to a disconnect between spatial consistency and structural plausibility. Although the authors enhance input consistency through centring, normalisation, and node ordering, these steps do not fundamentally resolve the potential misalignment between node positions and edge connections. The methodology lacks the incorporation of prior geometric rules or topological constraints, for example, functional relationships between spatial proximity and edge connectivity. This shortcoming likely contributes to the appearance of structural artifacts in some synthetic samples, such as long chains of degree-2 nodes or non-orthogonal triangular intersections. It is therefore suggested that the authors consider incorporating stronger geometric or spatial regularisation mechanisms during the edge generation phase to enhance the practical feasibility of the generated graphs.
Response 2: We have clarified that decoupling node placement from edge generation lets us sample plausible junction layouts and assess them quickly before incurring the heavier computational cost of synthesising full connectivity. Imposing a single, distance-based prior at the edge stage, as the reviewer suggests, is conceptually appealing; however, it could degrade performance since it could potentially overregularize the many historically evolved street patterns whose irregular geometries are nonetheless valid. Instead, our variational prior already yields embeddings that are globally coherent while still permitting the network to learn context-specific connectivity patterns directly from data. We do acknowledge that residual artefacts, particularly chains of degree-two nodes and the occasional triangular faces, remain a limitation. Addressing them will likely require hierarchical or geometry-aware edge decoders, which we identify as a promising direction for future work. Thus, while the suggestion to incorporate explicit geometric or topological priors is well-received, implementing and evaluating such constraints falls outside the scope of the present study and is reserved for subsequent investigations.
We believe these additions clarify the scope and justify the experimental design without expanding the paper beyond its methodological focus. We hope the reviewer agrees that, with the strengthened discussion, the manuscript meets the journal’s standards while maintaining a clear and self-contained contribution.
Reviewer 4 Report
Comments and Suggestions for AuthorsThanks for responsing to all my questions! I have no futher comment.
Author Response
Comment 1:Thanks for responsing to all my questions! I have no futher comment.
Response 1. Thank you very much for your support.