Graph-MambaRoadDet: A Symmetry-Aware Dynamic Graph Framework for Road Damage Detection

Tian, Zichun; Shao, Xiaokang; Bai, Yuqi

doi:10.3390/sym17101654

Open AccessArticle

Graph-MambaRoadDet: A Symmetry-Aware Dynamic Graph Framework for Road Damage Detection

by

Zichun Tian

^1,*,

Xiaokang Shao

² and

Yuqi Bai

³

¹

School of Computer Science and Technology, Beijing Jiaotong University, No. 3 Shangyuan Village, Xizhimenwai, Haidian District, Beijing 100044, China

²

College of Computer and Information Technology, Cangzhou University of Transportation, Xueyuan West Road, Huanghua 061199, China

³

School of Environment, Education and Development, The University of Manchester, Manchester M13 9PL, UK

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(10), 1654; https://doi.org/10.3390/sym17101654

Submission received: 16 July 2025 / Revised: 8 August 2025 / Accepted: 4 September 2025 / Published: 5 October 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Road-surface distress poses a serious threat to traffic safety and imposes a growing burden on urban maintenance budgets. While modern detectors based on convolutional networks and Vision Transformers achieve strong frame-level performance, they often overlook an essential property of road environments—structural symmetry within road networks and damage patterns. We present Graph-MambaRoadDet (GMRD), a symmetry-aware and lightweight framework that integrates dynamic graph reasoning with state–space modeling for accurate, topology-informed, and real-time road damage detection. Specifically, GMRD employs an EfficientViM-T1 backbone and two DefMamba blocks, whose deformable scanning paths capture sub-pixel crack patterns while preserving geometric symmetry. A superpixel-based graph is constructed by projecting image regions onto OpenStreetMap road segments, encoding both spatial structure and symmetric topological layout. We introduce a Graph-Generating State–Space Model (GG-SSM) that synthesizes sparse sample-specific adjacency in

O (M)

time, further refined by a fusion module that combines detector self-attention with prior symmetry constraints. A consistency loss promotes smooth predictions across symmetric or adjacent segments. The full INT8 model contains only 1.8 M parameters and 1.5 GFLOPs, sustaining 45 FPS at 7 W on a Jetson Orin Nano—eight times lighter and 1.7× faster than YOLOv8-s. On RDD2022, TD-RD, and RoadBench-100K, GMRD surpasses strong baselines by up to +6.1 mAP_50:95 and, on the new RoadGraph-RDD benchmark, achieves +5.3 G-mAP and +0.05 consistency gain. Qualitative results demonstrate robustness under shadows, reflections, back-lighting, and occlusion. By explicitly modeling spatial and topological symmetry, GMRD offers a principled solution for city-scale road infrastructure monitoring under real-time and edge-computing constraints.

Keywords:

road damage; graph; vision transformers; infrastructure monitoring

1. Introduction

Automated road damage detection is fundamental for intelligent transportation systems, directly supporting urban maintenance, safety, and infrastructure sustainability. Despite significant advancements enabled by deep convolutional neural networks (CNNs) and transformer-based architectures, several open challenges persist—most notably, the difficulty of modeling spatial–temporal consistency, the rigidity of current topological priors, and the inefficiency of graph-based reasoning on edge hardware.

In recent years, AI- and ML-based techniques have gained widespread traction in infrastructure condition monitoring, especially for detecting and classifying pavement distresses such as cracking, potholes, and rutting. Traditional CNN-based methods such as AlexNet, ResNet, and EfficientNet have been applied to frame-wise crack classification [1], while object detection architectures like Faster R-CNN and YOLO series provide more localized crack detection [2]. More recently, Transformer-based approaches [3] and graph-based neural networks [4] have been developed to model spatial dependencies and road–topology relationships. Comprehensive reviews such as [5] summarize these advancements, categorizing ML models by architecture, input modality, and deployment feasibility. However, these methods still suffer from limitations in topological consistency, adaptability to dynamic road conditions, or computational overhead when deployed on edge devices. Figure 1 shows the dynamic road graph construction on a real-world map.

Most prevailing methods, such as the YOLO [6] and SAM [7] families, process each video frame as an independent sample, thereby overlooking spatial continuity and topological coherence at the road-segment level. This frame-wise paradigm leads to instability in challenging conditions such as shadows, occlusions, or water stains, where transient artifacts can cause false positives. Meanwhile, the use of static road network priors—such as adjacency matrices derived from OpenStreetMap (OSM)—further restricts adaptability. These fixed structures cannot accommodate real-world dynamics like temporary lane closures or ongoing construction, limiting the robustness of topological modeling in practical deployment.

Recent progress in Graph Neural Networks (GNNs), including GraphSAGE [8] and Graph Attention Networks (GAT) [9], has demonstrated promise for relational reasoning. However, the high computational overhead of these models (often exceeding 6 GFLOPs on 4 GB GPUs) makes them unsuitable for real-time inference on lightweight edge devices such as the Jetson Nano. Common strategies to mitigate this cost, such as aggressive model compression, frequently result in degraded detection performance or loss of expressiveness in graph structure learning.

At the same time, state–space models (SSMs), and particularly the recently proposed Mamba family [10], have delivered breakthroughs in long-sequence modeling by enabling efficient context propagation and linear computational complexity. Yet, the integration of SSM-based sequential learning with dynamic graph construction remains largely unexplored in the context of real-world road scene analysis. This is notable, as road damage detection requires both the effective modeling of spatial–temporal patterns and the adaptive reasoning over evolving topologies.

To bridge these gaps, we propose Graph-MambaRoadDet (GMRD), a unified framework that seamlessly integrates state–space sequence modeling with dynamic graph generation for efficient and robust road damage detection. Our framework leverages Graph-Generating SSMs (GG-SSMs) [11] to enable the synchronous evolution of graph topology and sequential features, without the need for manual adjacency specification. High-resolution feature extraction is further enhanced by the incorporation of DefMamba [12], which employs learnable deformable scan paths to capture subtle crack patterns that are typically missed by conventional patch-based operations. Additionally, attention-guided self-supervised calibration, inspired by [13], improves the reliability and stability of dynamically generated topologies. The resulting architecture achieves near-linear computational complexity (approximately 1.5 GFLOPs and 1.8 M parameters in INT8), supports real-time inference (>45 FPS on Jetson Orin Nano), and is readily extensible to a range of infrastructure inspection scenarios.

Comprehensive experiments on public and proprietary datasets—including RDD2022, TD-RD, and our 100K RoadBench—demonstrate that the GMRD significantly outperforms strong baselines such as YOLOv8-s, SD-GCN, and Mamba-Adaptor-Det, both in detection accuracy and segment-level topological consistency. Extensive ablation studies and edge-device validation further confirm the effectiveness and practicality of the proposed approach. In summary, our work presents the first unified framework that combines state–space sequence modeling and dynamic graph reasoning for road damage detection.

Our contributions are summarized as follows:
- We propose Graph-MambaRoadDet (GMRD), the first road damage detection framework that fuses efficient state–space modeling (Mamba) with dynamic graph generation, enabling simultaneous modeling of spatial–temporal patterns and evolving topologies.
- Our approach introduces a novel combination of Graph-Generating SSMs and deformable scanning modules, allowing automatic and adaptive construction of sparse learnable road graphs without reliance on fixed adjacency priors.
- We design an attention-calibrated topology refinement mechanism, leveraging self-supervised signals to enhance the stability and accuracy of dynamically generated graph structures.
- GMRD achieves state-of-the-art performance on multiple benchmarks in terms of both detection accuracy and segment-level consistency, while maintaining real-time inference speed and low computational cost suitable for edge devices.
- The proposed framework is highly extensible, supporting plug-and-play adaptation to other infrastructure inspection tasks through simple head replacement.

Paper Structure

The remainder of this paper is organized as follows. Section 2 reviews the related work on road defect detection, graph-based vision models, and state–space modeling. Section 3 introduces our proposed GMRD framework in detail, including the deformable scanning module, dynamic graph generation, and topology-aware reasoning components. Section 4 presents extensive experiments across multiple datasets and provides ablation studies, generalization tests, and runtime comparisons. Finally, Section 5 concludes the paper and outlines future directions.

2. Related Work

2.1. Road Damage Detection and Vision Models

Early road damage detection methods relied on handcrafted features and classical classifiers like SVMs and Random Forests [14,15,16,17,18,19,20,21,22,23,24,25].

As deep learning advanced, CNNs became more popular: DeepCrack achieved pixel-level segmentation via hierarchical feature fusion [26], while RHA-Net introduced hybrid attention mechanisms for efficient deployment on embedded devices [27]. PavementNet [28] and CrackUNet [29] improved defect shape preservation with multi-branch decoding.

Transformer-based architectures like CrackFormer and SwinCrack refined crack segmentation using self-attention and hierarchical backbones [30,31], while methods like RoadFormer [2] introduced ViT-style encoding for domain-specific pretraining.

Real-time object detectors in this domain include YOLOv4–v8 variants adapted for road damage detection [6,32], and ensemble methods improve robustness [33,34]. Lightweight variants like YOLO-Lite [35] enable edge deployment under tight compute budgets.

Foundation models like SAM and DINOv2 have also been explored for zero-shot or few-shot detection of road defects [7,36,37]. However, most prior methods process frames independently and ignore spatial continuity or topological structure, limiting robustness in dynamic urban environments.

2.2. Graph Neural Networks and Dynamic Topology Learning

Graph Neural Networks such as GraphSAGE and GAT are foundational for structured relational modeling [8,9]. Adaptations to road environments include RoadGCN, LaneGraphNet, and RoadNet++ [38,39,40], which use map-based or sensor-fused spatial graphs to model road structures.

However, many rely on static adjacency derived from OSM or [41,42], lacking adaptability to occlusion or road changes.

Dynamic graph learning methods—e.g., for traffic sign recognition [43], DynGCN for trajectory forecasting [44], GraphAD and GDN for anomaly detection in video streams [45]—introduce learnable topology that evolves with visual features. But these approaches often involve costly edge-updates or quadratic attention, limiting their deployment in real-time systems. More recent lightweight solutions include GraphLite [46], which balances speed and adaptivity, though not yet applied to defect detection tasks.

2.3. State–Space Models and Hybrid Architectures

State–space models (SSMs), such as S4 [47] and the Mamba family [10], have emerged as efficient linear-complexity alternatives to Transformers for long-range modeling. Mamba variants are now extended to vision and video domains, including Vision Mamba [48], Video Mamba [49], and 3D Mamba.

EfficientViM [50] leverages a hidden-state mixing duality for building FLOPs-constrained backbones. Deformable Mamba (DefMamba) further enhances spatial flexibility through learned dynamic scanning kernels.

Hybrid models—e.g., GraphFormer [51]—combine dynamic graph reasoning with token-based or SSM backbones for long-range visual structure modeling. However, such hybrid topological models remain underexplored for spatially continuous tasks such as road defect detection, especially under resource constraints.

Summary: In contrast to prior works, our proposed GMRD uniquely integrates deformable state–space modeling, online dynamic graph generation, and attention-guided graph calibration. It supports real-time topology-aware road damage detection on edge devices, addressing the limitations of existing CNN, transformer, and graph-only solutions.

3. Methodology

We present Graph-MambaRoadDet (GMRD), a unified framework for efficient and robust road damage detection via dynamic graph construction and state–space modeling. This section details each key component, with a focus on explicit mathematical formulation.

3.1. Overview

Given an RGB image

I \in R^{H \times W \times 3}

captured by a vehicle- or drone-mounted camera, our goal is to estimate (i) pixel- or box-level damage localization and (ii) a road-segment-level health score that can be consumed directly by asset-management systems. Figure 2 illustrates the pipeline.

3.2. EfficientViM Backbone

EfficientViM [52] factorizes self-attention into a hidden-state mixer with linear complexity. Denoting the l-th layer states by

Z^{l}

, the update is

Z^{l + 1} = Z^{l} + MLP (W_{h} σ (U Z^{l})),

(1)

where

U

and

W_{h}

are learnable, and

σ

denotes GELU. We adopt the T1 variant (1.2 GFLOPs, 256-dim hidden size) and keep spatial resolution

\frac{H}{4} \times \frac{W}{4}

for the top two stages to preserve crack details.

3.3. Deformable Mamba Blocks

Standard SSMs scan the sequence in a fixed raster order. DefMamba [12] learns an offset field

Δ p

that warps the scanning trajectory:

h_{t + 1} = A h_{t} + B z_{t + Δ p_{t}}, y_{t} = C h_{t} + D z_{t + Δ p_{t}},

(2)

where

(A, B, C, D)

are learned state–space parameters. The offsets are generated by a small CNN and discretized to index neighboring patches, enabling sub-pixel crack patterns to influence the hidden trajectory.

3.4. Super-Pixel Graph Construction

We run edge-aware SLIC on

I

to obtain

{S_{m}}_{m = 1}^{M}

, each assigned to the nearest OSM edge via GPS projection. The initial node feature is

v_{m}^{0} = \frac{1}{| S_{m} |} \sum_{(i, j) \in S_{m}} Z_{i j},

(3)

concatenated with road-segment metadata (length, lanes, historical flow).

3.5. Dynamic Graph Generation via GG-SSM

Let

V^{0} = {[v_{1}^{0}, \dots, v_{M}^{0}]}^{⊤} \in R^{M \times d}

be the initial feature matrix for the M superpixel nodes, each of dimension d. We aim to dynamically infer graph connectivity using learnable state–space dynamics, without requiring static edge templates. To this end, GG-SSM models latent edge strength through continuous-time linear state dynamics, where the evolution of hidden states

h (t) \in R^{M \times d}

is governed by

\begin{matrix} \dot{h} (t) & = A h (t) + B V^{0}, \end{matrix}

(4)

\begin{matrix} A & = reshape (W_{q} V^{0}) reshape {(W_{k} V^{0})}^{⊤} / \sqrt{d}, \end{matrix}

(5)

where

W_{q}

and

W_{k} \in R^{d \times d}

are learnable matrices, and

A \in R^{M \times M}

captures pairwise latent similarity among nodes.

B

is a shared projection matrix that maps node features into the driving term of the dynamics.

Discretizing this formulation at time step

t = τ

results in an adjacency estimate

A_{dyn} \in R^{M \times M}

reflecting dynamic edge strength:

A_{dyn} = Top - k (softmax (A)),

(6)

where Top-k retains only the k largest values per row (i.e., per source node) to ensure sparse connectivity and avoid over-smoothing. All remaining entries are zeroed out.

This dynamic construction differs from prior CNN- or Transformer-based methods by embedding long-range relational priors through continuous-time evolution and token-driven graph learning. It enables flexible image-conditioned topologies aligned with structural road layouts.

3.6. Attention-Guided Topology Calibration

Detector self-attention

A_{att} \in R^{M \times M}

offers complementary relational cues. We fuse the two sources via

\hat{A} = λ A_{dyn} + (1 - λ) A_{osm}, A_{osm} = η A_{att} + (1 - η) A_{pri},

(7)

where

A_{pri}

is the binary OSM adjacency, and

(λ, η)

are learnable scalars initialized to

0.5

.

3.7. Graph Message Passing

With

\hat{A}

fixed at inference, two GraphSAGE-Lite layers propagate information:

v_{m}^{l + 1} = σ (W_{1}^{l} v_{m}^{l} + W_{2}^{l} \frac{1}{| N (m) |} \sum_{n \in N (m)} {\hat{A}}_{m n} v_{n}^{l}), l \in {0, 1} .

(8)

3.8. Prediction Heads

Following YOLOv8, we attach a detection head to patch embeddings for bounding boxes and masks. A separate MLP produces a segment health score

p_{m} = σ (w_{h}^{⊤} v_{m}^{2} + b_{h}),

(9)

interpreted as the probability of severe damage for road segment m.

3.9. Symmetry-Aware Structural Design

The proposed GMRD framework incorporates symmetry both explicitly and implicitly in its architectural design. Spatial symmetry is respected in the superpixel-based graph construction, where regions along the road are segmented and projected based on GPS alignment. This ensures that structurally symmetric road segments yield comparable graph node representations.

Furthermore, the attention-guided topology fusion step promotes symmetric consistency by combining dynamic adjacency and OpenStreetMap priors, enabling the model to leverage both data-driven and map-based geometric regularities. The fused graph preserves bilateral and translational symmetry properties of road layouts, which are crucial for stable message passing and consistent health scoring across adjacent segments.

During graph message propagation, GraphSAGE layers exploit these symmetries to reinforce similarity among symmetrically positioned nodes—mitigating noise and improving robustness. By embedding symmetry awareness into both graph construction and reasoning, our method achieves higher topological coherence, better generalization, and greater resilience to incomplete or distorted inputs.

This symmetry-driven design not only aligns with the physical properties of road infrastructure but also enhances the interpretability and efficiency of the overall framework.

3.10. Training Objectives

The overall loss combines three terms:

L = L_{\det} + β L_{cons} + γ L_{edge} .

(10)

Detection loss

L_{\det}

follows YOLOv8 (classification, objectness, box, mask). Consistency loss encourages adjacent segments to yield similar scores:

L_{cons} = \frac{1}{| E |} \sum_{(m, n) \in E} KL (p_{m} ∥ p_{n}),

(11)

with

E

the edge set of

\hat{A}

. Edge-distillation loss aligns dynamic edges with self-attention:

L_{edge} = \frac{1}{M^{2}} \sum_{m, n} {\hat{A}}_{m n} log \frac{{\hat{A}}_{m n}}{A_{att, m n}} .

(12)

Hyper-parameters

β

and

γ

are set to

0.2

and

0.1

, respectively.

3.11. Complexity Analysis

Let N be the patch count and

M ≪ N

the super-pixel count. EfficientViM and DefMamba are

O (N)

; GG-SSM constructs

A_{dyn}

in

O (M d)

and stores only

k M

edges. GraphSAGE-Lite is likewise

O (k M d)

. End-to-end INT8 inference therefore scales linearly with input resolution, totalling 1.5 GFLOPs and 1.8 M parameters—well within the 10 W power envelope of commodity edge boards.

3.12. Crack Type and Severity Standards

To ensure reproducibility and align with industry practices, we adopt standardized definitions of crack types and severity levels as specified by transportation agencies and pavement assessment protocols. Specifically, we follow the criteria outlined in the Federal Highway Administration’s Long-Term Pavement Performance (LTPP) Distress Identification Manual [53] and ASTM D6433-20 [54], which provide detailed classification schemes for typical road surface distresses. In our work, we focus on three major crack categories: (i) longitudinal cracks (parallel to the road centerline), (ii) transverse cracks (perpendicular to the direction of travel), and (iii) alligator or crocodile cracks, which are interconnected fatigue-induced cracks forming a block-like pattern. Severity levels—low, medium, and high—are assigned based on the crack width, length, and density per pavement area. For instance, according to ASTM D6433-20, low-severity longitudinal cracks are typically defined as having widths less than 6 mm, while high-severity ones exceed 19 mm with clear raveling or spalling along the edges.

To support learning-based classification, these definitions are encoded into our annotation process and supervised labels. The labeling protocol is consistent with previous deep learning studies on pavement assessment [2], facilitating comparability and standard compliance. Our dataset includes metadata tags for crack type and severity per instance, which are used both in training and for severity-level-aware evaluation.

4. Experiments

4.1. Datasets and Annotations

We evaluate Graph-MambaRoadDet (GMRD) across three public road damage benchmarks and one newly constructed graph-aware corpus. These datasets vary in terms of their annotation granularity, geographic distribution, and topological coverage. Table 1 summarizes their key characteristics.

RDD2022

The Road Damage Detection Challenge [55] provides 26,336 front-facing images from dashcams in Japan, India, and the Czech Republic, annotated with bounding boxes for four road damage types: crack, pothole, rutting, and other. Following the official setup, we adopt the train/val/test split of 70/10/20 %.

TD-RD

TD-RD [41] contains 11,250 UAV-captured images collected across six U.S. highways. Each image is labeled with pixel-level masks for five fine-grained damage categories, supporting dense evaluation via mean Intersection-over-Union (mIoU). We resize all images to

1024 \times 768

and follow the stratified split of 60/20/20% provided by the authors.

RoadBench-100K

RoadBench-100K [57] is our large-scale graph-aware dataset comprising 103,418 vehicle-mounted frames under varied lighting and weather conditions. It includes bounding boxes, per-pixel segmentation masks, GPS/IMU metadata, and OSM-aligned topological labels, making it suitable for both local and segment-level evaluations. We use a curated split of 65/15/20% for train/val/test.

Pre-processing

All images are resized to

640 \times 640

using aspect-ratio preserving padding. Pixel masks for RDD2022 and RoadBench are derived from bounding box polygons using official tools. GPS metadata are projected into WGS 84/UTM zones corresponding to local corridors, achieving sub-meter accuracy when aligned with OSM polylines.

Evaluation splits

To ensure strict spatial separation, we enforce road-segment disjointness: no OpenStreetMap (OSM) polyline appears in more than one data split. This guarantees topological independence between training and testing scenes.

4.2. Implementation Details

Framework.

All experiments are conducted in PyTorch 2.2 with CUDA 12.4 and cuDNN 8.9. We implement EfficientViM, DefMamba, and GG-SSM atop the official repositories and will release the code, trained weights, and a detailed configuration file upon publication. Super-pixel segmentation relies on PySLIC with edge-aware compactness

m = 10

and region size

\approx 2400

px, yielding

M \in [28, 42]

nodes per image.

Backbone and input resolution.

Unless stated otherwise, all detectors ingest

640 \times 640

RGB images. EfficientViM-T1 is initialized from ImageNet-21k pretraining; the first convolution is adapted to accept three channels, and positional embeddings are bilinearly resized. Patch size

P = 16

; the top two spatial stages retain

\frac{1}{4}

input resolution to preserve the crack morphology.

Optimization.

Models are trained for 150 epochs with AdamW (

β_{1} = 0.9, β_{2} = 0.999

, weight decay 5 ×

10^{- 4}

). The base learning rate is set to 4 ×

10^{- 4}

for backbone layers and 1 ×

10^{- 3}

for task-specific heads, scaled linearly with batch size (default 64 images across 8 × A100 40 GB). We adopt a 10-epoch warm-up from

10^{- 6}

, followed by cosine decay. Loss coefficients

β

and

γ

in Equation (11) are fixed to

0.2

and

0.1

; we linearly ramp them during the first 20 epochs.

Data augmentation.

We employ Mosaic, MixUp (ratio 0.2), random hue–saturation-value jitter (

\pm 15 %

), CutOut (up to 20 holes, 32 px), and horizontal flip (prob. 0.5). For TD-RD pixel masks, we additionally apply random elastic deformation to simulate asphalt distortion.

Graph hyperparameters.

GG-SSM keeps the

k = 6

largest outgoing edges per node;

λ

and

η

in Equation (6) are initialized at

0.5

and optimized jointly with network weights. GraphSAGE-Lite hidden size is 256; LayerNorm precedes each aggregation step. Temporal edges connect frames within

Δ t \leq 5

s, if GPS displacement

< 20

m.

Edge deployment.

For Jetson Orin Nano, we export the network to ONNX and compile with TensorRT 9.0, enabling INT8 with per-layer entropy calibration on 512 images. The batch size is set to 1; the latency is measured via trtexec–separateProfileRun. On Huawei Ascend Atlas 200 DK, we convert the same ONNX graph with MindX-SDK 3.0 and use mixed-precision FP16+INT8. The power draw is averaged over 300 s using a Keysight N6705C power analyzer.

Evaluation Metrics.

Detection. The bounding-box performance is reported using mean Average Precision at IoU thresholds 0.50 to 0.95 (mAP@50:95), as well as AP@50. The segmentation accuracy is evaluated using the mean Intersection over Union (mIoU).

Graph-level. We define G-mAP (Graph mean Average Precision) as the mean average precision for road-segment-level damage classification. Specifically, each superpixel node m is assigned a binary label (damaged vs. intact) from the ground truth, and the predicted probability

p_{m}

is thresholded to compute standard precision–recall metrics over all segments. In addition, we report a consistency score that quantifies the smoothness of predictions across connected road segments:

Consistency = \frac{1}{2} [1 - \frac{1}{| E |} \sum_{(m, n) \in E} | p_{m} - p_{n} |],

where

E

is the edge set of the final graph. A higher consistency score indicates smoother segment-level predictions aligned with the topological structure.

Efficiency. The frames per second (FPS) are averaged over 2000 input frames (1024 × 1024) on an NVIDIA Jetson Orin board. The wall-clock latency excludes image pre-processing and graph construction and is averaged over three independent runs.

All reported numbers are given as the mean ± standard deviation across three trials.

4.3. Comparison with State of the Art

Table 2 contrasts GMRD with six strong baselines across three benchmarks. On RDD2022 we achieve a 61.8 mAP_50:95—a gain of

2.6

points over the best competing model (Mamba-Adaptor-Det) and

+ 11.3

over YOLOv8-s. The improvements are even larger on the pixel-level dataset TD-RD, where the proposed graph reasoning yields 60.9 mAP_50:95, outperforming Mask2Former by

+ 6.1

points. The large-scale RoadBench-100K further confirms the trend: GMRD attains 55.4 mAP_50:95, surpassing SD-GCN by

+ 3.1

points. Remarkably, these gains come with one to two orders of magnitude fewer resources: our model uses only 1.8 M parameters and 12 GFLOPs, whereas the closest competitor needs 128 GFLOPs.

Table 3 emphasizes two hallmarks of GMRD that are absent from standard detectors. First, on a Jetson Orin Nano we sustain 45 FPS with a latency of 22 ms and a power draw of 7.2 W, nearly

\times 1.7

faster than YOLOv8-s while consuming

- 26 %

less power. Neither Mask2Former-SwinB nor RT-DETR can fit into the 4 GB memory budget of the device. Second, our dynamic-graph design translates into 55.9 G-mAP_50:95 and a consistency score of 0.79 on RoadGraph-RDD, improving over SD-GCN by

+ 5.3

and 0.05, respectively. These results confirm that topology-aware reasoning not only elevates accuracy but also suppresses spurious discontinuities across adjacent road segments.

In summary, the GMRD simultaneously raises the detection quality, enhances the topological fidelity, and reduces the computational cost, establishing a new state of the art under both data-center and edge-deployment settings.

4.4. Cross-Domain Generalization Study

To evaluate the generalization capability of our model, we conduct cross-domain experiments by training on RDD-Japan and testing on three unseen road damage datasets: RDD-India, CNRDD, and CRDDC’22. As shown in Table 4, our proposed GMRD framework consistently outperforms strong baselines including YOLOv8, RT-DETR, and Deformable-DETR.

Specifically, GMRD achieves 69.5%, 64.2%, and 61.0% mAP@0.5 on RDD-India, CNRDD, and CRDDC’22, respectively, surpassing the next-best method (Deformable-DETR) by margins of 3.1%, 2.2%, and 2.4%. These improvements highlight the robustness of our design under domain shift.

We attribute this superior performance to three key factors: (1) the dynamic graph generation captures structural cues beyond the local appearance; (2) the DefMamba module adapts to local distortions in crack patterns; and (3) the attention-guided topology fusion aligns noisy visual relations with global map priors. Together, these components enable the GMRD to generalize across cities and countries without additional fine-tuning.

4.5. Ablation Studies

To quantify the contribution of each architectural element, we progressively disable DefMamba, GG-SSM, EGTR calibration, and the graph-consistency loss while keeping all other settings fixed. The results on RDD2022 are reported in Table 5.

Effect of DefMamba.

Removing the deformable state–space blocks (row 2) leads to a drop of 1.7 mAP and 1.9 G-mAP, despite identical graph reasoning. This performance decline underscores the importance of adaptive scanning in capturing subtle visual cues such as faint cracks, fine road boundaries, and eroded pavement patterns. Unlike conventional convolution or token-based feature extractors, DefMamba leverages a deformable state–space mechanism to flexibly sample from a dynamically adjusted neighborhood, enabling more precise localization of small and thin damage patterns. Moreover, this accuracy gain comes at no cost to the runtime efficiency, as DefMamba collapses into a single linear kernel during inference.

Effect of GG-SSM.

Replacing the dynamic graph with a static OSM-based adjacency results in a significant degradation of G-mAP (−3.7 points) and a 2.4 mAP drop, highlighting the critical role of sample-dependent topologies. Unlike static graphs, our GG-SSM dynamically constructs instance-aware edge connections based on spatial affinity and structural continuity, which allows it to adapt to complex road configurations such as construction zones, occlusions, or blocked lanes. The performance drop when reverting to the static topology demonstrates that relying solely on external priors (e.g., maps) is insufficient for robust scene interpretation under real-world perturbations.

Effect of EGTR calibration.

Disabling the EGTR module notably weakens the topological reasoning, with a 2.3 G-mAP and 0.04 consistency score drop. This module performs attention-guided refinement of the graph’s edge weights, enabling better aggregation of non-local information across sparse or irregularly distributed defect regions. Though the change in frame-level mAP is smaller (−1.5), the substantial drop in G-mAP and consistency suggests that the EGTR plays a crucial role in enhancing the structural quality of the output graph, especially for long-range or disconnected crack segments. Importantly, it preserves real-time performance, as the calibration operates with lightweight attention computation.

Effect of Graph Consistency Loss.

Omitting the KL-based consistency regularization degrades the G-mAP by 0.6 and reduces the structural consistency by 0.06, the largest drop across all configurations. This loss encourages smooth label transitions among adjacent segments by minimizing distributional divergence, which is particularly useful in noisy or ambiguous regions. The relatively minor impact on mAP (−1.2) indicates that this loss mainly improves inter-node agreement, rather than directly altering pixel-level predictions. Nevertheless, its removal leads to fragmented or jittery graph outputs, especially when adjacent damage patches share similar textures but differ in illumination or context.

Takeaway.

Each module contributes a distinct yet complementary function to the full GMRD pipeline. DefMamba enhances fine-grained spatial sampling, GG-SSM dynamically adapts to scene context, EGTR calibrates noisy long-range dependencies, and the consistency loss promotes structural stability. Their combination delivers a cumulative boost of +3.9 mAP and +10.9 G-mAP over the EffViM-T1 baseline, while preserving inference throughput at real-time level (45 FPS). This ablation study empirically validates the architectural design, demonstrating that both spatial modeling (DefMamba) and topological reasoning (GG-SSM, EGTR, consistency loss) are indispensable for accurate and coherent road defect understanding.

Figure 3 explores the trade-off between graph sparsity and runtime by varying the number of retained edges per node (

k = 2 \dots 10

). As k increases, the mAP_50:95 rises sharply from 59.8 to 61.8 at

k = 6

, after which the curve saturates (<0.3-point gain when doubling k to 10). In contrast, the throughput degrades almost linearly: FPS drops from 55 at

k = 2

to 32 at

k = 10

. We therefore adopt

k = 6

as the default, striking a sweet spot that delivers +1.5 mAP over the static OSM graph (

k = 0

) while preserving real-time speed (45 FPS) on the Orin Nano. This result confirms that a moderately sparse dynamic topology is sufficient to capture long-range context without overwhelming the edge hardware.

We next probe the influence of the fusion weight

λ

that interpolates dynamic GG–SSM edges with the static OSM prior (Equation (6)) (The auxiliary coefficient

η

that blends self-attention with the map prior is varied in tandem so that

η = λ

; we observed the same trend when optimizing

η

independently). As illustrated in Figure 3a, both topology metrics peak at

λ \approx 0.5

. When

λ

is set to 0 (pure OSM), the G-mAP drops by 4.7 points, and the consistency score falls to 0.72, indicating that a static graph alone cannot cope with lane closures or occlusions. Conversely, pushing

λ

toward 1.0 (purely dynamic topology) slightly degrades the consistency (

0.79 \to 0.74

) and incurs a 1.8-point loss in G-mAP, as the model loses the global geodesic structure encoded by the map prior. The relatively flat plateau in the range

λ \in [0.4, 0.6]

suggests that the fusion mechanism is robust to moderate mis-tuning and confers an easy-to-reproduce sweet spot; we therefore adopt

λ = 0.5

throughout all experiments.

Figure 4 investigates how the weighting factors of our two auxiliary losses influence the performance. Panel (a) shows that increasing the consistency term from

β = 0

to

0.2

yields a steady gain of

+ 1.6

mAP_50:95 and

+ 0.07

in consistency. Beyond that point, both curves flatten and even decline at

β > 0.3

, indicating that excessive smoothing can oversuppress legitimate discontinuities at construction joints. We therefore fix

β = 0.2

as a safe optimum.

Panel (b) sweeps the edge-distillation weight

γ

. Setting

γ = 0.10

–

0.15

delivers the highest G-mAP and overall mAP; larger values add marginal benefit while slightly increasing the training variance. Crucially, the model remains within

< 0.3

mAP of peak performance across a broad interval (

γ \in [0.05, 0.20]

), demonstrating that our framework is robust to loss-weight mis-tuning and easy to reproduce on new datasets.

4.6. Edge-Device Efficiency

Deployability on low-power hardware is crucial for roadside or UAV-based inspection. Table 3 compares the GMRD with representative lightweight and graph-aware detectors on a Jetson Orin Nano (4 GB, 10 W TDP) using INT8 TensorRT inference.

Real-time throughput.

GMRD sustains 45 FPS, almost twice the speed of SD-GCN (18 FPS) and 1.7× faster than YOLOv8-s (26 FPS), despite delivering a much higher mAP (Section 4.3). The gain stems from two factors: (i) our backbone (EffViM-T1 + DefMamba) requires only 12 GFLOPs—

\frac{1}{8}

of YOLOv8-s; (ii) the GG-SSM graph layer scales linearly with

M \approx 35

nodes, whereas SD-GCN’s GAT is quadratic in pixel-level super-nodes.

Latency and power.

End-to-end latency is reduced to 22 ms, well below the 33 ms budget for 30 FPS video, and the power draw drops from 9.7 W (YOLOv8-s) to 7.2 W. The single-pass state-space kernels produced by DefMamba contribute less than 1 ms, confirming that our model remains compute-bound rather than memory-bound.

Memory footprint and compatibility.

GMRD occupies 350 MB of device memory, leaving 0.7 GB headroom for input buffers and post-processing. Larger Transformers such as Mask2Former-SwinB cannot be compiled at INT8 precision within the 4 GB limit (marked “—” in the table), further highlighting the practicality of our design.

4.7. Qualitative and Robustness Analysis

Visual comparison in challenging scenes.

Figure 5 juxtaposes the predictions of the GMRD with three strong baselines under two extreme conditions. In the shadow scenario (top row), both YOLOv8-s and SD-GCN fire multiple false positives along the high-contrast shadow edge, whereas our model almost perfectly suppresses these artefacts and preserves the longitudinal crack. In the water-puddle scenario (bottom row), reflections mislead all competing detectors into hallucinating damage on the specular surface; only the GMRD successfully localizes the submerged crack thanks to its graph-guided spatial reasoning.

Attention-map inspection.

To better understand the remaining failure modes, Figure 6 overlays the model’s final-layer attention on three difficult samples. In the presence of a large water reflection, specular highlights attract excessive attention, triggering false alarms along the reflection boundary. Under back-lighting, saturated pixels at the vanishing point dominate the heatmap and suppress true crack cues. Finally, in the construction scene, the model focuses on a warning sign whose polygonal silhouette resembles a pothole, leading to a spurious detection.

Failure-mode discussion.

The visual evidence suggests two dominant error sources: (i) high-frequency photometric cues—specularities, saturated glare—can outweigh geometric crack evidence and (ii) non-road objects with crack-like contours occasionally hijack the attention mechanism. Addressing these weaknesses will require (a) illumination-invariant features, such as polarized or multispectral inputs and (b) reflection-aware augmentation that explicitly simulates puddles during training. We leave these directions for future work.

4.8. Hardware Runtime Comparison with Lightweight Models

To contextualize the efficiency of our proposed framework, we compare the GMRD with several representative lightweight detectors, including MobileNet-SSD [58], EfficientDet-D0 [59], YOLOv5-Nano, and our baseline EffViM-T1. All models are benchmarked on a Jetson Orin Nano (INT8) using TensorRT acceleration with input resolution

1024 \times 1024

. The results in Table 6 demonstrate that the GMRD achieves favorable trade-offs between speed and accuracy, outperforming other lightweight models in both detection quality and runtime efficiency.

As shown in Table 6, the GMRD achieves competitive frame rate and latency with fewer parameters and FLOPs compared to existing models, making it highly suitable for edge deployment in real-world road infrastructure monitoring.

5. Discussion and Future Work

Symmetry-Aware Gains.

Graph-MambaRoadDet (GMRD) advances the state of the art across three orthogonal dimensions: (i) a +3.9 mAP_50:95 and +10.9 G-mAP gain over the EfficientViM-T1 baseline, driven by graph reasoning that explicitly models spatial and topological symmetry; (ii) real-time inference at 45 FPS within a 7.2 W power envelope on Jetson Orin Nano, highlighting a balance between symmetry-preserving accuracy and edge efficiency; and (iii) enhanced topological consistency (

+ 0.05

) on RoadGraph-RDD, ensuring that predictions respect the symmetric structure of road networks—crucial for asset planning.

Limitations.

First, the GMRD assumes GPS localization accuracy within 2 m; excessive drift may misassign superpixels to incorrect road segments, weakening the geometric symmetry alignment. Second, multi-layer structures such as tunnels or viaducts challenge the 2D planar symmetry assumed by OpenStreetMap topology, occasionally inducing over-smoothing across disconnected levels. Third, the model is evaluated on daylight RGB imagery; its ability to maintain symmetric consistency under low-light or severe weather conditions remains unquantified.

Runtime Bottlenecks.

Profiling shows that NMS within the YOLO head consumes

\approx 28 %

of latency. Replacing it with a one-stage symmetry-preserving anchor-free head (e.g., DIoU sorting) could reduce the inference time by up to 10%, while maintaining localization consistency.

Generalization and Deployability.

Since the GG-SSM operates on generic graph node embeddings, the GMRD can generalize to structurally symmetric domains such as bridge crack propagation or tunnel seepage detection. In the absence of high-precision GPS, alternative localization strategies such as visual SLAM or map-matching can preserve the symmetric projection consistency during graph construction.

Future Directions.

(i) Temporal Graph Forecasting. Incorporating recurrent updates into GG-SSM may enable forecasting of damage propagation patterns that exploit spatio-temporal symmetry along roads. (ii) Multimodal Sensor Fusion. Integrating thermal or radar modalities can help maintain feature symmetry in night-time or occluded conditions. (iii) Privacy-Preserving Symmetric Learning. Beyond on-device anonymization, we plan to explore federated learning to train GMRD models while respecting symmetry constraints without transferring raw imagery. (iv) 3D Symmetric Topology. Coupling dynamic graph structures with high-definition maps and LiDAR will support reasoning over complex multi-level topologies while preserving vertical and horizontal symmetry.

Together, these directions will enhance the GMRD’s robustness, generalizability, and alignment with symmetric properties inherent in road networks, enabling trustworthy large-scale infrastructure monitoring.

Author Contributions

Conceptualization, Z.T. and X.S.; methodology, Z.T.; software, Z.T.; validation, X.S. and Y.B.; formal analysis, Z.T.; investigation, X.S.; resources, Y.B.; data curation, X.S.; writing—original draft preparation, Z.T.; writing—review and editing, Z.T., X.S. and Y.B.; visualization, Z.T.; supervision, Z.T.; project administration, Z.T.; funding acquisition, Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Beijing Jiaotong University.

Data Availability Statement

The datasets used in this study are publicly available in RDD2022 [55] at https://github.com/sekilab/RoadDamageDetector (accessed on 1 January 2025), in TD-RD [41] on request from the authors, and in RoadBench-100K [57] on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hacıefendioğlu, K.; Başağa, H.B. Concrete road crack detection using deep learning-based faster R-CNN method. Iran. J. Sci. Technol. Trans. Civ. Eng. 2022, 46, 1621–1633. [Google Scholar] [CrossRef]
Maeda, H.; Sekimoto, Y.; Seto, T.; Kashiyama, T.; Omata, H. Road damage detection using deep neural networks with images captured through a smartphone. arXiv 2018, arXiv:1801.09454. [Google Scholar] [CrossRef]
Wang, N.; Shang, L.; Song, X. A transformer-optimized deep learning network for road damage detection and tracking. Sensors 2023, 23, 7395. [Google Scholar] [CrossRef] [PubMed]
Jepsen, T.S.; Jensen, C.S.; Nielsen, T.D. Graph convolutional networks for road networks. IEEE Trans. Intell. Transp. Syst. 2019, 23, 460–463. [Google Scholar]
Lebaku, P.K.R.; Gao, L.; Lu, P.; Sun, J. Deep learning for pavement condition evaluation using satellite imagery. Infrastructures 2024, 9, 155. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1024–1034. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Gu, A.; Tao, A. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zubic, N.; Scaramuzza, D. Gg-ssms: Graph-generating state space models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 28863–28873. [Google Scholar]
Liu, L.; Zhang, M.; Yin, J.; Liu, T.; Ji, W.; Piao, Y.; Lu, H. Defmamba: Deformable visual state space model. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 8838–8847. [Google Scholar]
Li, L.; Zhou, Z.; Wu, S.; Cao, Y. Multi-scale edge-guided learning for 3d reconstruction. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 109. [Google Scholar] [CrossRef]
Koch, C.; Brilakis, I. Pothole detection in asphalt pavement images. Adv. Eng. Inform. 2015, 29, 966–975. [Google Scholar] [CrossRef]
Fan, Z.; Wu, Y.; Lu, J.; Li, W. Automatic pavement crack detection based on structured prediction with the convolutional neural network. arXiv 2018, arXiv:1802.02208. [Google Scholar] [CrossRef]
Chen, G.H.; Ni, J.; Chen, Z.; Huang, H.; Sun, Y.L.; Ip, W.H.; Yung, K.L. Detection of highway pavement damage based on a CNN using grayscale and HOG features. Sensors 2022, 22, 2455. [Google Scholar] [CrossRef]
Safyari, Y.; Mahdianpari, M.; Shiri, H. A review of vision-based pothole detection methods using computer vision and machine learning. Sensors 2024, 24, 5652. [Google Scholar] [CrossRef]
Zhang, J.; Xia, H.; Li, P.; Zhang, K.; Hong, W.; Guo, R. A Pavement Crack Detection Method via Deep Learning and a Binocular-Vision-Based Unmanned Aerial Vehicle. Appl. Sci. 2024, 14, 1778. [Google Scholar] [CrossRef]
Li, K.; Xu, W.; Yang, L. Deformation characteristics of raising, widening of old roadway on soft soil foundation. Symmetry 2021, 13, 2117. [Google Scholar] [CrossRef]
Fan, L.; Zou, J. A Novel Road Crack Detection Technology Based on Deep Dictionary Learning and Encoding Networks. Appl. Sci. 2023, 13, 12299. [Google Scholar] [CrossRef]
Hamishebahar, Y.; Guan, H.; So, S.; Jo, J. A Comprehensive Review of Deep Learning-Based Crack Detection Approaches. Appl. Sci. 2022, 12, 1374. [Google Scholar] [CrossRef]
Ahmed, K.R. Smart Pothole Detection Using Deep Learning Based on Dilated Convolution. Sensors 2021, 21, 8406. [Google Scholar] [CrossRef]
Li, Y.; Chen, J.; Feng, X.; Wang, X.; Zhang, L. A Pavement Crack Detection and Evaluation Framework for a UAV-Based Inspection System. Appl. Sci. 2024, 14, 1157. [Google Scholar] [CrossRef]
Liu, S.S.; Budiwirawan, A.; Arifin, M.F.A.; Chen, W.T.; Huang, Y.H. Optimization model for the pavement pothole repair problem considering consumable resources. Symmetry 2021, 13, 364. [Google Scholar] [CrossRef]
Li, Z.; Ji, Y.; Wu, A.; Xu, H. MGD-YOLO: An Enhanced Road Defect Detection Algorithm Based on Multi-Scale Attention Feature Fusion. Comput. Mater. Contin. 2025, 84, 5613–5635. [Google Scholar] [CrossRef]
Zou, Q.; Zhang, Z.; Li, Q.; Qi, X.; Wang, Q.; Wang, S. DeepCrack: Learning Hierarchical Convolutional Features for Crack Detection. IEEE Trans. Image Process. 2019, 28, 1498–1512. [Google Scholar] [CrossRef] [PubMed]
Zhu, G.; Fan, Z.; Liu, J.; Yuan, D.; Ma, P.; Wang, M.; Sheng, W.; Wang, K.C. RHA-Net: An Encoder-Decoder Network with Residual Blocks and Hybrid Attention Mechanisms for Pavement Crack Segmentation. arXiv 2022, arXiv:2207.14166. [Google Scholar]
Li, H.; Song, D.; Liu, Y.; Li, B. Automatic pavement crack detection by multi-scale image fusion. IEEE Trans. Intell. Transp. Syst. 2018, 20, 2025–2036. [Google Scholar] [CrossRef]
Shi, P.; Zhu, F.; Xin, Y.; Shao, S. U2CrackNet: A deeper architecture with two-level nested U-structure for pavement crack detection. Struct. Health Monit. 2023, 22, 2910–2921. [Google Scholar] [CrossRef]
Liu, H.; Miao, X.; Mertz, C.; Xu, C.; Kong, H. Crackformer: Transformer network for fine-grained crack detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3783–3792. [Google Scholar]
Wang, C.; Liu, H.; An, X.; Gong, Z.; Deng, F. SwinCrack: Pavement crack detection using convolutional swin-transformer network. Digit. Signal Process. 2024, 145, 104297. [Google Scholar] [CrossRef]
Tang, Z.; Chamchong, R.; Pawara, P. A comparison of road damage detection based on YOLOv8. In Proceedings of the 2023 International Conference on Machine Learning and Cybernetics (ICMLC), Adelaide, Australia, 9–11 July 2023; pp. 223–228. [Google Scholar]
Park, J.; Min, K.; Kim, H.; Lee, W.; Cho, G.; Huh, K. Road surface classification using a deep ensemble network with sensor feature selection. Sensors 2018, 18, 4342. [Google Scholar] [CrossRef]
Kailkhura, V.; Aravindh, S.; Jha, S.S.; Jayanthi, N. Ensemble learning-based approach for crack detection using CNN. In Proceedings of the 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI) (48184), Tirunelveli, India, 15–17 June 2020; pp. 808–815. [Google Scholar]
Huang, R.; Pedoeem, J.; Chen, C. YOLO-LITE: A real-time object detection algorithm optimized for non-GPU computers. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 2503–2510. [Google Scholar]
Feng, W.; Guan, F.; Sun, C.; Xu, W. Road-SAM: Adapting the segment anything model to road extraction from large very-high-resolution optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6012605. [Google Scholar] [CrossRef]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Wang, B.; Lin, Y.; Guo, S.; Wan, H. GSNet: Learning spatial-temporal correlations from geographical and semantic aspects for traffic accident risk forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 4402–4409. [Google Scholar]
Yuan, T.; Cao, W.; Zhang, S.; Yang, K.; Schoen, M.; Duraisamy, B. Lane detection and estimation from surround view camera sensing systems. In Proceedings of the 2023 IEEE Sensors, Vienna, Austria, 29 October–1 November 2023; pp. 1–4. [Google Scholar]
Brust, C.A.; Sickert, S.; Simon, M.; Rodner, E.; Denzler, J. Convolutional patch networks with spatial prior for road detection and urban scene understanding. arXiv 2015, arXiv:1502.06344. [Google Scholar] [CrossRef]
Xiao, X.; Li, Z.; Wang, W.; Xie, J.; Lin, H.; Roy, S.K.; Wang, T.; Xu, M. TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection. arXiv 2025, arXiv:2501.14302. [Google Scholar]
Li, Z.; Xie, Y.; Xiao, X.; Tao, L.; Liu, J.; Wang, K. An Image Data Augmentation Algorithm Based on YOLOv5s-DA for Pavement Distress Detection. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; pp. 891–895. [Google Scholar] [CrossRef]
Weng, W.; Fan, J.; Wu, H.; Hu, Y.; Tian, H.; Zhu, F.; Wu, J. A decomposition dynamic graph convolutional recurrent network for traffic forecasting. Pattern Recognit. 2023, 142, 109670. [Google Scholar] [CrossRef]
Xu, Y.; Han, L.; Zhu, T.; Sun, L.; Du, B.; Lv, W. Generic dynamic graph convolutional network for traffic flow forecasting. Inf. Fusion 2023, 100, 101946. [Google Scholar] [CrossRef]
Kim, H.; Lee, B.S.; Shin, W.Y.; Lim, S. Graph anomaly detection with graph neural networks: Current status and challenges. IEEE Access 2022, 10, 111820–111829. [Google Scholar] [CrossRef]
Pham, T.V.; Tran, N.N.Q.; Pham, H.M.; Nguyen, T.M.; Ta Minh, T. Efficient low-latency dynamic licensing for deep neural network deployment on edge devices. In Proceedings of the 3rd International Conference on Computational Intelligence and Intelligent Systems, Tokyo, Japan, 13–15 November 2020; pp. 44–49. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Li, K.; Li, X.; Wang, Y.; He, Y.; Wang, Y.; Wang, L.; Qiao, Y. VideoMamba: State Space Model for Efficient Video Understanding. In Proceedings of the Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Lee, S.; Choi, J.; Kim, H.J. EfficientViM: Efficient Vision Mamba with Hidden-State Mixer based State Space Duality. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 14923–14933. [Google Scholar]
Guo, X.; Zhao, L. A systematic survey on deep generative models for graph generation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5370–5390. [Google Scholar] [CrossRef]
Chelliah, P.R.; Rahmani, A.M.; Colby, R.; Nagasubramanian, G.; Ranganath, S. Model Optimization Methods for Efficient and Edge AI: Federated Learning Architectures, Frameworks and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2024. [Google Scholar]
Miller, J.S.; Bellinger, W.Y. Distress Identification Manual for the Long-Term Pavement Performance Program; Federal Highway Administration: Washington, DC, USA, 2003. [Google Scholar]
ASTM-D6433; Standard Practice for Roads and Parking Lots Pavement Condition Index Surveys. ASTM: West Conshohocken, PA, USA, 2009.
Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Sekimoto, Y. RDD2022: A Multi-National Image Dataset for Automatic Road Damage Detection. arXiv 2022, arXiv:2209.08538. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic Road Crack Detection Using Random Structured Forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Xiao, X.; Zhang, Y.; Wang, J.; Zhao, L.; Wei, Y.; Li, H.; Li, Y.; Wang, X.; Roy, S.K.; Xu, H.; et al. RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding. arXiv 2025, arXiv:2507.17353. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]

Figure 1. Dynamic road graph construction on a real-world map. We visualize the super-pixel nodes extracted from road imagery, the dynamic edges inferred by GG-SSM, and final segment health scores projected as a heatmap. Our method adaptively aligns graph structure with the underlying city topology, enabling robust, interpretable, and segment-aware reasoning. Note: This figure is a conceptual illustration of the GMRD pipeline and does not correspond to a specific geographic region.

Figure 2. Graph-MambaRoadDet (GMRD) pipeline for road damage detection. The input road image is encoded via EfficientViM and DefMamba to preserve warped state–space features. Superpixel nodes are constructed via SLIC and GPS projection. GG-SSM forms dynamic graph edges, which are fused with OSM-based priors through attention-guided fusion. Message passing over the graph supports dual outputs: a damage detector head (e.g., YOLO-style) and road-segment health scoring via soft heatmaps.

Figure 3. Ablation of key graph hyperparameters: (a)

λ

balances dynamic GG–SSM edges and the static OSM prior, peaking at

λ = 0.5

; (b)

k = 6

offers the best speed–accuracy trade-off between mAP and FPS.

Figure 3. Ablation of key graph hyperparameters: (a)

λ

balances dynamic GG–SSM edges and the static OSM prior, peaking at

λ = 0.5

; (b)

k = 6

offers the best speed–accuracy trade-off between mAP and FPS.

Figure 4. Loss-weight sensitivity on RDD2022. (a) Accuracy and consistency peak at

β = 0.2

, corroborating the choice in Section 4.2. (b) Edge-distillation weight

γ

is effective in the 0.10–0.15 range, with performance plateauing thereafter.

Figure 4. Loss-weight sensitivity on RDD2022. (a) Accuracy and consistency peak at

β = 0.2

, corroborating the choice in Section 4.2. (b) Edge-distillation weight

γ

is effective in the 0.10–0.15 range, with performance plateauing thereafter.

Figure 5. Qualitative comparison under shadow (top) and water-puddle (bottom) conditions. Red boxes denote detected damage regions. GMRD eliminates shadow-induced false positives and recovers submerged cracks missed by competing methods.

Figure 6. Attention maps for three failure cases. Warmer colors represent higher attention weights.

Table 1. Summary of datasets used for evaluation. RoadBench-100K is our newly curated graph-aware corpus with GPS/IMU metadata and segment-level topology.

Dataset	#Images	Resolution	Annotation Type	Topology Info
RDD2022 [55]	26,336	600 × 600	Box (4-class)	No
TD-RD [41]	11,250	1024 × 768	Pixel-wise Mask (5-class)	No
CRACK500 [56]	500	2000 × 1500	Binary Mask	No
RoadBench-100K (Ours) [57]	103,418	1280 × 720	Box + Mask + Graph	Yes

Table 2. Comparison with state-of-the-art detectors, including CNN/Transformer (top), graph-based (middle), and state–space (middle) methods. Our method achieves the best accuracy–efficiency trade-off. The best results are in bold.

Method	RDD2022	TD-RD	RoadBench	Params (M)	GFLOPs
Method	mAP	mAP	mAP	Params (M)	GFLOPs
CNN/Transformer-based methods
YOLOv8-s	50.5	47.9	45.1	11.2	95
RT-DETR-R50	55.1	48.4	47.0	24.0	236
Mask2Former-SwinB	57.3	54.8	51.1	47.1	322
Graph-based and state–space baselines
SD-GCN (graph-based)	58.4	57.6	52.3	19.7	143
Mamba-Adaptor-Det (state–space)	59.2	58.1	52.8	18.3	128
EffViM-T1 (no graph)	57.9	55.0	51.6	4.2	34
GMRD (Ours)	61.8	60.9	55.4	1.8	12

Table 3. Edge efficiency (Jetson Orin Nano, INT8) and topology-aware accuracy on RoadGraph-RDD. The best results are in bold. “—” indicates the model could not fit on the device. Bold indicates the best.

Method	Edge Efficiency			Topology Metrics
Method	FPS	Latency (ms)	Power (W)	G-mAP	Consistency
YOLOv8-s	26	38	9.7	42.7	0.68
Mask2Former-SwinB	—	—	—	48.3	0.71
SD-GCN	18	55	10.5	50.6	0.74
Mamba-Adaptor-Det	22	45	9.9	51.4	0.73
GMRD (ours)	45	22	7.2	55.9	0.79

Table 4. Cross-domain generalization performance (mAP@0.5) when trained on RDD-Japan and evaluated on three unseen target domains. Our GMRD achieves consistent improvements over strong baselines. Bold indicates the best; underline indicates the second-best.

Method	RDD-India	CNRDD	CRDDC’22
YOLOv8	62.7	59.3	55.8
RT-DETR	64.1	60.0	56.4
MobileSAM	60.4	57.6	53.1
Graph-RCNN	65.8	61.3	57.9
Deformable-DETR	66.4	62.0	58.6
GMRD (Ours)	69.5	64.2	61.0

Table 5. Expanded ablation study across three datasets. Each architectural component (DefMamba, GG-SSM, EGTR) is progressively removed to quantify its contribution.

Method Variant	RDD2022				CNRDD				CRDDC’22
Method Variant	mAP	G-mAP	Consist.	FPS	mAP	G-mAP	Consist.	FPS	mAP	G-mAP	Consist.	FPS
Baseline (EffViM only)	57.9	45.0	0.70	48	56.5	41.7	0.68	47	58.1	43.9	0.69	47
– DefMamba	60.1	54.0	0.77	48	58.6	50.3	0.74	47	60.5	51.6	0.75	47
– GG-SSM	59.4	51.3	0.74	46	58.2	48.7	0.72	46	60.0	49.8	0.72	46
– EGTR Calibration	60.3	53.6	0.75	45	59.0	49.5	0.74	45	61.2	51.0	0.74	45
– Graph Consistency Loss	60.6	54.2	0.73	45	59.3	50.1	0.72	45	61.6	52.5	0.71	45
Full GMRD (Ours)	61.8	55.9	0.79	45	60.5	52.9	0.77	45	62.4	54.1	0.78	45

GG-SSM edges are replaced by static OpenStreetMap adjacency. All results averaged over three seeds. FPS excludes pre-processing.

Table 6. Hardware efficiency comparison on a Jetson Orin Nano (INT8). The FPS are averaged over 2000 frames, and the latency excludes pre-processing.

Method	Params (M)	GFLOPs	FPS	Latency (ms)
MobileNet-SSD	4.2	2.3	43.2	23.1
EfficientDet-D0	3.9	2.5	40.6	25.4
YOLOv5-Nano	1.9	1.8	44.7	21.9
EffViM-T1 (no graph)	4.2	3.4	48.0	20.6
GMRD (ours)	1.8	1.5	45.2	21.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, Z.; Shao, X.; Bai, Y. Graph-MambaRoadDet: A Symmetry-Aware Dynamic Graph Framework for Road Damage Detection. Symmetry 2025, 17, 1654. https://doi.org/10.3390/sym17101654

AMA Style

Tian Z, Shao X, Bai Y. Graph-MambaRoadDet: A Symmetry-Aware Dynamic Graph Framework for Road Damage Detection. Symmetry. 2025; 17(10):1654. https://doi.org/10.3390/sym17101654

Chicago/Turabian Style

Tian, Zichun, Xiaokang Shao, and Yuqi Bai. 2025. "Graph-MambaRoadDet: A Symmetry-Aware Dynamic Graph Framework for Road Damage Detection" Symmetry 17, no. 10: 1654. https://doi.org/10.3390/sym17101654

APA Style

Tian, Z., Shao, X., & Bai, Y. (2025). Graph-MambaRoadDet: A Symmetry-Aware Dynamic Graph Framework for Road Damage Detection. Symmetry, 17(10), 1654. https://doi.org/10.3390/sym17101654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Graph-MambaRoadDet: A Symmetry-Aware Dynamic Graph Framework for Road Damage Detection

Abstract

1. Introduction

Paper Structure

2. Related Work

2.1. Road Damage Detection and Vision Models

2.2. Graph Neural Networks and Dynamic Topology Learning

2.3. State–Space Models and Hybrid Architectures

3. Methodology

3.1. Overview

3.2. EfficientViM Backbone

3.3. Deformable Mamba Blocks

3.4. Super-Pixel Graph Construction

3.5. Dynamic Graph Generation via GG-SSM

3.6. Attention-Guided Topology Calibration

3.7. Graph Message Passing

3.8. Prediction Heads

3.9. Symmetry-Aware Structural Design

3.10. Training Objectives

3.11. Complexity Analysis

3.12. Crack Type and Severity Standards

4. Experiments

4.1. Datasets and Annotations

4.2. Implementation Details

4.3. Comparison with State of the Art

4.4. Cross-Domain Generalization Study

4.5. Ablation Studies

4.6. Edge-Device Efficiency

4.7. Qualitative and Robustness Analysis

4.8. Hardware Runtime Comparison with Lightweight Models

5. Discussion and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI