Next Article in Journal
Research on Multi-Field Coupling Response and Alignment Control of Super-Long-Span Steel Box Girder Synchronous Lifting
Previous Article in Journal
Simulation of ZnO/BiVO4 Photoanode Performance in Photoelectrochemical Water Splitting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot

by
Jianji Fu
1,†,
Hongyi Li
1,†,
Qi Liu
2,
Gaofeng Zheng
1,
Jianhuan Zhang
1,
Jin Jiang
3,* and
Chentao Zhang
1,*
1
Pen-Tung Sah Institute of Micro-Nano Science and Technology, Xiamen University, Xiamen 361102, China
2
School of Aerospace Engineering, Xiamen University, Xiamen 361102, China
3
Xiamen King Long United Automotive Industry Co., Ltd., Xiamen 361023, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Eng 2026, 7(6), 289; https://doi.org/10.3390/eng7060289
Submission received: 17 April 2026 / Revised: 3 June 2026 / Accepted: 8 June 2026 / Published: 11 June 2026

Abstract

Road potholes pose a considerable threat to mobile robots, which are generally less stable than conventional vehicles and may become trapped or overturned when traversing damaged road surfaces. Accurate semantic segmentation of road potholes is therefore essential for safe and reliable robot navigation. To address this requirement, multimodal fusion methods using RGB (Red, Green, Blue) and disparity images have been developed for pothole detection. Nevertheless, these methods still face challenges in detecting small potholes and delineating their boundaries precisely. To overcome these limitations, we propose a novel multimodal fusion network for road-pothole semantic segmentation. Specifically, we design a feature fusion module that integrates global context and local details to fully exploit the complementary information provided by RGB and disparity images. This design improves multimodal feature interaction and enhances boundary segmentation accuracy. Furthermore, we develop three feature attention fusion modules by incorporating multiple complementary attention mechanisms into the fusion module. These modules improve small-pothole detection by focusing on informative features, emphasizing target regions, and reducing information loss. We evaluate the proposed network on a small-pothole subset of Pothole-600 under identical hardware settings and backbone configurations for all experimental models. On the small-pothole subset of Pothole-600, FAFMNet achieves 90.22% mPre, 92.32% mRec, 98.73% mAcc, 91.26% mF1, and 83.93% mIoU, outperforming the state-of-the-art method by 1.87 percentage points in mF1 and 3.12 percentage points in mIoU. A paired statistical test over three independent runs further confirms that the improvement over the baseline is statistically significant ( p < 0.05 ).

1. Introduction

Obstacles located below the road surface are commonly referred to as negative obstacles and typically appear as potholes or crevices [1]. Representative examples of road potholes are shown in Figure 1. Pothole formation is generally associated with natural weathering and insufficient road maintenance [2]. These surface defects can disrupt traffic flow and may even cause traffic accidents. Therefore, the inspection and repair of road potholes are essential components of road maintenance. Although manual inspection remains widely used for pothole identification, it is labor-intensive and time-consuming [3]. In addition, small potholes are often missed because manual inspection is inevitably affected by human subjectivity. For mobile robots with narrow wheels, road potholes can easily cause immobilization or instability. Accordingly, onboard pothole detection is of critical importance for safe mobile robot navigation [4]. Related mobile robot perception tasks, such as drivable-area detection, have also benefited from attention-based network designs [5].
Pothole detection for mobile robots is commonly based on camera-based [6] or radar-based [7] sensing. Radar is effective for detecting forward obstacles; however, owing to its signal propagation mechanism, it is less suitable for identifying obstacles located below the road surface [8]. Cameras are therefore widely used for road-pothole detection, with RGB and disparity images serving as the primary data sources. Nevertheless, both modalities have inherent limitations. The visual appearance of potholes is often similar to that of the surrounding road surface [9], making accurate segmentation difficult when RGB images are used alone, particularly when boundary cues are weak. A disparity image is a form of depth representation that encodes positional differences between corresponding pixels and provides relative distance information for objects in a scene [10]. Unlike other depth images, disparity images are typically generated from stereo image pairs, where pixel disparities are used to estimate depth. By contrast, depth images obtained from structured-light sensors, time-of-flight sensors, or monocular depth-estimation methods rely on different imaging principles [11]. However, disparity images are sensitive to complex environmental factors, such as standing water inside potholes, and do not contain semantic information [12]. Multimodal fusion methods have been proposed to alleviate these limitations to some extent [13].
Combining RGB and disparity images is an effective strategy for leveraging the complementary strengths of the two modalities and improving road-pothole semantic segmentation [14]. However, existing multimodal fusion algorithms still show limited performance in detecting small potholes, especially in precise boundary segmentation. Recent multimodal segmentation studies indicate that effective cross-modal interaction is essential for exploiting complementary RGB and depth/disparity cues, whereas weak or fixed fusion may restrict feature complementarity, particularly for ambiguous small targets [12,13,15]. Previous studies have also shown that boundary errors are mainly reflected by reduced F1-score and IoU when targets are small, shallow, or visually similar to the background. Therefore, the comparative metrics reported in recent studies provide direct evidence of the remaining limitations in fine edge segmentation [13,15,16]. In addition, classical non-deep-learning pothole detection methods usually rely on geometric assumptions or handcrafted candidate extraction, making robust pixel-level segmentation difficult under complex road appearances and noisy depth conditions [8,17].
To address these issues, we propose a novel multimodal fusion neural network for road-pothole semantic segmentation. Prior multimodal segmentation studies have shown that simple concatenation- or addition-based fusion provides limited adaptive cross-modal interaction compared with attention- or complementarity-aware fusion strategies [12,15,16]. The encoder–decoder paradigm with skip connections has also been widely adopted in segmentation because it combines high-level semantics with low-level spatial details, which is particularly important for preserving thin boundaries and small objects [18,19,20]. To overcome this limitation, we design a feature fusion module that separately extracts global and local features from the input data. These features are then added at the pixel level and multiplied with the input features to strengthen informative responses. This design improves the segmentation accuracy of small potholes and their boundaries. Motivated by spatial, channel, and self-attention mechanisms reported in recent segmentation networks, we further incorporate attention modules to help the network emphasize task-relevant information while suppressing interference. Specifically, three feature attention fusion modules are designed by integrating multiple complementary attention mechanisms into the feature fusion module. These modules are strategically inserted into different network layers to improve the overall representation capability of the model, thereby enhancing road-pothole semantic segmentation accuracy.
The main contributions of this paper are summarized as follows:
  • A novel multimodal fusion network is proposed for road-pothole semantic segmentation by jointly using RGB and disparity images.
  • A feature fusion module is designed to integrate global and local features effectively, thereby enhancing fusion performance and boundary segmentation accuracy. This module reduces information loss during RGB–disparity fusion and better exploits useful cues, particularly in regions where boundary information is easily missed.
  • Three feature attention fusion modules are strategically designed and placed within the network to improve their effectiveness. By incorporating spatial attention, channel attention, and hybrid attention mechanisms, these modules enable the network to exploit complementary spatial and channel information more effectively. As a result, they enhance the network’s focus on small potholes and improve semantic segmentation accuracy.

2. Related Work

2.1. Detection of Road Potholes and Crevices Using RGB Images

Shan et al. [21] proposed a new pavement crack segmentation framework called DCUFormer, which introduces a Dual Cross Attention Module (DCA) and an Upsampling Attention Module (UA) to address the challenges of detecting fine and irregular cracks in complex environments. The DCA enhances feature integration by injecting high-level semantic information into low-level features and refining detailed boundaries, while the UA enables precise upsampling through local cross-attention mechanisms. Fan et al. [22] proposed a single-modal semantic segmentation method for road pothole detection by introducing a Multi-Scale Feature Fusion Module (MSFFM) and a Channel Attention Module (CAM). The MSFFM leverages spatial attention to effectively fuse high-level semantic features with low-level detailed features, reducing the semantic gap across feature layers. Meanwhile, the CAM enhances the consistency of feature maps by re-weighting channel responses based on their relevance. He et al. [23] proposed a dual-stream detection and segmentation framework for vision-based pothole perception on unstructured roads. Their method decouples object-level localization and pixel-level boundary extraction by combining an enhanced YOLOv10+ detection stream with a GAL-DeepLabv3+ segmentation stream that incorporates DenseASPP and graph attention, thereby improving multi-scale feature representation, contextual reasoning, and boundary refinement. Han et al. [24] presented a technique that leverages YOLOv3 for accurately detecting negative obstacles. They incorporated AutoMSRCR to enhance texture features of the negative obstacles in the input image and adjusted the anchor frames of YOLOv3 to improve precision. They also performed k-means++ clustering and linear scale shrinking processing on the anchor boxes of the dataset. This improved the model’s recognition of targets. Zhang et al. [25] designed a multilevel attention mechanism and incorporated it into the target detection algorithm. This mechanism was used between the backbone and feature fusion modules, which greatly improves the accuracy of the network in detecting road damage. In order to solve the issue that the current pothole detection method had too many parameters and cannot meet the requirements of accuracy and real-time performance, Zhang et al. [26] introduced a lightweight monocular target detection network AAL-Net. The lightweight feature extraction module and normalization-based attention module were used to ensure the accuracy and real-time performance of detection. In the backbone part, the authors replaced conventional convolutional layers with GhostConv and designed the LF module to replace the feature extraction module. This improved the performance of the network.
Although the above methods can detect and segment negative obstacles such as potholes and cracks, the accuracy of edge segmentation is unsatisfactory. In addition, the detection effect is susceptible to the interference of complex environments due to the similarities of features between the road potholes and the ground.

2.2. Detection of Road Potholes and Crevices Using Disparity Images

Dodge et al. [8] proposed a method for detecting negative obstacles based on convex optimization. It is a special type of mathematical optimization problem. It focuses on using convex objective functions and cost functions to achieve efficient and globally optimal solutions numerically. The geometric structure corresponds to the vertical baseline between optical centers, which is used to obtain favorable point feature information of the surrounding terrain for road potholes detection. Ali et al. [27] proposed a deep learning framework, 3DpredicNet, for pothole segmentation and 3D volume prediction from monocular RGB images. Their method extracts both local and global features using multi-scale convolutions, enhances important spatial and channel information through attention mechanisms, and captures long-range dependencies with criss-cross attention, enabling joint estimation of pothole location and depth from a single image. Lin et al. [28] proposed a pavement distress detection and quantification method combining an RGB-D camera with the YOLACT++ instance segmentation algorithm. The method extracts distress pixel information from RGB images via instance segmentation to get mask images, converts these 2D masks to 3D point clouds using the camera’s internal reference matrix and depth data, and then uses the RANSAC algorithm to quantify 3D features. Sun et al. [17] proposed a disparity image method based on 3D point cloud for road potholes detection. They extracted 2D candidates from 3D candidates for road features and used a variant of RANSAC to reduce the error caused by point cloud matching. Fan et al. [3] developed a novel pothole detection model using DeepLabv3+ [29], which is capable of handling three modalities of data. They created a graph attention layer that built upon graph neural networks, and integrated it into DeepLabv3+. This integration resulted in a significant improvement in the representation of image features.
The utilization of disparity images presents a substantial advancement over the use of RGB images for road potholes detection. However, its robustness is relatively limited as it remains vulnerable to the effects of illumination and water accumulation in road potholes.

2.3. Detection of Road Potholes and Crevices Using Multimodal Data

To fully utilize the advantages of RGB images and disparity images while overcoming their limitations, some multimodal algorithms have been proposed. More recently, Hu and Assaad [30] developed a robotic teleoperation pipeline that integrates multimodal RGB-D sensing fusion, point-cloud processing, and an efficient multi-scale attention-enhanced deep learning model for real-time pavement pothole segmentation, quantification, and localization. Feng et al. [15] proposed a novel RGB-D semantic segmentation network named PotCrackSeg with a Dual Semantic-feature Complementary Fusion (DSCF) module, which distinguishes potholes and cracks as separate classes and alleviates the impact of depth noise by fusing complementary semantic features from RGB and depth modalities. The network extracts features from RGB and depth streams via SegFormer encoders, maps them to semantic feature sets, extracts complementary features through the CompSemFE module, and fuses them to generate segmentation results. Fan et al. [14] designed a fusion network for pothole detection that uses RGB images and disparity images. They added attention mechanisms to the network, thereby improving the segmentation accuracy of potholes. Compared to existing single-modal networks, there has been an improvement in performance. Building upon this, Feng et al. [12] proposed an enhanced fusion network where they replaced the final layer of the network with a transformer module. This led to a further improvement in the semantic segmentation performance of road potholes in the model. They also proposed a dual-encoder dual-decoder RGB-D parallax multimodal network named InconSeg [31] for positive and negative road obstacle segmentation. The network solves the problem of inconsistent information between multimodal data by using a residual guidance fusion module to extract complementary features. This method avoided directly fusing RGB and depth features and extracted complementary features of RGB features from depth features, thus solving the problem of performance degradation caused by inconsistent data between the two.
Current multimodal fusion networks still exhibit limitations in detecting small road potholes, particularly in accurately segmenting small pothole regions and preserving boundary details [3]. This situation can easily lead to erroneous judgments, which may cause the mobile robot to get trapped and even roll over during its movement. This limitation is consistent with previous findings that fixed fusion operators may provide insufficient cross-modal interaction, whereas attention-guided or complementary fusion can better exploit modality-specific cues for difficult targets [12,15,16]. Moreover, the fusion of data often fails to fully exploit the distinctive properties of different attention mechanisms. Therefore, in this article, we propose a novel network model for semantic segmentation of small road potholes. It is capable of effectively fusing multimodal data and leveraging the power of attention mechanisms to improve detection accuracy.
Compared with representative multimodal segmentation methods, existing approaches mainly rely on simple feature concatenation/addition, semantic complementary fusion, or generic attention-based fusion. Although these strategies improve multimodal interaction, they still show limited ability in segmenting small potholes and preserving fine boundary details. In contrast, FAFMNet introduces a nested global–local fusion mechanism to couple coarse contextual cues with local structural details, and further deploys three stage-specific attention fusion modules to enhance shallow spatial sensitivity, mid-level channel selectivity, and deep cross-dimensional feature integration, respectively. Therefore, the key distinction of FAFMNet lies not merely in using attention or multimodal fusion, but in combining nested global–local fusion with progressively specialized attention fusion to address weak small-pothole segmentation and boundary loss more effectively.

3. Methodology

3.1. Overall Network

We propose a novel multimodal feature fusion network, termed the feature attention fusion multimodal network (FAFMNet), for road-pothole detection. As illustrated in Figure 2, the overall architecture consists of two encoders and one decoder. ResNet-34 [32] is adopted as the encoder backbone. The four-stage encoder depth follows the standard ResNet-34 hierarchy, providing a practical balance between receptive-field expansion and preservation of spatial details for small potholes. The symmetric decoder is therefore used to progressively restore the original resolution without introducing unnecessary additional stages. The initial encoder module consists of convolutional layers, batch normalization layers, and ReLU activation layers. It increases the number of input channels from 3 to 64 while reducing the spatial resolution from 512 × 512 to 256 × 256 . The subsequent three layers are constructed according to the ResNet architecture, with a max-pooling layer preceding the first ResNet layer. The final encoder layer, namely the fourth layer, incorporates a transformer module [33]. The transformer is placed only at this deepest encoder stage because the feature-map resolution has already been sufficiently reduced, making global token interaction computationally feasible while preserving high-level semantic information. In contrast, introducing the transformer into shallow or intermediate stages would substantially increase the computational burden and may weaken the preservation of local pothole-edge cues, which are more effectively modeled by convolutional operations. Table 1 summarizes the channel numbers and spatial dimensions of the input and output tensors at each encoder and decoder layer. To maintain the readability of the overall architecture diagram, these dimensions are summarized in Table 1 rather than repeatedly annotated in Figure 2. Accordingly, the dual encoder preserves modality-specific early representations, whereas the decoder restores spatial resolution after progressively refined multimodal fusion. Each major block therefore has a distinct functional role rather than being appended redundantly.
Most existing fusion algorithms perform feature fusion using simple operations, such as addition or concatenation. Although these operations provide a common form of early feature aggregation, they may be insufficient for fusion models involving heterogeneous modalities, particularly when the fused data exhibit substantial distributional and semantic differences. To enhance the fusion capability of the model and improve the integration of RGB and disparity information, we design a dedicated feature fusion module with both global and local feature extraction branches. Because RGB and disparity images have inherently different characteristics, directly extracting complementary features through simple addition or concatenation is challenging. In the proposed design, each input modality is processed independently by two parallel branches. The local branch employs a series of 1 × 1 convolutional layers with batch normalization and ReLU activation to capture fine-grained channel-wise features while preserving spatial resolution. In parallel, the global branch uses global average pooling followed by convolutional transformations to encode spatially aggregated contextual information.
The outputs of the two branches are fused through element-wise addition and are subsequently modulated by the original input features through element-wise multiplication. This selective interaction preserves both global context and local details during fusion. In addition, short skip connections are introduced to retain low-level structural cues and reduce information loss in deep layers. These connections facilitate direct information flow across network stages, maintain continuity between shallow and deep features, and improve the robustness of the semantic representation. The resulting fused features provide the basis for subsequent attention-enhanced processing, ultimately contributing to more accurate segmentation, particularly along pothole boundaries.
Attention mechanisms have become an important component of semantic segmentation models. By emphasizing target-related features and suppressing irrelevant interference, attention mechanisms can substantially improve model performance. Previous studies have shown that different combinations of attention mechanisms can have different effects on detection performance across neural network architectures [14]. Therefore, the careful selection and integration of attention mechanisms are essential for optimizing computer vision models. In this study, spatial attention, channel attention, and hybrid attention mechanisms are integrated into the feature fusion module to construct three feature attention fusion modules [34]. These modules are incorporated at different encoder stages to fuse RGB and disparity features. Specifically, they are distributed from shallow to deep encoder layers to progressively refine spatial structures, strengthen channel-wise responses, and enhance contextual feature interactions. This stage-wise placement substantially improves the performance of the proposed model.

3.2. Feature Fusion Module

To address the inherent heterogeneity between RGB images and disparity images, and to improve the accuracy of small pothole segmentation and edge localization, we design a dual-path feature fusion module, as illustrated in Figure 3. This module serves as the cornerstone of our multimodal integration pipeline and is strategically constructed to extract complementary representations from both data modalities while preserving the low-level spatial structure.
The module takes RGB image features R and disparity image features D as two separate inputs, where R denotes the RGB feature image and D denotes the disparity feature image. Before the features are fed into the feature fusion module, we first introduce a lightweight modality attention mechanism to estimate the relative importance of the two modalities. Specifically, w r and w d denote the adaptive weights assigned to the RGB and disparity branches, respectively. Each modality is then processed through a local feature extractor L ( · ) and a global branch G ( · ) . To avoid notation ambiguity, we denote global average pooling by P ( · ) , whereas G ( · ) denotes the output of the global branch after applying P ( · ) followed by a 1 × 1 convolution. The local branch consists of two 1 × 1 convolution layers with batch normalization and ReLU activation, aiming to capture fine-grained channel-level variations. In contrast, the global branch encodes spatially aggregated semantic cues through the transformation G ( X ) = Conv 1 × 1 ( P ( X ) ) .
Given the distributional differences between RGB and disparity data, directly combining them may result in modal conflict and feature dilution. Therefore, before entering the feature fusion module, the two input modalities are first reweighted by a modality attention mechanism that computes dynamic fusion weights from pooled modality descriptors:
[ w r , w d ] = Softmax ( MLP ( [ P ( R ) , P ( D ) ] ) ) , w r + w d = 1
where P ( R ) , P ( D ) R C × 1 × 1 are the global average pooling descriptors of the two modalities.
These weights allow the network to selectively emphasize the more informative modality under different scenes (e.g., RGB in high-texture areas and disparity in geometric depth cues). After this preweighting step, the initial fusion output F is computed by combining the local and global features modulated by the learned weights:
F = w r · R σ ( G ( R ) L ( R ) ) + w d · D σ ( G ( D ) L ( D ) )
In this formulation, ⊕ and ⊗ denote element-wise addition and multiplication, respectively, and σ ( · ) denotes the sigmoid activation function. This design enables pixel-wise cross-modal recalibration, which is crucial for preserving boundary integrity and shape consistency in semantic segmentation.
To further improve the robustness of multimodal representation learning, we introduce a nested refinement strategy that reuses the fused representation to recalibrate the original RGB and disparity features. The basic fusion module is effective for first-pass aggregation of the two modalities, but it cannot explicitly resolve residual inconsistency that remains after one-shot fusion. The nested fusion stage addresses this problem by feeding the intermediate fused feature back into the gating process, thereby suppressing conflicting responses and reinforcing complementary structures that are still weak after the first fusion, especially around small potholes and ambiguous boundaries.
Specifically, let the initial fused feature obtained from the first-stage fusion be denoted by F. In our implementation, the refined gate derived from F modulates RGB and disparity features in a complementary manner: high responses of σ ( G ( F ) L ( F ) ) emphasize RGB cues, whereas low responses retain disparity cues. This design is motivated by the fact that RGB is usually more discriminative in textured regions, while disparity is often more reliable in geometry-sensitive regions such as weak pothole depressions. Therefore, the term 1 σ ( G ( F ) L ( F ) ) is used to realize residual complementary redistribution under a shared confidence map, and the refined fusion is formulated as follows:
F = R σ ( G ( F ) L ( F ) ) + D ( 1 σ ( G ( F ) L ( F ) ) )
This nested refinement strategy enables the network to perform a second-stage adaptive fusion guided by the intermediate fused representation, thereby improving cross-modal consistency and enhancing the discrimination of fine pothole structures and weak boundaries. Moreover, the complementary gating is applied only in the refinement stage rather than at the initial fusion stage, so the network has already aggregated information from both modalities before this redistribution is imposed. In this way, the nested fusion module suppresses redundant responses while preserving cross-modal complementarity, which is particularly helpful for clarifying small pothole structures and ambiguous boundaries. Therefore, the nested fusion module is not a simple repetition of the basic fusion module; it acts as a refinement mechanism that corrects residual ambiguity left by the initial fusion and provides a more stable multimodal representation for the subsequent attention modules. As shown in Figure 4, this structure extends the basic feature fusion process with progressive refinement while retaining short skip connections to preserve low-level information and alleviate degradation during deep fusion.
Accordingly, the complete feature fusion process can be abstracted in a recursive form as
F ( 1 ) = F fusion ( R , D ) , F ( 2 ) = F fusion ( F ( 1 ) , R , D )
where F ( 1 ) denotes the initial fused representation, F ( 2 ) denotes the refined fused representation, and F fusion ( · ) represents the nested fusion operation. For clarity, the basic fusion module produces F ( 1 ) from R and D, the nested feature fusion (NFF) module refines it into F ( 2 ) or F , and the subsequent attention fusion modules (SDAF, CIF, and MCF) take this refined fused feature as input for stage-specific recalibration.

3.3. Feature Attention Fusion Module

Building on the nested fusion backbone introduced in Section 3.2 (Figure 4), we develop three novel feature attention fusion modules by incorporating spatial attention, channel attention, and hybrid attention mechanisms into the feature fusion process. We use three specialized attention modules instead of one unified attention block because the three modules target different dependencies and are effective at different representation scales: SDAF emphasizes directional spatial continuity for boundary localization, CIF performs efficient channel recalibration to suppress modality-specific noise, and MCF captures higher-order spatial–channel correlations after sufficiently abstract semantics have been formed. A single unified attention module would either be overly complex if it tried to model all these dependencies at every stage, or too coarse to preserve the stage-specific advantages of each mechanism. By integrating these complementary modules, the network achieves sharper boundary delineation and improved detection of small potholes, without increasing the depth of the backbone.

3.3.1. Spatial--Directional Attention Fusion

To precisely locate pothole boundaries and preserve spatial continuity, we aggregate the nested fusion feature F R C × H × W along the width and height directions, respectively. In this context, GAP denotes global average pooling, while GAP w and GAP h represent pooling along the width and height dimensions, respectively. Since pooling along the width dimension preserves the height-wise spatial layout, the resulting descriptor is denoted by P h ; likewise, pooling along the height dimension preserves the width-wise spatial layout, yielding P w :
P h = GAP w ( F ) R C × H × 1 , P w = GAP h ( F ) R C × 1 × W
These one-dimensional descriptors encode directional context and are then refined through lightweight transformations:
A h = σ Conv 1 × 1 ( ReLU ( BN ( P h ) ) ) ,
A w = σ Conv 1 × 1 ( ReLU ( BN ( P w ) ) ) .
Here, the sigmoid function is used to generate bounded attention weights in [ 0 , 1 ] , enabling each branch to act as a soft gate. Unlike ReLU or tanh, sigmoid directly supports multiplicative feature reweighting without producing unbounded amplification or negative attention responses.
The resulting attention masks A h and A w modulate F at the pixel level. As illustrated in Figure 5, this operation produces the spatial–directional attentive output:
F sd = F A h A w .
This design sharpens pothole boundaries by amplifying responses along informative directions, thereby distinguishing pothole regions from the surrounding road surface more effectively. SDAF is therefore deployed in shallow layers, where feature maps still retain dense edge geometry and fine spatial continuity; if moved to deeper stages, its directional advantage becomes less pronounced because repeated downsampling has already removed much of the boundary detail that SDAF is designed to exploit.

3.3.2. Channel-Interdependence Fusion

Accurate pothole segmentation requires selective emphasis on informative feature channels. We therefore apply global average pooling to F to obtain a compact channel descriptor F ¯ = GAP ( F ) R C × 1 × 1 . A lightweight one-dimensional convolution is then applied to this pooled descriptor to capture local cross-channel dependencies:
w c = σ Conv 1 D k ( F ¯ ) R C × 1 × 1 .
The vector w c represents the channel attention weights generated from the pooled descriptor F ¯ , where each element quantifies the importance of the corresponding feature channel. The same sigmoid gating is adopted here to keep channel weights normalized and positive, which is consistent with the intended role of channel-wise feature selection.
The reweighted feature map is then obtained as
F ci = F w c .
We adaptively reweight channels according to their relevance for pothole detection. This operation suppresses background noise and enhances small, low-contrast features that are critical for accurate segmentation. CIF is placed in middle layers because these stages already encode more stable semantic channel responses than shallow layers while still preserving moderate spatial resolution. If used too early, channel competition is dominated by local textures; if used too late, part of the discriminative mid-level information has already been compressed. As illustrated in Figure 6, the module consists of a global average pooling branch, a channel-wise one-dimensional convolution, a sigmoid activation, and feature reweighting.

3.3.3. Multidimensional Correlation Fusion

To exploit complementary spatial and channel dependencies, the refined fused feature F R C × H × W is processed by parallel spatial and channel attention branches, as illustrated in Figure 7. Following the dot-product affinity used in self-attention and non-local neural networks [35], we measure correlation by the inner product between two embedded feature vectors, followed by softmax normalization to obtain a probabilistic attention distribution. Let N = H × W . In the spatial branch, two convolutional embeddings are first generated from F and reshaped to matrices F s 2 , F s 3 R C × N . Here, F s 2 i R C and F s 3 j R C denote the i-th and j-th column vectors of the two embeddings, corresponding to two spatial locations. The spatial attention map A s R N × N is then computed as:
( A s ) i j = exp ( F s 2 i ) F s 3 j k = 1 N exp ( F s 2 k ) F s 3 j .
In the channel branch, we reshape F into F ˜ R C × N , where F ˜ i R N and F ˜ j R N denote the i-th and j-th row vectors, respectively, corresponding to the responses of two channels over all spatial locations. The channel attention map A c R C × C is then defined by:
( A c ) i j = exp ( F ˜ i ) F ˜ j k = 1 C exp ( F ˜ k ) F ˜ j .
Let the outputs of the spatial and channel branches be denoted by F ^ s and F ^ c , respectively. The final multidimensional correlation feature is then obtained by
F mc = F ^ s F ^ c .
This design enables the module to jointly model long-range spatial interactions and inter-channel dependencies, thereby improving the representation of pothole regions in complex scenes. MCF is allocated to deep layers because the reduced feature-map size makes self-attention computationally tractable, and the features at this stage already contain sufficiently strong semantics for long-range correlation modeling to be informative. Using MCF in shallow layers would substantially increase cost while offering weaker benefit because the features are still dominated by local appearance variations.

3.3.4. Module Naming and Placement

The three attention fusion blocks are denoted as Spatial–Directional Attention Fusion (SDAF), Channel-Interdependence Fusion (CIF), and Multidimensional Correlation Fusion (MCF), respectively. Each module is designed to exploit specific characteristics of the feature maps, including spatial continuity, channel dependency, and high-level cross-dimensional correlations, thereby improving both small-object detection and boundary precision. As illustrated in Figure 2, SDAF is placed immediately after the initial convolutional module and the first encoder layer. At these shallow stages, the feature maps retain relatively high spatial resolution and contain rich boundary information, which enables SDAF to maximize its spatial sensitivity. CIF is inserted after the second encoder layer, where the number of channels increases and local channel interactions become more important for discriminative feature recalibration. Finally, MCF is applied in the deeper encoder layers, where its self-attention capability can effectively refine global contextual relationships and local correlations among high-level features. The ablation studies presented in Section 4 further validate the effectiveness of this stage-wise placement strategy.

4. Experiments

4.1. Dataset and Training Details

We use a small-pothole subset of Pothole-600 [14] for our experiments. To focus on small potholes, we remove images whose annotated pothole area exceeds 20% of the image area. This threshold is chosen to exclude samples dominated by large pothole regions, so that the subset better reflects the target task of small-pothole segmentation while still preserving sufficient training data. In total, 17 images are removed from the original 600-image dataset, including 9 from 473 training, 0 from validation, and 8 from test, resulting in 231 training pairs, 180 validation pairs, and 172 test pairs. All compared methods are trained and evaluated on the same subset to ensure fairness. The original split identities are strictly preserved: no image is moved across training, validation, and test sets, and no augmented sample is generated from validation or test images. Therefore, the subset construction only changes the target evaluation scope toward small potholes and does not introduce sample-level data leakage. For the training set only, we perform data augmentation through random rotation within [ 5 , 5 ] , random cropping, symmetry transformation about the image center, and random scaling with a scale factor sampled from [ 0.8 , 1.2 ] . These operations generate four additional augmented versions for each original training pair. Therefore, the augmented training set contains a total of 1155 pairs of RGB images and disparity images.
In our experiments, the proposed FAFMNet and all comparison models are trained on the same workstation equipped with an NVIDIA RTX 3080 GPU. The implementation is based on PyTorch 1.7.0 and CUDA 11.8. All models adopt an ImageNet-pretrained ResNet-34 backbone and are trained for 100 epochs with a batch size of 8. Stochastic gradient descent (SGD) is used for optimization, with an initial learning rate of 0.08, a momentum of 0.9, and a weight decay of 5 × 10 4 . The learning rate is updated using an exponential decay schedule with a decay factor of 0.95 per epoch. Before being fed into the network, the original 400 × 400 RGB and disparity images are resized to 512 × 512 ; bilinear interpolation is applied to the images, whereas nearest-neighbor interpolation is used for the segmentation masks. Cross-entropy loss is adopted as the training objective. Each model is trained in three independent runs using random seeds 2024, 2025, and 2026, and the mean performance is reported. To ensure a fair comparison, RGB-T/RGB-D baselines are adapted by replacing the thermal/depth branch with the disparity branch, while general segmentation models follow the RGB or disparity input settings reported in Table 2. All compared methods are retrained on the same modified dataset using identical data augmentation, optimizer, learning-rate schedule, number of training epochs, and backbone. None of the baseline results are directly cited from the literature; all compared algorithms are re-implemented or adapted from their official implementations and retrained on the same small-pothole subset. Rather than adopting the original training settings reported by different papers, we use a common training protocol and a unified hyperparameter search space for all methods to ensure a fair and controlled comparison. Specifically, all methods share the same ImageNet-pretrained ResNet-34 backbone, optimizer (SGD with momentum 0.9 ), batch size (8), number of epochs (100), input resolution, data augmentation strategy, and loss function. For each compared method, the training hyperparameters, including the initial learning rate, learning-rate schedule, and weight decay, are independently selected using the validation set from the same predefined search space. The final configuration yielding the highest validation mIoU is adopted for testing. This design maintains a common backbone capacity and training budget while allowing each model to operate under its most suitable optimization setting. Consequently, the measured differences primarily reflect the effectiveness of each method’s fusion, attention, and decoder design rather than arbitrary choices of training hyperparameters. Because all models are trained on a common convolutional (ResNet-34) backbone, a single SGD-based configuration is appropriate across methods. The only model-specific settings are the architecture-defining hyperparameters of each method’s modules (e.g., the number of attention heads, embedding dimensions, decoder channel widths, dilation rates, and fusion-block structure), which are kept exactly as specified in the corresponding original papers, since altering them would change the identity of the published model rather than optimize its training. Table 2 reports, for every compared model, this shared training configuration together with the source of its architecture-defining modules. Each model is trained in three independent runs (random seeds 2024, 2025, and 2026), the reported metrics are the run averages, and the statistical significance of the comparison is assessed by a paired test reported in the comparative study (Section 4). Because no per-model configuration search is performed, the comparison does not rely on—and therefore cannot overfit—a separate validation split for any method.
This protocol ensures that each algorithm is optimized under the same validation criterion before test-set comparison. To quantify road-pothole semantic segmentation performance, we evaluate F1-score (F1), precision (Pre), recall (Rec), intersection over union (IoU), and accuracy (Acc). Their mean values across the test set are denoted as mF1, mPre, mRec, mIoU, and mAcc, respectively, and are calculated as follows:
mF 1 = 2 × mRec × mPre mRec + mPre
mPre = 1 n × ( True Positives True Positives + False Positives )
mRec = 1 n × ( True Positives True Positives + False Negatives )
mIoU = 1 n × ( True Positives True Positives + False Positives + False Negatives )
mAcc = 1 n × ( True Positives + True Negatives True Positives + True Negatives + False Positives + False Negatives )
For the experimental design, two ablation studies are first conducted to systematically analyze the contributions of the key components. The first ablation study quantifies the influence of the three proposed feature attention fusion modules on overall model performance, thereby evaluating their individual and combined effectiveness. The second ablation study investigates how the placement of these modules within the network architecture affects detection performance, with the aim of identifying the optimal integration strategy. Finally, a comprehensive comparative experiment is conducted, in which the proposed FAFMNet is benchmarked against ten state-of-the-art multimodal network models on the same small-pothole subset to rigorously evaluate its superiority in detection accuracy and computational efficiency.

4.2. Ablation Study

Ablation experiments are essential for identifying the key components that contribute to model performance. By selectively adding or removing specific modules, the contribution of each component to the overall system can be evaluated, providing useful guidance for designing more efficient architectures. Furthermore, conducting ablation experiments under identical experimental conditions ensures unbiased comparisons among different model variants and improves the reproducibility and reliability of the results. Therefore, all ablation experiments in this study are performed under the same experimental environment.

4.2.1. Ablation of Feature Attention Fusion Modules

In this ablation study, MAFNet [12] is used as the baseline, and the three proposed attention fusion modules, namely SDAF, CIF, and MCF, are incrementally introduced into the network to quantify their individual contributions. Specifically,
  • SDAF: inserted immediately after the initial convolutional module and the first encoder layer to exploit its spatial–directional sensitivity at high resolution.
  • CIF: placed after the second encoder layer, where the number of feature-map channels is sufficiently large to benefit from efficient channel reweighting.
  • MCF: applied to deeper network stages, namely the third and fourth encoder layers, to exploit its self-attention capability for global and local correlation refinement.
In each experiment, only one type of attention fusion module is added at its designated positions, while all other components are kept identical to those of the baseline. This design enables the effects of spatial–directional enhancement, channel-interdependence modeling, and multidimensional correlation refinement on overall segmentation performance to be isolated and compared.

4.2.2. Ablation of the Placement Locations of Three Designed Feature Attention Fusion Modules

In this ablation study, we analyze the effectiveness of our designed SDAF, CIF, and MCF modules and explicitly verify the rationale behind their placement. Because MCF contains self-attention, applying it to large feature maps would markedly increase computational cost; therefore, it is evaluated only in the second, third, and fourth encoder layers, and the fourth layer consistently retains MCF as the deepest global-correlation refinement stage. SDAF and CIF are evaluated from the initial layer to the third encoder layer because both are lightweight, yet they target different representation properties: SDAF is expected to benefit low-level layers with rich spatial detail, whereas CIF is expected to benefit intermediate layers where channel semantics are more stable. We therefore enumerate the feasible placement combinations of SDAF and CIF under one, two, and three MCF settings. In this experiment, all three modules are tested without omission so that the effect of placement can be distinguished from the effect of module removal.

4.2.3. Results Evaluations

As shown in Table 3, the sequential incorporation of the CIF, SDAF, and MCF modules improves mAcc by 0.25%, 0.30%, and 0.35%, respectively. Corresponding gains of 1.59%, 1.72%, and 1.84% are observed for mF1, while mIoU increases by 2.61%, 2.82%, and 3.02%, respectively. These improvements can be attributed to the proposed two-stage fusion strategy. The feature fusion module first integrates global context and local details into a robust multimodal representation. Based on this representation, the three attention fusion modules further enhance discriminative feature learning: CIF adaptively reweights channel responses to suppress background noise and emphasize pothole-specific activations; SDAF introduces spatial–directional attention to strengthen boundary cues; and MCF exploits dual self-attention to capture long-range dependencies and cross-dimensional correlations, which is particularly beneficial for small-pothole segmentation. Together, these modules produce sharper boundaries and more accurate small-object segmentation without increasing the network depth. Although each module introduces a modest increase in the number of parameters and FLOPs, the resulting accuracy improvements justify this additional computational cost.
As shown in Table 4, the ablation results support the proposed stage-wise design rather than an arbitrary assembly of modules. Although MCF includes self-attention, using more MCF blocks does not necessarily improve performance: Group B, which uses three MCF modules, is 0.3% lower in mF1 and 0.5% lower in mIoU than Group I, which uses only one MCF module. This indicates that global correlation modeling is most beneficial in selected deep stages instead of repeated indiscriminately. By comparing groups with the same number of MCF layers, we further observe that SDAF performs better in lower encoder stages than in higher ones, which is consistent with its role in preserving directional boundary cues on large feature maps. CIF achieves its most stable contribution in intermediate stages, where channel responses become more semantically meaningful and thus more suitable for adaptive reweighting. However, excessive repetition of lightweight modules is also unnecessary: for example, Group F uses SDAF one more time than Group G, yet its mF1 and mIoU decrease by 0.22% and 0.36%. These observations collectively justify the final design choice of placing SDAF in shallow layers, CIF in middle layers, and MCF in deep layers.
As the network becomes deeper, the spatial resolution of the feature maps gradually decreases, which reduces the contribution of SDAF to performance improvement. In contrast, the number of feature-map channels continues to increase, making it important to fully exploit the richer channel information in deeper representations. The CIF module enhances channel attention through a non-dimensionality-reduction design and a local cross-channel interaction strategy. By capturing information from each channel, CIF strengthens the model’s attention to target regions and is particularly effective for feature maps with a large number of channels. Overall, the results verify that the best performance is achieved when SDAF is placed after the initial module and the first network layer, CIF is placed after the second network layer, and MCF is placed in the last two network layers. Under this configuration, mF1 and mIoU reach 91.26% and 83.93%, respectively.
To quantitatively verify the improvement in boundary segmentation, we further compare FAFMNet with the baseline MAFNet using boundary-specific metrics. Boundary F1-score (BF1) and Trimap IoU are computed within a 3-pixel band around the ground-truth boundary, while Hausdorff distance measures the maximum contour deviation in pixels. As shown in Table 5, FAFMNet achieves higher BF1 and Trimap IoU and a lower Hausdorff distance, demonstrating more accurate boundary localization.

4.3. Comparative Study

The proposed FAFMNet is compared with ADFormer [19], EDAFormer [20], FDNet [36], SegNeXt [18], EAEFNet [16], CAINet [37], MFNet [38], SGFNet [39], DCANet [40], and PotCrackSeg [15]. The compared networks are categorized into two groups: single-modal and multimodal models. For the single-modal models, RGB and disparity images are used as independent inputs. For the multimodal models, RGB and disparity images are used jointly as paired inputs. To ensure fair comparison, all network models adopt ResNet-34 as the backbone.

4.3.1. Qualitative Demonstrations

Representative qualitative results are presented in Figure 8. Compared with the other algorithms, FAFMNet shows a clear improvement in boundary accuracy, as illustrated by the fourth and fifth columns, where the predicted segmentation regions closely match the ground truth. For the relatively small potholes shown in the seventh and eighth columns, accurate segmentation and boundary preservation are particularly challenging. Nevertheless, FAFMNet maintains high boundary fidelity and produces more complete masks. In addition, the proposed network performs well in segmenting elongated potholes, outperforming other multimodal fusion networks, as shown in the second and sixth columns. The test images cover diverse road conditions, including dark surfaces, light surfaces, and water-covered potholes. As observed from Figure 8, FAFMNet achieves accurate and consistent segmentation results under these different conditions, further demonstrating its robustness. To further clarify the boundary quality of small potholes, Figure 9 presents zoomed local examples comparing FAFMNet with PotCrackSeg, the strongest competing method in the quantitative comparison. The enlarged regions indicate that PotCrackSeg may produce slight boundary shrinkage, discontinuous edges, or local over-segmentation around small potholes. By contrast, FAFMNet generates masks that more closely follow the ground-truth contours. This visual evidence further supports the effectiveness of the proposed feature attention fusion modules in preserving fine boundary details for small-pothole segmentation.

4.3.2. Evaluation of Results

The quantitative comparison results are reported in Table 6. For the single-modal networks, disparity-based input consistently yields better performance than RGB-based input, indicating that geometric cues are more informative than appearance cues for this small-pothole segmentation task. Compared with the other multimodal fusion networks, FAFMNet improves mF1 by 1.87–3.94% and mIoU by 3.12–6.43%, achieving the best overall performance among the evaluated methods. To verify that the improvement is not caused by random initialization, a paired two-sided t-test is further conducted on the mF1 and mIoU values obtained from three independent runs. FAFMNet shows statistically significant improvements over the strongest baseline ( p < 0.05 for both mF1 and mIoU). The qualitative results under dark-road, light-road, and water-covered pothole conditions are also examined, and the model consistently preserves more complete pothole boundaries, indicating stable robustness across representative environmental variations. These results indicate that the proposed three feature attention fusion modules, together with the attention mechanisms incorporated into the encoder initialization module, substantially improve the segmentation accuracy for small road potholes. Table 7 further shows that FAFMNet maintains highly competitive computational efficiency. Specifically, it requires 65.36 M parameters and 113.18 G FLOPs, which is comparable to or slightly lower than several baseline methods, such as EAEFNet and SGFNet. This confirms that the proposed feature attention fusion modules do not introduce excessive computational cost. Overall, FAFMNet achieves a favorable balance between segmentation accuracy and model complexity, demonstrating its practicality for real-world applications with limited computational resources.

4.3.3. Failure Cases and Limitations

Although FAFMNet achieves the best overall performance, several limitations remain. First, water-filled potholes may produce unreliable disparity responses because reflection weakens depth consistency, leading to incomplete masks. Second, very small potholes with weak disparity contrast can still be partially missed, especially when their boundaries are shallow. Third, stains, cracks, or road markings with pothole-like appearance may occasionally cause false positives. These limitations indicate that future work should further improve robustness to noisy disparity, weak boundary cues, and confusing road-surface patterns.In addition, to keep the comparison controlled and reproducible, all methods are evaluated under one shared ResNet-34 backbone and a single fixed training protocol on the official Pothole-600 split; individually re-optimizing the optimizer and hyperparameters of each baseline could shift its absolute scores and is left to future work.

5. Conclusions

This paper presents FAFMNet, a dual-encoder multimodal fusion network that integrates RGB and disparity inputs for accurate semantic segmentation of road potholes in mobile robot applications. The core of FAFMNet is a two-stage feature fusion module that jointly captures global context and local details, while short skip connections are introduced to preserve low-level structural information. On this basis, three attention fusion blocks are strategically embedded in the encoder to refine pothole boundaries, recalibrate informative channels, and capture long-range feature dependencies. Extensive ablation studies verified the individual and combined effectiveness of these modules, and comparative experiments on the small-pothole subset of Pothole-600 demonstrated that FAFMNet outperforms ten state-of-the-art multimodal and single-modal segmentation methods in F1-score, mIoU, and edge accuracy.
In addition to its segmentation accuracy, FAFMNet shows strong generalization potential. Owing to its dual-encoder design, the network can incorporate other sensor modalities, such as depth or thermal images, by adapting the corresponding input branch without modifying the core fusion and attention modules. This modular design makes FAFMNet suitable for a broad range of robotic perception tasks under challenging illumination and environmental conditions.
Future work will focus on two main directions. First, more robust noise-resistant fusion strategies, such as confidence-aware attention and uncertainty modeling, could be explored to improve performance under adverse sensing conditions. Second, enhanced inter-layer connections, including cross-stage links and deep supervision, may further reduce feature loss and improve boundary localization. In addition, lightweight attention design and real-time optimization will be important for deployment on resource-constrained robotic platforms. These improvements are expected to further enhance segmentation accuracy and robustness, contributing to safer and more reliable autonomous navigation in complex road environments.

Author Contributions

Conceptualization, J.F., G.Z. and J.Z.; Methodology, J.F. and H.L.; Software, H.L.; Validation, H.L.; Formal analysis, Q.L.; Investigation, H.L.; Resources, J.J. and C.Z.; Data curation, H.L.; Writing—original draft, H.L.; Writing—review & editing, H.L.; Visualization, H.L.; Supervision, G.Z., J.Z. and C.Z.; Project administration, J.Z. and C.Z.; Funding acquisition, J.J. and C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Major Science and Technology Special Project of Fujian Province (Grant No. 2024HZ022013) and the Natural Science Foundation of Fujian Province of China (Grant No. 2023J01047).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Pothole-600 dataset used in this study is publicly available at https://sites.google.com/view/pothole-600/dataset (accessed on 16 April 2026). The small-pothole subset can be reproduced using the selection criteria described in the Dataset and Training Details subsection.

Acknowledgments

The authors thank the supporting institutions and collaborators for their assistance in this study.

Conflicts of Interest

Author Jin Jiang was employed by the company Xiamen King Long United Automotive Industry Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Karunasekera, H.; Wang, H.; Zhang, H. Energy Minimization Approach for Negative Obstacle Region Detection. IEEE Trans. Veh. Technol. 2019, 68, 11668–11678. [Google Scholar] [CrossRef]
  2. Qiu, J.; Jiang, C. A bilateral semantic guidance network for detection of off-road freespace with impairments based on joint semantic segmentation and edge detection. Comput. Electr. Eng. 2025, 123, 110045. [Google Scholar] [CrossRef]
  3. Wang, Z.; Ma, Z.; Wang, Z.; Gao, S.; Peng, J. A novel road damage detection model with efficient attention and Dynamic Snake Convolution. Eng. Appl. Artif. Intell. 2026, 163, 112618. [Google Scholar] [CrossRef]
  4. Fan, R.; Wang, H.; Wang, Y.; Liu, M.; Pitas, I. Graph Attention Layer Evolves Semantic Segmentation for Road Pothole Detection: A Benchmark and Algorithms. IEEE Trans. Image Process. 2021, 30, 8144–8154. [Google Scholar] [CrossRef] [PubMed]
  5. Ye, M.; Li, X.; Dai, J.; Li, H.; Xu, Z.; Zhang, C. SCSANet: Split Convolution Selective Attention Network of Drivable Area Detection for Mobile Robots. Eng 2026, 7, 176. [Google Scholar] [CrossRef]
  6. Subramanian, R.; Büker, U. Study of Contactless Computer Vision-Based Road Condition Estimation Methods Within the Framework of an Operational Design Domain Monitoring System. Eng 2024, 5, 2778–2804. [Google Scholar] [CrossRef]
  7. Deng, K.; Xing, L.; Wu, H.; Ma, H.; Ling, Y.; Gao, J. Advances in Object Detection for Autonomous Driving Using mmWave Radar and Camera: A Comprehensive Survey. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 328. [Google Scholar] [CrossRef]
  8. Dodge, D.; Yilmaz, M. Convex Vision-Based Negative Obstacle Detection Framework for Autonomous Vehicles. IEEE Trans. Intell. Veh. 2023, 8, 778–789. [Google Scholar] [CrossRef]
  9. Pandey, A.K.; Iqbal, R.; Maniak, T.; Karyotis, C.; Akuma, S.; Palade, V. Convolution neural networks for pothole detection of critical road infrastructure. Comput. Electr. Eng. 2022, 99, 107725. [Google Scholar] [CrossRef]
  10. Fu, T.; Dong, H.; Yang, B.; Deng, B. DE-DFNet: Edge Enhanced Diversity Feature Fusion Guided by Differences in Remote Sensing Imagery Tiny Object Detection. Image Vis. Comput. 2025, 161, 105627. [Google Scholar] [CrossRef]
  11. Zhou, Y.; Zhang, C.; Deng, L.; Fu, J.; Li, H.; Xu, Z.; Zhang, J. Resolution-sensitive self-supervised monocular absolute depth estimation. Appl. Intell. 2024, 54, 4781–4793. [Google Scholar] [CrossRef]
  12. Feng, Z.; Guo, Y.; Liang, Q.; Bhutta, M.; Wang, H.; Liu, M.; Sun, Y. MAFNet: Segmentation of Road Potholes with Multimodal Attention Fusion Network for Autonomous Vehicles. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar] [CrossRef]
  13. Wang, Z.; Wang, W.; Li, N.; Zhang, S.; Chen, Q.; Jiang, Z. Multimodal Parallel Attention Network for Medical Image Segmentation. Image Vis. Comput. 2024, 147, 105069. [Google Scholar] [CrossRef]
  14. Fan, R.; Wang, H.; Bocus, M.; Liu, M. We Learn Better Road Pothole Detection: From Attention Aggregation to Adversarial Domain Adaptation. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020, Proceedings, Part IV; Springer: Berlin/Heidelberg, Germany, 2020; pp. 285–300. [Google Scholar] [CrossRef]
  15. Feng, Z.; Guo, Y.; Sun, Y. Segmentation of Road Negative Obstacles Based on Dual Semantic-Feature Complementary Fusion for Autonomous Driving. IEEE Trans. Intell. Veh. 2024, 9, 4687–4697. [Google Scholar] [CrossRef]
  16. Liang, M.; Hu, J.; Bao, C.; Feng, H.; Deng, F.; Lam, T. Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks. IEEE Robot. Autom. Lett. 2023, 8, 4060–4067. [Google Scholar] [CrossRef]
  17. Sun, T.; Pan, W.; Wang, Y.; Liu, Y. Region of Interest Constrained Negative Obstacle Detection and Tracking with a Stereo Camera. IEEE Sens. J. 2022, 22, 3616–3625. [Google Scholar] [CrossRef]
  18. Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar] [CrossRef]
  19. He, L.; Todorovic, S. Attention Decomposition for Cross-Domain Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2024; pp. 414–431. [Google Scholar] [CrossRef]
  20. Yu, H.; Cho, Y.; Kang, B.; Moon, S.; Kong, K.; Kang, S.J. Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2024; pp. 92–110. [Google Scholar] [CrossRef]
  21. Shan, J.; Huang, Y.; Jiang, W. DCUFormer: Enhancing Pavement Crack Segmentation in Complex Environments with Dual-Cross/Upsampling Attention. Expert Syst. Appl. 2025, 264, 125891. [Google Scholar] [CrossRef]
  22. Fan, J.; Bocus, M.; Hosking, B.; Wu, R.; Liu, Y.; Vityazev, S.; Fan, R. Multi-scale Feature Fusion: Learning Better Semantic Segmentation for Road Pothole Detection. In Proceedings of the 2021 IEEE International Conference on Autonomous Systems (ICAS); IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar] [CrossRef]
  23. He, C.; Yang, H.; Zhang, Z.; Wang, H.; Cai, Y.; Chen, L.; Zhong, C.; Zhang, Y. Dual-stream Detection and Segmentation Framework for Vision Based Unmanned Ground Vehicle Pothole Perception on Unstructured Roads. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 203. [Google Scholar] [CrossRef]
  24. Han, J.; Zhang, Z.; Gao, X.; Li, K.; Kang, X. Research on Negative Obstacle Detection Method Based on Image Enhancement and Improved Anchor Box YOLO. In Proceedings of the 2022 IEEE International Conference on Mechatronics and Automation (ICMA); IEEE: New York, NY, USA, 2022; pp. 1216–1221. [Google Scholar] [CrossRef]
  25. Zhang, Y.; Zuo, Z.; Xu, X.; Wu, J.; Zhu, J.; Zhang, H.; Wang, J.; Tian, Y. Road Damage Detection Using UAV Images Based on Multi-level Attention Mechanism. Autom. Constr. 2022, 144, 104613. [Google Scholar] [CrossRef]
  26. Zhang, C.; Li, G.; Zhang, Z.; Shao, R.; Li, M.; Han, D.; Zhou, M. AAL-Net: A Lightweight Detection Method for Road Surface Defects Based on Attention and Data Augmentation. Appl. Sci. 2023, 13, 1435. [Google Scholar] [CrossRef]
  27. Ali, R.; Bin-Saeed, Q.; Buyukozturk, O.; Lee, S.; Cha, Y. Monocular Computer Vision-Based Simultaneous Pothole Segmentation and 3D Volume Prediction Using 3DPredictNet. SSRN Electron. J. 2024. [Google Scholar] [CrossRef]
  28. Lin, W.; Li, X.; Han, H.; Yu, Q.; Cho, Y.H. A Novel Approach for Pavement Distress Detection and Quantification Using RGB-D Camera and Deep Learning Algorithm. Constr. Build. Mater. 2023, 407, 133593. [Google Scholar] [CrossRef]
  29. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
  30. Hu, X.; Assaad, R.H. Real-time robotic teleoperation for pavement pothole segmentation, quantification, and localization using multimodal sensing and efficient multi-scale attention-enhanced edge deep learning. Autom. Constr. 2026, 183, 106806. [Google Scholar] [CrossRef]
  31. Feng, Z.; Guo, Y.; Navarro-Alarcon, D.; Lyu, Y.; Sun, Y. InconSeg: Residual-Guided Fusion with Inconsistent Multi-Modal Data for Negative and Positive Road Obstacles Segmentation. IEEE Robot. Autom. Lett. 2023, 8, 4871–4878. [Google Scholar] [CrossRef]
  32. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  33. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
  34. Li, N.; Zhang, X.; Li, B.; Yuan, B.; Yang, G. IE-collaborative attention for spatial feature refinement and boundary aware in real-time semantic segmentation. Neurocomputing 2025, 653, 131096. [Google Scholar]
  35. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  36. Li, X.; Xu, F.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. A Frequency Decoupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–21. [Google Scholar] [CrossRef]
  37. Lv, Y.; Liu, Z.; Li, G. Context-Aware Interaction Network for RGB-T Semantic Segmentation. IEEE Trans. Multimed. 2024, 26, 6348–6360. [Google Scholar] [CrossRef]
  38. Ma, X.; Zhang, X.; Pun, M.O.; Huang, B. A unified framework with multimodal fine-tuning for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5405015. [Google Scholar]
  39. Wang, Y.; Li, G.; Liu, Z. SGFNet: Semantic-Guided Fusion Network for RGB-Thermal Semantic Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7737–7748. [Google Scholar] [CrossRef]
  40. Bai, L.; Yang, J.; Tian, C.; Sun, Y.; Mao, M.; Xu, Y.; Xu, W. DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation. Pattern Recognit. 2025, 162, 111379. [Google Scholar] [CrossRef]
Figure 1. Images of typical road potholes.
Figure 1. Images of typical road potholes.
Eng 07 00289 g001
Figure 2. Overall architecture of FAFMNet. The feature fusion module integrates feature maps from RGB and disparity images. Three attention mechanisms are further incorporated to construct feature attention fusion modules, which are placed at appropriate network layers to exploit discriminative multimodal information.
Figure 2. Overall architecture of FAFMNet. The feature fusion module integrates feature maps from RGB and disparity images. Three attention mechanisms are further incorporated to construct feature attention fusion modules, which are placed at appropriate network layers to exploit discriminative multimodal information.
Eng 07 00289 g002
Figure 3. Structure of feature fusion module. The global and local features of the input feature images are combined through broadcasting addition, followed by element-wise multiplication with the original input data. The fused features are then combined with those extracted from both the RGB images and disparity images, ultimately serving as the model’s prediction.
Figure 3. Structure of feature fusion module. The global and local features of the input feature images are combined through broadcasting addition, followed by element-wise multiplication with the original input data. The fused features are then combined with those extracted from both the RGB images and disparity images, ultimately serving as the model’s prediction.
Eng 07 00289 g003
Figure 4. Structure of the nested feature fusion module.
Figure 4. Structure of the nested feature fusion module.
Eng 07 00289 g004
Figure 5. Structure of the spatial–directional attention fusion (SDAF) module.
Figure 5. Structure of the spatial–directional attention fusion (SDAF) module.
Eng 07 00289 g005
Figure 6. Channel-interdependence fusion (CIF) module.
Figure 6. Channel-interdependence fusion (CIF) module.
Eng 07 00289 g006
Figure 7. Structure of the multidimensional correlation fusion (MCF) module.
Figure 7. Structure of the multidimensional correlation fusion (MCF) module.
Eng 07 00289 g007
Figure 8. Comparison of our FAFMNet with six other state-of-the-art multimodal fusion neural networks is presented in this table, each column of which obtained the results of testing on the same RGB images and disparity images. It can be seen from the graph that our proposed FAFMNet achieves the best performance.
Figure 8. Comparison of our FAFMNet with six other state-of-the-art multimodal fusion neural networks is presented in this table, each column of which obtained the results of testing on the same RGB images and disparity images. It can be seen from the graph that our proposed FAFMNet achieves the best performance.
Eng 07 00289 g008
Figure 9. Zoomed examples of small-pothole boundary segmentation. FAFMNet is compared with PotCrackSeg, the strongest competing method in Table 6, to highlight the local boundary differences around small pothole regions.
Figure 9. Zoomed examples of small-pothole boundary segmentation. FAFMNet is compared with PotCrackSeg, the strongest competing method in Table 6, to highlight the local boundary differences around small pothole regions.
Eng 07 00289 g009
Table 1. Input and output parameters of each encoder and decoder layer. The initial layer consists of a convolutional layer, ReLU activation, and batch normalization. The fourth encoder layer corresponds to the transformer module shown in Figure 2.
Table 1. Input and output parameters of each encoder and decoder layer. The initial layer consists of a convolutional layer, ReLU activation, and batch normalization. The fourth encoder layer corresponds to the transformer module shown in Figure 2.
EncoderDecoder
Initial1st2nd3rd4th1st2nd3rd4th5th
Input Size512 × 512256 × 256128 × 12864 × 6432 × 3216 × 1632 × 3264 × 64128 × 128256 × 256
Output Size256 × 256128 × 12864 × 6432 × 3216 × 1632 × 3264 × 64128 × 128256 × 256512 × 512
Input Channel364641282565122561286432
Output Channel646412825651225612864322
Table 2. Per-model training and architecture configuration for the compared methods. All models are re-implemented on a common ImageNet-pretrained ResNet-34 backbone and trained under one identical, fixed protocol (a controlled comparison); only the architecture-defining modules differ and are retained from each source publication. Shared settings: optimizer SGD with momentum 0.9 ; input size 512 × 512 ; cross-entropy loss. Single-modal methods use one ResNet-34 encoder, whereas multimodal methods use two (RGB and disparity) ResNet-34 encoders with four-stage feature fusion.
Table 2. Per-model training and architecture configuration for the compared methods. All models are re-implemented on a common ImageNet-pretrained ResNet-34 backbone and trained under one identical, fixed protocol (a controlled comparison); only the architecture-defining modules differ and are retained from each source publication. Shared settings: optimizer SGD with momentum 0.9 ; input size 512 × 512 ; cross-entropy loss. Single-modal methods use one ResNet-34 encoder, whereas multimodal methods use two (RGB and disparity) ResNet-34 encoders with four-stage feature fusion.
ModelInputBackboneLRDecayWeight DecayBatchEpochsArchitecture-Defining Modules (Source)
ADFormerRGB/DISResNet-34 0.02 0.98 1 × 10 4 8100Attention-decomposition module; 8 attention heads [19]
EDAFormerRGB/DISResNet-34 0.02 0.98 1 × 10 4 8100Embedding-free attention; 8 heads; spatial-reduction ratios 8 / 4 / 2 / 1 [20]
FDNetRGB/DISResNet-34 0.04 0.95 5 × 10 4 8100Frequency-decoupling modules (high-/low-frequency branches) [36]
SegNeXtRGB/DISResNet-34 0.08 0.98 1 × 10 4 8100Multi-scale convolutional attention (kernels 7 / 11 / 21 ) + Hamburger decoder [18]
EAEFNetRGB+DISResNet-34 0.08 0.95 5 × 10 4 8100Explicit attention-enhanced fusion (EAEF) module [16]
CAINetRGB+DISResNet-34 0.04 0.95 5 × 10 4 8100Context-aware interaction module with global context [37]
MFNetRGB+DISResNet-34 0.08 0.98 1 × 10 4 8100Multimodal fine-tuning fusion adapters [38]
SGFNetRGB+DISResNet-34 0.04 0.95 5 × 10 4 8100Semantic-guided fusion module [39]
DCANetRGB+DISResNet-34 0.04 0.95 1 × 10 4 8100Differential convolution attention module [40]
PotCrackSegRGB+DISResNet-34 0.08 0.98 1 × 10 4 8100Dual semantic-feature complementary fusion (DSCF) + CompSemFE [15]
FAFMNet (Ours)RGB+DISResNet-34 0.08 0.95 5 × 10 4 8100Feature fusion module (global + local branches) with SDAF/CIF/MCF attention
Table 3. Ablation studies: comparison between the baseline network and the network with added modules in terms of performance metrics and computational parameter quantity. The improvements in performance metrics are displayed in bold.
Table 3. Ablation studies: comparison between the baseline network and the network with added modules in terms of performance metrics and computational parameter quantity. The improvements in performance metrics are displayed in bold.
VariantsEvaluation MetricsParameters (M)FLOPs (G)
mPre (%)mRec (%)mAcc (%)mF1 (%)mIoU (%)
Baseline91.2586.2398.3288.6779.6464.79 M112.22 G
Baseline + CIF91.3289.6398.57 (0.25)90.26 (1.59)82.25 (2.61)64.71 M112.14 G
Baseline + SDAF90.9789.8298.62 (0.30)90.39 (1.72)82.46 (2.82)64.82 M112.24 G
Baseline + MCF90.8690.1698.67 (0.35)90.51 (1.84)82.66 (3.02)65.04 M112.61 G
Table 4. Ablation studies: comparison of the placement locations of the three designed feature attention fusion modules and their impact on the semantic segmentation performance of the model. The best results are shown in bold.
Table 4. Ablation studies: comparison of the placement locations of the three designed feature attention fusion modules and their impact on the semantic segmentation performance of the model. The best results are shown in bold.
No.The Aggregation of Categories by Feature
Attention Fusion Modules
Evaluation
Metrics
Initial1st2nd3rd4thmF1 (%)mIoU (%)
ASDAFCIFMCFMCFMCF90.6481.09
BCIFSDAFMCFMCFMCF89.6681.26
CCIFCIFSDAFMCFMCF90.0481.89
DCIFSDAFSDAFMCFMCF90.1982.14
ESDAFCIFCIFMCFMCF90.4482.56
FSDAFSDAFSDAFCIFMCF90.2382.20
GSDAFSDAFCIFCIFMCF90.4582.56
HSDAFCIFCIFCIFMCF90.4382.54
ICIFSDAFSDAFSDAFMCF89.9681.76
JCIFCIFSDAFSDAFMCF90.2482.22
KCIFCIFCIFSDAFMCF89.6881.29
LSDAFSDAFCIFMCFMCF91.2683.93
Table 5. Boundary accuracy comparison between MAFNet and FAFMNet. Higher BF1 and Trimap IoU indicate better performance, while lower Hausdorff distance indicates smaller boundary deviation.
Table 5. Boundary accuracy comparison between MAFNet and FAFMNet. Higher BF1 and Trimap IoU indicate better performance, while lower Hausdorff distance indicates smaller boundary deviation.
ApproachBF1 (%)Trimap IoU (%)Hausdorff Distance (px)
MAFNet83.4772.589.64
FAFMNet (Ours)87.9277.366.81
Table 6. The performance comparison between our proposed FAFMNet and state-of-the-art semantic segmentation neural networks. The abbreviation “RGB” refers to the exclusive use of RGB images as the dataset, while “DIS” signifies the exclusive utilization of disparity images as the dataset. The best results are shown in bold type. All metric values are reported as mean ± standard deviation over three independent runs (seeds 2024–2026); the paired significance test is performed on mF1 and mIoU.
Table 6. The performance comparison between our proposed FAFMNet and state-of-the-art semantic segmentation neural networks. The abbreviation “RGB” refers to the exclusive use of RGB images as the dataset, while “DIS” signifies the exclusive utilization of disparity images as the dataset. The best results are shown in bold type. All metric values are reported as mean ± standard deviation over three independent runs (seeds 2024–2026); the paired significance test is performed on mF1 and mIoU.
ApproachEvaluation Metrics
mPre (%)mRec (%)mAcc (%)mF1 (%) mIoU (%)
ADFormer (RGB)68.51 ± 0.5372.24 ± 0.4294.35 ± 0.2470.79 ± 0.3457.79 ± 0.41
ADFormer (DIS)82.32 ± 0.5884.21 ± 0.6296.77 ± 0.1883.25 ± 0.4773.31 ± 0.53
EDAFormer (RGB)66.74 ± 0.7374.89 ± 0.5193.53 ± 0.2870.58 ± 0.4254.54 ± 0.45
EDAFormer (DIS)85.98 ± 0.5484.33 ± 0.5197.21 ± 0.1585.14 ± 0.3774.13 ± 0.42
FDNet (RGB)73.21 ± 0.6175.78 ± 0.5795.27 ± 0.2274.47 ± 0.5059.33 ± 0.43
FDNet (DIS)86.82 ± 0.5782.33 ± 0.5396.45 ± 0.1784.52 ± 0.4373.18 ± 0.40
SegNeXt (RGB)76.25 ± 0.6769.61 ± 0.7395.23 ± 0.2572.78 ± 0.6261.08 ± 0.67
SegNeXt (DIS)87.54 ± 0.5281.63 ± 0.5597.85 ± 0.1685.54 ± 0.4174.73 ± 0.53
EAEFNet91.61 ± 0.3283.42 ± 0.3898.43 ± 0.1387.32 ± 0.3077.50 ± 0.34
CAINet88.26 ± 0.4287.24 ± 0.4798.51 ± 0.1287.75 ± 0.3678.17 ± 0.42
MFNet90.04 ± 0.4385.28 ± 0.4098.45 ± 0.1387.60 ± 0.3577.93 ± 0.44
SGFNet85.68 ± 0.3289.52 ± 0.4798.44 ± 0.1487.56 ± 0.3877.87 ± 0.41
DCANet90.93 ± 0.2384.59 ± 0.3398.60 ± 0.1188.32 ± 0.2079.09 ± 0.26
PotCrackSeg91.45 ± 0.3387.42 ± 0.3498.65 ± 0.1189.39 ± 0.2880.81 ± 0.35
FAFMNet (Ours)90.22 ± 0.2692.32 ± 0.2598.73 ± 0.0991.26 ± 0.2183.93 ± 0.27
Table 7. Comparison of FAFMNet with other multimodal models in terms of parameter count and floating-point operations (FLOPs). All models are standardized with ResNet-34 as the backbone.
Table 7. Comparison of FAFMNet with other multimodal models in terms of parameter count and floating-point operations (FLOPs). All models are standardized with ResNet-34 as the backbone.
ApproachBackboneParameters (M)FLOPs (G)
EAEFNetResNet-3467.34 M121.37 G
CAINet66.69 M118.32 G
MFNet65.21 M112.65 G
SGFNet67.05 M119.85 G
DCANet64.85 M111.76 G
PotCrackSeg65.86 M116.74 G
FAFMNet (Ours)65.36 M113.18 G
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fu, J.; Li, H.; Liu, Q.; Zheng, G.; Zhang, J.; Jiang, J.; Zhang, C. FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot. Eng 2026, 7, 289. https://doi.org/10.3390/eng7060289

AMA Style

Fu J, Li H, Liu Q, Zheng G, Zhang J, Jiang J, Zhang C. FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot. Eng. 2026; 7(6):289. https://doi.org/10.3390/eng7060289

Chicago/Turabian Style

Fu, Jianji, Hongyi Li, Qi Liu, Gaofeng Zheng, Jianhuan Zhang, Jin Jiang, and Chentao Zhang. 2026. "FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot" Eng 7, no. 6: 289. https://doi.org/10.3390/eng7060289

APA Style

Fu, J., Li, H., Liu, Q., Zheng, G., Zhang, J., Jiang, J., & Zhang, C. (2026). FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot. Eng, 7(6), 289. https://doi.org/10.3390/eng7060289

Article Metrics

Back to TopTop