AMFFNet: Adaptive Multi-Scale Feature Fusion Network for Urban Image Semantic Segmentation

Huang, Shuting; Huang, Haiyan

doi:10.3390/electronics14122344

Open AccessArticle

AMFFNet: Adaptive Multi-Scale Feature Fusion Network for Urban Image Semantic Segmentation

by

Shuting Huang

^* and

Haiyan Huang

School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(12), 2344; https://doi.org/10.3390/electronics14122344

Submission received: 4 May 2025 / Revised: 5 June 2025 / Accepted: 6 June 2025 / Published: 8 June 2025

(This article belongs to the Topic Intelligent Image Processing Technology)

Download

Browse Figures

Versions Notes

Abstract

Urban image semantic segmentation faces challenges including the coexistence of multi-scale objects, blurred semantic relationships between complex structures, and dynamic occlusion interference. Existing methods often struggle to balance global contextual understanding of large scenes and fine-grained details of small objects due to insufficient granularity in multi-scale feature extraction and rigid fusion strategies. To address these issues, this paper proposes an Adaptive Multi-scale Feature Fusion Network (AMFFNet). The network primarily consists of four modules: a Multi-scale Feature Extraction Module (MFEM), an Adaptive Fusion Module (AFM), an Efficient Channel Attention (ECA) module, and an auxiliary supervision head. Firstly, the MFEM utilizes multiple depthwise strip convolutions to capture features at various scales, effectively leveraging contextual information. Then, the AFM employs a dynamic weight assignment strategy to harmonize multi-level features, enhancing the network’s ability to model complex urban scene structures. Additionally, the ECA attention mechanism introduces cross-channel interactions and nonlinear transformations to mitigate the issue of small-object segmentation omissions. Finally, the auxiliary supervision head enables shallow features to directly affect the final segmentation results. Experimental evaluations on the CamVid and Cityscapes datasets demonstrate that the proposed network achieves superior mean Intersection over Union (mIoU) scores of 77.8% and 81.9%, respectively, outperforming existing methods. The results confirm that AMFFNet has a stronger ability to understand complex urban scenes.

Keywords:

urban image semantic segmentation; multi-scale feature extraction; feature fusion; attention mechanism

1. Introduction

In the digital era, urbanization is advancing rapidly. As a critical carrier for urban environment perception, urban images contain multi-level semantic information. Their fine-grained analysis holds significant importance for urban planning, traffic management, and environmental monitoring [1,2,3]. Different from traditional semantic segmentation, urban view semantic segmentation uses pixel-by-pixel classification to establish comprehensive semantic correspondence, with the dual goal of maintaining positional fidelity and classification accuracy. However, due to the complex structural composition of urban scenes, as well as variations in shooting angles and distances that introduce substantial discrepancies between captured images, achieving high-precision urban image semantic segmentation remains a highly challenging task [4].

Traditional methods in urban image semantic segmentation have primarily relied on handcrafted feature design. While these approaches can capture certain local information, they suffer from severe limitations in generalization when faced with multi-scale objects, dynamic occlusions, and illumination changes, and they fail to model long-range contextual relationships [5,6]. With scientific advancements, deep learning—owing to its automated feature learning, multi-scale modeling, and efficient optimization capabilities—has become an inevitable choice to overcome the technical bottlenecks of urban image semantic segmentation. Deep learning-based methods for urban image semantic segmentation can generally be categorized into four types: encoder–decoder architectures [7], dilated convolution and multi-scale feature fusion methods [8], attention mechanisms and Transformer models [9], and real-time segmentation approaches [10]. Among them, the encoder–decoder architecture has emerged as one of the mainstream strategies, for it adapts to the complex and dynamic demands of real-world scenes through pretraining transfer, multi-scale feature fusion, and flexible architectural designs.

The encoder–decoder architecture consists of two main components: the encoder progressively extracts high-level semantic features by stacking convolutional and down-sampling layers, in order to compress the spatial dimensions of the input image while capturing feature representations; the decoder, in turn, gradually restores spatial resolution through deconvolution or up-sampling operations, ultimately producing pixel-level classification results. Long et al. replaced the fully connected layers in traditional convolutional neural networks with convolutional layers and thus successfully achieved dense pixel-level predictions from image inputs. However, the segmentation results lacked fine-grained precision [11]. Ronneberger et al. introduced lateral skip connections to concatenate multi-scale features from the encoder with the corresponding decoder layers, significantly enhancing boundary accuracy in image segmentation, while in complex scenes, capturing fine details of small objects remained challenging [12]. Badrinarayanan et al. optimized the up-sampling process by preserving pooling indices, yet the segmentation accuracy around object boundaries was still limited [13]. Xie et al. incorporated a lightweight multi-scale feature fusion decoder, achieving high efficiency when handling features at different scales, but struggled with small and precise details in complex image scenarios [14]. Kirillov et al. reframed image segmentation as a rendering problem in image processing, refining the coarse mask edges produced by earlier networks. Nevertheless, inconsistencies between sampling strategies during training and inference led to insufficient generalization to key points in practical applications [15]. These observations indicate that the segmentation accuracy of such networks in complex urban images still requires further improvement.

Building upon the above analysis, we identify three key limitations in current urban image segmentation approaches:

Insufficient multi-scale feature representation: Existing methods struggle to comprehensively capture both macro-structures and micro-details in complex urban scenes due to rigid feature extraction mechanisms.
Static feature fusion strategies: Fixed fusion rules fail to adapt to the dynamic relationships between hierarchical features across scales.
Weak small-object discrimination: Critical urban elements (e.g., traffic signs and street lamps) are frequently misclassified due to inadequate attention to fine-grained features.

In order to enhance the accuracy of semantic segmentation for complex urban scenes, this paper proposes an Adaptive Multi-scale Feature Fusion Network, following the encoder–decoder architecture. The main contributions are summarized as follows:

An Adaptive Multi-scale Feature Fusion Network, termed AMFFNet, is proposed. AMFFNet combines the representational capabilities of the encoder–decoder architecture and attention mechanisms, exhibiting outstanding performance in multi-scale feature extraction and adaptive feature fusion. Our experimental results demonstrate that AMFFNet achieves superior segmentation performance on the CamVid dataset and Cityscapes dataset compared to other segmentation networks.
A Multi-scale Feature Extraction Module (MFEM) is designed, utilizing depthwise strip convolutions and Global Average Pooling (GAP) to enable the network to extract richer multi-scale features and establish a comprehensive cognitive representation of the overall structure and context of the image.
An Adaptive Feature Fusion Module (AFM) is introduced, which dynamically adjusts the feature fusion strategy to optimize the combination of features across different levels and scales; this design improves the model’s capability to comprehend and segment complex scenes.
Efficient Channel Attention (ECA) is incorporated, which enhances the learning of useful information in the input image through cross-channel interactions and nonlinear transformations, leading to improved segmentation accuracy for small objects.

2. Related Work

In recent years, to address the challenges of object scale diversity and complex contextual dependencies in urban scene semantic segmentation, multi-scale feature extraction and cross-hierarchical fusion mechanisms have emerged as core research directions for improving segmentation accuracy. In terms of feature extraction, Yuan et al. developed the Object Contextual Representation (OCR) framework, which associates pixel features with object region features generated from coarse segmentation masks. By leveraging region-level semantic guidance, this method optimizes pixel classification confidence [16]. Although Arulananth et al. utilize downsampling to reduce spatial dimensions while augmenting feature depth for enhanced context acquisition, their approach demonstrates limited robustness in accurately segmenting urban landscapes under adverse weather conditions [17]. Building on this, Jin et al. constructed the Short-term Dense Connection Module (SDCM), incorporating Strip Pooling to capture long-range dependencies while embedding a Channel Attention mechanism to filter discriminative features. This design effectively resolves feature dilution issues in multi-scale receptive field fusion [18]. Nan et al. synergistically integrated Squeeze-and-Excitation (SE) with ASPP, developing a spatial weight map to enhance edge responses while utilizing multi-scale atrous convolutions for global context aggregation. This integrated approach demonstrates notable improvements in edge segmentation accuracy and small object recognition performance [19].

In terms of feature fusion, researchers have proposed various optimization strategies for encoder–decoder architectures. Tong et al. proposed CSAFNet with channel-spatial attention fusion to enhance urban segmentation accuracy, though real-time performance remains unvalidated [20]. Liu et al. innovatively integrated the Frequency-Spatial Domain Attention Fusion Module (FSAFM) with the Attention-Guided Multi-scale Fusion Upsampling Module (AGMUM), significantly enhancing target boundary precision. However, unassessed real-time performance constrains deployment efficiency [21]. To address this limitation, Wu et al. developed an Attention-guided Feature Fusion Module that adaptively adjusts receptive fields through depthwise separable convolutions of varying kernel sizes, combined with a channel attention mechanism to generate adaptive fusion weights for deep and shallow features [22]. Shen et al. contributed two novel components: the Asymmetric Cross-layer Self-Attention (ACSA) module for enhanced feature alignment and the Multi-branch Cascade Decoder (MCD) module for effective multi-modal fusion [23]. Further advancing this direction, Meng et al. designed an enhanced framework incorporating dual-attention mechanisms to optimize multi-scale feature interactions [24]. Despite these advancements, significant challenges persist in global context modeling for complex urban scenes and detail preservation of small-scale targets, particularly under dynamic illumination and occlusion conditions.

3. Network Model

3.1. Overall Network Architecture

The structure of the proposed AMFFNet is illustrated in Figure 1. It consists of four primary components: the MFEM module, the AFM, the ECA module, and the auxiliary supervision head. The backbone network employs Dilated ResNet-50 to extract visual features from the input image. Then, the MFEM leverages depthwise strip convolution to perform multi-scale resampling of urban scenes, effectively capturing contextual information at different scales. Subsequently, the AFM fuses high-level and low-level features while adaptively adjusting channel-wise weights to optimize the integration of features across layers. Meanwhile, the ECA attention mechanism enhances the network’s ability to focus on informative channels by introducing inter-channel interactions and nonlinear transformations. Finally, to enhance the exploitation of spatial features from shallow network layers, we introduce an auxiliary supervision head during training, which is subsequently discarded at test time.

The input image undergoes preliminary feature extraction through the backbone network. Subsequently, the MFEM performs advanced feature refinement, yielding the feature map F_MFEM. Concurrently, two shallow feature maps from the backbone are enhanced by the ECA, producing F_stage2 and F_stage3. These three feature maps are simultaneously processed by the AFM for cross-hierarchical integration, followed by upsampling to restore the original spatial resolution.

3.2. Structure of the Multi-Scale Feature Extraction Module

Urban images typically contain diverse scenes and a wide variety of objects, such as buildings, roads, pedestrians, and vehicles, which exhibit significant differences in scale and size. Using a single scale for segmentation cannot capture the details and features of all targets well. Therefore, multi-scale segmentation is essential for achieving a comprehensive understanding and accurate delineation of various objects within the image. To address the challenges brought by complex scenes, differences in perspectives, and variations in lighting conditions, the MFEM aims to conduct effective multi-scale feature extraction to enhance the network’s adaptability to diverse urban environments.

The MFEM consists of five branches: three depthwise strip convolution branches, one GAP branch, and one 1 × 1 convolution branch. The specific structure is displayed in Figure 2.

Compared to standard convolution, depthwise strip convolution retains the advantage of large receptive fields from sizable convolutional kernels, while extracting richer features with reduced parameter counts and computational costs. By decomposing 2D convolutions into a series of two 1D convolutions, this method improves computational efficiency by accelerating both forward and backward propagation while maintaining rich feature extraction capabilities [25]. This asymmetric kernel decomposition strategy is more effective than symmetric small-kernel combinations. It better captures and utilizes spatial features, promotes feature diversity, and enhances the model’s capacity to understand and represent complex spatial structures. It proves especially advantageous in handling diverse and dynamic visual features. GAP compresses the spatial dimensions (height and width) of feature maps into channel-wise descriptors by aggregating activation responses across the entire receptive field, effectively preserving critical channel-wise semantic information while eliminating spatial redundancy. The 1 × 1 convolution within each branch integrates features at various spatial locations and adjusts the number of channels to improve the network’s representational capacity. Finally, the outputs of the four branches are fused with the input feature map X via a residual connection, resulting in a feature representation that effectively combines both local and global contextual information.

3.3. Adaptive Feature Fusion Module

In urban image semantic segmentation, the fusion of shallow and deep features plays a crucial role, as they capture different aspects of the image. Shallow layers tend to preserve rich spatial and edge details, while deeper layers extract higher-level semantic information. However, the semantic contribution of pixels can vary across different scales, and naively adding these features may cause the loss of fine-grained specifics in the final segmentation output. In order to integrate the detailed information of shallow features into urban scenes and enhance the expressive power of features, this paper proposes an AFM with the structure shown in Figure 3.

The AFM fuses shallow features F_stage2, F_stage3, and F_MFEM through concatenation. Due to the use of dilated convolution instead of downsampling in Dilated ResNet-50, all three feature maps share the same spatial resolution (i.e., 1/8 of the input image), and no resizing is needed prior to fusion. Afterwards, the concatenated feature maps are passed through a 1 × 1 convolutional layer to adjust the channel dimensions, followed by Batch Normalization (BN) and a ReLU activation to generate the intermediate feature maps. To capture both global and local context, Global Max Pooling (GMP) and GAP are applied to the feature map, producing two context-aware descriptors, F_max and F_avg, which are used to update the channel-wise attention weights. GAP facilitates contextual dependency modeling by aggregating spatial activation patterns into channel-wise descriptors, thereby strengthening cross-region semantic correlations and enhancing the network’s global scene understanding capability, while GMP enhances sensitivity to prominent local features, such as edges and fine structures. The recalibrated features are obtained by first performing element-wise multiplication of the fused feature map F_concat with both F_max and F_avg, followed by element-wise addition with F_concat itself. Compared to simply concatenating or adding the features, this fusion strategy enables more refined modulation of the shallow and deep features, resulting in richer representations and improved discriminative capacity. Consequently, the AFM contributes to enhancing the overall segmentation performance.

3.4. ECA Attention Mechanism

In urban image semantic segmentation, the handling of multi-dimensional features is particularly critical. The model must identify and focus on task-relevant features such as pedestrians, vehicles, and traffic signs. The channel attention mechanism not only enables the model to adaptively emphasize different channels via learned weights but also filters out irrelevant or distracting information, thereby enhancing the efficiency of key visual feature extraction. The channel attention mechanism adaptively reassigns channel-wise feature significance through learned weighting parameters, thereby enhancing the extraction of more useful visual features. The ECA mechanism is a channel-level attention module designed to enhance the learning capability of dependencies among different feature channels [26], with its schematic diagram illustrated in Figure 4.

The ECA mechanism initiates by employing GAP to compress the spatial dimensions of each channel, generating channel-wise global statistics that preserve critical semantic information while eliminating spatial redundancies. Subsequently, a one-dimensional convolution is applied to learn the weights of individual channels. The kernel size is then adaptively determined according to a predefined formula, which enables the model to capture channel dependencies across different ranges. This approach enhances both the flexibility and effectiveness of the attention mechanism. The corresponding calculation formula is shown in Equation (1):

k = Ψ (C) = | \frac{\log_{2} (C)}{γ} + \frac{b}{γ} |_{o d d}

(1)

Let

C

denote the number of channels, and let

{|t|}_{o d d}

represent the nearest odd number to a predefined function of

t

. In this paper,

γ = 1

and

b = 1

. Through the mapping

Ψ

, low-dimensional channels can interact over shorter ranges via nonlinear mappings. The cross-channel relationships captured by the 1D convolutional operation are normalized into probability distributions via the Softmax function, generating channel-wise attention weights that undergo element-wise multiplication with the input feature maps to produce adaptively enhanced representations. Finally, the weighted features are used to reconstruct the feature representation, which enhances the model’s representational capacity with respect to the input data.

3.5. Auxiliary Supervision Head

For semantic segmentation tasks, increasing network depth may introduce additional optimization challenges. To further improve Semantic segmentation performance, the proposed model incorporates an auxiliary head technique. This strategy adds auxiliary loss functions at different network hierarchies to optimize the training process, allowing low-level features to directly influence the loss function and thereby facilitating better fusion of multi-scale information. In this work, features obtained from the multi-scale feature extraction module serve as the primary feature representation, while classifiers applied after Stage 2 and Stage 3 function as auxiliary heads for deep supervision. Each auxiliary head produces predictions corresponding to an individual loss function, enabling more comprehensive utilization of outputs from all network layers during training and mitigating the limitations of relying solely on the final network output. The auxiliary classifiers allow the model to optimize intermediate features critical to final performance from early training stages, thereby enhancing its capability to parse complex image content. The complete loss function is defined in Equation (2):

L = α l o s s_{1} + β l o s s_{2} + (1 - α - β) l o s s

(2)

The loss functions from Stage 2 and Stage 3 are denoted as

l o s s_{1}

and

l o s s_{2}

, respectively, while the loss from the backbone network is

l o s s

. Let

α

and

β

be the weighting coefficients for the auxiliary losses. Empirical results demonstrate that the total loss reaches its minimum when

α = 0.1

and

β = 0.2

.

4. Experimental Results and Analysis

4.1. Dataset Description

4.1.1. CamVid Dataset

The CamVid dataset comprises street-view images captured from a driving car perspective, containing road scene images with 32 semantic categories. To adapt to specific application requirements, we simplify these categories to 11 common classes during the image preprocessing stage: Road, Traffic Sign, Car, Sky, Sidewalk, Pole, Fence, Pedestrian, Building, Bicycle, and Tree. The dataset contains 701 images, each with a resolution of 960 × 720 pixels, categorized into a training set (367 images), validation set (101 images), and test set (233 images).

4.1.2. Cityscapes Dataset

Cityscapes is a large-scale dataset specifically designed for urban scene understanding, visual perception analysis, and the development of computer vision algorithms in autonomous driving systems. The dataset is divided into two parts based on annotation quality: the finely annotated set and the coarsely annotated set. The finely annotated subset contains 3475 training images and 1525 testing images, each with high-quality pixel-level annotations. These annotations cover a wide range of object categories commonly found in urban environments, including roads, buildings, pedestrians, and vehicles. The coarsely annotated subset provides 20,000 images with less detailed labels, making it suitable for preliminary testing or algorithm pre-evaluation. Cityscapes offers pixel-wise annotations for 30 typical urban object classes. In this study, a subset of 19 classes was selected for training and evaluation to align with standard practice in semantic segmentation research on this dataset.

4.2. Experimental Settings

All implementations were built upon the Pytorch v1.8.1 deep learning framework, running on Ubuntu 18.04.6 with an NVIDIA GeForce RTX 3090 GPU (24 GB memory) [Manufacturer: NVIDIA Corporation; Origin: Santa Clara, United States]. The network was optimized using the Adam optimizer with hyperparameters β₁ = 0.9 and β₂ = 0.99. The initial learning rate was set to

5 \times 10^{- 4}

and a momentum coefficient of 0.9 was applied, while the training protocol incorporates the Simulated Annealing Algorithm (SAA) for learning rate adaptation. The semantic segmentation of urban images is a multi-label classification task, whereby the loss of each task is calculated separately and combined to obtain the overall loss function. The loss function employed throughout the entire training process was Cross Entropy Loss, whose calculation formula is as follows:

L = \frac{1}{b a t c h_s i z e} \sum_{j = 1}^{b a t c h_s i z e} \sum_{i = 1}^{n} [- y_{j i} \log {\hat{y}}_{j i} - (1 - y_{j i}) \log (1 - {\hat{y}}_{j i})]

(3)

where

y

is the actual result,

\hat{y}

is the predicted result, and batch_size is the size of the training batch.

To increase data diversity and improve the model’s accuracy and robustness, several data augmentation strategies were applied to the input images during training. These included random scaling within a factor of 0.5 to 2.0, horizontal flipping, and random rotations within the range of −10° to 10°. Such augmentations effectively expanded the training dataset and helped the model adapt better to diverse urban scenes.

4.3. Evaluation Metrics

In this article, two standard semantic segmentation metrics were used to validate the performance of the model: mean Intersection over Union (mIoU) and Mean Pixel Accuracy (mPA). The mIoU can be used to measure the degree of overlap between predicted segmentation and true segmentation to evaluate model performance, providing an overall assessment of segmentation quality, as illustrated by Equation (4), while mPA measures the pixel classification accuracy of the model for all categories, reflecting the overall classification ability of the model when ignoring class imbalance. The calculation formula is illustrated through Equation (5):

m I o U = \frac{1}{K + 1} \sum_{i = 0}^{K} \frac{p_{i i}}{\sum_{j = 0}^{K} p_{i j} + \sum_{j = 0}^{K} p_{ji} - p_{i i}}

(4)

m P A = \frac{1}{K + 1} \sum_{i = 0}^{K} \frac{p_{i i}}{\sum_{j = 0}^{K} p_{i j}}

(5)

In this formula, K denotes the number of classes and

p_{i j}

represents the total number of pixels whose ground truth label is class

i

while they are predicted as class

j

in the model. Here,

p_{i i}

corresponds to the true positives (TP), while false negatives (FN) and false positives (FP) are defined accordingly by

p_{i j}

and

p_{j i}

.

4.4. Results and Analysis

4.4.1. Comparative Analysis

To assess the performance of the proposed model in the urban semantic segmentation task, this section compares it with PSPNet [27], DeepLabV3+ [28], DFANet [29], DANNet [30], and SERNet-Former [31] on the CamVid dataset and Cityscapes dataset. The comparison results are systematically summarized in Table 1 and Table 2.

As evidenced by the results presented in Table 1, the proposed model achieves a notable mIoU of 77.8% on the CamVid dataset. This performance represents a clear improvement over established baselines such as PSPNet, DeepLabV3+, DFANet, and DANNet. Nevertheless, a significant performance gap persists when compared to SERNet-Former, indicating considerable potential for further enhancement in its segmentation accuracy. These findings collectively substantiate the effectiveness and robustness of the proposed approach for urban scene semantic segmentation tasks.

As evidenced in Table 2, the proposed model achieves a competitive mIoU of 81.9% on the Cityscapes dataset, outperforming DeepLabV3+ and PSPNet by 2.7% and 4.9%, respectively. This performance substantiates AMFFNet’s technical superiority in complex urban scene parsing, primarily attributed to its architectural innovations: the MFEM module’s depthwise strip convolution enriches multi-scale feature extraction, the AFM adaptively fuses shallow and deep features, and the ECA module enhances critical feature channels—collectively elevating segmentation accuracy. Nevertheless, a considerable performance gap persists when benchmarked against SERNet-Former on this dataset.

4.4.2. Ablation Study

To quantitatively isolate individual module contributions and prevent performance gain misattribution, we conduct ablation studies to validate the efficacy of each component. We conduct module-specific ablation studies by incrementally integrating the MFEM, AFM, and ECA to quantify their individual contributions. The experimental framework adopts a Dilated ResNet-50 backbone architecture, with comprehensive evaluations conducted on the Cityscapes dataset. Quantitative comparisons of contributions to segmentation accuracy, measured through mIoU and mPA, are systematically tabulated in Table 3.

An analysis of the results in Table 3 reveals that using only the backbone network for feature extraction yields a mIoU of 77.9%. When the MFEM is incorporated, the mIoU increases to 79.8%, indicating that the MFEM has superior capabilities in hierarchical feature aggregation and contextual relationship modeling, which collectively advance semantic segmentation performance. Introducing the AFM alone raises the mIoU to 80.0%, demonstrating its efficacy in optimizing cross-scale feature aggregation and improving segmentation robustness within urban environments. The addition of the ECA attention mechanism alone results in an mIoU of 78.3%, which confirms the module’s capacity to strengthen the representation of key semantic regions through cross-channel interaction and nonlinear activation. When all three modules—the MFEM, the AFM, and the ECA module—are combined, the mIoU reaches 81.9%, highlighting the effectiveness of their collaborative optimization strategy. Through multi-scale feature aggregation, dynamic semantic fusion, and key channel enhancement, the integrated approach significantly boosts the model’s robustness and detail perception in urban scene segmentation tasks.

Since shallow features contain a significant amount of redundant information, using ECA for spatial feature extraction before applying the AFM for feature fusion enables the model to discard some of the less important redundant information, thereby improving its ability to capture spatial details. By weighted fusion of features from different layers, the model can more effectively emphasize important features while suppressing irrelevant information, which can improve semantic segmentation performance. Moreover, the introduction of the AFM also strengthens the model’s sensitivity to details, particularly in the presence of complex backgrounds and multi-scale objects.

4.4.3. Visual Comparison

To validate the effectiveness of the proposed model in urban image segmentation, Figure 5 presents a visual comparison of the predicted images on the Cityscapes dataset, using PSPNet, DeepLabv3+, DFANet, DANNet, and the proposed network.

Through the analysis of the visual comparison results, it is observed that DFANet exhibits the poorest segmentation performance, while the proposed model achieves the most accurate results. Specifically, in the first column, DFANet showed misclassification of the power poles in the first image and blurred road boundaries in the third image, indicating that this method is not effective in small object and edge segmentation and suffers from more misclassification compared to other methods. And the model in this article can not only accurately recognize objects in images and distinguish them but also classify them correctly when segmenting small objects such as power poles. Its edge detection and segmentation on slender objects are significantly better, and the edge segmentation of objects is smoother.

4.4.4. Generalization Validation

To comprehensively verify the generalization ability of the AMFFNet, urban images captured with a mobile phone were used for testing. These urban images have a resolution of 4092 px × 3072 px, providing rich detail information that is beneficial for assessing the model’s performance in real-world applications. During image collection, we deliberately selected different shooting angles to simulate the variable environmental conditions of the real world. The corresponding experimental results are displayed in Figure 6.

As illustrated in the figure, whether the urban images are captured head-on, from the side, or at an oblique angle, the model is able to effectively identify and segment key elements, such as vehicles, pedestrians, and road signs. These results demonstrate that the model not only performs excellently on standard datasets but also exhibits strong generalization ability in complex real-world scenarios. Testing in a variety of real-world environments further confirms the model’s practicality and reliability.

5. Conclusions

This article proposes an AMFFNet, which significantly enhances model performance through the integration of an MFEM, AFM, the ECA attention mechanism, and auxiliary supervision head. The MFEM extracts multi-scale features using multiple branches with different receptive fields, while the AFM adopts a novel dynamic adjustment strategy to adaptively reweight features in real time, effectively coping with the complexity of varying urban scenes. This strategy enables the network to automatically optimize the feature fusion process when dealing with images containing different levels of detail and scale, thereby improving semantic segmentation accuracy. Additionally, the incorporation of the ECA attention mechanism enhances the network’s focus on important regions within an image, further boosting segmentation performance. Our experimental results indicate that the proposed AMFFNet reaches superior segmentation accuracy on the CamVid and Cityscapes datasets, outperforming existing methods in both accuracy and robustness. The auxiliary head injects intermediate supervision signals into the shallow network to enhance gradient propagation and optimize multi-scale feature learning, thereby enhancing the model’s capability to capture intricate details and overall segmentation accuracy. Furthermore, when tested on real-world images outside the dataset, the network is still capable of producing semantically correct segmentation, indicating its strong ability to handle tasks of urban image semantic segmentation. However, based on the experimental results, we regrettably observe that the proposed network still lags behind state-of-the-art (SOTA) models. This performance gap may stem from its relatively less advanced backbone architecture and the absence of skip connections in intermediate layers. Subsequent research will prioritize enhancing segmentation accuracy while simultaneously reducing computational complexity, ultimately achieving optimal model performance.

Author Contributions

Conceptualization, S.H.; methodology, H.H.; software, S.H.; validation, S.H. and H.H.; formal analysis, S.H. and H.H.; investigation, H.H.; resources, S.H.; data curation, S.H.; writing—original draft preparation, S.H.; writing—review and editing, S.H.; visualization, S.H. and H.H.; supervision, H.H.; project administration, S.H. and H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bardis, G. A Declarative Modeling Framework for Intuitive Multiple Criteria Decision Analysis in a Visual Semantic Urban Planning Environment. Electronics 2024, 13, 4845. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, L.; Yun, X.; Chai, C.; Liu, Z.; Fan, W.; Luo, X.; Liu, Y.; Qu, X. Enhanced Scene Understanding and Situation Awareness for Autonomous Vehicles Based on Semantic Segmentation. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 6537–6549. [Google Scholar] [CrossRef]
Zhang, X.; Chen, Y.; Han, W.; Chen, X.; Wang, S. Fine Mapping of Hubei Open Pit Mines via a Multi-Branch Global–Local-Feature-Based ConvFormer and a High-Resolution Benchmark. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104111. [Google Scholar] [CrossRef]
Tan, G.; Jin, Y. A Semantic Segmentation Method for Road Sensing Images Based on an Improved PIDNet Model. Electronics 2025, 14, 871. [Google Scholar] [CrossRef]
Zhang, M.; Liu, C.; Li, Z.; Yin, B. From Convolutional Networks to Vision Transformers: Evolution of Deep Learning in Agricultural Pest and Disease Identification. Agronomy 2025, 15, 1079. [Google Scholar] [CrossRef]
Xiao, X.; Zhang, J.; Shao, Y.; Liu, J.; Shi, K.; He, C.; Kong, D. Deep Learning-Based Medical Ultrasound Image and Video Segmentation Methods: Overview, Frontiers, and Challenges. Sensors 2025, 25, 2361. [Google Scholar] [CrossRef]
Duan, Y.; Yang, R.; Zhao, M.; Qi, M.; Peng, S.-L. DAF-UNet: Deformable U-Net with Atrous-Convolution Feature Pyramid for Retinal Vessel Segmentation. Mathematics 2025, 13, 1454. [Google Scholar] [CrossRef]
Berka, A.; Es-Saady, Y.; Hajji, M.E.; Canals, R.; Hafiane, A. Enhancing DeepLabV3+ for Aerial Image Semantic Segmentation Using Weighted Upsampling. In Proceedings of the 2024 IEEE 12th International Symposium on Signal, Image, Video and Communications (ISIVC), Marrakech, Morocco, 21–23 May 2024; pp. 1–6. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale 2021. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, C.; Zhao, L.; Guo, W.; Yuan, X.; Tan, S.; Hu, J.; Yang, Z.; Wang, S.; Ge, W. FARVNet: A Fast and Accurate Range-View-Based Method for Semantic Segmentation of Point Clouds. Sensors 2025, 25, 2697. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 640–651. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image Segmentation As Rendering. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9796–9805. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. In Computer Vision—ECCV 2020; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 173–190. ISBN 978-3-030-58538-9. [Google Scholar]
Arulananth, T.S.; Kuppusamy, P.G.; Ayyasamy, R.K.; Alhashmi, S.M.; Mahalakshmi, M.; Vasanth, K.; Chinnasamy, P. Semantic Segmentation of Urban Environments: Leveraging U-Net Deep Learning Model for Cityscape Image Analysis. PLoS ONE 2024, 19, e0300767. [Google Scholar] [CrossRef]
Jin, Z.; Dou, F.; Feng, Z.; Zhang, C. BSNet: A Bilateral Real-Time Semantic Segmentation Network Based on Multi-Scale Receptive Fields. J. Vis. Commun. Image Represent. 2024, 102, 104188. [Google Scholar] [CrossRef]
Nan, G.; Li, H.; Du, H.; Liu, Z.; Wang, M.; Xu, S. A Semantic Segmentation Method Based on AS-Unet++ for Power Remote Sensing of Images. Sensors 2024, 24, 269. [Google Scholar] [CrossRef]
Tong, X.; Wei, J.; Guo, R.; Yang, C. CSAFNet: Channel Spatial Attention Fusion Network for RGB-T Semantic Segmentation. In Proceedings of the 2022 International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM), Xiamen, China, 5–7 August 2022; pp. 339–345. [Google Scholar]
Liu, J.; Chen, H.; Li, Z.; Gu, H. Multi-Scale Frequency-Spatial Domain Attention Fusion Network for Building Extraction in Remote Sensing Images. Electronics 2024, 13, 4642. [Google Scholar] [CrossRef]
Wu, L.; Qiu, S.; Chen, Z. Real-Time Semantic Segmentation Network Based on Parallel Atrous Convolution for Short-Term Dense Concatenate and Attention Feature Fusion. J. Real Time Image Proc. 2024, 21, 74. [Google Scholar] [CrossRef]
Shen, Z.; Wang, J.; Weng, Y.; Pan, Z.; Li, Y.; Wang, J. ECFNet: Efficient Cross-Layer Fusion Network for Real Time RGB-Thermal Urban Scene Parsing. Digit. Signal Process. 2024, 151, 104579. [Google Scholar] [CrossRef]
Meng, W.; Shan, L.; Ma, S.; Liu, D.; Hu, B. DLNet: A Dual-Level Network with Self- and Cross-Attention for High-Resolution Remote Sensing Segmentation. Remote Sens. 2025, 17, 1119. [Google Scholar] [CrossRef]
Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S.-M. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Xiong, P.; Fan, H.; Sun, J. DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9514–9523. [Google Scholar]
Wu, X.; Wu, Z.; Guo, H.; Ju, L.; Wang, S. DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Nashville, TN, USA, 2021; pp. 15764–15773. [Google Scholar]
Erisen, S. SERNet-Former: Semantic Segmentation by Efficient Residual Network with Attention-Boosting Gates and Attention-Fusion Networks. arXiv 2024, arXiv:2401.15741. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the AMFFNet network.

Figure 2. Structural illustration of the MFEM.

Figure 3. Architecture of the AFM.

Figure 4. ECA attention mechanism.

Figure 5. Visual comparison of different models.

Figure 6. Visualization of urban image semantic segmentation in random scenes. (a) The original image; (b) the overlay of the original and predicted images; (c) the predicted image.

Table 1. Comparison of different semantic segmentation networks on the CamVid dataset.

Model	Backbone Network	mIoU (%)
PSPNet	ResNet-101	73.4
DeepLabV3+	ResNet-50	71.7
DFANet	Xception	75.7
DANNet	ResNet-101	76.9
SERNet-Former	EfficientResNet	84.6
Ours	Dilated ResNet-50	77.8

Table 2. Comparison of different semantic segmentation networks on the Cityscapes dataset.

Model	Backbone Network	mIoU (%)
PSPNet	ResNet-101	77.0
DeepLabV3+	ResNet-50	79.2
DFANet	Xception	70.3
DANNet	ResNet-101	79.9
SERNet-Former	EfficientResNet	87.4
Ours	Dilated ResNet-50	81.9

Table 3. Impact of different modules on mIoU and mPA in the Cityscapes dataset.

Baseline	MFEM	AFM	ECA	mIoU (%)	mPA (%)
✓	✗	✗	✗	77.9	87.3
✓	✓	✗	✗	79.8	88.5
✓	✗	✓	✗	79.0	88.4
✓	✗	✗	✓	78.3	87.9
✓	✓	✓	✗	81.2	89.7
✓	✓	✗	✓	80.8	88.5
✓	✗	✓	✓	80.0	88.4
✓	✓	✓	✓	81.9	90.7

✓: Enabled; ✗: Disabled.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, S.; Huang, H. AMFFNet: Adaptive Multi-Scale Feature Fusion Network for Urban Image Semantic Segmentation. Electronics 2025, 14, 2344. https://doi.org/10.3390/electronics14122344

AMA Style

Huang S, Huang H. AMFFNet: Adaptive Multi-Scale Feature Fusion Network for Urban Image Semantic Segmentation. Electronics. 2025; 14(12):2344. https://doi.org/10.3390/electronics14122344

Chicago/Turabian Style

Huang, Shuting, and Haiyan Huang. 2025. "AMFFNet: Adaptive Multi-Scale Feature Fusion Network for Urban Image Semantic Segmentation" Electronics 14, no. 12: 2344. https://doi.org/10.3390/electronics14122344

APA Style

Huang, S., & Huang, H. (2025). AMFFNet: Adaptive Multi-Scale Feature Fusion Network for Urban Image Semantic Segmentation. Electronics, 14(12), 2344. https://doi.org/10.3390/electronics14122344

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AMFFNet: Adaptive Multi-Scale Feature Fusion Network for Urban Image Semantic Segmentation

Abstract

1. Introduction

2. Related Work

3. Network Model

3.1. Overall Network Architecture

3.2. Structure of the Multi-Scale Feature Extraction Module

3.3. Adaptive Feature Fusion Module

3.4. ECA Attention Mechanism

3.5. Auxiliary Supervision Head

4. Experimental Results and Analysis

4.1. Dataset Description

4.1.1. CamVid Dataset

4.1.2. Cityscapes Dataset

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Results and Analysis

4.4.1. Comparative Analysis

4.4.2. Ablation Study

4.4.3. Visual Comparison

4.4.4. Generalization Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI