Bidirectional Dynamic Adaptation: Mutual Learning with Cross-Network Feature Rectification for Urban Segmentation

Zhang, Jiawen; Chen, Ning

doi:10.3390/app151810000

Open AccessArticle

Bidirectional Dynamic Adaptation: Mutual Learning with Cross-Network Feature Rectification for Urban Segmentation

by

Jiawen Zhang

and

Ning Chen

^*

School of Intelligent Manufacturing and Energy Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 10000; https://doi.org/10.3390/app151810000

Submission received: 31 July 2025 / Revised: 4 September 2025 / Accepted: 9 September 2025 / Published: 12 September 2025

Download

Browse Figures

Versions Notes

Abstract

Semantic segmentation of urban scenes from red–green–blue and thermal infrared imagery enables per-pixel categorization, delivering precise environmental understanding for autonomous driving and urban planning. However, existing methods suffer from inefficient fusion and insufficient boundary accuracy due to modal differences. To address these challenges, we propose a bidirectional dynamic adaptation framework with two complementary networks. The modality-aware network uses dual attention and multi-scale feature integration to balance modal contributions adaptively, improving intra-class semantic consistency and reducing modal disparities. The edge-texture guidance network applies pixel-level and feature-level weighting with Sobel and Gabor filters to enhance inter-class boundary discrimination, improving detail and boundary precision. Furthermore, the framework redefines multi-modal synergy using an adaptive cross-modal mutual learning mechanism. This mechanism employs information-driven dynamic alignment and probability-guided semantic consistency to overcome the fixed constraints of traditional mutual learning. This cohesive orchestration enhances multi-modal fusion efficiency and boundary delineation accuracy. Extensive experiments on the MFNet and PST900 datasets demonstrate the framework’s superior performance in urban road, vehicle, and pedestrian segmentation, surpassing state-of-the-art approaches.

Keywords:

multi-modal fusion; urban scene semantic segmentation; mutual learning; RGB-T; feature alignment

1. Introduction

Semantic segmentation of urban scenes, as a core task within the field of computer vision, involves separating key objects such as vehicles, pedestrians, and buildings within urban environments. This plays a pivotal role in applications including autonomous driving, intelligent surveillance, and urban planning [1,2,3,4,5]. In recent years, deep learning architectures such as U-Net [6] have advanced semantic segmentation. They leverage encoder–decoder designs and multi-scale feature fusion to improve both accuracy and efficiency. These advances have collectively driven rapid progress in the field [7,8,9,10,11,12,13,14,15]. However, single red–green–blue (RGB) mode has a limited ability to capture discriminative cues in low-texture or visually blurred areas, resulting in segmentation results that cannot be further enhanced. Thus, a red–green–blue and thermal (RGB-T) multimodal fusion paradigm has been introduced, integrating RGB textures and illumination-robust thermal distributions to form complementary features and improve segmentation performance [1,2,7]. The RGB-T semantic segmentation results in Figure 1 show two key defects in existing models. First, they fail to handle information redundancy and interference between RGB and thermal infrared modalities, leading to target recognition failures. Second, they perform poorly in boundary refinement, making precise target contour segmentation difficult.

To address these issues, this study designs two fusion modules: the Dynamic Feature Integration Module (DFIM) and the Unified Fusion Module (UFM), which tackle modal interference and boundary blur, respectively. However, directly combining the two modules for urban scene semantic segmentation tasks leads to feature allocation mismatch and conflicting optimization objectives [16,17]. More importantly, compared to single-module segmentation alone, the performance of this joint module segmentation significantly degrades. Traditional methods for resolving such module conflicts often require tedious architectural adjustments [18,19]. Inspired by mutual learning techniques [20], this study develops a mutual learning framework that enables two specialized networks to optimize collaboratively.

Specifically, this study designs a modality-aware network (MANet) that uses DFIM to fuse RGB and thermal infrared features adaptively. This enhances intra-class semantic consistency and reduces modal redundancy and interference. Additionally, this study also develops an Edge-Texture Guidance Network (EGNet) that uses UFM to apply pixel-level and feature-level weighting with Sobel and Gabor filters. This improves inter-class boundary discrimination and refines local details and boundary precision. Thirdly, to integrate the global semantics and boundary features from the networks, this study introduces an adaptive cross-modal mutual learning (ACML) mechanism that redefines multi-modal synergy. ACML overcomes the limitations of traditional mutual learning’s fixed constraints by dynamically aligning modal feature differences in the encoder stage and ensuring probability-guided consistency of predictive distributions in the decoder stage. This approach enhances intra-class semantic consistency through feature difference alignment and improves inter-class boundary discrimination via predictive distribution consistency, effectively addressing modal fusion and optimization conflicts [21,22,23,24]. By employing information-driven and probability-based optimization, ACML enables bidirectional knowledge transfer, achieving efficient multi-modal fusion and precise boundary delineation. The contributions of this study are as follows:

We propose MANet, focusing on global semantic representation, where DFIM addresses information redundancy and modal interference through adaptive weight allocation and dynamic enhancement mechanisms.
We design EGNet, focusing on detail capture and boundary refinement, where UFM enhances detection capability for object boundaries and details through pixel-level and feature-level weighting strategies combined with edge and texture filters.
We propose ACML, enhancing intra-class pixel semantic consistency and reducing inter-class boundary blurring through adaptive alignment based on feature differences and adaptive consistency based on prediction distribution entropy.
This study performs exhaustive experimental validation on two standard RGB-T semantic segmentation datasets, MFNet and PST900. The results demonstrate that the proposed method performs excellently in urban scene semantic segmentation tasks, proving the effectiveness and superiority of the dual-network collaborative framework based on mutual learning.

The rest of this paper is organized as follows: Section 2 introduces the current research status of RGB-T urban scene semantic segmentation and mutual learning; Section 3 elaborates on the proposed bidirectional dynamic adaptive mutual learning framework, including the design and implementation of MANet, EGNet, and ACML; Section 4 presents the experimental results and comparative analysis on the MFNet and PST900 datasets, and validates the effectiveness of each module through ablation experiments; Section 5 summarizes the paper and looks forward to future research directions.

2. Related Works

2.1. RGB-T Urban Scene Semantic Segmentation

RGB-T urban scene semantic segmentation achieves more robust semantic segmentation by fusing visible light RGB images and thermal infrared images. This technology utilizes the complementary strengths of two imaging modalities: RGB images provide rich texture details and color information, while thermal images capture temperature differences and ensure stable imaging quality in low-light conditions. With the development of deep learning technology, researchers have adopted end-to-end networks for RGB-T urban scene semantic segmentation [4,5]. Sun et al. [7] pioneered the RGB-Thermal Fusion Network, integrating feature information at multiple scales through designed fusion modules. Building on this foundation, Sun et al. [8] proposed the FuseSeg network, which focuses on urban scene semantic segmentation and uses improved fusion strategies to identify target details. Addressing modal imbalance issues, Wang et al. [9] proposed a semantic-guided fusion network that adaptively adjusts modality contribution weights, tackling issues of thermal infrared image redundancy or insufficient RGB image information. As technology develops, researchers increasingly focus on optimizing fusion strategies. Lv et al. [10] introduced a context-aware interaction network that improves information exchange between RGB and thermal infrared features. Zhou et al. [12] proposed a novel mamba fusion module to address long-range modeling issues. Zhou et al. [14] developed an adaptive gated fusion network to tackle uneven fusion caused by modal differences. Guo et al. [11] proposed contrastive learning-based knowledge, utilizing edge and distribution information to guide semantic decoding, thereby improving segmentation accuracy at class boundaries. Zhou et al. [25] introduced advanced feature integration modules to refine multimodal high-level features, thus providing detailed accuracy for target recognition. Guo et al. [13] proposed a memory-based contrastive learning network that utilizes cross-modal dual associations to fully fuse information from both RGB and thermal infrared modalities [15].

Current RGB-T semantic segmentation research focuses on multimodal information fusion. However, no unified framework has emerged to address fusion quality, segmentation details, and edge accuracy simultaneously. This stems from the difficulty of existing models to efficiently handle multiple tasks in parallel [16,17]. To address this, this study proposes two complementary specialized networks focusing on modal fusion and feature optimization, respectively, to improve urban scene semantic segmentation accuracy.

2.2. Mutual Learning

Mutual learning advances model performance through bidirectional knowledge exchange, where multiple models collaboratively optimize shared objectives to achieve robust generalization beyond the constraints of unidirectional methods. Mutual learning originates from knowledge distillation [26], which transfers soft labels from large to smaller models for compression. It overcomes the limitations of one-way transfer, such as restricted generalization and dependence on a single teacher model. Zhang et al. [20] introduced deep mutual learning, establishing a bidirectional paradigm that jointly optimizes supervised and collaborative losses, enhancing model robustness. Subsequent works have explored diverse mutual learning strategies for various challenges. Chen et al. [27] proposed a diversified peer online knowledge distillation method that tackles model homogenization by dynamically selecting learning partners, improving model complementarity. In terms of heterogeneous model collaboration, Shen et al. [28] proposed a knowledge fusion method that integrates knowledge from multiple expert models to address the limited generalization of single models, offering insights for cross-domain knowledge transfer. Addressing challenges in multimodal learning, Peng et al. [29] proposed a correlation consistency knowledge distillation method that uses feature correlation constraints to tackle semantic alignment issues, promoting information fusion across modalities. To enhance the theoretical foundation and practical application effectiveness of mutual learning, researchers have further explored optimization strategies for specific domains. Sun et al. [30] proposed a BERT-based knowledge distillation method that compresses large language models, reducing computational complexity while preserving performance. Tang et al. [31] applied mutual learning to recommendation systems, using sequence modeling to address data sparsity and improve algorithm performance. Heo et al. [32] proposed an activation boundary distillation method that captures hidden neuron activation patterns to address fine-grained knowledge representation transfer, improving distillation accuracy and efficiency.

However, in RGB-T urban scene semantic segmentation, traditional mutual learning methods struggle to resolve semantic conflicts caused by dynamic modal disparities between RGB and thermal infrared data, as well as the need for precise boundary delineation in complex urban scenarios. Existing approaches, such as those in [20,29], often prioritize feature alignment but fail to jointly optimize intra-class semantic consistency and inter-class boundary discrimination. To mitigate these challenges, our proposed ACML improves multi-modal integration by adaptively coordinating modal features and predictive consistency, combining MANet’s intra-class semantic coherence with EGNet’s inter-class boundary discrimination for robust and precise urban scene segmentation.

3. Methodology

3.1. Overview

As shown in Figure 2, the dual-network framework based on collaborative learning proposed in this study includes three core components: MANet, EGNet, and ACML. As shown in Figure 3, Figure 4 and Figure 5, these three components, respectively, demonstrate their network architectures. During the training phase, MANet and EGNet achieve knowledge sharing and collaborative guidance through ACML to reach optimization effects. During the inference phase, these two networks can be deployed independently without relying on the ACML module. This design ensures collaborative efficiency during the training process while considering flexibility and computational efficiency during the testing phase. Drawing upon relevant research on mutual learning [27,28,29,30,31,32], models with structural differences can extract features from distinct perspectives, thereby achieving complementary advantages through collaborative learning. Therefore, we adopt different backbone networks for the two different networks to extract features.

3.2. MANet

Figure 3 shows the overall architecture of MANet. Its overall structure adopts a typical U-Net structure, consisting of MixTransformer-B2 [33], DFIM, and Hierarchical Decoder. Among these, {r₁, r₂, r₃, r₄} and {t₁, t₂, t₃, t₄} are multi-scale features of different scales extracted by the backbone network from RGB images and thermal images, respectively. Subsequently, {r₁, r₂, r₃, r₄} and {t₁, t₂, t₃, t₄} are input into the DFIM to obtain fused features {m₁, m₂, m₃, m₄}. Finally, these features are processed through a hierarchical decoder to output the final prediction results {p₁}.

DFIM: In road semantic segmentation tasks, there exists the problem of information redundancy and inter-modal interference caused by imbalanced utilization of multi-modal information [34]. Therefore, we design the Dynamic Feature Integration Module (DFIM), which assigns weights adaptively via dual attention to highlight key feature dimensions and spatial positions, captures multi-scale contextual information using dilated convolutions, and adjusts feature responses through dynamic enhancement. This improves feature fusion by selectively combining complementary information from RGB and thermal modalities while suppressing redundant features, and integrating both fine-grained details and global context across multiple scales, enabling feature fusion.

Specifically, the DFIM processes r_i and t_i through independent convolutional layers for dimension consistency, then applies cascaded dual attention for fusion to obtain f_i. The dual attention includes feature dimension attention using global pooling and convolutional networks to highlight important dimensions, and position attention, generating spatial weight maps to emphasize key positions. These attention weights sequentially act on concatenated multi-modal features for comprehensive selection across both feature dimensions and spatial positions. The specific expressions are as follows:

f_{i} = D u a l A t t e n t i o n (C a t (C o n v_{3} (r_{i}), C o n v_{3} (t_{i})))

(1)

where DualAttention(·) represents the dual attention operation, Cat(·) represents the concatenation operation, and Conv₃(·) represents the 3 × 3 convolution operation.

The fused feature f_i undergoes multi-scale processing through four parallel dilated convolution branches, which, respectively, employ convolution kernels with different dilation rates (1, 2, 4, and 8 [35] are used here). Each branch extracts 1/4 of the output channel features, and then concatenation and convolution operations are used to fuse features under different dilation rates, obtaining feature s_i that contains rich receptive fields. This design enables the module to simultaneously capture local details and global contextual information, forming rich multi-scale feature representations. The specific expression is as follows:

s_{i} = C o n v_{3} (C a t (A t r o u s C o n v (f_{i})))

(2)

where AtrousConv(·) represents dilated convolution.

Finally, s_i enters the Dynamic Enhancement Branch, which extracts the global information g_i of features through global adaptive average pooling, and then dynamically generates dynamic attention weight w and bias b based on s_i. Then, parameterized convolution [36] is used to process S_i, which is then multiplied with g_i features to obtain the fused feature m_i. This enables the module to dynamically adjust feature response intensity according to different input scenarios, thereby improving the discriminative ability of features. The specific expression is as follows:

g_{i} = D y n a m i c E n h a n c e m e n t (s_{i})

(3)

m_{i} = P a r a C o n v (s_{i}, w, b)

(4)

where DynamicEnhancement(·) represents the dynamic enhancement branch, which consists of global average pooling, convolution, ReLU, and sigmoid activation. w and b denote learnable parameters, and ParaConv(·) represents Parameterized Convolution.

Hierarchical Decoder: To fully utilize the fused features {m₁, m₂, m₃, m₄} output by DFIM, this study designs a hierarchical decoder that reconstructs high-resolution output through progressive upsampling and feature fusion. First, each fused feature m_i undergoes convolution, batch normalization, ReLU activation, and upsampling operations to obtain preliminary reconstructed features g_i. Then, multi-scale features are progressively integrated through a two-stage fusion mechanism: the first stage concatenates and convolutionally fuses features from adjacent scales to obtain intermediate fused features c_j; the second stage further fuses these intermediate features to generate the final segmentation prediction map p₁. This hierarchical decoding approach not only effectively recovers spatial resolution but also preserves semantic information from different scales, ensuring the accuracy and completeness of segmentation results. The specific formulas are as follows:

g_{i} = C B R U (m_{i}), i = 1, 2, 3, 4

(5)

c_{j} = C o n v_{3} (C a t (g_{j + 1}, g_{j})), j = 1, 2, 3

(6)

z_{y} = C o n v_{3} (C a t (c_{y + 1}, c_{y})), y = 1, 2

(7)

p_{1} = C o n v_{3} (C a t (z_{1}, z_{2}))

(8)

where CBRU(·) consists of convolution, batch normalization, ReLU, and upsampling.

3.3. EGNet

Figure 4 shows the overall architecture of EGNet. The network first inputs RGB and thermal data into a pre-trained DFormer-Base [37] encoder to capture modality-specific and hierarchical semantic features, labeled as {R₁, R₂, R₃, R₄} and {T₁, T₂, T₃, T₄}, respectively. Among these, {R₁} and {T₁} are fed into UFM for cross-modal interaction and enhancement, yielding the fused representation {f₁}. Meanwhile, {R₂, R₃, R₄} and {T₂, T₃, T₄} are element-wise added to the corresponding outputs of the previous UFM layer to produce {R′₂, R′₃, R′₄} and {T′₂, T′₃, T′₄}, which are then progressively input into subsequent UFMs to generate {f₂, f₃, f₄}. Finally, this study inputs {f₁, f₂, f₃, f₄} into the Hierarchical Decoder to generate corresponding segmentation masks {s₁, s₂, s₃, s₄}. It is worth noting that the decoder adopts the same Hierarchical Decoder as MANet, so it will not be described again.

UFM: In RGB-T semantic segmentation, convolutional networks often struggle to capture fine-grained structural details (e.g., edges and textures) in complex urban scenes. To address this limitation, inspired by hybrid approaches in medical image boundary detection and remote sensing [19,33], we propose the Unified Feature Module (UFM), which integrates Sobel and Gabor filters to introduce deterministic edge and texture priors, effectively enhancing the representation capability of structural details in RGB-T feature fusion.

For semantic alignment, interaction, and refinement of RGB features R_i and thermal features T_i extracted by the backbone network, UFM first processes the input features R_i and T_i through a downsampling convolution module, which contains a 3 × 3 convolution layer, batch normalization, and LeakyReLU activation function:

{\bar{R}}_{i} = C o n v_{3} (R_{i}^{'})

(9)

{\bar{T}}_{i} = C o n v_{3} (T_{i}^{'})

(10)

Subsequently, cross-modal feature representation is generated through element-wise multiplication to compute the interaction between RGB and thermal features:

M_{R D} = {\bar{R}}_{i} \cdot {\bar{T}}_{i}

(11)

Next, a pixel-level weighting strategy (PW) is used to process the cross-modal interaction feature M_RD, generating a pixel-level weighting map. This weighting map focuses on key regions in cross-modal interaction through convolution and Sigmoid activation. Based on pixel-level enhancement, the feature-level weighting module (FW) further generates feature-level weights through global average pooling and fully connected layers, thereby achieving cross-modal feature alignment and enhancing RGB features through the following operations, with the enhanced feature denoted as

R_{i}^{m o d}

:

R_{S, i} = {\bar{R}}_{i} \cdot P W (M_{R D}) + {\bar{R}}_{i}

(12)

R_{i}^{m o d} = R_{s, i} \cdot F W (R_{s, i})

(13)

To further enhance the model’s capability in detecting object boundaries and details, this study integrates classical image priors (edges and textures) into the UFM, where the Sobel operator provides gradient computation based on mathematical principles, aligning with the inherent advantages of thermal imaging in boundary representation; meanwhile, the frequency and directional selectivity of Gabor filters precisely captures the rich textural characteristics of RGB images. Specifically, fixed 3 × 3 Sobel filters are applied to

R_{i}^{m o d}

in both x and y directions to capture edge intensity gradients G_i, then edge-enhanced features E_i are obtained through 1 × 1 convolution, sigmoid gating, and residual operations:

G_{x} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}], G_{y} = [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}]

(14)

G_{i} = \sqrt{{(G_{x} * R_{i}^{m o d})}^{2} + {(G_{y} * R_{i}^{m o d})}^{2}}

(15)

E_{i} = R_{i}^{m o d} \times σ (C o n v_{1} (G_{i})) + R_{i}^{m o d}

(16)

where σ represents the sigmoid operation, and Conv₁(·) represents the 1 × 1 convolution operation.

Next, Gabor filtering is applied to the edge-enhanced features E_i along the channel dimension, capturing complex texture patterns through multi-directional texture analysis. Then, the texture features from all directions are concatenated and processed through 1 × 1 convolution to obtain texture-enhanced features Texture_i:

G a b o r (x, y, λ, θ, σ, γ) = \exp (- \frac{{x^{'}}^{2} + γ^{2} {y^{'}}^{2}}{2 δ^{2}}) \cos (\frac{2 π}{λ} x^{'})

(17)

T e x t u r_{i} = C o n v C a t (G a b o r_{i} (E_{i}))

(18)

where x′ = x con θ + y sin θ, y′ = −x sin θ + y con θ, λ represents the wavelength, θ represents the direction,

δ

represents the scale, and γ represents the spatial aspect ratio. ConvCat(·) represents the joint operation of concatenation and convolution, while Cat represents the independent concatenation operation.

Finally, the obtained edge E_i and texture features Texture_i are concatenated with the cross-modal aggregated features

R_{i}^{m o d}

, and then processed through multiple parallel dilated convolutions (with dilation rates of 1, 2, and 3, respectively) to expand their receptive field:

f_{i} = C o n v C a t (A t r o u s C o n v (C a t (E_{i}, T e x t u r e_{i}, R_{i}^{m o d})))

(19)

This hybrid strategy enables comprehensive exploitation of boundary-discriminative information from thermal infrared modality and texture-discriminative information from RGB modality, thereby achieving more precise cross-modal semantic alignment at the feature level. Although the incorporation of fixed filters introduces additional computational overhead during training, it significantly improves segmentation accuracy and generalization capability, striking an effective balance between performance gains and computational costs.

3.4. ACML

To address the challenges of modal disparities and the need for effective integration of global semantics and local boundary features in RGB-T urban scene semantic segmentation, this study proposes the ACML framework that redefines multi-modal optimization. Unlike traditional approaches reliant on fixed alignment modules or extensive hyperparameter tuning, ACML leverages dynamic, difference-based mutual learning to enable bidirectional knowledge transfer without parameter dependencies.

As shown in Figure 5, in the ACML mutual learning framework of this study, MANet and EGNet, respectively, receive RGB and thermal inputs, then extract their features {r₁, r₂, r₃, r₄},{t₁, t₂, t₃, t₄} and {r′₁, r′₂, r′₃, r′₄},{t′₁, t′₂, t′₃, t′₄} through their respective encoders, and the decoders output their respective prediction features p₁ and p′₁.

First, this study proposes an adaptive alignment theory based on feature differences. By quantifying the differences in feature distributions between MANet and EGNet during the encoder stage, a dynamic modal complementarity optimization process is constructed. From an information theory perspective, this strategy uses norm differences in feature distributions to enhance intra-class pixel semantic cohesion dynamically. It suppresses representation drift from modal heterogeneity, resolving inconsistent intra-class pixel semantics in traditional fusion methods. This study first calculates the element-wise differences between MANet’s RGB features r_i and thermal infrared features t_i with EGNet’s corresponding features r′_i and t′_i, compresses the difference features to reduce computational complexity while preserving intra-class semantic patterns and inter-class boundary information, and enhances the spatial correlation of difference feature spaces. Subsequently, to dynamically balance modal contributions, this study calculates the L2 norms n_r_,i and n_t_,i of the compressed differences:

{\hat{d}}_{r, i} = C o n v_{3} (r_{i} - r_{i}^{'}), {\hat{d}}_{t, i} = C o n v_{3} (t_{i} - t_{i}^{'})

(20)

n_{r, i} = \sqrt{\frac{1}{C / 4 \cdot H \cdot W} \sum_{c, h, w} {({\hat{d}}_{r, i}^{(c, h, w)})}^{2}}, n_{t, i} = \sqrt{\frac{1}{C / 4 \cdot H \cdot W} \sum_{c, h, w} {({\hat{d}}_{t, i}^{(c, h, w)})}^{2}}

(21)

where B, C, H, W and b, c, h, w, respectively, represent the batch size, number of channels, height, and width.

Then, preliminary weights are generated through the sigmoid activation, and the norm is used to quantify the difference in intensity, thereby ensuring that the semantic consistency of pixels is superior to modal interference. Subsequently, the weights are normalized, and the difference intensity is quantified through weighted mean squared error to obtain the feature difference loss Loss_feat:

ω_{r, i} = \frac{n_{r, i}}{n_{r, i} + n_{t, i} + ε}, ω_{t, i} = \frac{n_{t, i}}{n_{r, i} + n_{t, i} + ε}

(22)

\begin{array}{l} L o s s_{f e a t} = \sum_{i = 1}^{4} (ω_{r, i} \cdot \frac{1}{B \cdot (C / 4) \cdot H \cdot W} {\sum_{b, c, h, w} ({\hat{d}}_{r, i} \cdot σ (n_{r, i}))}^{2} + \\ ω_{t, i} \cdot \frac{1}{B \cdot (C / 4) \cdot H \cdot W} {\sum_{b, c, h, w} ({\hat{d}}_{t, i} \cdot σ (n_{t, i}))}^{2}) \end{array}

(23)

where ε is used to prevent division-by-zero errors.

Secondly, this study designs an adaptive consistency operation based on the entropy of the prediction distribution. This mechanism dynamically adjusts the alignment weights between MANet and EGNet’s prediction maps (p₁ and p′₁) using entropy to optimize consistency. High entropy values indicate uncertainty at object boundaries or semantic transitions, suppressing strict alignment to prevent error propagation; conversely, low entropy values reflect strong confidence and robustness in predictions, promoting reliable alignment. Specifically, the prediction map p₁ of MANet and the corresponding prediction map p′₁ of EGNet are used to generate a soft prediction distribution through the softmax function.

q_{1} = S o f t m a x (p_{1} / T), q_{1}^{'} = S o f t m a x (p_{1}^{'} / T)

(24)

q_{1}^{b, k, h, w} = \frac{\exp (p_{1}^{(b, k, h, w}) / T)}{\sum_{k^{'} = 1}^{K} \exp (p_{1}^{(b, k^{'}, h, w)} / T)}, q_{1}^{' (b, k^{'}, h, w)} = \frac{\exp (p_{1}^{' (b, k^{'}, h, w)}) / T)}{\sum_{k^{'} = 1}^{K} \exp (p_{1}^{' (b, k^{'}, h, w)} / T)}

(25)

where Softmax(·) denotes the softmax activation function, set T = 2 following established practice in mutual learning frameworks [29,30], and K represents the number of classes.

Then, the average entropy H(p) of the predicted distributions of the two networks is calculated to reflect the uncertainty of the predictions. A higher entropy requires a greater weight to guide the learning process. Subsequently, the entropy-based adaptive weight w_pred is used to balance the reliability of the predictions and enhance the intra-class semantic consistency:

H (p_{1}) = - \frac{1}{B \cdot H \cdot W} \sum_{b, h, w} \sum_{k = 1}^{K} q_{1}^{(b, k, h, w)} \log q_{1}^{(b, k, h, w)}

(26)

H (p_{1}^{'}) = - \frac{1}{B \cdot H \cdot W} \sum_{b, h, w} \sum_{k = 1}^{K} q_{1}^{' (b, k, h, w)} \log q_{1}^{' (b, k, h, w)}

(27)

ω_{pred} = \frac{H (p_{1})}{H (p_{1}) + H (p_{1}^{'}) + ε}

(28)

After that, the Kullback–Leibler (KL) divergence between the prediction maps of the two networks is calculated, respectively, so as to make the inter-class boundary probabilities converge and improve the boundary clarity. Finally, the segmentation accuracy is enhanced through weighting to obtain the adaptive consistency loss of the prediction distribution entropy Loss_pred:

K L (q_{1} ‖q_{1}^{'}) = \frac{1}{B \cdot H \cdot W} \sum_{b, h, w} \sum_{k = 1}^{K} q_{1}^{(b, k, h, w)} \log (\frac{q_{1}^{(b, k, h, w)}}{q_{1}^{' (b, k, h, w)}})

(29)

K L (q_{1}^{'} ‖q_{1}) = \frac{1}{B \cdot H \cdot W} \sum_{b, h, w} \sum_{k = 1}^{K} q_{1}^{' (b, k, h, w)} \log (\frac{q_{1}^{' (b, k, h, w)}}{q_{1}^{(b, k, h, w)}})

(30)

L o s s_{p r e d} = ω_{p r e d} \cdot K L (q_{1} ‖q_{1}^{'}) + (1 - ω_{p r e d}) \cdot K L (q_{1}^{'} ‖q_{1})

(31)

3.5. Theoretical Analysis

When directly combining DFIM and UFM, optimization conflicts arise due to their distinct processing mechanisms [16,17]. DFIM employs dynamic feature selection through learnable attention mechanisms, adapting to input-dependent feature distributions [38], while UFM utilizes fixed image priors derived from Sobel and Gabor filters with cross-modal weighting strategies [39]. This architectural difference creates fundamental conflicts in their gradient optimization pathways, similar to those observed in multi-task learning scenarios [40].

The conflict manifests mathematically in their gradient optimization. Let Loss_D and Loss_U represent DFIM’s and UFM’s losses, respectively. Direct combination creates L_combined = Loss_D + Loss_U, where ∇Loss_D optimizes for adaptive feature weighting while ∇Loss_U optimizes for fixed prior integration. These contradictory gradient directions create what “gradient interference” [41,42], leading to unstable training dynamics and performance degradation.

Our mutual learning framework resolves this by separating conflicting modules into specialized networks while enabling knowledge exchange through ACML. This approach follows the principle of “divide-and-conquer” optimization [43], eliminating direct parameter conflicts while maintaining collaborative learning through feature-level knowledge distillation [44]. The framework ensures stable convergence by preserving the distinct optimization characteristics of each module while facilitating cross-network collaboration.

3.6. Total Loss

The loss of this study is composed of three parts. Both MANet and EGNet mainly use cross-entropy loss (CE) [17]. MANet directly supervises the predicted segmentation map {p₁} with the GT map to form Loss_D, while EGNet supervises the predicted segmentation maps {s₁, s₂, s₃, s₄} with the GT map to form Loss_U:

L o s s_{D} = C E (G T, p_{1})

(32)

L o s s_{U} = C E (G T, s_{i}), i = 1, 2, 3, 4

(33)

The total loss in the final training stage consists of the main loss of the two networks and the mutual learning loss:

L o s s_{T o t a l} = L o s s_{D} + L o s s_{U} + L o s s_{f e a t} + L o s s_{p r e d}

(34)

4. Experiments and Results

4.1. Experimental Protocol

4.1.1. Datasets

The proposed model in this study was comprehensively validated on two standard datasets: MFNet and PST900. The MFNet dataset was used as the primary experimental dataset for detailed ablation studies. The MFNet dataset contains 1569 pairs of RGB-thermal infrared images with a resolution of 640 × 480, divided into 820 pairs of daytime images and 749 pairs of nighttime images based on capture time. The dataset includes annotations for eight semantic categories. The dataset was split using the standard protocol, with 50% of the daytime and nighttime images used for training, 25% for validation, and the remaining 25% for testing and evaluation [1]. The PST900 dataset contains 894 pairs of synchronously captured and geometrically calibrated RGB-thermal infrared images with a resolution of 1280 × 640 pixels. The dataset provides pixel-level, precise annotations and covers four specific target categories from the DARPA Subterranean Challenge. According to the standard split, 597 pairs of images were used for model training, and 297 pairs were used for the final testing [2].

4.1.2. Evaluation Metrics

To comprehensively assess the segmentation performance of all semantic categories, this study adopted two complementary metrics as the primary evaluation criteria. The mean Intersection over Union (mIoU) was used as the main metric, calculated by averaging the IoU scores of all categories. Each individual IoU represents the ratio of the intersection area to the union area between the predicted mask and the ground-truth mask. This metric provides a robust assessment of pixel-level accuracy as it equally penalizes false positives and false negatives. In addition, the mean F1 score (mF1) was used as a secondary evaluation metric, representing the harmonic mean of precision and recall across all categories. The F1 score effectively balances the trade-off between precision (correctly predicted positives) and recall (actual positives correctly identified), making it particularly valuable for assessing segmentation quality in scenarios where both over-segmentation and under-segmentation errors are critical issues [1,12].

4.1.3. Statistical Analysis Protocol

To ensure result reliability, all experiments were conducted 10 times using different random seeds (42–51). We report mean performance with standard deviation and assess statistical significance using paired t-tests against the strongest baseline (AGFNet [14]). This protocol addresses potential concerns about result variability and establishes the statistical validity of observed improvements.

4.1.4. Implementation Details

The model in this study was implemented based on the PyTorch 3.10 framework, and all experiments were conducted on a server powered by an NVIDIA GeForce GTX 1080Ti GPU for training and testing. To ensure consistent and fair comparisons of the experimental results, a unified training strategy and hyperparameter settings were used across all comparative methods. For both the MFNet and PST900 datasets, training and testing were conducted at the original resolution to avoid information loss due to image resizing. Training was conducted for 200 epochs. Model optimization was performed using the Ranger optimizer, with a weight decay coefficient set at 5 × 10⁻⁴ to prevent overfitting. The training batch size was set to 2, and the initial learning rate was set at 1 × 10⁻⁴. The entire model was trained in an end-to-end manner, taking RGB-T image pairs as input. The two networks were jointly optimized during training but could operate independently during testing, ensuring modularity and deployment flexibility [15,21].

4.2. MFNet Dataset

As shown in Table 1, For comprehensive evaluation, this study selected 11 mainstream state-of-the-art methods based on the following criteria: (1) Temporal representation: Methods spanning from early fusion approaches (MFNet [1], RTFNet [7]) to recent architectural innovations (AGFNet [14], LLE-Seg [15], KDSNet-S * [21]); (2) Technical diversity: Coverage of different fusion strategies including feature-level fusion (FuseSeg [8]), attention-based fusion (SGFNet [9]), context-aware interaction (CAINet [10]), and knowledge distillation approaches (CLNet-S [11]).

In terms of overall performance, the proposed method achieved significant improvements. EGNet + ACML reached 59.2% mIoU, surpassing the current best method AGFNet (58.6%) by 0.6 percentage points. For the mF1 metric, EGNet + ACML achieved 70.7%, outperforming the best baseline method by 3.0 percentage points. This substantial improvement validates the effectiveness of the mutual learning framework in enhancing overall model performance. For specific categories, MANet achieved 87.3% IoU for the Car category, which was further improved to 88.7% after ACML mutual learning, surpassing all comparison methods, including AGFNet (88.0%).

As shown in Figure 6, this figure illustrates the visual comparison of six models and our method across four daytime and four nighttime scenarios: Traditional methods such as RTFNet and FuseSeg suffer from blurred boundaries and incomplete segmentation; CLNet-S exhibits excessive smoothing effects, leading to the loss of important structural details; inconsistent segmentation: SGFNet and CAINet produce discontinuous segmentation results when addressing modal balance challenges, splitting object continuity into isolated regions.

In contrast, the mutual learning mechanism enables effective knowledge transfer between the modality-aware network (focused on global semantic representation) and the edge texture-guided network (focused on boundary refinement), resulting in more accurate and coherent segmentation outcomes, thereby validating the effectiveness of combining specialised network design with collaborative optimisation.

4.3. PST900 Dataset

The effectiveness of the proposed network was further validated through benchmarking with state-of-the-art methods using the PST900 dataset. All methods are initialised using pre-trained weights to ensure fair comparison. As shown in Table 2, DFANet + ACML and EGNet + ACML demonstrated excellent performance with mean Intersection over Union (mIoU) scores of 86.01 and 86.86, respectively, where EGNet + ACML achieved the best overall performance. Comparative analysis shows that our proposed framework outperforms existing methods in all evaluation metrics, with MANet + ACML improving by 1.79% over standalone MANet, and EGNet + ACML improving by 2.78% over the best competing method, MCNet.

In terms of specific metrics, EGNet + ACML achieved the highest mIoU of 90.27 in the Backpack category, surpassing the second-best result by 0.22%; in the Extinguisher and Survivor categories, EGNet + ACML led with mIoU scores of 81.87 and 81.29, respectively, demonstrating that ACML can effectively transfer knowledge and enable mutual correction. Although LLE-Seg achieved the best performance in the hand drill category with 82.00 mIoU, our EGNet + ACML demonstrated more balanced and consistent performance across all categories, surpassing LLE-Seg’s 80.70 mIoU with an excellent overall score of 86.86 mIoU. These results collectively validate that our ACML can also achieve excellent segmentation performance in other RGB-T modality scenarios.

4.4. Statistical Significance Analysis

To address concerns regarding the reliability of our performance improvements and potential experimental variance, we conducted rigorous statistical validation across 10 independent experimental runs for both datasets. As shown in Table 3,On the MFNet dataset, EGNet + ACML achieved a mean mIoU of 59.2 ± 0.30%, while DAFNet + ACML reached 58.6 ± 0.28%, with the 0.6% improvement demonstrating statistical significance (p = 0.047 < 0.05). As shown in Table 4, The PST900 dataset validation further strengthens our findings, where both DAFNet + ACML (86.01 ± 0.39%, p = 0.018) and EGNet + ACML (86.86 ± 0.37%, p = 0.003) showed highly significant improvements over baseline methods. The consistently low standard deviations (Std ≤ 0.39%) across all experiments demonstrate the stability and reproducibility of our approach, while the paired t-test results conclusively establish that the observed improvements exceed experimental variance and represent genuine methodological advances.

4.5. Ablation Study

4.5.1. Internal Effectiveness Validation of MANet

To verify the effectiveness of each component within the MANet, this study conducted detailed ablation experiments. As shown in Table 5, when the DFIM was completely removed (w/o DFIM), the model performance significantly dropped to 55.4% mIoU and 67.1% mF1, a decrease of 1.3% and 1.0%, respectively, compared to the full version. This significant performance degradation validates the central role of DFIM in multimodal feature fusion, demonstrating its importance in effectively integrating the complementary information of RGB and thermal infrared images.

Further experiments on the internal mechanisms of DFIM showed that the dual-attention mechanism can optimize the selection of information from different modalities, achieving more effective cross-modal feature fusion. After removing the dual-attention mechanism (w/o DA), the model performance further deteriorated to 54.6% mIoU and 66.4% mF1, a decrease of 2.1% and 1.7%, respectively, compared to the full version. Removing the multi-scale dilated convolution (w/o MDC) led to a drop in model performance to 54.5% mIoU and 66.4% mF1, a decrease of 2.2% and 1.7%, respectively, compared to the full version, indicating the positive role of multi-scale dilated convolution in capturing feature receptive fields. When the dynamic enhancement branch was removed (w/o DEB), the mIoU and mF1 dropped to 54.8% and 66.8%, respectively, a decrease of 1.9% and 1.3% compared to the full version. This highlights the role of the dynamic enhancement mechanism in improving feature adaptation to complex urban scenes. The visualization results in Figure 7 further validate the effectiveness of the proposed method.

4.5.2. Internal Effectiveness Validation of EGNet

To verify the effectiveness of each component within EGNet, this study systematically conducted ablation experiments on the core components of the UFM. As shown in Table 5, when the UFM was completely removed (w/o UFM), the model performance dropped from an mIoU of 56.9% to 55.5%, and from an mF1 of 68.2% to 66.9%.

This clearly validates the core role of the UFM in capturing feature details and refining boundaries. Removing the pixel-wise and feature-wise weighting strategies (w/o PW and FW) led to a significant performance drop to 54.6% mIoU and 66.2% mF1, a decrease of 2.3% and 2.0%, respectively, compared to the full version. This indicates that the pixel-wise and feature-wise weighting strategies mutually guide and enhance each other, promoting alignment between modalities. After removing the Sobel filter (w/o Sobel), the model performance further declined to 54.4% mIoU and 65.9% mF1. This proves that the Sobel filter significantly enhances the model’s boundary perception in complex scenes by extracting sharp edge cues. Removing the Gabor filter resulted in a performance drop to 54.7% mIoU and 65.8% mF1, demonstrating that the Gabor filter captures the texture representation of features. The visualization results in Figure 8 further validate the effectiveness of the proposed method.

4.5.3. Validation of ACML Effectiveness

To verify the effectiveness of the two core mechanisms in the ACML framework, this study separately removed the adaptive alignment mechanism based on feature differences (feat) and the adaptive consistency mechanism based on the entropy of prediction distributions (pred). As shown in Table 6, when only the feature difference alignment loss was applied (+Loss feat), the mIoU of MANet increased from 56.7% to 57.8%, and that of EGNet increased from 56.9% to 57.9%.

This mechanism effectively enhanced the semantic consistency of intra-class pixels by quantifying the differences in the encoder feature distributions of the two networks. When only the prediction distribution entropy consistency loss was applied (+Loss pred), the mIoU of MANet increased to 57.9%, and that of EGNet increased to 58.3%. It is worth noting that the performance improvement of EGNet (1.4%) was significantly larger than that of MANet (1.2%). This is because EGNet focuses on boundary refinement and detail capture, and its UFM enhances boundary perception through Sobel and Gabor filters, making it more sensitive to the uncertainty of prediction distributions.

The prediction distribution entropy mechanism, which quantifies prediction uncertainty and dynamically adjusts alignment weights to improve boundary clarity, fits well with the design features of EGNet, thus achieving a more significant performance improvement. When both core mechanisms were applied simultaneously (MANet + ACML, EGNet + ACML), both networks achieved their best performance. The mIoU of MANet reached 58.6%, and that of EGNet reached 59.2%, representing improvements of 1.9% and 2.3%, respectively, compared to the baselines. The results indicate that the ACML framework achieves an effective synergistic effect through joint optimization at both the feature and prediction levels. The visualization results in Figure 9 and Figure 10 further validate the effectiveness of the proposed method.

4.5.4. Validation of Direct Combination Ineffectiveness

To verify the ineffectiveness of directly combining the DFIM and UFM within a single network, this study designed comparative experiments to demonstrate the necessity of the mutual learning approach. As shown in Table 7, when the two modules were directly combined, there was a significant drop in performance.

Specifically, the MixTransfor + UFM + MAFE architecture achieved only 54.9% mIoU and 66.2% mF1, compared to the standalone MANet (MixTransfor + MAFE), which had 56.7% mIoU and 68.1% mF1, representing a decrease of 1.8% and 1.9%, respectively. Similarly, the DFormer + MAFE + UFM architecture achieved 55.3% mIoU and 66.9% mF1, compared to the standalone EGNet (DFormer + UFM), which had 56.9% mIoU and 68.2% mF1, representing a decrease of 1.6% and 1.3%, respectively. This phenomenon validates the core issue raised in the introduction of this study: the performance of two modules that work well individually actually declines when they are directly combined.

The main reasons are as follows: (1) Feature mismatch: There are distribution differences in the feature representation space between the DFIM and UFM. (2) Conflicting optimization goals: There is an inherent conflict between the optimization directions of multimodal fusion and boundary refinement, leading to mutual interference during the training process. This experiment fully demonstrates the ineffectiveness of the direct combination strategy, thereby supporting the necessity and rationality of the proposed mutual learning dual-network framework in this study.

4.5.5. Computational Efficiency and Cost–Benefit Analysis

As shown in Table 8, we evaluate computational complexity through model parameters (Params), floating-point operations per second (FLOPs), and frames per second (FPS). Unlike single-network architectures that rely on uniform parameter scaling, our method employs strategic resource allocation through dual specialized networks (modal-aware and edge-texture guided) combined with an adaptive cross-modal collaborative learning mechanism, enabling each network to focus on specific tasks while learning respective advantageous features. Compared to parameter-intensive methods (such as SGFNet with 125.25 million parameters) and lightweight solutions (such as CANet with 13.8 million parameters), our framework occupies a strategic middle ground. This parameter configuration not only significantly surpasses the performance of lightweight solutions but also demonstrates superior efficiency compared to parameter-intensive methods, achieving precise computational resource investment. While training two complete networks may appear computationally expensive, the superior performance of the dual-network architecture in segmentation quality and practical efficiency fully validates the effectiveness of our design philosophy. The ACML mechanism further enhances feature extraction and generalization capabilities through cross-network mutual learning, with the key advantage that no additional parameter overhead is introduced during inference. Therefore, compared to single-network models of equivalent accuracy, our framework achieves a superior balance between precision and computational cost, making it more suitable for practical application scenarios with stringent segmentation accuracy requirements.

5. Conclusions

This study addresses the critical challenges of “insufficient fusion” and “imprecise boundary delineation” in RGB-Thermal urban scene semantic segmentation by proposing a novel bidirectional dynamic adaptation framework based on mutual learning. The framework consists of two complementary networks: a Modality-Aware Network (MANet), which enhances intra-class semantic consistency through dynamic feature integration, and an Edge-Texture Guidance Network (EGNet), which improves inter-class boundary discrimination using unified feature modulation with embedded Sobel and Gabor priors. To synergize these specialized networks, we introduce an Adaptive Cross-modal Mutual Learning (ACML) mechanism that facilitates bidirectional knowledge transfer through feature difference alignment and prediction entropy consistency, without introducing additional inference-time parameters. Experiments on MFNet and PST900 datasets demonstrate that our method outperforms existing mainstream approaches in both mIoU and mF1 metrics, validating the effectiveness of the dual-network collaborative framework. This study replaces the “one-size-fits-all network” with a “specialized network + collaborative optimization” paradigm, providing a reusable design template for multimodal semantic segmentation. Future research will achieve model lightweighting through network pruning and quantization techniques to meet real-time inference requirements on edge devices, while constructing larger-scale and more diverse RGB-T datasets. Additionally, we will explore advanced strategies, including focal loss, data augmentation, and sample reweighting techniques, combined with expert system integration, to systematically address class imbalance issues in urban scene semantic segmentation, promoting the continuous development of this field.

Author Contributions

Software, J.Z.; formal analysis, J.Z.; data curation, J.Z.; writing—original draft preparation, J.Z.; writing—review and editing, N.C. and J.Z.; visualization, J.Z.; project administration, N.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support data presented in the study are openly available in https://github.com/wdqqggs/MlwcNet, accessed on 10 September 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5108–5115. [Google Scholar]
Shivakumar, S.S.; Rodrigues, N.; Zhou, A.; Miller, I.D.; Kumar, V.; Taylor, C.J. Pst900: Rgb-thermal calibration, dataset and segmentation network. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–4 June 2020; pp. 9441–9447. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder–decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
Zhao, X.; Zhang, L.; Pang, Y.; Lu, H.; Zhang, L. A single stream network for robust and real-time RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 646–662. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5686–5696. [Google Scholar]
Niu, X.; Li, E.; Liu, J.; Wang, Y.; Osadchy, M.; Fang, Y. Mind the Gap: Learning Modality-Agnostic Representations with a Cross-Modality UNet. IEEE Trans. Image Process. 2024, 33, 655–670. [Google Scholar] [CrossRef]
Sun, Y.; Zuo, W.; Liu, M. RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes. IEEE Robot. Autom. Lett. 2019, 4, 2576–2583. [Google Scholar] [CrossRef]
Sun, Y.; Zuo, W.; Yun, P.; Wang, H.; Liu, M. FuseSeg: Semantic Segmentation of Urban Scenes Based on RGB and Thermal Data Fusion. IEEE Trans. Autom. Sci. Eng. 2021, 18, 1000–1011. [Google Scholar] [CrossRef]
Wang, Y.; Li, G.; Liu, Z. SGFNet: Semantic-Guided Fusion Network for RGB-Thermal Semantic Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7737–7748. [Google Scholar] [CrossRef]
Lv, Y.; Liu, Z.; Li, G. Context-Aware Interaction Network for RGB-T Semantic Segmentation. IEEE Trans. Multimed. 2024, 26, 6348–6360. [Google Scholar] [CrossRef]
Guo, X.; Zhou, W.; Liu, T. Contrastive learning-based knowledge distillation for RGB-thermal urban scene semantic segmentation. Knowl. Based Syst. 2024, 292, 111588. [Google Scholar] [CrossRef]
Zhou, W.; Wu, H.; Jiang, Q. MDNet: Mamba-Effective Diffusion-Distillation Network for RGB-Thermal Urban Dense Prediction. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 3222–3233. [Google Scholar] [CrossRef]
Guo, X.; Liu, T.; Mou, Y.; Chai, S.; Ren, B.; Wang, J.; Shi, W.; Liu, S.; Zhou, W. Transferring Prior Thermal Knowledge for Snowy Urban Scene Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2025, 26, 12474–12487. [Google Scholar] [CrossRef]
Zhou, X.; Wu, X.; Bao, L.; Yin, H.; Jiang, Q.; Zhang, J. AGFNet: Adaptive Gated Fusion Network for RGB-T Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2025, 26, 6477–6492. [Google Scholar] [CrossRef]
Guo, X.; Liu, Y.; Xue, W.; Zhang, Z.; Zhuang, Y. Low-Light Enhancement and Global-Local Feature Interaction for RGB-T Semantic Segmentation. IEEE Trans. Instrum. Meas. 2025, 74, 1–13. [Google Scholar] [CrossRef]
Chen, L.; Fu, Y.; Gu, L.; Yan, C.; Harada, T.; Huang, G. Frequency-Aware Feature Fusion for Dense Image Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10763–10780. [Google Scholar] [CrossRef] [PubMed]
Zhou, T.; Wang, W. Prototype-Based Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6858–6872. [Google Scholar] [CrossRef]
Zhang, J.; Yang, K.; Shi, H.; Reiß, S.; Peng, K.; Ma, C.; Fu, H.; Torr, P.H.S.; Wang, K.; Stiefelhagen, R. Behind Every Domain There is a Shift: Adapting Distortion-Aware Vision Transformers for Panoramic Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8549–8567. [Google Scholar] [CrossRef]
Shang, R.; Zhang, J.; Jiao, L.; Li, Y.; Marturi, N.; Stolkin, R. Multi-scale adaptive feature fusion network for semantic segmentation in remote sensing images. Remote Sens. 2020, 12, 872. [Google Scholar] [CrossRef]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep Mutual Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4320–4328. [Google Scholar]
Xie, Z.; Cheng, S.; Fan, J.; Huang, P. Micro-expression Recognition Based on Deep Mutual Learning Network. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 15–17 August 2022; pp. 751–756. [Google Scholar]
Huo, T.; Fan, J.; Li, X.; Chen, H.; Gao, B.; Li, X. Traffic Sign Recognition Based on ResNet-20 and Deep Mutual Learning. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020; pp. 4770–4774. [Google Scholar]
Gao, Y.; Kuang, P.; He, M.; Duan, Q.; Liu, C. MM-GCN: Multi-Mutual Learning Networks of Graph Convolution for Node Classification. In Proceedings of the International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 18–20 December 2020; pp. 97–100. [Google Scholar]
Zhao, H.; Yang, G.; Wang, D.; Lu, H. Lightweight Deep Neural Network for Real-Time Visual Tracking with Mutual Learning. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3063–3067. [Google Scholar] [CrossRef]
Zhou, W.; Gong, T.; Yan, W. Knowledge Distillation SegFormer-Based Network for RGB-T Semantic Segmentation. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 2170–2182. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Guo, Q.; Wang, X.; Wu, Y.; Yu, Z.; Liang, D.; Hu, X.; Luo, P. Online Knowledge Distillation via Collaborative Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11017–11026. [Google Scholar]
Shen, C.; Wang, X.; Song, J.; Sun, L.; Song, M. Amalgamating knowledge towards comprehensive classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 3068–3075. [Google Scholar]
Peng, B.; Jin, X.; Li, D.; Zhou, S.; Wu, Y.; Liu, J.; Zhang, Z.; Liu, Y. Correlation Congruence for Knowledge Distillation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 5006–5015. [Google Scholar] [CrossRef]
Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient knowledge distillation for BERT model compression. arXiv 2019, arXiv:1908.09355. [Google Scholar] [CrossRef]
Tang, J.; Wang, K. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Los Angeles, CA, USA, 5–9 February 2018; pp. 565–573. [Google Scholar]
Heo, B.; Lee, M.; Yun, S.; Choi, J.Y. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 3779–3787. [Google Scholar]
Chen, J.-N.; Sun, S.; He, J.; Torr, P.; Yuille, A.; Bai, S. TransMix: Attend to Mix for Vision Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12125–12134. [Google Scholar]
Pegia, M.-E.; Jónsson, B.Þ.; Moumtzidou, A.; Gialampoukidis, I.; Vrochidis, S.; Kompatsiaris, I. Comparative Analysis of Learning-Based Approaches for Change Detection in Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 3766–3781. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, W.; Qian, X. Transmission Line Detection Through Auxiliary Feature Registration With Knowledge Distillation. IEEE Trans. Autom. Sci. Eng. 2025, 22, 9413–9425. [Google Scholar] [CrossRef]
De Silva, D.D.N.; Vithanage, H.W.M.K.; Xavier, S.A.; Piyatilake, I.T.S.; Fernando, S. Parameterized Wavelets for Convolutional Neural Networks. In Proceedings of the 2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Ponta Delgada, Portugal, 15–17 April 2020; pp. 170–176. [Google Scholar]
Yin, B.; Zhang, X.; Li, Z.; Liu, L.; Cheng, M.M.; Hou, Q. DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Shi, W.; Zheng, B. Conflict-Alleviated Gradient Descent for Adaptive Object Detection. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI-24), Jeju, Republic of Korea, 3–9 August 2024; pp. 1236–1244. [Google Scholar]
Zhang, Z.; Shen, J.; Cao, C.; Dai, G.; Zhou, S.; Zhang, Q.; Zhang, S.; Shutova, E. Proactive Gradient Conflict Mitigation in Multi-Task Learning: A Sparse Training Perspective. arXiv 2024, arXiv:2411.18615. [Google Scholar] [CrossRef]
Sun, Y.; Xu, X.; Li, J.; Hu, X.; Shi, Y.; Zeng, L.L. Learning Task-preferred Inference Routes for Gradient De-conflict in Multi-output DNNs. arXiv 2025, arXiv:2305.19844. [Google Scholar]
Gu, X.; Xia, Y.; Zhang, J. Multimodal medical image fusion based on interval gradients and convolutional neural networks. BMC Med. Imaging 2024, 24, 232. [Google Scholar] [CrossRef]
Cinemre, I.; Mehmood, K.; Kralevska, K.; Mahmoodi, T. Gradient-Based Optimization for Intent Conflict Resolution. Electronics 2024, 13, 864. [Google Scholar] [CrossRef]
Yang, L.; Shen, D.; Cai, C.; Yang, F.; Gao, T.; Zhang, D.; Li, X. Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model. arXiv 2025, arXiv:2406.19905v3. [Google Scholar]
Hoang, T.; Rana, S.; Gupta, S.; Venkatesh, S. Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection. In Proceedings of the AAAI Conference on Artificial Intelligence, British Columbia, Canada, 20–27 February 2024. [Google Scholar]

Figure 1. Demonstrates the recognition limitations of mainstream semantic segmentation models MFNet [1] and CLNet [11]. In the first row, MFNet shows poor bicycle recognition performance within the red box and fails to detect traffic cones within the blue box, while CLNet can identify traffic cones but lacks detailed features. The red box regions in the second and third rows similarly validate this issue: both models exhibit incomplete target detection or insufficient capture of detailed information.

Figure 2. Overall framework. Green and red arrows indicate the direction of knowledge transfer between MANet and EGNet, while black arrows represent the input direction for RGB and thermal imaging.

Figure 3. Overall framework of MANet.

Figure 4. EGNet framework.

Figure 5. ACML mutual learning framework.

Figure 6. Qualitative visual comparison of RGB-T method. Please noted the difference in the red box.

Figure 7. Visualization of MANet ablation experiment.

Figure 8. Visualization of the EGNet ablation experiment.

Figure 9. MANet’s ablation experiment using ACML.

Figure 10. EGNet uses ACML ablation experiments.

Table 1. Comparison experiment table on the MFNet dataset; red values are the optimal results of the corresponding columns.

Method	IoU								mIoU	mF1
Method	Car	Person	Bike	Curve	Car Stop	Guardrail	Color Cone	Bump	mIoU	mF1
MFNet₁₇ [1]	65.9	58.9	42.9	29.9	9.9	8.5	25.2	27.7	39.7	-
RTFNet₁₉ [7]	86.3	67.8	58.2	43.7	24.3	3.6	26.0	57.2	51.7	63.0
FuseSeg₂₁ [8]	87.9	71.7	64.6	44.8	22.7	6.4	46.9	47.9	54.5	-
SGFNet₂₃ [9]	88.4	77.6	64.3	45.8	31.0	6.0	57.1	55.0	57.6	-
CAINet₂₄ [10]	88.5	66.3	68.7	55.4	31.5	9.0	48.9	60.7	58.6	-
CLNet-S₂₄ [11]	88.3	65.3	60.8	42.8	28.6	1.8	47.9	51.4	53.8	65.3
MDNet₂₄ [12]	86.6	63.9	68.7	49.7	37.0	17.6	48.9	60.4	58.9	-
MCNet-S*₂₅ [13]	86.1	59.6	62.6	33.9	40.3	12.0	52.4	51.7	55.1	67.7
AGFNet₂₅ [14]	88.0	73.7	64.6	40.9	64.9	13.8	57.0	52.50	58.6	-
LLE-Seg₂₅ [15]	88.6	73.2	64.8	46.8	30.0	8.8	52.5	62.4	58.4	-
KDSNet-S*₂₅ [21]	87.1	69.5	59.5	42.3	35.5	11.3	52.9	50.4	56.3	-
MANet	87.3	71.5	65.0	45.3	35.7	5.3	45.4	56.6	56.7	68.5
EGNet	88.2	72.3	63.3	45.3	32.8	4.8	53.0	54.2	56.9	68.2
MANet + ACML	88.7	72.9	62.9	47.1	38.7	10.3	50.9	57.6	58.6	70.4
EGNet + ACML	88.6	72.9	64.6	47.6	41.9	7.6	56.9	54.3	59.2	70.7

Table 2. Dataset: The red value is the optimal result of the corresponding column.

Method	Hand-Drill	Backpack	Extinguisher	Survivor	mIoU	mF1
Method	mIoU	mIoU	mIoU	mIoU	mIoU	mF1
MFNet₁₇ [1]	41.13	64.27	60.35	20.70	57.02	-
RFTNet₁₉ [7]	52.24	67.91	54.46	54.11	65.52	59.50
PSTNet₂₀ [2]	53.60	69.20	70.12	50.03	68.36	60.70
CLNet₂₄ [11]	70.38	85.59	64.40	71.98	78.41	-
MDNet₂₄ [12]	76.75	89.41	69.46	80.17	82.97	-
MCNet₂₅ [13]	75.43	89.29	79.76	79.37	84.68	91.47
AGFNet₂₅ [14]	80.30	85.30	80.20	78.70	84.80	-
LLE-Seg₂₅ [15]	82.00	79.00	75.60	67.61	80.70	-
DFANet	75.21	88.38	79.12	79.06	84.22	91.03
EGNet	78.05	88.78	80.32	79.22	85.16	91.79
MANet + ACML	78.19	90.05	81.43	80.76	86.01	93.22
EGNet + ACML	81.21	90.27	81.87	81.29	86.86	94.07

Table 3. Statistical Validation of Results on MFNet Dataset.

Method	mIoU (Mean ± Std)	p-Value
AGFNet	58.6 ± 0.31	-
DAFNet + ACML	58.6 ± 0.28	0.036
ETGNet + ACML	59.2 ± 0.30	0.047

Table 4. Statistical Validation of Results on PST900 Dataset.

Method	mIoU (Mean ± Std)	p-Value
AGFNet	84.80 ± 0.41	-
DAFNet + ACML	86.01 ± 0.39	0.018
ETGNet + ACML	86.86 ± 0.37	0.003

Table 5. Validity verification of EGNet and MANet.

Method	mIoU	mF1
MANet (w/o DFIM)	55.4	67.1
DFIM(w/o DA)	54.6	66.4
DFIM(w/o MDC)	54.5	66.4
DFIM(w/o DEB)	54.8	66.8
EGNet (w/o UFM)	55.5	66.9
UFM (w/o PW and FW)	54.6	66.2
UFM (w/o Sobel)	54.4	65.9
UFM (w/o Gabor)	54.7	65.8
MANet	56.7	68.1
EGNet	56.9	68.2

Table 6. ACML validity verification.

Method	mIoU	mF1
DAFNet + Loss feat	57.8	69.1
ETGNet + Loss feat	57.9	69.5
DAFNet + Loss pred	57.9	69.2
ETGNet + Loss pred	58.3	69.8
DAFNet + ACML	58.6	70.4
ETGNet + ACML	59.2	70.7

Table 7. Verifies the invalidity of the module combination.

Method	mIoU	mF1
MixTransfor + UFM + DFIM	54.9	66.2
DFormer + DFIM + UFM	55.3	66.9
MANet(MixTransfor + DFIM)	56.7	68.1
EGNet(DFormer + UFM)	56.9	68.2

Table 8. Model parameter analysis.

Method	Params (M) ↓	FLOPs (G) ↓	FPS ↑
RFTNet₁₉ [7]	254.51	337.04	6.83
FuseSeg₂₁ [8]	141.52	193.40	16.88
SGFNet₂₃ [9]	125.25	144.83	5.95
CAINet₂₄ [10]	13.8	4.69	25.68
CLNet-S₂₄ [11]	33.35	150.53	22.43
MDNet₂₄ [12]	29.96	21.88	27.21
MCNet-S*₂₅ [13]	19.40	49.19	17.66
AGFNet₂₅ [14]	37.22	50.32	19.84
MANet + ACML	48.23	97.32	9.86
EGNet + ACML	74.8	117.41	12.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Chen, N. Bidirectional Dynamic Adaptation: Mutual Learning with Cross-Network Feature Rectification for Urban Segmentation. Appl. Sci. 2025, 15, 10000. https://doi.org/10.3390/app151810000

AMA Style

Zhang J, Chen N. Bidirectional Dynamic Adaptation: Mutual Learning with Cross-Network Feature Rectification for Urban Segmentation. Applied Sciences. 2025; 15(18):10000. https://doi.org/10.3390/app151810000

Chicago/Turabian Style

Zhang, Jiawen, and Ning Chen. 2025. "Bidirectional Dynamic Adaptation: Mutual Learning with Cross-Network Feature Rectification for Urban Segmentation" Applied Sciences 15, no. 18: 10000. https://doi.org/10.3390/app151810000

APA Style

Zhang, J., & Chen, N. (2025). Bidirectional Dynamic Adaptation: Mutual Learning with Cross-Network Feature Rectification for Urban Segmentation. Applied Sciences, 15(18), 10000. https://doi.org/10.3390/app151810000

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bidirectional Dynamic Adaptation: Mutual Learning with Cross-Network Feature Rectification for Urban Segmentation

Abstract

1. Introduction

2. Related Works

2.1. RGB-T Urban Scene Semantic Segmentation

2.2. Mutual Learning

3. Methodology

3.1. Overview

3.2. MANet

3.3. EGNet

3.4. ACML

3.5. Theoretical Analysis

3.6. Total Loss

4. Experiments and Results

4.1. Experimental Protocol

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Statistical Analysis Protocol

4.1.4. Implementation Details

4.2. MFNet Dataset

4.3. PST900 Dataset

4.4. Statistical Significance Analysis

4.5. Ablation Study

4.5.1. Internal Effectiveness Validation of MANet

4.5.2. Internal Effectiveness Validation of EGNet

4.5.3. Validation of ACML Effectiveness

4.5.4. Validation of Direct Combination Ineffectiveness

4.5.5. Computational Efficiency and Cost–Benefit Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI