3.2. MANet
Figure 3 shows the overall architecture of MANet. Its overall structure adopts a typical U-Net structure, consisting of MixTransformer-B2 [
33], DFIM, and Hierarchical Decoder. Among these, {
r1,
r2,
r3,
r4} and {
t1,
t2,
t3,
t4} are multi-scale features of different scales extracted by the backbone network from RGB images and thermal images, respectively. Subsequently, {
r1,
r2,
r3,
r4} and {
t1,
t2,
t3,
t4} are input into the DFIM to obtain fused features {
m1,
m2,
m3,
m4}. Finally, these features are processed through a hierarchical decoder to output the final prediction results {
p1}.
DFIM: In road semantic segmentation tasks, there exists the problem of information redundancy and inter-modal interference caused by imbalanced utilization of multi-modal information [
34]. Therefore, we design the Dynamic Feature Integration Module (DFIM), which assigns weights adaptively via dual attention to highlight key feature dimensions and spatial positions, captures multi-scale contextual information using dilated convolutions, and adjusts feature responses through dynamic enhancement. This improves feature fusion by selectively combining complementary information from RGB and thermal modalities while suppressing redundant features, and integrating both fine-grained details and global context across multiple scales, enabling feature fusion.
Specifically, the DFIM processes
ri and
ti through independent convolutional layers for dimension consistency, then applies cascaded dual attention for fusion to obtain
fi. The dual attention includes feature dimension attention using global pooling and convolutional networks to highlight important dimensions, and position attention, generating spatial weight maps to emphasize key positions. These attention weights sequentially act on concatenated multi-modal features for comprehensive selection across both feature dimensions and spatial positions. The specific expressions are as follows:
where
DualAttention(·) represents the dual attention operation,
Cat(·) represents the concatenation operation, and
Conv3(·) represents the 3 × 3 convolution operation.
The fused feature
fi undergoes multi-scale processing through four parallel dilated convolution branches, which, respectively, employ convolution kernels with different dilation rates (1, 2, 4, and 8 [
35] are used here). Each branch extracts 1/4 of the output channel features, and then concatenation and convolution operations are used to fuse features under different dilation rates, obtaining feature
si that contains rich receptive fields. This design enables the module to simultaneously capture local details and global contextual information, forming rich multi-scale feature representations. The specific expression is as follows:
where
AtrousConv(·) represents dilated convolution.
Finally,
si enters the Dynamic Enhancement Branch, which extracts the global information
gi of features through global adaptive average pooling, and then dynamically generates dynamic attention weight
w and bias
b based on
si. Then, parameterized convolution [
36] is used to process
Si, which is then multiplied with
gi features to obtain the fused feature
mi. This enables the module to dynamically adjust feature response intensity according to different input scenarios, thereby improving the discriminative ability of features. The specific expression is as follows:
where
DynamicEnhancement(·) represents the dynamic enhancement branch, which consists of global average pooling, convolution, ReLU, and sigmoid activation.
w and
b denote learnable parameters, and
ParaConv(·) represents Parameterized Convolution.
Hierarchical Decoder: To fully utilize the fused features {
m1,
m2,
m3,
m4} output by DFIM, this study designs a hierarchical decoder that reconstructs high-resolution output through progressive upsampling and feature fusion. First, each fused feature
mi undergoes convolution, batch normalization, ReLU activation, and upsampling operations to obtain preliminary reconstructed features
gi. Then, multi-scale features are progressively integrated through a two-stage fusion mechanism: the first stage concatenates and convolutionally fuses features from adjacent scales to obtain intermediate fused features
cj; the second stage further fuses these intermediate features to generate the final segmentation prediction map
p1. This hierarchical decoding approach not only effectively recovers spatial resolution but also preserves semantic information from different scales, ensuring the accuracy and completeness of segmentation results. The specific formulas are as follows:
where
CBRU(·) consists of convolution, batch normalization, ReLU, and upsampling.
3.3. EGNet
Figure 4 shows the overall architecture of EGNet. The network first inputs RGB and thermal data into a pre-trained DFormer-Base [
37] encoder to capture modality-specific and hierarchical semantic features, labeled as {
R1,
R2,
R3,
R4} and {
T1,
T2,
T3,
T4}, respectively. Among these, {
R1} and {
T1} are fed into UFM for cross-modal interaction and enhancement, yielding the fused representation {
f1}. Meanwhile, {
R2,
R3,
R4} and {
T2,
T3,
T4} are element-wise added to the corresponding outputs of the previous UFM layer to produce {
R′
2,
R′
3,
R′
4} and {
T′
2,
T′
3,
T′
4}, which are then progressively input into subsequent UFMs to generate {
f2,
f3,
f4}. Finally, this study inputs {
f1,
f2,
f3,
f4} into the Hierarchical Decoder to generate corresponding segmentation masks {
s1,
s2,
s3,
s4}. It is worth noting that the decoder adopts the same Hierarchical Decoder as MANet, so it will not be described again.
UFM: In RGB-T semantic segmentation, convolutional networks often struggle to capture fine-grained structural details (e.g., edges and textures) in complex urban scenes. To address this limitation, inspired by hybrid approaches in medical image boundary detection and remote sensing [
19,
33], we propose the Unified Feature Module (UFM), which integrates Sobel and Gabor filters to introduce deterministic edge and texture priors, effectively enhancing the representation capability of structural details in RGB-T feature fusion.
For semantic alignment, interaction, and refinement of RGB features
Ri and thermal features
Ti extracted by the backbone network, UFM first processes the input features
Ri and
Ti through a downsampling convolution module, which contains a 3 × 3 convolution layer, batch normalization, and LeakyReLU activation function:
Subsequently, cross-modal feature representation is generated through element-wise multiplication to compute the interaction between RGB and thermal features:
Next, a pixel-level weighting strategy (PW) is used to process the cross-modal interaction feature
MRD, generating a pixel-level weighting map. This weighting map focuses on key regions in cross-modal interaction through convolution and Sigmoid activation. Based on pixel-level enhancement, the feature-level weighting module (FW) further generates feature-level weights through global average pooling and fully connected layers, thereby achieving cross-modal feature alignment and enhancing RGB features through the following operations, with the enhanced feature denoted as
:
To further enhance the model’s capability in detecting object boundaries and details, this study integrates classical image priors (edges and textures) into the UFM, where the Sobel operator provides gradient computation based on mathematical principles, aligning with the inherent advantages of thermal imaging in boundary representation; meanwhile, the frequency and directional selectivity of Gabor filters precisely captures the rich textural characteristics of RGB images. Specifically, fixed 3 × 3 Sobel filters are applied to
in both x and y directions to capture edge intensity gradients
Gi, then edge-enhanced features
Ei are obtained through 1 × 1 convolution, sigmoid gating, and residual operations:
where
σ represents the sigmoid operation, and
Conv1(·) represents the 1 × 1 convolution operation.
Next, Gabor filtering is applied to the edge-enhanced features
Ei along the channel dimension, capturing complex texture patterns through multi-directional texture analysis. Then, the texture features from all directions are concatenated and processed through 1 × 1 convolution to obtain texture-enhanced features
Texturei:
where
x′ =
x con
θ +
y sin
θ,
y′ = −
x sin
θ +
y con
θ,
λ represents the wavelength,
θ represents the direction,
represents the scale, and
γ represents the spatial aspect ratio.
ConvCat(·) represents the joint operation of concatenation and convolution, while
Cat represents the independent concatenation operation.
Finally, the obtained edge
Ei and texture features
Texturei are concatenated with the cross-modal aggregated features
, and then processed through multiple parallel dilated convolutions (with dilation rates of 1, 2, and 3, respectively) to expand their receptive field:
This hybrid strategy enables comprehensive exploitation of boundary-discriminative information from thermal infrared modality and texture-discriminative information from RGB modality, thereby achieving more precise cross-modal semantic alignment at the feature level. Although the incorporation of fixed filters introduces additional computational overhead during training, it significantly improves segmentation accuracy and generalization capability, striking an effective balance between performance gains and computational costs.
3.4. ACML
To address the challenges of modal disparities and the need for effective integration of global semantics and local boundary features in RGB-T urban scene semantic segmentation, this study proposes the ACML framework that redefines multi-modal optimization. Unlike traditional approaches reliant on fixed alignment modules or extensive hyperparameter tuning, ACML leverages dynamic, difference-based mutual learning to enable bidirectional knowledge transfer without parameter dependencies.
As shown in
Figure 5, in the ACML mutual learning framework of this study, MANet and EGNet, respectively, receive RGB and thermal inputs, then extract their features {
r1,
r2,
r3,
r4},{
t1,
t2,
t3,
t4} and {
r′
1,
r′
2,
r′
3,
r′
4},{
t′
1,
t′
2,
t′
3,
t′
4} through their respective encoders, and the decoders output their respective prediction features
p1 and
p′
1.
First, this study proposes an adaptive alignment theory based on feature differences. By quantifying the differences in feature distributions between MANet and EGNet during the encoder stage, a dynamic modal complementarity optimization process is constructed. From an information theory perspective, this strategy uses norm differences in feature distributions to enhance intra-class pixel semantic cohesion dynamically. It suppresses representation drift from modal heterogeneity, resolving inconsistent intra-class pixel semantics in traditional fusion methods. This study first calculates the element-wise differences between MANet’s RGB features
ri and thermal infrared features
ti with EGNet’s corresponding features
r′
i and
t′
i, compresses the difference features to reduce computational complexity while preserving intra-class semantic patterns and inter-class boundary information, and enhances the spatial correlation of difference feature spaces. Subsequently, to dynamically balance modal contributions, this study calculates the L2 norms
nr,i and
nt,i of the compressed differences:
where
B,
C,
H,
W and
b,
c,
h,
w, respectively, represent the batch size, number of channels, height, and width.
Then, preliminary weights are generated through the sigmoid activation, and the norm is used to quantify the difference in intensity, thereby ensuring that the semantic consistency of pixels is superior to modal interference. Subsequently, the weights are normalized, and the difference intensity is quantified through weighted mean squared error to obtain the feature difference loss
Lossfeat:
where
ε is used to prevent division-by-zero errors.
Secondly, this study designs an adaptive consistency operation based on the entropy of the prediction distribution. This mechanism dynamically adjusts the alignment weights between MANet and EGNet’s prediction maps (
p1 and
p′
1) using entropy to optimize consistency. High entropy values indicate uncertainty at object boundaries or semantic transitions, suppressing strict alignment to prevent error propagation; conversely, low entropy values reflect strong confidence and robustness in predictions, promoting reliable alignment. Specifically, the prediction map
p1 of MANet and the corresponding prediction map
p′
1 of EGNet are used to generate a soft prediction distribution through the softmax function.
where
Softmax(·) denotes the softmax activation function, set
T = 2 following established practice in mutual learning frameworks [
29,
30], and
K represents the number of classes.
Then, the average entropy
H(
p) of the predicted distributions of the two networks is calculated to reflect the uncertainty of the predictions. A higher entropy requires a greater weight to guide the learning process. Subsequently, the entropy-based adaptive weight
wpred is used to balance the reliability of the predictions and enhance the intra-class semantic consistency:
After that, the Kullback–Leibler (KL) divergence between the prediction maps of the two networks is calculated, respectively, so as to make the inter-class boundary probabilities converge and improve the boundary clarity. Finally, the segmentation accuracy is enhanced through weighting to obtain the adaptive consistency loss of the prediction distribution entropy
Losspred:
3.5. Theoretical Analysis
When directly combining DFIM and UFM, optimization conflicts arise due to their distinct processing mechanisms [
16,
17]. DFIM employs dynamic feature selection through learnable attention mechanisms, adapting to input-dependent feature distributions [
38], while UFM utilizes fixed image priors derived from Sobel and Gabor filters with cross-modal weighting strategies [
39]. This architectural difference creates fundamental conflicts in their gradient optimization pathways, similar to those observed in multi-task learning scenarios [
40].
The conflict manifests mathematically in their gradient optimization. Let
LossD and
LossU represent DFIM’s and UFM’s losses, respectively. Direct combination creates
Lcombined =
LossD +
LossU, where ∇
LossD optimizes for adaptive feature weighting while ∇
LossU optimizes for fixed prior integration. These contradictory gradient directions create what “gradient interference” [
41,
42], leading to unstable training dynamics and performance degradation.
Our mutual learning framework resolves this by separating conflicting modules into specialized networks while enabling knowledge exchange through ACML. This approach follows the principle of “divide-and-conquer” optimization [
43], eliminating direct parameter conflicts while maintaining collaborative learning through feature-level knowledge distillation [
44]. The framework ensures stable convergence by preserving the distinct optimization characteristics of each module while facilitating cross-network collaboration.