Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention

Ye, Yihang; Chen, Mingxuan

doi:10.3390/app15084298

Open AccessArticle

Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention

by

Yihang Ye

and

Mingxuan Chen

^*

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4298; https://doi.org/10.3390/app15084298

Submission received: 6 March 2025 / Revised: 27 March 2025 / Accepted: 2 April 2025 / Published: 13 April 2025

Download

Browse Figures

Versions Notes

Abstract

Object detection benefits greatly from multimodal image fusion, which integrates complementary data from different modalities like RGB and thermal images. However, existing methods struggle with effective inter-modal fusion, particularly in capturing spatial and contextual information across diverse regions and scales. To address these limitations, we propose the dynamic channel adjustment and multi-scale activated attention mechanism network (MNCM). Our approach incorporates dynamic channel adjustment for precise feature fusion across modalities and a multi-scale attention mechanism to capture both local and global contexts. This design improves robustness while balancing computational efficiency. The model’s scalability is enhanced through its ability to adaptively process multi-scale information without being constrained by fixed-scale designs. To validate our method, we used two multimodal datasets from traffic and industrial scenarios, which consisted of paired thermal infrared and visible light images. The results first demonstrate strong performance in multimodal fusion and then show state-of-the-art results in object detection, proving its effectiveness for real-world applications.

Keywords:

multimodal image fusion; object detection; attention mechanisms; channel adjustment

1. Introduction

Object detection is a fundamental task in computer vision, which involves identifying and precisely localizing objects within images or videos. By enabling machines to interpret visual data, it plays a crucial role in applications such as autonomous driving [1], intelligent surveillance [2], industrial automation [3], and medical imaging [4].

In recent years, integrating various neural network architectures has led to significant advancements in object detection technologies. Early Convolutional Neural Network (CNN)-based models excelled in feature extraction but were limited by their localized processing, which hindered the capture of deep interactions between different modalities [5,6,7]. Generative models enhanced the quality of fused outputs but faced challenges in ensuring consistency across diverse receptive field regions [8,9]. The introduction of attention mechanisms marked a pivotal shift, which enabled models to focus on relevant input regions and capture long-range dependencies, thereby improving feature representation and detection accuracy [10,11]. Building upon these developments, multimodal data fusion has emerged as a key research focus [12,13,14,15,16,17], leveraging complementary information from diverse modalities to enhance robustness and performance in complex environments. However, existing multimodal detectors still face critical limitations, including insensitivity to spatial location information, feature loss, and insufficient fusion during the multimodal data integration process. These issues hinder the ability to fully leverage the complementary strengths of different modalities, ultimately impacting the overall effectiveness of multimodal object detection systems.

Therefore, we propose a novel multimodal image fusion network, a Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention in the backbone branches for each modality. This design achieves both inter-modality complementarity and intra-modality enhancement during the fusion process. The dynamic channel adjustment (DCA) module dynamically adjusts channels to optimize fusion and enhance modality complementarity, allowing the network to adaptively focus on the most relevant features from each modality. Meanwhile, the multi-scale activated attention mechanism (MAAM) strengthens attention to key features across multiple receptive field scales, ensuring that both local and global contexts are effectively captured. In contrast to existing transformer-based multimodal fusion models, which predominantly rely on global token-level attention for feature integration, our approach emphasizes fine-grained channel-level interaction and scale-aware attention modulation. This design reduces computational overhead while maintaining strong performance, particularly in scenarios with diverse object sizes and modality variations. Together, these components enable flexible and efficient inter-modality feature interactions, significantly improving the quality of fused images and overall detection performance. By leveraging these advanced mechanisms, our network is able to achieve superior adaptability to varying object sizes and complex environments, making it a robust solution for real-world detection tasks.

We conducted extensive experiments on traffic and industrial datasets to thoroughly evaluate the performance of the module in multimodal fusion. Initially, we assessed the image fusion performance, which demonstrated the strong capability of the two modules in integrating multimodal information. Subsequently, the results from object detection experiments showed that our approach significantly improved detection performance in complex scenarios. Notably, it exhibited exceptional adaptability and robustness, particularly in handling objects of varying sizes and challenging environments. These findings underscore the model’s strong generalization ability, which offer a promising solution for multimodal image fusion in real-world applications. Furthermore, extensive computational efficiency analysis demonstrates that our model achieves this superior performance while maintaining competitive inference speed and parameter efficiency, making it suitable for real-time applications.

Our main contributions are summarized as follows:

To achieve effective feature interaction and adaptive channel selection, we design a dynamic channel adjustment strategy using channel exchange principles for enhancing the fusion of complementary information from different modalities.
To capture key features across different scales, we introduce a multi-scale activated attention mechanism, enhancing the model’s focus on critical features from each modality for improved detection accuracy.
Extensive experiment results on datasets show that our approach outperforms state-of-the-art methods in both image fusion and object detection tasks, thus demonstrating its superior performance in complex scenarios.

2. Related Work

Early object detection relied on traditional machine learning techniques [18,19,20,21], which utilized hand-crafted features and conventional classifiers. These methods achieved moderate success in simple environments but struggled with complex, multi-scale scenarios. The introduction of Convolutional Neural Networks (CNNs) marked a turning point by enabling end-to-end feature learning, significantly enhancing both accuracy and efficiency. Notable early CNN-based detectors include R-CNN [22] and its successors Fast R-CNN and Faster R-CNN [6], which introduced region proposals and hierarchical feature representations. However, their computational complexity limited their suitability for real-time applications. This challenge led to the development of single-stage detectors such as YOLO [23] and SSD [24], which achieved a better balance between speed and accuracy by directly predicting bounding boxes and class probabilities.

As detection tasks became more complex—particularly in multi-object and multi-scale contexts—there was a growing need to model long-range dependencies and contextual relationships. Attention mechanisms and transformer-based architectures were introduced to address this limitation, fundamentally reshaping object detection. Carion et al. [25] proposed DETR, which formulates detection as a set prediction problem using global attention and a bipartite loss. Han et al. [10] presented TNT, which enhances local feature extraction by dividing image patches and computing intra-patch attention. Liu et al. [26] developed the Swin Transformer, a hierarchical vision transformer with shifted windows, achieving state-of-the-art performance in both classification and detection tasks.

More recently, multimodal learning has emerged as a powerful strategy for improving object detection by integrating complementary information from diverse sources, such as RGB images, LiDAR, and radar. For example, Wang et al. [3] introduced a fusion method using a shared backbone for RGB and LiDAR data, yielding significant gains in autonomous driving scenarios. Liu et al. [13] proposed a cross-modal attention framework that dynamically selects relevant modalities, improving the detection of small and occluded objects. In addition, Zhang et al. [27] demonstrated the use of transformer-based architectures for multi-sensor fusion, enhancing feature alignment and object localization.

3. Methodology

3.1. Architecture Overview

The architecture of our proposed model is shown in Figure 1. It includes a dual-stream feature extraction backbone, a two-unit feature enhancement and fusion module, and an FPN module with a detection head. Infrared (IR) and visible (RGB) images are first processed independently through parallel ResNet-50 [28] networks, where each stream generates five hierarchical feature maps corresponding to different stages of the network. At each stage, five DCA-MAAM modules are applied to fuse feature maps. The fusion process begins with a dynamic channel adjustment (DCA) strategy, which adaptively recalibrates cross-modal feature channels by learning modality-specific importance weights. Subsequently, the Multi-scale Activated Attention Mechanism (MAAM) activates both local and global contextual information through parallel convolutional branches with varying receptive fields, enhancing the model’s ability to capture objects at different scales. The fused features are processed by a Feature Pyramid Network (FPN) [29] to build a multi-scale feature hierarchy, which is then fed into YOLOX detection heads for final predictions. The detection heads output bounding box coordinates, objectness scores, and class probabilities, enabling robust multimodal detection. The framework is optimized using a combined loss function that integrates classification, localization, and spatial alignment objectives.

3.2. Dynamic Channel Adjustment

Building upon the success of channel interaction strategies in multimodal learning [30], we propose a dynamic channel adjustment (DCA) mechanism to enhance cross-modal feature fusion. The DCA adaptively recalibrates and exchanges channel-wise features between modalities through learnable attention weights, effectively capturing complementary information while suppressing redundant features.

Given the feature maps

x_{1} \in R^{H \times W \times C}

and

x_{2} \in R^{H \times W \times C}

from two modalities at stage s, we first apply Batch Normalization (BN) to each channel to stabilize the feature distribution:

{\hat{x}}_{m, s, c} = \frac{x_{m, s, c} - μ_{m, s, c}}{\sqrt{σ_{m, s, c}^{2} + ϵ}}, m \in {1, 2}

(1)

where

μ_{m, s, c}

and

σ_{m, s, c}^{2}

denote the mean and variance of the c-th channel for modality m at stage s, and

ϵ

is a small constant for numerical stability. The normalized features are then scaled and shifted using learnable parameters:

{\tilde{x}}_{m, s, c} = γ_{m, s, c} {\hat{x}}_{m, s, c} + β_{m, s, c}

(2)

where

γ_{m, s, c}

and

β_{m, s, c}

are the affine transformation parameters for channel-wise adjustment.

To enable dynamic channel interaction, we generate modality-specific attention weights through a shared

1 \times 1

convolutional layer followed by a sigmoid activation:

g_{m} = σ (W_{g} * {\tilde{x}}_{m}), m \in {1, 2}

(3)

where

W_{g} \in R^{1 \times 1 \times C \times C}

denotes the convolutional kernel, and

σ (\cdot)

is the sigmoid function that maps weights to the range

[0, 1]

.

The channel exchange operation is then performed as a weighted combination of the normalized features:

x_{1, c}^{'} = g_{1, c} ⊙ {\tilde{x}}_{1, c} + (1 - g_{1, c}) ⊙ {\tilde{x}}_{2, c}

(4)

x_{2, c}^{'} = g_{2, c} ⊙ {\tilde{x}}_{2, c} + (1 - g_{2, c}) ⊙ {\tilde{x}}_{1, c}

(5)

where ⊙ denotes element-wise multiplication. This formulation ensures that each channel in the output features

x_{1}^{'}

and

x_{2}^{'}

incorporates complementary information from both modalities while preserving the original spatial dimensions

H \times W \times C

.

3.3. Multi-Scale Activated Attention Mechanism

The multi-scale activated attention mechanism improves the model’s ability to capture both local and global contexts by applying dilated convolutions at multiple scales within the attention framework. This enables the effective extraction of multi-scale features from input images through adjusted receptive fields.

To implement this mechanism, we perform convolutions on feature maps

x_{1}^{'}

and

x_{2}^{'}

from channel-exchanged modalities, generating queries (q), keys (k), and values (v) for each. A standard convolution generates q, while dilated convolutions with rates

r = 1, 2,

and 3 produce k and v. This retains channel dimensions and results in three sets of q, k, and v per modality.

q_{i, r} = Conv (x_{i}^{'}), k_{i, r} = {DilateConv}_{r} (x_{i}^{'}), v_{i, r} = {DilateConv}_{r} (x_{i}^{'})

(6)

where

q_{i, r}

,

k_{i, r}

, and

v_{i, r}

represent the query, key, and value at scale r for each modality i (with

r = 1, 2, 3

). Here,

Conv (\cdot)

denotes a standard convolution operation, while

{DilateConv}_{r} (\cdot)

signifies a dilated convolution with dilation rate r.

For each scale r, the attention score

α_{i, r}

is computed via a dot product between the queries and keys, followed by a scaling operation:

α_{i, r} = Softmax (\frac{q_{i, r} k_{i, r}^{⊤}}{\sqrt{d}})

(7)

{head}_{i, r} = α_{i, r} v_{i, r}

(8)

where d represents a scaling factor, typically defined as

C / h

, denoting the channel dimension per attention head. The attention score

α_{i, r}

is then applied to the corresponding value

v_{i, r}

, yielding the output of each attention head.

The attention heads from different scales

r = 1, 2, 3

are concatenated along the channel dimension to form a multi-scale attention feature representation for each modality:

x_{i, concat} = Concat ({head}_{i, 1}, {head}_{i, 2}, {head}_{i, 3})

(9)

Next, the concatenated feature maps from both modalities are processed in two ways: both are subjected to element-wise summation to produce

x_{sum}

and element-wise multiplication to generate

x_{mul}

. A spatial attention mechanism is then applied to

x_{sum}

to enhance spatial dependencies:

x_{spatial} = σ (Conv (Concat (F_{avg} (x_{sum}), F_{\max} (x_{sum})))) ⊙ x_{sum}

(10)

where

σ

denotes the activation function,

Conv (\cdot)

represents a convolutional layer, and

F_{avg}

and

F_{\max}

refer to the average and maximum pooling operations, respectively.

The spatially enhanced representation

x_{spatial}

undergoes channel self-attention to capture inter-channel dependencies, resulting in the final fused representation

x_{fused}

:

x_{fused} = σ (W_{sa} (F_{avg} (x_{spatial} ⊙ x_{mul}) + F_{\max} (x_{spatial} ⊙ x_{mul})))

(11)

where

W_{sa}

are learnable parameters; the symbol ⊙ denotes element-wise multiplication.

The final fused representation

x_{fused}

integrates multi-scale, spatial, and channel-level information. This enriched representation then undergoes five additional processing stages before entering the Feature Pyramid Network (FPN), ensuring a seamless transition and coherence throughout the architecture.

3.4. FPN, Detector and Loss Function

Let the final output feature map from the Multi-scale Attention Mechanism be x, where x is a multi-scale feature map that encapsulates information from the input image at various spatial resolutions. The dimensions of x are

H \times W \times C

, where H is the height, W is the width, and C is the number of channels.

This feature map x is passed into the YOLOX Feature Pyramid Network (FPN) for further processing. The YOLOX-FPN is designed to effectively merge multi-scale features from the input and perform feature extraction across different spatial resolutions. It improves detection performance by consolidating information from various levels of the feature hierarchy. Specifically, the FPN receives input feature maps

x_{i}

from multiple stages and produces a consolidated feature map,

f_{f p n}

, which can be mathematically represented as follows:

FPN (x_{1}, x_{2}, x_{3}, x_{4}, x_{5}) \to f_{f p n}

(12)

where

f_{f p n}

is the output of the FPN, containing enhanced multi-scale features suitable for object detection. These features are subsequently passed to the YOLO detection head for final prediction.

The YOLO detection head takes the multi-scale features from

f_{f p n}

and performs object detection by predicting the bounding box coordinates, object confidence scores, and class probabilities. The detection process can be represented as follows:

y = YOLO Detection Head (f_{f p n})

(13)

where y is the final prediction output.

The output for each grid cell in the YOLO model can be expressed as follows:

y_{i} = (b_{i}, p_{c}, p_{class})

(14)

y_{i}

includes the bounding box coordinates,

b_{i} = (x, y, w, h)

, where x and y represent the center of the bounding box, and w and h represent its width and height, respectively. Additionally,

p_{c} \in [0, 1]

denotes the object confidence score, and

p_{class} \in R^{K}

represents the class probabilities for each detected object.

For multi-object detection, the total loss L is computed as the sum of three main components: bounding box loss

L_{box}

, object confidence loss

L_{obj}

, and classification loss

L_{cls}

. The total loss is represented as follows:

L = λ_{box} L_{box} + λ_{obj} L_{obj} + λ_{cls} L_{cls}

(15)

where

L_{box}

measures the error between the predicted and true bounding box coordinates,

L_{obj}

measures the object confidence score loss, and

L_{cls}

measures the classification error. Each loss component is weighted by a factor

λ

to balance their contributions to the total loss.

The bounding box loss

L_{box}

measures the discrepancy between the predicted bounding box

b_{i}

and the ground truth

b_{true}

. This work adopts the Generalized Intersection over Union (GIoU) loss, which is formulated as follows:

L_{box} = 1 - GIoU (b_{i}, b_{true})

(16)

where

GIoU (b_{i}, b_{true})

extends the traditional IoU by considering the smallest enclosing box, improving optimization when predicted and ground truth boxes do not overlap.

The object confidence loss

L_{obj}

is computed using focal binary cross-entropy, which helps address the issue of class imbalance by focusing more on hard-to-classify positive samples. The loss function is defined as follows:

L_{obj} = - κ {(1 - p_{c})}^{δ} log (p_{c}) for positive samples

(17)

where

p_{c}

represents the predicted object confidence score, indicating the model’s predicted probability that an object exists in the given grid cell.

κ

is a scaling factor that adjusts the loss magnitude, and

δ

is the focusing parameter, which controls the strength of the penalty on well-classified examples, making the loss more sensitive to misclassified positive samples. By down-weighting easily classified examples, the approach prioritizes difficult ones, improving model performance, especially in class-imbalanced scenarios.

The classification loss

L_{cls}

quantifies the discrepancy between the predicted class probability distribution

p_{pred} (c)

and the ground-truth distribution

p_{true} (c)

, implemented through categorical cross-entropy:

L_{cls} = - \sum_{c = 1}^{K} [p_{true} (c) log p_{pred} (c)]

(18)

where

p_{true} (c) \in {0, 1}

represents the one-hot encoded ground-truth label for class c, and

p_{pred} (c) \in (0, 1)

denotes the predicted probability after softmax normalization over K object categories. This formulation explicitly measures the information-theoretic distance between the predicted class posterior and the true data distribution.

By combining these loss components, the model learns to optimize the object detection task while maintaining a balance between bounding box accuracy, object confidence, and classification accuracy.

4. Experiment

4.1. Experiment Setup

To comprehensively validate the effectiveness of our proposed method, we conducted extensive experiments on both the public MSRS dataset and a private industrial dataset provided by SAIC, evaluating performance across urban road and industrial scenarios.

The MSRS dataset is a well-established benchmark for urban scene analysis and contains 1444 pairs of aligned thermal infrared and visible light images, which are annotated for vehicles and pedestrians. Each image typically includes multiple objects, providing a complex traffic scenario that helps evaluate a model’s ability to distinguish between different objects in multi-object detection tasks.

The private SAIC dataset consists of 1689 multimodal images from real-world production workshops, with each image representing a specific type of industrial equipment, such as sensors, lathes, forklifts, and filters. These images capture equipment of varying sizes and complexities, from smaller sensors to larger forklifts, while also reflecting common challenges in industrial environments, such as occlusions and cluttered backgrounds. This makes the SAIC dataset particularly valuable for evaluating the model’s robustness in industrial equipment detection and classification tasks. A detailed breakdown of the SAIC dataset’s composition and category distribution is shown in Table 1.

To demonstrate the effectiveness of our multimodal fusion approach, we first conducted image fusion experiments to evaluate the quality of feature integration across modalities. Subsequently, we performed object detection tasks to validate the practical benefits of our fusion strategy. For the fusion quality assessment, we employed five well-established metrics: Visual Information Fidelity (VIF), Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), Gradient-based Fusion Metric (Qabf), and Entropy (EN). These metrics collectively evaluate different aspects of fusion performance, including information preservation, structural consistency, and noise suppression.

For the object detection evaluation, we adopted three standard metrics: mean Average Precision (mAP), AP@50, and AP@75. These metrics provide a comprehensive assessment of detection accuracy at different Intersection-over-Union (IoU) thresholds, effectively validating the robustness of our method across various scenarios. The combination of fusion quality metrics and detection performance indicators offers strong empirical evidence for the reliability and effectiveness of our proposed approach.

4.2. Implementation

In our experiments, we utilized ResNet50 as the backbone for feature extraction, integrating it with the YOLOX-FPN and YOLO detectors to perform object detection tasks. The training process was conducted on an NVIDIA Tesla A30 GPU over 300 epochs, with input images uniformly resized to 640 × 640 pixels. We employed Stochastic Gradient Descent (SGD) as the optimization method, setting the weight decay to 5 × 10⁻⁴.

In the object confidence loss function, we set two important hyperparameters,

κ

and

δ

, to values of 0.25 and 2, respectively. The learning rate was varied between 0.0001 and 0.01, and we implemented cosine annealing for learning rate decay. These hyperparameter values were selected based on empirical validation and prior works to achieve optimal performance while maintaining training stability.

4.3. Comparison Experiment and Ablation

We selected several methods to conduct experiments on the MSRS dataset and the SAIC dataset. The experimental results validating the effectiveness of image fusion are presented in Table 2 and Table 3. The results of the objection detection are presented in Table 4 and Table 5.

Fast R-CNN [31]: A domain adaptive object detection method based on Faster R-CNN that mitigates image-level and instance-level shifts using adversarial training and consistency regularization.
GNN-based [32]: A joint multi-object tracking (MOT) approach based on Graph Neural Networks (GNNs) that simultaneously optimizes object detection and data association by modeling spatial and temporal relationships.
CDDFuse [33]: CDDFuse is a multimodal image fusion method that employs a dual-branch architecture combining Transformer and CNN components
ProbEn [34]: A multimodal object detection method based on probabilistic ensembling to effectively integrate information from multiple sensor modalities.
DETR [25]: An end-to-end object detection framework that uses a transformer encoder–decoder architecture with bipartite matching loss to directly predict object sets.
SwinF [35]: A feature fusion network based on Swin Transformer, designed to enhance object detection performance while reducing computational complexity through hierarchical windowing operations.
TransFusion [36]: A unified multimodal model that combines next-token prediction and diffusion processes in a single transformer to jointly model continuous data.
MMA-UNet [37]: A multimodal asymmetric UNet designed for balanced feature fusion by employing specialized encoders and cross-scale fusion strategies.

In the evaluation of image fusion, we employed six metrics to conduct a quantitative analysis of the results, which are presented in Table 1 and Table 2. The results show that our method outperforms the comparison methods in nearly all metrics. Specifically, higher VIF and PSNR values indicate greater fidelity between the source images and the fused image, meaning that the fused image better resembles the original source images. Higher Qabf and SSIM values suggest that the fused image retains more information from the source images with minimal distortion. The EN value reflects the preservation of image details and texture information, with higher EN values indicating that more details and texture are retained.

Our method surpasses the comparison methods in almost all metrics, particularly excelling in key metrics like VIF and PSNR. Compared with traditional neural network-based algorithms (such as CNN and Fast R-CNN), our method shows significant improvement. When compared to transformer-based methods (such as SwinF and CDDFuse), our approach also performs better in most metrics, likely due to the dynamic channel adjustment strategy we proposed, which enhances deep interaction between modalities. Additionally, our method outperforms newer techniques such as ProbEn and the asymmetric network MMA-UNet in most metrics, demonstrating its advantage in handling complex image fusion tasks. Overall, our method, with its innovative strategies and modules, achieves significant performance improvements across various quality metrics.

To validate the effectiveness of our method, the fused images should be applied to downstream tasks to assess their contribution. In our work, we utilized IVF image fusion technology for object detection, conducting experiments in traffic and industrial environments. The detection targets included both large and small objects, and the results further demonstrated the strong generalization capability of the MAAM model, achieving state-of-the-art performance across both datasets.

To further evaluate the visual performance of our proposed method, we conduct a qualitative comparison with several state-of-the-art fusion-based pedestrian detection methods, including CDDFuse, SWINF, and MMAUNet. As shown in Figure 2, we visualize the detection results on a representative challenging scene containing multiple objects with varying levels of occlusion and lighting conditions.

Table 2. The comparison of performance between our architecture and benchmark models on the MSRS dataset highlights the effectiveness of each approach in the fusion task. All experiments were identical.

Method	VIF	SSIM	PSNR	Qabf	EN
Faster R-CNN [31]	0.236	0.289	10.346	0.412	4.366
GNN-based [32]	0.250	0.239	10.899	0.475	5.342
CDDFuse [33]	0.386	0.363	12.803	0.645	6.301
ProbEn [34]	0.296	0.332	11.996	0.338	3.865
MMA-UNet [37]	0.446	0.478	13.634	0.702	6.398
Ours	0.473	0.513	13.834	0.731	6.621

Table 3. The comparison of performance between our architecture and benchmark models on the SAIC dataset highlights the effectiveness of each approach in the fusion task. All experiments were identical.

Method	VIF	SSIM	PSNR (dB)	Qabf	EN
Faster R-CNN [31]	0.436	0.589	14.546	0.675	6.376
GNN-based [32]	0.421	0.633	15.112	0.702	7.381
CDDFuse [33]	0.530	0.630	16.103	0.736	7.653
ProbEn [34]	0.492	0.573	15.102	0.593	6.784
SwinF [35]	0.502	0.641	16.330	0.732	7.801
MMA-UNet [37]	0.512	0.673	16.381	0.742	7.931
Ours	0.506	0.692	17.031	0.781	7.832

Table 4. Performance comparison on the MSRS dataset between our architecture, benchmark models, and variant architectures, which shows effectiveness in object detection under AP50, AP75, and mAP metrics.

Method	AP50		AP75		mAP
Method	Person	Car	Person	Car	Person	Car
Faster R-CNN [31]	0.831	0.853	0.804	0.736	0.769	0.683
GNN-based [32]	0.830	0.901	0.763	0.862	0.736	0.801
CDDFuse [33]	0.932	0.916	0.902	0.910	0.897	0.864
ProbEn [34]	0.906	0.913	0.842	0.869	0.811	0.844
DETR [25]	0.892	0.927	0.889	0.900	0.883	0.812
SwinF [35]	0.896	0.945	0.873	0.902	0.886	0.872
TransFusion [36]	0.915	0.928	0.903	0.901	0.891	0.884
MMA-UNet [37]	0.926	0.903	0.913	0.910	0.892	0.873
Ours	0.941	0.939	0.933	0.920	0.881	0.894
w/o DCA	0.915	0.883	0.904	0.873	0.849	0.831
w/o MAAM	0.908	0.890	0.891	0.846	0.833	0.825

Table 5. Performance comparison (AP50 scores) on the SAIC dataset across different object categories. Our method achieves state-of-the-art results in three out of four categories.

Method	Sensor	Lathe	Forklift	Filter
Faster R-CNN [31]	0.734	0.883	0.801	0.884
GNN-based [32]	0.702	0.912	0.814	0.861
CDDFuse [33]	0.816	0.943	0.902	0.933
ProbEn [34]	0.819	0.952	0.897	0.947
DETR [25]	0.821	0.973	0.936	0.914
SwinF [35]	0.820	0.978	0.933	0.961
TransFusion [36]	0.812	0.956	0.912	0.938
MMA-UNet [37]	0.807	0.992	0.943	0.970
Ours	0.843	0.986	0.970	0.976
w/o DCA	0.801	0.953	0.919	0.902
w/o MAAM	0.783	0.937	0.904	0.934

Figure 2. Qualitative comparison of detection results produced by CDDFuse, SWINF, MMAUNet, and our proposed method.

4.4. Ablation

To comprehensively evaluate the effectiveness of our proposed MNCM model, we conducted an ablation analysis focusing on its two key components: the dynamic channel adjustment (DCA) module and the multi-scale activated attention mechanism (MAAM) module. These modules are designed to enhance cross-modal feature representation and fusion, thereby improving overall model performance in classification and detection tasks.

To assess their individual contributions, we systematically removed each module and analyzed the resulting performance variations. As shown in Table 4 and Table 5, eliminating either component led to a noticeable decline in accuracy, underscoring their critical roles. Quantitative results demonstrate the effectiveness of both the DCA and MAAM modules. On the MSRS dataset, the DCA module improves mAP from 0.849 to 0.881 for Person detection (+3.2%) and from 0.831 to 0.894 for Car detection (+6.3%). Similarly, the MAAM module yields consistent gains, with Person mAP increasing from 0.833 to 0.881 (+4.8%) and Car from 0.825 to 0.894 (+6.9%). On the SAIC dataset, both modules generalize well, with DCA and MAAM achieving average AP50 improvements of 3.7% and 4.6%, respectively, across all object categories. These results highlight the complementary strengths of DCA and MAAM in enhancing spatial feature refinement and modality-aware representation.

Specifically, the DCA module facilitates adaptive feature recalibration by dynamically adjusting channel-wise importance across modalities, ensuring that essential information is preserved during fusion. Meanwhile, the MAAM module expands the receptive field through dilated attention, allowing the model to capture modality-specific spatial dependencies and refine feature integration. Together, these modules work synergistically—DCA enhances deep modality interaction at the feature level, while MAAM strengthens spatial feature extraction through attention mechanisms—resulting in significant improvements in robustness and detection accuracy. These findings highlight the necessity of both components in the MNCM framework, demonstrating their complementary roles in optimizing the model’s performance.

In addition to evaluating the core modules (DCA and MAAM), we conducted systematic ablation studies on the loss function components to understand their individual contributions. To maintain model integrity during these experiments, we employed a re-weighting strategy where the targeted loss term’s weight (

λ

) was set to zero while keeping all network architectures intact. The experimental results are shown in Table 6.

The experimental results show that the impact of the three types of loss functions on detection performance varies significantly. The absence of the bounding box loss

L_{box}

leads to the most noticeable performance decline (−7.0% for pedestrians and −6.4% for vehicles), confirming the crucial role of GIoU loss in object localization. The removal of the object confidence loss

L_{obj}

has a greater impact on pedestrian detection (−4.3% vs. −2.8% for vehicles), indicating that pedestrian detection faces more severe class imbalance issues. The classification loss

L_{cls}

affects both target categories similarly (−3.9% for pedestrians and −3.6% for vehicles), highlighting its general applicability in category differentiation.

4.5. Computational Complexity Analysis

To assess the computational efficiency of MNCM, we compare it with several representative transformer-based or attention-based multimodal object detection methods. The evaluation is conducted on the MSRS dataset using a fixed input resolution of

640 \times 640

pixels, which reflects a typical real-world deployment setting. All models are tested on the same hardware (Tesla A30 GPU) under consistent inference conditions. We report the number of parameters (in millions), floating-point operations (FLOPs in GFLOPs), average inference time per image (milliseconds), and frames per second (FPS). The results are shown in Table 7.

As shown in Table 7, MNCM achieves the lowest number of parameters (31.4 M) among all compared methods, reflecting its lightweight architectural design. While its FLOPs (70.3 G) are slightly higher than those of SwinF (69.4 G), MNCM maintains a competitive inference time of 33.6 ms and achieves 27.9 FPS, which is on par with or slightly better than transformer-based baselines such as SwinF (26.5 FPS) and CDDFuse (25.9 FPS).

These results indicate that MNCM provides a favorable balance between model complexity and runtime efficiency. The improvements can be attributed to the integration of the dynamic channel adjustment (DCA) and multi-scale activated attention mechanism (MAAM), which jointly enhance feature representation across modalities while maintaining efficient computation suitable for real-time multimodal object detection.

5. Conclusions

In this paper, we present MNCM, a novel multimodal image fusion network designed to enhance target detection performance through the integration of dynamic channel adjustment (DCA) and the multi-scale activated attention mechanism (MAAM). DCA facilitates deep interactions between modalities, ensuring the preservation of critical information, while MAAM captures spatial features from multiple perspectives, enabling more effective fusion of complementary data. Extensive experiment results validate the capabilities of our model, demonstrating its robust performance in both image fusion and object detection tasks. Experimental results on two distinct datasets demonstrate MNCM’s significant impact on object detection, particularly in complex environments with objects of varying sizes. These findings underscore MNCM’s potential as a powerful tool for advancing multimodal target detection in real-world applications. Some limitations and potential improvements are discussed in Appendix A.

Author Contributions

Y.Y. and M.C.; methodology, Y.Y. and M.C.; modeling, Y.Y.; software, Y.Y.; validation, Y.Y.; investigation, Y.Y.; writing, Y.Y.; tuition, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Limitations and Future Work

While MNCM demonstrates competitive accuracy and computational efficiency in multimodal object detection, it still has several limitations that suggest areas for improvement.

One key limitation is that MNCM does not incorporate temporal or sequential modeling, which restricts its use to static image-based detection. This means the model is unable to capture temporal continuity across video frames or perform multi-frame fusion, both of which are essential for dynamic scenarios like video surveillance and autonomous driving. Another challenge is that MNCM assumes complete and well-aligned modality inputs during inference. However, in real-world situations, modalities may be missing, degraded, or misaligned due to factors like sensor failures, occlusions, or adverse weather conditions. Currently, the model lacks explicit mechanisms to handle these issues, which could affect its performance in challenging environments. Additionally, MNCM is designed specifically for object detection and has not been evaluated in large-scale pretraining or general-purpose multimodal learning contexts. Its modular structure may also create challenges when attempting to integrate it into broader multimodal frameworks or adapt it to large-scale foundation models.

Looking ahead, we aim to improve MNCM’s ability to handle incomplete or noisy modalities, introduce lightweight temporal modeling modules for video-based detection, and explore how the model can be integrated with scalable pretraining strategies or multimodal foundation model backbones to enhance its adaptability and transferability.

References

Zheng, Y.; Blasch, E.; Liu, Z. Multispectral Image Fusion and Colorization; SPIE Press: Bellingham, WA, USA, 2018; Volume 481. [Google Scholar]
Ouardirhi, Z.; Mahmoudi, S.A.; Zbakh, M. Enhancing object detection in smart video surveillance: A survey of occlusion-handling approaches. Electronics 2024, 13, 541. [Google Scholar] [CrossRef]
Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; Wang, C. Multimodal industrial anomaly detection via hybrid fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8032–8041. [Google Scholar]
Baumgartner, M.; Jäger, P.F.; Isensee, F.; Maier-Hein, K.H. nnDetection: A self-configuring method for medical object detection. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; pp. 530–539. [Google Scholar]
Chauhan, R.; Ghanshala, K.K.; Joshi, R. Convolutional neural network (CNN) for image detection and recognition. In Proceedings of the 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC), Jalandhar, India, 15–17 December 2018; pp. 278–282. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [PubMed]
Chandana, R.; Ramachandra, A. Real time object detection system with YOLO and CNN models: A review. arXiv 2022, arXiv:2208.00773. [Google Scholar]
Dimitri, G.M.; Spasov, S.; Duggento, A.; Passamonti, L.; Lió, P.; Toschi, N. Multimodal and multicontrast image fusion via deep generative models. Inf. Fusion 2022, 88, 146–160. [Google Scholar] [CrossRef]
Liu, L.; Muelly, M.; Deng, J.; Pfister, T.; Li, L.J. Generative modeling for small-data object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6073–6081. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward transformer-based object detection. arXiv 2020, arXiv:2012.09958. [Google Scholar]
Zhang, H.; Fromont, E.; Lefèvre, S.; Avignon, B. Guided attentive feature fusion for multispectral pedestrian detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 72–80. [Google Scholar]
Liu, Y. Cross-Modal Attention for Robust Object Detection. Comput. Vis. Image Underst. 2024, 189, 103305. [Google Scholar]
Li, X.; Liu, J.; Tang, Z.; Han, B.; Wu, Z. MEDMCN: A novel multi-modal EfficientDet with multi-scale CapsNet for object detection. J. Supercomput. 2024, 80, 12863–12890. [Google Scholar]
Zhan, Y.; Zeng, Z.; Liu, H.; Tan, X.; Tian, Y. MambaSOD: Dual Mamba-driven cross-modal fusion network for RGB-D salient object detection. Neurocomputing 2025, 631, 129718. [Google Scholar]
Liu, S.; Liu, Z. Multi-channel CNN-based object detection for enhanced situation awareness. arXiv 2017, arXiv:1712.00075. [Google Scholar]
Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Zhang, J. Bayesian fusion for infrared and visible images. Signal Process. 2020, 177, 107734. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
Pedersoli, M.; Gonzàlez, J.; Hu, X.; Roca, X. Toward real-time pedestrian detection based on a deformable template model. IEEE Trans. Intell. Transp. Syst. 2013, 15, 355–364. [Google Scholar] [CrossRef]
Yan, J.; Lei, Z.; Wen, L.; Li, S.Z. The fastest deformable part model for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2497–2504. [Google Scholar]
Zhou, H.; Yu, G. Research on pedestrian detection technology based on the SVM classifier trained by HOG and LTP features. Future Gener. Comput. Syst. 2021, 125, 604–615. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision—ECCV 2020 Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Zhang, Z. Transformer-Based Fusion for Multi-Sensor Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1391–1403. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sriharipriya, K.C. Enhanced pothole detection system using YOLOX algorithm. Auton. Intell. Syst. 2022, 2, 22. [Google Scholar]
Wang, Y.; Huang, W.; Sun, F.; Xu, T.; Rong, Y.; Huang, J. Deep multimodal fusion by channel exchanging. Adv. Neural Inf. Process. Syst. 2020, 33, 4835–4845. [Google Scholar]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
Wang, Y.; Kitani, K.; Weng, X. Joint object detection and multi-object tracking with graph neural networks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13708–13715. [Google Scholar]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling; Springer: Cham, Switzerland, 2022. [Google Scholar]
Li, T.; Wang, H.; Li, G.; Liu, S.; Tang, L. SwinF: Swin Transformer with feature fusion in target detection. J. Phys. Conf. Ser. 2022, 2284, 012027. [Google Scholar]
Zhou, C.; Yu, L.; Babu, A.; Tirumala, K.; Yasunaga, M.; Shamis, L.; Kahn, J.; Ma, X.; Zettlemoyer, L.; Levy, O. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. arXiv 2024, arXiv:2408.11039. [Google Scholar]
Huang, J.; Li, X.; Tan, T.; Li, X.; Ye, T. MMA-UNet: A Multi-Modal Asymmetric UNet Architecture for Infrared and Visible Image Fusion. arXiv 2024, arXiv:2404.17747. [Google Scholar]

Figure 1. The overall architecture of the proposed MNCM framework consists of two parallel ResNet-based image feature extraction networks, processing input features from each modality. The extracted features are then fed into the DCA-MAAM module. In the DCA-MAAM module, the dynamic channel adjustment and multi-scale activated attention mechanisms are applied. The blue box with three small squares represents sliding convolutions with different dilation rates of 1, 2, and 3, capturing features at multiple receptive field scales.

Table 1. SAIC dataset statistic.

Dataset	Sensors	Lathes	Forklifts	Fiters	Total
SAIC	590	367	465	267	1689

Table 6. Ablation study on loss function components (MSRS validation set).

Configuration	Person mAP	Car mAP
Full model	0.881	0.894
w/o $L_{box}$	0.819 (−7.0%)	0.837 (−6.4%)
w/o $L_{obj}$	0.843 (−4.3%)	0.869 (−2.8%)
w/o $L_{cls}$	0.847 (−3.9%)	0.858 (−3.6%)

Table 7. Computational complexity and runtime efficiency comparison between MNCM and baseline methods on the MSRS dataset.

Method	Params (M)	FLOPs (G)	Inference Time (ms)	FPS
CDDFuse [33]	48.3	74.6	41.7	25.9
SwinF [35]	56.6	69.4	39.2	26.5
MMA-UNet [37]	38.1	72.8	35.4	28.2
Ours	31.4	70.3	33.6	27.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, Y.; Chen, M. Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention. Appl. Sci. 2025, 15, 4298. https://doi.org/10.3390/app15084298

AMA Style

Ye Y, Chen M. Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention. Applied Sciences. 2025; 15(8):4298. https://doi.org/10.3390/app15084298

Chicago/Turabian Style

Ye, Yihang, and Mingxuan Chen. 2025. "Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention" Applied Sciences 15, no. 8: 4298. https://doi.org/10.3390/app15084298

APA Style

Ye, Y., & Chen, M. (2025). Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention. Applied Sciences, 15(8), 4298. https://doi.org/10.3390/app15084298

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Architecture Overview

3.2. Dynamic Channel Adjustment

3.3. Multi-Scale Activated Attention Mechanism

3.4. FPN, Detector and Loss Function

4. Experiment

4.1. Experiment Setup

4.2. Implementation

4.3. Comparison Experiment and Ablation

4.4. Ablation

4.5. Computational Complexity Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Limitations and Future Work

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI