Next Article in Journal
Optimization of Ethanol Concentration and Wetting Time for Industrial-Scale Production of Ipomoea batatas L. Leaf Extract
Previous Article in Journal
Experimental and Numerical Investigation of Vibration-Suppression Efficacy in Spring Pendulum Pounding-Tuned Mass Damper
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(8), 4298; https://doi.org/10.3390/app15084298
Submission received: 6 March 2025 / Revised: 27 March 2025 / Accepted: 2 April 2025 / Published: 13 April 2025

Abstract

:
Object detection benefits greatly from multimodal image fusion, which integrates complementary data from different modalities like RGB and thermal images. However, existing methods struggle with effective inter-modal fusion, particularly in capturing spatial and contextual information across diverse regions and scales. To address these limitations, we propose the dynamic channel adjustment and multi-scale activated attention mechanism network (MNCM). Our approach incorporates dynamic channel adjustment for precise feature fusion across modalities and a multi-scale attention mechanism to capture both local and global contexts. This design improves robustness while balancing computational efficiency. The model’s scalability is enhanced through its ability to adaptively process multi-scale information without being constrained by fixed-scale designs. To validate our method, we used two multimodal datasets from traffic and industrial scenarios, which consisted of paired thermal infrared and visible light images. The results first demonstrate strong performance in multimodal fusion and then show state-of-the-art results in object detection, proving its effectiveness for real-world applications.

1. Introduction

Object detection is a fundamental task in computer vision, which involves identifying and precisely localizing objects within images or videos. By enabling machines to interpret visual data, it plays a crucial role in applications such as autonomous driving [1], intelligent surveillance [2], industrial automation [3], and medical imaging [4].
In recent years, integrating various neural network architectures has led to significant advancements in object detection technologies. Early Convolutional Neural Network (CNN)-based models excelled in feature extraction but were limited by their localized processing, which hindered the capture of deep interactions between different modalities [5,6,7]. Generative models enhanced the quality of fused outputs but faced challenges in ensuring consistency across diverse receptive field regions [8,9]. The introduction of attention mechanisms marked a pivotal shift, which enabled models to focus on relevant input regions and capture long-range dependencies, thereby improving feature representation and detection accuracy [10,11]. Building upon these developments, multimodal data fusion has emerged as a key research focus [12,13,14,15,16,17], leveraging complementary information from diverse modalities to enhance robustness and performance in complex environments. However, existing multimodal detectors still face critical limitations, including insensitivity to spatial location information, feature loss, and insufficient fusion during the multimodal data integration process. These issues hinder the ability to fully leverage the complementary strengths of different modalities, ultimately impacting the overall effectiveness of multimodal object detection systems.
Therefore, we propose a novel multimodal image fusion network, a Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention in the backbone branches for each modality. This design achieves both inter-modality complementarity and intra-modality enhancement during the fusion process. The dynamic channel adjustment (DCA) module dynamically adjusts channels to optimize fusion and enhance modality complementarity, allowing the network to adaptively focus on the most relevant features from each modality. Meanwhile, the multi-scale activated attention mechanism (MAAM) strengthens attention to key features across multiple receptive field scales, ensuring that both local and global contexts are effectively captured. In contrast to existing transformer-based multimodal fusion models, which predominantly rely on global token-level attention for feature integration, our approach emphasizes fine-grained channel-level interaction and scale-aware attention modulation. This design reduces computational overhead while maintaining strong performance, particularly in scenarios with diverse object sizes and modality variations. Together, these components enable flexible and efficient inter-modality feature interactions, significantly improving the quality of fused images and overall detection performance. By leveraging these advanced mechanisms, our network is able to achieve superior adaptability to varying object sizes and complex environments, making it a robust solution for real-world detection tasks.
We conducted extensive experiments on traffic and industrial datasets to thoroughly evaluate the performance of the module in multimodal fusion. Initially, we assessed the image fusion performance, which demonstrated the strong capability of the two modules in integrating multimodal information. Subsequently, the results from object detection experiments showed that our approach significantly improved detection performance in complex scenarios. Notably, it exhibited exceptional adaptability and robustness, particularly in handling objects of varying sizes and challenging environments. These findings underscore the model’s strong generalization ability, which offer a promising solution for multimodal image fusion in real-world applications. Furthermore, extensive computational efficiency analysis demonstrates that our model achieves this superior performance while maintaining competitive inference speed and parameter efficiency, making it suitable for real-time applications.
Our main contributions are summarized as follows:
  • To achieve effective feature interaction and adaptive channel selection, we design a dynamic channel adjustment strategy using channel exchange principles for enhancing the fusion of complementary information from different modalities.
  • To capture key features across different scales, we introduce a multi-scale activated attention mechanism, enhancing the model’s focus on critical features from each modality for improved detection accuracy.
  • Extensive experiment results on datasets show that our approach outperforms state-of-the-art methods in both image fusion and object detection tasks, thus demonstrating its superior performance in complex scenarios.

2. Related Work

Early object detection relied on traditional machine learning techniques [18,19,20,21], which utilized hand-crafted features and conventional classifiers. These methods achieved moderate success in simple environments but struggled with complex, multi-scale scenarios. The introduction of Convolutional Neural Networks (CNNs) marked a turning point by enabling end-to-end feature learning, significantly enhancing both accuracy and efficiency. Notable early CNN-based detectors include R-CNN [22] and its successors Fast R-CNN and Faster R-CNN [6], which introduced region proposals and hierarchical feature representations. However, their computational complexity limited their suitability for real-time applications. This challenge led to the development of single-stage detectors such as YOLO [23] and SSD [24], which achieved a better balance between speed and accuracy by directly predicting bounding boxes and class probabilities.
As detection tasks became more complex—particularly in multi-object and multi-scale contexts—there was a growing need to model long-range dependencies and contextual relationships. Attention mechanisms and transformer-based architectures were introduced to address this limitation, fundamentally reshaping object detection. Carion et al. [25] proposed DETR, which formulates detection as a set prediction problem using global attention and a bipartite loss. Han et al. [10] presented TNT, which enhances local feature extraction by dividing image patches and computing intra-patch attention. Liu et al. [26] developed the Swin Transformer, a hierarchical vision transformer with shifted windows, achieving state-of-the-art performance in both classification and detection tasks.
More recently, multimodal learning has emerged as a powerful strategy for improving object detection by integrating complementary information from diverse sources, such as RGB images, LiDAR, and radar. For example, Wang et al. [3] introduced a fusion method using a shared backbone for RGB and LiDAR data, yielding significant gains in autonomous driving scenarios. Liu et al. [13] proposed a cross-modal attention framework that dynamically selects relevant modalities, improving the detection of small and occluded objects. In addition, Zhang et al. [27] demonstrated the use of transformer-based architectures for multi-sensor fusion, enhancing feature alignment and object localization.

3. Methodology

3.1. Architecture Overview

The architecture of our proposed model is shown in Figure 1. It includes a dual-stream feature extraction backbone, a two-unit feature enhancement and fusion module, and an FPN module with a detection head. Infrared (IR) and visible (RGB) images are first processed independently through parallel ResNet-50 [28] networks, where each stream generates five hierarchical feature maps corresponding to different stages of the network. At each stage, five DCA-MAAM modules are applied to fuse feature maps. The fusion process begins with a dynamic channel adjustment (DCA) strategy, which adaptively recalibrates cross-modal feature channels by learning modality-specific importance weights. Subsequently, the Multi-scale Activated Attention Mechanism (MAAM) activates both local and global contextual information through parallel convolutional branches with varying receptive fields, enhancing the model’s ability to capture objects at different scales. The fused features are processed by a Feature Pyramid Network (FPN) [29] to build a multi-scale feature hierarchy, which is then fed into YOLOX detection heads for final predictions. The detection heads output bounding box coordinates, objectness scores, and class probabilities, enabling robust multimodal detection. The framework is optimized using a combined loss function that integrates classification, localization, and spatial alignment objectives.

3.2. Dynamic Channel Adjustment

Building upon the success of channel interaction strategies in multimodal learning [30], we propose a dynamic channel adjustment (DCA) mechanism to enhance cross-modal feature fusion. The DCA adaptively recalibrates and exchanges channel-wise features between modalities through learnable attention weights, effectively capturing complementary information while suppressing redundant features.
Given the feature maps x 1 R H × W × C and x 2 R H × W × C from two modalities at stage s, we first apply Batch Normalization (BN) to each channel to stabilize the feature distribution:
x ^ m , s , c = x m , s , c μ m , s , c σ m , s , c 2 + ϵ , m { 1 , 2 }
where μ m , s , c and σ m , s , c 2 denote the mean and variance of the c-th channel for modality m at stage s, and ϵ is a small constant for numerical stability. The normalized features are then scaled and shifted using learnable parameters:
x ˜ m , s , c = γ m , s , c x ^ m , s , c + β m , s , c
where γ m , s , c and β m , s , c are the affine transformation parameters for channel-wise adjustment.
To enable dynamic channel interaction, we generate modality-specific attention weights through a shared 1 × 1 convolutional layer followed by a sigmoid activation:
g m = σ ( W g x ˜ m ) , m { 1 , 2 }
where W g R 1 × 1 × C × C denotes the convolutional kernel, and σ ( · ) is the sigmoid function that maps weights to the range [ 0 , 1 ] .
The channel exchange operation is then performed as a weighted combination of the normalized features:
x 1 , c = g 1 , c x ˜ 1 , c + ( 1 g 1 , c ) x ˜ 2 , c
x 2 , c = g 2 , c x ˜ 2 , c + ( 1 g 2 , c ) x ˜ 1 , c
where ⊙ denotes element-wise multiplication. This formulation ensures that each channel in the output features x 1 and x 2 incorporates complementary information from both modalities while preserving the original spatial dimensions H × W × C .

3.3. Multi-Scale Activated Attention Mechanism

The multi-scale activated attention mechanism improves the model’s ability to capture both local and global contexts by applying dilated convolutions at multiple scales within the attention framework. This enables the effective extraction of multi-scale features from input images through adjusted receptive fields.
To implement this mechanism, we perform convolutions on feature maps x 1 and x 2 from channel-exchanged modalities, generating queries (q), keys (k), and values (v) for each. A standard convolution generates q, while dilated convolutions with rates r = 1 , 2 , and 3 produce k and v. This retains channel dimensions and results in three sets of q, k, and v per modality.
q i , r = Conv ( x i ) , k i , r = DilateConv r ( x i ) , v i , r = DilateConv r ( x i )
where q i , r , k i , r , and v i , r represent the query, key, and value at scale r for each modality i (with r = 1 , 2 , 3 ). Here, Conv ( · ) denotes a standard convolution operation, while DilateConv r ( · ) signifies a dilated convolution with dilation rate r.
For each scale r, the attention score α i , r is computed via a dot product between the queries and keys, followed by a scaling operation:
α i , r = Softmax q i , r k i , r d
head i , r = α i , r v i , r
where d represents a scaling factor, typically defined as C / h , denoting the channel dimension per attention head. The attention score α i , r is then applied to the corresponding value v i , r , yielding the output of each attention head.
The attention heads from different scales r = 1 , 2 , 3 are concatenated along the channel dimension to form a multi-scale attention feature representation for each modality:
x i , concat = Concat ( head i , 1 , head i , 2 , head i , 3 )
Next, the concatenated feature maps from both modalities are processed in two ways: both are subjected to element-wise summation to produce x sum and element-wise multiplication to generate x mul . A spatial attention mechanism is then applied to x sum to enhance spatial dependencies:
x spatial = σ Conv Concat F avg ( x sum ) , F max ( x sum ) x sum
where σ denotes the activation function, Conv ( · ) represents a convolutional layer, and F avg and F max refer to the average and maximum pooling operations, respectively.
The spatially enhanced representation x spatial undergoes channel self-attention to capture inter-channel dependencies, resulting in the final fused representation x fused :
x fused = σ W sa F avg ( x spatial x mul ) + F max ( x spatial x mul )
where W sa are learnable parameters; the symbol ⊙ denotes element-wise multiplication.
The final fused representation x fused integrates multi-scale, spatial, and channel-level information. This enriched representation then undergoes five additional processing stages before entering the Feature Pyramid Network (FPN), ensuring a seamless transition and coherence throughout the architecture.

3.4. FPN, Detector and Loss Function

Let the final output feature map from the Multi-scale Attention Mechanism be x, where x is a multi-scale feature map that encapsulates information from the input image at various spatial resolutions. The dimensions of x are H × W × C , where H is the height, W is the width, and C is the number of channels.
This feature map x is passed into the YOLOX Feature Pyramid Network (FPN) for further processing. The YOLOX-FPN is designed to effectively merge multi-scale features from the input and perform feature extraction across different spatial resolutions. It improves detection performance by consolidating information from various levels of the feature hierarchy. Specifically, the FPN receives input feature maps x i from multiple stages and produces a consolidated feature map, f f p n , which can be mathematically represented as follows:
FPN ( x 1 , x 2 , x 3 , x 4 , x 5 ) f f p n
where f f p n is the output of the FPN, containing enhanced multi-scale features suitable for object detection. These features are subsequently passed to the YOLO detection head for final prediction.
The YOLO detection head takes the multi-scale features from f f p n and performs object detection by predicting the bounding box coordinates, object confidence scores, and class probabilities. The detection process can be represented as follows:
y = YOLO Detection Head ( f f p n )
where y is the final prediction output.
The output for each grid cell in the YOLO model can be expressed as follows:
y i = ( b i , p c , p class )
y i includes the bounding box coordinates, b i = ( x , y , w , h ) , where x and y represent the center of the bounding box, and w and h represent its width and height, respectively. Additionally, p c [ 0 , 1 ] denotes the object confidence score, and p class R K represents the class probabilities for each detected object.
For multi-object detection, the total loss L is computed as the sum of three main components: bounding box loss L box , object confidence loss L obj , and classification loss L cls . The total loss is represented as follows:
L = λ box L box + λ obj L obj + λ cls L cls
where L box measures the error between the predicted and true bounding box coordinates, L obj measures the object confidence score loss, and L cls measures the classification error. Each loss component is weighted by a factor λ to balance their contributions to the total loss.
The bounding box loss L box measures the discrepancy between the predicted bounding box b i and the ground truth b true . This work adopts the Generalized Intersection over Union (GIoU) loss, which is formulated as follows:
L box = 1 GIoU ( b i , b true )
where GIoU ( b i , b true ) extends the traditional IoU by considering the smallest enclosing box, improving optimization when predicted and ground truth boxes do not overlap.
The object confidence loss L obj is computed using focal binary cross-entropy, which helps address the issue of class imbalance by focusing more on hard-to-classify positive samples. The loss function is defined as follows:
L obj = κ ( 1 p c ) δ log ( p c ) for positive samples
where p c represents the predicted object confidence score, indicating the model’s predicted probability that an object exists in the given grid cell. κ is a scaling factor that adjusts the loss magnitude, and δ is the focusing parameter, which controls the strength of the penalty on well-classified examples, making the loss more sensitive to misclassified positive samples. By down-weighting easily classified examples, the approach prioritizes difficult ones, improving model performance, especially in class-imbalanced scenarios.
The classification loss L cls quantifies the discrepancy between the predicted class probability distribution p pred ( c ) and the ground-truth distribution p true ( c ) , implemented through categorical cross-entropy:
L cls = c = 1 K p true ( c ) log p pred ( c )
where p true ( c ) { 0 , 1 } represents the one-hot encoded ground-truth label for class c, and p pred ( c ) ( 0 , 1 ) denotes the predicted probability after softmax normalization over K object categories. This formulation explicitly measures the information-theoretic distance between the predicted class posterior and the true data distribution.
By combining these loss components, the model learns to optimize the object detection task while maintaining a balance between bounding box accuracy, object confidence, and classification accuracy.

4. Experiment

4.1. Experiment Setup

To comprehensively validate the effectiveness of our proposed method, we conducted extensive experiments on both the public MSRS dataset and a private industrial dataset provided by SAIC, evaluating performance across urban road and industrial scenarios.
The MSRS dataset is a well-established benchmark for urban scene analysis and contains 1444 pairs of aligned thermal infrared and visible light images, which are annotated for vehicles and pedestrians. Each image typically includes multiple objects, providing a complex traffic scenario that helps evaluate a model’s ability to distinguish between different objects in multi-object detection tasks.
The private SAIC dataset consists of 1689 multimodal images from real-world production workshops, with each image representing a specific type of industrial equipment, such as sensors, lathes, forklifts, and filters. These images capture equipment of varying sizes and complexities, from smaller sensors to larger forklifts, while also reflecting common challenges in industrial environments, such as occlusions and cluttered backgrounds. This makes the SAIC dataset particularly valuable for evaluating the model’s robustness in industrial equipment detection and classification tasks. A detailed breakdown of the SAIC dataset’s composition and category distribution is shown in Table 1.
To demonstrate the effectiveness of our multimodal fusion approach, we first conducted image fusion experiments to evaluate the quality of feature integration across modalities. Subsequently, we performed object detection tasks to validate the practical benefits of our fusion strategy. For the fusion quality assessment, we employed five well-established metrics: Visual Information Fidelity (VIF), Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), Gradient-based Fusion Metric (Qabf), and Entropy (EN). These metrics collectively evaluate different aspects of fusion performance, including information preservation, structural consistency, and noise suppression.
For the object detection evaluation, we adopted three standard metrics: mean Average Precision (mAP), AP@50, and AP@75. These metrics provide a comprehensive assessment of detection accuracy at different Intersection-over-Union (IoU) thresholds, effectively validating the robustness of our method across various scenarios. The combination of fusion quality metrics and detection performance indicators offers strong empirical evidence for the reliability and effectiveness of our proposed approach.

4.2. Implementation

In our experiments, we utilized ResNet50 as the backbone for feature extraction, integrating it with the YOLOX-FPN and YOLO detectors to perform object detection tasks. The training process was conducted on an NVIDIA Tesla A30 GPU over 300 epochs, with input images uniformly resized to 640 × 640 pixels. We employed Stochastic Gradient Descent (SGD) as the optimization method, setting the weight decay to 5 × 10−4.
In the object confidence loss function, we set two important hyperparameters, κ and δ , to values of 0.25 and 2, respectively. The learning rate was varied between 0.0001 and 0.01, and we implemented cosine annealing for learning rate decay. These hyperparameter values were selected based on empirical validation and prior works to achieve optimal performance while maintaining training stability.

4.3. Comparison Experiment and Ablation

We selected several methods to conduct experiments on the MSRS dataset and the SAIC dataset. The experimental results validating the effectiveness of image fusion are presented in Table 2 and Table 3. The results of the objection detection are presented in Table 4 and Table 5.
  • Fast R-CNN [31]: A domain adaptive object detection method based on Faster R-CNN that mitigates image-level and instance-level shifts using adversarial training and consistency regularization.
  • GNN-based [32]: A joint multi-object tracking (MOT) approach based on Graph Neural Networks (GNNs) that simultaneously optimizes object detection and data association by modeling spatial and temporal relationships.
  • CDDFuse [33]: CDDFuse is a multimodal image fusion method that employs a dual-branch architecture combining Transformer and CNN components
  • ProbEn [34]: A multimodal object detection method based on probabilistic ensembling to effectively integrate information from multiple sensor modalities.
  • DETR [25]: An end-to-end object detection framework that uses a transformer encoder–decoder architecture with bipartite matching loss to directly predict object sets.
  • SwinF [35]: A feature fusion network based on Swin Transformer, designed to enhance object detection performance while reducing computational complexity through hierarchical windowing operations.
  • TransFusion [36]: A unified multimodal model that combines next-token prediction and diffusion processes in a single transformer to jointly model continuous data.
  • MMA-UNet [37]: A multimodal asymmetric UNet designed for balanced feature fusion by employing specialized encoders and cross-scale fusion strategies.
In the evaluation of image fusion, we employed six metrics to conduct a quantitative analysis of the results, which are presented in Table 1 and Table 2. The results show that our method outperforms the comparison methods in nearly all metrics. Specifically, higher VIF and PSNR values indicate greater fidelity between the source images and the fused image, meaning that the fused image better resembles the original source images. Higher Qabf and SSIM values suggest that the fused image retains more information from the source images with minimal distortion. The EN value reflects the preservation of image details and texture information, with higher EN values indicating that more details and texture are retained.
Our method surpasses the comparison methods in almost all metrics, particularly excelling in key metrics like VIF and PSNR. Compared with traditional neural network-based algorithms (such as CNN and Fast R-CNN), our method shows significant improvement. When compared to transformer-based methods (such as SwinF and CDDFuse), our approach also performs better in most metrics, likely due to the dynamic channel adjustment strategy we proposed, which enhances deep interaction between modalities. Additionally, our method outperforms newer techniques such as ProbEn and the asymmetric network MMA-UNet in most metrics, demonstrating its advantage in handling complex image fusion tasks. Overall, our method, with its innovative strategies and modules, achieves significant performance improvements across various quality metrics.
To validate the effectiveness of our method, the fused images should be applied to downstream tasks to assess their contribution. In our work, we utilized IVF image fusion technology for object detection, conducting experiments in traffic and industrial environments. The detection targets included both large and small objects, and the results further demonstrated the strong generalization capability of the MAAM model, achieving state-of-the-art performance across both datasets.
To further evaluate the visual performance of our proposed method, we conduct a qualitative comparison with several state-of-the-art fusion-based pedestrian detection methods, including CDDFuse, SWINF, and MMAUNet. As shown in Figure 2, we visualize the detection results on a representative challenging scene containing multiple objects with varying levels of occlusion and lighting conditions.
Table 2. The comparison of performance between our architecture and benchmark models on the MSRS dataset highlights the effectiveness of each approach in the fusion task. All experiments were identical.
Table 2. The comparison of performance between our architecture and benchmark models on the MSRS dataset highlights the effectiveness of each approach in the fusion task. All experiments were identical.
MethodVIFSSIMPSNRQabfEN
Faster R-CNN [31]0.2360.28910.3460.4124.366
GNN-based [32]0.2500.23910.8990.4755.342
CDDFuse [33]0.3860.36312.8030.6456.301
ProbEn [34]0.2960.33211.9960.3383.865
MMA-UNet [37]0.4460.47813.6340.7026.398
Ours0.4730.51313.8340.7316.621
Table 3. The comparison of performance between our architecture and benchmark models on the SAIC dataset highlights the effectiveness of each approach in the fusion task. All experiments were identical.
Table 3. The comparison of performance between our architecture and benchmark models on the SAIC dataset highlights the effectiveness of each approach in the fusion task. All experiments were identical.
MethodVIFSSIMPSNR (dB)QabfEN
Faster R-CNN [31]0.4360.58914.5460.6756.376
GNN-based [32]0.4210.63315.1120.7027.381
CDDFuse [33]0.5300.63016.1030.7367.653
ProbEn [34]0.4920.57315.1020.5936.784
SwinF [35]0.5020.64116.3300.7327.801
MMA-UNet [37]0.5120.67316.3810.7427.931
Ours0.5060.69217.0310.7817.832
Table 4. Performance comparison on the MSRS dataset between our architecture, benchmark models, and variant architectures, which shows effectiveness in object detection under AP50, AP75, and mAP metrics.
Table 4. Performance comparison on the MSRS dataset between our architecture, benchmark models, and variant architectures, which shows effectiveness in object detection under AP50, AP75, and mAP metrics.
MethodAP50AP75mAP
Person Car Person Car Person Car
Faster R-CNN [31]0.8310.8530.8040.7360.7690.683
GNN-based [32]0.8300.9010.7630.8620.7360.801
CDDFuse [33]0.9320.9160.9020.9100.8970.864
ProbEn [34]0.9060.9130.8420.8690.8110.844
DETR [25]0.8920.9270.8890.9000.8830.812
SwinF [35]0.8960.9450.8730.9020.8860.872
TransFusion [36]0.9150.9280.9030.9010.8910.884
MMA-UNet [37]0.9260.9030.9130.9100.8920.873
Ours0.9410.9390.9330.9200.8810.894
w/o DCA0.9150.8830.9040.8730.8490.831
w/o MAAM0.9080.8900.8910.8460.8330.825
Table 5. Performance comparison (AP50 scores) on the SAIC dataset across different object categories. Our method achieves state-of-the-art results in three out of four categories.
Table 5. Performance comparison (AP50 scores) on the SAIC dataset across different object categories. Our method achieves state-of-the-art results in three out of four categories.
MethodSensorLatheForkliftFilter
Faster R-CNN [31]0.7340.8830.8010.884
GNN-based [32]0.7020.9120.8140.861
CDDFuse [33]0.8160.9430.9020.933
ProbEn [34]0.8190.9520.8970.947
DETR [25]0.8210.9730.9360.914
SwinF [35]0.8200.9780.9330.961
TransFusion [36]0.8120.9560.9120.938
MMA-UNet [37]0.8070.9920.9430.970
Ours0.8430.9860.9700.976
w/o DCA0.8010.9530.9190.902
w/o MAAM0.7830.9370.9040.934
Figure 2. Qualitative comparison of detection results produced by CDDFuse, SWINF, MMAUNet, and our proposed method.
Figure 2. Qualitative comparison of detection results produced by CDDFuse, SWINF, MMAUNet, and our proposed method.
Applsci 15 04298 g002

4.4. Ablation

To comprehensively evaluate the effectiveness of our proposed MNCM model, we conducted an ablation analysis focusing on its two key components: the dynamic channel adjustment (DCA) module and the multi-scale activated attention mechanism (MAAM) module. These modules are designed to enhance cross-modal feature representation and fusion, thereby improving overall model performance in classification and detection tasks.
To assess their individual contributions, we systematically removed each module and analyzed the resulting performance variations. As shown in Table 4 and Table 5, eliminating either component led to a noticeable decline in accuracy, underscoring their critical roles. Quantitative results demonstrate the effectiveness of both the DCA and MAAM modules. On the MSRS dataset, the DCA module improves mAP from 0.849 to 0.881 for Person detection (+3.2%) and from 0.831 to 0.894 for Car detection (+6.3%). Similarly, the MAAM module yields consistent gains, with Person mAP increasing from 0.833 to 0.881 (+4.8%) and Car from 0.825 to 0.894 (+6.9%). On the SAIC dataset, both modules generalize well, with DCA and MAAM achieving average AP50 improvements of 3.7% and 4.6%, respectively, across all object categories. These results highlight the complementary strengths of DCA and MAAM in enhancing spatial feature refinement and modality-aware representation.
Specifically, the DCA module facilitates adaptive feature recalibration by dynamically adjusting channel-wise importance across modalities, ensuring that essential information is preserved during fusion. Meanwhile, the MAAM module expands the receptive field through dilated attention, allowing the model to capture modality-specific spatial dependencies and refine feature integration. Together, these modules work synergistically—DCA enhances deep modality interaction at the feature level, while MAAM strengthens spatial feature extraction through attention mechanisms—resulting in significant improvements in robustness and detection accuracy. These findings highlight the necessity of both components in the MNCM framework, demonstrating their complementary roles in optimizing the model’s performance.
In addition to evaluating the core modules (DCA and MAAM), we conducted systematic ablation studies on the loss function components to understand their individual contributions. To maintain model integrity during these experiments, we employed a re-weighting strategy where the targeted loss term’s weight ( λ ) was set to zero while keeping all network architectures intact. The experimental results are shown in Table 6.
The experimental results show that the impact of the three types of loss functions on detection performance varies significantly. The absence of the bounding box loss L box leads to the most noticeable performance decline (−7.0% for pedestrians and −6.4% for vehicles), confirming the crucial role of GIoU loss in object localization. The removal of the object confidence loss L obj has a greater impact on pedestrian detection (−4.3% vs. −2.8% for vehicles), indicating that pedestrian detection faces more severe class imbalance issues. The classification loss L cls affects both target categories similarly (−3.9% for pedestrians and −3.6% for vehicles), highlighting its general applicability in category differentiation.

4.5. Computational Complexity Analysis

To assess the computational efficiency of MNCM, we compare it with several representative transformer-based or attention-based multimodal object detection methods. The evaluation is conducted on the MSRS dataset using a fixed input resolution of 640 × 640 pixels, which reflects a typical real-world deployment setting. All models are tested on the same hardware (Tesla A30 GPU) under consistent inference conditions. We report the number of parameters (in millions), floating-point operations (FLOPs in GFLOPs), average inference time per image (milliseconds), and frames per second (FPS). The results are shown in Table 7.
As shown in Table 7, MNCM achieves the lowest number of parameters (31.4 M) among all compared methods, reflecting its lightweight architectural design. While its FLOPs (70.3 G) are slightly higher than those of SwinF (69.4 G), MNCM maintains a competitive inference time of 33.6 ms and achieves 27.9 FPS, which is on par with or slightly better than transformer-based baselines such as SwinF (26.5 FPS) and CDDFuse (25.9 FPS).
These results indicate that MNCM provides a favorable balance between model complexity and runtime efficiency. The improvements can be attributed to the integration of the dynamic channel adjustment (DCA) and multi-scale activated attention mechanism (MAAM), which jointly enhance feature representation across modalities while maintaining efficient computation suitable for real-time multimodal object detection.

5. Conclusions

In this paper, we present MNCM, a novel multimodal image fusion network designed to enhance target detection performance through the integration of dynamic channel adjustment (DCA) and the multi-scale activated attention mechanism (MAAM). DCA facilitates deep interactions between modalities, ensuring the preservation of critical information, while MAAM captures spatial features from multiple perspectives, enabling more effective fusion of complementary data. Extensive experiment results validate the capabilities of our model, demonstrating its robust performance in both image fusion and object detection tasks. Experimental results on two distinct datasets demonstrate MNCM’s significant impact on object detection, particularly in complex environments with objects of varying sizes. These findings underscore MNCM’s potential as a powerful tool for advancing multimodal target detection in real-world applications. Some limitations and potential improvements are discussed in Appendix A.

Author Contributions

Y.Y. and M.C.; methodology, Y.Y. and M.C.; modeling, Y.Y.; software, Y.Y.; validation, Y.Y.; investigation, Y.Y.; writing, Y.Y.; tuition, M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Limitations and Future Work

While MNCM demonstrates competitive accuracy and computational efficiency in multimodal object detection, it still has several limitations that suggest areas for improvement.
One key limitation is that MNCM does not incorporate temporal or sequential modeling, which restricts its use to static image-based detection. This means the model is unable to capture temporal continuity across video frames or perform multi-frame fusion, both of which are essential for dynamic scenarios like video surveillance and autonomous driving. Another challenge is that MNCM assumes complete and well-aligned modality inputs during inference. However, in real-world situations, modalities may be missing, degraded, or misaligned due to factors like sensor failures, occlusions, or adverse weather conditions. Currently, the model lacks explicit mechanisms to handle these issues, which could affect its performance in challenging environments. Additionally, MNCM is designed specifically for object detection and has not been evaluated in large-scale pretraining or general-purpose multimodal learning contexts. Its modular structure may also create challenges when attempting to integrate it into broader multimodal frameworks or adapt it to large-scale foundation models.
Looking ahead, we aim to improve MNCM’s ability to handle incomplete or noisy modalities, introduce lightweight temporal modeling modules for video-based detection, and explore how the model can be integrated with scalable pretraining strategies or multimodal foundation model backbones to enhance its adaptability and transferability.

References

  1. Zheng, Y.; Blasch, E.; Liu, Z. Multispectral Image Fusion and Colorization; SPIE Press: Bellingham, WA, USA, 2018; Volume 481. [Google Scholar]
  2. Ouardirhi, Z.; Mahmoudi, S.A.; Zbakh, M. Enhancing object detection in smart video surveillance: A survey of occlusion-handling approaches. Electronics 2024, 13, 541. [Google Scholar] [CrossRef]
  3. Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; Wang, C. Multimodal industrial anomaly detection via hybrid fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8032–8041. [Google Scholar]
  4. Baumgartner, M.; Jäger, P.F.; Isensee, F.; Maier-Hein, K.H. nnDetection: A self-configuring method for medical object detection. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; pp. 530–539. [Google Scholar]
  5. Chauhan, R.; Ghanshala, K.K.; Joshi, R. Convolutional neural network (CNN) for image detection and recognition. In Proceedings of the 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC), Jalandhar, India, 15–17 December 2018; pp. 278–282. [Google Scholar]
  6. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [PubMed]
  7. Chandana, R.; Ramachandra, A. Real time object detection system with YOLO and CNN models: A review. arXiv 2022, arXiv:2208.00773. [Google Scholar]
  8. Dimitri, G.M.; Spasov, S.; Duggento, A.; Passamonti, L.; Lió, P.; Toschi, N. Multimodal and multicontrast image fusion via deep generative models. Inf. Fusion 2022, 88, 146–160. [Google Scholar] [CrossRef]
  9. Liu, L.; Muelly, M.; Deng, J.; Pfister, T.; Li, L.J. Generative modeling for small-data object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6073–6081. [Google Scholar]
  10. Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
  11. Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward transformer-based object detection. arXiv 2020, arXiv:2012.09958. [Google Scholar]
  12. Zhang, H.; Fromont, E.; Lefèvre, S.; Avignon, B. Guided attentive feature fusion for multispectral pedestrian detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 72–80. [Google Scholar]
  13. Liu, Y. Cross-Modal Attention for Robust Object Detection. Comput. Vis. Image Underst. 2024, 189, 103305. [Google Scholar]
  14. Li, X.; Liu, J.; Tang, Z.; Han, B.; Wu, Z. MEDMCN: A novel multi-modal EfficientDet with multi-scale CapsNet for object detection. J. Supercomput. 2024, 80, 12863–12890. [Google Scholar]
  15. Zhan, Y.; Zeng, Z.; Liu, H.; Tan, X.; Tian, Y. MambaSOD: Dual Mamba-driven cross-modal fusion network for RGB-D salient object detection. Neurocomputing 2025, 631, 129718. [Google Scholar]
  16. Liu, S.; Liu, Z. Multi-channel CNN-based object detection for enhanced situation awareness. arXiv 2017, arXiv:1712.00075. [Google Scholar]
  17. Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Zhang, J. Bayesian fusion for infrared and visible images. Signal Process. 2020, 177, 107734. [Google Scholar]
  18. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
  19. Pedersoli, M.; Gonzàlez, J.; Hu, X.; Roca, X. Toward real-time pedestrian detection based on a deformable template model. IEEE Trans. Intell. Transp. Syst. 2013, 15, 355–364. [Google Scholar] [CrossRef]
  20. Yan, J.; Lei, Z.; Wen, L.; Li, S.Z. The fastest deformable part model for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2497–2504. [Google Scholar]
  21. Zhou, H.; Yu, G. Research on pedestrian detection technology based on the SVM classifier trained by HOG and LTP features. Future Gener. Comput. Syst. 2021, 125, 604–615. [Google Scholar] [CrossRef]
  22. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  23. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  24. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  25. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision—ECCV 2020 Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  26. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  27. Zhang, Z. Transformer-Based Fusion for Multi-Sensor Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1391–1403. [Google Scholar]
  28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  29. Sriharipriya, K.C. Enhanced pothole detection system using YOLOX algorithm. Auton. Intell. Syst. 2022, 2, 22. [Google Scholar]
  30. Wang, Y.; Huang, W.; Sun, F.; Xu, T.; Rong, Y.; Huang, J. Deep multimodal fusion by channel exchanging. Adv. Neural Inf. Process. Syst. 2020, 33, 4835–4845. [Google Scholar]
  31. Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
  32. Wang, Y.; Kitani, K.; Weng, X. Joint object detection and multi-object tracking with graph neural networks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13708–13715. [Google Scholar]
  33. Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
  34. Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling; Springer: Cham, Switzerland, 2022. [Google Scholar]
  35. Li, T.; Wang, H.; Li, G.; Liu, S.; Tang, L. SwinF: Swin Transformer with feature fusion in target detection. J. Phys. Conf. Ser. 2022, 2284, 012027. [Google Scholar]
  36. Zhou, C.; Yu, L.; Babu, A.; Tirumala, K.; Yasunaga, M.; Shamis, L.; Kahn, J.; Ma, X.; Zettlemoyer, L.; Levy, O. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. arXiv 2024, arXiv:2408.11039. [Google Scholar]
  37. Huang, J.; Li, X.; Tan, T.; Li, X.; Ye, T. MMA-UNet: A Multi-Modal Asymmetric UNet Architecture for Infrared and Visible Image Fusion. arXiv 2024, arXiv:2404.17747. [Google Scholar]
Figure 1. The overall architecture of the proposed MNCM framework consists of two parallel ResNet-based image feature extraction networks, processing input features from each modality. The extracted features are then fed into the DCA-MAAM module. In the DCA-MAAM module, the dynamic channel adjustment and multi-scale activated attention mechanisms are applied. The blue box with three small squares represents sliding convolutions with different dilation rates of 1, 2, and 3, capturing features at multiple receptive field scales.
Figure 1. The overall architecture of the proposed MNCM framework consists of two parallel ResNet-based image feature extraction networks, processing input features from each modality. The extracted features are then fed into the DCA-MAAM module. In the DCA-MAAM module, the dynamic channel adjustment and multi-scale activated attention mechanisms are applied. The blue box with three small squares represents sliding convolutions with different dilation rates of 1, 2, and 3, capturing features at multiple receptive field scales.
Applsci 15 04298 g001
Table 1. SAIC dataset statistic.
Table 1. SAIC dataset statistic.
DatasetSensorsLathesForkliftsFitersTotal
SAIC5903674652671689
Table 6. Ablation study on loss function components (MSRS validation set).
Table 6. Ablation study on loss function components (MSRS validation set).
ConfigurationPerson mAPCar mAP
Full model0.8810.894
w/o L box 0.819 (−7.0%)0.837 (−6.4%)
w/o L obj 0.843 (−4.3%)0.869 (−2.8%)
w/o L cls 0.847 (−3.9%)0.858 (−3.6%)
Table 7. Computational complexity and runtime efficiency comparison between MNCM and baseline methods on the MSRS dataset.
Table 7. Computational complexity and runtime efficiency comparison between MNCM and baseline methods on the MSRS dataset.
MethodParams (M)FLOPs (G)Inference Time (ms)FPS
CDDFuse [33]48.374.641.725.9
SwinF [35]56.669.439.226.5
MMA-UNet [37]38.172.835.428.2
Ours31.470.333.627.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ye, Y.; Chen, M. Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention. Appl. Sci. 2025, 15, 4298. https://doi.org/10.3390/app15084298

AMA Style

Ye Y, Chen M. Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention. Applied Sciences. 2025; 15(8):4298. https://doi.org/10.3390/app15084298

Chicago/Turabian Style

Ye, Yihang, and Mingxuan Chen. 2025. "Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention" Applied Sciences 15, no. 8: 4298. https://doi.org/10.3390/app15084298

APA Style

Ye, Y., & Chen, M. (2025). Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention. Applied Sciences, 15(8), 4298. https://doi.org/10.3390/app15084298

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop