Cross-Modal Adaptive Fusion and Multi-Scale Aggregation Network for RGB-T Crowd Density Estimation and Counting
Featured Application
Abstract
1. Introduction
- Limited robustness to illumination shifts: In well-lit scenarios, the RGB modality can provide rich detailed information and should be assigned greater confidence; whereas under insufficient illumination or extreme weather conditions, the thermal infrared modality, with its stronger anti-interference capability, should play a more significant role. However, existing methods such as DEFNet [14] and MISF-Net [21] consistently fuse features from different modalities without adaptive weighting, leading to performance fluctuations between bright and dark environments. This indicates their incomplete adaptation to real-world illumination changes.
- Insufficient handling of scale discrepancies: Due to differences in imaging principles, the same target may exhibit scale discrepancies between visible and thermal images. Existing methods, such as MIANet [17] with its hierarchical interaction strategy or CCANet [18] with misalignment-aware fusion, primarily focus on feature interaction at the same scale, thus insufficiently modeling cross-modal scale variations.
- We design a cross-modal adaptive fusion module (CAFM), which simulates a dynamic feature selection and trust assignment mechanism to proactively evaluate and fuse reliability-aware features from different modalities according to varying scenarios. This module addresses the issue of the modality reliability fluctuation problem ignored by most prior RGB-T methods, which treat both modalities equally regardless of imaging conditions.
- We design a multi-scale aggregation module (MSAM) that unifies features from different receptive fields to an intermediate scale and performs weighted fusion, effectively handling cross-modal scale discrepancies that are not adequately addressed in existing multi-scale RGB-T fusion networks.
- The proposed method is thoroughly validated on benchmark datasets. Compared to existing state-of-the-art methods, it effectively enhances crowd density estimation and counting performance. Evaluated using the GAME and RMSE metrics, our method achieves improvements of 1.57%, 4.05%, 5.05%, 4.63%, and 0.78% across different GAME levels and RMSE on the DroneRGBT dataset, and improvements of 2.48%, 1.89%, 2.80%, 0.67%, and 1.59% on the RGBT-CC dataset, respectively. Notably, it exhibits superior robustness under varying illumination conditions, validating the effectiveness of the proposed adaptive fusion and multi-scale aggregation strategy.
2. Related Work
2.1. RGB-Based Crowd Counting
2.2. RGB-T Crowd Counting
3. Methodology
3.1. Proposed Cross-Modal Adaptive Fusion Module
| Algorithm 1. Cross-Modal Adaptive Fusion. |
| Input: , where c is the number of channels, h is the height, and w is the width. |
| Output: . |
| Step 1: Initial feature aggregation: . |
| Step 2: Global context descriptor extraction: Compute global average-pooled vectors via Equation (1). Compute global max-pooled vectors via Equation (2). |
| Step 3: Channel-wise attention weight generation: Pass and through two independent multi-layer perceptrons (MLPs) with ReLU activation: Apply Softmax along the modality dimension to obtain normalized attention weights and via Equation (3). Step 4: Feature recalibration and using Equations (4) and (5). |
| Step 5: Feature concatenation and adaptive fusion apply a CBR block via Equation (6). |
3.2. Proposed Multi-Scale Aggregation Module
| Algorithm 2. Multi-Scale Aggregation. |
| Input: . |
| Output: , where is the spatial size of . |
| Step 1: Spatial alignment and to the spatial dimensions of : |
| Step 2: Apply two sequential ODConv2d layers per branch for feature refinement via Equation (7). |
| Step 3: Concatenate the enhanced low- and high-scale features with the mid-scale features: Step 4: Channel projection apply a 1 × 1 convolution to reduce channel dimensions, then refine with convolutional block attention module (CBAM): |
| Step 5: Fuse the attentively calibrated multi-scale features with the original mid-scale features via element-wise addition: |
3.3. Network Loss
4. Experiment Details
4.1. Datasets
4.2. Implementation Step
4.3. Evaluation Metrics
5. Experimental Results and Discussion
5.1. Discussion of Experimental Results on the DroneRGBT Benchmark
5.2. Discussion of Experimental Results on the RGBT-CC Benchmark
5.3. Cross Dataset Validation and Analysis
5.4. Experimental Comparison Under Different Illumination Conditions
5.5. Qualitative Comparison
5.5.1. Comparison of Density Maps and Count Visualization
5.5.2. Comparison of Absolute Error Distribution in Counting
5.5.3. Analysis of Failure Cases
5.6. Computational Complexity Analysis
5.7. Ablation Studies
5.7.1. Conduct Ablation Experiments on the Proposed Module
5.7.2. Conduct Ablation Experiments on the Proposed MSAM
5.7.3. Conduct Ablation Experiments on Different Supervisory Losses
5.7.4. Conduct Ablation Experiments on Spatial Registration Errors
6. Conclusions
- Cross-dataset and Cross-device Generalization: Study the adaptability and generalization of models across datasets collected with different devices and under varying scene distributions. Explore domain adaptation and meta-learning strategies to reduce dependence on specific data annotations and enhance the model’s robustness in diverse real-world environments.
- Suppression of Complex Heat Sources and Background Interference: Aiming at interference caused by non-human heat sources (such as lights, vehicles, animals, etc.) and complex backgrounds in thermal infrared images, investigate more discriminative feature representations and fusion mechanisms to improve the model’s ability to distinguish between targets and interference.
- Adaptive Fusion Strategies for Extremely High-Density Crowds: In highly crowded scenes, study mechanisms for dynamically adjusting the granularity and receptive fields of multimodal fusion, optimizing feature extraction and aggregation methods to better capture severely occluded and extremely small-scale target features, thereby enhancing counting accuracy.
- Fine-Grained Multimodal Registration and Alignment: Address the spatial misalignment issues between RGB and thermal infrared images caused by differences in perspective, resolution, and imaging principles. Investigate adaptive pixel-level or feature-level registration techniques to mitigate the impact of misalignment on fusion effectiveness and improve the geometric consistency of feature fusion.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| RGB-T | Visible Light and Thermal Imaging |
| CMAFM | Cross-Modal Adaptive Fusion Module |
| MSAM | Multi-Scale Aggregation Module |
| MLP | Multilayer Perceptron |
| CNNs | Convolutional Neural Networks |
| FIDTM | Focal Inverse Distance Transform Map |
| MSE | Mean Squared Error |
| GAME | Grid Average Mean Absolute Error |
| RMSE | Root Mean Square Error |
References
- Khan, A.; Ali Shah, J.; Kadir, K.; Albattah, W.; Khan, F. Crowd Monitoring and Localization Using Deep Convolutional Neural Network: A Review. Appl. Sci. 2020, 10, 4781. [Google Scholar] [CrossRef]
- Boukerche, A.; Coutinho, R.W. Crowd management: The overlooked component of smart transportation systems. IEEE Commun. Mag. 2019, 57, 48–53. [Google Scholar] [CrossRef]
- Sam, D.B.; Peri, S.V.; Sundararaman, M.N.; Kamath, A.; Babu, R.V. Locate, size, and count: Accurately resolving people in dense crowds via detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2739–2751. [Google Scholar] [CrossRef] [PubMed]
- Khan, M.A.; Menouar, H.; Hamila, R. Revisiting crowd counting: State-of-the-art, trends, and future perspectives. Image Vis. Comput. 2023, 129, 104597. [Google Scholar] [CrossRef]
- Alhawsawi, A.N.; Khan, S.D.; Ur Rehman, F. Crowd Counting in Diverse Environments Using a Deep Routing Mechanism Informed by Crowd Density Levels. Information 2024, 15, 275. [Google Scholar] [CrossRef]
- Luo, H.; Sang, J.; Wu, W.; Xiang, H.; Xiang, Z.; Zhang, Q.; Wu, Z. A High-Density Crowd Counting Method Based on Convolutional Feature Fusion. Appl. Sci. 2018, 8, 2367. [Google Scholar] [CrossRef]
- Ilyas, N.; Lee, B.; Kim, K. HADF-Crowd: A Hierarchical Attention-Based Dense Feature Extraction Network for Single-Image Crowd Counting. Sensors 2021, 21, 3483. [Google Scholar] [CrossRef]
- Sindagi, V.A.; Patel, V.M. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognit. Lett. 2018, 107, 3–16. [Google Scholar] [CrossRef]
- Liu, Z.; Yaermaimaiti, Y. A study of pedestrian detection algorithms for use in complex environments. Eng. Res. Express 2025, 7, 035283. [Google Scholar] [CrossRef]
- George, A.; Vinothina, V.; Beulah, G.V. Neural Network-Based Crowd Counting Systems: State of the Art, Challenges, and Perspectives. J. Adv. Inf. Technol. 2023, 14, 1450–1460. [Google Scholar] [CrossRef]
- Peng, T.; Li, Q.; Zhu, P. RGB-T Crowd Counting from Drone: A Benchmark and MMCCN Network. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 December 2020–4 January 2021. [Google Scholar]
- Liu, L.; Chen, J.; Wu, H.; Li, G.; Li, C.; Lin, L. Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4821–4831. [Google Scholar]
- Gu, S.; Lian, Z. A unified RGB-T crowd counting learning framework. Image Vis. Comput. 2023, 131, 104631. [Google Scholar] [CrossRef]
- Zhou, W.; Pan, Y.; Lei, J.; Yu, L. DEFNet: Dual-branch enhanced feature fusion network for RGB-T crowd counting. IEEE Trans. Intell. Transp. Syst. 2022, 23, 24540–24549. [Google Scholar] [CrossRef]
- Zhou, W.; Yang, X.; Lei, J.; Yan, W.; Yu, L. MC3Net: Multimodality Cross-Guided Compensation Coordination Network for RGB-T Crowd Counting. IEEE Trans. Intell. Transp. Syst. 2024, 25, 4156–4165. [Google Scholar] [CrossRef]
- Tang, H.; Wang, Y.; Lin, Z.; Chau, L.P.; Zhuang, H. A three-stream fusion and self-differential attention network for multi-modal crowd counting. Pattern Recognit. Lett. 2024, 183, 35–41. [Google Scholar] [CrossRef]
- Wang, S.; Lyu, Y.; Li, Y.; Xu, Y.; Wu, W. MIANet: Bridging the Gap in Crowd Density Estimation with Thermal and RGB Interaction. IEEE Trans. Intell. Transp. Syst. 2025, 26, 254–267. [Google Scholar] [CrossRef]
- Liu, Y.; Cao, G.; Shi, B.; Hu, Y. CCANet: A Collaborative Cross-Modal Attention Network for RGB-D Crowd Counting. IEEE Trans. Multimed. 2024, 26, 154–165. [Google Scholar] [CrossRef]
- Kong, W.; Liu, J.; Hong, Y.; Li, H.; Shen, J. Cross-modal collaborative feature representation via Transformer-based multi-modal mixers for RGB-T crowd. Expert Syst. Appl. 2024, 255, 124483. [Google Scholar] [CrossRef]
- Kong, W.; Yu, Z.; Li, H.; Zhang, J. Cross-modal misalignment-robust feature fusion for crowd counting. Eng. Appl. Artif. Intell. 2024, 136, 108898. [Google Scholar] [CrossRef]
- Mu, B.; Shao, F.; Xie, Z.; Chen, H.; Zhu, Z.; Jiang, Q. MISF-Net: Modality-Invariant and -Specific Fusion Network for RGB-T Crowd Counting. IEEE Trans. Multimed. 2025, 27, 2593–2607. [Google Scholar] [CrossRef]
- Du, Z.; Shi, M.; Deng, J.; Zafeiriou, S. Redesigning multi-scale neural network for crowd counting. IEEE Trans. Image Process. 2023, 32, 3664–3678. [Google Scholar] [CrossRef]
- Gong, S.; Yao, Z.; Zuo, W.; Yang, J.; Yuen, P.; Zhang, S. Spatially adaptive pyramid feature fusion for scale-aware crowd counting. Pattern Recognit. 2025, 168, 111832. [Google Scholar] [CrossRef]
- Wang, T.; Zhang, T.; Zhang, K.; Wang, H.; Li, M.; Lu, J. Context attention fusion network for crowd counting. Knowl. Based Syst. 2023, 271, 110541. [Google Scholar] [CrossRef]
- Zhu, A.; Duan, G.; Zhu, X.; Zhao, L.; Huang, Y.; Hua, G.; Snoussi, H. CDADNet: Context-guided dense attentional dilated network for crowd counting. Signal Process.-Image Commun. 2021, 98, 116379. [Google Scholar] [CrossRef]
- Li, B.; Zhang, Y.; Xu, H.; Yin, B. CCST: Crowd counting with swin transformer. Vis. Comput. 2023, 39, 2671–2682. [Google Scholar] [CrossRef]
- Huo, Z.; Yuan, C.; Zhang, K.; Qiao, Y.; Luo, F. VMamba-Crowd: Bridging multi-scale features from Visual Mamba for weakly-supervised crowd counting. Pattern Recognit. Lett. 2025, 197, 297–303. [Google Scholar] [CrossRef]
- Wan, J.; Wang, Q.; Chan, A.B. Kernel-based density map generation for dense object counting. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1357–1370. [Google Scholar] [CrossRef]
- Chen, Y.; Yang, J.; Chen, B.; Du, S. Counting varying density crowds through density guided adaptive selection CNN and transformer estimation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 1055–1068. [Google Scholar] [CrossRef]
- Liang, D.; Xu, W.; Zhu, Y.; Zhou, Y. Focal inverse distance transform maps for crowd localization. IEEE Trans. Multimed. 2022, 25, 6040–6052. [Google Scholar] [CrossRef]
- Ma, Z.; Wei, X.; Hong, X.; Gong, Y. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Wu, Y. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3365–3374. [Google Scholar]
- Wan, J.; Wu, Q.; Chan, A.B. Modeling Noisy Annotations for Point-Wise Supervision. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15065–15080. [Google Scholar] [CrossRef]
- Li, H.; Zhang, J.; Kong, W.; Shen, J.; Shao, Y. CSA-Net: Cross-modal scale-aware attention-aggregated network for RGB-T crowd counting. Expert Syst. Appl. 2023, 213, 119038. [Google Scholar] [CrossRef]
- Guo, Q.; Yuan, P.; Huang, X.; Ye, Y. Consistency-constrained RGB-T crowd counting via mutual information maximization. Complex Intell. Syst. 2024, 10, 5049–5070. [Google Scholar] [CrossRef]
- Pan, Y.; Zhou, W.; Fang, M.; Qiang, F. Graph enhancement and transformer aggregation network for RGB-thermal crowd counting. IEEE Geosci. Remote Sens. Lett. 2024, 21, 3000705. [Google Scholar] [CrossRef]
- Chen, Y.; Zhao, H.; Huang, L.; Yang, Y.; Kang, W.; Zhang, J. TransMambaCC: Integrating Transformer and Pyramid Mamba Network for RGB-T Crowd Counting. Appl. Intell. 2025, 55, 989. [Google Scholar] [CrossRef]
- Kong, W.; Yu, Z.; Li, H.; Tong, L.; Zhao, F.; Li, Y. CrowdAlign: Shared-weight dual-level alignment fusion for RGB-T crowd counting. Image Vis. Comput. 2024, 148, 105152. [Google Scholar] [CrossRef]
- Kuang, X.; Tao, B. ODGNet: Robotic Grasp Detection Network Based on Omni-Dimensional Dynamic Convolution. Appl. Sci. 2024, 14, 4653. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Zhou, P.; Xie, X.Y.; Lin, Z.C.; Yan, S.C. Towards Understanding Convergence and Generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]
- Zhang, C.; Shao, Y.; Sun, H.; Xing, L.; Zhao, Q.; Zhang, L. The WuC-Adam algorithm based on joint improvement of Warmup and cosine annealing algorithms. Math. Biosci. Eng. 2024, 21, 1270–1285. [Google Scholar] [CrossRef]
- Pan, Y.; Zhou, W.; Qian, X.; Mao, S.; Yang, R.; Yu, L. CGINet: Cross-modality grade interaction network for RGB-T crowd counting. Eng. Appl. Artif. Intell. 2023, 126, 106885. [Google Scholar] [CrossRef]
- Xie, Z.; Shao, F.; Mu, B.; Chen, H.; Jiang, Q.; Lu, C.; Ho, Y.S. BGDFNet: Bidirectional Gated and Dynamic Fusion Network for RGB-T Crowd Counting in Smart City System. IEEE Trans. Instrum. Meas. 2024, 73, 5024616. [Google Scholar] [CrossRef]
- Zhang, Y.; Choi, S.; Hong, S. Memory-efficient cross-modal attention for RGB-X segmentation and crowd counting. Pattern Recognit. 2025, 162, 111376. [Google Scholar] [CrossRef]
- Liu, K.; Zou, X.; Zhu, P.; Sang, J. Modal-Adaptive Spatial-Aware-Fusion and Propagation Network for Multimodal Vision Crowd Counting. IEEE Trans. Consum. Electron. 2025, 71, 3605–3616. [Google Scholar] [CrossRef]
- Wan, J.; Kumar, N.S.; Chan, A.B. Fine-Grained Crowd Counting. IEEE Trans. Image Process. 2021, 30, 2114–2126. [Google Scholar] [CrossRef] [PubMed]
- Zhang, M.; Chen, M.; Li, Q.; Chen, Y.; Lin, R.; Li, X.; He, S.; Liu, W. Category-Contrastive Fine-Grained Crowd Counting and Beyond. IEEE Trans. Multimed. 2024, 27, 477–488. [Google Scholar] [CrossRef]
- Hayat, M.; Aramvith, S. Superpixel-Guided Graph-Attention Boundary GAN for Adaptive Feature Refinement in Scribble-Supervised Medical Image Segmentation. IEEE Access 2025, 25, 35212–35222. [Google Scholar] [CrossRef]
- Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended Feature Pyramid Network for Small Object Detection. IEEE Trans. Multimed. 2022, 24, 1968–1979. [Google Scholar] [CrossRef]











| Method | Venue | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
|---|---|---|---|---|---|---|
| IADM [12] | CVPR 21 | 9.70 | 12.04 | 15.31 | 20.31 | 15.01 |
| DEFNet [14] | TITS 22 | 7.89 | 9.60 | 11.96 | 15.34 | 12.88 |
| CGINet [43] | EAAI 23 | 8.37 | 9.97 | 12.34 | 15.51 | 13.45 |
| MC3Net [15] | TITS 23 | 7.33 | / | / | / | 11.17 |
| CrowdAlign [38] | IVC 24 | 7.03 | / | / | / | 10.96 |
| CFAF-Net [20] | EAAI 24 | 6.72 | / | / | / | 11.06 |
| C4-MIM [35] | CAIS 24 | 6.72 | / | / | / | 10.58 |
| MCN [19] | ESWA 24 | 6.36 | / | / | / | 10.22 |
| GETANet [36] | GRSL 24 | 8.44 | 10.01 | 12.75 | 15.83 | 13.99 |
| MIANet [17] | TITS 25 | 6.74 | 8.64 | 11.49 | 16.31 | 10.58 |
| Ours | - | 6.26 ± 0.14 | 8.29 ± 0.23 | 10.91 ± 0.32 | 14.63 ± 0.41 | 10.14 ± 0.20 |
| Method | Venue | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
|---|---|---|---|---|---|---|
| IADM [12] | CVPR 21 | 15.61 | 19.95 | 24.69 | 32.89 | 28.18 |
| DEFNet [14] | TITS 22 | 11.90 | 16.08 | 20.19 | 27.27 | 21.09 |
| CCANet [18] | TMM 23 | 13.93 | 18.13 | 22.08 | 28.26 | 24.71 |
| CSA-Net [34] | ESWA 23 | 12.45 | 16.46 | 21.48 | 30.62 | 21.64 |
| CGINet [40] | EAAI 23 | 12.07 | 15.98 | 20.06 | 27.73 | 20.54 |
| MC3Net [15] | TITS 23 | 11.47 | 15.06 | 19.40 | 27.95 | 20.59 |
| GETANet [36] | GRSL 24 | 12.14 | 15.98 | 19.40 | 28.61 | 22.17 |
| MCN [19] | ESWA 24 | 11.56 | 15.92 | 20.16 | 28.06 | 19.02 |
| C4-MIM [35] | CAIS 24 | 11.27 | 15.02 | 19.31 | 25.33 | 20.31 |
| CFAF-Net [20] | EAAI 24 | 11.07 | 14.96 | 19.65 | 29.05 | 18.83 |
| CrowdAlign [38] | IVC 24 | 11.07 | 14.83 | 19.44 | 28.65 | 19.78 |
| BGDFNet [44] | TIM 24 | 11.00 | 15.04 | 19.86 | 29.72 | 19.05 |
| CSCA [45] | PR 25 | 13.50 | 18.63 | 23.59 | 31.59 | 24.83 |
| MSPNet [46] | TCE 25 | 12.20 | 16.50 | 20.51 | 27.84 | 21.49 |
| MIANet [17] | TITS 25 | 11.97 | 15.65 | 19.93 | 27.54 | 22.17 |
| MISF-Net [21] | TMM 25 | 10.90 | 14.87 | 19.65 | 29.18 | 19.42 |
| Ours | - | 10.63 ± 0.17 | 14.55 ± 0.26 | 18.77 ± 0.35 | 25.16 ± 0.42 | 18.53 ± 0.22 |
| Method | Test Type | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
|---|---|---|---|---|---|---|
| IADM [12] | DroneRGBT ↓ RGBT-CC | 73.19 | 73.48 | 74.01 | 74.35 | 103.31 |
| DEFNet [14] | 74.08 | 74.08 | 74.09 | 74.09 | 104.38 | |
| MIANet [17] | 61.18 | 63.48 | 66.51 | 70.14 | 93.82 | |
| Ours | 73.03 | 73.27 | 73.67 | 73.98 | 103.53 | |
| IADM [12] | RGBT-CC ↓ DroneRGBT | 27.12 | 28.4 | 29.94 | 30.99 | 38.01 |
| DEFNet [14] | 47.36 | 55.12 | 64.25 | 69.9 | 83.02 | |
| MIANet [17] | 27.3 | 31.95 | 35.47 | 37.69 | 51.33 | |
| Ours | 34.74 | 40.55 | 43.28 | 44.53 | 49.71 |
| Illumination | Methods | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
|---|---|---|---|---|---|---|
| Brightness | IADM [12] | 20.36 | 23.57 | 28.49 | 36.29 | 32.57 |
| DEFNet [14] | 11.03 | 15.39 | 20.07 | 27.99 | 22.96 | |
| MCN [19] | 12.36 | 16.68 | 21.45 | 29.75 | 20.56 | |
| MIANet [17] | 13.69 | 18.01 | 22.96 | 31.41 | 26.76 | |
| MISF-Net [21] | 11.89 | 15.50 | 20.62 | 31.17 | 21.17 | |
| Ours | 11.61 ± 0.08 | 15.16 ± 0.15 | 19.97 ± 0.21 | 27.07 ± 0.48 | 20.31 ± 0.14 | |
| Darkness | IADM [12] | 15.44 | 19.23 | 23.79 | 30.28 | 29.11 |
| DEFNet [14] | 12.39 | 16.16 | 20.33 | 27.16 | 23.16 | |
| MCN [19] | 10.74 | 15.36 | 19.10 | 26.20 | 17.39 | |
| MIANet [17] | 13.88 | 18.28 | 23.46 | 32.17 | 25.15 | |
| MISF-Net [21] | 10.13 | 14.42 | 18.81 | 27.21 | 17.94 | |
| Ours | 9.51 ± 0.11 | 13.07 ± 0.27 | 17.16 ± 0.41 | 23.31 ± 0.55 | 16.45 ± 0.32 |
| Method | Parameters (M) | FLOPs (G) | Throughput (fps) |
|---|---|---|---|
| IADM [12] | 25.67 | 143.95 | 59.06 |
| DEFNet [14] | 45.33 | 355.42 | 26.03 |
| MIANet [17] | 199.54 | 289.30 | 85.11 |
| Ours | 84.49 | 506.13 | 35.73 |
| Module | Parameters (M) | FLOPs (G) |
|---|---|---|
| Backbone | 14.76 | 301.77 |
| CMAFM | 0.23 | 0.04 |
| MSAM | 68.90 | 192.07 |
| Regression Layer | 0.60 | 12.25 |
| All | 84.49 | 506.13 |
| Exp. | CMAFM | MSAM | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
|---|---|---|---|---|---|---|---|
| 1 | × | × | 15.30 | 19.51 | 23.53 | 29.26 | 25.87 |
| 2 | √ | × | 12.29 | 16.26 | 21.75 | 28.13 | 20.16 |
| 3 | × | √ | 13.58 | 15.56 | 20.63 | 26.49 | 21.84 |
| 4 | √ | √ | 10.54 | 14.58 | 18.76 | 25.09 | 18.50 |
| Method | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
|---|---|---|---|---|---|
| Baseline | 10.54 | 14.58 | 18.76 | 25.09 | 18.50 |
| w/o ODConv | 13.15 | 15.36 | 20.33 | 25.97 | 20.75 |
| w/o CBAM | 12.30 | 14.97 | 19.51 | 25.81 | 20.12 |
| Loss | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
|---|---|---|---|---|---|
| MSE | 10.54 | 14.58 | 18.76 | 25.09 | 18.5 |
| Bayesian | 12.64 | 15.63 | 20.62 | 28.38 | 21.52 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, J.; Niu, Z.; Zhang, Y.; Tang, L. Cross-Modal Adaptive Fusion and Multi-Scale Aggregation Network for RGB-T Crowd Density Estimation and Counting. Appl. Sci. 2026, 16, 161. https://doi.org/10.3390/app16010161
Liu J, Niu Z, Zhang Y, Tang L. Cross-Modal Adaptive Fusion and Multi-Scale Aggregation Network for RGB-T Crowd Density Estimation and Counting. Applied Sciences. 2026; 16(1):161. https://doi.org/10.3390/app16010161
Chicago/Turabian StyleLiu, Jian, Zuodong Niu, Yufan Zhang, and Lin Tang. 2026. "Cross-Modal Adaptive Fusion and Multi-Scale Aggregation Network for RGB-T Crowd Density Estimation and Counting" Applied Sciences 16, no. 1: 161. https://doi.org/10.3390/app16010161
APA StyleLiu, J., Niu, Z., Zhang, Y., & Tang, L. (2026). Cross-Modal Adaptive Fusion and Multi-Scale Aggregation Network for RGB-T Crowd Density Estimation and Counting. Applied Sciences, 16(1), 161. https://doi.org/10.3390/app16010161

