Multimodal Fire Salient Object Detection for Unregistered Data in Real-World Scenarios
Abstract
1. Introduction
- (1)
- We introduce an unaligned RGB-IR fire dataset for indoor environments, encompassing diverse common scenes and fire scenarios under variable illumination conditions. The dataset specifically includes challenging cases such as small-target fires and occlusion interference, providing essential data support for multimodal fire fusion research under spatial misalignment conditions.
- (2)
- We propose a fire saliency detection framework for processing spatially unaligned RGB-IR data, integrating a multimodal feature enhancement module and an improved inter-modal alignment fusion module. These components resolve spatial discrepancies between modalities including RGB and thermal infrared through feature-level correspondence learning and fusion strategies, significantly enhancing detection accuracy compromised by spatial misalignment.
- (3)
- To validate method efficacy, we conduct extensive experiments using the proposed misaligned multimodal fire dataset across indoor and cross-environmental scenarios. The results demonstrate superior performance over state-of-the-art RGB-T multimodal saliency detection models, confirming our method’s capability for accurate fire saliency detection with unregistered RGB and TIR inputs.
2. Related Works
2.1. Single-Modal-Based Fire Identification
2.2. Multi-Modal-Based Fire Identification
2.3. Unaligned Multimodal Data Fusion Study
3. Materials and Methods
3.1. Proposed Methods
3.1.1. Model
3.1.2. Channel Cross Enhancement Module(CCM)
3.1.3. Deformable Alignment Model (DAM)
3.1.4. Loss Function and Significance Output
3.2. Datasets
3.3. Evaluation Metrics
3.4. Implementation Details
4. Results and Discussion
4.1. Quantitative Evaluation of Model Performance
4.2. Qualitative Comparison of Methods
4.3. Model Parameters and Computational Complexity
4.4. Ablation Study
5. Conclusions
- (1)
- We constructed an unregistered RGB-IR dataset for indoor fire scenarios, covering diverse environments and fire conditions. This dataset provides essential support for research on multimodal fusion under spatial misalignment.
- (2)
- The proposed framework integrates a Channel Cross-Enhancement Module (CCM) and a Deformable Alignment Module (DAM), which jointly enhance cross-modal interaction and progressively correct geometric deviations.
- (3)
- Experimental validation demonstrates that our method achieves F1 = 0.864 and IoU = 0.774 on the Indoor-Fire dataset, surpassing SACNet by +11.9% IoU and +11.3% F1. On the Wildfire dataset, our method also outperforms competing approaches with F1 = 0.845 and IoU = 0.754 while maintaining lower parameter complexity (91.9 M, 28.1% of SACNet). In addition, when tested on the complex Fire in historic buildings dataset, our method also demonstrates strong performance, achieving F1 = 0.836 and IoU = 0.709.
- (4)
- The proposed model achieves a favorable balance between accuracy and efficiency, making it suitable for real-world deployment in intelligent firefighting systems.
- (5)
- Limitations remain: (i) the scale and diversity of available datasets still restrict model generalizability; (ii) inference speed requires further optimization for time-sensitive applications. Future work will expand multimodal fire datasets and explore more efficient fusion strategies.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Sharma, A.; Kumar, R.; Kansal, I.; Popli, R.; Khullar, V.; Verma, J.; Kumar, S. Fire detection in urban areas using multimodal data and federated learning. Fire 2024, 7, 104. [Google Scholar] [CrossRef]
- Celik, T.; Demirel, H. Fire detection in video sequences using a generic color model. Fire Saf. J. 2009, 44, 147–158. [Google Scholar] [CrossRef]
- Shen, P.; Sun, N.; Hu, K.; Ye, X.; Wang, P.; Xia, Q.; Wei, C. FireViT: An adaptive lightweight backbone network for fire detection. Forests 2023, 14, 2158. [Google Scholar] [CrossRef]
- Rui, X.; Li, Z.; Zhang, X.; Li, Z.; Song, W. An RGB-thermal based adaptive modality learning network for day–night wildfire identification. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103554. [Google Scholar] [CrossRef]
- Chen, X.; Hopkins, B.; Wang, H.; O’Neill, L.; Afghah, F.; Razi, A.; Fulé, P.; Coen, J.; Rowell, E.; Watts, A. Wildland fire detection and monitoring using a drone-collected RGB/IR image dataset. IEEE Access 2022, 10, 121301–121317. [Google Scholar] [CrossRef]
- Yuan, F.; Li, K.; Wang, C.; Fang, Z. A lightweight network for smoke semantic segmentation. Pattern Recognit. 2023, 137, 109289. [Google Scholar] [CrossRef]
- Muksimova, S.; Mardieva, S.; Cho, Y.-I. Deep encoder–decoder network-based wildfire segmentation using drone images in real-time. Remote Sens. 2022, 14, 6302. [Google Scholar] [CrossRef]
- Doshi, J.; Garcia, D.; Massey, C.; Llueca, P.; Borensztein, N.; Baird, M.; Cook, M.; Raj, D. FireNet: Real-time segmentation of fire perimeter from aerial video. arXiv 2019, arXiv:1910.06407. [Google Scholar]
- Li, M.; Zhang, Y.; Mu, L.; Xin, J.; Yu, Z.; Jiao, S.; Liu, H.; Xie, G.; Yingmin, Y. A real-time fire segmentation method based on a deep learning approach. IFAC-PapersOnLine 2022, 55, 145–150. [Google Scholar] [CrossRef]
- Ahmad, N.; Akbar, M.; Alkhammash, E.H.; Jamjoom, M.M. CN2VF-Net: A hybrid convolutional neural network and vision transformer framework for multi-scale fire detection in complex environments. Fire 2025, 8, 211. [Google Scholar] [CrossRef]
- Sun, W.; Liu, Y.; Wang, F.; Hua, L.; Fu, J.; Hu, S. A study on flame detection method combining visible light and thermal infrared multimodal images. Fire Technol. 2024, 61, 2167–2188. [Google Scholar] [CrossRef]
- Liu, Y.; Zheng, C.; Liu, X.; Tian, Y.; Zhang, J.; Cui, W. Forest fire monitoring method based on UAV visual and infrared image fusion. Remote Sens. 2023, 15, 3173. [Google Scholar] [CrossRef]
- Tlig, M.; Bouchouicha, M.; Sayadi, M.; Moreau, E. Fire segmentation with an optimized weighted image fusion method. Electronics 2024, 13, 3175. [Google Scholar] [CrossRef]
- Kim, D.; Ruy, W. CNN-based fire detection method on autonomous ships using composite channels composed of RGB and IR data. Int. J. Nav. Archit. Ocean. Eng. 2022, 14, 100489. [Google Scholar] [CrossRef]
- Guo, Y.; Zhang, Y.; Wang, R. Inverse asymptotic fusion framework for fusion of infrared and visible images of fires. In Proceedings of the 2024 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Beijing, China, 22 November 2024; pp. 139–143. [Google Scholar]
- Niu, K.; Wang, C.; Xu, J.; Liang, J.; Zhou, X.; Wen, K.; Lu, M.; Yang, C. Early forest fire detection with UAV image fusion: A novel deep learning method using visible and infrared sensors. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6617–6629. [Google Scholar] [CrossRef]
- Guo, S.; Hu, B.; Huang, R. Real-time flame segmentation based on RGB-thermal fusion. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27 December 2021; pp. 1435–1440. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Chen, P.; Chen, Y.; Yang, D.; Wu, F.; Li, Q.; Xia, Q.; Tan, Y. I2UV-HandNet: Image-to-UV prediction network for accurate and high-fidelity 3D hand mesh modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 12929–12938. [Google Scholar] [CrossRef]
- Song, K.; Xue, X.; Wen, H.; Ji, Y.; Yan, Y.; Meng, Q. Misaligned visible-thermal object detection: A drone-based benchmark and baseline. IEEE Trans. Intell. Veh. 2024, 9, 7449–7460. [Google Scholar] [CrossRef]
- Wang, K.; Lin, D.; Li, C.; Tu, Z.; Luo, B. Alignment-free RGBT salient object detection: Semantics-guided asymmetric correlation network and a unified benchmark. IEEE Trans. Multimed. 2024, 26, 10692–10707. [Google Scholar] [CrossRef]
- Zhang, T.; He, X.; Jiao, Q.; Zhang, Q.; Han, J. AMNet: Learning to align multi-modality for RGB-T tracking. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7386–7400. [Google Scholar] [CrossRef]
- Li, C.; Gao, S.; Deng, C.; Xie, D.; Liu, W. Cross-modal learning with adversarial samples. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Park, L.H.; Kim, J.; Oh, M.G.; Park, J.; Kwon, T. Adversarial feature alignment: Balancing robustness and accuracy in deep learning via adversarial training. In Proceedings of the 2024 Workshop on Artificial Intelligence and Security (AISec), Copenhagen, Denmark, 6 November 2024; pp. 101–112. [Google Scholar]
- Chaoxia, C.; Shang, W.; Zhang, F.; Cong, S. Weakly aligned multimodal flame detection for fire-fighting robots. IEEE Trans. Ind. Inform. 2022, 19, 2866–2875. [Google Scholar] [CrossRef]
- Wen, X.; Zhao, J.; He, Y.; Yin, H. Three-decoder cross-modal interaction network for unregistered RGB-T salient object detection. IEEE Trans. Instrum. Meas. 2025, 74, 1–13. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
- Peng, H.; Hu, Y.; Yu, B.; Zhang, Z. TCAINet: An RGB-T salient object detection model with cross-modal fusion and adaptive decoding. Sci. Rep. 2025, 15, 14266. [Google Scholar] [CrossRef] [PubMed]
- Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2Net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
- Hu, K.; Xu, K.; Xia, Q.; Li, M.; Song, Z.; Song, L.; Sun, N. An Overview: Attention Mechanisms in Multi-Agent Reinforcement Learning. Neurocomputing 2024, 598, 128105. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
- Mehta, S.; Rastegari, M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
- Hu, K.; Zhang, Q.; Feng, X.; Liu, Z.; Shao, P.; Xia, M.; Ye, X. An Interpolation and Prediction Algorithm for XCO2 based on Multi-Source Time Series Data. Remote Sens. 2024, 16, 1907. [Google Scholar] [CrossRef]
- Hu, K.; Li, M.; Song, Z.; Xu, K.; Xia, Q.; Sun, N.; Zhou, P.; Xia, M. A Review of Research on Reinforcement Learning Algorithms for Multi-Agent. Neurocomputing 2024, 599, 128068. [Google Scholar] [CrossRef]














| Datasets | Indoor-Fire | Wildfire | Fire in Historic Buildings | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| F1 | IoU | MAE | F1 | IoU | MAE | F1 | IoU | MAE | ||
| Methods | LSNet | 0.663 | 0.505 | 0.045 | 0.761 | 0.656 | 0.042 | 0.728 | 0.618 | 0.047 | 
| CSRNet | 0.727 | 0.655 | 0.045 | 0.725 | 0.607 | 0.033 | 0.766 | 0.709 | 0.045 | |
| MCFNet | 0.621 | 0.514 | 0.036 | 0.627 | 0.501 | 0.041 | 0.677 | 0.529 | 0.049 | |
| TNet | 0.731 | 0.607 | 0.039 | 0.782 | 0.676 | 0.040 | 0.735 | 0.622 | 0.034 | |
| MGAI | 0.733 | 0.574 | 0.036 | 0.762 | 0.647 | 0.030 | 0.718 | 0.613 | 0.031 | |
| SACNet | 0.882 | 0.752 | 0.023 | 0.828 | 0.714 | 0.030 | 0.823 | 0.714 | 0.037 | |
| FODNet (ours) | 0.864 | 0.774 | 0.024 | 0.845 | 0.754 | 0.028 | 0.836 | 0.709 | 0.035 | |
| FLOPs/G | Params/M | ||
|---|---|---|---|
| Methods | LSNet | 1.21 | 5.39 | 
| CSRNet | 4.20 | 8.63 | |
| MCFNet | 96.54 | 78.64 | |
| TNet | 39.71 | 87.05 | |
| MGAI | 156.26 | 89.01 | |
| SACNet | 40.17 | 327.7 | |
| FODNet (ours) | 36.15 | 91.91 | 
| Datasets | Indoor-Fire | Wildfire | Fire in Historic Buildings | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| F1 | IoU | MAE | F1 | IoU | MAE | F1 | IoU | MAE | ||
| Methods | M1/Backbone | 0.843 | 0.772 | 0.075 | 0.828 | 0.718 | 0.059 | 0.803 | 0.698 | 0.067 | 
| M2/CCM | 0.842 | 0.731 | 0.031 | 0.831 | 0.722 | 0.035 | 0.814 | 0.703 | 0.036 | |
| M3/DAM | 0.610 | 0.461 | 0.045 | 0.733 | 0.590 | 0.047 | 0.603 | 0.473 | 0.051 | |
| M4/UpBlock | 0.847 | 0.752 | 0.027 | 0.827 | 0.729 | 0.034 | 0.811 | 0.707 | 0.034 | |
| FODNet (ours) | 0.864 | 0.774 | 0.024 | 0.845 | 0.754 | 0.028 | 0.836 | 0.709 | 0.035 | |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sun, N.; Zhou, J.; Hu, K.; Wei, C.; Wang, Z.; Song, L. Multimodal Fire Salient Object Detection for Unregistered Data in Real-World Scenarios. Fire 2025, 8, 415. https://doi.org/10.3390/fire8110415
Sun N, Zhou J, Hu K, Wei C, Wang Z, Song L. Multimodal Fire Salient Object Detection for Unregistered Data in Real-World Scenarios. Fire. 2025; 8(11):415. https://doi.org/10.3390/fire8110415
Chicago/Turabian StyleSun, Ning, Jianmeng Zhou, Kai Hu, Chen Wei, Zihao Wang, and Lipeng Song. 2025. "Multimodal Fire Salient Object Detection for Unregistered Data in Real-World Scenarios" Fire 8, no. 11: 415. https://doi.org/10.3390/fire8110415
APA StyleSun, N., Zhou, J., Hu, K., Wei, C., Wang, Z., & Song, L. (2025). Multimodal Fire Salient Object Detection for Unregistered Data in Real-World Scenarios. Fire, 8(11), 415. https://doi.org/10.3390/fire8110415
 
         
                                                


 
       