VISR-CNN: A Dual-Stream Framework for Meteorological Visibility Estimation via Multi-Scale Transmission Attention and Spectral Gating
Abstract
1. Introduction
- Physics-Informed Frequency Selection: Unlike prior works that utilize the entire spectral domain, our Spectral Gating (SG) module employs learnable masks to adaptively isolate the low-to-mid frequency components that contain structural silhouettes, effectively filtering out high-frequency atmospheric noise.
- Multi-Scale Transmission Attention (MSTA): We introduce a transmission-aware attention mechanism that specifically targets multi-scale contrast degradation—a primary indicator of visibility loss that standard spatial-only attention mechanisms often fail to capture.
- Monotonic Ranking Supervision: While existing studies treat visibility as a standard regression task, we integrate a Monotonic Ranking Loss that enforces the ordinal physical constraints of light attenuation, ensuring that the model’s predictions remain physically consistent even in data-sparse low-visibility conditions.
2. Methodology
2.1. Data Characteristics, Preprocessing and Augmentation
2.2. Network Architecture
2.2.1. Spatial Transmission Branch
2.2.2. Global Frequency Branch
2.2.3. Feature Fusion and Output
2.3. Physics-Informed Loss Function
2.3.1. Hybrid Regression Loss
2.3.2. Monotonic Ranking Loss
2.4. Training Strategy
2.4.1. Three-Phase Training Strategy
2.4.2. Optimization and Regularization
2.4.3. Training Dataset and Equipment
2.5. Architectural Originality and System Summary
- Spatial Backbone (Established): The use of ResNeXt-50 provides a robust baseline for high-level semantic feature extraction. While the architecture is established for standalone image feature extraction tasks, it functions here as a ‘local texture sensor’ within a dual-stream visibility framework specialized for this task.
- Frequency Branch (Enhanced): While FFT-based learning has been explored in general image processing, our implementation introduces a Global Frequency Branch specifically tuned to capture the low-pass filtering effects caused by atmospheric particles.
- Novel Spectral Gating (Original): This is a key innovation based on our practical observations that full-spectrum analysis tends to introduce atmospheric noise. The learnable masks are a unique contribution of this work.
- Multi-Scale Transmission Attention (Original): Unlike standard spatial attention, this mechanism is theoretically derived from the physics of contrast degradation and is a novel contribution of the authors.
- Monotonic Ranking Loss (Original): This represents a theoretical shift from simple regression to physics-constrained supervision, ensuring that the model strictly adheres to the physical law of light attenuation.
3. Experimental Results
3.1. Quantitative Comparison
3.1.1. Comparison with Specialized Models
3.1.2. Comparison with Large Vision-Language Models (LVLMs)
3.1.3. Visual Analysis of Model Predictions
3.1.4. Visualization for the Multi-Scale Transmission Attention (MSTA)
3.1.5. Visualization for the Spectral Gating (SG)
- Priority on Low Frequencies (Top-Left Corner): In all datasets, the highest weights (bright yellow/green) are concentrated near the DC Component and low-frequency regions. This indicates that the model prioritizes global structures and large-scale intensity variations, which are the primary indicators of atmospheric haze and fog density in meteorological visibility estimation.
- Adaptive Filtering of High Frequencies: The darker regions (purple/dark blue) represent higher frequency bands that are suppressed by the gating mechanism. These frequencies typically correspond to sharp edges or transient sensor noise that could interfere with stable visibility regression.
- Consistency Across Datasets: While the specific distribution of weights varies slightly between the two datasets (reflecting site-specific environmental features), both masks consistently favor the low-to-mid frequency range. This demonstrates that the Spectral Gating mechanism successfully learns a robust, physically consistent filter for visibility-related spectral features regardless of the specific deployment location.
3.2. Ablation Study and Efficiency Analysis
3.2.1. Component Efficiency and Dual-Stream Integration
- Spatial Baseline (ResNeXt-50 Single Backbone): Achieves an overall RMSE of 2.59 km but degrades to 2.96 km in the 0–10 km range. This confirms that spatial landmarks become unreliable when obscured by dense atmospheric particles.
- Spectral Baseline (FFT Single Backbone): Exhibits the highest error (RMSE: 3.64 km). While capturing global blurring patterns, the lack of localized spatial awareness results in poor precision across all ranges.
- Fusion Baseline (VISR-CNN (without progressive)): achieves RMSE of 2.41 km, confirming that spatial and spectral features provide complementary information for stabilizing predictions.
- MSTA and SG (VISR-CNN (without MSTA and SG)): Removing MSTA and SG increases overall RMSE to 2.39 km. The full model’s 3.5% improvement proves the SG and MSTA modules function as an adaptive mechanism for frequency and spatial domains respectively, effectively isolating visibility-dependent spectral features from complex atmospheric interference.
- Physics-Informed Ranking Loss (VISR-CNN (without ranking loss)): Excluding this loss results in a sharp performance drop in the safety-critical 0–10 km range (RMSE increases from 1.95 km to 2.28 km). This validates the module’s role in enforcing monotonic physical constraints between atmospheric extinction and image contrast.
- Full VISR-CNN: Yields the optimal RMSE of 2.31 km. By isolating branch training before joint fine-tuning, the strategy prevents the high-gradient spatial branch from overwhelming frequency features, ensuring a robust multi-modal representation.
3.2.2. Impact of Progressive Training Strategy
3.2.3. Visual Analysis of Single Backbone Model Predictions
3.2.4. Analysis of Computational Complexity
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2341–2353. [Google Scholar] [PubMed]
- Nayar, S.K.; Narasimhan, S.G. Vision in bad weather. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–27 September 1999; IEEE: Piscataway, NJ, USA, 1999; pp. 820–827. [Google Scholar]
- Narasimhan, S.G.; Nayar, S.K. Vision and the atmosphere. Int. J. Comput. Vis. 2002, 48, 233–254. [Google Scholar] [CrossRef]
- World Meteorological Organization. Guide to Meteorological Instruments and Methods of Observation, 2018th ed.; WMO-No. 8; WMO: Geneva, Switzerland, 2018. [Google Scholar]
- Robert, G.H.; Michael, P.M. An Automated Visibility Detection Algorithm Utilizing Camera Imagery. In Proceedings of the 23rd Conference on IIPS, San Antonio, TX, USA, 15 January 2007. [Google Scholar]
- Babari, R.; Hautiere, N.; Dumont, E.; Bredif, R.; Paparoditis, N. A Model-Driven Approach to Estimate Atmospheric Visibility with Ordinary Cameras. Atmos. Environ. 2011, 45, 5316–5324. [Google Scholar]
- Lo, W.L.; Zhu, M.; Fu, H. Meteorological Visibility Estimation Using Multi-Support Vector Regression Method. J. Adv. Inf. Technol. 2020, 11, 40–47. [Google Scholar]
- Yan, Q.; Sun, T.; Zhang, J.; Xun, L. Visibility estimation based on weakly supervised learning under discrete label distribution. Sensors 2023, 23, 9390. [Google Scholar] [CrossRef] [PubMed]
- Cai, B.; Xu, X.; Jia, K.; Qing, C.; Tao, D. DehazeNet: An End-to-End System for Single Image Haze Removal. IEEE Trans. Image Process. 2016, 25, 5187–5198. [Google Scholar] [CrossRef] [PubMed]
- Palvanov, A.; Cho, Y.I. VisNet: Deep Convolutional Neural Networks for Forecasting Atmospheric Visibility. Sensors 2019, 19, 1343. [Google Scholar] [CrossRef] [PubMed]
- Pan, H.; Xue, J.; Huang, M.; Lei, X. Air Visibility Prediction Based on Multiple Models. In Proceedings of the IEEE CYBER, Tianjin, China, 19–23 July 2018; pp. 1421–1426. [Google Scholar]
- Jin, Z.; Qiu, K.; Zhang, M. Investigation of Visibility Estimation Based on BP Neural Network. J. Atmos. Environ. Opt. 2021, 16, 415–423. [Google Scholar]
- Narksri, P.; Darweesh, H.; Takeuchi, E.; Ninomiya, Y.; Takeda, K. Visibility Estimation in Complex, Real-World Driving Environments Using High Definition Maps. In Proceedings of the IEEE ITSC, Indianapolis, IN, USA, 19–22 September 2021; pp. 2847–2854. [Google Scholar]
- You, J.; Jia, S.; Pei, X.; Yao, D. DMRVisNet: Deep Multihead Regression Network for Pixel-Wise Visibility Estimation Under Foggy Weather. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22354–22366. [Google Scholar] [CrossRef]
- Lo, W.L.; Wong, K.W.; Hsung, R.T.C.; Chung, H.S.H.; Fu, H. Meteorological Visibility Estimation Using Landmark Object Extraction and the ANN Method. Sensors 2025, 25, 951. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Moorthy, A.K.; Bovik, A.C. Blind image quality assessment: A natural scene statistics approach in the DCT domain. IEEE Trans. Image Process. 2011, 20, 3339–3352. [Google Scholar] [CrossRef] [PubMed]
- Xu, K.; Stevens, M.; Barsky, B.A. Learning in the Frequency Domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1740–1749. [Google Scholar]
- Mittal, A.; Soundararajan, R.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef] [PubMed]
- Lo, W.L.; Wong, K.W.; Hsung, R.T.C.; Chung, H.S.H.; Fu, H.; Zhu, T.Y.; Tsang, H.S.H.; Pong, K.H. Meteorological Visibility Estimation Through Multi-Modal Feature Fusion with Convolutional and Frequency Domain Representations. In Proceedings of the International Conference on Computer and Communications (ICCC), Chengdu, China, 12–15 December 2025. [Google Scholar]
- Xie, L.; Chiu, A.; Newsam, S. Estimating Atmospheric Visibility Using General-Purpose Cameras. In Advances in Visual Computing, Proceedings of the 4th International Symposium, ISVC 2008, Las Vegas, NV, USA, 1–3 December 2008; Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Gantz, L.I., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5359, pp. 356–367. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Frigo, M.; Johnson, S.G. The design and implementation of FFTW3. Proc. IEEE 2005, 93, 216–231. [Google Scholar] [CrossRef]
- Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Lo, W.L.; Wong, K.W.; Hsung, R.T.C.; Chung, H.S.H.; Fu, H.; Tsang, H.S.H.; Zhu, T.Y. A Range-Aware Attention Framework for Meteorological Visibility Estimation. Sensors 2026, 26, 1893. [Google Scholar] [CrossRef] [PubMed]
- Rahman, A. A Systematic Review of Vision Language Models: Comprehensive Analysis of Architectures, Applications, Datasets and Challenges Towards Robust Multimodal Intelligence. Array 2026, 30, 100739. [Google Scholar] [CrossRef]
- Zhang, C.; Wan, F.; Wei, P.; Xu, K.; Guo, L.; Jiao, J.; Ye, Q. Vision-Language Models for Vision Tasks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1234–1256. [Google Scholar] [CrossRef] [PubMed]
- Bai, S.; Cai, Y.; Chen, R.; Chen, K.; Chen, X.-H.; Cheng, Z.; Deng, L.; Ding, W.; Fang, R.; Gao, C.; et al. Qwen3-VL Technical Report. arXiv 2025, arXiv:2511.21631. [Google Scholar] [CrossRef]
- Gemma Team. Gemma 3 Technical Report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
- Meta, A.I. The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes. arXiv 2026, arXiv:2601.11659. [Google Scholar] [CrossRef]













| Methodology | Focus/Core Mechanism | Critical Research Gaps & Limitations |
|---|---|---|
| Traditional Physical Models [1,2,3,5] | Dark Channel Prior (DCP); Koschmieder’s Law; Edge detection. | Relies on idealized atmospheric homogeneity; fails in complex urban scenes with heterogeneous haze or uneven lighting. |
| Statistical & ML Methods [6,7] | Contrast mapping; Multi- SVR. | Effective for high visibility (~5000 m) but exhibits high error margins in safety-critical low-visibility ranges (<400 m). |
| Spatial-Only Deep Learning [8,9,10,11,12,13,14,15] | CNNs (DehazeNet, VisNet); 3D point clouds; Discrete labeling. | Operates primarily in the spatial domain; local convolutions often struggle to generalize across diverse global atmospheric degradation patterns. |
| Advanced Backbones [16,17] | ResNet; ResNeXt (Residual & Cardinality-based learning). | Optimized for object semantic extraction; lacks explicit mechanisms to model haze as a global frequency low-pass filter. |
| Early Dual- Domain Fusion [18,19,20,21] | Hybrid CNN and DCT/FFT features; frequency-domain quality assessment. | Lacks adaptive frequency filtering (Spectral Gating) and physics-informed constraints to prevent branch interference. |
| Category | Hyperparameter | VISR-CNN |
|---|---|---|
| Optimization | Optimizer | AdamW |
| Learning Rate | ||
| Weight Decay | ||
| Training | Total Epochs | 180 (Progressive) |
| Batch Size | 32 | |
| Training Strategy | 3-Phase | |
| Architecture | Backbone Architecture | FFT + ResNeXt-50 |
| Attention Type | Multi-Scale Transmission | |
| Loss Function | Primary Objective | Physics-Informed (MSE + MAE + Ranking) |
| Target Output | Regression | |
| Data | Resolution | 224 × 224 |
| Augmentation | Rotation, Color Jitter, Resize, Crop |
| Dataset: HKCHC-VD | Visibility Range (km) | ||
|---|---|---|---|
| 0–10 | 10–30 | 30–50 | |
| No. of Training Sample Images | 239 | 3192 | 5490 |
| No. of Test Sample Images | 59 | 797 | 1371 |
| Total: | 298 | 3989 | 6861 |
| Dataset: CP1 | Visibility Range (km) | ||
| 0–10 | 10–30 | 30–50 | |
| No. of Training Sample Images | 386 | 2059 | 400 |
| No. of Test Sample Images | 97 | 515 | 100 |
| Total: | 483 | 2574 | 500 |
| Dataset: SWH | Visibility Range (km) | ||
| 0–10 | 10–30 | 30–50 | |
| No. of Training Sample Images | 347 | 2742 | 2710 |
| No. of Test Sample Images | 87 | 686 | 678 |
| Total: | 434 | 3428 | 3388 |
| Dataset: HKCHC-VD | Low 0–10 km | Mid 10–30 km | High 30–50 km | Overall | ||||
|---|---|---|---|---|---|---|---|---|
| MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | |
| VisNet [10] | 2.35 | 4.36 | 1.48 | 2.12 | 1.91 | 2.79 | 1.77 | 2.63 |
| ResNeXt-50 + ViT [29] | 1.73 | 2.69 | 1.26 | 1.78 | 1.84 | 2.67 | 1.62 | 2.39 |
| Qwen3-VL (8B) [32] | 4.13 | 5.62 | 7.36 | 8.99 | 20.12 | 21.66 | 15.13 | 17.85 |
| Gemma3 (12B) [33] | 5.60 | 6.37 | 9.88 | 11.26 | 28.17 | 29.08 | 21.03 | 23.81 |
| Llama4 (16 × 17B) [34] | 4.55 | 5.75 | 12.82 | 15.33 | 23.29 | 25.74 | 19.05 | 22.20 |
| VISR-CNN | 1.45 | 1.95 | 1.17 | 1.68 | 1.76 | 2.62 | 1.54 | 2.31 |
| Dataset: CP1 | Low 0–10 km | Mid 10–30 km | High 30–50 km | Overall | ||||
| MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | |
| VisNet [10] | 4.2 | 5.31 | 3.93 | 4.95 | 9.74 | 11.12 | 4.78 | 6.24 |
| ResNeXt-50 + ViT [29] | 1.47 | 2.26 | 2.87 | 4.03 | 5.64 | 7.00 | 3.07 | 4.399 |
| Qwen3-VL (8B) [32] | 5.00 | 6.08 | 6.86 | 8.54 | 15.59 | 17.70 | 7.83 | 10.09 |
| Gemma3 (12B) [33] | 5.33 | 5.99 | 5.60 | 7.00 | 20.38 | 21.09 | 7.64 | 10.14 |
| Llama4 (16 × 17B) [34] | 3.16 | 4.16 | 7.23 | 8.59 | 19.06 | 20.32 | 8.34 | 10.66 |
| VISR-CNN | 1.23 | 1.90 | 2.38 | 3.48 | 4.63 | 6.09 | 2.54 | 3.80 |
| Dataset: SWH | Low 0–10 km | Mid 10–30 km | High 30–50 km | Overall | ||||
| MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | |
| VisNet [10] | 4.14 | 5.20 | 4.52 | 5.67 | 5.53 | 6.90 | 4.97 | 6.25 |
| ResNeXt-50 + ViT [29] | 2.05 | 2.48 | 3.19 | 4.51 | 4.33 | 5.58 | 3.65 | 4.95 |
| Qwen3-VL (8B) [32] | 3.55 | 4.40 | 7.36 | 8.97 | 23.76 | 25.51 | 14.65 | 18.40 |
| Gemma3 (12B) [33] | 5.90 | 6.68 | 7.84 | 9.28 | 28.07 | 28.95 | 17.03 | 20.73 |
| Llama4 (16 × 17B) [34] | 3.04 | 3.61 | 10.47 | 11.49 | 29.77 | 30.57 | 18.87 | 22.22 |
| VISR-CNN | 1.00 | 1.49 | 2.27 | 3.58 | 3.81 | 5.24 | 2.91 | 4.36 |
| Methods | Low 0–10 km | Mid 10–30 km | High 30–50 km | Overall | ||||
|---|---|---|---|---|---|---|---|---|
| MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | |
| ResNeXt-50 Single Backbone | 2.24 | 2.96 | 1.45 | 2.06 | 1.83 | 2.83 | 1.71 | 2.59 |
| FFT Single Backbone | 3.15 | 5.49 | 2.13 | 3.15 | 2.92 | 3.99 | 2.50 | 3.64 |
| VISR-CNN (without progressive) | 1.51 | 2.29 | 1.22 | 1.84 | 1.81 | 2.69 | 1.59 | 2.41 |
| VISR-CNN (without ranking loss) | 1.51 | 2.28 | 1.18 | 1.69 | 1.82 | 2.71 | 1.58 | 2.38 |
| VISR-CNN (without MSTA and SG) | 1.68 | 2.62 | 1.21 | 1.68 | 1.82 | 2.71 | 1.60 | 2.39 |
| VISR-CNN | 1.45 | 1.95 | 1.17 | 1.68 | 1.76 | 2.62 | 1.54 | 2.31 |
| Framework | Params (M) | FLOPs (G) | Latency (ms) |
|---|---|---|---|
| ResNeXt-50 Single Backbone | 24.16 M | 4.29 | 1.94 |
| FFT Single Backbone | 34.68 M | 0.08 | 0.18 |
| ResNeXt-50 + ViT (dual-threshold) | 113.93 M | 15.58 | 5.53 |
| VISR-CNN | 90.76 M | 5.7 | 2.76 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lo, W.L.; Wong, K.W.; Hsung, R.T.C.; Chung, H.S.H.; Fu, H.; Tsang, H.S.H.; Zhu, T.Y. VISR-CNN: A Dual-Stream Framework for Meteorological Visibility Estimation via Multi-Scale Transmission Attention and Spectral Gating. Algorithms 2026, 19, 434. https://doi.org/10.3390/a19060434
Lo WL, Wong KW, Hsung RTC, Chung HSH, Fu H, Tsang HSH, Zhu TY. VISR-CNN: A Dual-Stream Framework for Meteorological Visibility Estimation via Multi-Scale Transmission Attention and Spectral Gating. Algorithms. 2026; 19(6):434. https://doi.org/10.3390/a19060434
Chicago/Turabian StyleLo, Wai Lun, Kwok Wai Wong, Richard Tai Chiu Hsung, Henry Shu Hung Chung, Hong Fu, Harris Sik Ho Tsang, and Tony Yulin Zhu. 2026. "VISR-CNN: A Dual-Stream Framework for Meteorological Visibility Estimation via Multi-Scale Transmission Attention and Spectral Gating" Algorithms 19, no. 6: 434. https://doi.org/10.3390/a19060434
APA StyleLo, W. L., Wong, K. W., Hsung, R. T. C., Chung, H. S. H., Fu, H., Tsang, H. S. H., & Zhu, T. Y. (2026). VISR-CNN: A Dual-Stream Framework for Meteorological Visibility Estimation via Multi-Scale Transmission Attention and Spectral Gating. Algorithms, 19(6), 434. https://doi.org/10.3390/a19060434

