WeatherMono: A CNN-Transformer Architecture for Self-Supervised Monocular Depth Estimation in Rainy and Foggy Conditions
Abstract
1. Introduction
2. Related Work
2.1. Monocular Depth Estimation Based on Convolutional Neural Networks
2.2. Monocular Depth Estimation Based on Transformer
2.3. Monocular Depth Estimation Based on Hybrid Architectures
2.4. Depth Estimation in Adverse Environments
3. Proposed Method
3.1. Overall Network Architecture
3.2. Encoder-Decoder Depth Estimation Network
3.3. Pose Estimation Network (PoseNet)
3.4. Monocular Depth Estimation Loss Function
4. Experiments and Analysis
4.1. Implementation Details
4.2. Datasets and Evaluation Metrics
4.3. Comparison with State-of-the-Art Methods
4.3.1. Quantitative Analysis of Monocular Depth Estimation in Rainy and Foggy Conditions
4.3.2. QualitativeAnalysis of Monocular Depth Estimation in Rainy and Foggy Conditions
4.4. Ablation Study on Model Architectures
5. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 18537–18546. [Google Scholar]
- Lahiri, S.; Ren, J.; Lin, X. Deep learning-based stereopsis and monocular depth estimation techniques: A review. Vehicles 2024, 6, 305–351. [Google Scholar] [CrossRef]
- Li, Z.; Wang, X.; Liu, X.; Jiang, J. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv 2022, arXiv:2204.00987. [Google Scholar] [CrossRef] [PubMed]
- Hu, J.; Cao, Z. Monocular Depth Estimation Algorithm for Rainy Scenes. In Proceedings of the 2024 5th International Conference on Electronic Communication and Artificial Intelligence (ICECAI), Guangzhou, China, 31 May–2 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 770–773. [Google Scholar]
- Lee, J.; Lai-Dang, Q.V.; Sengar, N.; Har, D. Robust Monocular Depth Estimation in Adverse Weather Conditions by Unsupervised Domain Adaptation. In Proceedings of the ECAI 2024, Santiago de Compostela, Spain, 19–24 October 2024; IOS Press: Amsterdam, The Netherlands, 2024; pp. 601–608. [Google Scholar]
- Gasperini, S.; Morbitzer, N.; Jung, H.; Navab, N.; Tombari, F. Robust monocular depth estimation under challenging conditions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 8177–8186. [Google Scholar]
- Liu, J.; Guo, Z.; Ping, P.; Zhang, H.; Shi, Q. Channel Interaction and Transformer Depth Estimation Network: Robust Self-Supervised Depth Estimation Under Varied Weather Conditions. Sustainability 2024, 16, 9131. [Google Scholar] [CrossRef]
- Saunders, K.; Vogiatzis, G.; Manso, L.J. Self-supervised monocular depth estimation: Let’s talk about the weather. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 8907–8917. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. arXiv 2014, arXiv:1406.2283. [Google Scholar] [CrossRef]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 239–248. [Google Scholar]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2002–2011. [Google Scholar]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1851–1858. [Google Scholar]
- Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1983–1992. [Google Scholar]
- Jung, H.; Park, E.; Yoo, S. Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 12642–12652. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3828–3838. [Google Scholar]
- Zhou, H.; Greenwood, D.; Taylor, S. Self-supervised monocular depth estimation with internal feature fusion. arXiv 2021, arXiv:2110.09482. [Google Scholar] [CrossRef]
- Agarwal, A.; Arora, C. Depthformer: Multiscale vision transformer for monocular depth estimation with global local information fusion. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3873–3877. [Google Scholar]
- Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 12179–12188. [Google Scholar]
- Bae, J.; Moon, S.; Im, S. Monoformer: Towards generalization of self-supervised monocular depth estimation with transformers. arXiv 2022, arXiv:2205.11083. [Google Scholar]
- Zhao, C.; Zhang, Y.; Poggi, M.; Tosi, F.; Guo, X.; Zhu, Z.; Huang, G.; Tang, Y.; Mattoccia, S. Monovit: Self-supervised monocular depth estimation with a vision transformer. In Proceedings of the 2022 International Conference on 3D Vision (3DV), Prague, Czechia, 12–15 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 668–678. [Google Scholar]
- Liu, L.; Song, X.; Wang, M.; Liu, Y.; Zhang, L. Self-supervised monocular depth estimation for all day images using domain separation. In Proceedings of the CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 12717–12726. [Google Scholar]
- Zhao, C.; Tang, Y.; Sun, Q. Unsupervised monocular depth estimation in highly complex environments. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 1237–1246. [Google Scholar] [CrossRef]
- Lee, Y.; Jeon, J.; Ko, Y.; Jeon, B.; Jeon, M. Task-driven deep image enhancement network for autonomous driving in bad weather. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13746–13753. [Google Scholar]
- Wang, J.; Lin, C.; Nie, L.; Huang, S.; Zhao, Y.; Pan, X.; Ai, R. Weatherdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4976–4982. [Google Scholar]
- Zhou, Z.; Fan, X.; Shi, P.; Xin, Y. R-msfm: Recurrent multi-scale feature modulation for monocular depth estimating. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 12777–12786. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Yang, G.; Song, X.; Huang, C.; Deng, Z.; Shi, J.; Zhou, B. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 899–908. [Google Scholar]
- Wang, R.; Yu, Z.; Gao, S. Planedepth: Self-supervised depth estimation via orthogonal planes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 21425–21434. [Google Scholar]
- Watson, J.; Firman, M.; Brostow, G.J.; Turmukhambetov, D. Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2162–2171. [Google Scholar]
- Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. Hr-depth: High resolution self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; AAAI Press: Cambridge, MA, USA, 2021; Volume 35, pp. 2294–2301. [Google Scholar]
- Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-wise attention-based network for self-supervised monocular depth estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), Virtual, 1–3 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 464–473. [Google Scholar]
- Watson, J.; Mac Aodha, O.; Prisacariu, V.; Brostow, G.; Firman, M. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1164–1174. [Google Scholar]
- Zhang, S.; Zhang, J.; Tao, D. Towards scale-aware, robust, and generalizable unsupervised monocular depth estimation by integrating IMU motion dynamics. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 143–160. [Google Scholar]
- Kong, L.; Xie, S.; Hu, H.; Ng, L.X.; Cottereau, B.; Ooi, W.T. Robodepth: Robust out-of-distribution depth estimation under corruptions. Adv. Neural Inf. Process. Syst. 2023, 36, 21298–21342. [Google Scholar]






| Method | Architecture | Key Features | Limitations |
|---|---|---|---|
| Monodepth2 | CNN | Multi-scale prediction, minimum reprojection loss | Poor global context capture |
| DepthFormer | Transformer | Multi-scale self-attention, global-local fusion | High computational complexity |
| MonoViT | CNN-Transformer | Hybrid architecture, position embeddings | Lacks specialized weather handling |
| Lite-Mono | CNN-Transformer | Lightweight design, efficient attention | Reduced capacity for complex weather |
| WeatherDepth | CNN | Curriculum contrastive learning | Limited local feature extraction |
| RoboDepth | CNN-Transformer | Robustness under corruptions | Not specialized for rain/fog |
| Md4all | CNN | Adversarial training for robustness | Limited global context |
| WeatherMono (Ours) | CNN-Transformer | MDC + GLFI modules, multi-scale deformable conv | (proposed method) |
| Methods | Depth Error (↓) | Depth Accuracy (↑) | |||||
|---|---|---|---|---|---|---|---|
| AbsRel | RMSE | RMSE log | SqRel | ||||
| Monodepth2 | 0.168 | 8.171 | 0.229 | 2.123 | 0.768 | 0.922 | 0.972 |
| DepthHints | 0.164 | 8.065 | 0.227 | 2.117 | 0.774 | 0.926 | 0.976 |
| DPT | 0.161 | 7.918 | 0.218 | 1.902 | 0.776 | 0.931 | 0.977 |
| FSRE-Depth | 0.156 | 7.937 | 0.226 | 2.097 | 0.772 | 0.924 | 0.974 |
| HR-Depth | 0.163 | 7.929 | 0.224 | 1.917 | 0.771 | 0.926 | 0.975 |
| ADDS-Depth | 0.167 | 7.931 | 0.235 | 1.926 | 0.759 | 0.918 | 0.959 |
| CADepth-Net | 0.159 | 7.927 | 0.216 | 1.944 | 0.784 | 0.933 | 0.978 |
| DiFFNet | 0.146 | 7.884 | 0.200 | 1.668 | 0.805 | 0.945 | 0.983 |
| Manydepth | 0.161 | 7.929 | 0.221 | 1.912 | 0.761 | 0.921 | 0.967 |
| MonoFormer | 0.141 | 7.783 | 0.191 | 1.594 | 0.807 | 0.935 | 0.982 |
| DepthFormer | 0.138 | 7.594 | 0.181 | 1.468 | 0.811 | 0.933 | 0.981 |
| DynaDepth | 0.162 | 7.875 | 0.219 | 1.905 | 0.769 | 0.914 | 0.965 |
| Lite-Mono | 0.131 | 6.624 | 0.177 | 1.437 | 0.816 | 0.936 | 0.973 |
| MonoVit | 0.120 | 5.111 | 0.201 | 0.899 | 0.857 | 0.953 | 0.980 |
| PlaneDepth | 0.150 | 6.513 | 0.217 | 1.360 | 0.757 | 0.891 | 0.945 |
| RoboDepth | 0.107 | 4.604 | 0.183 | 0.791 | 0.883 | 0.963 | 0.983 |
| Md4all | 0.121 | 4.606 | 0.181 | 0.784 | 0.882 | 0.961 | 0.982 |
| WeatherDepth | 0.104 | 4.413 | 0.178 | 0.737 | 0.892 | 0.965 | 0.984 |
| WeatherMono | 0.097 | 4.281 | 0.162 | 0.702 | 0.897 | 0.976 | 0.989 |
| Methods | Depth Error (↓) | Depth Accuracy (↑) | |||||
|---|---|---|---|---|---|---|---|
| AbsRel | RMSE | RMSE log | SqRel | ||||
| (a) DrivingStereo Dataset:Rain | |||||||
| ADDS-Depth | 0.227 | 11.907 | 0.271 | 3.165 | 0.641 | 0.894 | 0.951 |
| MonoVit | 0.175 | 9.616 | 0.232 | 2.138 | 0.730 | 0.931 | 0.979 |
| PlaneDepth | 0.220 | 11.671 | 0.278 | 3.302 | 0.654 | 0.883 | 0.965 |
| RoboDepth | 0.167 | 9.157 | 0.221 | 2.019 | 0.755 | 0.938 | 0.982 |
| Md4all | 0.176 | 9.216 | 0.225 | 2.134 | 0.752 | 0.941 | 0.981 |
| WeatherDepth | 0.158 | 8.837 | 0.211 | 1.833 | 0.764 | 0.945 | 0.985 |
| WeatherMono | 0.149 | 8.634 | 0.202 | 1.824 | 0.775 | 0.951 | 0.988 |
| (b) DrivingStereo Dataset:Fog | |||||||
| ADDS-Depth | 0.157 | 10.902 | 0.184 | 2.211 | 0.776 | 0.924 | 0.969 |
| MonoVit | 0.109 | 7.758 | 0.167 | 1.206 | 0.870 | 0.967 | 0.990 |
| PlaneDepth | 0.151 | 7.758 | 0.167 | 1.836 | 0.803 | 0.945 | 0.983 |
| RoboDepth | 0.105 | 7.276 | 0.158 | 1.135 | 0.882 | 0.974 | 0.992 |
| Md4all | 0.131 | 7.415 | 0.165 | 1.241 | 0.842 | 0.981 | 0.991 |
| WeatherDepth | 0.105 | 7.346 | 0.158 | 1.117 | 0.879 | 0.972 | 0.992 |
| WeatherMono | 0.101 | 7.265 | 0.144 | 1.112 | 0.881 | 0.988 | 0.992 |
| Number | Architecture | Depth Error (↓) | Depth Accuracy (↑) | |||||
|---|---|---|---|---|---|---|---|---|
| AbsRel | RMSE | RMSE log | SqRel | |||||
| (a) DrivingStereo Dataset:Rain | ||||||||
| (a) | w/o GLFI | 0.153 | 8.847 | 0.211 | 1.916 | 0.770 | 0.949 | 0.987 |
| (b) | w/o MDC | 0.155 | 8.742 | 0.212 | 1.906 | 0.769 | 0.949 | 0.987 |
| (c) | w/o pooled connection | 0.151 | 8.812 | 0.205 | 1.911 | 0.772 | 0.948 | 0.986 |
| (d) | w/o cross-stage connections | 0.150 | 8.753 | 0.204 | 1.914 | 0.773 | 0.949 | 0.987 |
| (e) | WeatherMono(ours) | 0.149 | 8.634 | 0.202 | 1.824 | 0.775 | 0.951 | 0.988 |
| (b) DrivingStereo Dataset:Fog | ||||||||
| (a) | w/o GLFI | 0.105 | 7.428 | 0.156 | 1.243 | 0.876 | 0.986 | 0.991 |
| (b) | w/o MDC | 0.108 | 7.314 | 0.158 | 1.201 | 0.875 | 0.986 | 0.990 |
| (c) | w/o pooled connection | 0.103 | 7.416 | 0.148 | 1.184 | 0.878 | 0.985 | 0.990 |
| (d) | w/o cross-stage connections | 0.102 | 7.354 | 0.146 | 1.211 | 0.879 | 0.986 | 0.991 |
| (e) | WeatherMono(ours) | 0.101 | 7.265 | 0.144 | 1.112 | 0.881 | 0.988 | 0.992 |
| Number | Depth Error (↓) | Depth Accuracy (↑) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| AbsRel | RMSE | RMSE log | SqRel | ||||||
| (a) | √ | 0.167 | 8.946 | 0.231 | 2.016 | 0.753 | 0.932 | 0.977 | |
| (b) | √ | 0.153 | 8.802 | 0.226 | 1.971 | 0.761 | 0.940 | 0.981 | |
| (c) | √ | √ | 0.149 | 8.634 | 0.202 | 1.824 | 0.775 | 0.951 | 0.988 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Qiu, Y. WeatherMono: A CNN-Transformer Architecture for Self-Supervised Monocular Depth Estimation in Rainy and Foggy Conditions. Sensors 2026, 26, 1705. https://doi.org/10.3390/s26051705
Qiu Y. WeatherMono: A CNN-Transformer Architecture for Self-Supervised Monocular Depth Estimation in Rainy and Foggy Conditions. Sensors. 2026; 26(5):1705. https://doi.org/10.3390/s26051705
Chicago/Turabian StyleQiu, Yongsheng. 2026. "WeatherMono: A CNN-Transformer Architecture for Self-Supervised Monocular Depth Estimation in Rainy and Foggy Conditions" Sensors 26, no. 5: 1705. https://doi.org/10.3390/s26051705
APA StyleQiu, Y. (2026). WeatherMono: A CNN-Transformer Architecture for Self-Supervised Monocular Depth Estimation in Rainy and Foggy Conditions. Sensors, 26(5), 1705. https://doi.org/10.3390/s26051705
