A Residual U-Net Architecture for Built-Up Area Segmentation from Sentinel-2 Images
Abstract
1. Introduction
- For built-up area segmentation from Sentinel-2 multispectral imagery, a simplified residual U-Net-based segmentation architecture is proposed, incorporating conventional residual blocks, multi-level skip connection mechanisms, and TTA-based inference while jointly processing RGB, NIR (B8), and SWIR (B11) bands.
- The performance of the proposed model was comparatively evaluated using a multi-temporal Sentinel-2 multispectral dataset constructed in this study for Kocaeli Province using satellite images. Furthermore, different multispectral configurations and segmentation architectures were comparatively evaluated through experimental analyses. In addition, the influence of the TTA-based inference strategy on the final segmentation performance was examined.
2. Related Works
| Ref. | Data Source | Spatial Resolution | Task | Model Type | Output | Key Findings |
|---|---|---|---|---|---|---|
| [16] | Sentinel-2 | 10 m | Built-up area mapping | CNN-based pixel classification | Global built-up probability map | A lightweight CNN using 5 × 5 pixel patches was employed to generate a global built-up map for 2018, validated across 277 regions using building footprint data. |
| [17] | Sentinel-2 | 10 m | Human settlement mapping | Sen2HSE (FCN-based deep learning model) | Settlement map (HSE) | Human settlement areas were automatically mapped from Sentinel-2 imagery. |
| [19] | Sentinel-1 Sentinel-2 | 10 m | Sub-pixel building estimation | ML regression-based unmixing | Building area fraction | The building area fraction within each pixel was estimated to reduce the mixed pixel effect. |
| [18] | Sentinel-2 | 10 m | Building footprint extraction | Super-resolution semantic segmentation | Building footprint map | A national-scale building footprint map at 2.5 m resolution was generated from 10 m Sentinel-2 imagery, detecting over 86 million buildings. |
| [20] | Sentinel-2 | 10 m | Super-resolution | CNN-based autoencoder | High-resolution imagery | The SEN-2_CAENET model was used to enhance the spatial detail of Sentinel-2 images. |
| [21] | Sentinel-1 Sentinel-2 | 10 m | Built-up area mapping | Cross-fusion neural network | Built-up map | Automatic built-up mapping was achieved by fusing Sentinel-1 SAR and Sentinel-2 optical data. |
| [22] | Sentinel-2 | 10 m | Land-use classification | IRUNet (UNet InceptionResNetV2) | Land-use map | High classification accuracy (98.21%) was achieved using multi-scale feature fusion and test-time augmentation. |
3. Materials and Methods
3.1. Overview
3.2. Multispectral Feature Reconstruction
3.3. Decoder Architecture
4. Experiment
4.1. Dataset
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Ablation Experiments
- RGB Baseline is based on the U-Net architecture [28] and utilizes only the visible spectral bands (B2, B3, B4).
- FiveBand Single is also based on the U-Net architecture [28] and incorporates the NIR (B8) and SWIR (B11) bands in addition to the visible bands. In this model, all spectral bands are concatenated along the channel dimension and processed through a single encoder stream. This structure enables the direct utilization of multispectral information.
- DeepLabV3+ [29] is a widely used encoder–decoder-based semantic segmentation architecture that aims to capture multi-scale contextual information using convolution operations.
- SegFormer [30] is a transformer-based semantic segmentation model that employs hierarchical transformer encoders and a lightweight MLP decoder to effectively capture both local and global contextual information.
- FiveBand Residual (No TTA) employs the same residual U-Net architecture as the proposed model but does not apply test-time augmentation during inference.
5. Discussion
- The results demonstrate that the use of multispectral bands significantly improves built-up area segmentation performance. In particular, the integration of NIR (B8) and SWIR (B11) bands contributed to a more accurate representation of small-scale built-up regions and complex boundary structures. The proposed FiveBandTTA model achieved the highest IoU, Dice, and Precision scores among all compared models. Furthermore, it was observed that the combination of multispectral information, the encoder–decoder architecture, and the TTA-based inference strategy improved segmentation stability and produced more consistent predictions.
- The results demonstrate that multispectral information is critical for structured area segmentation in medium-resolution images such as Sentinel-2. In particular, the poor performance of the model using only RGB bands revealed that visible bands alone do not provide sufficient discrimination. In contrast, the proposed residual U-Net-based structure learned multispectral features more effectively, producing more successful segmentation results in complex urban areas. Furthermore, it was observed that the TTA-based inference strategy increased prediction stability and generated more consistent segmentation maps.
- The proposed model’s high IoU and Dice performance demonstrates its capability to successfully extract built-up areas from medium-resolution Sentinel-2 imagery. In particular, the encoder–decoder architecture together with multi-level skip connections contributed to the preservation of fine spatial details and boundary information. Experimental results indicate that multispectral bands enhance class separability, especially in complex urban regions. However, the relatively low FPS performance of the model suggests that future studies may focus on developing computationally more efficient architectures while preserving segmentation accuracy.
- The study results indicate that built-up area segmentation in medium-resolution Sentinel-2 imagery depends not only on the model architecture of the model but also on the selected spectral band combinations. In particular, the incorporation of B8 (NIR) and B11 (SWIR) bands contributed to reducing spectral confusion between built-up surfaces and bare soil regions. The stable progression of the training curves demonstrates that the proposed model achieved a balanced learning process and exhibited strong generalization capability. Furthermore, the proposed approach produced more consistent segmentation maps and more accurately delineated building boundaries in complex urban environments.
- The findings demonstrate that residual encoder–decoder architectures can achieve effective performance in multispectral feature learning. In particular, the proposed model generated more consistent segmentation outputs in dense built-up clusters and irregular boundary regions. Although transformer-based models successfully captured global contextual information, the proposed model demonstrated superior performance in preserving fine spatial details and boundary structures. These results indicate that residual-based architectures provide a strong alternative for built-up area segmentation in multispectral Sentinel-2 imagery.
- The results indicate that high-capacity models alone are insufficient for achieving successful built-up area segmentation in medium-resolution Sentinel-2 imagery. In particular, the proposed model produced more stable segmentation outputs in low-resolution boundary regions. Furthermore, the combined use of multispectral bands contributed to reducing spectral interference between built-up areas and spectrally similar surfaces. These findings demonstrate that the proposed approach provides more reliable and consistent segmentation performance in complex urban environments.
- The fact that the proposed model produced more successful results, particularly in small-scale built-up regions and complex boundary structures, demonstrates the effectiveness of the combination of multispectral information, the encoder–decoder model, and multi-level skip connections. The close correspondence between the training and validation curves indicates that the model achieved stable convergence behavior without exhibiting significant overfitting. Furthermore, it was observed that the TTA-based inference strategy improved segmentation stability by combining predictions obtained from different spatial transformations.
- Furthermore, the study results demonstrate that the use of multispectral data significantly improves segmentation performance, particularly in heterogeneous urban regions. While models utilizing only RGB bands produced more irregular and incomplete built-up boundaries, the incorporation of NIR and SWIR bands resulted in more coherent and holistic segmentation maps. The more consistent performance of the proposed model, especially in dense built-up clusters, indicates that the combination of multispectral information and the encoder–decoder model enables more effective utilization of multi-level feature representations. In addition, the stable improvement observed in the model’s validation performance demonstrates the strong generalization capability of the proposed approach.
6. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Seto, K.C.; Güneralp, B.; Hutyra, L.R. Global forecasts of urban expansion to 2030. Proc. Natl. Acad. Sci. USA 2014, 109, 16083–16088. [Google Scholar]
- Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
- Hakim, Y.F.; Tsai, F. Building Footprint Extraction for Large-Scale Basemaps Using Very-High-Resolution Satellite Imagery. Buildings 2026, 16, 675. [Google Scholar]
- Fletcher, K. (Ed.) ESA’s Optical High-Resolution Mission for GMES Operational Services; ESA Communications: Paris, France, 2012. [Google Scholar]
- Foody, G.M. Status of land cover classification accuracy assessment. Remote Sens. Environ. 2002, 80, 185–201. [Google Scholar] [CrossRef]
- Sun, Y.; Bi, F.; Gao, Y.; Chen, L.; Feng, S. A multi-attention UNet for semantic segmentation in remote sensing images. Symmetry 2022, 14, 906. [Google Scholar] [CrossRef]
- Wu, Y.; Wang, F.; Zhao, P.; Zhou, M.; Geng, S.; Zhang, D. UNet with multibranch prior information encoding for building segmentation in remote sensing images. Adv. Space Res. 2025, 76, 4296–4313. [Google Scholar] [CrossRef]
- Yang, Z.; Zhou, D.; Yang, Y.; Zhang, J.; Chen, Z. Road Extraction from Satellite Imagery by Road Context and Full-Stage Feature. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar]
- Yang, Z.; Zhou, D.; Yang, Y.; Zhang, J.; Chen, Z. TransRoadNet: A Novel Road Extraction Method for Remote Sensing Images via Combining High-Level Semantic Feature and Context. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5506705. [Google Scholar]
- Yang, Z.; Yao, H.; Li, Q.; Ni, W.; Wu, J.; Wang, Q. Semantic–Spatial Feature Refinement Network for Road Extraction from Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2026, 64, 1–10. [Google Scholar] [CrossRef]
- Schug, F.; Frantz, D.; Okujeni, A.; van Der Linden, S.; Hostert, P. Mapping urban-rural gradients of settlements and vegetation at national scale using Sentinel-2 spectral-temporal metrics and regression-based unmixing with synthetic training data. Remote Sens. Environ. 2020, 246, 111810. [Google Scholar] [CrossRef] [PubMed]
- Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Corbane, C.; Syrris, V.; Sabo, F.; Politis, P.; Melchiorri, M.; Pesaresi, M.; Soille, P.; Kemper, T. Convolutional neural networks for global human settlements mapping from Sentinel-2 satellite imagery. Neural Comput. Appl. 2021, 33, 6697–6720. [Google Scholar]
- Qiu, C.; Schmitt, M.; Geiß, C.; Chen, T.H.K.; Zhu, X.X. A framework for large-scale mapping of human settlement extent from Sentinel-2 images via fully convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2020, 163, 152–170. [Google Scholar] [CrossRef] [PubMed]
- Feng, L.; Xu, P.; Tang, H.; Liu, Z.; Hou, P. National-scale mapping of building footprints using feature super-resolution semantic segmentation of Sentinel-2 images. GISci. Remote Sens. 2023, 60, 2196154. [Google Scholar]
- Schug, F.; Frantz, D.; Okujeni, A.; Hostert, P. Sub-pixel building area mapping based on synthetic training data and regression-based unmixing using Sentinel-1 and-2 data. Remote Sens. Lett. 2022, 13, 822–832. [Google Scholar]
- Arık, A.E.; Paşaoğlu, R.; Emrahaoğlu, N. Sentinel-2 uydu görüntüleri için evrişimli otokodlayıcı sinir ağı ile süper çözünürlük yaklaşımı. Türk Uzak. Algılama CBS Derg. 2023, 4, 231–241. [Google Scholar]
- Li, Y.; Matgen, P.; Chini, M. Extraction of built-up areas using Sentinel-1 and Sentinel-2 data with automated training data sampling and label noise robust cross-fusion neural networks. Int. J. Appl. Earth Obs. Geoinf. 2025, 139, 104524. [Google Scholar]
- Jagannathan, J.; Vadivel, M.T.; Divya, C. Land use classification using multi-year Sentinel-2 images with deep learning ensemble network. Sci. Rep. 2025, 15, 29047. [Google Scholar] [CrossRef] [PubMed]
- Rikimaru, A.; Roy, P.S.; Miyatake, S. Tropical forest cover density mapping. Trop. Ecol. 2002, 43, 39–47. [Google Scholar]
- Diek, S.; Fornallaz, F.; Schaepman, M.E.; De Jong, R. Barest pixel composite for agricultural areas using Landsat time series. Remote Sens. 2017, 9, 1245. [Google Scholar] [CrossRef]
- Chinilin, A.V.; Lozbenev, N.I.; Shilov, P.M.; Fil, P.P.; Levchenko, E.A.; Kozlov, D.N. Synergetic use of bare soil composite imagery and multitemporal vegetation remote sensing for soil mapping (a case study from Samara region’s upland). Land 2024, 13, 2229. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Rahman, M.A.; Wang, Y. Optimizing intersection-over-union in deep neural networks for image segmentation. In Proceedings of the International Symposium on Visual Computing; Springer: Cham, Switzerland, 2016; pp. 234–244. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder–decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 801–818. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]





| Year | Training | Validation | Test | Total |
|---|---|---|---|---|
| 2015 | 904 | 126 | 119 | 1149 |
| 2020 | 926 | 114 | 114 | 1154 |
| 2025 | 942 | 117 | 120 | 1179 |
| Total | 2772 | 357 | 353 | 3482 |
| Parameter | Value |
|---|---|
| Input image size | 256 × 256 pixels |
| Batch size | 32 |
| Optimizer | AdamW |
| Weight decay | 1 × 10−4 |
| Initial learning rate | 1 × 10−4 |
| Scheduler factor | 0.5 |
| Scheduler patience | 5 epochs |
| Learning rate scheduler | ReduceLROnPlateau |
| Random seed | 42, 123, 3407 |
| Number of epochs | 40 |
| Model | Input Bands | Architecture | TTA | IoU (Mean ± Std) | Dice (Mean ± Std) | Precision (Mean ± Std) |
|---|---|---|---|---|---|---|
| RGB Baseline | B2, B3, B4 | U-Net | No | 0.4211 ± 0.0020 | 0.5764 ± 0.0019 | 0.6099 ± 0.0018 |
| FiveBand Single | B2, B3, B4, B8, B11 | U-Net | No | 0.8230 ± 0.0016 | 0.9012 ± 0.0011 | 0.9166 ± 0.0010 |
| DeepLabV3+ | B2, B3, B4, B8, B11 | CNN encoder–decoder | No | 0.8019 ± 0.0018 | 0.8877 ± 0.0014 | 0.8709 ± 0.0012 |
| SegFormer | B2, B3, B4, B8, B11 | Transformer-based | No | 0.8137 ± 0.0016 | 0.8950 ± 0.0015 | 0.9214 ± 0.0011 |
| FiveBand Residual (No TTA) | B2, B3, B4, B8, B11 | Residual U-Net | No | 0.8172 ± 0.0015 | 0.8975 ± 0.0012 | 0.8882 ± 0.0008 |
| FiveBandTTA (proposed) | B2, B3, B4, B8, B11 | Residual U-Net with TTA | Yes | 0.8447 ± 0.0013 | 0.9124 ± 0.0010 | 0.9249 ± 0.0009 |
| Evaluation Dataset | IoU | Dice | Precision | Recall |
|---|---|---|---|---|
| Proposed dataset | 0.8447 | 0.9141 | 0.9249 | 0.9064 |
| Human-Annotated Evaluation dataset | 0.8412 | 0.9123 | 0.9381 | 0.8902 |
| Model | PARAMS | FLOPs (G) | FPS Without TTA | FPS with TTA |
|---|---|---|---|---|
| RGB_Baseline | 8.12M | 15.04 G | 21.39 | - |
| FiveBand_Single | 8.12M | 15.08 G | 19.17 | - |
| DeepLabV3+ | 3.72M | 32.42 G | 27.25 | - |
| SegFormer | 1.29M | 9.66 G | 20.14 | - |
| Proposed_model | 12.69M | 23.50 G | 220.41 | 54.62 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ülker, M. A Residual U-Net Architecture for Built-Up Area Segmentation from Sentinel-2 Images. Appl. Sci. 2026, 16, 6407. https://doi.org/10.3390/app16136407
Ülker M. A Residual U-Net Architecture for Built-Up Area Segmentation from Sentinel-2 Images. Applied Sciences. 2026; 16(13):6407. https://doi.org/10.3390/app16136407
Chicago/Turabian StyleÜlker, Mehtap. 2026. "A Residual U-Net Architecture for Built-Up Area Segmentation from Sentinel-2 Images" Applied Sciences 16, no. 13: 6407. https://doi.org/10.3390/app16136407
APA StyleÜlker, M. (2026). A Residual U-Net Architecture for Built-Up Area Segmentation from Sentinel-2 Images. Applied Sciences, 16(13), 6407. https://doi.org/10.3390/app16136407

