1. Introduction
In April 2017, the Central Committee of the Communist Party of China and the State Council officially announced the decision to establish the Xiong’an New Area in Hebei Province. Located in the heartland of Beijing, Tianjin, and Baoding, the Xiong’an New Area serves as a concentrated hub for non-essential functions of Beijing, aiming to achieve rational reallocation of resources and promote coordinated regional development. Therefore, dynamic monitoring of urban development in the Xiong’an New Area is of great significance for supporting urban planning and development management [
1,
2,
3].
HRBs are urban landforms with distinct morphological and functional traits that are essential for accommodating growing populations and driving economic growth. The spatiotemporal evolution of HRBs directly reflect the intensity of regional construction and the progress of urban expansion [
4,
5]. In this study, HRBs are defined as clusters of buildings with an average height exceeding 25 m and forming a continuous spatial pattern. This threshold is based on observed evolutionary patterns of building morphology throughout urbanization. In recent decades, accelerated urbanization in China has resulted in new buildings generally exceeding 10 stories, with most surpassing 25 m, creating a clear contrast with traditional low-rise structures [
6]. As essential elements of vertical urbanization, HRBs contribute to efficient land use, higher population density, and concentrated urban functions [
7,
8,
9,
10]. Therefore, the precise identification and dynamic monitoring of HRBs hold significant theoretical and practical value, particularly in the context of constructing the Xiong’an New Area, a national-level new area. Changes in HRBs directly affect the success of the Beijing–Tianjin–Hebei coordinated strategy, offering insights for evaluating regional capacity, optimizing spatial organization, and supporting orderly urban growth.
The continuous advancement of remote sensing technology has significantly improved the temporal and spatial resolution of images, providing robust support for long-term applications [
11,
12,
13]. The European Space Agency (ESA) Sentinel-2 imagery provides a technical pathway for monitoring HRBs thanks to its 10 m spatial resolution, global coverage, short revisit interval, and open access policy [
14]. Deep learning methods, with their powerful feature learning capabilities, provide effective technical means for the automatic characterization of complex spatial-spectral features in remote sensing images [
15,
16]. The FCN-based HRBs extraction method using Sentinel-2 imagery (10 m spatial resolution) achieves over 95% F1-score accuracy and enables change detection via dual-phase difference analysis [
17,
18]; however, independent temporal phase processing suffers from low computational efficiency, poor utilization of temporal contextual information, and vulnerability to noise and seasonal variations, resulting in a high false change detection rate. Existing studies have attempted to introduce temporal modeling techniques to overcome these issues. Currently, most temporal feature modeling approaches employ temporal attention mechanisms or recurrent neural networks to capture temporal dependencies. For example, UTAE captures long-term dependencies via temporal attention [
19,
20], STANet [
21] jointly models spatial and temporal features, and ConvLSTM [
22] relies on convolutional recurrent units to capture local temporal dynamics. However, these methods primarily focus on aggregating similar features, lacking explicit extraction of temporally distinct features, and thus exhibit certain limitations.
To address the above limitations, we propose the Temporal Change-Aware U-Net (TCA-Unet), an encoder–decoder architecture designed to detect long-range temporal changes. The change-aware encoder integrates self-attention with temporal differencing to support dynamic comparisons across multiple temporal frames while preserving spatial context. The change-aware attention module fuses change-sensitive components with temporal attention. Skip connections are temporally aggregated through attention-guided fusion, ensuring effective transfer of spatial details and long-range temporal features to the decoder. The decoder employs adaptive fusion weights to balance original temporal features with change-enhanced representations. This early and deep integration of temporal and spatial cues improves sensitivity to subtle, multi-scale changes while maintaining robustness to noise and false alarms. Specifically, a temporal change-aware attention module is introduced to explicitly model changes between adjacent temporal frames, enabling comparative feature extraction and effective capture of change-related information. A multi-level weight generation mechanism is further designed to produce change feature maps and dynamically balance temporal and change features through learnable fusion parameters, thereby enhancing the model’s capability to perceive and represent temporal variations. The effectiveness and robustness of the proposed method are demonstrated through an application-oriented experiment conducted in the Xiong’an New Area, where multi-temporal Sentinel-2 imagery is used for HRBs extraction and spatiotemporal change characterization.
2. Study Area and Data
2.1. Study Area
The study area is centered on the Xiong’an New Area and covers surrounding counties and districts, with a total area of 26,000 square kilometers. The specific boundaries are shown in
Figure 1. The Xiong’an New Area is located in the central part of Hebei Province, at the intersection of Beijing, Tianjin, and Hebei, and includes Rongcheng County, Anxin County, Xiong’an County, and surrounding areas.
2.2. Data
To assess changes in HRBs from 2017 to 2024, Sentinel-2 L2A-level products with no clouds cover were selected between March and April each year. These products have undergone atmospheric correction and other basic operations to minimize the impact of atmospheric conditions on the imagery data. The acquisition times of the imagery are shown in
Table 1. Additionally, the acquired imagery was subjected to super-resolution reconstruction using the Gram–Schmidt adaptive (GSA) pansharpening method [
23], enhancing the resolution of the 9-band imagery to 10 m. This approach preserves the rich spectral information of the multispectral data while improving the imagery’s ability to capture detailed features.
To ensure training effectiveness and data quality, we selected multiple target regions. Each sample comprises a spatio-temporal data cube spanning eight consecutive time points (T = 8), with each time point containing nine spectral bands. We manually interpreted open-source high-resolution imagery and street-view photographs to generate binary mask ground-truth labels for each temporal series (e.g., Google Maps satellite imagery).
Figure 2 illustrates the temporal alignment between multi-temporal remote sensing images and their corresponding building labels for the study area from 2017 to 2024. For each year, the upper row presents the optical remote sensing imagery, while the lower row shows the manually interpreted binary building masks. The sequence demonstrates the gradual emergence and expansion of building areas over time, highlighting the spatio-temporal evolution of high-rise buildings in the region.
Subsequently, a sliding-window cropping strategy was applied to extract 128 × 128-pixel image patches from each data cube. In summary, the sliding-window extraction produced 208 training samples, each organized as a tensor of size [B, 8, 9, 128, 128], which served as the input for training the proposed model.
3. Methods
To enable dynamic monitoring of HRBs from multi-temporal Sentinel-2 imagery, the model must jointly address two key challenges: learning consistent temporal dependencies across a fixed-length time series, and explicitly capturing change cues between adjacent time steps. To this end, we propose a temporal change-aware semantic segmentation framework that integrates temporal modeling with change enhancement within a unified encoder–decoder architecture. In the following, we first present the overall TCA-Unet architecture, and then describe the major components and their roles in temporal feature learning and HRBs extraction.
3.1. TCA-Unet
The TCA-Unet model is designed based on the classic U-Net encoder–decoder architecture, retaining the spatial feature extraction capabilities of U-Net while introducing the Temporal Change-Aware (TCA) module as the core innovative component in the deepest layer of the encoder. The entire model architecture consists of four main components: the spatial encoder, the TCA module, the temporal aggregator, and the spatial decoder. The spatial encoder uses multi-layer convolutional blocks for hierarchical feature extraction, gradually reducing spatial resolution and increasing semantic information depth to obtain rich feature representations. At the deepest layer of the encoder, the input features are fed into the TCA module for temporal modeling and change detection processing. Subsequently, based on the attention weights generated by the TCA module, the temporal aggregator performs temporal aggregation operations on the skip connections across all layers of the encoder, ensuring that important temporal information is effectively transmitted to the decoder. Finally, the spatial decoder restores spatial resolution through up sampling and skip connection mechanisms to generate segmentation results. The TCA-Unet model architecture is shown in
Figure 3.
3.1.1. Temporal Change-Aware Module
The TCA module is the core innovative component of the TCA-UNet architecture, as illustrated in
Figure 4. Building upon TAE’s foundational temporal modeling capabilities, we introduce a collaborative modeling mechanism that integrates “temporal information × change patterns.” This module employs a dual-branch parallel design. The Temporal Attention Encoder (TAE) and the Change-Aware Encoder (CAE) form a primary–auxiliary collaborative relationship. The TAE acts as the backbone path and is responsible for capturing global temporal dependencies. In contrast, the CAE serves as the auxiliary path. It focuses on identifying change regions and enhancing their feature representations by explicitly modeling temporal differential features to generate change weights. The two paths achieve complementary information fusion through a carefully designed integration mechanism: the change weights generated by the CAE directly act upon the output features of the TAE. This approach preserves temporal integrity while emphasizing change sensitivity, ultimately producing a comprehensive representation that incorporates both complete temporal information and highlighted change features. This primary-auxiliary collaborative design enables the model to simultaneously handle the two critical tasks of temporal modeling and change perception within a unified framework.
The temporal attention branch of the TCA module reuses the core architecture of TAE. Through feature projection, spatial aggregation, multi-head self-attention computation, and feature reconstruction, it learns global dependencies and temporal weight distributions across time steps. This branch focuses on capturing long-range temporal dependencies, providing the model with rich sequential contextual information. The Change Perception Branch explicitly models temporal variation patterns: it quantifies inter-frame changes by computing absolute differences between adjacent time steps to extract temporal difference features. The Change Feature Extractor employs a convolutional neural network to extract deep feature representations related to changes. The Change Gating Mechanism generates a spatial change weight map, highlighting regions with significant variations. The Pairwise Weight Generator produces weight information describing temporal relationships based on frame-pair features, thereby enhancing the model’s ability to understand complex change patterns.
3.1.2. Temporal Attention Encoder
The Temporal–Attention–Encoder (TAE) module serves as the core temporal modeling component in the TCA-Unet architecture, specifically designed for learning temporal features in temporal remote sensing images. This module is based on the understanding that features at different time steps have varying degrees of importance in temporal remote sensing image analysis, while traditional convolutional neural networks cannot explicitly model temporal information. To address this issue, TAE is introduced to learn temporal weight distributions, thereby enabling adaptive selection and enhancement of critical temporal information. The structure of the TAE module is shown in
Figure 5.
The TAE module adopts an end-to-end design strategy, and its overall architecture can be divided into four key stages: feature projection, spatial aggregation, temporal attention calculation, and feature reconstruction. First, the feature projection layer applies a one-dimensional convolution, batch normalization, and a ReLU activation function to map the input multi-channel spatial features into the temporal attention computation space. This step enables an effective transformation from spatial to temporal representations. Next, the spatial feature aggregation module employs adaptive global average pooling to compress two-dimensional feature maps into one-dimensional feature vectors, reducing computational complexity while preserving key semantic information.
The temporal attention mechanism constitutes the core innovative component of this module, implemented through a standard multi-head self-attention mechanism [
24]. The multi-head self-attention mode is based on the Transformer design philosophy, calculating global dependencies between time steps through a query-key-value mechanism. It uses a scaled dot-product attention formula to effectively avoid the vanishing gradient problem and improve training stability. The multi-head mechanism captures different types of temporal relationship patterns by computing multiple attention subspaces in parallel, enhancing the model’s representational capabilities. The temporal attention mode uses a cyclic update strategy to gradually accumulate temporal information by maintaining the hidden state, making it particularly suitable for applications with a clear temporal order.
After the temporal attention computation, the module reconstructs the features back to their original spatial dimensions. It expands the one-dimensional attention-enhanced features along the spatial axes and applies spatial broadcasting to align them with the spatial structure of the input. This reconstruction process ensures that the attention information is effectively propagated to every spatial location. To maintain training stability, the module employs a residual connection design, adding the attention-enhanced features to the original input to form the final output. The layer normalization mechanism independently normalizes each time step feature, effectively mitigating internal covariate shift issues and improving model training efficiency and generalization capabilities.
3.1.3. Change-Aware Encoder
Building upon the foundational temporal modeling capabilities provided by the TAE module, the introduction of a specialized CAE module aims to explicitly model temporal change patterns, thereby completing the final form of the TCA encoder. The core idea of this mechanism is to combine temporal attention with explicit change detection through multi-level change feature extraction and adaptive weight generation. This enhances the model’s ability to represent temporal changes, enabling the collaborative modeling of both temporal information and change patterns. The structure details of the CAE module is shown in
Figure 6.
The core of the CAE module lies in computing temporal difference features. This mechanism identifies temporal changes by calculating difference features between adjacent time steps. These difference features undergo feature extraction via a convolutional neural network to generate change weights. Simultaneously, the mechanism constructs paired features by concatenating features from adjacent time steps, generating paired weights through a dedicated weight generator. These two types of weights are fused via multiplication to form comprehensive change-perception weights. In terms of temporal feature processing, this module generates change weights based on temporal differences and modulates and adaptively fuses TAE outputs. It does not generate an entirely independent feature set symmetrically with TAE and then concatenate them.
The outputs of the two branches are organically integrated through an adaptive fusion mechanism, which uses learnable fusion weights α to balance the contributions of temporal attention features and change detection features. The fusion formula is expressed as:
where
represents the output of the temporal attention branch,
represents the output of the change detection branch, and
is the residual connection term. This design enables the model to automatically adjust the importance of the two branches based on the characteristics of the input data, achieving optimal feature fusion results.
Within the overall architecture, the TCA module is deeply integrated with the U-Net encoder–decoder structure. The encoder stage extracts spatial features through down sampling convolutional blocks, followed by temporal modeling and change enhancement via the change-aware temporal encoder. The decoder stage employs a temporal aggregator to process skip connections, ensuring effective propagation of change-aware temporal information across all decoder layers. This design enables the model to capture and utilize change information at multiple scales, thereby enhancing the accuracy and robustness of change detection.
The TCA module differs from traditional implicit change learning strategies by directly computing inter-frame differences using temporal differences. It generates changes through change gating and spatially modulates features based on change weights, thereby achieving precise capture and effective utilization of change information.
3.1.4. Sequential Segmentation Output and Change Detection
The model ultimately outputs a five-dimensional tensor , enabling independent semantic segmentation predictions for each time step. The output layer uses convolution to map decoder features to a probability distribution of categories, efficiently converting features to categories while preserving spatial context information. This design enables the model to simultaneously perform temporal semantic segmentation and change detection tasks, providing precise pixel-level annotation results for temporal remote sensing image analysis. By comparing segmentation results across different time steps, users can intuitively identify the change patterns and evolution trends of objects.
3.1.5. Loss Function
The TCA-Unet model uses a composite loss function design, aiming to optimize both classification accuracy and change detection capability. The main loss term uses weighted cross-entropy loss, which effectively handles category imbalance issues and ensures stable model performance in change detection tasks. The main classification loss is weighted cross-entropy loss:
Among them, is the category weight, is the true label, is the predicted probability, is the number of categories, and is the total number of pixels.
To enhance change-detection capability, the model introduces an attention regularization loss composed of two components: attention sparsity loss and spatial smoothness loss. The attention sparsity loss is introduced to encourage the attention weights to focus on actual change regions while suppressing spurious activations, which is consistent with common practices in attention modeling that promote focused and sparse responses [
25]. The spatial smoothness loss enforces continuity in the spatial distribution of attention weights by penalizing abrupt spatial variations, which is inspired by classical total variation–based regularization methods widely used to promote spatial coherence in image analysis [
26].
Among the is the attention weight, and is the numerical stability constant.
Spatial smoothness loss, which promotes spatial continuity of attention weights in the horizontal and vertical directions:
In addition, the model adopts a warm-up training strategy, using only classification loss in the early stages of training. Once the model has mastered basic change detection capabilities, attention regularization loss is gradually enabled to achieve stability and convergence in the training process.
Among them, and are the corresponding weight coefficients, which are dynamically adjusted after preheating training.
3.2. Accuracy Assessment
To comprehensively evaluate the performance of the TCA-Unet model in the task of detecting semantic changes in time-series remote sensing images, this study employs three classic evaluation metrics: overall accuracy (OA), F1 score, and mean intersection over union (mIoU). The OA metric reflects the model’s overall classification accuracy across all categories, providing an intuitive global assessment of model performance. The F1 score, calculated as the harmonic mean of precision and recall, offers a more balanced and reliable performance evaluation metric for imbalanced datasets. mIoU effectively evaluates the model’s accuracy in pixel-level semantic segmentation tasks by calculating the overlap between the predicted region and the true region for each category. It is particularly suitable for applications such as change detection that require precise boundary localization. These metrics quantify the model’s classification performance and segmentation accuracy from different dimensions, providing a scientific and reliable quantitative standard for model performance evaluation.
3.3. Experimental Setup
We compared the proposed TCA-Unet with three representative temporal modeling baselines, including UTAE [
19], STANet [
21], and ConvLSTM [
22]. To ensure a fair comparison, all methods were trained and evaluated on the same dataset split with identical input settings and optimization protocol. We used the Adam optimizer with an initial learning rate of 1 × 10
−4, weight decay of 1 × 10
−4, and a batch size of 4. Training was conducted for 150 epochs, and the best checkpoint was selected according to the performance on the validation set. We used the same loss function for all methods, i.e., a weighted combination of Focal Loss and Dice Loss, to avoid confounding effects introduced by different objective functions. In all experiments, consistent supervision settings were adopted to ensure comparability and reproducibility. All models were implemented in PyTorch 2.5.1 (CUDA 12.1) and trained on a workstation with an NVIDIA GeForce RTX 3060 Ti (16 GB VRAM).
4. Results
The model was trained on the constructed spatio-temporal dataset. The input images were standardized to 128 × 128 pixels with a fixed sequence length. Data augmentation strategies were applied during training. For each temporal sample, random horizontal and vertical flips were applied simultaneously with a probability of 0.5. These transformations were applied identically to both the image and its corresponding label. The model was evaluated against baseline methods using metrics including IoU, overall accuracy, and F1-score.
4.1. Performance Evaluation
The detailed performance metrics listed in
Table 2 highlight that the TCA-UNet model outperforms other models on the dataset, achieving an overall accuracy (OA) of 90.98%, an F1-Score of 82.63%, and a mean intersection over union (mIoU) of 72.22%. While these results validate the effectiveness of the traditional temporal attention mechanism, this approach still has limitations when it comes to precise change detection.
TCA-UNet demonstrates consistent improvements over other benchmark models. Specifically, it achieves increases of 5.67–6.39% in Overall Accuracy (OA), 5.69–6.83% in F1-score, and 7.88–8.77% in mean Intersection over Union (mIoU). Quantitative results confirm that integrating enhanced temporal modeling with explicit change perception effectively improves performance, offering a robust technical solution for semantic change detection in temporal remote sensing imagery.
The visualization results of the final time step, generated by different models for HRBs in selected regions, are illustrated in
Figure 7. As shown in examples (a), (b), and (c), clear performance differences can be observed across models. In region (a), STANet and ConvLSTM produce scattered noise points, whereas TCA-UNet yields outputs that more closely align with the labeled contours. In region (b), other models generate notable false alarms in HRB detection, while TCA-UNet accurately reconstructs the boundaries of the labeled HRBs. In region (c), the comparison models exhibit blurred boundaries and local omissions, whereas TCA-UNet achieves superior spatial alignment between the extracted features and the reference labels. Overall, these visual results collectively demonstrate the superior capability of TCA-UNet in suppressing false alarms and improving the reliability of HRB region extraction. Nevertheless, despite the observed performance improvements, the extraction results are still affected by building shadows and complex urban structures, which limits precise boundary delineation. Consequently, the proposed method is more effective in extracting building regions rather than achieving fine-scale boundary segmentation.
This result fully validates the effectiveness of the dual-path architecture design: the temporal attention encoder effectively captures long-range temporal dependencies, while the CAE module precisely identifies the features of changing regions. The synergistic interaction between the two significantly enhances the model’s ability to model complex spatio-temporal change patterns. The success of TCA-Unet provides an efficient and reliable technical solution for temporal semantic change detection in remote sensing images, achieving a significant improvement in detection accuracy.
4.2. Ablation Study
To isolate and quantify the effect of the change-aware enhancement in TCA-Unet, we perform an ablation study by comparing UNET-TAE (without CAE) against the full TCA-Unet (with CAE) under identical training settings. This controlled comparison enables a direct assessment of how the CAE module contributes to segmentation accuracy on the HRBs extraction task.
UNET-TAE Model: Serving as a baseline for ablation experiments, UNET-TAE retains the complete encoder–decoder architecture framework of TCA-UNet while removing the TCA branch and preserving only the enhanced TAE branch. This allows us to validate the effectiveness of CAE modulation through framework comparisons.
The ablation experiments were conducted using the same training strategy and evaluation metrics to ensure result comparability, thereby providing reliable evidence for understanding the contribution and mechanisms of each model component.
Table 3 presents the quantitative results of the ablation experiments.
As shown in
Table 3, TCA-UNet substantially outperforms UNet-TAE in Overall Accuracy (OA), F1-score, and mean Intersection over Union (mIoU), with improvements from 88.56%, 80.58%, and 69.61% to 90.98%, 82.63%, and 72.22%, respectively. These improvements highlight the effectiveness of the CAE module, whose integration with the temporal attention encoder enhances the model’s capacity to perceive and model change information in multi-temporal remote sensing imagery, thereby improving performance in HRB extraction.
4.3. Dynamic of HRBs in Xiong’an New Area
Using the optimally trained TCA-Unet model, HRBs were extracted across the entire study area from 2017 to 2024. The corresponding results are shown in
Figure 8.
The spatial distribution of HRBs in 2024 exhibits a markedly uneven pattern across the study area, as illustrated in
Figure 8a. The central-northern and eastern regions form the main clusters, characterized by dense and continuous patches. By contrast, the southern and western peripheral areas exhibit relatively sparse and scattered distributions. This spatial disparity is closely related to the geographical context of the study area. The northeastern part borders Beijing, where spillover effects from urban development have intensified land use transformation, such as urban expansion, resulting in more concentrated HRBs changes.
The temporal evolution of HRBs in the Rongdong District indicates that large-scale construction activities commenced in 2021, as shown in
Figure 8b. HRBs construction in this district began in 2021. Color coding for the period 2021–2024 reveals a clear and progressive expansion of HRBs coverage. Spatially, the pattern evolved from scattered patches in the early stage to a more contiguous layout in later years, reflecting the rapid development of Rongdong District over the past four years.
5. Discussion
The experimental results demonstrate that the proposed TCA-Unet framework is effective in capturing multi-year changes in HRBs and provides a reliable basis for regional-scale dynamic monitoring. Compared with conventional temporal modeling approaches that primarily aggregate temporal features, the proposed framework explicitly integrates change-aware mechanisms into temporal feature learning. This design enables the model to emphasize temporally distinct patterns associated with HRBs construction while maintaining spatial coherence across long time spans. As a result, the extracted HRBs distributions exhibit consistent spatial patterns over time, particularly in regions undergoing continuous urban development, which facilitates the interpretation of long-term urban growth dynamics.
The effectiveness of the proposed approach mainly stems from the explicit temporal change-aware modeling strategy. By introducing a temporal change-aware attention module, the framework directly captures differences between adjacent temporal frames, enabling comparative feature extraction that emphasizes change-related information rather than solely aggregating similar temporal features. Furthermore, the designed multi-level weight generation mechanism produces change-sensitive feature maps and adaptively balances temporal attention features and change-enhanced representations through learnable fusion parameters. This asymmetric and adaptive fusion strategy allows the model to more effectively perceive temporal variations associated with HRBs construction processes.
Despite these advantages, several limitations remain. Due to the relatively long construction cycle of HRBs, this study adopts annual-scale time series data, which limits the temporal precision of detecting exact construction timepoints. Moreover, although the proposed model reduces false positives compared with baseline methods, attention responses may still be influenced by visually similar structures and shadow effects, leading to blurred boundaries in certain cases. These observations indicate that the current framework does not explicitly enforce boundary-level constraints and remains sensitive to complex urban visual patterns.
These limitations also point to promising directions for future research. Incorporating multi-source observations, such as higher-temporal-resolution optical imagery or complementary data from other sensors, may enable finer-grained temporal modeling of construction processes. In addition, integrating boundary-aware constraints or shape priors into the learning framework could further enhance boundary delineation accuracy and robustness. Exploring lightweight model designs to reduce computational overhead while preserving change-awareness is another potential direction for extending the applicability of the proposed approach.
6. Conclusions
This study addresses the needs of urban dynamic monitoring by proposing a semantic change detection model, TCA-Unet, based on a temporal change-aware attention module. By incorporating a CAE module, the model effectively captures temporal features in time-series Sentinel-2 satellite imagery, enabling the extraction and detection of HRBs. We selected the Xiong’an New Area and its surrounding regions from 2017 to 2024 as the study area and conducted a systematic validation of the model’s performance. The model achieves an overall accuracy (OA) of 90.98%, an F1-score of 82.63%, and an average intersection-over-union (mIoU) of 72.22% on the test set. Our results indicate that by explicitly modeling temporal difference features, the TCA-Unet model can accurately capture change information in remote sensing image sequences. The introduction of a multi-level weight generation mechanism, combined with an adaptive fusion strategy to dynamically balance temporal feature representation and change-aware features, enhances the model’s ability to resist noise interference and improves the feasibility and applicability of temporal data processing efficiency.
The TCA-Unet-based change detection method achieves accurate semantic segmentation of HRBs. By modeling multi-temporal differences in building spatial distributions, it further identifies the construction timing of newly built structures and characterizes their spatial evolution. Experimental results demonstrate that the proposed method exhibits high stability and strong discriminative capability in capturing temporal changes in buildings, confirming its effectiveness for time-series change detection tasks.
Author Contributions
Conceptualization, L.L. and J.L.; methodology, L.L., G.C. and J.L.; software, J.L. and L.L.; validation, J.L. and G.C.; formal analysis, J.L.; investigation, L.L.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, L.L.; visualization, G.C.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research is funded by the National Natural Science Foundation of China (grant number 41971327).
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Lin, H.Z. The Shrinking of Beijing and the Rising of Xiong’an: Optimize Population Migration in terms of Transport Service. Discret. Dyn. Nat. Soc. 2020, 2020, 8282070. [Google Scholar] [CrossRef]
- Huo, J.E.; Shi, Z.Q.; Zhu, W.B.; Xue, H.; Chen, X. A Multi-Scenario Simulation and Optimization of Land Use with a Markov-FLUS Coupling Model: A Case Study in Xiong’an New Area, China. Sustainability 2022, 14, 2425. [Google Scholar] [CrossRef]
- Kuang, W.H.; Yang, T.R.; Yan, F.Q. Examining urban land-cover characteristics and ecological regulation during the construction of Xiong’an New District, Hebei Province, China. J. Geogr. Sci. 2018, 28, 109–123. [Google Scholar] [CrossRef]
- Vafai, H.; Parivar, P.; Kashani, S.S.; Imani, A.F.; Vakili, F.; Ahmadi, G. Environmental impact analysis of high-rise buildings for resilient urban development. Sci. Iran. 2020, 27, 1843–1857. [Google Scholar] [CrossRef]
- Guan, C.J.T.P. Spatial distribution of high-rise buildings and its relationship to public transit development in Shanghai. Transp. Policy 2019, 81, 371–380. [Google Scholar] [CrossRef]
- Yao, S.; Li, L.W.; Cheng, G.; Zhang, B. Analyzing Long-Term High-Rise Building Areas Changes Using Deep Learning and Multisource Satellite Images. Remote Sens. 2023, 15, 2427. [Google Scholar] [CrossRef]
- Wellmann, T.; Haase, D.; Knapp, S.; Salbach, C.; Selsam, P.; Lausch, A. Urban land use intensity assessment: The potential of spatio-temporal spectral traits with remote sensing. Ecol. Indic. 2018, 85, 190–203. [Google Scholar] [CrossRef]
- Mo, Y.G.; Bao, Y.; Wang, Z.T.; Wei, W.F.; Chen, X.T. Spatial coupling relationship between architectural landscape characteristics and urban heat island in different urban functional zones. Build. Environ. 2024, 257, 111545. [Google Scholar] [CrossRef]
- Guo, L.Y.; Du, S.H.; Sun, W.B.; Fan, D.Q.; Wu, Y.H. Multi-scale impact of urban building function and 2D/3D morphology on urban heat island effect: A case study in Shanghai, China. Energy Build. 2025, 338, 115719. [Google Scholar] [CrossRef]
- Ge, M.Y.; Fang, S.H.; Gong, Y.; Tao, P.J.; Yang, G.; Gong, W.B. Understanding the Correlation between Landscape Pattern and Vertical Urban Volume by Time-Series Remote Sensing Data: A Case Study of Melbourne. Isprs Int. J. Geo-Inf. 2021, 10, 14. [Google Scholar] [CrossRef]
- Woodcock, C.E.; Loveland, T.R.; Herold, M.; Bauer, M.E. Transitioning from change detection to monitoring with remote sensing: A paradigm shift. Remote Sens. Environ. 2020, 238, 111558. [Google Scholar] [CrossRef]
- Esch, T.; Heldens, W.; Hirner, A.; Keil, M.; Marconcini, M.; Roth, A.; Zeidler, J.; Dech, S.; Strano, E. Breaking new ground in mapping human settlements from space—The Global Urban Footprint. ISPRS J. Photogramm. Remote Sens. 2017, 134, 30–42. [Google Scholar] [CrossRef]
- Miller, R.B.; Small, C. Cities from space: Potential applications of remote sensing in urban environmental research and policy. Environ. Sci. Policy 2003, 6, 129–137. [Google Scholar] [CrossRef]
- Li, J.; Roy, D.P. A global analysis of Sentinel-2A, Sentinel-2B and Landsat-8 data revisit intervals and implications for terrestrial monitoring. Remote Sens. 2017, 9, 902. [Google Scholar] [CrossRef]
- Li, B.N.; Gao, J.C.; Chen, S.P.; Lim, S.; Jiang, H. POI Detection of High-Rise Buildings Using Remote Sensing Images: A Semantic Segmentation Method Based on Multitask Attention Res-U-Net. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4706916. [Google Scholar] [CrossRef]
- Wang, Y.X.; Chen, S.L.; Zhang, R.X.; Xu, F.; Liang, S.; Wang, Y.J.; Yang, W. BuildMon: Building Extraction and Change Monitoring in Time Series Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10813–10826. [Google Scholar] [CrossRef]
- Li, L.W.; Zhu, J.M.; Gao, L.R.; Cheng, G.; Zhang, B. Detecting and Analyzing the Increase of High-Rising Buildings to Monitor the Dynamic of the Xiong’an New Area. Sustainability 2020, 12, 4355. [Google Scholar] [CrossRef]
- Li, L.; Zhu, J.; Cheng, G.; Zhang, B. Detecting High-Rise Buildings from Sentinel-2 Data Based on Deep Learning Method. Remote Sens. 2021, 13, 4073. [Google Scholar] [CrossRef]
- Garnot, V.S.F.; Landrieu, L. Panoptic segmentation of satellite image time series with convolutional temporal attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021. [Google Scholar]
- Sainte Fare Garnot, V.; Landrieu, L. Lightweight Temporal Self-Attention for Classifying Satellite Image Time Series. arXiv 2020, arXiv:2007.00586. [Google Scholar] [CrossRef]
- Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
- Xiong, T.S.; He, J.X.; Wang, H.; Tang, X.W.; Shi, Z.; Zeng, Q.Y. Contextual Sa-Attention Convolutional LSTM for Precipitation Nowcasting: A Spatiotemporal Sequence Forecasting View. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 12479–12491. [Google Scholar] [CrossRef]
- Aiazzi, B.; Baronti, S.; Selva, M. Improving component substitution pansharpening through multivariate regression of MS plus Pan data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3230–3239. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
- Jetley, S.; Lord, N.A.; Lee, N.; Torr, P.H.S. Learn to pay attention. arXiv 2018, arXiv:1804.02391. [Google Scholar]
- Rudin, L.I.; Osher, S.; Fatemi, E. Nonlinear total variation based noise removal algorithms. Phys. D Nonlinear Phenom. 1992, 60, 259–268. [Google Scholar] [CrossRef]
Figure 1.
Study area and cropped Sentinel-2 data. The (left) figure shows the geographical location of the study area. The (right) figure shows the true color image of the study area in the background; the red line indicates the administrative boundaries of Xiong’an New Area and surrounding counties; light blue indicates the core area of Xiong’an New Area.
Figure 1.
Study area and cropped Sentinel-2 data. The (left) figure shows the geographical location of the study area. The (right) figure shows the true color image of the study area in the background; the red line indicates the administrative boundaries of Xiong’an New Area and surrounding counties; light blue indicates the core area of Xiong’an New Area.
Figure 2.
The temporal alignment between multi-temporal remote sensing images and their corresponding building labels for the study area from 2017 to 2024.
Figure 2.
The temporal alignment between multi-temporal remote sensing images and their corresponding building labels for the study area from 2017 to 2024.
Figure 3.
TCA-Unet Model Architecture Diagram.
Figure 3.
TCA-Unet Model Architecture Diagram.
Figure 4.
Schematic diagram of TCA module. Red lines denote the generation of temporal attention weights for cross-temporal aggregation of skip connections. The “×” symbol denotes element-wise weighting, and “×4” indicates that this operation is applied at four hierarchical levels via skip connections (dashed lines).
Figure 4.
Schematic diagram of TCA module. Red lines denote the generation of temporal attention weights for cross-temporal aggregation of skip connections. The “×” symbol denotes element-wise weighting, and “×4” indicates that this operation is applied at four hierarchical levels via skip connections (dashed lines).
Figure 5.
Schematic diagram of TAE. Red lines denote the generation of temporal attention weights for cross-temporal aggregation of skip connections.
Figure 5.
Schematic diagram of TAE. Red lines denote the generation of temporal attention weights for cross-temporal aggregation of skip connections.
Figure 6.
Schematic diagram of CAE module. Red lines denote the generation of temporal attention weights for cross-temporal aggregation of skip connections. The symbol “*” denotes channel-wise concatenation of two adjacent feature representations, resulting in a doubled channel dimension (2 × C).
Figure 6.
Schematic diagram of CAE module. Red lines denote the generation of temporal attention weights for cross-temporal aggregation of skip connections. The symbol “*” denotes channel-wise concatenation of two adjacent feature representations, resulting in a doubled channel dimension (2 × C).
Figure 7.
Original image and segmentation prediction results. The image displays the extraction results for high-rise building areas at the final time step. Subfigures (a–c) correspond to three representative spatial regions selected for qualitative comparison.
Figure 7.
Original image and segmentation prediction results. The image displays the extraction results for high-rise building areas at the final time step. Subfigures (a–c) correspond to three representative spatial regions selected for qualitative comparison.
Figure 8.
Spatial distribution and temporal evolution of HRBs in the study area. (a) Extraction results of HRBs in the study area from 2017 to 2024, where red areas indicate the detected HRBs. (b) Temporal evolution of HRBs in the Rongdong District, with different colors representing newly constructed HRBs from 2021 to 2024.
Figure 8.
Spatial distribution and temporal evolution of HRBs in the study area. (a) Extraction results of HRBs in the study area from 2017 to 2024, where red areas indicate the detected HRBs. (b) Temporal evolution of HRBs in the Rongdong District, with different colors representing newly constructed HRBs from 2021 to 2024.
Table 1.
The image acquisition time used in the experiment, The format “MM-DD” indicates the month and day of image acquisition. For example, “04-18” represents April 18. All images were obtained from the Sentinel-2 satellite over the study area, covering multiple UTM grid tiles, for the period from 2017 to 2024.
Table 1.
The image acquisition time used in the experiment, The format “MM-DD” indicates the month and day of image acquisition. For example, “04-18” represents April 18. All images were obtained from the Sentinel-2 satellite over the study area, covering multiple UTM grid tiles, for the period from 2017 to 2024.
| Tiles | T50SMJ | T50SLJ | T50SLH | T50SMH |
|---|
| Year | |
|---|
| 2017 | 04-18 | 04-18 | 04-18 | 03-06 |
| 2018 | 04-08 | 04-08 | 04-08 | 04-20 |
| 2019 | 04-03 | 04-03 | 03-29 | 03-31 |
| 2020 | 04-12 | 03-23 | 04-12 | 04-04 |
| 2021 | 04-17 | 04-27 | 04-17 | 04-14 |
| 2022 | 03-03 | 03-28 | 03-08 | 03-10 |
| 2023 | 03-28 | 03-28 | 03-18 | 03-15 |
| 2024 | 04-21 | 03-22 | 03-22 | 04-18 |
Table 2.
Performance Comparison of Different Models on Dataset.
Table 2.
Performance Comparison of Different Models on Dataset.
| Dataset | Overall Accuracy | F1-Score | mIoU |
|---|
| STANET | 85.31 | 75.91 | 63.55 |
| ConvLSTM | 84.59 | 76.94 | 64.34 |
| UTAE | 85.25 | 75.80 | 63.45 |
| TCA-Unet | 90.98 | 82.63 | 72.22 |
Table 3.
Ablation experiments of the CAE module. Unet-TAE refers to the model that only includes a temporal attention encoder, while TCA-Unet refers to the model that integrates a temporal attention encoder and a CAE module.
Table 3.
Ablation experiments of the CAE module. Unet-TAE refers to the model that only includes a temporal attention encoder, while TCA-Unet refers to the model that integrates a temporal attention encoder and a CAE module.
| Dataset | Overall Accuracy | F1-Score | mIoU |
|---|
| Unet-TAE | 88.56 | 80.58 | 69.61 |
| TCA-Unet | 90.98 | 82.63 | 72.22 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |