Deep Learning-Based Collapsed Building Mapping from Post-Earthquake Aerial Imagery

Lyu, Hongrui; Oshio, Haruki; Matsuoka, Masashi

doi:10.3390/rs17173116

Open AccessArticle

Deep Learning-Based Collapsed Building Mapping from Post-Earthquake Aerial Imagery

by

Hongrui Lyu

¹,

Haruki Oshio

²

and

Masashi Matsuoka

^1,*

¹

Department of Architecture and Building Engineering, Institute of Science Tokyo, Yokohama 226-8501, Japan

²

Division of Architectural, Civil and Environmental Engineering, School of Science and Engineering, Tokyo Denki University, Saitama 350-0394, Japan

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3116; https://doi.org/10.3390/rs17173116

Submission received: 2 July 2025 / Revised: 20 August 2025 / Accepted: 2 September 2025 / Published: 7 September 2025

Download

Browse Figures

Versions Notes

Abstract

Rapid building damage assessments are vital for an effective earthquake response. In Japan, traditional Earthquake Damage Certification (EDC) surveys—followed by the issuance of Disaster Victim Certificates (DVCs)—are often inefficient. With advancements in remote sensing technologies and deep learning algorithms, their combined application has been explored for large-scale automated damage assessment. However, the scarcity of remote sensing data on damaged buildings poses significant challenges to this task. In this study, we propose an Uncertainty-Guided Fusion Module (UGFM) integrated into a standard decoder architecture, with a Pyramid Vision Transformer v2 (PVTv2) employed as the encoder. This module leverages uncertainty outputs at each stage to guide the feature fusion process, enhancing the model’s sensitivity to collapsed buildings and increasing its effectiveness under diverse conditions. A training and in-domain testing dataset was constructed using post-earthquake aerial imagery of the severely affected areas in Noto Prefecture. The model approximately achieved a recall of 79% with a precision of 68% for collapsed building extraction on this dataset. We further evaluated the model on an out-of-domain dataset comprising aerial images of Mashiki Town in Kumamoto Prefecture, where it achieved an approximate recall of 66% and a precision of 77%. In a quantitative analysis combining field survey data from Mashiki, the model attained an accuracy exceeding 87% in identifying major damaged buildings, demonstrating that the proposed method offers a reliable solution for initial assessment of major damage and its potential to accelerate DVC issuance in real-world disaster response scenarios.

Keywords:

damage detection; deep learning; earthquake; segmentation; aerial imagery

1. Introduction

Large-scale natural disasters, such as earthquakes, can rapidly destroy numerous buildings, leading to significant casualties and economic losses. Japan, an earthquake-prone nation, has suffered severely from these events. According to the Japan Meteorological Agency [1], since the aftermath of the 2011 Tohoku earthquake and tsunami, two major earthquakes with a Moment Magnitude Scale (MMS) above 7.0, each resulting in more than 100 fatalities, have occurred, causing immense economic and social losses.

Following earthquakes, rapidly assessing casualty numbers and the distribution and extent of damaged buildings is crucial for developing effective emergency response plans and allocating relief resources. In Japan, post-earthquake building damage assessments are conducted through Earthquake Damage Certification (EDC) field surveys. Based on these survey results, victims receive Disaster Victim Certificates (DVCs) [2], which are essential for obtaining assistance or compensation. Although EDC field surveys are highly accurate, they are limited by human resources and transportation constraints, making large-scale assessments difficult, and the issuance of DVCs can be significantly delayed. For instance, only 20% of DVCs were issued within 30 days of the 2024 Noto Peninsula earthquake in Ishikawa Prefecture [3]. Fortunately, with the recent advancement of RS technology, it is now possible to obtain high-resolution optical imagery at sub-meter detail over large areas within a short timeframe. This capability makes large-scale earthquake monitoring feasible. Consequently, RS data have become an essential tool for extensive urban damage assessment [4,5,6,7,8,9,10,11,12,13,14,15,16]. Compared to traditional field surveys, remote sensing is low-risk and provides a rapid overview of building collapses across broad regions.

Visual interpretation of RS optical imagery has become an efficient and effective tool for building damage assessment, demonstrating reasonable accuracy when compared with on-site survey results [4,5]. Recognizing this, Japan’s Cabinet Office has proposed using aerial imagery to identify collapsed buildings for preliminary damage evaluations, aiming to reduce the workload of EDC surveys and expedite DVCs issuance. However, this approach requires considerable human resources, is time-consuming, and can yield inconsistent outcomes due to subjective judgment by different inspectors. To overcome these limitations, efforts have been made to automate visual interpretation, leading to the development of methods for the automatic extraction of damaged buildings from aerial images.

In the automation of damaged building extraction, early methods were inspired by automated building extraction techniques, where hand-crafted features—based on spectral, spatial, and textural properties—were used to identify intact buildings [17,18]. Extending this idea, researchers designed hand-crafted features specifically for collapsed buildings. For instance, Vu et al. [6] proposed a filter focusing on edge variance, edge direction, and statistical textures to automatically detect collapsed buildings in RS optical images. However, because collapsed buildings manifest different characteristics under various disaster scenarios and lighting conditions, empirically designed hand-crafted features often fail to capture this complexity, limiting the generalization capability of such methods.

In recent years, deep learning has been widely applied to extract damaged buildings from aerial imagery, as it can automatically learn discriminative and representative features of damaged structures. Deep learning-based methods for detecting building damage in RS data can be broadly divided into bitemporal and single-temporal approaches. Bitemporal methods [8,9,10,11] use both pre- and post-event aerial imagery, allowing models to compare building characteristics before and after a disaster. However, it is often difficult to obtain pre-event images taken shortly before a disaster, and ensuring accurate spatial alignment between pre- and post-event images adds further complexity.

Given the urgency of post-disaster assessments, single-temporal methods relying solely on post-event imagery—where the model directly learns the characteristics of damaged buildings—have increasingly been adopted. These methods can be broadly divided into object-based and area-based approaches. Object-based methods [7,12,13] predict both the distribution and the exact number of damaged buildings but often require high-quality imagery (e.g., very-high-resolution aerial images, favorable lighting) or additional spatial data (e.g., building footprints), limiting their overall applicability. In contrast, area-based methods are more data-flexible as they only require standard aerial imagery at the prediction stage, followed by automatic segmentation to extract damaged areas. Considering the urgency of rapid post-disaster damage assessment, these methods are more commonly used and are better suited for quick disaster responses. Naito et al. [14] divided aerial images into numerous small patches, assigning each patch a corresponding damage level label. These labeled patches were then used to train a CNN-based model, which subsequently predicted patch-level building damage distributions across the entire aerial image. However, this approach is limited by the fixed size of the image patches. The scale of buildings in aerial images can vary significantly, causing some patches to cover only a small portion of a building. As a result, the model is restricted to localized views and lacks a broader global context, ultimately hindering its classification accuracy.

The adoption of encoder–decoder architectures has facilitated pixel-level damage mapping from remote sensing imagery. For instance, Xie et al. [15] proposed a CNN-based encoder–decoder network that accounts for the heterogeneous characteristics of damaged buildings, while Liu et al. [16] introduced a lightweight attention-enhanced instance segmentation framework that integrates multiscale feature extraction and feature fusion to achieve accurate and real-time damage identification from post-earthquake aerial imagery captured by drones. By delineating damaged areas at the pixel level, these methods provide an overall distribution of building damage, enabling rapid assessments in disaster response. However, these automated methods mainly rely on the vertical viewpoint of aerial imagery for visual interpretation. To ensure that such determinations are sufficiently reliable to serve as formal evidence for damage assessment, such as for Disaster Victim Certificate (DVC) issuance, they must be validated through the integration of on-site field survey data.

Additionally, unlike standard building extraction—where high-quality RS images can be captured under optimal lighting conditions at any time—RS data for damaged buildings can only be acquired after disasters occur, making the lighting conditions uncertain. In many previous studies, the earthquakes examined occurred in low-latitude regions or during seasons with favorable illumination, allowing researchers to use optical RS data under relatively good lighting. In contrast, damaged-building RS data acquired in poor lighting conditions remain largely absent from the literature. Such data are crucial for future research, as they would enhance the robustness and applicability of damage assessment methods under diverse environmental conditions.

Furthermore, the visual characteristics of buildings in RS data vary significantly due to differences in scale, shape, texture, and disaster-specific appearance (see Figure 1a–c), as well as background interference such as shadows and vegetation. These variations become even more pronounced in post-event images from different earthquakes due to changes in areas and lighting conditions (see Figure 1a,d). Compounding this challenge is the scarcity and morphological diversity of collapsed samples, which introduces a high degree of uncertainty [19] and makes it difficult for models to learn generalized features of collapsed structures. As a result, misclassifications are more likely, particularly when models are applied to out-of-domain areas, thereby limiting overall performance. In addition, the lack of large-scale RS datasets capturing damaged buildings has led most prior studies to evaluate models on a single earthquake event, leaving their transferability to other scenarios largely unverified. While [13,15,16] examined transferability by training and testing models on separate datasets, their experiments did not involve cross-applying pre-trained models to entirely new and unforeseen events. This limitation is critical, as post-earthquake contexts often provide insufficient time to collect adequate data for retraining or fine-tuning. Therefore, for practical deployment, transferability evaluations should prioritize assessing how effectively pre-trained models perform when directly applied to new earthquake scenarios without additional training.

In summary, although deep learning–based approaches combined with RS imagery have shown considerable potential for enabling a rapid post-disaster response, their practical deployment in real disaster scenarios requires addressing the following challenges:

Evaluating the performance of such methods under challenging scenarios, such as poor lighting and complex scene conditions;
Mitigating the uncertainty introduced by the limited and morphologically diverse samples of collapsed buildings, thereby enhancing the model’s generalization performance and transferability to new disaster events;
Validating the reliability of inferred results so that they can be confidently used as the basis for official damage assessments.

To address these issues, we constructed a damaged building dataset under low-light conditions using post-earthquake aerial imagery of the Noto area and employed it to evaluate model performance in such challenging scenarios. To mitigate the uncertainty caused by the scarcity of collapsed buildings, we developed the Uncertainty-Guided Fusion Module (UGFM) and integrated it into a Transformer-based network, Pyramid Vision Transformer v2 (PVTv2) [20]. To validate the model’s generalization ability, we applied the trained model to post-earthquake aerial imagery of Mashiki Town in Kumamoto Prefecture. Furthermore, a quantitative analysis was conducted by comparing the model’s predictions with field survey data to assess its practical feasibility in accelerating DVC issuance under real-world disaster response conditions.

2. Datasets and Methods

2.1. Outline of the 2024 Noto Peninsula Earthquake

The 2024 Noto Peninsula earthquake occurred on 1 January 2024, at 16:10 JST (07:10 UTC), striking at 6 km north-northeast of Suzu city in Ishikawa Prefecture, Japan, with an MMW up to 7.6 [21]. The epicenter and the distribution of seismic intensity of the earthquake are shown in Figure 2 [22]. The shaking and subsequent hazards, such as tsunamis, landslides, and fires, caused extensive destruction across the Noto Peninsula, particularly affecting the towns of Suzu, Wajima, Noto, and Anamizu. According to a survey by Disaster Management of Japan, the earthquake resulted in 339 casualties, 1678 injuries, and the collapse of 5910 buildings in Ishikawa Prefecture [23].

2.2. Outline of the 2016 Kumamoto Earthquakes

The 2016 Kumamoto earthquakes were a series of seismic events that included a foreshock with a MMW of 6.2, occurring at 21:26 JST (12:26 UTC) on 14 April 2016, and a mainshock with a MMW of 7.0, occurring at 01:25 JST (16:25 UTC) on 15 April 2016, beneath Kumamoto City in Kumamoto Prefecture, Kyushu Region, Japan. According to a survey conducted by the Disaster Management of Japan, these two earthquakes resulted in 211 casualties, 1142 injuries, and the collapse of 8666 buildings within Kumamoto Prefecture [24].

2.3. Selected Aerial Images

A few days after the earthquakes, the Geospatial Information Authority of Japan (GSI) captured post-disaster ortho aerial imagery of the affected areas. Each image has a spatial resolution of 0.2 m and dimensions of 10,000 × 7500 pixels, with three RGB channels [25]. To ensure a sufficient number of collapsed building samples, this study selected aerial images from the most severely affected regions in both earthquakes. These include the center of Wajima City (captured on 11 January 2024, hereafter referred to as Wajima), Machinomachi in Wajima City (11 January 2024), the Ukai fishing port in Suzu City (2 January 2024, hereafter Ukai), the center of Suzu City (2 January 2024, hereafter Suzu), and Mashiki Town in Kumamoto Prefecture (19 April 2016). Figure 3 illustrates these areas. Due to the different timing of these two earthquakes, the aerial images from the Noto area were captured during winter and are affected by low-light conditions, whereas those from Mashiki benefit from favorable lighting conditions in spring (see Figure 1a,d).

2.4. Dataset Construction

The selected aerial photographs were manually labeled in QGIS [26] into three categories: collapsed area, non-collapsed area, and background. In terms of defining collapsed buildings in aerial imagery, due to the vertical perspective of ortho-aerial images, which primarily captures the roof and obscures structural elements below (e.g., walls, columns), buildings were annotated as collapsed if they exhibited any of the following characteristics:

Visible roof damage exposing internal structural fragments (see Figure 4a);
Significant roof structural failure (see Figure 4b,d);
Loss of roof texture continuity, such as fragmented roofing (see Figure 4c);
Complete loss of defining structural characteristics, with the building turning into debris or ruins.

To enable the model to more precisely learn the discriminative features of collapsed structures during training, thereby improving its ability to accurately segment collapsed buildings during inference, we annotated only the portions of buildings that exhibited collapse characteristics. Buildings where the entire structure exhibited these collapse criteria were considered totally collapsed, and the entire building was labeled as collapsed (see Figure 4a,b). Conversely, buildings where only parts of the structure met the criteria while other sections remained intact were considered partially collapsed, and only the damaged parts were labeled as collapsed (see Figure 4c,d).

Considering that wooden houses constitute the majority of buildings in both the Noto (over 75%) and the Kumamoto (over 65%) regions [27] (see Figure 5), we assumed that the buildings in the labeled collapsed areas correspond to the D5 damage level (see Figure 6) defined in the wooden building damage assessment standard established by Odaka and Takai [28].

Due to hardware limitations, such as restricted GPU memory, the original aerial images were too large to be directly processed by the deep learning model. Therefore, each image was divided into smaller patches of 512 × 512 pixels with a stride of 256 pixels. To mitigate dataset imbalance caused by an excessive number of background-only samples, patches containing only background were discarded. For data augmentation, two widely used strategies in the remote sensing domain were employed: Random Horizontal Flipping and Random Gaussian Blur.

The division of the training and test sets was designed with consideration of the distinct damage characteristics observed at each site. For instance, Ukai and Suzu were affected by both intense ground shaking and tsunami-related destruction; Wajima experienced widespread fire damage, while Machinomachi and Mashiki were primarily impacted by seismic shaking alone [25]. These variations resulted in diverse building damage patterns (see Figure 7). To enhance the model’s capacity to learn from a wide range of damage types and to improve its generalization performance, we designated Wajima, Machinomachi, and Ukai as the training set, while Suzu and Mashiki were used as the test set. A summary of the dataset after preprocessing is provided in Table 1. Regarding the quantity of annotated buildings, it was difficult to count the exact number of collapsed structures, as intense ground shaking and secondary disasters (e.g., fire or tsunami) often led to complete structural destruction, making building boundaries unrecognizable from the aerial view. Therefore, instead of counting buildings, we quantified the collapsed and non-collapsed areas by pixel counts in the 0.2 m resolution imagery. It is worth noting that the area occupied by non-collapsed buildings is substantially larger than that of collapsed buildings, reflecting a significant class imbalance in the dataset.

2.5. Proposed Network

2.5.1. Network Architecture

The overall network architecture follows a typical encoder–decoder structure (see Figure 8), which has been widely adopted in semantic segmentation tasks due to its effectiveness in capturing hierarchical features and reconstructing high-resolution predictions from compressed representations.

Encoder: In the encoder, we employ a pretrained transformer-based backbone, Pyramid Vision Transformer-V2-B2 [20] (PVT-V2-B2), to extract hierarchical features across four stages ( $P_{i}, \{i = 1, 2, 3, 4\}$ ) with channel dimensions of 64, 128, 320, and 512, respectively, ranging from low-level to high-level, from the input image;
Decoder: In the decoder, the highest-level feature $P_{4}$ is first processed by two consecutive 3 × 3 convolutions to generate a coarse segmentation result $M_{4}$ . Subsequently, this lower-resolution result is transformed into an uncertainty map by the Uncertainty Guide Fusion Module (UGFM). The uncertainty map guides the feature fusion between low-level and high-level features, producing fused features ( $F, \{i = 1, 2, 3,\}$ ) with channel dimensions of 64, 128, and 320, respectively. These fused features are then fed into two consecutive 3 × 3 convolutions, which are used to adjust channel dimensions and enhance the nonlinear representational capacity of the decoder to generate the segmentation result for the next stage.

This UGFM feature fusion process, followed by two consecutive 3 × 3 convolutions, is repeated at each stage until a final refined output with reduced uncertainty is produced. The entire process of the decoder can be summarized as

M_{4} = {C o n v}_{3 \times 3} ({C o n v}_{3 \times 3} (P_{4}))

(1)

F_{3} = U G F M (E_{4}, E_{3}, M_{4})

(2)

M_{i} = {C o n v}_{3 \times 3} ({C o n v}_{3 \times 3} (F_{i})), i = 1, 2, 3

(3)

F_{i - 1} = U G F M (F_{i}, E_{i - 1}, M_{i}), i = 2, 3

(4)

2.5.2. Outline of the Pyramid Vision Transformer v2

The PVT is a transformer-based backbone network designed to enhance the capabilities of the standard Vision Transformer (ViT) [29]. Compared to ViT, which employs a columnar structure and is constrained by computational resources, limiting it to processing coarse-grained image patches and thus obtaining single-scale features and outputs, i.e., PVT, addresses these limitations by introducing the Spatial Reduction Attention (SRA) module. The SRA module significantly reduces computational complexity, enabling the processing of finer-grained image patches. Moreover, unlike ViT, where query patches interact with other patches at the same scale, the SRA module allows query patches to engage with patches at larger scales, facilitating multi-scale information interaction. Additionally, the feature pyramid structure of PVT facilitates the generation of feature maps with varying resolutions and channel dimensions at different stages, thereby enhancing multi-scale feature learning.

Figure 9 shows the information interaction method in CNNs, ViT, and PVT architectures. In the context of remote sensing image segmentation tasks, the effective fusion of multi-level features, specifically, low-level fine-grained spatial details and high-level contextual information with large receptive fields, is essential for generating stable and accurate segmentation results. The transformer-based architecture of PVT allows for the interaction of local and global information at each stage, while its pyramid-like multi-scale output characteristics make it particularly well-suited for handling the complexities of remote sensing images. Consequently, PVTv2 has been selected as the backbone network for this study.

2.5.3. Proposed Uncertainty Guide Fusion Module

Buildings in post-earthquake aerial images represent the following characteristics:

Significant variations in appearance (e.g., scale, shape, texture, damage pattern);
Ambiguous boundaries with surrounding objects (e.g., shadows or vegetation);
Inconsistent distribution, appearing either sparsely or densely across the region.

For different earthquakes, variations in lighting conditions further intensify these characteristics, creating even greater diversity in building appearance. This complexity introduces a high degree of uncertainty in model predictions, making misclassifications, such as missing or incorrectly identifying buildings, more likely, especially when the model is applied to out-of-domain areas. To address this issue and inspired by the Uncertainty-Aware Fusion Module [30], we introduce the Uncertainty Guide Fusion Module, which can be applied to multi-class tasks and provides a more consistent and detailed representation of uncertainty through entropy. During the upsampling and feature-fusion process, this module leverages the coarse segmentation output to guide feature fusion, thereby enhancing the model’s capacity to recognize rare or difficult samples and reducing the uncertainty in the upsampled fused features.

As illustrated in Figure 10, the fusion of

P_{4}

and

P_{3}

is an example to demonstrate the entire fusion process. First, a probability map is generated from the lower-stage coarse segmentation result

M_{4}

via the Softmax function. Since information entropy considers the full probability distribution across all classes and captures the overall disorder within the system, it provides a continuous, nonlinear quantification of uncertainty. For example, in the three-class setting of this study (N = 3, representing the three classes: background, non-collapsed, and collapsed), entropy reaches its maximum when probabilities are evenly distributed (x = 1/3), indicating maximal uncertainty (See Figure 11). Around this region, entropy varies slowly, meaning that similarly uncertain predictions are assigned comparable values. Conversely, near the extremes (x close to 0 or 1), entropy decays rapidly, reflecting sharp reductions in uncertainty as predictions become more confident.

Uncertainty = - \sum_{i = 1}^{N} x_{i} \log_{e} (x_{i})

(5)

To prompt the model to focus more on these ambiguous regions, both

P_{4}

and

P_{3}

are multiplied by this uncertainty map—adding 1 beforehand to avoid excessively diminishing the original feature values. Additionally, nearest-neighbor interpolation is used to upsample the uncertainty map, aligning it with the spatial resolution of

P_{3}

.

P_{4}^{'} = (1 + U M) \times P_{4}

(6)

P_{3}^{'} = U p (1 + U M) \times P_{3}

(7)

After this uncertainty-guided enhancement (

P_{3}^{'}

,

P_{4}^{'}

) bilinear interpolation is applied to upsample

P_{4}^{'}

, it matches the spatial resolution of

P_{3}^{'}

. The two enhanced features are then concatenated and passed sequentially through 1 × 1, 3 × 3, and 3 × 3 convolutions, restoring the original channel dimensionality and producing the fused feature map,

F_{3}

.

Seq - Conv (x) = {C o n v}_{3 \times 3} ({C o n v}_{3 \times 3} ({C o n v}_{1 \times 1} (x)))

(8)

F_{3} = Seq - Conv (Concat (P_{3}^{'}, U p (P_{4}^{'})))

(9)

In the subsequent upsampling and feature fusion process,

F_{3}

is used to generate

M_{3}

. The UGFM then uses

M_{3}

to guide the fusion of

F_{3}

and

P_{2}

, generating the fused feature

F_{2}

. This procedure is repeated until

M_{1}

is obtained, which serves as the final refined segmentation map. UGFM introduces only trainable parameters related to feature concatenation. The core component—the uncertainty maps—are dynamically generated for each input image rather than being fixed during training. This design prevents the model from overfitting to the training data and enables it to flexibly adapt to unseen datasets with different feature distributions, thereby enhancing its generalization performance.

2.5.4. Loss Function

In Figure 8, the model produces four output segmentation maps:

M_{1}

,

M_{2}

,

M_{3}

, and

M_{4}

. Each segmentation map

M_{i}

is compared to the ground truth (GT) using a cross-entropy (CE) loss, denoted as

L_{i}

. The overall loss

L_{a l l}

is then expressed as the sum of these individual losses

L_{i}

, facilitating faster convergence through a multi-branch loss formulation.

L_{1} = C E (M_{1}, G T)

(10)

L_{i} = C E (U p (M_{i}), G T), i = 2, 3, 4

(11)

L_{a l l} = \sum_{i = 1}^{4} L_{i}

(12)

Moreover, for

M_{2}

,

M_{3}

, and

M_{4}

, bilinear interpolation is applied to upsample each output map to match the spatial resolution of the GT before computing the respective losses.

2.6. Experimental Configuration and Training Methods

All training and testing procedures were conducted on Tsubame 4.0, a supercomputer at the Institute of Science in Tokyo, equipped with four 96 GB GPUs and 256 GB of RAM [31]. The network model was implemented using the PyTorch (v2.5.1) deep learning framework.

Regarding the training strategy, we employed the AdamW optimizer and adopted a poly learning rate schedule, setting the initial learning rate to

10^{- 4}

. Additionally, Random Horizontal Flipping and Gaussian blur were applied as data augmentation techniques. From the training dataset’s 2604 image patches, 250 patches were randomly selected as the validation set. The batch size was set to 32, and the network was trained for 200 epochs. During training, the model was updated whenever the mF1 (the average F1 score of non-collapsed and collapsed) on the validation set achieved a new high. An early stopping strategy was adopted: if the mF1 did not improve for 25 consecutive epochs after the last update, the best-performing model was retained. For fair comparisons in subsequent experiments, all runs were conducted with a fixed random seed of 2333. A summary of the Training Setting is provided in Table 2.

In terms of evaluation metrics, to comprehensively assess the model’s performance, and given that this task prioritizes the exhaustive extraction of collapsed buildings, recall was adopted as the primary metric, while precision served as a supplementary indicator.

These evaluation parameters are defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

True Positive (TP): Pixels correctly predicted as belonging to the positive class (e.g., collapsed building areas).

False Positive (FP): Pixels incorrectly predicted as belonging to the positive class when they belong to the negative class (e.g., pixels predicted as collapsed building areas but are not).

False Negative (FN): Pixels incorrectly predicted as belonging to the negative class when they actually belong to the positive class (e.g., pixels not predicted as collapsed building areas but are collapsed buildings).

2.7. Methods Comparison

To verify the effectiveness of our proposed network, we compare it against several common semantic segmentation methods—UNet [32], HRNetv2 [33], and DeepLabv3+ [34]—as well as approaches known to perform well in building extraction from RS images: A2FPN [35] and ABCNet [36]. During training, all these models followed the same training strategy as the proposed method.

3. Results

3.1. In-Domain Data Evaluation

After training, these models were tested on the Suzu area dataset to evaluate their performance within the in-domain data.

Quantitative and Visual Comparison

Table 3 presents the overall quantitative evaluation results for the different methods. Compared with other methods, the proposed method achieves the best overall performance, demonstrating the highest recall for both non-collapsed and collapsed categories while maintaining relatively acceptable precision. Specifically, it outperforms the second-best method by 3.9% in recall for the non-collapsed category and 4.5% in recall for the collapsed category.

Figure 12 presents some segmentation results for all methods. Key regions of interest are highlighted with orange rectangles. In Figure 12, it can be observed that the proposed method is more sensitive to collapsed areas, making it more likely to extract regions with unobvious damage, such as roof collapse or deformation without obvious structural damage (as shown in the first and third scenarios) or areas where damage features are obscured by shadows (as seen in the second scenario), where other methods fail.

3.2. Out-of-Domain Data Evaluation

In the previous experiments, we used post-earthquake RS images from specific regions of the Noto Peninsula to train a model, which was then applied to generate distribution maps of non-collapsed and collapsed areas across the Noto Peninsula. However, for a practical emergency response, it is crucial to obtain damage information as quickly as possible. Collecting data and training a model after an earthquake is both time-consuming and impractical. Instead, pre-trained models must be directly employed in new areas, making model transferability a critical factor. Hence, in this section, we evaluate all the pre-training models on the Mashiki dataset.

Quantitative and Visual Comparison

Table 4 presents the overall quantitative evaluation results for the different methods on the Mashiki dataset. Due to differences in building representation distribution (e.g., lighting, shadows) between the out-of-domain data and the features learned by the model from the in-domain data, it can be observed that all pre-trained models experienced an inevitable performance drop when applied to a new, unseen dataset. Among them, ABCNet failed in both non-collapsed and collapsed area extraction. On the other hand, the proposed method achieved the highest recall for both non-collapsed and collapsed buildings while maintaining a top-ranked precision.

Figure 13 presents some segmentation results for all methods. Key regions of interest are highlighted with orange rectangles. In Figure 13, it can be observed that in these three scenarios, the proposed method also achieves the most complete extraction of both the highlighted areas and their surrounding regions.

3.3. Ablation Study

To confirm the suitability of PVTv2 (Base) and evaluate the impact of UGFM, we conducted extensive experiments on two test datasets: Suzu and Mashiki. The baseline model employs PVTv2 as the encoder but excludes UGFM in the decoder, using conventional upsampling followed by concatenation for feature fusion. Based on this baseline, we assessed the effectiveness of incorporating UGFM into the decoder.

As shown in Table 5, in the Suzu test area, the baseline model already demonstrates strong performance, indicating the suitability of the transformer-based PVTv2 for this task. Incorporating UGFM further enhances building extraction, particularly for collapsed structures, by improving recall with a slight trade-off in precision. Importantly, these improvements become even more significant in the unseen Mashiki test area, with recall increasing by 5.8% for non-collapsed buildings and 7.8% for collapsed buildings, highlighting UGFM’s critical role in enhancing the model’s transferability.

3.4. Quantitative Analysis with Field Survey Data

In the previous sections, we validated that the proposed method can stably extract collapsed areas even in out-of-domain data areas. However, this validation was based solely on visual confirmation from the perspective of aerial images. Considering that the ultimate goal of automated visual interpretation is to accelerate the DVC issuance process, the reliability of these predicted collapsed areas needs further evaluation. Specifically, we need to verify whether the predicted collapsed buildings correspond to the collapsed buildings identified in On-Site Field Surveys. This section conducts a quantitative analysis of the predicted results using On-Site Field Survey data from Mashiki Town.

3.4.1. On-Site Field Survey from Mashiki

After the Kumamoto earthquakes, an on-site EDC survey was conducted for Mashiki Town, covering 6037 buildings categorized as either non-major damage or major damage. Cho et al. [37] further subdivided major damage into Levels 1–5 by analyzing field photographs and assessing the severity of damage characteristics, while non-major damage was classified as Level 0. Table 6 shows the number and defines the Levels of buildings 1–5 with corresponding field images. Figure 14 illustrates the field survey area.

3.4.2. Evaluation of the Reliability of Predicted Collapsed Areas

To evaluate the reliability of the predicted collapsed areas, we applied the trained model to the field survey area and compared its predictions with the on-site EDC survey data for buildings that were successfully extracted. In total, 5782 building coordinates, recorded as point locations in the field survey, fell within the extracted regions (including both non-collapsed and collapsed areas), resulting in an overall extraction rate of 95.78%. The primary causes of missed detections were building point localization errors and incomplete extraction by the model.

To examine the correspondence between predicted collapsed buildings and the damage levels defined in the field survey data, initially, considering that Cho’s definition [37] classifies collapsed buildings as Level 5 within major damage, we assumed that predicted collapsed buildings should align with Level 5. Under this assumption, buildings classified as Level 5 were designated as collapsed, while Levels 0–4 were classified as non-collapsed (Case 1).

From the visual comparison (see Figure 15), the predicted results and the field survey distribution appear highly consistent. However, analysis of the misclassified samples (Table 7) reveals two key issues:

Some buildings categorized as Levels 1–4 in the field survey exhibited sufficiently visible external damage in aerial images, leading the model to classify them as collapsed. Conversely, some Level 5 buildings appeared visually intact from the aerial perspective, causing the model to classify them as non-collapsed;
Some buildings exhibited partial damage. While the model successfully extracted the collapsed portions, these buildings were missing in the final output because their reference coordinates fell within the non-collapsed area of the building.

As a result of these issues, the quantitative evaluation in Case 1 reports a precision of only 62.5% (see Table 8).

Considering that Levels 1–4 also show sufficiently visible external damage in aerial images and also belong to the major damage level, which are eligible for DVC issuance, we reclassified Levels 1–5 as collapsed and Level 0 as non-collapsed, in alignment with the ultimate objective of accelerating the DVC issuance process (Case 2).

From Figure 16, it can be observed that the predicted collapsed results are largely encompassed within the field survey levels 1–5. Under this classification, the precision increased to 87.2% (see Table 9). This adjustment suggests that the predicted collapsed areas effectively capture major damaged buildings identified in the on-site survey results. Thus, under the Case 2 classification, the reliability of automated visual interpretation is validated, demonstrating that this method for automating visual interpretation can serve as a credible tool for the initial assessment of major damaged buildings.

4. Discussion

The proposed PVTv2-based model with the UGFM achieved competitive performance in both in-domain and out-of-domain evaluations. On the Suzu test set, the model obtained a recall of 78.7% for collapsed buildings, outperforming all compared methods while maintaining competitive precision (Table 3). On the Mashiki test set, the model achieved a recall of 66.0% for collapsed buildings, representing the highest recall values among all methods while maintaining top-tier precision (Table 4). Incorporating UGFM into the baseline PVTv2 encoder increased recall by 3.4% in Suzu and 7.8% in Mashiki, with only a marginal reduction in precision (Table 5).

To evaluate the reliability of the predicted results, we compared the model’s output with field survey data, achieving a precision of 87%. Compared with previous studies that also utilized field survey data from this area, Miura et al. [7] developed a CNN-based model to classify collapsed buildings and achieved a precision of 85%. Zhan et al. [12] improved the Mask R-CNN framework to automatically detect buildings and classify collapsed ones; during the validation phase using field survey data, they adopted a grid-based approach (with 57 m² per grid cell) and reported an accuracy of 91% in predicting collapsed ratios. Although our model’s precision is comparable, it is noteworthy that our assessment was conducted on an out-of-domain dataset, whereas the training data in the aforementioned studies originated from the same event as their test data. This demonstrates the strong generalization ability and reliability of our model’s predictions, further supporting its potential as a practical tool for directly issuing DVCs based on extracted collapsed regions.

In the process of designing a model for mapping collapsed structures from post-earthquake aerial imagery, we first considered selecting an appropriate baseline model. It is important to consider the specific characteristics of RS imagery—large-scale scenes and high information density. These attributes make multi-scale feature processing, particularly the integration of low-level fine-grained spatial details and high-level contextual information with large receptive fields, crucial for accurate segmentation. In the case of post-disaster building collapse mapping, an additional challenge arises from the scarcity and heterogeneity of collapsed samples. Their irregular shapes and limited representation in the training data introduce uncertainty, making it difficult for models to learn generalized features of collapsed structures without overfitting to a specific earthquake event. As shown in Table 3 and Table 4, the manner in which high- and low-level features are fused during multi-scale processing greatly influences performance. For instance, ABCNet only fuses high-level and low-level features at the final layer, and the high-level features used are extracted solely from the topmost layer of the contextual path. This lack of mid-level feature fusion may contribute to its lowest precision and recall on the Suzu dataset and complete failure (recall near 0) in identifying collapsed buildings when transferred to the Mashiki dataset. DeepLabv3+ also performs fusion only at the final output stage, similar to ABCNet. However, its parameter count is more than three times greater, which partially compensates for its limited fusion strategy, resulting in moderate performance for the collapsed category on the Suzu dataset (precision 60.7%, recall 67.5%). Nonetheless, the large parameter size appears to cause overfitting to the in-domain data, leading to a substantial drop in recall (26.1%) when transferred to Mashiki. In contrast, A2FPN and UNet adopt a progressive feature fusion strategy in the decoder during upsampling, making full use of features across all scales. This enables stronger performance overall. Notably, for collapsed building extraction on Suzu, UNet achieves both higher precision and recall than A2FPN, largely due to having 10 times more parameters. However, this large parameter size also limits its transferability: recall dropped to 43.5% when transferred to Mashiki, compared with A2FPN’s 48.9%. HRNetv2 further excels by performing multi-scale fusion of low- and high-level features at every layer during the forward pass, yielding the highest precision (69.6%) for collapsed buildings in Suzu and recall on par with UNet, despite having half the parameters. When transferred to Mashiki, HRNetv2′s precision was 3.8% lower than UNet’s, but recall was 10.2% higher. Similarly, the transformer-based architecture of PVT allows interaction between local and global information at each encoder stage, and its pyramid-like multi-scale outputs enable progressive fusion in the decoder, akin to A2FPN and UNet. This combined multi-scale fusion design allows the PVTv2-based baseline to achieve the highest precision and second-highest recall across both datasets, while maintaining a relatively small parameter size, thus avoiding the overfitting issues seen in larger models.

After determining the baseline model, we addressed the issue of uncertainty, which is primarily caused by the scarcity of training data, particularly post-earthquake aerial imagery. This uncertainty often leads to missed extractions or false positives, and its impact becomes especially pronounced when the model is applied to unseen datasets, resulting in significant performance degradation (Compare Table 3 with Table 4). Since uncertainty is closely associated with sample quantity—that is, the fewer the samples, the greater the model’s uncertainty in recognizing them—we introduced the UGFM. Within UGFM, uncertainty is quantified using entropy and leveraged during multi-scale feature fusion. Regions with higher uncertainty are given greater weight, guiding the network to focus more on these challenging areas. Importantly, this mechanism is applied dynamically at inference without introducing additional parameters, avoiding the transferability limitations associated with larger models. As a result, the model becomes more sensitive to underrepresented samples, namely, collapsed buildings, thereby improving its ability to extract these structures. On the in-domain Suzu dataset, the improvement over the baseline was modest (recall +3.4%, precision −2.0% for the collapsed category). In contrast, on the out-of-domain Mashiki dataset, recall improved by 7.8% with only a 1.2% decrease in precision (Table 5). This recall-focused improvement is intentional for post-disaster applications, where minimizing missed detections of collapsed structures is a higher priority than maximizing precision.

To enable the predicted collapsed areas to serve as direct evidence for issuing DVCs, it is essential to establish a clear correspondence between the buildings identified as “collapsed” by the model and the major damage levels recorded in the field survey data. Initially, we assumed that the predicted collapsed buildings corresponded to Level 5 major damage in the field survey classification. However, under this assumption, the resulting precision was only 62.5%. This discrepancy may stem from the fact that Level 5 assessments in field surveys are based on a comprehensive evaluation of structural elements such as the walls, beams, and foundations. In contrast, the model’s predictions rely solely on roof conditions, due to the vertical viewpoint limitations inherent in aerial imagery. Considering that buildings classified as Levels 1–5 are all designated as “major damage” and are eligible for direct DVC issuance, we reclassified Levels 1–5 as corresponding to predicted collapsed buildings. Under this updated assumption (Case 2), the precision increased significantly to 87.2%, demonstrating that the predicted collapsed regions can serve as a reliable basis for DVC issuance.

On the other hand, although the introduction of UGFM enhances the model’s sensitivity to collapsed structures, it still fails to accurately extract small collapsed buildings or those obscured by complex surroundings (see Figure 17), similar to other benchmark models. This limitation may be attributed to the insufficient representation of fine-scale features in complex environments. Enhancing the model’s sensitivity to small objects could potentially mitigate this issue to some extent.

Finally, there remain several aspects that require further investigation to validate the generalizability of the proposed method and potentially enhance its performance:

Application to Buildings with Different Structural Materials: Although this study validated the model’s transferability by applying it to post-earthquake imagery from a different event (i.e., the 2016 Kumamoto earthquake), which differs from the training dataset (Noto Peninsula earthquake) in terms of lighting conditions, location, and season, both datasets predominantly feature wooden buildings. Therefore, the model’s detection targets in this study were largely limited to wooden structures. Whether this method can be effectively transferred to buildings constructed with other materials—which may exhibit different collapse patterns—remains to be verified;
Stage-Specific Uncertainty Weighting in UGFM: In this study, the amplification range of uncertainty within UGFM was kept identical across all decoding stages, implicitly assuming that uncertainty at different depths is equally important. However, in practice, uncertainty at deeper stages may play a more critical role, or conversely, earlier stages may require stronger guidance. Future work could investigate progressively increasing or decreasing stage-specific uncertainty amplification factors, which may further refine the feature fusion process and improve overall performance;
In real-world post-earthquake scenarios, there is often no time to manually annotate new datasets, making it infeasible to create labeled data for each event. As a result, models must rely on training from previously available data and be directly applied to unseen scenarios. However, due to the scarcity of post-earthquake RS data, such models are prone to overfitting to the training domain, which limits their generalizability. Although this study introduced the UGFM to enhance model transferability, it is fundamentally still reliant on previously labeled data. The semi-supervised learning techniques could be incorporated, i.e., where the model automatically generates pseudo-labels during inference, which may furthermore improve the model’s adaptability to new disasters without requiring manual annotations.

5. Conclusions

In this study, our objective was to develop a method capable of extracting collapsed buildings from post-earthquake aerial imagery to expedite initial damage assessments and thereby accelerate the issuance of DVCs. After analyzing the visual characteristics of collapsed structures in aerial images and considering the sample distribution imbalance, we adopted PVTv2 as the backbone network and introduced the UGFM to construct the proposed model. The integration of UGFM enhanced the model’s ability to extract collapsed buildings, thereby improving its generalization performance. As a result, the proposed model demonstrated robust performance in both in-domain (Suzu dataset) and out-of-domain (Mashiki dataset) scenarios, achieving recall rates of 78.7% and 66%, respectively, with considerable precision in both cases.

To assess the reliability of the model’s predictions, we compared the results with the damaged assessment field survey data from Mashiki. Although differences in observation perspectives—specifically, the vertical viewpoint of aerial imagery versus the ground-level inspection of field surveys—prevent a direct mapping between the predicted collapsed buildings and the detailed sub-levels of major damage, a clear correlation exists between the predicted collapsed buildings and the major damage category. When we directly interpreted the predicted collapsed buildings as major damage structures, the model achieved a precision of 87.2%, validating the reliability of its predictions. This result highlights the potential of the proposed method to serve as a rapid, automated tool for initial post-earthquake damage assessments, ultimately facilitating faster issuance of DVCs.

Author Contributions

Conceptualization, H.L. and M.M.; methodology, H.L., H.O. and M.M.; formal analysis, H.L.; investigation, H.L., H.O. and M.M.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, H.O. and M.M.; visualization, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Japanese Society for the Promotion of Science (JSPS) Grants-in-Aid for Scientific Research (KAKENHI), grant number: #23K26348.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Aerial photographs used in this study were taken by the Geospatial Information Authority of Japan (GSI). The building damage assessment survey data used in this study were provided by the town of Mashiki, Kumamoto Prefecture, as part of the Tokyo Metropolitan Resilience Project of the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) of the Japanese Government, the National Research Institute for Earth Science and Disaster Resilience (NIED), and Niigata University.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

EDC	Earthquake Damage Certification
DVC	Disaster Victim Certificate
UFGM	Uncertainty-Guided Fusion Module
PVT	Pyramid Vision Transformer
MMS	Moment Magnitude Scale
CNN	Convolutional Neural Network
ViT	Vision Transformer
SRA	Spatial Reduction Attention
CE	Cross-Entropy
UM	Uncertainty Map
RS	Remote Sensing
GSI	Geospatial Information Authority of Japan

References

Japan Meteorological Agency. Major Damaging Earthquakes near Japan (Since 1996). Available online: https://www.data.jma.go.jp/eqev/data/higai/higai1996-new.html (accessed on 20 October 2024). (In Japanese)
Cabinet Office of Japan Disaster Management in Japan. Guideline for the Application of the Standards for Recognition of Damage Caused by Disasters to Houses. 2024. Available online: https://www.bousai.go.jp/taisaku/unyou.html (accessed on 24 October 2024). (In Japanese)
The Japan Broadcasting Corporation. Issuance of Certificate Disaster Victim Certificate in Ishikawa. Available online: https://www3.nhk.or.jp/news/html/20240202/k10014345371000.html (accessed on 24 October 2024). (In Japanese).
Matsuoka, M.; Yamazaki, F. Interferometric Characterization of Areas Damaged by the 1995 Kobe Earthquake Using Satellite SAR Images. In Proceedings of the 12th World Conference on Earthquake Engineering, Auckland, New Zealand, 30 January–4 February 2000; Volume 2. [Google Scholar]
Naito, S.; Hao, K.X.; Senna, S.; Saeki, T.; Nakamura, H.; Fujiwara, H.; Azuma, T. Investigation of Damages in Immediate Vicinity of Co-Seismic Faults during the 2016 Kumamoto Earthquake. J. Disaster Res. 2017, 12, 899–915. [Google Scholar] [CrossRef]
Vu, T.T.; Matsuoka, M.; Yamazaki, F. Detection and Animation of Damage Using Very High-Resolution Satellite Data Following the 2003 Bam, Iran Earthquake. Earthq. Spectra 2005, 21 (Suppl. 1), 319–327. [Google Scholar] [CrossRef]
Miura, H.; Aridome, T.; Matsuoka, M. Deep Learning-Based Identification of Collapsed, Non-Collapsed and Blue Tarp-Covered Buildings from Post-Disaster Aerial Images. Remote Sens. 2020, 12, 1924. [Google Scholar] [CrossRef]
Wiguna, S.; Adriano, B.; Mas, E.; Koshimura, S. Evaluation of Deep Learning Models for Building Damage Mapping in Emergency Response Settings. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5651–5667. [Google Scholar] [CrossRef]
Adriano, B.; Yokoya, N.; Xia, J.; Miura, H.; Liu, W.; Matsuoka, M.; Koshimura, S. Learning from Multimodal and Multitemporal Earth Observation Data for Building Damage Mapping. ISPRS J. Photogramm. Remote Sens. 2021, 175, 132–143. [Google Scholar] [CrossRef]
Kuo, W.-N.; Lin, S.-Y. Multimodal Models for Assessing Earthquake-Induced Building Damage Using Metadata and Satellite Imagery. J. Build. Eng. 2025, 111, 113467. [Google Scholar] [CrossRef]
Han, D.; Yang, G.; Xie, R.; Lu, W.; Huang, M.; Liu, S. A Multilevel Damage Assessment Framework for Mixed-Hazard Buildings with Global Spatial Feature Guidance Module and Change Feature Attention in VHR Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Zhan, Y.; Liu, W.; Maruyama, Y. Damaged Building Extraction Using Modified Mask R-CNN Model Using Post-Event Aerial Images of the 2016 Kumamoto Earthquake. Remote Sens. 2022, 14, 1002. [Google Scholar] [CrossRef]
Yu, K.; Wang, S.; Wang, Y.; Gu, Z.; Wang, Y. DBA-RTMDet: A High-Precision and Real-Time Instance Segmentation Method for Identification of Damaged Buildings in Postearthquake UAV Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 19577–19593. [Google Scholar] [CrossRef]
Naito, S.; Tomozawa, H.; Mori, Y.; Monma, N.; Nakamura, H.; Fujiwara, H. Development of the Deep Learning-Based Damage Detection Model for Buildings Utilizing Aerial Photographs of Multiple Earthquakes. J. Jpn. Assoc. Earthq. Eng. 2021, 21, 3_72–3_118. [Google Scholar] [CrossRef]
Xie, Y.; Feng, D.; Chen, H.; Liu, Z.; Mao, W.; Zhu, J.; Hu, Y.; Baik, S.W. Damaged Building Detection from Post-Earthquake Remote Sensing Imagery Considering Heterogeneity Characteristics. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Liu, J.; Luo, Y.; Chen, S.; Wu, J.; Wang, Y. BDHE-Net: A Novel Building Damage Heterogeneity Enhancement Network for Accurate and Efficient Post-Earthquake Assessment Using Aerial and Remote Sensing Data. Appl. Sci. 2024, 14, 3964. [Google Scholar] [CrossRef]
Cote, M.; Saeedi, P. Automatic Rooftop Extraction in Nadir Aerial Imagery of Suburban Regions Using Corners and Variational Level Set Evolution. IEEE Trans. Geosci. Remote Sens. 2012, 51, 313–328. [Google Scholar] [CrossRef]
Awrangjeb, M.; Zhang, C.; Fraser, C.S. Improved Building Detection Using Texture Information. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2013, 38, 143–148. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 5580–5590. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11–17. [Google Scholar] [CrossRef]
The Report of the 2024 Noto Earthquake. Available online: https://www.jma.go.jp/jma/press/2401/01c/202401012130.html (accessed on 24 October 2024). (In Japanese)
Quiet+. Seismic Intensity Distribution Map. Available online: https://app.quietplus.kke.co.jp/quakes?mode=quick (accessed on 24 October 2024). (In Japanese).
Cabinet Office of Japan Disaster Management in Japan. Damage from the 2024 Noto Peninsula Earthquake. Available online: https://www.bousai.go.jp/updates/r60101notojishin/r60101notojishin/index.html (accessed on 24 October 2024). (In Japanese)
Cabinet Office of Japan Disaster Management in Japan: The Report of 2016 Kumamoto Earthquakes. Available online: https://www.bousai.go.jp/updates/h280414jishin/index.html (accessed on 24 October 2024). (In Japanese)
Geospatial Information Authority of Japan. Available online: https://www.gsi.go.jp/top.html (accessed on 24 October 2024).
QGIS Development Team. Available online: https://qgis.org (accessed on 24 October 2024).
Statistics Bureau of Japan. Type of House, Construction Method and Structure. Available online: https://www.stat.go.jp/data/jyutaku/2008/nihon/2_1.html (accessed on 24 October 2024).
Okada, S.; Takai, N. Classifications of Structural Types and Damage Patterns of Buildings for Earthquake Field Investigation. J. Struct. Constr. Eng. 1999, 64, 65–72. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Li, J.; He, W.; Cao, W.; Zhang, L.; Zhang, H. UANet: An Uncertainty-Aware Network for Building Extraction from Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Tsubame. Supercomputer of Science Tokyo. Available online: https://www.titech.ac.jp/english/news/2024/069452 (accessed on 24 October 2024).
Ronneberger, O.; Fischer, P.; Brox, T. UNet: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder–Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A2-FPN for Semantic Segmentation of Fine-Resolution Remotely Sensed Images. Int. J. Remote Sens. 2022, 43, 1131–1155. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remotely Sensed Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Cho, S.; Xiu, H.; Matsuoka, M. Backscattering Characteristics of SAR Images in Damaged Buildings Due to the 2016 Kumamoto Earthquake. Remote Sens. 2023, 15, 2181. [Google Scholar] [CrossRef]

Figure 1. Representation of Damaged Buildings Under Different Conditions. (a–c): Noto earthquake; (d): Kumamoto earthquake. (a,d): Earthquake damage under different lighting conditions. (b–d): Damage caused by secondary disasters (fire and tsunami).

Figure 2. The Epicenter and Seismic Intensity Distribution of The Noto Peninsula Earthquake [17]. The blue area in the inset map delineates the spatial coverage of the main map.

Figure 3. The Location of Selected Areas. Blue Frame: Train Areas, Red Frame: Test Area. The blue area in the inset map delineates the spatial coverage of the main map.

Figure 4. Annotated Buildings in the Wajima Area. (Black: Background, Green: Non-Collapsed, Red: Collapsed).

Figure 5. The Ratio of Non-wooden Buildings per Region in Japan [27].

Figure 6. The Wooden Building Damage Pattern in Odaka and Takai [28]. (Wooden buildings damaged at the D5 level are shown in the red box).

Figure 7. The Representation of Collapsed Buildings under Different Disasters.

Figure 8. Structure of the Proposed Network.

Figure 9. Information Interaction in Different Structures. Red frame: local information; Green frame: global information.

Figure 10. Structure of UGFM.

Figure 11. Uncertainty Quantification by Entropy.

x_{3} = 1 - x_{1} - x_{2} .

Figure 11. Uncertainty Quantification by Entropy.

x_{3} = 1 - x_{1} - x_{2} .

Figure 12. Visual Comparison on Test Area (Suzu). (a1–a3) Image. (b1–b3) GT. (c1–c3) UNet. (d1–d3) HRNetv2. (e1–e3) DeepLabv3+. (f1–f3) A2Fpn. (g1–g3) ABCNet. (h1–h3) Proposed. (Black: Background, Green: Non-Collapsed, Red: Collapsed).

Figure 13. Visual Comparison on the Test Area (Mashiki). GT is Ground True. (a4–a6) Image. (b4–b6) GT. (c4–c6) UNet. (d4–d6) HRNetv2. (e4–e6) DeepLabv3+. (f4–f6) A2Fpn. (g4–g6) ABCNet. (h4–h6) Proposed. (Black: Background, Green: Non-Collapsed, Red: Collapsed).

Figure 14. Field Survey Area in Mashiki Town. The blue area in the inset map delineates the spatial coverage of the main map.

Figure 15. Visual Comparison between Field Survey (top) and Prediction (bottom) (Case 1). The blue area in the inset map delineates the spatial coverage of the main map.

Figure 16. Visual Comparison between Field Survey (top) and Prediction (bottom) (Case 2).

Figure 17. Missing Extraction Example. (a) Image. (b) GT. (c) Proposed.

Table 1. Data Summary.

Area	Patch_Num (No_Augmentation)		Non_Collapsed_Area ¹ (m²)	Collapsed_Area ¹ (m²)
Wajima Machinomachi Ukai	2604		1,024,038	244,109
Suzu	228		143,663	15,226
Mashiki	56		43,254	33,502
Augmentation Strategies		1.Random_Horizontal_Flip 2. Random_GaussianBlur Random probability: 0.5

¹ The area is calculated by 0.2 m/pixel × 0.2 m/pixel = 0.04 m²/pixel.

Table 2. Training Setting.

Training Setting
Image Number of Training Dataset	2604 (Training: 2354, Validation: 250)
Data Augmentation	Random Horizontal Flipping, Gaussian blur
Image Number of Test Dataset	228 (In-Domain), 56 (Out-of-Domain)
Framework	PyTorch
GPU	NVIDIA H100 ((NVIDIA Corp., Santa Clara, CA, USA))
Batchsize	32
Initial Learning Rate	$10^{- 4}$
Training Strategy	AdamW Optimizer and Poly Learning Rate Schedule
Random Seed	2333
Epochs	200

Table 3. Performance Comparison with Other Methods (Suzu).

Networks	Category				Parameters(M)
	Non-Collapsed		Collapsed
	Precision	Recall	Precision	Recall
UNet	0.898	0.874	0.633	0.742	147.81
HRNetv2	0.912	0.883	0.696	0.740	65.76
DeepLabv3+	0.875	0.841	0.607	0.675	42.40
A2FPN	0.834	0.863	0.619	0.612	12.16
ABCNet	0.764	0.762	0.615	0.240	13.85
Proposed	0.864	0.922	0.676	0.787	29.15

The best score is bolded, and the second score is underlined.

Table 4. Performance Comparison with Other Methods (Mashiki).

Networks	Category
	Non-Collapsed		Collapsed
	Precision	Recall	Precision	Recall
UNet	0.638	0.772	0.647	0.435
HRNetv2	0.757	0.668	0.609	0.537
DeepLabv3+	0.568	0.751	0.612	0.261
A2FPN	0.606	0.745	0.593	0.489
ABCNet	0.308	0.110	0.926	0.000
Proposed	0.714	0.861	0.774	0.660

The best score is bolded, and the second score is underlined.

Table 5. Ablation Results.

Test Data	Networks	Category
		Non-Collapsed		Collapsed
		Precision	Recall	Precision	Recall
Suzu	Base	0.906	0.887	0.696	0.753
Suzu	Base + UGFM	0.864	0.922	0.676	0.787
Mashiki	Base	0.742	0.803	0.786	0.582
Mashiki	Base + UGFM	0.714	0.861	0.774	0.660

The best score is bolded.

Table 6. Definition and Amount of Each Damage Level for Buildings [30].

Damage Class		Characteristics of the Damage	Number
Major Damage	Level 5	Collapsed	399
	Level 4	Large interlayer deformation (not collapsed)	132
	Level 3	Large distortion or large inclination	192
	Level 2	Damage to roof and walls (including the foundations)	392
	Level 1	Damage to walls (including the foundations)	354
Under Major Damage	Level 0	Under 50% of economic damage to the building	4568

Table 7. Missed Extraction Samples.

		Model Prediction		Building Point Location Problem
		Collapsed	Non-Collapsed	Building Point Location Problem
Field Survey	Level 5
	Level 4
	Level 1–3

Table 8. Quantitative Analysis Result (Case 1).

Confusion Matrix		Field Survey (Case1)		Total	Precision
Confusion Matrix		Level 0–4	Level 5	Total	Precision
Model Prediction	Non-collapsed	5248	126	5374	0.976
Model Prediction	Collapsed	153	255	408	0.625
Total		5401	381
Recall		0.972	0.669

Table 9. Quantitative Analysis Result (Case 2).

Confusion Matrix		Field Survey (Case2)		Total	Precision
Confusion Matrix		Level 0	Level 1–5	Total	Precision
Model Prediction	Non-collapsed	4321	1053	5374	0.804
Model Prediction	Collapsed	52	356	408	0.872
Total		4374	1409
Recall		0.988	0.253

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lyu, H.; Oshio, H.; Matsuoka, M. Deep Learning-Based Collapsed Building Mapping from Post-Earthquake Aerial Imagery. Remote Sens. 2025, 17, 3116. https://doi.org/10.3390/rs17173116

AMA Style

Lyu H, Oshio H, Matsuoka M. Deep Learning-Based Collapsed Building Mapping from Post-Earthquake Aerial Imagery. Remote Sensing. 2025; 17(17):3116. https://doi.org/10.3390/rs17173116

Chicago/Turabian Style

Lyu, Hongrui, Haruki Oshio, and Masashi Matsuoka. 2025. "Deep Learning-Based Collapsed Building Mapping from Post-Earthquake Aerial Imagery" Remote Sensing 17, no. 17: 3116. https://doi.org/10.3390/rs17173116

APA Style

Lyu, H., Oshio, H., & Matsuoka, M. (2025). Deep Learning-Based Collapsed Building Mapping from Post-Earthquake Aerial Imagery. Remote Sensing, 17(17), 3116. https://doi.org/10.3390/rs17173116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Collapsed Building Mapping from Post-Earthquake Aerial Imagery

Abstract

1. Introduction

2. Datasets and Methods

2.1. Outline of the 2024 Noto Peninsula Earthquake

2.2. Outline of the 2016 Kumamoto Earthquakes

2.3. Selected Aerial Images

2.4. Dataset Construction

2.5. Proposed Network

2.5.1. Network Architecture

2.5.2. Outline of the Pyramid Vision Transformer v2

2.5.3. Proposed Uncertainty Guide Fusion Module

2.5.4. Loss Function

2.6. Experimental Configuration and Training Methods

2.7. Methods Comparison

3. Results

3.1. In-Domain Data Evaluation

Quantitative and Visual Comparison

3.2. Out-of-Domain Data Evaluation

Quantitative and Visual Comparison

3.3. Ablation Study

3.4. Quantitative Analysis with Field Survey Data

3.4.1. On-Site Field Survey from Mashiki

3.4.2. Evaluation of the Reliability of Predicted Collapsed Areas

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI