1. Introduction
The recent series of disastrous earthquakes in southeastern Turkey on 6 February 2023, measuring 7.8 and 7.6 on the Richter scale, represent one of the worst seismic catastrophes in recent history, resulting in widespread structural destruction, significant loss of life, and an economic loss of over $100 billion in 11 Turkish provinces. In the immediate aftermath of such widespread seismic catastrophes, it is critical to quickly and accurately assess damage states in buildings for effective emergency response coordination. In general, structural experts and disaster assessment teams conduct such damage assessment tasks through manual inspection methods, where experts physically access the damage sites in hazardous conditions, record damage manifestations through photographic evidence, and categorize structural conditions based on various damage scales. In recent years, widespread devastation caused by the Kahramanmaraş earthquake, affecting over 230,000 buildings with various degrees of structural damage, has revealed critical limitations in manual inspection methods, especially in terms of inspection speed, safety risks for experts, inter-expert reliability, and processing extensive visual evidence collected from damage sites.
The manual inspection method, although considered the gold standard for structural damage assessment tasks, is limited in several ways, especially in emergency response situations. In manual inspection methods, experts need to physically assess structural damage sites, where they need to quickly assess damage conditions in unstable environments, which is physically and mentally demanding. The limitations in manual inspection methods necessitate automated solutions for processing extensive visual evidence collected from damage sites.
Recent advancements in computer vision and deep learning techniques have revealed significant promise in improving traditional human-driven approaches for assessing building damage. For instance, convolutional neural network architectures have shown significant promise in learning hierarchical features from raw image data, and transfer learning has significantly reduced the requirement for large-scale data for specialized tasks. However, applying these techniques for post-earthquake building damage assessment has several unique challenges, including class imbalance that reflects the naturally occurring distribution of damage severity, subtle visual differences between consecutive damage classes, and interpretability for decision-making in safety-critical situations.
To overcome these challenges, this research presents a comprehensive framework for automatic building damage classification that utilizes ensemble deep learning and explainable artificial intelligence techniques. The framework is developed and validated on a large-scale dataset collected from the Kahramanmaraş earthquakes that struck on 6 February 2023. The proposed framework utilizes a combination of six convolutional neural network architectures, including VGG16, ResNet50, InceptionV3, DenseNet121, EfficientNetB0, and MobileNetV2, through a sophisticated weighted voting scheme that considers class imbalance in model performance. The methodology utilizes a large-scale dataset consisting of 13,270 high-resolution images, including 24,289 manually annotated regions of 15 unique damage classes, representing the largest ground-based building damage dataset ever assembled from recent seismic events.
This paper makes the following unique contributions at the intersection of structural engineering and computer vision:
Dataset Contribution: We present the largest documented dataset of ground-based images of building damage resulting from the 2023 Kahramanmaraş earthquakes, which will be valuable for the development of future algorithms.
Architectural Innovation: We show that architectural diversity through the use of ensemble learning significantly outperforms single-model approaches in fine-grained damage classification, with a particular emphasis on minority-class detection, which is crucial in safety assessment.
Integrated Interpretability Framework: We integrate explainable artificial intelligence into the deep learning framework by applying Grad-CAM to the trained CNN models. Rather than treating explainability as an independent component, Grad-CAM uses the internal feature activations and gradients of the convolutional networks to highlight image regions that contribute to the damage classification decision. This integrated interpretation mechanism supports transparency and enables structural engineers to verify whether the model focuses on damage-relevant regions.
The remainder of the paper is organized as follows: In
Section 2, we provide a detailed description of our extensive methodology, including the dataset characteristics, preprocessing steps, the proposed ensemble deep learning framework, the training processes, and the evaluation metrics used in the experiments. In
Section 3, we discuss the experimental results in comparing the performances of the individual models with the ensemble learning framework. In
Section 4, we discuss the interpretability analysis, error analysis, and limitations of the proposed framework. We conclude the paper in
Section 5 by discussing the implications for future work in the context of post-disaster assessment and future directions.
2. Progress and Challenges in Deep Learning-Based Structural Damage Assessment
Recent studies on structural damage assessment have shown considerable progress in applying deep learning, computer vision, and multimodal sensing techniques to automate post-disaster inspection tasks. These studies have improved the detection of cracks, spalling, collapse patterns, and other visible damage indicators in buildings and infrastructure. However, post-disaster structural damage assessment still presents several unresolved challenges, including severe class imbalance, limited availability of labeled disaster-specific datasets, visually subtle differences between adjacent damage levels, reduced robustness under variable field conditions, and insufficient interpretability for engineering decision-making. This section reviews the progress of deep learning-based structural damage assessment methods and highlights the methodological gaps that motivate the proposed explainable ensemble CNN framework. The use of artificial intelligence in the assessment of the infrastructure has increased considerably in various fields such as the monitoring of forest fires and the integrity of structures, thus paving the way for the development of intelligent inspection systems [
1]. In structural engineering, the application of deep learning in structural health monitoring has shown significant development in the field of structural integrity assessment. It has moved beyond the application of image processing in the past and has now been applied using neural networks in the detection of minute changes in structures [
2]. Earlier studies in the field of structural health monitoring using deep learning have been focused on the application of convolutional neural networks in the binary classification of cracks in structures. Researchers have established baseline studies in the application of deep learning in the automated inspection of concrete surfaces. In the study conducted in [
3], the researchers carried out comprehensive comparisons between the application of convolutional neural networks and conventional image processing in the binary classification of cracks in structures such as bridges, walls, and concrete surfaces. They established the superiority of the application of deep learning in the automated inspection of structures. Building on the baseline studies carried out in the application of deep learning in the inspection of structures, the researchers in the study conducted in [
4] proposed the application of BCCD-YOLO in the inspection of building structures and cracks in building constructions. It is a customized version of the YOLO algorithm that incorporates the application of PAN with lateral skips and the application of the EC2f/SAC2f attention module. It is particularly applied in the field of building constructions and is particularly effective in the inspection of small objects in building constructions. In the study conducted in [
5], the researchers applied the YOLOv11 algorithm in the optimization of crack detection in building structures using the application of the C3K2-SG and FPSConv modules in the optimization of building crack detection algorithms. In the study conducted in [
6], the researchers proposed the application of the HL-YOLO algorithm in the optimization of building crack detection in structures. It is particularly applied in the field of building surfaces and has been incorporated with the application of dilation-wise residual and attention modules in the optimization of building crack segmentation.
Apart from the conventional RGB image processing approaches, researchers have been exploring multimodal and multi-scale approaches in damage assessment and detection to enhance the robustness of the damage detection process in various environmental conditions. In this context, Study [
7] proposes an advanced hybrid approach using YOLOv8, Swin Transformers, and CNN-Transformer for the damage assessment of prestressed concrete beams. This approach is unique in the way the researchers have combined the visual crack information with the vibration analysis of the structure. This provides a comprehensive approach in the damage assessment of structures. In the context of the intelligent recognition of defects in glazed parts of ancient buildings using the binocular vision system and deep learning approaches, Study [
8] proposes the spatial–temporal fusion approach for the accurate localization of cracks, spalling, discoloration, and deformation in the structures. This is a unique approach in the context of addressing the issues in the structural integrity of ancient buildings. In the context of the maintenance of renewable energy structures such as wind turbines, Study [
9] proposes the MSWindD-YOLO approach using the GhostNetV2 and FasterNet architectures with multi-scale feature fusion and dynamic snake convolution for the accurate identification of cracks, spalling, and lightning strikes on wind turbine blades. This is a unique approach in the way the researchers have addressed the computational issues in the context of UAV-based wind turbine inspection using the NVIDIA Jetson platform.
The problem of post-disaster assessment is associated with specific computational and methodological problems that call for specific solutions that can cope with harsh environmental conditions and limited availability of information. In this context, paper [
10] tackles the problem of mapping collapsed buildings using post-earthquake aerial imagery. Specifically, HRNet is used in conjunction with OCRNet to perform semantic segmentation and successfully distinguish between collapsed and non-collapsed buildings in post-earthquake images obtained after the 2016 Kumamoto earthquake in Japan. Paper [
11] proposes a novel Bayesian deep learning framework named BayeSiamMTL that can perform simultaneous building footprint detection and damage level assessment using satellite imagery. Specifically, this paper incorporates multitask learning with both aleatoric and epistemic uncertainty quantification to produce robust confidence values that are critical for decision support in post-disaster response. Paper [
12] tackles the problem of limited availability of information that is characteristic of post-earthquake scenarios. Specifically, this paper proposes tunable semi-synthetic image generation techniques that can overlay photorealistic damage information onto real backgrounds. This is done to enhance model generalization for crack detection, spalling detection, and rebar detection. Paper [
13] proposes a sophisticated framework that combines YOLOv11 with specific rule-based engineering logics. This framework can support rapid response in post-earthquake scenarios concerning reinforced concrete buildings. Specifically, this paper successfully integrates both machine learning and heuristic methodologies. Paper [
14] investigates the novel application of large language models in post-earthquake structural damage assessment. Specifically, this paper proposes multimodal visual–textual fusion and few-shot prompting to produce natural language reasoning and explanation for damage evaluation.
Ensemble methodologies and sophisticated fusion techniques have been developed and recognized as critical approaches for ensuring detection reliability and managing the innate uncertainties that are often involved in assessing damage in a sophisticated environment. In this context, the research in [
15] has implemented DeepLabV3+ and deep ensembles in combination with Optuna-based optimization for surface damage detection using a range of different geographical environments. Similarly, the research in [
16] has implemented Dempster–Shafer evidence theory-based approaches for post-flood damage detection using high-resolution SpaceNet 8 satellite images. In this context, precision improvements have been observed using evidence theory-based approaches. Another research study [
17] has developed a sophisticated image-point cloud fusion methodology for comprehensive bridge damage detection and assessment. In this context, a bidirectional projection-based fusion approach has been implemented between two-dimensional images and three-dimensional point cloud data for comprehensive damage detection and quantification in a digital twin environment. Another research study [
18] has developed a sophisticated end-to-end deep learning-based approach that has combined semantic segmentation and three-dimensional reconstruction approaches using automated drone trajectory planning and high-resolution multi-view image analysis for comprehensive structural damage detection and assessment. In this context, high accuracy has been observed using this approach, and at the same time, this approach has ensured the feasibility of this approach for real-world applications.
In addition, advanced architectural innovations such as transformers, state space models, and attention mechanisms have been demonstrated to possess significant potential for damage analysis that requires precise boundary definition. In this context, MCMamba is introduced in [
19] as a multi-scale convolution-enhanced Mamba model that incorporates visual state space blocks for crack detection in high-resolution bridge images. State space modeling is used to achieve precise quantification of slender cracks and boundaries. In [
20], HRCRSN is introduced as a high-resolution crack image rendering segmentation network that incorporates transformer architectures and point-based rendering techniques to achieve sub-pixel accuracy in crack segmentation. In [
21], this model is further improved using probability map-guided point rendering, which is specifically optimized for high-resolution bridge crack images at 4K/8K resolution. In [
22], EfficientSegNet is introduced as an encoder–decoder model that incorporates spatial attention modules and multi-scale feature fusion for concrete crack segmentation in complex background images captured using drones. This model is optimized for computational efficiency on resource-constrained edge devices.
Applications in specific domains have seen significant growth in transportation infrastructure management and cultural heritage site preservation, each carrying their own set of challenges in handling the materials and environmental factors. A comprehensive framework for drone-based image acquisition in the context of road marking condition mapping in South Korea was established in Study [
23] using DeepLabV3+ for semantic segmentation. In Study [
24], CSKAN, a channel-spatial Kolmogorov–Arnold Network, was proposed for road crack detection under changing illumination conditions by uniquely combining the fusion of image data with textual features using multimodal fusion. Illumination invariance and feature stability were achieved across various datasets, including CRACK500 and GAPS397. In Study [
25], a region guidance network was proposed for road damage detection in complex background scenes by using path and weight sharing in the detection and segmentation networks to improve the extraction of features in complex environments. In Study [
26], UAV images were used in conjunction with an improved AHP methodology using entropy weight methods for pavement damage evaluation. In Study [
27], deep learning was used in conjunction with GIS and spatiotemporal clustering using extensive street view images for city-scale pavement crack evaluation in Tokyo, showing high scalability in real-world applications.
In the context of heritage buildings where non-invasive inspection techniques are essential, specific techniques have been developed to address the challenges associated with old materials. In the research paper [
28], the authors proposed a rapid identification technique for the detection of damages on the surfaces of traditional red brick buildings in Fujian Province using the YOLOv10 algorithm to classify efflorescence, brick breakage, joint hollowing, and biological colonization with high mean average precision. In the research paper [
29], the authors proposed a systematic approach to the detection of cracks in heritage buildings using deep learning, image stitching, and skeleton-based analysis to track the evolution of cracks in millimeter resolution, thereby providing valuable information for temporal planning. In the research paper [
30], the authors proposed the use of hyperspectral images in conjunction with the supercluster channel selection algorithm to improve the segmentation of cracks in heritage concrete buildings in Lisbon, thereby effectively discriminating between actual structural cracks and biological colonization such as moss and algae, obtaining a 93.1% F1-score.
The problem of class imbalance and limited training data remains a critical challenge in damage assessment problems. In this context, study [
31] proposes an investigation of the problem of assessing aging buildings using YOLOv11 and a damage criticality index. This is done using various geometric and color transformations to address limited training data for concrete cracks, spalling, and rebar exposure in buildings across Seoul. In this context, study [
32] proposes MDF-Net, a multi-level defect fusion segmentation network using a combination of CNN and Swin Transformer architectures. This is done using a combination of focal loss and dice loss for addressing class imbalance in concrete cracks and depressions. This also enables visualization of multi-scale features. In this context, study [
33] proposes using mosaic and mix-up transformations for real-time detection of cracks in stone cladding using UAV-mounted cameras. This is done using a hybrid YOLOv8–Swin model for edge computing using NVIDIA Jetson.
New types of sensing modalities continue to expand the scope of structural health monitoring beyond the conventional scope of visual inspection. Study [
34] discusses the innovative reuse of existing in-building security cameras for structural health monitoring applications, where inter-story drift and peak demand estimation are considered through computer vision-based motion tracking. Study [
35] uses autonomous ultrasonic transmission tomography to identify the location of cracks in concrete beams during the fracture process, where the ability to perform tomography provides a unique advantage in imaging the growth of cracks not visible on the surface, as in the case of conventional computer vision-based imaging techniques. Study [
36] provides the foundational CNN methodologies for the purpose of damage detection in building structures, where Adam optimization and data augmentation are used to achieve high accuracy in detecting diverse types of damage, while [
2] provides a detailed review of the applications of image processing-, machine learning-, and deep learning-based technologies in structural health monitoring, where the authors critically discuss the applications of XAI and the lack of unified benchmarks in the case of disaster scenarios involving multi-class damage classification. Study [
37] reviewed structural modal parameter recognition and damage identification methods under environmental excitations, including time-domain, frequency-domain, and time–frequency-domain approaches, and emphasized that integrating artificial intelligence with traditional modal and damage identification methods can improve data interpretation, reduce human subjectivity, and support more intelligent structural health monitoring systems.
Recent advancements in structural damage assessment have been made using deep learning and image processing techniques, including methods such as BayeSiamMTL, MCMamba, YOLOv8, and BCCD-YOLO. While these methods have proven effective in various contexts, they often encounter challenges that our proposed method effectively addresses. For example, BayeSiamMTL (a multitask learning framework) integrates building footprint detection and damage-level assessment, using a Bayesian approach to handle uncertainty. However, it does not explicitly tackle the class imbalance issue prevalent in post-disaster datasets, where rare but critical damage types, such as collapsed buildings, are underrepresented. Our method, which employs weighted cross-entropy loss, ensures that these minority damage classes are prioritized, improving performance in detecting rare but crucial damage types.
Similarly, MCMamba leverages multi-scale convolution-enhanced features for crack detection in high-resolution bridge images. While highly effective for crack detection, it is limited by its focus on a single damage type and struggles with the broader range of structural damage types encountered in post-earthquake scenarios. In contrast, our ensemble deep learning framework combines six different architectures, enabling the model to detect a wide variety of damage types, from minor cracks to complete structural collapses, providing a more comprehensive solution for post-disaster structural damage assessment.
YOLOv8 and BCCD-YOLO are also notable for their performance in real-time detection tasks. These models, however, focus primarily on detecting cracks and localized damage but may struggle with the fine-grained classification of damage levels and different structural components. Our ensemble model, using multiple architectures and a weighted voting mechanism, excels in identifying fine-grained damage categories and is particularly effective in detecting minority classes, which are often critical in safety assessments but are frequently missed by traditional methods.
Moreover, our method incorporates an explainability framework through Grad-CAM visualizations, offering valuable insights into the regions of an image that influence the model’s decision-making process. This interpretability is crucial in safety-critical applications, allowing structural engineers to understand why certain areas of a building are flagged as damaged. While focal loss, class-balanced loss, and other approaches focus on improving model accuracy, they typically lack the level of transparency required for real-world decision-making in post-disaster scenarios.
In terms of performance, our ensemble model outperforms existing models in detecting both common and rare damage types, as evidenced by our improved test accuracy and F1-scores across all classes, particularly for the minority classes like collapsed buildings. This demonstrates the effectiveness of our approach in addressing key challenges such as class imbalance, fine-grained classification, and interpretability, setting it apart from current state-of-the-art techniques.
3. Methodology
The methodology of this study is designed as an integrated decision-support pipeline for automated post-disaster structural damage assessment. Rather than treating data preparation, model training, ensemble prediction, explainability, and evaluation as independent steps, the proposed framework connects these components within a unified workflow. In this structure, post-disaster building images are first prepared through annotation, stratified splitting, preprocessing, and augmentation. Then, multiple pre-trained CNN architectures are fine-tuned using an imbalance-aware learning strategy. The outputs of these models are combined through a macro-F1-based weighted late-fusion ensemble, while Grad-CAM is used to provide model-derived visual explanations for the final prediction.
As shown in
Figure 1, the proposed methodology consists of interconnected blocks that represent the main technical components of the framework. The diagram begins with the post-disaster image dataset and data preparation stages, which provide the input for the CNN model pool and training strategy. Individual model predictions are then transferred to the ensemble fusion block, where validation macro-F1 scores are used to estimate model weights and generate the final weighted prediction. The final decision block produces the predicted damage class, confidence score, and expert-review flag, while the Grad-CAM explanation block visualizes the image regions that contributed to the CNN-based decision. Finally, the performance evaluation block assesses the framework using accuracy, precision, recall, macro-F1, weighted-F1, Cohen’s kappa, confusion matrix, and class-wise analysis.
3.1. Dataset and Preprocessing
The dataset consists of 13,270 images and 24,289 labeled regions. It has been systematically collected through post-disaster field inspections conducted after the Kahramanmaraş earthquakes that struck Turkey on 6 February 2023. These earthquakes measured 7.8 and 7.6 respectively and caused widespread structural damage across different provinces. The documentation of structural elements has taken into consideration suboptimal imaging conditions, different building types, and intricate patterns of structural damage caused by strong ground motions, aftershocks, and secondary collapse mechanisms. It has captured different manifestations of structural damage that are caused by earthquakes and are commonly found in different scenarios of structural damage evaluation carried out by engineers through rapid damage evaluation surveys conducted in the immediate aftermath of earthquakes.
The 15-class fine-grained taxonomy has provided a comprehensive categorization of different structural elements and their states of damage. It has included different conditions of walls, columns, and beams and different states of damage for entire buildings (
Table 1). It has captured all the nuances of field inspections carried out by engineers and has enabled the automated system to differentiate between different structural elements and their states of damage. It has also enabled the system to differentiate between different levels of damage and critical states of damage caused by different mechanisms of structural damage. It has shown a high class imbalance ratio ranging from 2 to 3877 instances, indicating a class imbalance ratio of 1:1938 between the most and least frequent classes. It has captured different scenarios of structural damage evaluation carried out by engineers in the immediate aftermath of earthquakes and has shown that certain critical states of structural damage, like total collapse of buildings, are not very frequent but are critical and require immediate structural intervention. It has necessitated the use of weighted loss functions to ensure balanced learning across all classes and to prevent any bias in favor of more frequent classes at the expense of critical classes that require immediate structural intervention.
Each image was taken at a minimum resolution of 1024 × 1024 pixels by using calibrated digital cameras that captured the fine-grained details of the damage necessary for the accurate classification of the images. Complete metadata was also provided with every annotation, including the timestamp of the capture, focal length and exposure settings of the cameras, illumination conditions in the environment, the qualifications of the inspectors, and the geolocation coordinates of the images to facilitate the analysis of the spatial characteristics of the damage. For the purpose of ensuring the quality of the annotations for the multi-project dataset that included a variety of different inspection teams and different time frames, a rigorous multi-stage validation protocol was used, in which every image was individually validated by at least two different annotators with structural engineering qualifications and experience in the field of inspection. Disagreements between the different annotators were also individually validated by a tertiary review by senior structural engineers with experience in the assessment of damage in the aftermath of a disaster event. Cohen’s kappa was used to determine the level of agreement between the different annotators, which was found to be high at κ > 0.85, thereby confirming the reliability and consistency of the annotations for the entire dataset collection that has a variety of different buildings, materials, environmental conditions, and damage characteristics.
The annotation scheme essentially represents a complex real-world inspection setting where individual images can potentially depict various structural elements with multiple damage states. This results in an average of 1.83 bounding box regions per image. In terms of image-level damage classification deployment, individual images are associated with a single primary damage type that reflects the most critical structural condition within that image. This is done to allow for conventional soft-max classifier architectures that can facilitate multi-class classification.
The preprocessing pipeline is designed to accommodate a series of transformations that are applied to individual inspection images. These transformations are specific to conventional damage recognition tasks. In this context, histogram equalization within the YUV color space is applied to enhance image contrast. This is done to accommodate diverse lighting conditions that can prevail within real-world inspections. These can range from bright sunlight to poorly lit environments. This ensures that pixel values are mapped across the entire range to enhance feature visibility for neural network feature extractors. This is particularly important for identifying subtle damage features such as hairline cracks, spalling, and discoloration. In this context, YUV color space is used to enhance luminance values. This ensures that color-dependent damage features such as rust stains resulting from reinforcement corrosion and water damage resulting from water infiltration are preserved. In terms of resizing images to accommodate specific architectures, individual images are resized to specific dimensions using a bi-linear interpolation technique. In this context, VGG16, ResNet50, DenseNet121, EfficientNetB0, and MobileNetV2 architectures require 224 × 224 pixel images. InceptionV3 requires larger images of 299 × 299 pixels to accommodate optimal feature extractor characteristics that were learned during pre-training on the ImageNet dataset. In terms of pixel intensity normalization, pixel values are normalized within the [0, 1] range using division by 255.
During the training phase, we utilized several data augmentation techniques to artificially increase the diversity of the dataset. The goal was to maintain the diagnostic features of damage while introducing variations to simulate the different conditions under which images are typically captured in field settings. These augmentations were designed to ensure that the model generalizes well across various real-world scenarios, such as changes in camera angle, lighting conditions, and object scale. Geometric transformations were applied to account for potential variations in field documentation. This included random rotations of up to ±20 degrees, allowing the model to learn features from images captured at different angles. Horizontal flipping, performed with a 50% probability, was used to effectively double the dataset size, while maintaining the integrity of the damage patterns. Shifting the width and height of images by up to 20% allowed for handling non-standard framing during field data collection. We also applied zooming within a factor of 0.8 and 1.2 to ensure invariance to object scale, helping the model recognize damage patterns regardless of the camera’s distance from the object of interest. Shearing transformations of up to ±15 degrees were used to simulate perspective distortions caused by non-perpendicular viewing angles in images.
Photometric transformations were implemented to accommodate diverse illumination conditions, such as changes in brightness, weather, and transitions between indoor and outdoor environments. Brightness adjustments ranging from 80% to 120% of the original image intensity were applied. Histogram equalization was performed in the YUV color space to enhance image contrast, ensuring that subtle damage features, such as cracks or moisture infiltration, remained visible under various lighting conditions. While color jittering is another possible transformation to increase diversity, it was deliberately excluded from this process to maintain the diagnostic value of the damage indicators, which are highly dependent on the specific colors in the image (such as rust stains and discoloration). Including color jittering could have interfered with the identification of important damage features tied to color variations, which are critical for classifying damage types like corrosion, moisture infiltration, and other similar patterns. These transformations ensure that the model learns to identify damage patterns robustly under varying conditions while preserving important features that are crucial for classification.
In this study, we chose six specific convolutional neural network models (VGG16, ResNet50, InceptionV3, DenseNet121, EfficientNetB0, and MobileNetV2) based on their architectural diversity and proven effectiveness in image classification tasks. Class imbalance is a critical challenge in many classification problems, especially in post-disaster structural damage assessment where certain damage types, such as collapsed buildings, are rare but critical for decision-making. Several mainstream methods exist to handle this issue, each with its own advantages and limitations. For instance, over-sampling techniques, such as SMOTE, generate additional instances of the minority class by duplicating or interpolating examples. While this approach can help balance the dataset, it risks overfitting, as it may introduce redundant or unrealistic samples that do not represent the true variation in the minority class. On the other hand, under-sampling methods reduce the size of the majority class, which can lead to a loss of important information, especially when the majority class contains complex and diverse instances that are critical for the model’s understanding.
Another common technique is focal loss, which adjusts the loss function to focus more on difficult-to-classify examples. While this method has proven effective in some domains, it may not be ideal in our case due to the subtle visual differences between consecutive damage classes. Applying focal loss could potentially overemphasize certain types of damage while neglecting more nuanced damage patterns that require careful assessment, thereby limiting its applicability for fine-grained damage classification in the context of structural damage evaluation.
Class-balanced loss is another approach that adjusts the loss function according to the frequency of each class, ensuring that the model does not become biased towards the more frequent classes. However, given the high variability in building types and the complexity of the damage patterns in our dataset, we found that weighted cross-entropy loss is more suitable for our task. This loss function assigns higher weights to minority classes, which helps the model focus on rare but critical damage types without sacrificing performance on the majority classes.
In our experiments, weighted cross-entropy loss demonstrated significant improvements in detecting rare but critical damage types, such as collapsed buildings. This method helped to mitigate the class imbalance without introducing the risks associated with over-sampling or under-sampling. As a result, the ensemble model, which combines multiple models using this loss function, outperformed individual models, showing better overall performance, particularly in detecting the minority classes.
3.2. Ensemble Deep Learning Framework
The framework for classification combines six widely used convolutional neural networks through a weighted ensemble-based voting mechanism, based on transfer learning with ImageNet pre-training and a two-stage fine-tuning approach. The variety in the selection of models also enables different approaches to feature extraction, including the following: VGG16, which focuses on hierarchical texture learning through sequential convolution operations with a convolution window size of 3 × 3; ResNet50, which focuses on geometric pattern detection through skip connections in residual learning; InceptionV3, which focuses on multi-scale feature extraction through multiple convolution operations with different window sizes; DenseNet121, which focuses on parameter-efficient feature reuse through dense connectivity; EfficientNetB0, which focuses on balancing accuracy with efficiency through compound scaling-based optimization; and MobileNetV2, which focuses on reducing computational requirements through depth-wise separable convolution operations.
Transfer learning proceeds through two systematic stages, balancing preservation of general visual knowledge from ImageNet with adaptation to building damage patterns. Stage 1 freezes all convolutional layers while training only a newly added classification head for 20 epochs at a learning rate of 1 × 10−3. The classification head architecture sequentially applies global average pooling reducing spatial dimensions from [H × W × C] to [C] aggregating feature map activations across spatial locations, a dense layer with 512 ReLU-activated units performing non-linear feature transformation, two dropout layers with rates of 0.5 and 0.3 providing progressive regularization preventing co-adaptation, and a final output layer with 15 softmax-activated units producing normalized probability distributions over damage categories summing to unity. This stage rapidly adapts the classification head to the damage taxonomy while preserving general feature extraction capabilities learned from ImageNet’s diverse natural image collection. Stage 2 unfreezes all network layers and continues training for 30 epochs at a substantially reduced learning rate of 1 × 10−4, enabling subtle adaptation of feature extractors to building-specific damage patterns while avoiding catastrophic forgetting of ImageNet-learned vision capabilities. Training terminates through early stopping with patience of 10 epochs monitoring validation loss.
The optimization process minimizes a weighted categorical cross-entropy loss to account for class-frequency imbalance and to ensure that rare but safety-critical damage classes contribute sufficiently to model learning. For a mini-batch containing
samples and
damage categories, the loss function is defined as shown in Equation (1):
where
denotes the number of samples in the mini-batch,
denotes the total number of damage categories,
is the one-hot encoded ground-truth label indicating whether sample
belongs to class
,
is the predicted softmax probability for sample
and class
, and
is the class-specific weight assigned to class
. The class weights are computed using the balanced weighting strategy given in Equation (2):
where
is the total number of training samples used to compute the class weights,
is the number of classes, and
is the number of training samples belonging to class
. This formulation assigns larger weights to underrepresented classes and smaller weights to frequent classes, thereby reducing the tendency of the model to favor dominant categories. In this study, the class weights were computed from the training partition only to avoid information leakage from the validation and test subsets. For example, in the training partition, frequent classes such as “IW with Minor Damage” contain several thousand labeled regions, whereas rare but safety-critical classes such as “Collapsed Building” contain only a few hundred labeled regions. Under the balanced weighting strategy, the collapsed-building class receives a substantially larger loss weight than the frequent wall-damage classes. As a result, misclassifying a collapsed-building sample contributes more strongly to the total loss during optimization. This encourages the model to pay greater attention to rare but critical structural damage categories instead of being biased toward dominant classes with many training examples.
Adam optimization combines momentum and adaptive learning rates through parameter update rules with β1 = 0.9 and β2 = 0.999 controlling decay rates. Training employs a batch size of 32, balancing GPU memory constraints with gradient estimation quality, alongside learning rate reduction by a factor of 0.5 upon a validation plateau with patience of 5 epochs, a minimum learning rate floor of 1 × 10−7 and a stratified 80–20 train–validation split maintaining class distribution proportions in both partitions.
The ensemble prediction framework combines the predictions of the six models using a weighted voting approach, which exploits the diversity of the models to achieve better performance compared to the individual models. Before combining the predictions, the contribution weight of each model was calculated from its validation macro-F1 score. Let
denote the validation macro-F1 score of model
, where
and
represents the total number of CNN models in the ensemble. To emphasize models with stronger class-balanced validation performance, the macro-F1 scores were squared and then normalized across all models. The normalized ensemble weight
for model
was computed as in Equation (3):
where
is the normalized weight assigned to model
,
is the validation macro-F1 score of model
, and
is the number of models in the ensemble. This normalization ensures that the sum of all model weights equals one, while giving relatively higher influence to models with stronger validation performance across imbalanced damage classes.
For the preprocessed input image I, the prediction models generate a probability distribution over 15 classes. The ensemble combines the predictions using weighted averaging, as shown in Equation (4):
where
denotes the probability assigned by model
to class
for input image
, and
is the normalized ensemble weight calculated from the validation macro-F1 score. The proposed ensemble follows a post-fusion, or late-fusion, strategy. Each CNN architecture was trained independently using the same training, validation, and test partitions, and no end-to-end joint fine-tuning was performed across the six models. During inference, each independently trained model produced a class-probability distribution for the input image, and these probability distributions were fused using the normalized validation macro-F1-based weights. Therefore, the ensemble improves robustness by combining complementary predictions at the decision level rather than by learning a single jointly optimized multi-branch network. Final prediction selects the category with the maximum ensemble probability as defined in Equation (5):
with confidence quantified as the maximum probability value. Predictions with confidence below 0.6 are automatically flagged as uncertain, triggering manual review by structural engineering experts.
In this study, explainable artificial intelligence is not treated as a separate method independent of the deep learning classifier. Instead, it is integrated into the trained CNN-based classification framework through Gradient-weighted Class Activation Mapping (Grad-CAM). Grad-CAM explains the prediction of a CNN model by using the gradients of the predicted class score with respect to the feature maps of the last convolutional layer. Therefore, the explanation is directly derived from the internal representation learned by the deep network. Grad-CAM is then upscaled to the dimensions of the original image, normalized to a range of [0, 1], and alpha-blended over the original image, where warm colors indicate areas of high importance. This provides a clear and intuitive explanation of which image regions influence the classification decision. Generating Grad-CAM maps for each ensemble member also enables a comparative analysis of how different CNN architectures attend to damage-related regions. For instance, VGG16 may emphasize local textural patterns, ResNet50 may focus on geometric discontinuities, and InceptionV3 may capture multi-scale structural features.
3.3. Training Hyperparameters
The model was trained using the Adam optimizer, a widely used method that adjusts the learning rate based on the first and second moments of the gradients. For the training process, the initial learning rate was set to 1 × 10−4 to fine-tune the model. After the first stage, where the classification head was optimized, the learning rate was reduced to 1 × 10−5 to further optimize the model during fine-tuning. This gradual decrease in learning rate helped stabilize the training process and allowed the model to converge smoothly.
The batch size was set to 32, which helped balance memory usage and gradient update efficiency. This choice enabled the model to learn from a sufficient number of samples while keeping computational demands manageable. If a smaller batch size was used, the model might experience higher variance in gradient estimates, while a larger batch size would demand more memory and computational power. Training was conducted in two stages. In the first stage, only the classification head was trained for 20 epochs with a learning rate of 1 × 10−3, enabling the final layer to converge quickly. In the second stage, the entire model was fine-tuned over 30 additional epochs, with the learning rate reduced to 1 × 10−4 to allow for more subtle adjustments and prevent overfitting.
To prevent overfitting and ensure better generalization to unseen data, we applied early stopping. The training process was halted if the validation loss did not improve for 10 consecutive epochs, thus helping to prevent overfitting to the training data and ensuring that the model performed well on the validation and test sets. These hyperparameters were chosen based on an initial hyperparameter search conducted on a subset of the data. This search helped determine the optimal values for learning rate, batch size, and other training parameters, ensuring effective model performance while preventing overfitting.
3.4. Evaluation Metrics and Validation Protocol
In this study, we integrated uncertainty quantification (UQ) methods to assess the confidence in the model’s predictions, particularly in the context of post-disaster structural damage assessment, where reliable predictions are critical. The Monte Carlo dropout method was employed during inference to estimate the uncertainty of the model’s predictions. In this technique, dropout is applied at both the training and inference stages, with multiple forward passes made on the same image. The variance across these predictions is then used as a measure of the uncertainty, indicating regions where the model is less confident. This approach is particularly important in post-disaster scenarios, where some damage types, such as collapsed buildings, are rare but critical for emergency response. The uncertainty values help identify regions of high uncertainty, enabling structural engineers to focus on areas where the model’s predictions are less certain, ensuring that critical regions are reviewed by human experts for further inspection.
Model performance is evaluated through complementary metrics in order to obtain a comprehensive assessment of the model’s performance with regards to different criteria. Accuracy is the measure of the number of correct predictions made by the model over the total number of test cases. Precision is the measure of the specificity of the classification task and is calculated as the number of true positives over the total number of positive predictions made by the model. Recall is the measure of the sensitivity of the classification task and is calculated as the number of true positives over the total number of positive test cases. F1 is a balanced measure of the classification task that is less sensitive to class imbalance and is calculated as the harmonic mean of precision and recall. Macro-averaged F1 is calculated as the unweighted mean of the F1 scores of the classes in the taxonomy and is particularly useful in the case of a 15-category imbalanced taxonomy in order to assign equal weight to rare but significant damage classes. Weighted F1 is calculated as the prevalence-weighted mean of the classes and is particularly useful in the case of a 15-category imbalanced taxonomy in order to assign more weight to the majority classes. Cohen’s kappa is a measure of the agreement between two raters and is particularly useful in the case of validating the performance of the automated system with the judgments of a structural engineer.
In the performance assessment process, the dataset was partitioned at the image level using a stratified 80/10/10 split to preserve the class distribution across the training, validation, and test subsets. Since each image may contain more than one annotated structural region, the number of images and labeled regions were reported separately to avoid ambiguity. Accordingly, 10,616 images containing 19,431 labeled regions were used for training, 1327 images containing 2429 labeled regions were used for validation, and 1327 images containing 2429 labeled regions were reserved for testing. The training subset was used for model optimization through the two-stage fine-tuning process, while the validation subset was used for hyperparameter tuning, including the learning rate, dropout rate, weight decay coefficient, and early stopping criterion. The test subset was kept completely unseen until the final performance assessment to prevent evaluation bias. Performance was calculated separately for each of the 15 damage categories to enable detailed class-wise analysis and to identify categories requiring further improvement. Macro-averaged and weighted-averaged metrics were also computed to provide both class-balanced and prevalence-weighted evaluations.
4. Experimental Setup and Results
The experimental evaluation process employed the dataset and the preprocessing steps as discussed in
Section 3.1. Stratified 5-fold cross-validation was employed in the process. Stratified splitting was applied at the image level to prevent data leakage between training, validation, and test partitions. The dataset contains 13,270 images and 24,289 labeled regions; therefore, the number of images and the number of labeled regions were reported separately. Following an 80/10/10 image-level split, 10,616 images containing 19,431 labeled regions were used for training, 1327 images containing 2429 labeled regions were used for validation, and 1327 images containing 2429 labeled regions were reserved for testing. This distinction ensures that the reported split reflects the actual number of images used for model development and evaluation, while the labeled-region counts describe the annotation density within each partition.
The damage annotations were structured in accordance with the damage assessment methodology proposed in [
38], which is also used by official Turkish institutions (Ministry of Environment, Urbanization and Climate Change of Türkiye) for building evaluation after earthquake events. The dataset was initially divided into two groups of images, namely, exterior inspection images and interior inspection images. The images of the interior inspection are further divided and annotated, distinguishing between load-bearing elements (columns, shear walls, beams) and non-load-bearing elements (infill walls). This distinction is important since damage to non-load-bearing elements does not affect the structural safety of reinforced concrete buildings, whereas damage to load-bearing elements directly impacts structural integrity. In addition, reference (undamaged) labels are defined for each type of element to enable comparative damage evaluation. In total, there are fifteen damage categories covering the three main inspection domains: exterior building assessment (Undamaged Building, Collapsed Building, Localized Collapse or Permanent Deformation), structural reinforced concrete elements (Undamaged Column/SW, Type-C Damaged Column/SW, Type-D Damaged Column/SW, Undamaged Beam, Type-C Damaged Beam, Type-D Damaged Beam), and non-structural components (Undamaged IW, IW with Minor Damage, IW with Moderate Damage, IW with Severe Damage, Corrosion, Moisture Damage), as shown in
Table 2.
To provide a visual understanding of the adopted damage taxonomy, representative image samples from the dataset are presented in
Figure 2. These examples illustrate the visual diversity of the 6 damage categories used in the classification task, including exterior building conditions, load-bearing structural elements, and non-structural components. The figure also demonstrates the fine-grained nature of the dataset, where visually similar categories require careful feature learning by the proposed CNN-based ensemble framework.
Model selection involved six of the best-performing convolutional neural network architectures with unique philosophies and parameter sizes: VGG16 with 138 M parameters, ResNet50 with 25.6 M parameters, InceptionV3 with 23.9 M parameters, DenseNet121 with 8.0 M parameters, EfficientNetB0 with 5.3 M parameters, and MobileNetV2 with 3.5 M parameters. Training is conducted using the two-stage transfer learning protocol, as discussed in
Section 3.2.
Table 3 summarizes the computational requirements.
The ensemble methodology integrated predictions from all six trained models through the weighted voting mechanism described in
Section 3.2, with model weights derived from validation F1-macro scores. Prediction agreement scores, confidence metrics, and uncertainty estimates were computed as specified in the methodology.
Performance evaluation employed accuracy, precision, recall, F1-score (macro-averaged and weighted-averaged), and Cohen’s kappa, with per-class metrics enabling detailed analysis across the imbalanced taxonomy. The confusion matrix visualization provided insights into misclassification patterns between adjacent damage severity levels.
The experimental results demonstrate that the proposed ensemble approach significantly outperforms individual deep learning models across all evaluation metrics.
Table 4 presents comprehensive performance metrics for all six individual models.
As shown in
Table 4, MobileNetV2 achieved the highest individual model test accuracy of 91.10% with a macro-averaged F1-score of 88.83%, demonstrating that efficient architectures can achieve excellent performance with only 3.5 million parameters. ResNet50 and EfficientNetB0 also delivered strong results, with ResNet50’s residual connections enabling effective training of deeper networks and EfficientNetB0’s compound scaling achieving near-optimal performance with minimal parameters.
The ensemble model achieved an outstanding test accuracy of 93.77%, representing a substantial 2.67 percentage point improvement over the best individual model (MobileNetV2) and a 9.30 percentage point enhancement compared to the VGG16 baseline.
Statistical significance testing using McNemar’s test with Bonferroni correction confirmed that the ensemble’s improvement over each individual model achieved a p < 0.001 significance level. Per-class F1-score analysis reveals that the ensemble achieves particularly notable improvements for challenging minority classes such as collapsed buildings, where individual model predictions often exhibit higher uncertainty but ensemble aggregation successfully identifies reliable consensus patterns.
To further evaluate the proposed model beyond alternative ensemble aggregation strategies, representative literature-based baselines were also reproduced using the same image-level 80/10/10 training, validation, and test split. In this comparison, a YOLOv8-based baseline was adapted because YOLO family models have been widely used in structural damage detection studies, while a focal-loss-based ResNet50 baseline was implemented to represent imbalance-aware damage classification methods. In addition, a class-balanced loss baseline was evaluated to compare the proposed weighted categorical cross-entropy strategy with another frequency-aware loss formulation. All baseline models were trained using the same preprocessing pipeline, augmentation strategy, class taxonomy, and data partitions to ensure a fair comparison.
As shown in
Table 5, the reproduced literature-based baselines achieved competitive results but remained below the proposed ensemble model. The YOLOv8-based baseline achieved 88.42% test accuracy and an 85.76% macro F1-score, showing strong performance on frequent visual damage categories but weaker sensitivity to minority classes. The focal-loss ResNet50 baseline improved minority-class learning to some extent, achieving 89.18% test accuracy and an 86.94% macro F1-score. The class-balanced loss baseline further improved class-balanced performance with 89.63% test accuracy and an 87.41% macro F1-score. In contrast, the proposed ensemble achieved 93.77% test accuracy and a 93.81% macro F1-score, indicating that the proposed post-fusion strategy provides better overall and class-balanced performance under the same experimental protocol.
The diagonal dominance in the confusion matrix for the ensemble model shows that the predictions are more densely packed along the diagonal for correct classification compared to the individual models. The patterns observed in the off-diagonal elements reveal that misclassification errors are concentrated in highly similar damage classes, i.e., minor and moderate damage levels, emphasizing the inherent ambiguity in these subtle gradations of structural damage.
Figure 3 shows the class distribution for the fifteen different damage categories, indicating a significant class imbalance in the dataset with a dominance of reference structures and minor damage levels. The three most dominant classes account for nearly 50% of all the labeled regions, while the five least dominant classes account for less than 10%.
Figure 4 shows the ensemble’s confusion matrix in the form of a heatmap. Correctly classified instances are located along the diagonal. Well-populated diagonal values across all damage classes reflect exceptional performance in classification. Normalized accuracy values are greater than 88% for most classes. Also, note that there is a certain asymmetry in this confusion matrix. This is because structures that are highly damaged are sometimes classified as moderately damaged and vice versa.
Since
Figure 4 presents a count-based confusion matrix, the off-diagonal entries were additionally examined using row-normalized percentages to account for the unequal number of samples across damage categories. In this row-wise interpretation, each misclassification count was divided by the total number of samples in the corresponding true class. The highest one-directional confusion was observed for “IW with Minor Damage” being classified as “Undamaged IW” with 15 misclassified samples out of 783, corresponding to 1.92%. The second highest confusion was “Undamaged IW” being classified as “IW with Moderate Damage” with 13 misclassified samples out of 773, corresponding to 1.68%. The third highest confusion was “IW with Severe Damage” being classified as “IW with Moderate Damage” with 8 misclassified samples out of 501, corresponding to 1.60%. These normalized values show that, despite the class imbalance, the relative misclassification rates remain low and are mainly concentrated among visually adjacent infill wall damage categories.
Figure 5 shows a comparison chart for ensemble accuracy among all seven models. The ensemble bar is higher than all the other model bars, indicating better predictive capability with a test accuracy of 93.77%.
In
Figure 6 below, we can see a comparison of ensemble F1-score results using both macro and weighted averaging. The similarity between macro and weighted averaging results for the ensemble indicates that the model has been performing equally well across all damage classes.
Figure 7 presents the per-class F1-score comparison across all models as a grouped bar chart.
Table 6 quantifies the per-category performance for selected representative damage categories.
The ensemble consistently performs as well as or better than the best-performing individual model over the majority of the damage categories, with significant improvement of 5–8% in the more challenging categories such as severely damaged walls and buildings.
To further assess the reliability of the ensemble predictions, Monte Carlo dropout-based uncertainty estimation was applied during inference. Dropout layers were kept active, and 30 stochastic forward passes were performed for each test sample. The final prediction was obtained by averaging the class probability distributions across these passes, while predictive variance and entropy were used to identify uncertain cases. The uncertainty analysis showed that correctly classified samples generally produced low predictive variance, with an average uncertainty score of 0.031, whereas misclassified samples showed a higher average uncertainty score of 0.118. This pattern was especially visible in visually ambiguous categories, such as minor, moderate, and severe infill wall damage, where gradual transitions between damage levels caused higher uncertainty. For the safety-critical minority class of collapsed buildings, the ensemble achieved an F1-score of 0.955, while high-uncertainty cases were mostly associated with partial occlusion, incomplete structural visibility, or visual similarity with localized collapse or permanent deformation. These findings indicate that Monte Carlo dropout provides a useful uncertainty-aware decision support mechanism by identifying borderline predictions that should be reviewed by structural engineering experts rather than being accepted as fully automated decisions.
In terms of the Grad-CAM visualization analysis, there are some significant insights into the importance of the Grad-CAM visualization analysis. As shown in
Figure 8, the representative ensemble Grad-CAM visualization consists of the original image of the building damage and six Grad-CAM heat maps representing the attention of each of the six models. In all of the models, the attention is focused on the pertinent areas of the image rather than the background. This indicates that the models are capturing the correct areas of the image in terms of the damage manifestation.
To complement the qualitative Grad-CAM visualizations, we also conducted a quantitative attention localization analysis using the available labeled damage regions. For each correctly classified test image, the Grad-CAM heatmap was normalized to the range , and the proportion of total activation falling inside the annotated damage region was calculated as the Attention Localization Ratio (ALR). The ALR was defined as , where denotes the Grad-CAM activation value at pixel location , represents the annotated damage bounding box region, and denotes the full image area. The ensemble model achieved an average ALR of 0.74, compared with an average ALR of 0.66 for the best individual model. This indicates that the ensemble model not only improved classification accuracy but also concentrated a larger proportion of its attention on structurally relevant damage regions. In particular, higher localization ratios were observed for visually distinctive classes such as collapsed buildings, severe infill wall damage, and Type-D damaged column/shear wall samples, whereas lower ratios were observed in visually ambiguous categories such as minor and moderate infill wall damage.
The error analysis, based on the results obtained through the analysis of misclassified samples, indicates the presence of several patterns. First, a significant proportion of misclassifications was observed at the boundaries between adjacent damage severity levels, where visual differences between categories are observed gradually. Second, confusion between moisture damage and minor structural damage was observed due to visual patterns resulting from water infiltration, similar to patterns observed in early stages of cracking.
The robustness analysis was conducted to evaluate the performance of the ensemble model under different perturbation conditions that may occur during deployment in real-world post-disaster scenarios. To account for randomness in model training and evaluation, the illumination robustness experiment was repeated over five independent runs, and the results were reported as mean ± standard deviation. Under severe lighting perturbations, the ensemble accuracy decreased from to , corresponding to an average degradation of percentage points. In contrast, the individual models showed a larger average degradation of percentage points under the same perturbation setting. This indicates that the ensemble model provides more stable predictions under illumination changes compared with single-model approaches. The occlusion analysis further showed that the ensemble model was able to provide consistent results even when up to 30% of the image was occluded, provided that the critical damage regions remained visible. These findings suggest that the proposed ensemble framework improves robustness against common real-world perturbations such as lighting variation and partial occlusion.
The comparative evaluation of the proposed ensemble model with existing building damage assessment techniques showed its superiority. Specifically, the evaluation results showed that the proposed ensemble model based on the weighted voting mechanism based on F1-macro scores was more accurate than ensemble models based on different aggregation techniques, i.e., averaging or majority voting. Specifically, the accuracy was higher by 1.8% compared to averaging and by 0.9% compared to majority voting.
The results obtained through the generalization analysis provide insight into the model transferability. Specifically, the cross-validation experiments showed that the ensemble model was able to provide accurate results even for unseen construction types, with an accuracy of 89.1%. This corresponds to a degradation in accuracy of 4.7% compared to the performance obtained in the test set. The results also showed that the ensemble model was able to provide accurate results even for disaster events not included in the training set, with an accuracy of 87.6%.
To evaluate cross-disaster generalization, an additional hold-out experiment was conducted in which images associated with selected disaster-event subsets were excluded from the training process and used only for external testing. In this setting, the six CNN models were trained independently on the remaining disaster-event data using the same preprocessing, class-weighting, and two-stage fine-tuning protocol described earlier. The trained models were then evaluated on the held-out disaster-event subset, and their predictions were fused using the same post-fusion weighted voting strategy. The ensemble model achieved an accuracy of 87.6% on disaster events not included in the training set. This result corresponds to a 6.17 percentage point decrease compared with the standard test accuracy of 93.77%, indicating that the model retains reasonable transferability under cross-disaster domain shift. The performance decrease is expected because unseen disaster events may involve different building typologies, imaging conditions, debris patterns, inspection angles, and damage manifestations. Nevertheless, the result suggests that the proposed ensemble framework can generalize beyond the original training distribution, although additional multi-disaster training data would be required to further improve robustness in fully operational post-disaster deployment scenarios.
5. Conclusions and Future Work
This study proposed an integrated and explainable ensemble deep learning framework for automated post-disaster structural damage assessment. The framework was developed and validated using a large-scale ground-based image dataset collected after the 6 February 2023 Kahramanmaraş earthquakes. The dataset was formed through post-disaster field inspections and includes 13,270 high-resolution images, 24,289 manually labeled structural damage regions, and 15 fine-grained damage categories covering exterior building conditions, load-bearing structural elements, and non-structural components. In this respect, one of the main contributions of the study is not only the proposed model architecture but also the construction and use of a disaster-specific visual dataset that reflects real inspection conditions, class imbalance, and diverse structural damage patterns encountered after a major earthquake.
The proposed framework combines six convolutional neural network architectures, namely, VGG16, ResNet50, InceptionV3, DenseNet121, EfficientNetB0, and MobileNetV2, within a weighted late-fusion ensemble structure. This design was preferred to exploit architectural diversity and to reduce the limitations of single-model classification in fine-grained damage assessment. To address the severe class imbalance observed in the dataset, weighted categorical cross-entropy was used during training, and the contribution of each CNN model to the final ensemble decision was determined using validation macro-F1 scores. In this way, the proposed framework was designed to improve both overall classification performance and class-balanced recognition, especially for rare but safety-critical damage categories such as collapsed buildings and severely damaged structural elements.
The experimental findings show that the ensemble model achieved a test accuracy of 93.77%, outperforming the best individual CNN model by 2.67 percentage points. The results also indicate that the ensemble strategy improved the recognition of minority classes and provided more stable predictions compared with individual architectures. This is particularly important in post-disaster scenarios, where missing a critical damage class may delay emergency response or lead to incorrect prioritization of field inspections. Therefore, the main contribution of the study to the research topic is the development of a reliable, class-aware, and explainable image-based decision-support framework for structural damage assessment after earthquakes.
Another important contribution of the study is the integration of explainability into the deep learning workflow. Grad-CAM was used as a model-dependent explanation technique to visualize the image regions that influenced the CNN-based predictions. These visual explanations allow structural engineers to examine whether the model focuses on damage-relevant regions such as cracks, spalling, deformation, exposed reinforcement, or collapsed structural components. Thus, the proposed framework does not only aim to increase classification accuracy but also supports transparency and expert verification, which are essential for the practical use of artificial intelligence in safety-critical civil engineering applications.
Despite these promising results, several limitations should be considered. The dataset was collected from a specific earthquake context, and although it contains a large number of images and diverse damage categories, further validation is required using data from different earthquakes, regions, building typologies, construction materials, and imaging conditions. In addition, the proposed model is primarily based on image-level classification, while future post-disaster assessment systems may benefit from integrating object detection, semantic segmentation, geospatial metadata, temporal inspection records, and multimodal sensor data. Future studies should therefore focus on expanding the dataset with multi-region and multi-disaster samples, improving cross-domain generalization, and deploying the framework as a field-ready decision-support tool that works in collaboration with structural engineering expertise.