1. Introduction
The number of railway bridges is steadily increasing to support the expansion of transportation infrastructure. Consequently, the demand for maintenance and inspection of both newly constructed and existing railway bridges is also on the rise. Zhang [
1] projected that bridge maintenance will evolve into a large-scale industry as the number of bridges increases over time. Cracking is a commonly observed issue in concrete bridge structures, and if left unmanaged, it may result in significant economic losses and safety hazards.
As railway bridges are critical infrastructure elements where safety is paramount, they require systematic and periodic inspection. However, there has been a continuous decline in the available workforce for maintenance operations. Traditionally, inspections are performed manually through visual assessments. Visual inspection remains the most widely used method for monitoring the condition of large structures such as bridges. Cardellicchio [
2] emphasized the importance of systematic and periodic visual inspections for monitoring the health of aging bridge infrastructure and evaluated various deep learning techniques aimed at automating this process.
However, the growing number of structures requiring inspection each year makes efficient management increasingly difficult. Manual inspections often require specialized equipment such as aerial ladders, which increases the risk of safety incidents and demands significant time and financial resources. Various alternative approaches are being explored to address these limitations, and the application of unmanned aerial vehicles (UAVs) for inspecting bridges and concrete structures has gained substantial attention.
Fayyad [
3] proposed a UAV-based structural health monitoring (SHM) system that utilizes mounted or embedded sensors to collect data and capture images from multiple angles, enabling early and accurate damage diagnosis. Huang [
4] introduced a systematic workflow that integrates UAV flight with Building Information Modeling (BIM) to facilitate autonomous inspection in real environments. Ayele [
5] highlighted the potential of UAVs as efficient inspection tools, demonstrating their capability to reduce the time and cost of bridge inspections while applying deep learning techniques for damage evaluation. Nasimi [
6] suggested using UAV-mounted sensors for measuring bridge displacements, and Torres-Barriuso [
7] advocated for UAV-based inspection combined with digital assessment methods to provide reliable and cost-effective data for the maintenance of industrial buildings.
UAVs are particularly useful for inspecting hard-to-reach areas where close-proximity human access is difficult or unsafe. Several studies have explored UAV-based inspection methods for bridges and concrete structures. Hong [
8] addressed the limitations of manual inspection by proposing a deep-learning-based model, BDODC-F, for detecting bridge damage. Ma [
9] introduced a “double detection + single segmentation” approach combining YOLOv5(x) and U-Net for detecting and segmenting cracks in real bridge environments. These methods aim to enhance the objectivity and efficiency of damage inspections that are traditionally reliant on human judgment.
High-resolution imagery captured by UAVs enables detailed visual analysis of bridge damage. Yu [
10] proposed the Residual Linear Attention U-Net (RLAU-Net) to improve crack detection performance. Perera [
11] introduced a trained graph neural network (GNN) framework capable of simulating crack propagation, adhesion, and corresponding stress distributions without additional post-processing. Han [
12] pointed out the difficulty in simultaneously achieving high precision and high efficiency in bridge crack detection and proposed a new deep-learning-based method tailored for urban bridges. Paramanandham [
13] evaluated the performance of four pre-trained networks, AlexNet, VGG16, VGG19, and ResNet-50, for concrete crack classification using deep learning.
Crack detection continues to evolve due to the abundance and accessibility of training data, which enhances performance in deep learning models. However, comprehensive structural safety assessment requires consideration of a wider variety of damage types. Accordingly, this study aims to detect not only line-type cracks but also surface-type damage such as concrete scaling/spalling, exposed rebar, water leakage, and paint peeling (damage or detachment of paint coating layers applied to steel components).
Na [
14] proposed a multi-class damage detection method suitable for field conditions using U-Net, comparing its performance with Light Head-RCNN, FPN, and PSP-Net for detecting cracks, concrete scaling/spalling, water leakage, and exposed rebar in railway bridges. In related research, Na [
15] also proposed a crack detection method using Gabor filter banks applied to line-scan camera images for detecting damage in railway concrete slabs.
In the road infrastructure domain, Amhaz [
16] presented a novel algorithm for automated crack detection in 2D pavement images, while Kyslytsyna [
17] proposed the ICGA method, which enhances U-Net with dual attention gates and improves generator segmentation capacity in pix2pix for road surface crack detection. Fan and Zou [
18] developed a road crack detection approach based on deep dictionary learning and encoding networks (DDLCNs), introducing a new activation function (MeLU) and a novel differentiable computation method.
Numerous studies have shown that deep neural networks significantly outperform traditional pattern recognition methods in learning-based damage detection. Therefore, the goal of this study is to develop a deep-learning-based detection system capable of identifying multiple damage classes on structural surfaces, enabling a more objective and reliable alternative to conventional visual inspections and facilitating practical deployment in industrial field applications.
2. Data Collection
2.1. Definition of Multi-Class Damage
In this study, the types of damage commonly found in railway bridges were first defined. While most deep-learning-based feature prediction techniques focus primarily on crack detection, a broader range of damage types must be considered to effectively replace manual visual inspections. Hadinata et al. [
19] investigated the use of U-Net and DeepLabv3+ to detect three major types of damage—cracks, spallings, and voids—on concrete surfaces. As the number of damage classes increases, appropriate detection methods tailored to each class become essential. Recent studies have proposed attention-based multi-class defect detection models capable of recognizing various types of damage simultaneously. Torres-Barriuso et al. [
20] developed a lightweight CNN architecture combined with the CBAM attention mechanism to effectively detect diverse damage types. Such techniques, which enhance classification accuracy under complex background conditions, are essential for future expansion into various infrastructure and industrial domains. As shown in
Table 1, railway bridge damage can be broadly categorized into concrete and steel structural components. For each structural type, relevant damage types requiring inspection were selected. Based on these classifications, this study explores a multi-class damage detection method that enables both classification and segmentation of diverse damage types using deep learning models.
In this study, we integrated the SE-ResNeXt50 32x4d encoder into the U-Net framework and compared three loss functions—Cross-Entropy, Focal, and IoU Loss—to enhance multi-class damage detection performance under complex and overlapping conditions. This combination has not been extensively evaluated in UAV-based bridge inspection tasks.
2.2. Damage Image Data Collection Using UAVs
To collect damage image data, a rotary-wing-type UAV was employed due to its ability to perform vertical take-off and landing, maneuver in tight spaces, and hover with stability. In order to capture damage on the underside of railway bridges, the camera was mounted on the top side of the UAV body. The UAV was equipped with two cameras simultaneously: a wide-angle camera for mapping the overall structure and a high-resolution camera for capturing detailed images of localized damage. The spatial resolution of a captured image is determined by the camera’s image sensor size, focal length, and the distance from the camera to the subject. This relationship defines the Ground Sampling Distance (GSD), which represents the real-world physical size corresponding to one pixel in the image calculated as the distance between the centers of adjacent pixels projected onto the surface of the target object—in this case, the railway bridge. To ensure reliable detection of fine surface defects such as micro-cracks, it is essential to acquire images under conditions that yield a sufficiently small GSD. In this study, when the UAV-mounted camera was positioned approximately 4 m from the surface of the railway bridge, the resulting GSD was 0.277 mm. This resolution is adequate for detecting cracks as narrow as 0.3 mm, meeting the visual inspection standards typically required in railway infrastructure maintenance. The UAV maintained a consistent flight path at an average distance of approximately 4 m from the bridge surface. High-resolution images were captured at a rate of one frame every 2 s, with a horizontal field of view (FOV) of 4091.4 mm and a vertical FOV of 2731.4 mm at that distance. To ensure sufficient coverage with approximately 10% overlap between images, the UAV flew at a speed of approximately 2.00 m per second. Each flight typically lasted around 20 min, with battery replacements conducted as needed to continue capturing images along the entire bridge length.
2.3. Railway Bridge Damage Data Collection
A total of 17 railway bridges were surveyed using UAVs, resulting in the collection of approximately 14,155 images that captured a wide range of complex and diverse damage types. As shown in
Figure 1, a dual-camera top-mounted UAV was deployed to capture damage images from actual in-service railway bridges. During the image acquisition process, overview shots of the target bridges were also captured from a distance to provide contextual information, as summarized in
Table 2.
A key feature of this study is that all damage data were collected from operational railway bridges, ensuring the practical relevance and applicability of the model training and analysis.
Table 3 presents the number of images categorized by damage type. As shown in the data, there is an abundance of images for cracks, concrete scaling/spalling, and paint peeling, whereas fewer samples were available for water leakage and exposed rebar. Given the characteristics of deep-learning-based analysis, a larger volume of training data generally leads to improved classification and detection performance. However, due to the regular maintenance practices of railway bridge management authorities who conduct ongoing inspections and repairs to ensure structural safety, it is inherently difficult to acquire large quantities of real-world damage data, especially for less frequently occurring types.
Nevertheless, for the developed model to be effective in practical applications, it is critical to train on data that reflect environmental conditions and factors that may affect detection performance or lead to false positives. To maximize data diversity and volume, this study focused on bridges within the Korean railway network known to exhibit the highest frequency of damage occurrences. As shown in
Table 3, the collected images have a resolution of 42.4 megapixels (7952 × 5304 pixels). The dataset includes images with no visible damage, images with a single type of damage, and images containing multiple overlapping damage types. Processing 42.4 MP images posed significant computational challenges. To manage GPU memory efficiently, image patches of 320 × 320 pixels were extracted with overlap during training and inference. Additionally, inference time was optimized through parallel processing across multiple GPUs.
To address the class imbalance caused by the relatively small number of images for leakage and rebar exposure, an oversampling strategy was applied to these classes. Specifically, data augmentation techniques were employed, including random rotations and spatial relocation of damage patterns within the images. By synthetically increasing the quantity and diversity of training samples for these underrepresented classes, the model’s ability to recognize them in field conditions was significantly improved.
2.4. Training of Railway Bridge Damage Data
Image data representing various classes of damage observed in different structural components were collected and used to develop a deep-learning-based damage detection and prediction model. These datasets were utilized for supervised learning.
Table 4 shows sample data from the ground-truth (GT) annotations used for training. Cracks were annotated using line segments, while other damage types, such as concrete spalling, exposed rebar, and paint peeling, were labeled as polygons to capture their detailed spatial extent. All annotated regions were precisely labeled to reflect the true geometry and boundaries of each damage class. The collected images were manually annotated based on the Korean Railway Bridge Inspection Manual. Damage regions were labeled only when clearly visible defects were identified in the images.
Table 4 shows full-resolution UAV images labeled for training and evaluation. Crack regions were annotated with lines, and other damage types were labeled using polygons. These examples reflect realistic conditions where multiple damage types and background noise coexist, aiming to improve robustness in practical inspections.
3. Proposed Model
The conventional method of bridge inspection relies on the subjective visual judgment of inspectors. This study aims to enhance the efficiency and objectivity of such inspections by introducing a deep-learning-based approach. High-resolution imagery acquired via UAV-mounted cameras was utilized, and a convolutional neural network (CNN) architecture was employed to perform automatic multi-class damage detection and classification tailored to the structural characteristics of Korean railway bridges. Specifically, different loss functions were applied and evaluated to determine their effectiveness in improving detection performance. A relevant study by Prasetyo [
21] proposed an automated monitoring system for infrastructure using a U-Net framework integrated with a novel loss function called DBCE Loss. Infrastructure structures, which are exposed to natural environmental conditions over long periods, are often subject to surface contamination and discoloration. Moreover, UAV-acquired images frequently include complex backgrounds, such as trees, vegetation, and soil, which may hinder accurate detection. This study focuses on evaluating detection models that are robust enough to be deployed in real-world industrial environments, where such challenges are prevalent. Su [
22] conducted an in-depth study on bridge crack detection and proposed a U-Net-based crack identification algorithm named CBAM-U-Net. This model achieved detection standards by measuring the maximum crack width and length with errors of only 1–6% and 1–8%, respectively. Similarly, Fan [
23] introduced SA-U-Net, which integrates attention mechanisms with multi-scale feature extraction to effectively improve classification accuracy. These studies exemplify ongoing efforts to enhance defect detection performance through modifications and extensions of the U-Net architecture.
3.1. U-Net Architecture
The U-Net architecture is a fully convolutional network (FCN) originally developed for biomedical image segmentation. Proposed by Ronneberger et al. [
24], U-Net is an end-to-end learning model designed to perform pixel-wise segmentation with high precision. The model extends the concepts introduced by Long and Shelhamer [
25], incorporating contemporary classification networks such as AlexNet, VGGNet, and GoogLeNet into a symmetric encoder–decoder structure.
Table 5 presents the experimental environment, including the software and hardware configurations used in this study. Based on this environment, the U-Net-based deep learning model was trained and evaluated for damage segmentation and analysis.
As shown in
Figure 2, the network is named “U-Net” due to its U-shaped architecture. The model consists of a contracting path (encoder) that captures contextual information and a symmetric expanding path (decoder) that enables precise localization. This structure allows U-Net to learn detailed spatial features, making it suitable for fine-grained segmentation tasks in complex environments such as damaged bridge surfaces.
The U-Net model follows a two-part structure comprising an encoder that captures essential features and a decoder that reconstructs spatial details. At the beginning, convolutional layers extract key representations from the input image. Downsampling, which is achieved through pooling, reduces data dimensionality while preserving crucial characteristics. The decoder then performs upsampling operations to progressively regain the original resolution, allowing for precise localization. Through this process, the target is predicted.
As illustrated in
Figure 2, U-Net derives its name from its U-shaped design. The encoder path captures semantic features via convolution and pooling operations, while the decoder restores the spatial resolution through upsampling layers. This architecture allows the model to capture both context and detail, making it suitable for precise segmentation tasks on complex surfaces such as damaged bridge structures.
3.2. Improving Recall Through the Application of Loss Functions
This study investigates appropriate analysis methods for detecting damage in structural environments such as railway bridges. In UAV-acquired imagery, the majority of the pixels often represent background rather than damage regions. Therefore, it is crucial to focus on adjusting the learning weights associated with background classes, which are typically classified as non-damage.
Conventional analysis models using loss functions, such as Binary Cross-Entropy (BCE), distinguish between damage and non-damage by training on binary classifications. However, such methods are often limited by class imbalance issues, especially when the background is introduced as an additional class. To address this, the Focal Loss function was employed, which penalizes the model more heavily for misclassified damage regions while reducing the loss contribution from easily classified background samples. This mechanism reduces the learning rate for well-classified examples and increases the emphasis on harder, misclassified instances.
Kalsotra [
26] compared various U-Net-based approaches using Focal Loss, Dice Loss, and hybrid loss functions to improve feature extraction in complex backgrounds. The study concluded that hybrid loss functions are more suitable for foreground detection tasks. In this study, Focal Loss and Intersection over Union (IoU) Loss functions were applied and comparatively evaluated to determine the most effective approach for multi-class damage detection in railway bridge structures.
Conventional object detection models typically rely on binary cross-entropy (BCE) loss for classification. However, this approach is prone to performance degradation in scenarios involving significant class imbalance, as is often the case in structural damage datasets. In particular, easily classified negative samples (i.e., background) tend to dominate the training process, thereby diminishing the model’s sensitivity to rare or subtle damage classes. To mitigate this issue, Focal Loss was introduced by Lin et al. [
27], incorporating a modulating factor that reduces the relative loss contribution of well-classified examples. The mechanism involves applying both a weighting factor (
) and a scaling exponent (
), which allow the model to focus more effectively on hard-to-classify samples. In this study, we implemented Focal Loss in conjunction with a U-Net architecture to improve the model’s capability in detecting minority damage classes under challenging environmental conditions.
In addition to classification loss optimization, we explored IoU (Intersection over Union) Loss, also referred to as Jaccard Loss, as a spatially aware evaluation metric. IoU Loss measures the degree of overlap between predicted and ground-truth segmentation masks, offering a more holistic assessment of model performance in pixel-wise tasks [
28]. Unlike traditional pixel-wise losses, IoU Loss does not require explicit balancing between foreground and background and instead operates on the geometrical congruence of predicted masks. This makes it particularly useful for semantic segmentation tasks involving multi-class surface defects. Both Focal Loss and IoU Loss were incorporated into our multi-class U-Net training pipeline. Comparative experiments showed that their integration substantially improved the recall and segmentation accuracy for low-frequency damage classes.
3.3. Multi-Class Damage Detection Process Using U-Net
Traditional image processing techniques, such as pattern recognition and edge detection, have long been utilized for surface defect identification. However, for concrete structures, these methods often fall short in practical applications due to high false positive rates caused by surface contamination, discoloration, and environmental noise. To overcome these limitations, this study proposes a deep-learning-based segmentation pipeline centered on the U-Net architecture, as shown in
Figure 3, which highlights the role of U-Net in the core prediction module.
The pipeline includes data preprocessing and normalization, followed by supervised training using manually labeled ground-truth (GT) data to define damage regions. The U-Net model serves as the backbone of the prediction stage, consisting of an encoder–decoder structure. The encoder utilizes SE-ResNeXt50_32x4d to enhance feature extraction capabilities by integrating squeeze-and-excitation (SE) blocks with grouped convolutions, enabling the network to effectively capture both local and global contextual features from UAV-acquired images. The decoder reconstructs full-resolution segmentation maps for precise pixel-wise classification.
This U-Net-based segmentation module is emphasized in the center of the system diagram (
Figure 3), where detection, segmentation, and classification are performed. To further improve detection performance under class-imbalanced conditions, Focal Loss is employed to focus training on hard-to-classify samples, while IoU Loss enhances spatial accuracy by maximizing the overlap between predicted masks and the GT regions. Applying each of these loss functions results in significant improvements in the model’s recall and precision, particularly for underrepresented damage types in the multi-class segmentation framework.
The U-Net-based process, as described, analyzes input images without the need for predefined class selection. The final detection results are overlaid onto the original input images, allowing for clear visualization of the segmented regions. Detected damages are both classified and segmented according to their respective damage classes.
The U-Net-based process, as described, analyzes input images without the need for predefined class selection. The final detection results are overlaid onto the original input images, allowing for clear visualization of the segmented regions. Detected damages are both classified and segmented according to their respective damage classes.
4. Performance Evaluation and Validation
4.1. Evaluation Method
Figure 4 illustrates the evaluation methodology used to assess the performance of the proposed damage detection model. The overall process is divided into two main stages: model training and model validation. In the training phase, 80% of the annotated image dataset was used to generate ground-truth (GT) data for supervised learning. The remaining 20% of the images were reserved as validation data to test the model’s predictive performance. This split ensures that the model is evaluated on unseen data, providing an objective measure of its generalization capability.
To validate the effectiveness of the proposed method, this study employed a confusion matrix, a widely used evaluation tool in deep-learning-based image analysis tasks [
9,
10,
16,
21]. A confusion matrix provides a comprehensive summary of the performance of a classification model by quantifying correct and incorrect predictions across different classes.
As shown in
Table 6, the confusion matrix consists of four key components: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), which are used to compute evaluation metrics.
The evaluation metrics derived from the confusion matrix in
Table 6 are defined by the following equations:
Such evaluation methods require careful consideration when the dataset contains a disproportionately large amount of non-damage (background) data. Akshay [
29] emphasized that although background regions may be well classified due to their prevalence in training data, this can result in artificially high accuracy scores while failing to detect actual damage effectively. Consequently, this can lead to low detection sensitivity and reduced reliability in real-world applications.
To overcome the limitations of using accuracy as a sole performance metric, it is essential to incorporate additional evaluation indicators, such as the Precision, Recall, and F1-Score. Precision quantifies the proportion of correctly identified damage among all predicted damage regions. Recall measures the model’s ability to detect all actual damage instances. The F1-Score provides a balanced harmonic mean of the Precision and Recall, offering a more comprehensive assessment of the model’s effectiveness.
In this study, the performance of the proposed method was evaluated using these metrics across five distinct damage classes. As illustrated in
Figure 5a, ground-truth annotations were carefully defined for evaluation purposes. Each damage instance was manually labeled based on its type: cracks were annotated using line drawings, while concrete scaling/spalling, exposed rebar, water leakage, and paint peeling were annotated using polygonal regions to accurately delineate their spatial extent.
Figure 5b presents sample images with the predicted damage regions overlaid, as identified by the trained model using 80% of the dataset for ground-truth (GT)-based training. These predictions reflect the model’s ability to detect damage based on learned features.
Figure 5c illustrates the overlap between the predicted damage regions and the corresponding ground-truth annotations. This comparison visually demonstrates the agreement between the predicted outputs and the manually labeled actual damage areas.
As shown in
Figure 6, the evaluation involved determining whether each type of damage was correctly detected by analyzing the overlapped images. The objective was to calculate the detection rate for each damage class. However, in the case of cracks, which often appear irregular, continuous, or fragmented, it is difficult to assess detection accuracy based solely on discrete object counts.
To address this limitation, the overlapped images were divided into a 10 × 8 grid (horizontal × vertical) for localized evaluation. Within each subdivided region, the presence or absence of damage was compared between the predicted output and the ground truth. If the predicted damage region overlapped with the actual damage in a sub-image, it was counted as a true positive (TP). If the model failed to detect damage in a region where ground truth was present, it was counted as a false negative (FN). Conversely, if the model predicted damage in a region where none was present, it was recorded as a false positive (FP), and if it correctly identified a non-damage region, it was counted as a true negative (TN).
4.2. Damage Detection and Classification Results
Table 7 presents the prediction results obtained under various environmental conditions, including surface contamination and natural elements that may be misidentified as damage. These results were generated to evaluate the applicability of the proposed model in real-world railway bridge inspection scenarios. Based on a comparative evaluation of different loss functions within the proposed U-Net framework, the IoU Loss function was selected as the most appropriate optimization method. The model was trained and analyzed using this approach, which led to the results shown in
Table 7.
The results demonstrate that the model applying IoU Loss consistently outperformed those using other loss functions in terms of detection reliability and classification robustness. In particular, it showed superior capability in distinguishing true damage from environmental noise and background elements, an essential feature for practical deployment in field conditions. This suggests that the proposed deep learning model, when optimized with IoU Loss, is highly effective for multi-class damage detection in complex real-world settings.
To evaluate the performance of the proposed IoU Loss function, it was compared with the results obtained using Cross-Entropy Loss and Focal Loss functions.
Table 8 summarizes the detection and classification performance metrics—Accuracy, Precision, Recall, and F1-Score—for each loss function.
In image datasets where background regions significantly outnumber damaged areas, evaluation metrics such as Accuracy and Precision tend to be overestimated. This was confirmed to be inadequate for assessing the performance of models intended for structural damage inspection. Specifically, these metrics may appear high even when actual damage is not detected, as the model correctly classifies most of the background as non-damage. While this indicates good background discrimination, it does not align with the primary objective of infrastructure maintenance, which is to accurately detect and assess damage. Therefore, Recall and the F1-Score are more appropriate metrics for performance comparison in this context.
As shown in
Table 8, for damage types with relatively large amounts of training data, such as cracks, the differences across loss functions were relatively small (Recall ranging from 0.831 to 0.866, F1-Score ranging from 0.776 to 0.869). However, for more complex and underrepresented classes, particularly concrete scaling/spalling and leakage, the proposed IoU Loss model demonstrated substantial improvements; for concrete spalling, the Recall increased from 0.355 (Cross-Entropy) to 0.950 (IoU Loss), representing a 167.6% relative improvement. For leakage, the Recall rose from 0.262 to 0.922, yielding a 251.5% increase, while the F1-Score nearly doubled from 0.374 to 0.929.
These results strongly validate the IoU Loss function’s ability to enhance the detection of minor or partially visible damage areas that conventional methods often fail to capture. While Focal Loss also led to notable gains in these categories (e.g., Recall of 0.807 for concrete spalling and 0.846 for leakage), limitations remained in detecting rare classes such as exposed rebar, which only improved marginally in Recall (from 0.740 to 0.760) under Focal Loss.
To address these residual shortcomings, the IoU Loss function, by directly optimizing the spatial overlap between predictions and the ground truth, proved effective, achieved the highest average recall (0.914) and F1 Score (0.913) across all classes. Consequently, IoU Loss is proposed as a more robust optimization strategy for multi-class surface damage segmentation in real-world railway bridge inspection scenarios.
5. Conclusions
Periodic visual inspections are currently conducted to maintain the structural integrity of railway bridges. These inspections typically require human inspectors to visually assess damage using specialized equipment, such as aerial ladders. However, this traditional approach poses safety risks and requires substantial time and budget. Therefore, a more efficient and reliable visual inspection method utilizing unmanned aerial vehicles (UAVs) is necessary.
This study proposed and evaluated a damage detection model based on the U-Net deep learning framework, using UAV-captured imagery to replace manual inspections. The goal was to identify a method suitable for detecting and classifying multiple types of damage under real-world environmental conditions, even with limited training data. To improve performance, the study compared the proposed U-Net model with IoU Loss to models using Cross-Entropy Loss and Focal Loss, aiming to develop a robust multi-class damage detection approach specifically tailored to the needs of Korea’s infrastructure maintenance industry.
Five major types of surface damage—cracks, concrete scaling/spalling, water leakage, exposed rebar, and paint peeling—were used to generate ground-truth (GT) training data and evaluate detection performance. A field-applicable damage detection process was developed by implementing three different loss functions in a U-Net architecture to detect and classify multi-class damage on railway bridges. The key conclusions are summarized as follows:
The proposed model successfully detected and classified five key types of damage commonly assessed during visual inspections for railway bridge safety, even under complex and overlapping conditions.
The U-Net model trained with Focal Loss outperformed the one using Cross-Entropy Loss, particularly in detecting concrete spalling and water leakage. The Recall values of the Cross-Entropy-based model for cracks, concrete spalling, exposed rebar, leakage, and paint peeling were 0.776, 0.452, 0.851, 0.374, and 0.903, respectively. With Focal Loss, these improved to 0.834, 0.807, 0.760, 0.846, and 0.965, indicating substantial enhancement, especially in underrepresented classes. The F1-Scores were also noticeably improved.
The application of Focal Loss helped mitigate issues arising from class imbalances in training data. By adjusting class weights, the model proved suitable for field scenarios where data for certain damage types may be scarce.
Despite these improvements, challenges remained in effectively learning underrepresented damage classes. To address this, the IoU Loss function was applied to strengthen weighting on overlapping areas, resulting in further performance gains.
The final model, which applied the IoU Loss function to the U-Net architecture, achieved an average Recall of 0.914 and an F1-Score of 0.913, representing a 34.2% improvement in Recall and a 26.0% improvement in the F1-Score compared to the baseline U-Net model.
The proposed IoU-Loss-based U-Net model effectively detects and classifies complex, multi-class damage using UAV imagery under real-world environmental conditions, with strong resistance to false positives caused by noise or natural background features. This method provides a more objective and precise alternative to conventional manual inspections. As the railway infrastructure maintenance sector increasingly adopts UAV-based automation for structural inspections, the proposed approach is expected to contribute to improved accuracy, reduced inspection time, and cost savings, ultimately supporting the advancement of the industry.
While this study focused on five surface damage types, future work should expand to include efflorescence and corrosion of steel components, which are critical to structural safety. Continued validation and testing will be pursued with the goal of fully replacing manual visual inspections in the field.
To ensure practical applicability in industry-level maintenance practices, a large volume of labeled field data were collected from diverse environments and used to train the deep learning model. As a result, the proposed method achieved an average Recall of 0.914, demonstrating industry-level performance with high detection sensitivity. This result highlights the method’s potential as a practical and deployable solution for reliable multi-class damage detection in UAV-based railway bridge inspections.