Performance Comparison of Deep Learning Models for Damage Identiﬁcation of Aging Bridges

: Currently, damage in aging bridges is assessed visually, leading to signiﬁcant personnel, time, and cost expenditures. Moreover, the results depend on the subjective judgment of the inspector. Machine-learning-based approaches, such as deep learning, can solve these problems. In particular, instance-segmentation models have been used to identify diﬀerent types of bridge damage. However, the value of deep-learning-based damage identiﬁcation may be reduced by insuﬃcient training data, class imbalance, and model-reliability issues. To overcome these limitations, this study utilized photographic data from real bridge-management systems for the inspection and assessment of bridges as the training dataset. Six types of damage were considered. Moreover, the performances of three representative deep learning models—Mask R-CNN, BlendMask, and SWIN—were compared in terms of loss–function values. SWIN showed the best performance, achieving a loss value of 0.000005 after 269,939 training iterations. This shows that bridge-damage-identiﬁcation performance can be maximized by se�ing an appropriate learning rate and using a deep learning model with a minimal loss value.


Research Background
Bridges sustain structural damage and undergo wear and tear over time.Consequently, their strength and stability deteriorate, increasing the risk of accidents.As bridges are major road facilities used over a long period, extending their service life through proper management and maintenance is essential.Restricting traffic on an aging bridge (or closing the bridge entirely) because of safety concerns may have a significant and adverse societal and economic impact.Hence, mandatory periodic inspections and maintenance activities are conducted consistently.However, the current method of doing this, which relies on on-site inspection by human workers, has several drawbacks.First, there is a shortage of inspectors.Most bridges are large and complex; hence, field inspection and diagnosis require significant investments of personnel and time.If the number of qualified professionals available is limited or insufficient, it can be difficult to inspect all bridges accurately.Second, the method is fundamentally subjective.Because different inspectors may assess the condition of a given bridge differently, based on their experience and knowledge, the results may be inconsistent.Third, as bridges are frequently located in remote places, conducting personnel-based on-site inspections is time-consuming and expensive.Lastly, structural defects in aging bridges often occur in very hard-to-access areas of the bridge.The risky work environment during on-site inspections means that the safety of the inspectors is a major issue.For all these reasons, there is a growing demand for methods that utilize new technologies to inspect aging bridges.For example, a system that can remotely monitor the condition of bridges using sensors and Internet-of-Things (IoT) technologies for the analysis of data and the detection of anomalies in real time can assist in providing accurate information and building early warning systems [1].IoT refers to the technology and concept where everyday objects around us are connected to each other through the Internet and exchange information.Non-destructive examination technologies, such as ultrasound and X-rays, can quickly check for internal defects and damage within structures without causing further damage themselves; moreover, they enable precise analysis of data and prediction of defects [2].Research is also underway that proposes a defect recognition model to recognize four typical defects of phased-array ultrasonic testing (PA-UT) images for electrofusion (EF) joints [3].If big data and artificial intelligence (AI) technologies are used, it is possible to identify pa erns and trends from large datasets and establish preventive maintenance plans [4][5][6].In a study that analyzed and verified images acquired with ground-penetrating radar (GPR) to determine the location of underground utilities (UU) with an end-to-end deep learning model based on a core-regression model, it was possible to optimize parameters [7].Drones and robot-based systems can inspect and evaluate the condition of structures, even in low or hard-to-access areas.Drones capture photos and image data from the air, which are then used to assess the overall condition of structures; robots can enter narrow or hazardous areas to perform detailed defect detection and take appropriate corrective measures [8].
Studies are presently underway to identify and quantify bridge damage by analyzing image data captured by drones or by on-site inspectors [9].Similar studies are being conducted in various fields; the goal is to identify the damage automatically and calculate its extent through the analysis of inspection images.Image-based crack detection and classification is an especially important research topic.Many studies have focused on methods to learn and detect crack pa erns using convolutional neural networks (CNNs), which are a type of deep learning algorithm [10,11].Research is also being conducted to quantify the size and area of defects using image processing and computer vision technology; this can not only enable accurate calculation of the extent of damage but also analyze trends.Further, efforts are being made to utilize images for three-dimensional modeling of facilities and the detection of deformations in them; changes in the shape of structures and the extent of deformations can be tracked and assessed in this way.Comprehensive assessments of damage by combining visible-wavelength image data with sensor data for other wavelengths (e.g., radar and infrared) can yield more accurate and comprehensive results.In recent years, identifying and predicting facility damage using AI technologies, especially algorithms such as deep learning and reinforcement learning, has become a highly active field of research.
However, the use of deep learning technology is not without problems.Performance varies depending on the training images, and classification can be challenging when the features of the target classes are limited or indistinct.For example, when classifying types of bridge damage such as efflorescence, corrosion, cracks, concrete scaling, and concrete spalling, deep learning may exhibit good performance in distinguishing efflorescence, which appears as a white powdery residue on bridges, from other damage types.However, it may have more difficulty differentiating among cracks, concrete scaling, and spalling, which are all similar in appearance.
Another problem is that previous studies have predominantly focused on cracks, the most common type of damage in concrete structures.This has resulted in a scarcity of training data for other types of bridge damage, such as water leaks, efflorescence, concrete scaling, concrete spalling, and corrosion of reinforcing steel.The nature of deep learning technology dictates that classification accuracy improves with an increase in training data [12].

Scope and Methods of Research
In this study, we apply three representative deep learning models-Mask R-CNN, BlendMask, and SWIN-that can automatically identify six typical types of damage found in aging bridges: cracks, water leaks, efflorescence, concrete scaling, concrete spalling, and corrosion of reinforcing steel.We analyze the performance of these models by comparing the loss values when training each model individually.The loss value measures the difference between the predicted results and the actual ground truth during the training process.It evaluates how accurately the model segments objects and identifies instances.Commonly used loss-functions include pixel-wise cross-entropy loss and dice loss, which are determined by calculating the pixel-level agreement between the predicted and actual masks.This value typically ranges between 0 and 1, with a lower value indicating be er performance.To minimize the loss value, the model is updated to decrease the difference between the predicted mask based on the training data and the actual mask.Through this process, the model learns to accurately segment object boundaries and internal regions, training on features that allow the distinction of individual object instances.The loss value mainly distinguishes between the background and foreground (objects), taking into account classification accuracy and pixel-level agreement for both regions.Moreover, weight adjustments for overlapping areas can be incorporated to address overlaps between adjacent objects.In conclusion, the loss value in an instance-segmentation model is important for evaluating and optimizing the model's performance.By minimizing this value, we can increase the model's ability to segment and recognize objects with high accuracy and detail.
The paper is organized as follows: Section 2 introduces studies relevant to this paper; Section 3 describes the proposed technology; Section 4 explains the experimental results and evaluates the performance of the proposed technology; and Section 5 presents the conclusions.

Bridge-Damage Identification Using Deep Learning
One previous study that a empted to develop an automated bridge-inspection system proposed using an unmanned aerial vehicle for bridge inspections and applying deep learning algorithms to the results [13].Another study used deep learning to detect cracks in concrete structures used CNNs to develop a model for detecting damage in bridge structures, achieving high performance in identifying damaged areas from images of entire bridges [14,15].Bukhsh et al. demonstrated that transfer learning could improve bridge-damage-detection models, even with a limited dataset [16].They proposed a realtime bridge-damage-detection system using deep learning, employing Faster R-CNN to detect damage in bridge images and evaluate structural integrity instantly.Recent studies have also explored using generative adversarial networks for image generation and data augmentation to detect the extent of actual structural damage more accurately [17].
In addition, research focused on automatic crack detection in concrete using CNNs has led to the development of a deep learning-based method for detecting road defects from ground-penetrating radar images [18].Avci et al. have discussed a method for identifying structural bridge damage utilizing deep learning with frequency-response-function data [19].Another proposed model, CrackNet, is a deep learning-based approach for crack detection in structural materials that has produced high-accuracy results for various crack types and defects [20].

Loss Values of Deep Learning Models
A empts to reduce the loss values of deep learning models have been conducted in various fields.First, one study has been conducted on normalization techniques, which control the complexity of the model and enhance its generalization performance [21].Both batch normalization and layer normalization can improve the stability and learning speed of neural networks, contributing to a reduction in loss values.In addition, data augmentation-a technique that artificially transforms training data to increase diversity-is used.By applying transformations such as rotation, translation, and scaling, the training dataset is expanded to improve the model's generalization performance.Regularization alleviates overfi ing by imposing penalties on model complexity [22].Methods such as L1 and L2 regularization and dropout are used to constrain the network's weights or reduce overfitting by randomly deactivating some units.It is also crucial to set the initial weights to facilitate efficient learning and exploration of optimal solutions, aiding in loss value reduction.Optimizer algorithms, used when updating the model's parameters, leverage gradient-descent methods to minimize loss values.Adam and RMSprop are widely used optimizers because of their efficiency in updating parameters.The architecture of the neural network itself also affects the reduction of loss values.Innovative designs, such as CNNs, residual connections, and a ention mechanisms, have generated considerable attention.In the present study, we aim to compare the performance of three models with different neural network architectures by comparing their loss values.

Proposed Framework
This paper proposes a deep learning framework for detecting damaged objects in bridges [12].Initially, object-detection performance for each damage type is enhanced by using super-resolution (SR) techniques to improve and normalize resolution, thereby augmenting the dataset's diversity and consistency.By using the SR technique, it is possible to build a be er learning set because it maintains the quality of the images used for learning.Thereafter, we construct an optimized detection model tailored for each damage type using the bridge-damage-identification deep learning combination-module, which is based on separate training for each type.These models are then integrated into a single model and presented as an optimized solution for detecting damaged objects in bridges.The framework is specifically designed for six types of bridge damage: efflorescence, concrete scaling, concrete spalling, cracks, corrosion, and water leaks.The architecture of this framework is shown in Figure 1.The deep learning models used in combination are Mask R-CNN, BlendMask, and SWIN.

Mask R-CNN
Mask R-CNN, akin in architecture to Faster R-CNN, can perform object detection and semantic segmentation simultaneously.It utilizes RoIAlign for more accurate segmenta-tion of object boundaries within the region of interest (RoI), enhancing accuracy.However, enhanced accuracy in pixel-level segmentation requires substantial GPU memory and computational capability [23].Mask R-CNN consists of three main components:

BlendMask
BlendMask is a model capable of performing semantic-segmentation and instancesegmentation tasks simultaneously.It features two branches: an instance-mask branch (IMB) and a semantic branch (SMB) [24].It delivers high accuracy and speed, owing particularly to the IMB's use of a RoI transformer and soft proposal generator to extract precise object boundaries.It achieves state-of-the-art performance on the COCO dataset, outperforming other models in terms of accuracy and speed.BlendMask's pipeline is shown in Figure 3.Although Mask R-CNN demonstrates high performance, its high computational demands and GPU memory requirements pose challenges for real-time tasks.YOLO offers fast processing but is inferior in segmentation accuracy to BlendMask or Mask R-CNN.BlendMask is utilized in various fields because it ensures high accuracy and speed.Compared to other instance-segmentation models, BlendMask has several advantages:


High Accuracy: BlendMask's superior performance on the COCO dataset is because of its two-branch structure (IMB and SMB) and the integration of a RoI transformer and soft proposal generator, which together enhance the accuracy of segmentation boundaries.


High Speed: The use of a RoI transformer, instead of traditional RoI pooling, contributes to BlendMask's fast processing capabilities.Additionally, its ability to maintain accuracy with fewer parameters enables high performance in memory-constrained environments.


Scale Invariance: BlendMask's scale invariance ensures consistent performance across multiple image sizes, a feature influenced by the pre-trained models; this allows BlendMask to maintain high accuracy at various image sizes. Multi-Object Segmentation: BlendMask's capability to segment multiple objects within a single image is a huge advantage in both commercial processes and computer vision.


Small Training-Data Requirement: Typically, instance-segmentation models necessitate extensive labeling of training data, particularly in specific domains.BlendMask can achieve high performance with a limited dataset by using techniques that reduce the demand for a large amount of labeled data.
For these reasons, BlendMask is an outstanding instance-segmentation model.

SWIN
Swin Transformer (SWIN) is a model introduced in the paper "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows", developed by researchers at Microsoft Research Asia [25].As shown in Figure 4, SWIN extends the existing vision transducer architecture to effectively capture the spatial structure of the image, resulting in excellent performance in tasks such as object detection and instance segmentation.The key elements of the SWIN include:


Window-Based Self-A ention: SWIN segments the input image into grids and performs self-a ention within each, facilitating the integration of local context and global image structure. Shifted Window: It employs a hierarchical strategy to enlarge each window incrementally through shifting operations, capturing a variety of resolutions and contextual information.Experimental results reveal that SWIN markedly outperforms Faster R-CNN and Mask R-CNN in instance-segmentation tasks on the COCO dataset.For example, SWIN has achieved a mean average precision (mAP) ~4% higher than the ResNet-50-based Faster R-CNN while maintaining a relatively lightweight model size.By segmenting images into grid-shaped windows, performing hierarchical self-a ention, and combining features at multiple scales, SWIN delivers good performance with a streamlined architecture.SWIN and the two preceding models may be compared as follows:


Mask R-CNN focuses on object detection and pixel-level segmentation for accurate and reliable results. BlendMask builds on Cascade Mask R-CNN, utilizing an ensemble technique to combine multi-stage network outputs and mask predictions. SWIN leverages a vision transformer architecture, combining window-based self-attention with hierarchical feature mapping.
Each model has its unique concepts and architecture; selection among them should be based on specific experimental results and use requirements.In this study, these three models were applied to bridge-damage identification.To compare their performances, the same parameter values were set, and the resulting loss values were comparatively analyzed.

Experimental Environment
To test the performance of the detection model, experiments were conducted using a system equipped with a GPU.For fast learning speed and detection performance analysis, the CPU and GPU were selected to have the highest specifications at the time of the experiment.The programming language employed was Python.The equipment used in the experimental evaluations was as follows:

Hyperparameters
Table 1 shows the parameters adopted for all of the models.Fundamentally, the experiments were based on a ResNet model with a backbone depth of 50 layers.

Measurement Method
Loss values in deep learning models are primarily determined through a loss-function, an indicator that evaluates model performance by calculating the difference between the model's predicted outputs and the actual values.Commonly used loss-functions include:


Mean Squared Error: Employed in regression problems, it is calculated by squaring the differences between predicted and actual values; these squares are then averaged. Cross-Entropy Loss: This function is mainly used in classification problems to measure the differences between the predicted-probability distribution and the actual class labels.Binary cross-entropy is typically used for binary classification, while categorical cross-entropy is suited for multi-class scenarios. Log-Likelihood Loss: Often used in generative models, this function seeks to maximize the logarithm of the likelihood of the predicted probabilities matching the actual data.


Custom loss-function for specific tasks: Such functions are tailored to specific problems.The composite loss-function in Mask R-CNN is an example; it is designed for concurrent object classification, bounding box regression, and mask segmentation.
To calculate the loss value, both the input data and the model's output for that data are required.Optimization algorithms, such as gradient-descent, are used to minimize the loss value during the training process.Deep learning frameworks and libraries are equipped with a range of loss-functions and internally calculate and track loss values.In this paper, we compared the loss values generated across six types of bridge damage detection, over the course of 300,000 learning iterations.

Comparison of Mask R-CNN and BlendMask
For a learning rate (LR) of 0.01, the distribution of loss values for Mask R-CNN is shown in Figure 5.For LR: 0.01, Figure 6 presents the loss value distribution for BlendMask.These experimental results show that after 350,000 iterations, the loss values gradually converged to 0.0001 or lower.Moreover, these results not only exhibit more consistency but are also generally smaller than the prior findings from Mask R-CNN.The rapid change in loss value depending on the number of repetitions in the graph occurs due to the learning rate.The learning rate determines how much the model adjusts the weights at each iteration.If the learning rate is too large, the model may miss the optimal point, and if the learning rate is too small, it may take a long time to reach the optimal point.In these cases, the learning rate can be adjusted to stabilize the change in loss values.

Comparison of Mask R-CNN and SWIN
For LR: 0.005, the distribution of loss values for Mask R-CNN is shown in Figure 7.While the loss values were under 0.0001-lower than those at an LR of 0.01-they exhibited sharp fluctuations.
For LR: 0.005, the distribution of loss values for the SWIN is shown in Figure 8.They gradually converged to a point at or below 0.0001, requiring considerably fewer iterations than Mask R-CNN.This indicates that SWIN is capable of identifying objects more rapidly and with greater precision than Mask R-CNN.

Comparison of Mask R-CNN, BlendMask, and SWIN
For LR: 0.0001, the distribution of loss values for Mask R-CNN is shown in Figure 9.For Mask R-CNN, the results show that loss values started at 0.00002 or lower, even at early iteration stages.With additional iterations, the loss values continued with li le variation, indicating only marginal improvement.
Figure 10 presents the loss value results for BlendMask.While high loss values were occasionally observed until about 200,000 iterations, generally the loss values followed a descending curve.This indicates that with continued training, loss values decrease, enhancing the model's accuracy.However, with extremely small weight-update values, a vibration effect in model parameters was observed.This vibration phenomenon can undermine training stability and consistency, making it difficult to enhance performance.Upon reviewing the loss value graph for the SWIN across training iterations (Figure 11), a consistently descending curve is apparent, distinguishing it from other models.The loss values for SWIN approached 0.00001 after 200,000 iterations; BlendMask a ained similar values but only after exceeding 350,000 iterations.This result indicates that SWIN can produce accurate results faster.However, this observation applies when the learning rate is set to a very low level, specifically at 0.0001.In deep learning models, an excessively low LR can slow down the overall learning speed, with model-parameter updates becoming very small; this increases the likelihood of ge ing stuck in the local minimum, complicating convergence to the global minimum and degrading the model's performance.
Furthermore, it may lead to issues such as overfi ing and parameter vibrations.Therefore, it is necessary to determine the optimal learning rate by experiment and verification within an appropriate range.

Conclusions
To enhance the predictive performance of a deep learning model, it is essential to reduce the loss value.Besides ensuring a highly accurate model with good generalization capabilities, achieving low loss values can also reduce the number of iterations needed, enhance stability, and prevent overfi ing.In this study, we compared and analyzed the loss values of three deep learning models for identifying damage in aging bridges.The experimental results showed that BlendMask outperformed Mask R-CNN at a learning rate of 0.01.For the SWIN, learning occurred so quickly at this rate that the loss value trend was not clearly observed.At a learning rate of 0.005, the SWIN exhibited both lower loss values and a faster learning speed.When the learning rate was set to 0.0001, it was possible to compare all three models, with both BlendMask and SWIN demonstrating decent loss values; however, SWIN excelled in learning speed.Overall, the SWIN demonstrated the best performance; its performance can be maximized by adjusting the learning rate.Using these experimental results, it is possible to find out which instance-segmentation model is optimal for identifying damage to old bridges and how parameter values such as learning rate affect the use of various models.Using these results, it is possible to apply the most optimal deep learning model for each type of bridge damage.
However, in this study, it was impossible to compare loss values for various parameter values, and the amount of learning data was very small (about 4000), making a more objective comparative analysis difficult.Also, the number of images used for experiment and verification was about 200, so the results were slightly different for each experiment.Lastly, there are many more deep learning models besides the three representative ones studied here, and these should also be considered.
In future research, we plan to investigate hyperparameter tuning to reduce loss values more effectively and to conduct rigorous experiments and verification of the results.Model combination algorithms should be investigated to further improve the accuracy of the proposed automated bridge-damage-identification model.Finally, we plan to develop an optimized mobile application capable of obtaining real-time images from bridges and identifying damage, thereby advancing the automation of bridge inspection and assessment tasks.

Figure 1 .
Figure 1.Object-detection framework for identifying bridge damage by integrating deep learning combination-modules [12].

1 .
Region Proposal Network (RPN): Like Faster R-CNN, Mask R-CNN begins with an RPN to suggest object regions in the image, predicting locations where objects are likely to be present.2. RoIAlign: This accurately maps the RoI.It replaces RoIPool to improve pixel alignment accuracy, enabling accurate mask prediction during instance segmentation.3. Mask Head: This component predicts the segmentation mask at the pixel level for the object within each RoI, along with the object's bounding box.It consists of a CNNbased fully convolutional network that generates the mask using the feature map within the given RoI.The architecture of Mask R-CNN is shown in Figure 2. Mask R-CNN's capacity to detect objects and predict accurate segmentation masks for each instance makes it useful for various computer vision tasks, ranging from pedestrian detection and segmentation by autonomous vehicles to tumor analysis in medical imaging.Models pre-trained on large datasets such as Common Objects in COntext (COCO) are provided for Mask R-CNN, along with code in deep learning frameworks such as TensorFlow and PyTorch.


Tokenization and Patch Embedding: The model converts the image into a series of patches, which undergo tokenization and embedding processes. Hierarchical Feature Fusion: SWIN utilizes a technique for merging features across different scales hierarchically, a aining various resolutions and spatial diversities.

Author
Contributions: Conceptualization, S.-W.C. and B.-K.K.; Software, S.-W.C. and S.-S.H.; Validation, S.-W.C. and S.-S.H.; Resources, S.-W.C.; Data curation, S.-W.C.; Writing-original draft, S.-W.C.; Writing-review and editing, S.-W.C., S.-S.H. and B.-K.K.; Supervision, B.-K.K.; Project administration, B.-K.K.; Funding acquisition, B.-K.K.All authors have read and agreed to the published version of the manuscript.Funding: Research for this paper was carried out under the KICT Research Program (project no.20230073-001, Development of DNA-based smart maintenance platform and application technologies for aging bridges) funded by the Ministry of Science and ICT.Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.