Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Performance of Fine-Tuning Techniques for Multilabel Classification of Surface Defects in Reinforced Concrete Bridges

Appl. Sci. 2025, 15(9), 4725; https://doi.org/10.3390/app15094725

by Benyamin Pooraskarparast^1,*

, Son N. Dang¹

, Vikram Pakrashi²

and José C. Matos¹

Reviewer 1: Anonymous

Reviewer 2:

Haitao Zhao

Reviewer 3: Anonymous

Appl. Sci. 2025, 15(9), 4725; https://doi.org/10.3390/app15094725

Submission received: 31 March 2025 / Revised: 17 April 2025 / Accepted: 23 April 2025 / Published: 24 April 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

see below

Comments for author File: Comments.pdf

Author Response

Response to Reviewer 1:

Thank you very much for taking the time to review our manuscript and for providing thoughtful and constructive feedback. We have carefully addressed each of your comments, as detailed below. All corresponding revisions have been made in the manuscript and are highlighted using track changes in the revised version.

Comment 1: While the paper recognizes that defects can overlap (e.g., corrosion coinciding with cracks), the multi-label design largely treats each label independently in the final classification. It would be helpful to discuss the rationale for the discrete labeling scheme: Are certain types of overlap (e.g., “spalling + exposed bars”) explicitly handled or systematically annotated in the dataset? A more explicit explanation of how “multi-label classification” addresses partial overlaps would clarify the scope of the approach.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have added a clarification on how overlapping defects are represented in the CODEBRIM dataset and how our model handles them within the multi-label classification setting. This clarification has been added to the Page 9, line 286~291 of Section 3 (Dataset).

We now explain that the dataset is annotated using a multi-label scheme in which overlapping defects are labeled as independent binary classes. For example, if both spalling and corrosion are present in an image, both labels are assigned independently. Our model, through sigmoid activation, is capable of detecting these co-occurrences without needing predefined composite labels.

Updated text in the manuscript:

“The CODEBRIM dataset supports multi-label annotations, where multiple defects such as cracks, corrosion, and exposed bars may co-occur in a single image. While the dataset does not define composite classes explicitly, such overlaps are represented through simultaneous binary labels. Our model uses sigmoid activation to allow independent prediction of each defect, enabling robust detection even in overlapping scenarios (e.g., a spalled area with exposed reinforcement and corrosion stains).”

Comment 2: CODEBRIM contains images from 30 bridges, but the text is relatively brief on how representative these structures are in terms of environment, age, or construction type. Since real-world defect distributions can vary significantly (e.g., urban vs. coastal settings), adding detail on the geographical or environmental conditions captured in CODEBRIM would help readers gauge the model’s broader applicability.

Response 2: Thank you for this insightful comment. We agree that discussing the environmental and visual diversity of the dataset is important for understanding the generalizability of the models. Accordingly, we have added a sentence in the Page 8, paragraph 3, line 244~248 of Section 3 (Dataset) that highlights the variety of visual and structural conditions represented in the images.

Updated text in the manuscript:

“The CODEBRIM dataset includes images collected under diverse environmental and structural conditions, reflected in variations in lighting, surface textures, and types of visible defects. This diversity suggests coverage of a broad range of real-world bridge settings, supporting the generalizability of the trained models to different environments.”

Comment 3: The paper evaluates six architectures (ResNet, ResNeXt, and EfficientNet). However, the discussion of why these specific models were chosen is somewhat minimal—particularly whether aspects like grouped convolutions (ResNeXt) or compound scaling (EfficientNet) were specifically hypothesized to handle surface defect patterns more effectively. Linking these architectural traits to the challenges of defect classification (e.g., subtle texture differences, small-scale cracks) would enrich the methodology.

Response 3: Thank you for your valuable comment. We agree that further elaboration on the rationale behind choosing these specific architectures would enhance the clarity and depth of the methodology. Therefore, we have added Page 6, last paragraph, line 178~185 of Section 2.2 (Fine-Tuning and Transfer Learning Strategies) explaining the architectural characteristics and how they relate to the nature of surface defect classification.

Updated text in the manuscript:

“The chosen models, ResNet, ResNeXt, and EfficientNet, are three diverse architectural design philosophies in deep convolutional neural networks. ResNet models are essentially residual connections to help with the learning of deep features. ResNeXt, with the introduction of grouped convolutions, allows the model to process on multiple parallel paths of information which may improve the detection of fine-grained features or overlapping defect patterns. EfficientNet-B3 employs a compound scaling approach that uniformly scales the depth, width, and resolution of the network allowing it to better capture fine texture variations and small-scale surface defect features.”

Comment 4: Horizontal flips and rotations were applied. Given that bridge surfaces can vary in lighting, angle, and environment, it might be beneficial to consider more domain-specific augmentations (e.g., random brightness changes or Gaussian noise to mimic different imaging conditions). The authors could elaborate on whether advanced transformations (color jitter, perspective distortion) were tested and why they were or were not adopted.

Response 4: Thank you for the suggestion. We agree that domain-specific data augmentation can contribute to model robustness. In this study, we focused on applying standard augmentations, namely horizontal flipping and rotation, which are widely used and effective in visual classification tasks. These were chosen due to their simplicity and relevance to the nature of bridge surface images. We have clarified this in the Page 10, line 334~337 of Section 4.1 (Training results).

Updated text in the manuscript:

“In this study, we applied standard augmentations, including horizontal flipping and random rotation, to increase dataset diversity. These simple yet effective transformations were selected due to their relevance to typical variations in bridge inspection imagery.”

Comment 5: Although the training curves and final accuracies are discussed, readers might also wonder about model inference speed, GPU memory requirements, or feasibility for real-time or near-real-time bridge inspections (e.g., using drones). A short section comparing the computational overhead among the tested architectures could help practical adoption decisions.

Response 5: Thank you for raising this important point. Due to the limitations of our university computing environment, we did not have access to a dedicated GPU system to comprehensively evaluate the inference time and memory usage of all models. However, based on well-known architectural characteristics and findings from related studies, we can provide general insights into the computational efficiency of the tested models.

Specifically, shallower networks such as ResNet-18 and ResNet-50 are known for their faster inference times and lower memory consumption, making them suitable for real-time or edge-based applications. In contrast, deeper models such as ResNeXt-101, while achieving higher accuracy, are generally more computationally expensive. EfficientNet-B3 offers a balanced trade-off between accuracy and efficiency, and is often used in lightweight deployment scenarios.

We have included a brief summary of these practical considerations in the Page 16, paragraph 2, line 516~521 of Section 4.3 (AUC-ROC analysis).

Updated text in the manuscript:

“From a practical standpoint, architectures such as ResNet-18 and EfficientNet-B3 are expected to perform well in real-time or resource-constrained environments due to their low computational overhead. In contrast, deeper models like ResNeXt-101 may offer higher accuracy but at the cost of slower inference and higher memory usage. These considerations are important for future deployment in field applications such as drone-assisted bridge inspections.”

Reviewer 2 Report

Comments and Suggestions for Authors

This study employed Six CNN architectures: ResNet-18, ResNet-50, ResNet-101, ResNeXt-50, ResNeXt-101 and EfficientNet-B3 for classifying concrete surface defects of the CODEBRIM dataset. This study aims to illustrate the effectiveness of fine-tuning parameters of existing CNN structures to improve classification performance. The results seem good and advanced compared with the previous publications. However, several issues should be addressed to improve this manuscript. (1) The authors emphasized the Fine-Tuning Techniques in the title and abstract, while there was no detailed pipeline or flowchart to introduce it in the methodology. As the author stated, more details should be added because this is the main contribution. (2) More explanations should be attempted to reveal why different ResNet models perform differently from the point of view of structures, parameters and fine-tuning processing rather than only listing the accuracy values and comparations(section 4.2). (3) AUC values equal to 1.0 are reported in this study, which is wrong because Figure 9 does not show the perfect ROC curve. There are still gaps between the roc curve with point(0,1). The misleading results come from the numerical precision after rounding up to the nearest integer. More decimal places are suggested.

Therefore, I recommend major revisions of this manuscript. The specific comments are shown as follows:

Line 14, Abstract, Line breaks are not usually common in abstracts.
Line 19, are the CNN model structures modified or not? The structure seems unchanged while only retraining the parameters of all convolutional layers according to Line 180 to 181.
Abstract, the background points out the main challenges are the complex surface features, overlapping defects, imbalance in the amount of data, and the visual similarity of defects. However, can fine-tining model parameters solve the problems mentioned above?
Introduction, The literature review does not correspond to the content of the study, so the study fine-tuning strategies should be better illustrated. The introduction part is suggested in the flowchart of backgrounds, problems, motivations, and paper structures.
Figure 6. The images are suggested to be displayed in categories of defects.
Line 254, The sigmoid function is not commonly used as an activation function in multicategorization, and softmax is more suitable for multicategorization, why is this used?
Section 4.2, The results of the training set, validation set and test set are not reported separately.
Line 331 to 373, Specific results have been given in Tables 5 and 6. The exact values are not important. What is important is why different network structures (18, 50, and 101) improve performance.
Figure 9, The AUC=1.0 in figure abcdef is clearly an unreasonable result due to the fact that the data is approximated as 1. The actual value is certainly not 1. This is because the curve in the figure with AUC=1.0 has a gap between it and the coordinate point (0, 1). It is recommended that a higher precision characterization be used, e.g. four decimal places after the decimal point.
The following literature is suggested to be referred to in the introduction. https://doi.org/10.1007/s11440-023-01871-y, https://doi.org/10.1016/j.tust.2024.106351.

Author Response

Response to Reviewer 2:

Comment 1: Line 14, Abstract, Line breaks are not usually common in abstracts.

Response 1: Thank you for pointing this out. The line break in the abstract has been removed to improve consistency and adhere to journal formatting standards.

Updated text in the manuscript:

Comment 2: Line 19, are the CNN model structures modified or not? The structure seems unchanged while only retraining the parameters of all convolutional layers according to Line 180 to 181.

Response 2: Thank you for your question. The base CNN architectures (e.g., ResNet, ResNeXt, EfficientNet-B3) were not structurally modified. However, to adapt the models to the multi-label defect classification task, we replaced the original classification head with a new fully connected output layer consisting of six units. Each unit corresponds to one defect category, and sigmoid activation was applied to allow independent probability estimates. This new output layer was added on top of the average pooling layer of each model. The rest of the network remained structurally intact, but all layers, including early convolutional ones, were fine-tuned during training. Page 7, last paragraph, line 218~224 of Section 2.2 (Fine-Tuning and Transfer Learning Strategies).

Updated text in the manuscript:

“The backbone architectures of all CNN models (ResNet, ResNeXt, and EfficientNet-B3) remain unchanged. However, to support multi-label classification, we replaced the original classification head of each model with a custom fully connected output layer consisting of six units each corresponding to one defect class. A sigmoid activation function was applied to allow independent probability outputs. This new output layer was placed on top of the global average pooling layer. The rest of the network was kept structurally intact, but all layers were fine-tuned during training.”

Comment 3: Abstract, the background points out the main challenges are the complex surface features, overlapping defects, imbalance in the amount of data, and the visual similarity of defects. However, can fine-tining model parameters solve the problems mentioned above?

Response 3: Thank you for your thoughtful comment. We agree that while full fine-tuning of CNN architectures improves model adaptability to domain-specific challenges, it does not fully resolve all the issues mentioned in the Page 1, Abstract line 15~18 such as overlapping defects, surface complexity, and class imbalance.

Updated text in the manuscript:

“Although challenges such as overlapping defects, complex surface textures, and data imbalance remain difficult, full fine-tuning of deep learning models helps them better adapt to these conditions by updating all layers for domain-specific learning.”

Comment 4: The authors emphasized the Fine-Tuning Techniques in the title and abstract, while there was no detailed pipeline or flowchart to introduce it in the methodology. As the author stated, more details should be added because this is the main contribution.

Response 4: Thank you for this important comment. We agree that including a clear pipeline to represent the fine-tuning strategy improves the transparency of our methodology. In response, we have added a new figure Page 7, (Figure 6), line 199~207 of Section 2.2 (Fine-Tuning and Transfer Learning Strategies), which visually compares standard transfer learning and the full fine-tuning approach adopted in our study.

In transfer learning, only the final fully connected (FC) layers are updated, while the convolutional backbone remains frozen. In contrast, our approach involves updating all layers of the network both convolutional and FC allowing the model to fully adapt to domain-specific patterns such as concrete surface defects. This distinction is now clearly illustrated and described in the manuscript.

Updated text in the manuscript:

“Figure 6 illustrates the key differences between conventional transfer learning and the full fine-tuning approach employed in this study. While transfer learning typically updates only the final classification layers, full fine-tuning involves training the entire network including all convolutional and fully connected layers. This strategy enables the model to adapt more effectively to the specific patterns and features present in the target domain.

Figure 6. Transfer Learning and Full Fine-Tuning Schematic”

Comment 5: Figure 6. The images are suggested to be displayed in categories of defects.

Response 5: Thank you for your suggestion. We agree that displaying images grouped by defect category can provide additional clarity. However, in the current version, Figure 6 was designed to present the dataset across multiple defect types, as our focus was on overall dataset variability rather than class-wise visualization. We believe this placement is appropriate in the context of the methodology and ensures that the visual diversity relevant to multi-label classification is conveyed effectively.

Updated text in the manuscript:

Comment 6: Line 254, The sigmoid function is not commonly used as an activation function in multicategorization, and softmax is more suitable for multicategorization, why is this used?

Response 6: Thank you for this observation. We would like to clarify that the task addressed in this study is multi-label classification, not multi-class categorization. In multi-label settings, a single image can simultaneously contain multiple defect types (e.g., both spallation and corrosion stains). For this reason, we use sigmoid activation on each output node, allowing the model to generate independent probability scores for each class.

In contrast, softmax is generally used for multi-class problems, where only one class is active per sample. Therefore, sigmoid is the more appropriate choice for our dataset and classification objective. We have revised Page 10, line 317~319 of Section 3 (Dataset) in the manuscript to clarify this distinction.

Updated text in the manuscript:

“The final output consists of multi-label predictions (crack, spalling, efflorescence, exposed bars, corrosion stains, and background), with a sigmoid activation applied to each output node to enable independent probabilities for co-occurring defect classes.”

Comment 7: Section 4.2, The results of the training set, validation set and test set are not reported separately.

Response 7: Thank you for your comment. We would like to clarify that the dataset was split into training, validation, and test sets using class partitioning, and the class distributions for each subset are explicitly presented in Table 4. In addition, Figure 9 presents training accuracy and loss curves, which visually depict the learning behavior over epochs. Evaluation metrics for model performance are reported separately for the test set in Tables 5 and 6, as our primary aim was to assess generalization.

Therefore, the model behavior across these sets is clearly demonstrated through the learning curves. We believe this effectively communicates the convergence and performance trends during training.

Updated text in the manuscript:

Comment 8: Line 331 to 373, Specific results have been given in Tables 5 and 6. The exact values are not important. What is important is why different network structures (18, 50, and 101) improve performance.

Response 8: Thank you for raising this insightful point. We agree that a deeper explanation of the architectural and training differences between all models enhances the scientific rigor of the study. Accordingly, we have expanded Page 13, line 422~435, Section 4.2 (Evaluation metrics) analyzing why these models yield varying performance.

Updated text in the manuscript:

“The performance differences among the six models can be explained through their architectural depth, parameter count, and how these factors interact with the full fine-tuning process. ResNet-18, being the shallowest with around 11M parameters, offers fast convergence and decent generalization under limited data, but its limited depth may hinder the learning of complex defect features. ResNet-50 and ResNet-101 progressively increase depth (with ~23M and ~45M parameters, respectively), allowing for more detailed feature extraction, but also introducing higher risk of overfitting. ResNeXt-50 and ResNeXt-101 introduce grouped convolutions (cardinality), which expand feature diversity without significantly increasing parameters, making them efficient for capturing subtle defect variations. EfficientNet-B3, with compound scaling of width, depth, and resolution, achieves strong performance due to its balanced design, though it is more sensitive to hyperparameter tuning and training dynamics. Under full fine-tuning, where all layers are updated, these structural differences play a more prominent role, as the model’s ability to adapt feature hierarchies becomes crucial for multi-label defect recognition.”

Comment 9: AUC values equal to 1.0 are reported in this study, which is wrong because Figure 9 does not show the perfect ROC curve. There are still gaps between the roc curve with point(0,1). The misleading results come from the numerical precision after rounding up to the nearest integer. More decimal places are suggested.

Response 9: Thank you for this helpful comment. We acknowledge that reporting AUC values as exactly 1.0 may be misleading, especially when the ROC curves do not reach the ideal point (0,1). The values were rounded for presentation purposes in the initial version.

In response, we have re-calculated and updated the AUC values with higher numerical precision (up to four decimal places), and have revised both the text and the labels in Figure 10 accordingly. This correction provides a more accurate and transparent representation of model performance.

Updated text in the manuscript:

Page 15, (Figure 10), line 478~486 of Section 4.3 (AUC-ROC analysis)

Comment 10: The following literature is suggested to be referred to in the introduction. https://doi.org/10.1007/s11440-023-01871-y, https://doi.org/10.1016/j.tust.2024.106351.

Response 10: Thank you for suggesting these relevant and insightful references. We have reviewed both papers and agree that they offer valuable perspectives on the application of machine learning particularly CNN-based approaches for structural condition assessment beyond the bridge domain. Accordingly, we have cited these works in the Introduction to emphasize the broader role of model tuning, data fusion, and learning strategies in civil infrastructure contexts, including tunneling and subsurface defect classification. The following paragraph has been added to the Page 2, line 66~72 of Section 1 (Introduction).

Updated text in the manuscript:

“Recent developments in machine learning have extended beyond bridge inspection to include subsurface and tunnel-related structural assessments. For instance, Yang et al. [11] employed a probabilistic CNN–RF hybrid model to predict rock mass classification in TBM operations, while Yang et al. [12] proposed a feature fusion framework combining TBM operation data and cutter wear indicators. These studies reinforce the broader relevance of tuning strategies, data fusion, and model interpret-ability across civil engineering domains.”

Reviewer 3 Report

Comments and Suggestions for Authors

The paper presents a study on the performance of fine-tuning approaches for multiclass surface defects in RC bridges. The paper is interesting and potentially suitable for publication. Nevertheless, some aspects should be improved:

The introduction is well posed, but some aspects should be considered. First, 4 main families of CV algorithms exist, as reported in recent state of the art review papers. Still, other methodologies, different from the CNN for classification exist, as the ones based on object detection. For all this information, check 10.1109/ACCESS.2025.3532832 and refs therein
Authors use resnet and its version for defining the fine-tuning and transfer learning. The unclear aspect is proper the contribution in this phase (section 2.2), since it seems that authors did not propose improved versions of transfer learning with respect to the existing ones
Please, define the multiclass and above all, each defect considered. A clear definition should be proposed to avoid confusion between similar defects (e.g., exposed bars and corrosion)
In addition, spallation should be spalling
About cracks, very different typologies of cracks can exist. What about this aspect?
Resolution of Figure 10 should be improved
in the end, authors applied a simple training and test of the algorithms. Nevertheless, I don't see improvements with respect to the existing methods. Please, specify this aspect.

Author Response

Response to Reviewer 3:

Comment 1: First, 4 main families of CV algorithms exist, as reported in recent state of the art review papers. Still, other methodologies, different from the CNN for classification exist, as the ones based on object detection. For all this information, check 10.1109/ACCESS.2025.3532832 and refs therein.

Response 1: Thank you for this helpful suggestion. We agree that situating our classification-based approach within the broader context of computer vision methods for defect detection strengthens the introduction. Following your advice, we have added a short paragraph in the Page 2, second paragraph, line 50~53 of Section 1 (Introduction) that briefly outlines the four main categories of computer vision approaches: classification, object detection, segmentation, and anomaly detection, as described in 10.1109/ACCESS.2025.3532832.

Updated text in the manuscript:

“Computer vision algorithms for defect detection can be broadly categorized into four families: image classification, object detection, semantic segmentation, and anomaly detection [7]. The present study focuses on deep learning-based classification due to its simplicity and suitability for scenarios where image-level annotation is available.”

Comment 2: Authors use resnet and its version for defining the fine-tuning and transfer learning. The unclear aspect is proper the contribution in this phase (section 2.2), since it seems that authors did not propose improved versions of transfer learning with respect to the existing ones

Response 2: Thank you for your valuable comment. We agree that the contribution of our study lies not in proposing a novel transfer learning method, but in providing a systematic and comparative evaluation of full fine-tuning strategies across multiple modern CNN architectures.

We have clarified this explicitly in the Page 8, first paragraph, line 229~234 Section 2.2 (Fine-Tuning and Transfer Learning Strategies) and also emphasized it again in the last paragraph of the Introduction. The aim of this study is to assess the behavior and performance of well-established models (ResNet, ResNeXt, EfficientNet-B3) when fully fine-tuned on a multi-label surface defect dataset. This controlled benchmarking helps identify architectures that offer a practical balance between accuracy and computational efficiency, which is especially important for field applications.

Updated text in the manuscript:

“It should be noted that the study performs a comparative evaluation of full fine-tuning strategies using well-known architectures on a multi-label surface defect dataset. By applying consistent training conditions across all models, the study aims to reveal performance trade-offs and practical deployment insights for bridge inspection tasks.”

Comment 3: Please, define the multiclass and above all, each defect considered. A clear definition should be proposed to avoid confusion between similar defects (e.g., exposed bars and corrosion)

In addition, spallation should be spalling

Response 3: Thank you for this important observation. We have added Page 8, line 268~271 and 275~279, Section 3 (Dataset) to clearly define the classification task as a multi-label problem, where each image may contain one or more co-occurring defects. We also added concise definitions for each of the five defect categories used in the CODEBRIM dataset, to minimize ambiguity between visually similar classes such as corrosion stain and exposed reinforcement. Also, the term “spallation” has been replaced with “spalling” throughout the entire manuscript.

Updated text in the manuscript:

“The classification task is formulated as multi-label, where each image can be associated with one or more defect types. The five annotated defect categories in the CODEBRIM dataset are defined as follows:

Crack – visible linear fracture or separation in the concrete surface
Spalling – surface flaking or detachment of concrete material
Efflorescence – white crystalline deposits resulting from salt leaching
Corrosion stain – discoloration due to rust formation, often near steel reinforcements
Exposed bars – visible steel reinforcement due to severe concrete loss

These classes are not mutually exclusive and may co-occur in the same image.”

Comment 4: About cracks, very different typologies of cracks can exist. What about this aspect?

Response 4: Thank you for this insightful comment. We agree that cracks can have different typologies (e.g., hairline, longitudinal, diagonal), each potentially requiring distinct treatment. However, the CODEBRIM dataset groups all crack types into a single “crack” label without further subclassification. Accordingly, our model treats all visible cracks as one category. We have clarified this limitation in Page 8, line 271~274, Section 3 (Dataset) and noted that subclassifying crack types is a promising direction for future work.

Updated text in the manuscript:

“Although cracks may exhibit various typologies such as hairline, vertical, or diagonal cracks, the CODEBRIM dataset annotates all crack types under a single ‘crack’ label. Consequently, our classification model is trained to recognize cracks as a general category.”

Comment 5: Resolution of Figure 10 should be improved

Response 5: We thank the reviewer for pointing this out. Figure 10 has been replaced with a higher-resolution version in the revised manuscript to enhance visual clarity and ensure the features of the defect predictions are more easily interpretable. (Page 16, (Figure 11), line 506~507 of Section 4.3 (AUC-ROC analysis)

Comment 6: In the end, authors applied a simple training and test of the algorithms. Nevertheless, I don't see improvements with respect to the existing methods. Please, specify this aspect.

Response 6: Thank you for raising this important point. While we did not introduce a novel training algorithm, our contribution lies in systematically evaluating full fine-tuning strategies across multiple modern CNN architectures specifically ResNet, ResNeXt, and EfficientNet within a multi-label surface defect classification context.

This comparative evaluation was conducted on the CODEBRIM dataset using consistent training protocols and loss functions (e.g., focal loss), which differentiates our work from earlier studies that employed older models or limited fine-tuning (e.g., only retraining the last layer).

We have clarified this contribution at the end of Introduction and again in the Discussion (Section 4.3) to better emphasize the novelty and practical value of our study.

Article Menu

Performance of Fine-Tuning Techniques for Multilabel Classification of Surface Defects in Reinforced Concrete Bridges

Further Information

Guidelines

MDPI Initiatives

Follow MDPI