Leveraging Saliency in Single-Stage Multi-Label Concrete Defect Detection Using Unmanned Aerial Vehicle Imagery

: Visual inspection of concrete structures using Unmanned Areal Vehicle (UAV) imagery is a challenging task due to the variability of defects’ size and appearance. This paper proposes a high-performance model for automatic and fast detection of bridge concrete defects using UAV-acquired images. Our method, coined the Saliency-based Multi-label Defect Detector (SMDD-Net), combines pyramidal feature extraction and attention through a one-stage concrete defect detection model. The attention module extracts local and global saliency features, which are scaled and integrated with the pyramidal feature extraction module of the network using the max-pooling, multiplication, and residual skip connections operations. This has the effect of enhancing the localisation of small and low-contrast defects, as well as the overall accuracy of detection in varying image acquisition ranges. Finally, a multi-label loss function detection is used to identify and localise overlapping defects. The experimental results on a standard dataset and real-world images demonstrated the performance of SMDD-Net with regard to state-of-the-art techniques. The accuracy and computational efﬁciency of SMDD-Net make it a suitable method for UAV-based bridge structure inspection.


Introduction
Visual inspection to detect surface defects is an important task for maintaining the structural reliability of bridges.Failing to do so can lead to disastrous consequences, as shown by the recent collapse of the Morandi bridge [1].According to public data, out of 607,380 existing bridges in the U.S., nearly 67,000 are classified as structurally deficient, whereas approximately 85,000 are considered functionally obsolete.According to the National Research Council of Canada , one-third of Canada's highway bridges have some structural or functional deficiencies and a short remaining service life [2].Currently, the inspection task is often conducted manually by inspectors, which could be a timeconsuming and, sometimes, a cumbersome and painstaking process.Recently, there has been a growing shift toward using Unmanned Aerial Vehicles (UAVs) to perform inspection tasks, specifically for bridges, due to the enormous benefits such as the ability to access and rapidly inspect remote segments of the structure [3].Drones can dramatically reduce the inspection time while ensuring the safety of hard-to-reach sites.Since they are efficient, fast, safe, and cost-effective, transportation authorities in several countries have been starting to apply UAV-based bridge inspection techniques [4].However, concrete defect detection in UAV imagery remains more challenging than general defect detection, due to perspective and scale variation, changing lighting conditions, and overlapping of defects [5].
Early vision methods for defect detection used image processing techniques to design low-level features for defect description [6,7].The common pipeline of these traditional methods is to use handcrafted features (e.g., Histograms of Oriented Gradients (HOGs), Gabor filters, Local Binary Patterns (LBPs)) to train an appropriate classifier (e.g., SVM, AdaBoost) and deploy the classifier in a sliding window fashion or by generating region proposals on the input image [8].Some of these methods have proven their efficiency for detecting defects such as cracks and corrosion [9].However, their performance has been shown on simple images only.On the other hand, they are poorly extendible to address the detection of other defect types in a single framework.
The advent of deep learning methods in the last decade has enabled state-of-the-art performance for visual recognition problems such as image classification and object detection.Convolutional Neural Networks (CNNs) have become popular for image classification and object detection.Inspired by biological systems, CNNs (or ConvNets) have a unique capability to learn hierarchical and robust features directly from the training data by alternating convolution, pooling, and non-linear activation operations on the input image.With several training datasets such as ImageNet [10], strong CNN-based architectures have been proposed for extracting advanced features, which have drastically increased the performance of deep learning methods for visual recognition problems such as object classification and detection.These developments have unleashed huge opportunities for applications such as monitoring and visual inspection for anomaly detection [11,12].This has also led to the publishing of several benchmark datasets with annotations, facilitating model training and testing.One of the most-popular datasets is the COncrete DEfect BRidge IMage dataset (CODEBRIM) proposed by Mundt et al. [12], which exhibits multiple defects, including: crack, spalling, exposed reinforcement bar, efflorescence, and corrosion (oxidation stains).Other datasets include MCDS [13] and SDNET (crack detection) [6].
The availability of annotated data has spurred the research on concrete defect classification and detection using deep learning [14].More specifically, two-stage object detection was proposed first for localising specific concrete defects such as cracks and spalling [3,[15][16][17].With the advent of single-stage and faster methods (e.g., SSD [18], YOLO [19]), several real-time defect detection methods have been proposed [16,[20][21][22][23].Most of the above models have targeted specific defects such as cracks or spalling, whereas their validation is generally performed on images containing single defects against uniform and defect-free backgrounds.Such images are generally obtained by preprocessing the original images to remove non-relevant parts of the background (e.g., bridge structure elements, paintings, artefacts, other defects, etc.).In typical UAV-based inspection scenarios, however, images can be acquired at different viewpoints, resolutions, and ranges to camera, creating a huge variability in defect appearance and size, which makes them hard to detect.In addition, small and low-contrast defects occupy generally a tiny portion of the images against dominant and non-uniform backgrounds, which increases the difficulty of their localisation.Finally, defects with similar classes can overlap at the same location, which can cause defect mislabelling (e.g., oxidation occurs often with reinforcement corrosion and spalling) [24].These challenges can drastically decrease the performance of previous detection methods since they are dedicated to simple scenarios [16,[20][21][22][23].
The use of visual attention in deep learning models has enabled a substantial improvement of image classification [25].Attention is a property of the human visual system, which processes the scene by exploiting the selective focus on its salient parts, rather than processing it as a whole [26].Recently, there have been several attempts to incorporate attention to improve the performance of CNNs in large-scale classification tasks.These methods use additional modules of the networks to highlight discriminative local regions, which improves the classification.Some models implement attention as residual blocks [27], channelwise weighting across different network branches [28], or a combination of several modules [29].Despite its potential, the use of attention for concrete defects' recognition is limited.Recently, some methods have attempted to use attention to enhance concrete defect classification for cracks [30][31][32][33] or overlapping defects [34].These methods have shown good success in recognising single or overlapping defects against a defect-free uniform background.However, their performance can be limited in UAV imagery, where the images can contain several defects against highly cluttered backgrounds.Using multi-label defect detection in such images is more advantageous since it enables detecting multiple defects characterised potentially by an overlapping structure.Moreover, having a selective mechanism that allocates higher attention to defective regions can enable a better detection of small and low-contrast defects surrounded by cluttered backgrounds.
In this paper, we investigated using attention to enhance concrete defect detection in the presence of the aforementioned challenging scenarios that typically arise in UAV-based inspection.We propose a fast and accurate model, coined the Saliency-based Multi-label Defect Detector (SMDD-Net), which leverages saliency and pyramidal feature extraction for the better detection and localisation of concrete defects.SMDD-Net integrates two modules in a single-stage pipeline: (1) the attention module, which extracts the global and local saliency to highlight regions of interest in the image, and (2) the detection module, which uses a Feature Pyramid Network (FPN) inspired by the RetinaNet model [35] to localise multiple and potentially overlapping defects.RetinaNet was chosen for its ability to detect defects at different scales and in the presence of imbalanced training datasets thanks to the use of the FPN and focal loss [36].The two modules were integrated through residual skip connections to highlight regions of interest in the spatial representation of the pyramidal features.In other words, the attention module enables focusing the detection more on locally contrasted regions characterising general concrete defects.SMDD-Net has the ability to detect small and low-contrast defects, as well as localise defects in cluttered backgrounds, making it a suitable method for UAV-based inspection.Figure 1 (top) shows the pipeline of the SMDD-Net architecture, consisting of the attention and detection modules.The first module extracts the saliency features, whereas the second module optimises a multi-label loss function to detect overlapped defects.Our major contributions in this paper can be listed as follows: • We propose the SMDD-Net architecture, which integrates attention in single-stage concrete defect detection.The attention module extracts global and local saliency maps, which highlight localised features for better detection of multiple defect classes in the presence of background clutter (e.g., artefacts, bridge structure elements, etc.).Contrary to detection methods that target single defect localisation against uniform backgrounds, SMDD-Net is capable of localising complex defects characterised by variable shapes, a small size, low-contrast, and overlap.

•
We propose an attention module that is based on saliency extraction through gradientbased back-propagation of our feature extraction network.The back-propagation is performed via two paths: a global path, which highlights large-sized defect structures, and a local path, which highlights local image characteristics containing small and low-contrast defects.The two paths are fused using inter-channel max-pooling, and the output is added to the pyramidal features through residual skip connections.

•
We demonstrate the performance of the SMDD-Net model on the well-known CODEBRIM dataset [12], which contains five classes of defects and several image examples with small, low-contrast, and overlapping defects.Our model leverages the benefits of the two detection paradigms: the high accuracy of two-stage detection and the high speed of one-stage detection.We compared also the performance of our model with state-of-the-art methods using several examples of real-world UAV images.
This paper is organised as follows: Section 2 discusses the related works.Section 3 presents the proposed method.Section 4 presents some experimental results for the validation.Finally, the paper ends with a conclusion and future work perspectives.

Literature Review
This section focuses on recent papers on concrete defect detection that apply bounding boxes to the localisation of defects.Following the evolution of object detection based on deep learning, the methods can be roughly divided into two categories: two-stage and one-stage concrete defect detection.

Two-Stage Concrete Defect Detection
Two-stage detection came first in the literature, where the detection process is divided into two steps: (1) region proposals and (2) bounding box regression and classification.Kim et al. [3] used a two-stage method to detect cracks by combining region proposals with rich features extracted by CNNs.The region proposals were extracted from the input images by a selective search, and the features were extracted by a CNN after image cropping and bounding box regression.However, the method can miss non-contrast cracks and requires heavy computation for feature extraction, which make it not easily applicable for real-time inspection.Haciefendioglu et al. [37] used Faster RCNN to detect road cracks.Because the final prediction is made using a single deep layer feature map, it is difficult to detect defects at different scales.Another limitation of this method is that it is proposed for single defect detection (cracks).Yao et al. [38] proposed a deep-learningbased method to detect bugholes.They used the inception module [39] to detect small-size defects and address the problem of a limited number of labelled examples in the training data.They also studied the effects of illumination and shadows on the detection accuracy.Wei et al. [40] developed a method based on Mask-RCNN for concrete surface bughole detection.However, most of these methods have been tested on cropped images, and they require generally a high computation time since they involve two separate stages in their detection pipeline.
Kang et al. [41] presented a technique for automatic detection, localisation, and quantification of cracks.Faster-RCNN was used to provide bounding boxes for cracks, which were then segmented to ensure pixel-level crack localisation.Finally, the segmented cracks were assessed for their thicknesses and lengths.Moreover, to increase the robustness of their method, the authors used a variety of complex backgrounds under varying environmental conditions.Mishra et al. [42] developed a two-stage automated method, based on YOLOv5, for the identification, localisation, and quantification of cracks on general concrete structures.In the first stage, cracks were localised using bounding boxes, whereas in the second stage, the length of the cracks, reflecting the damage severity, was determined.The main limitation of this work was that the method was tested on very simple cases involving cropped and close-range images, on which cracks can be easily located.In other words, it cannot be easily used for real-time inspection applications.
Xu et al. [43] developed a modified method for the detection and localisation of multiple seismic damages of reinforced concrete structures (i.e., cracking, spalling, rebar exposure, and rebar buckling).The Region Proposal Network (RPN) uses CNN features to define initial bounding boxes for damage, which are then refined using Fast-RCNN.One limitation of this method is that the RPN is trained by extracting all region anchors in the mini-batch from a single image.Because all samples from a single image may be correlated, it is possible that the network takes a long time to reach convergence.Li et al. [44] proposed a unified model for concrete defect detection and localisation developed using Faster-RCNN.The model takes an image and computes feature maps using the shared bottom layers and then applies the defect detection network at the top layers.However, combining methods will incur a large computational cost and will not be suitable for real-time applications.Wan et al. [45] presented a method based on vision transformers for concrete defect detection.However, it is computationally intensive, which makes it not applicable for real-time applications.

One-Stage Concrete Defect Detection
Teng et al. [46] and Deng et al. [21] used one-stage YOLOv2, pre-trained on ImageNet, for crack detection.These works showed good performance for real-time crack detection on close-range and cropped images.However, they are not easily extendible to other types of defects.Cui et al. [20] improved the one-stage detector YOLOv3 [47] to detect erosion.As a pre-trained model, they used Darknet53, which uses the Mish activation function.However, the method has poor performance when images have different aspect ratios.In addition, the method is limited to single defect detection.Zhang et al. [48] also used YOLOv3 for detecting multiple concrete bridge defects, including cracks, pop-outs, spalling, and exposed bars.The model was pre-trained on the MS-COCO dataset.However, this approach has issues detecting defects at different scales.
Wu et al. [49] used the YOLOv4 model for crack detection.A pruning strategy was employed to overcome the issue of over-parameterised CNNs, as well as to increase the detection speed.Wang et al. [50] developed an automated one-stage concrete defect detection method that was composed of two parts: the EfficientNetB0 backbone network and the detector.To increase the detection accuracy, the detector gathers feature information from three scales and merges low-level and high-level features through an up-sampling procedure.However, they used small, cropped and close-range images and used them only for two types of defects (cracks and exposed bars).Jiang et al. [22] used improved YOLOv3 for multiple defect detection in concrete bridges.They combined EfficientNet-B0 and MobileNetV3 pre-trained on MS-COCO as a baseline and depthwise separable convolutions.The method did not show high performance when dealing with different scales.Kumar et al. [51] used YOLOv3 for detecting spalling and cracks.The method was validated only on cropped and close-range images.Zou et al. [8] used YOLOv4 to detect defects such as cracks, spalling, and exposed/buckled rebar.They employed depthwise separable convolutions to decrease the computational costs, which boosted the detection speed.
All the previous methods achieved a noticeable success when tackling single defect detection such as cracks, but they lost efficiency when dealing with other types of defects.Indeed, developing a unique model that can deal with all defect classes is hard to achieve.Defect detection and classification, particularly in UAV-based inspection, pose several challenges.First, even if defects can be categorised into different classes, there is huge intra-class variability due to varying illumination conditions and viewpoint and scale changes [14].On the other hand, some defect classes can have a huge overlap (e.g., oxidation, exposed bar, spalling), which can cause defect mislabelling.

Proposed Method
The core idea in this paper is to improve defect detection and localisation by leveraging object detection with saliency.This was motivated by the fact that defects are usually characterised by their local contrast with regard to their defect-free background surface.Combining saliency and object detection is intuitively appealing to focus attention on parts with the highest local contrast for region proposals where potential defects can be located.Saliency can also enhance the feature representation of low-contrast defects (e.g., cracks, efflorescence), which will enhance their region proposal scores for detection.The pipeline of the SMDD-Net method is depicted in Figure 2, which is composed of two main modules: (a) the attention module and (b) the multi-label one-stage concrete defect detection module.The attention module was designed to enhance the feature representation by putting more emphasis on parts of the image containing local discontinuities.In other words, the saliency map will pinpoint parts of the image to enable the generation of region proposals on the concrete surface with the highest defect potential.The second module scans the region proposals to ascertain the defects through bounding box regression and classification.The latter was fine-tuned for multi-label defect detection using the CODEBRIM dataset [12].In what follows, we describe each module separately by giving an example of the implementation of each module, as shown in Figure 2, where Grad-CAM [52] and RetinaNet [35] were used as the baselines for implementing our architecture.

Saliency for Defect Region Proposals
The purpose of saliency, which is based on cognitive studies of visual perception, is to enable identifying regions that exhibit local contrast with regard to their surroundings.This problem is very much aligned with the one of concrete defect detection, where defects usually exhibit some contrast with regard to the immediate defect-free concrete surface.While not every salient region constitutes a defect, most defects exhibit some degree of local saliency.Thus, computing saliency and using it to draw attention to defective regions can be very useful to reduced false negatives (e.g., due to small-sizes and low-contrast defects), which is a common problem in concrete defect detection [53].
Early methods for saliency detection were based on hand-crafted features such as local histograms, calculated on the region support such as superpixels, and used contrast between the features' local statistics [54] or graph ranking [55] to assign saliency scores to pixels.These methods have been successful at distinguishing compact salient regions without requiring extensive training, but they cannot be readily applied to distinguish defects.First, some defects such as cracks are hard to describe using local statistics.Second, defects do not always occupy compact regions, but rather, come with different sizes and shapes and are sometimes even fragmented into different parts.By using deep learning, local contrast patterns characterising saliency can be detected by training an appropriate model.For example, Hou et al. [56] introduced short connections to a CNN architecture that have the role of extracting holistic local discontinuities.Selvaraju et al. [52] proposed the Gradient-weighted Class Activation Map (Grad-CAM) method, which is aimed at producing a coarse localisation map highlighting the important regions of the image causing the prediction of a concept.Grad-CAM can be applied to a wide variety of CNN model families and correctly identifies the general region of the object that is predicted by the model.
In order to search for regions in the input image that can potentially contain defects, the SMDD-Net method integrates a local version of Grad-CAM to highlight regions in the image that are responsible for defect class activations.For this purpose, the input image is subdivided into several overlapping parts, where a saliency map is calculated separately for each part (see Figure 3b for an illustration).In order to maximise the defect detection rate, the maximum saliency is retained in each part of the division to produce the final saliency map of the input image.More formally, let us have K class defects and an input image I, which can contain one or several instances of these defect classes.We first subdivided the image into n overlapping parts P 1 , ..., P n , as shown in Figure 3b.The saliency induced on part P i of the k-th feature map after activating defect class y c is given by , and the weighted combination of the forward activation maps is given by where α ik are the neurons' importance weights and Z is the spatial size of the feature map (p and q are the spatial coordinates of the feature maps).Moreover, to take into account correlation effects between defect classes (e.g., oxidation caused by exposed bar, which is itself caused by spalling), we built for class label c a subset of labels Q c that might occur as a consequence of y c (e.g., c = spalling, then Q c = {exposed bar, oxidation}).The final activation map produced for part P i is then given by the following formula: Finally, we built the complete local saliency map S local for the whole image I by stitching the local saliency maps S i generated for parts P i into their corresponding positions in image I.For the overlapped parts between two patches, say P i and P j , we took the maximum saliency between S i and S j for the overlapped area.Furthermore, to enable a global saliency enhancement of the feature representation, another saliency map S global was generated by feeding the whole image to Grad-CAM.Figure 3 illustrates the saliency extraction module on an image containing several defects spreading across the entire image.While the global saliency identified the most-salient defect on the right (exposed bar + spalling), the local saliency identified the efflorescence occupying a large portion of the image and the spalling on the left.

Multi-Label One-Stage Defect Detection
The concern of defect detection is not only to classify defects, but also to localise them inside bounding boxes.Note that the one-stage detection models are mainly focused on computational efficiency, which enables fast detection at the expense of limited accuracy [57].Here, we tried to gain the benefits of the two worlds by simply narrowing the search area of the one-stage detector, thus enabling rapid and accurate defect detection at once.Using the computed saliency map, we generated region candidates around the salient parts to narrow the search space for defects, similar to two-stage object detection using deep learning [58].
In the majority of object detection methods such as RetinaNet, the bounding boxes are assigned a single label.For concrete defect detection, however, several defects can co-occur together at the same location.For example, exposed bar is often linked to corrosion or spalling defects.Likewise, cracks can be linked to efflorescence defects.To take into account this reality, we revisited the RetinaNet focal loss to implement a multi-label version by enabling assigning more than one label for each bounding box.The original Focal Loss (FL) for RetinaNet is an enhancement of the Cross-Entropy loss (CE), which suffers from an extreme foreground-background class imbalance problem due to the dense sampling of anchor boxes.Since there are hundreds of anchor boxes in each pyramid layer, only a few will be assigned to a ground-truth object, while the vast majority will be the background class, which can collectively overwhelm the model.To mitigate this problem, FL reduces the loss contribution from easy examples and increases the importance of correcting misclassified examples.More formally, the FL for a given object is given by where p c is the estimated probability for the ground-truth class c.The weight α is used to mitigate the class imbalance problem, which could be set by the inverse class frequency or treated as a hyper-parameter to be set by cross-validation.The exponent γ is used to modulate the factor (1 − p c ) to the cross-entropy loss.The focusing parameter γ smoothly adjusts the rate at which easy examples are down-weighted.When γ = 0, FL is equivalent to CE, and as γ increases, the effect of the modulating factor increases.
To take into account the multi-label aspect for defect detection, we augmented the ground-truth by creating copies of the bounding boxes containing several defect labels.This will enable regressing different candidates in the region proposals to target the same ground-truth bounding box.The final effect of this is to produce tightly overlapping bounding boxes that can be labelled differently from each other.More formally, the FL for a given bounding box location having a set L = {c 1 , ...c n } of labels is given by In the experiments of this study, we set α c i = 0.25 and γ = 2.For bounding box regression, we used the smooth L 1 loss defined in [59], which is less sensitive to outliers than the L2 loss.The final training loss was the sum of FL and the smooth L 1 .
As illustrated in the implementation example of Figure 2, the detection module in the RetinaNet architecture uses features extracted from a ResNet50 [60] backbone fine-tuned on the CODEBRIM dataset [12].From this backbone, three pyramidal features, F 3 , F 4 , and F 5 , are extracted corresponding to the 3rd, 4th, and 5th convolutional blocks of the ResNet50 backbone.These features are combined with the saliency maps using the following formula: where ↑ designates an up-sampling operation, ⊕ designates a skip connection adding a residual, and ⊗ designates pointwise feature multiplication.The new features F ′ i are then fed to the FPN, followed by bounding box regression and classification sub-networks.

Experiments
To evaluate our method, we conducted experiments on bridge defect detection and compared our results with previous methods.Here, we briefly present the dataset and metrics used for the quantitative evaluation, as well as some important implementation details about our method.
The most-recent dataset that exists for bridge defect detection is the COncrete DEfect BRidge IMage Dataset (CODEBRIM) [12], which is composed of six classes: background (2490) and five defect classes: crack (2507), spallation (1898), exposed bars (1507), efflorescence (833), and corrosion stain (1559).The images were acquired at high-resolution using drones, then resized to fit the input resolution required by the RetinaNet baseline method.In order to make the model perform better, especially on small objects and with changing viewing conditions, we augmented this dataset using bounding box data augmentation techniques such as brightness, mosaic, and shear.Bounding-box-level augmentation generates new training data by only altering the content of a source images with the bounding boxes [61].

Implementation Details
For our the experiments, we used Google Colab Pro+, which provides 52 GB of RAM alongside 8 CPU cores and priority access to GPU P100.The CODEBRIM dataset was split into training, validation, and test sets with an approximate ratio of 70%, 20%, and 10%.We trained our model along 30 epochs with 2370 iterations for each epoch, a batch size of 4, and fp16 mixed-precision training.Using transfer learning, ResNet50 was first pretrained on the CODEBRIM dataset, then used as the backbone for both the saliency (Grad-CAM) and detection modules.To help reduce False Positives (FPs), we added 10% background images to the dataset, with no objects (labels).In the inference phase, to select the best bounding box from the multiple predicted bounding boxes, we used the Non-Maximal Suppression (NMS) technique with a threshold equal to 0.45.

Evaluation Metrics
To evaluate our method's performance and compare it to the other models, we used the mean Average Precision (mAP) metric, representing the average of the AP of all classes.In brief, the AP is the enclosed area under the precision-recall curve drawn for a single class.The larger the area, the higher the AP value is and the better performance the method has for detection.The precision and recall metrics are defined as follows: where TP, FP, and FN represent the number of True Positives, False Positives, and False Negatives, respectively.To classify the detections as a TP or FP, the Intersection Over Union (IOU) between the predicted and ground-truth bounding boxes was used.The mAP@0.5 : 0.95 was used to select the best weights of the model on the validation set, and the mAP@0.5 was used to evaluate the method on the test set.

Results Analysis
To show the importance of the attention module, we performed an ablation study using SMDD-Net for defect detection in four different scenarios: (1) SMDD-Net without the attention module, (2) SMDD-Net without global saliency, (3) SMDD-Net without local saliency, and (4) SMDD-Net without residual skip connections.This allowed us to see the advantages of employing saliency.The obtained AP values for each defect class, as well as the mAP value in the first scenario are presented in Table 1.The results of the remaining scenarios are presented in Table 2.According to the results of the second and third scenarios, the advantages of local and global saliency were complimentary.In other words, the modules contribute equally to detection.In the last scenario, with deleting residual blocks, the accuracy drastically decreased since the combination was performed through feature multiplication only.The saliency map extracted from the attention module may contain very small values, which can destroy the pyramidal feature through multiplication.This is reflected in the decreased accuracy for this scenario.In conclusion, all the components of the attention module were important to achieve a proven detection accuracy.Figure 4 shows the results of some detection samples by SMDD-Net.It should be noted that, in certain cases, the detection had a very high score (score = 100%).Note also that, even though the saliency highlighted regions other than the defects, the detection module did not output any false detections at these regions (see Examples 1, 3, and 4).To better assess the merits of the SMDD-Net method, we compared its performance against seven other methods [35,[62][63][64][65][66][67].Patel et al. [65] used an improved Faster-RCNN method by adding a multi-label loss function for concrete defect detection.They pointed out various elements that affect the network's accuracy, such as the bounding boxes' inability to accurately depict the complex shape of defect patches and further inaccuracies in the CODEBRIM database's annotation.Xiong et al. [66] used the pre-trained YOLO-v4 network on the MS COCO dataset and fine-tuned the entire model on the CODEBRIM training dataset for concrete defect detection in order to compare the visual inspections with the automated ones.The other methods were implementations of the baseline YOLOv5-l [67], YOLOv8-l [64], RetinaNet [35], YOLOX [62], and YOLOR [63].Table 3 shows a comparison of the SMDD-Net method with these methods.Clearly, our method outperformed the other methods in terms of detection accuracy.Noticeably, the performance of the baselines YOLOR and YOLOX came in second and third, respectively, in rank behind the SMDD-Net method, but with a significant gap with regard to the latter (the gap was 7.3% with regard to YOLOX and 9.9% with regard to YOLOR).These results demonstrated, among other things, the benefit of using the attention module for boosting the local feature representation, which significantly enhanced the detection accuracy.Figure 5 shows a comparison of the validation curves for the implemented methods, demonstrating that SMDD-Net, RetinaNet, and YOLOX converged faster than the other methods.To qualitatively compare the implemented methods reported in Table 3, Figure 6 presents some illustrative examples of concrete defect detection.These examples we selected on purpose because they were particularly challenging.In the first example, the image was acquired at close range and contained two defects (a long crack on the left and corrosion on the right).Only SMDD-Net and YOLOR detected the corrosion.Note also that, due to the multi-label detection, the defect on the right was assigned another bounding box depicting an exposed bar, which was a valid detection.The second to fifth examples were acquired at a medium range.In the second example, the image contained several defects (two oxidation stains and a crack).SMDD-Net detected all these defects, whereas the other methods missed one or multiple defects.In the third example, the image contained various small cracks that were detected by SMDD-Net, but most of them were missed by the other methods.In the fourth example, the image contained a crack in the middle against a huge background part.Only SMDD-Net was able to detect it, as well as some efflorescence.In the fifth example, the image contained an exposed bar and several small cracks, which can barely be seen even with the naked eye.SMDD-Net detected all the defects, but the other methods missed most of the cracks.Finally, in the sixth example, the image contained corrosion within a spalling area, as well as a significant part with efflorescence.Only SMDD-Net and YOLOv8 were successful in detecting the spalling and corrosion.These results demonstrated the efficiency of the SMDD-Net method in detecting defects in challenging scenarios.

Conclusions
In this paper, a novel one-stage concrete defect detection method was proposed.This method leverages attention in the form of saliency to focus the detection on the mostimportant parts of the image.This had the benefit of boosting the overall detection accuracy, but also detecting non-contrasted and small defects, which were missed by most of the onestage and two-stage detection methods.Our attention module was implemented by fusing local and global saliency maps extracted using back-propagation.To improve the feature representation for defect detection, the saliency features were combined with the pyramidal features.The experimental results demonstrated that the proposed method outperformed the other methods in terms of the detection accuracy, while being computationally efficient.It also enabled a better identification and localisation of overlapping defects.Although the results were shown for concrete bridge defect detection, SMDD-Net can be readily applied without major modifications to any other concrete structure defect detection task.

Figure 1 .
Figure 1.Pipeline of the proposed SMDD-Net method.On the top: (a) Saliency computation for the attention module; (b) multi-label one-stage concrete defect detection module.On the bottom: An example of our method depicting the benefits of using attention.

Figure 2 .
Figure 2. Pipeline of the proposed SMDD-Net method: (a) attention module enhancing regional feature representation for defect detection; (b) the multi-label one-stage defect detection module uses the RetinaNet model.

Figure 3 .
Figure 3. Illustration of the saliency extraction module for concrete defect detection: (a) input image, (b) overlapping block image subdivision, (c) global saliency extraction, and (d) local saliency extraction.

Figure 5 .
Figure 5.Comparison of the validation curves.

Figure 6 .
Figure 6.Visualisation of the compared results from concrete surface images.

Table 1 .
Results of the SMDD-Net evaluation on the test set without using the attention module.

Table 3 .
Comparison of SMDD-Net with other methods.