1. Introduction
Hollow defects typically arise from partial detachment between the surface layer and base layer in building structures or renovations, accompanied by slight bulging. Due to the consistent color of hollowed areas with their surroundings, most are difficult to identify directly. If not addressed promptly, peeling of the outer material may occur. In particular, material falling off from high-altitude areas will threaten the life and property safety of passers-by. Therefore, the detection of hollow defects holds significant practical importance.
Traditional mainstream detection methods include the hammer tapping method, rebound method, and pull-out test method. However, these methods often have limited detection ranges, and their results are susceptible to subjective factors. Moreover, some methods involve contact-based or even destructive testing, potentially causing secondary damage to the building surface. In contrast, various non-destructive testing techniques are gradually being applied in the field of building exterior wall inspection. These techniques can accurately detect internal hollow defects without damaging the main building structure or its finish [
1]. Considering that hollow defects in buildings mostly occur on the surface layer and cover large areas, infrared thermography technology, with its non-contact nature and capability for rapid full-field scanning, is highly suitable for rapid screening in such large-scale, complex facade scenarios [
2]. Its core principle involves capturing infrared radiation from the object’s surface using an infrared detector and converting it into visualized thermal images. Due to the presence of air in hollow defect regions, their surface temperatures differ from the surrounding environment when irradiated by external heat sources. Thus, under appropriate ambient temperature differences, infrared thermography technology enables effective detection of hollow defects. Furthermore, when equipped with thermal imaging devices, unmanned aerial vehicles can perform rapid detection of high-altitude and large-area structures.
In practical detection scenarios, the volume of captured data is enormous. When processing such data with traditional image processing techniques, professional inspectors are required to manually inspect, adjust parameters, and annotate each image individually, manually delineating the contours of hollow areas. This process is highly labor-intensive, incurring substantial labor and time costs. It is also susceptible to interference from complex backgrounds such as building mortar joints, material aging, and uneven illumination, leading to insufficient defect recognition accuracy and difficulty in meeting large-scale detection demands. In contrast, deep learning technology based on convolutional neural networks possesses powerful end-to-end learning and automatic feature extraction capabilities. By training models on large-scale datasets, it enables automatic recognition and segmentation of hollow defects and has been widely adopted in intelligent industrial inspection [
3,
4,
5].
In recent years, the integrated application of infrared thermography technology and deep learning-based image segmentation algorithms has become a core research focus in the field of intelligent non-destructive testing for engineering. In civil engineering, studies have applied convolutional neural networks to identify and segment thermal anomaly regions such as bridge cracks, concrete spalling, and building wall fissures [
6,
7,
8]. Compared to traditional manual inspection and conventional image processing methods, this significantly enhances the automation level and recognition accuracy of hollow defect detection. Existing empirical research confirms that U-Net [
9], Mask R-CNN [
10], and recently emerged one-stage detection models [
11] all demonstrate excellent feature extraction and generalization capabilities in thermal image detection tasks. However, limited by the low contrast and blurred boundaries characteristic of infrared thermal images, existing models still struggle to achieve desired performance in detecting small-scale defects within complex environments. Addressing the practical needs of building exterior wall hollow detection, lightweight modifications and targeted optimizations of existing deep learning models can simultaneously improve detection accuracy and enhance adaptability to complex scenarios [
12], representing a key research focus in this field.
Addressing the intelligent detection needs of building exterior wall hollow defects using infrared thermography, numerous scholars have conducted research on deep learning-based defect recognition in recent years. Wang et al. trained a Res-UNet-based model on fused visible and infrared thermographic data, achieving efficient identification of hollow defects and spalling defects [
13]. However, this study only employed improved U-Net structures with different depths, leaving room for supplementary model selections. Pan et al. designed hollow tile specimens of various shapes and sizes, evaluating the performance of Mask R-CNN and YOLACT models in instance segmentation tasks. The mean Average Precision (mAP) of Mask R-CNN reached 0.855, significantly higher than YOLACT’s 0.771 [
14]. Nevertheless, the dataset in this experiment was derived from artificially designed hollow defect specimens with regular shapes and fixed sizes. However, actual building exterior walls are often subject to interferences such as jointing lines and material aging, and the forms of hollow defects are complex and diverse, which limits the applicability of the models to a certain extent. Experimental results show that these models generally exhibit relatively low detection accuracy for small hollow defect targets.
As a representative architecture for one-stage object detection and instance segmentation, the YOLOv8 model features cross-layer feature fusion capability and an anchor-free design, endowing it with a strong multi-scale detection ability. Hence, it is more suitable for scenarios such as hollow defect detection where stringent requirements are placed on missed detection control. However, research on the application of the YOLOv8 model in the context of infrared detection for building exterior wall hollow defects remains scarce, and improvements to the model’s capability in real-world scenarios require further in-depth exploration. To this end, this paper proposes a lightweight attention-enhanced improvement scheme based on the YOLOv8 model, which replaces the C2f module in its backbone network with a C2f + PSA (Position-Sensitive Attention) module integrated with the Position-Sensitive Attention mechanism. The design concept of this module is derived from the multi-head self-attention mechanism and the positional encoding idea of Transformer [
15]. This improvement allows the model to focus on the feature representation of key regions and boundary transition zones of hollow defects without affecting the model’s running speed and original inference process, thereby reducing missed detections and improving segmentation accuracy.
In summary, to make up for the deficiencies of existing research in terms of model comparison and adaptability to real scenarios, this paper combines infrared thermal imaging technology with the deep learning-based instance segmentation method. On the one hand, a dedicated dataset for hollow defects in building exterior walls, which contains various real interference factors, is constructed, and systematic performance comparison experiments are carried out on several current mainstream instance segmentation models to clarify their adaptability to real application scenarios. On the other hand, the PSA Position-Sensitive Attention mechanism is introduced to optimize and improve the YOLOv8 model, so as to enhance its anti-interference capability and segmentation accuracy in complex real scenarios. Ultimately, the optimal automated detection scheme that balances detection accuracy and inference efficiency is screened out, which provides tangible technical support and practical references for the large-scale non-destructive testing of hollow defects in building exterior walls.
2. Materials and Methods
To achieve efficient, accurate and automated identification of hollow defects in building exterior walls, this chapter elaborates on the dataset construction process, principles of baseline instance segmentation models, design scheme of the improved model, experimental environment configuration and performance evaluation metrics, thus providing a complete and reproducible methodological support for subsequent model training and performance comparison.
2.1. Establishment of a Hollow Defect Dataset
The test data were collected from buildings in Minhou County, Fuzhou City. The infrared thermal imager used in this test was the InfraRed Camera R500Pro, manufactured by Nippon Avionics Co., Ltd., Yokohama, Japan. Data collection was carried out on sunny mornings in spring and autumn. These periods featured rapid temperature rise. A total of 409 thermal images were obtained finally.
According to the relevant technical specifications for infrared thermography detection [
16,
17], the thermal images were processed with a 2–3 color display mode. When the surface temperature difference between the red and green areas exceeded the quantitative threshold of 1 °C specified in the specifications, the red areas were judged as hollow defects. Combined with the color gradient distribution characteristics, the boundary of the yellow area was determined as the hollow defect boundary. This yellow area lies in the transition zone between the red hollow defect areas and the green intact areas. Further verification can be conducted using hardness testing. During the testing process, after estimating the extent of hollowing using an infrared thermal imaging camera, multiple survey lines were arranged within the estimated range on the exterior wall using a measuring scale. Hardness tests were then carried out along these survey lines with a hardness tester, and the points where abrupt changes in hardness values occurred were connected to define the hollow defect boundary. The testing process is shown in
Figure 1. The hollow defect boundaries obtained through multiple hardness tests were found to be in good agreement with the yellow temperature boundaries displayed by infrared thermal imaging, thereby validating the reliability of the preliminary infrared thermal imaging results. The testing process is shown in
Figure 1.
In the data annotation stage, the infrared thermal images were annotated by an experienced operator with professional knowledge of building exterior wall defects and infrared thermography, using the Labelme image annotation software (version 5.6.1). After annotation, all images were uniformly converted to COCO format or YOLO format, and randomly divided into training, validation, and test sets at a ratio of 7:2:1. To enhance the generalization ability of the model, data augmentation techniques were applied to the training, validation, and test sets in proportion to their sizes, expanding the original sample set to 3200 samples. This approach not only ensured consistent data distribution across the sets and avoided data leakage but also effectively enriched sample diversity, providing reliable support for stable model training. The dataset construction process is shown in
Figure 2.
2.2. Implementation of Instance Segmentation Algorithm in Hollow Defect Recognition
According to the model processing stages, instance segmentation technologies are classified into two-stage and one-stage instance segmentation methods [
18,
19]. Based on this classification method, this study selects the two-stage segmentation model Mask R-CNN, as well as the one-stage segmentation models YOLACT and YOLOv8 for training. Through the comparison of these three models, the most suitable model for the automatic recognition of hollow defects is explored.
2.2.1. Mask R-CNN Algorithm-Based Hollow Defect Recognition Method
Mask R-CNN is a two-stage model further applied to instance segmentation, which is based on the Faster R-CNN object detection model [
20,
21]. The flow of the hollow defect recognition algorithm based on Mask R-CNN is shown in
Figure 3, and the specific steps are as follows:
Step 1: Input the image into the feature extraction network. Through the Residual Network (ResNet) and Feature Pyramid Network (FPN), fuse and construct a feature map containing multi-scale information, such as tiny edge details and overall contours of hollow defects.
Step 2: Use the Region Proposal Network (RPN) to generate a series of rectangular anchor boxes on these feature maps. Perform binary classification of hollow defects and background, as well as bounding box regression on these anchor boxes.
Step 3: Obtain high-quality candidate regions of hollow defects through Non-Maximum Suppression (NMS) screening. This step reduces the interference of complex backgrounds on the building surface, such as decorative textures and stains.
Step 4: Introduce Region of Interest Align (ROI Align). Solve the misalignment problem between features and bounding boxes, which easily occurs in irregular hollow defects, through bilinear interpolation.
Step 5: On the basis of the object detection task branch (including object classification and bounding box regression), add a parallel instance segmentation branch. This branch predicts the pixel-level mask of each hollow defect, thus realizing the contour depiction of irregular hollow defects.
2.2.2. YOLACT Algorithm-Based Hollow Defect Recognition Method
YOLACT is the first one-stage model to achieve real-time instance segmentation [
22]. The flow of the hollow defect recognition algorithm based on YOLACT is shown in
Figure 4, with the specific steps as follows:
Step 1: Feed the input image into ResNet and FPN. Extract the edge structures and contextual semantic information of hollow defects on building exterior walls through multi-scale feature fusion.
Step 2: Send the generated multi-scale fused feature maps to two parallel task branches in parallel. One is the Protonet branch, which predicts a set of prototype masks based on the fused features output by FPN. These prototype masks serve as a shared basis for generating all instance masks. They can adapt to hollow defects of different morphologies through linear combination, avoiding the problem of insufficient mask adaptability caused by differences in hollow defect shapes. The other is the Prediction Head branch, which is responsible for predicting class probabilities, bounding box offsets, and mask coefficients for multi-scale anchor boxes.
Step 3: Apply NMS to the prediction results of the Detection Branch to filter candidate boxes. This ensures that hollow defects of different scales and shapes can be accurately located.
Step 4: Dynamically generate the mask for each hollow defect instance through the linear combination of prototype masks and mask coefficients.
2.2.3. YOLOv8 Algorithm-Based Hollow Defect Recognition Method
YOLOv8 is an important instance segmentation version in the YOLO series. The flow of the hollow defect recognition algorithm based on YOLOv8 is shown in
Figure 5, with the specific steps as follows:
Step 1: The model uses the backbone network to extract multi-level feature maps of hollow defects, such as edge textures and spatial shapes. Through the Path Aggregation Network (PANet) in the neck structure, it fuses the multi-scale features of tiny boundaries and overall contours of hollow defects in the image, strengthening the feature discrimination ability for hollow defects of different sizes.
Step 2: The fused feature maps are sent to the detection head and segmentation head. YOLOv8 adopts an Anchor-Free prediction mechanism. This mechanism can break free from the limitations of fixed anchor boxes on irregularly shaped hollow defects and adapt to any contour shape of hollow defects. The detection head directly predicts the bounding boxes of target, class probabilities, and mask coefficients at each feature map position, while the segmentation head outputs prototype mask maps.
Step 3: The model removes overlapping detection boxes through post-processing operations such as NMS, obtaining a set of hollow defect localization and classification results.
Step 4: Pixel-level hollow defect masks are generated by combining mask coefficients and prototype mask maps.
2.3. Construction of the Modified YOLOv8 Model
In practical engineering scenarios, the thermal imaging of hollow defects is characterized by blurred boundaries and a continuous color transition with intact wall surfaces. Moreover, it is subject to noise interferences such as wall jointing lines and material aging, and the thermal signal contrast of micro hollow defects is even lower, which imposes higher requirements on the model’s feature focusing and anti-interference capabilities. Although the C2f module in the YOLOv8 backbone network can efficiently extract features, it lacks targeted attention to the key regions of hollow defects. This results in the model being susceptible to interference in complex backgrounds, with room for improvement in its boundary fitting and micro hollow defect capture capabilities. To address this issue, this study proposes a lightweight improvement scheme: the C2f modules at key levels in the YOLOv8 backbone network are replaced with the C2fPSA module integrated with the Position-Sensitive Attention (PSA) mechanism, so as to strengthen the model’s focus on the key features of hollow defects. The design principles and structural details of the C2fPSA module are elaborated in detail below.
2.3.1. Design of the C2fPSA Module
The C2f module in the YOLOv8 backbone network adopts a dual-branch parallel processing structure: First, the module divides the input feature map into two branches. The first branch directly retains the original features, while the second branch performs feature transformation through multiple Bottleneck units, and finally outputs the fused features. The specific structure of the module is shown in
Figure 6a.
The core of the C2fPSA module introduced in this study is to introduce the PSA module (PSABlock) to replace the Bottleneck units of the C2f module, while maintaining the basic architecture of the C2f module. The specific structure of the module is shown in
Figure 6b. The specific structure of the PSABlock is shown in
Figure 6c.
2.3.2. Improved YOLOv8 Network Structure
Considering that the details and boundary information of hollow defects are mainly contained in the middle and high-level features, this study selects to replace the C2f modules at two scale levels (P3/8 and P4/16) in the YOLOv8 backbone network with C2fPSA modules, while keeping the rest of the structure unchanged. This design directly embeds the attention mechanism into the feature levels that are more sensitive to small hollow defects and boundary details, thereby enhancing the model’s ability to represent key regions. The schematic diagram of the improved YOLOv8 backbone network framework and the replacement positions is shown in
Figure 7.
2.4. Configuration of Model Training Parameters
This study was conducted on the Ubuntu 22.04 operating system, with hardware configurations including an AMD EPYC 9754 128-core processor and an NVIDIA RTX 4090D graphics card (24 GB video memory). PyTorch 2.3.0 was adopted for model training, and CUDA 12.1 was used to accelerate computation. The training batch size was set to 12. The total number of training epochs for Mask R-CNN was 100, while YOLACT and YOLOv8 (with slower convergence speeds) were set to 200 training epochs in total. All models used the Stochastic Gradient Descent (SGD) optimizer, with a momentum of 0.9 and a learning rate of 0.02. The input resolution was uniformly set to 640 × 480 (the default resolution of the infrared camera used for data acquisition). To ensure statistical significance, each experiment was repeated three times with random seeds, and the average results are reported. Additionally, all models were initialized with the default pre-trained weights provided by their official frameworks, which are pre-trained on the COCO dataset.
2.5. Evaluation Metrics for Recognition Accuracy
In deep learning tasks, the main evaluation metrics are Precision (P) and Recall (R), and Average Precision (AP) is adopted as the core performance metric [
23].
Intersection over Union (
IoU) is a metric used to measure the overlap degree between the predicted mask and the ground truth mask. Its calculation formula is as follows:
where
S1 is the area of the overlapping region between the predicted mask and the ground truth mask;
S2 is the total coverage area of the predicted mask and the ground truth mask.
Both
P and
R are calculated with an
IoU threshold of 0.5, following the standard evaluation for instance segmentation tasks (consistent with the COCO evaluation criteria). The calculation formulas are as follows:
where
TP is the number of correctly predicted target samples;
FP is the number of incorrectly predicted target samples;
FN is the number of unpredicted target samples.
The
AP value is defined as the Area Under the Curve (AUC) of the Precision-Recall (
PR) curve. Its calculation formula is as follows:
mean Average Precision (
mAP) is the mean value of
AP across all categories under different IoU thresholds. mAP@50 refers to the mean of
AP values for all categories obtained from the area under the
PR curve under the condition of
IoU ≥ 50%. mAP@50-95 is the mean of
AP values calculated within the
IoU threshold interval [0.50, 0.95] (with a step size of 0.05), which is used to represent the comprehensive performance of the model. Its calculation formula is as follows:
The evaluation efficiency of the model is measured by the Frames Per Second (FPS) metric [
24], which represents the number of images processed per second. It is used to evaluate the computational efficiency of the model on a general-purpose GPU.
3. Results
3.1. Comparison of Accuracy of Overall Recognition Results
The loss of different models on the training set, as well as the changes in the mask-based mean Average Precision (mAP@50 and mAP@50-95) on the validation set as a function of the number of training epochs, are shown in
Figure 8,
Figure 9 and
Figure 10: the specific performance of each segmentation model is presented in
Table 1; and the ablation experiment results for different placement strategies of the C2fPSA module are shown in
Table 2. From the figures and table, it can be observed that:
(1) All three models achieved effective convergence after training, with their performance metrics improving significantly in the first 20 epochs, yet there were differences in the number of training epochs required to reach convergence. Among them, no significant fluctuations were observed in the metrics of Mask R-CNN after its training stabilized. The convergence process of YOLACT was completed at approximately 100 epochs, while YOLOv8 had the longest convergence cycle and only stabilized at around 175 epochs. The curves of all metrics of the modified YOLOv8 model after training exhibited stable fluctuations overall, indicating that the model could still achieve effective convergence with the introduction of the C2fPSA module. The overall training process showed little deviation from the original training curve, and no abnormal oscillations occurred during the training.
(2) In terms of performance metrics, there were significant differences in the adaptability of the three models to hollow defects. Mask R-CNN achieved an mAP@50 of 0.966, yet its Precision (P), Recall (R) and mAP@50-95 were the lowest among the three models. YOLACT demonstrated the best overall performance, with its P, mAP@50 and mAP@50-95 all ranking the highest. The precision of YOLOv8 was slightly lower than that of YOLACT, yet its R topped the list at 0.869. All core performance metrics of the modified YOLOv8 model were significantly improved compared with the original model, which indicates that the C2fPSA module can effectively enhance the model’s ability to characterize the key regions of hollow defects.
(3) In terms of the computational efficiency of the models, there were significant differences in their actual inference speed. Mask R-CNN featured a low FPS and high computational overhead; YOLACT had a streamlined structure with minimal redundant computation; the efficiency of YOLOv8 showed little change before and after the modification, achieving a good balance between detection accuracy and processing efficiency.
(4) Synthesizing the performance characteristics of the three models and the actual requirements of hollow defect detection, there were distinct differences in their applicable scenarios: the YOLACT model exhibited comprehensively optimal performance in both precision metrics and inference efficiency, making it suitable for on-site rapid screening where both efficiency and precision are pursued; Mask R-CNN, constrained by its low frame rate, is more applicable to the refined post-processing of static images; YOLOv8 delivered the best performance in terms of recall rate, which makes it suitable for engineering sites with stringent requirements for missed detection control and real-time performance guarantees.
(5) The results of the ablation experiments effectively validate the effectiveness of the C2fPSA module proposed in this paper. After embedding the C2fPSA module into the YOLOv8 model, all evaluation metrics of the improved models were significantly enhanced, directly confirming the module’s effectiveness in extracting features of hollow defects. Specifically, the model with the C2fPSA module added only to the P3 layer achieved the highest recall rate (0.941) among all groups, demonstrating the most prominent suppression of missed detections for tiny hollow defects. The model with the module added only to the P4 feature layer exhibited the best comprehensive segmentation performance, ranking first in both mAP@50 and mAP@50-95 among all groups. The final adopted P3 + P4 dual-layer embedding strategy achieved an optimal performance balance without any significant shortcomings, making it highly suitable for the comprehensive application requirements of on-site hollow defect detection.
(6) The results of the model quantization complexity analysis further validate the lightweight nature of the improved scheme proposed in this paper. Compared to the original YOLOv8 model (with 3.26M parameters and 4.31 G FLOPs), the parameter count of the final improved YOLOv8 model only increased to 3.54M, and the FLOPs only increased to 4.83 G. The increase in model complexity brought about by these increments is almost negligible, and the inference speed of the model did not decrease significantly.
Figure 8.
Variation Curve of the Total Loss Function of Different Models.
Figure 8.
Variation Curve of the Total Loss Function of Different Models.
Figure 9.
mAP@50 Curve of Different Models.
Figure 9.
mAP@50 Curve of Different Models.
Figure 10.
mAP@50-95 Curve of Different Models.
Figure 10.
mAP@50-95 Curve of Different Models.
3.2. Comparison of Accuracy of Local Recognition Results
The segmentation performance of different models for hollow defects on building exterior walls is shown in
Figure 11 and
Figure 12. All three models can detect most hollow defects, yet there are certain differences in their visual segmentation performance.
It can be seen from the two types of scenarios that the three models exhibit distinct characteristics in local recognition performance: the core problems of Mask R-CNN are concentrated in mask over-segmentation and adhesion, with a relatively low matching degree between the segmented regions and actual hollow defects; YOLACT achieves accurate segmentation for scattered hollow defects but is prone to edge adhesion at complex boundaries; YOLOv8 preserves the finest details in complex boundary scenarios, with only slight false detections occurring in scattered scenarios. The predicted masks output by the modified YOLOv8 are more refined, and the model can effectively suppress the interference of background noise, distinguish between real hollow defects and intact wall surfaces, and reduce the false detection of small targets.
4. Discussion
There are fundamental differences among the three models in terms of core architecture design, mask generation methods, and feature extraction and fusion strategies. Moreover, the boundary between hollow defects and intact wall surfaces is not a clear line, but a transition zone with continuous color variation. These model differences and the characteristics of hollow defects lead to different detection results.
4.1. Analysis of Causes for Overall Performance Differences
In terms of overall performance, the models exhibit significant differences in convergence speed, accuracy upper limit, and inference efficiency.
The Region Proposal Network (RPN) of Mask R-CNN can quickly lock onto large-scale hollow defect regions with significant color changes in thermal images, demonstrating extremely high efficiency in learning features of highly salient targets. As a result, its loss function tends to stabilize after approximately 20 epochs, and it exhibits high detection confidence for large, high-contrast hollow defects, which is the core reason for its outstanding performance in P and mAP@50. Meanwhile, the RPN employs specific screening criteria when selecting targets. This mechanism easily omits small-scale hollow defects or blurred boundaries with minor color differences and low feature distinctiveness, directly leading to insufficient perception of such easily missed targets and a significantly lower recall rate. Furthermore, in the mask generation stage, the upsampling operation of RoI Align loses fine boundary localization information, limiting its performance on the more stringent mAP@50-95 metric. Additionally, the two-stage framework, which first generates candidate boxes and then performs classification, regression, and mask segmentation, incurs high computational overhead, failing to meet the frame rate requirements for real-time detection, resulting in the lowest FPS among the three models.
The core of YOLACT lies in its dual-branch design with parallel prototype mask generation and detection. This structure can simultaneously learn the global irregular contour features of hollow defects, the law of boundary color gradients, and the localization and classification features of local targets, without the need for serial candidate box refinement like two-stage models. Consequently, it converges in only 100 epochs. The detection branch generates unique mask coefficients for each target, which are linearly combined with the prototype masks, thereby accurately restoring various morphologies and achieving stronger discrimination between hollow defects and the background. As a result, YOLACT leads in all accuracy metrics. The parallel computation mode also significantly reduces computational overhead, enabling high-frame-rate inference, resulting in high FPS.
The performance characteristics of YOLOv8 stem from its cross-layer feature fusion mechanism and anchor-free detection architecture. Its anchor-free design directly predicts targets, avoiding localization deviations caused by mismatches between preset anchor boxes and irregular hollow defects, thereby enhancing perception of small-scale and irregularly shaped defects, which is why it achieves the highest recall rate. Path Aggregation Network, through cross-layer feature fusion, establishes deep correlations between shallow detail features and deep semantic features, enabling simultaneous capture of subtle color differences in small targets and overall contours of large targets, further strengthening the detection capability across all defect scales. However, the complex feature fusion and adaptive anchor learning process also result in its longest training convergence period, placing its accuracy and speed at a moderate level among the three models.
4.2. Analysis of Causes for Local Effect Differences
In terms of local detail processing, the adaptability of different models to boundary fuzziness and morphological complexity further reveals the underlying causes of their performance differences.
Mask R-CNN exhibits severe mask over-segmentation in both scenarios, with the lowest matching degree between the segmented regions and the actual hollow defect areas. This is because during the mask generation stage, when RoI Align operates on candidate regions, the local receptive field struggles to accurately distinguish the gradient transition zone of the hollow defect boundary from the background region. The upsampling operation further blurs the fine features of the boundary, leading to insufficient positioning accuracy of the model for hollow defect edges. Additionally, its region cropping and alignment mechanism, when confronted with extremely close hollow defect boundaries, fails to separate adjacent boundaries, resulting in severe edge adhesion issues.
YOLACT achieves the best overall segmentation performance among all models, benefiting from the global feature learning mechanism of its prototype masks. Its prototype masks pre-learn the global contours and boundary features of hollow defects, which are then linearly combined with the exclusive mask coefficients predicted for different hollow defects, enabling the generation of highly fitted boundaries. This results in an extremely high degree of fit for independent, scattered hollow defect boundaries, effectively avoiding over-segmentation issues. However, when facing transition zones of adjacent boundaries, the overlapping of multiple masks at the same pixel positions leads to adhesion between the edges of different defects, preventing precise separation of densely adjacent defect boundaries.
YOLOv8 retains the finest edge details in complex boundary scenarios, achieving the optimal restoration of defect gradient boundaries. Due to its Path Aggregation Network, it exhibits higher sensitivity to subtle features than the other two models, which sometimes results in false positives where slight textures or color differences on the wall surface are identified as small hollow defects. Unlike anchor-based models that indirectly locate boundaries through anchor box regression, YOLOv8 adopts an anchor-free design that directly performs pixel-wise prediction of target contours. Without the need for candidate box cropping or upsampling operations, it better preserves contour color transition changes, thus achieving the best performance in the restoration of complex boundaries.
5. Conclusions
To address the problems of low efficiency and heavy reliance on manual experience in traditional hollow defect detection methods, this study explores an automated detection approach based on infrared thermal imaging and deep learning. Taking the buildings in Minhou County, Fuzhou City as the research object, a dedicated thermal imaging dataset for exterior wall hollow defects is constructed. Training, performance evaluation, and improvement are carried out based on three mainstream instance segmentation models (Mask R-CNN, YOLACT, and YOLOv8) and the improved YOLOv8 model.
The training results show that in the task of building exterior wall hollow defect detection, YOLACT achieves the best comprehensive performance in terms of overall accuracy and inference speed, but its masks are prone to adhesion at complex boundaries, making it suitable for on-site rapid screening. The improved YOLOv8 model has distinct advantages in high recall rate. It is suitable for hollow defect recognition scenarios with strict missing detection control requirements and complex boundary shapes. Mask R-CNN, due to its low efficiency, limited detail accuracy, and mask over-segmentation phenomenon, is more suitable for refined analysis of static images.
This paper focuses on the low detection accuracy and slow speed of traditional manual inspection for hollow defects on the surface layer of building exterior walls, and proposes to replace manual inspection with instance segmentation technology, aiming to provide automated solutions for hollow defect detection. Future research can further improve the robustness and practicality of the model in complex field environments through model lightweighting and scenario adaptability optimization.