Crack Detection in Images of Masonry Using CNNs

While there is a significant body of research on crack detection by computer vision methods in concrete and asphalt, less attention has been given to masonry. We train a convolutional neural network (CNN) on images of brick walls built in a laboratory environment and test its ability to detect cracks in images of brick-and-mortar structures both in the laboratory and on real-world images taken from the internet. We also compare the performance of the CNN to a variety of simpler classifiers operating on handcrafted features. We find that the CNN performed better on the domain adaptation from laboratory to real-world images than these simple models. However, we also find that performance is significantly better in performing the reverse domain adaptation task, where the simple classifiers are trained on real-world images and tested on the laboratory images. This work demonstrates the ability to detect cracks in images of masonry using a variety of machine learning methods and provides guidance for improving the reliability of such models when performing domain adaptation for crack detection in masonry.


Introduction
Masonry construction is common in both historical and contemporary architecture [1][2][3]. Furthermore, there has been a surge of interest in using masonry for sustainable infrastructure in the future [4][5][6][7][8]. One of the reasons masonry construction achieves a long service lifetime is its ability to be incrementally repaired; the sacrificial mortar and the modular nature of individual masonry blocks means it is less expensive to maintain than monolithic materials like concrete slabs. Consequently, there are a myriad of ways these structures can incur damage over their lifetime. Masonry structures are susceptible to cracking due to thermal stress from freezing and thawing cycles [9][10][11] or incompatible material adjacency [12], hydroscopic stress from precipitation or rising damp [13,14], as well as mechanical stress from settlement [15][16][17][18] or earthquakes [19]. Unreinforced masonry is typically the most vulnerable type of building material to earthquake damage according to the U.S. Federal Emergency Management Agency [20].
Traditionally, masonry facade inspections have been performed through a combination of techniques including: ground-level inspections [21]; tactile inspections using cherry pickers, scaffolding, or ropes [22]; and drone-based inspections [23,24]. These techniques are costly and time consuming because of the mobilization requirements. Even approaches using drone-captured images can be time-consuming since presently a human must manually inspect them to ascertain the state of the structure. The goal of this research is to develop software to automate the crack classification process in a way that is We first identify cracks under optimal conditions in a narrow image domain (i.e., consistent materials, uniform lighting, and orthogonal photography [63]). Subsequently, we test the method against images of cracked masonry obtained from Google Image Search and measure its performance. However, a common element missing in prior works is an application of the models to testing data which comes from a different domain than the one it was trained on; any model deployed under realistic circumstances would need to demonstrate a high degree of transferability between structures and conditions. In this work, domain adaptation is specifically considered in the development of the models. Critically, we find that the performance of machine learning models for domain adaptation is greatest when the training data encompasses a broader range of scenarios (i.e., the realworld data), whereas training on a large number of images from a controlled environment results in overfitting.

Laboratory-Scale Masonry Walls
Small-scale experimental walls (such as shown in Figure 1) were built using miniature bricks which measured 3.4 × 1.7 × 1.7 cm and were cored with three evenly-spaced holes in the middle. Using small bricks made repeated building and cracking safer and faster. Each wall included 30 to 50 bricks with heights and widths both varying from 6-8 bricks. A total of 100 different bricks were used in the construction of 53 walls. The bricks were joined together with a non-cementitious mortar so that they could be re-used. Mortar is typically made from a combination of water, sand, and cement. When only removing the cement, the particle size of the sand was too large relative to the miniature bricks (comparable to placing river rocks between full-size bricks). Thus, the particle size needed to be scaled down along with the bricks. An alternative mortar recipe was designed with flour (as the binder), corn meal (as the aggregate) and water. Though not a strong binder, this mortar substitute looked visually similar to real mortar in photographs.
After construction, the walls were allowed to dry for at least two hours until the mortar was set. Walls were cracked manually in many arrangements by pushing, pulling and twisting different sections of the wall until a crack emerged in the mortar. Photos of size 4608 × 3456 pixels were obtained using a mirrorless Nikon D90 digital camera mounted on a tripod with a two second shutter delay. Identifying cracks in later steps required zooming in close onto sections of the photo, so pixel-level sharpness was important. Both of these measures reduced camera motion during the exposure and helped to visibly increase sharpness.
The shooting distance to the walls was variable as to not impact the results. Additionally lighting conditions were varied between the different walls as to not further bias the system. This is visible in Figure 1, where a blue tint is imbued by a blue tarp in the scene.

Real-World Images
The models were also tested for their ability to classify real-world images after being trained on laboratory images. Pictures of different masonry walls (cracked and uncracked) were obtained using Google Image Search. To ensure that the architecture was tested on a variety of wall typologies, walls with bricks of different size, colors, shapes, level of deterioration (burned, spalling, with effloresence) were utilized. Additionally, variety in the mortar typology (convex, concave) was included to test the efficacy of the developed architecture on diverse wall constructions (see Figures 2 and 3).

Image Processing
The images were prepared for the CNN by splitting each 4608 × 3456-pixel image into 48,512 × 512-pixel non-overlapping image patches, as shown in Figure 4. A total of 2542 image patches were labeled manually. The following classes were used:

1.
Cracked: Image clearly shows a cracked section of a brick wall.

2.
Uncracked: Image clearly shows an uncracked section of a brick wall.

3.
Vague: It is unclear whether or not there is a crack in the image. There may be a very small crack, the crack might fall directly on the image edge, or the image might be out of focus.

4.
Partial: These images include part of a brick in addition to some of the black background on which the wall was resting when photos were taken.

5.
No-bricks: These images include only black background with no bricks. Only the Cracked and Uncracked images were used for training. For many of the Vague images, the authors did not agree on whether the image was cracked or not. Similarly, some Partial images had marks which looked like they might be a crack at the edge, which resulted in split votes on whether a crack was present. This yielded a total of 1068 usable images which were further split up into: Additionally, some cracked images were removed from the training set to yield a 1:1 Cracked:Uncracked ratio, reducing the training set size to 598.
The images taken as 512 × 512 pixel crops were then scaled down to 100 × 100 pixels using bicubic interpolation over 4 × 4 pixel neighborhood. Unlike the initial reduction to 512 × 512, the reduction from 512 × 512 to 100 × 100 did not involve changing the crop of the image, only a slight blurring. This blurring of the images mitigates distinction between images taken under laboratory conditions and real-world images. Based on previous works by the authors where cracks were mapped manually on masonry structures [24,[64][65][66], 150 × 75 pixel patches are sufficient for documentation of individual bricks.

CNN Model
Based on the results of previous research [44,[58][59][60]62], a CNN is used to classify the images presented in this work. The CNN model was implemented with the Keras deep learning library [67] using the TensorFlow backend. The architecture of Ref. [44] was used as a baseline due to its high success rate in identifying cracks in concrete. Two types of labels are used in the softmax layer to classify the images (Cracked and Uncracked). The overall architecture is summarized in Table 1.
The first modification was designed around a goal of having a 15 × 15 filter to start. This was based on the hypothesis that it would be beneficial to have a filter large enough to see large swaths of cracked material surrounded on both sides by uncracked material. The sizes of other filters were set to keep the height and width of the data large enough to run through all 18 layers. The result has been named Architecture B and is shown in Table 2.  Two other architectures were also considered to improve results. Another approach to modifying the architecture from Keras documentation was to scale everything up. Compared to the 32 × 32 pixel input of CIFAR, the 100 × 100 input needed for this model was almost exactly three times taller and wider. Therefore, all filter sizes were scaled up by three. The result has been named Architecture C and is shown in Table 3.
Comparing the three architectures, it is obvious that B and C are deeper (they have more layers). However, although Architecture B has more layers than Architecture A, it does not have many more parameters. This is because many of additional layers in B are activation layers and pooling layers which do not involve parameters. For this reason Architecture B was expected to do better than Architecture A. At the same time, Architecture C, though it does not have any filters as large as some in Architecture B, still has more than five times the number of parameters. The distribution of these parameters between layers is shown in Tables 1-3.
The RMSProp optimizer was used for training all three architectures. A learning rate of 0.001 was used, along with a decay of 1.0 × 10 −6 [62], as implemented in the Keras library [67]. The models were trained on batches of size 20, which were shuffled between epochs, for 500 epochs.

Other Classifiers
To provide additional context for the performance of the deep CNNs, several simpler classifier models were considered. Without deep representation learning, these classifiers needed to be provided with appropriate features upon which to make the classification decision. In this case, a set of features was constructed based on expert knowledge, particularly that cracks induce dark shadows in the image patches. All image patches were first converted to greyscale. The first set of features were based on the premise that cracks appear at lower lightness than the brick face or mortar joints. Two thresholds T 1 and T 2 were selected manually based on our evaluation of the images. The first two features are simply the number of pixels below T 1 and between T 1 and T 2 . However, it is not just the number of dark pixels but their arrangement that matters for identifying a crack. Therefore, the standard deviations of the coordinates of pixels within the selected thresholds (both x and y coordinates independently) were considered. Note that this may conflate small, disconnected regions of dark shadow with large single regions.
The thresholds described above define two sets of pixels: S 1 : Set of pixels with brightness below T 1 S 2 : Set of pixels with brightness between T 1 and T 2 This then leads to the definition of six input features for the classifiers: where σ defines the standard deviation function and S i indicates the labels of the set S i .
These features were provided to a set of standard classification models: linear Support Vector Machine (SVM), Random Forest (RF), Gaussian Process (GP), Multi-Layer Perceptron (MLP), Naive Bayes (NB), and Quadratic Discriminant Analysis (QDA). The scikit-learn implementation was used for all the models [68]. For all models, the scikit-learn default parameters were used for simplicity; no hyperparameter tuning was performed. The GP model used a radial basis function kernel. Our use of these simple, "off the shelf" classifiers is intended to provide a baseline performance on our CNN models, rather than achieve optimal performance.

CNN Training
Training the models took around six minutes using four 2.60 GHz Intel Skylake cores and a single nVidia K20m GPU on a high-performance computing cloud. The accuracy of each model during training is shown in Figure 5. The lines show replicas of the same model training on a different data fold, initialized with different random initial weights, and under the effect of stochastic dropout. Interestingly, based on the training accuracy, Models A and B appear to learn at about the same rate and eventually reach about 95% accuracy, while Model C appears to learn more quickly and reach a higher accuracy of up to 99%. This is consistent with the greater trainable parameters in Model C, whereas Models A and B have about the same number. However, as shown in the bottom row of Figure 5, the performance of each model on the testing data plateaued to much lower values than the training data, closer to 90% accuracy. The magnitude of this discrepancy reveals overfitting in cases where the model performed substantially better on training data compared to testing. This in turn indicates that the model will not be able to generalize to new images. Because the data set is small and there are so many trainable parameters in these models, overfitting is a major concern. From Figure 5, we observe that Models A and B had similar accuracy in training and testing and are therefore do not suffer significant overfitting. Model C, however, appears to have used its significantly greater number of parameters to overfit to the training data.

Testing on Laboratory Images
The "lab images" used for testing were from the same data set as the images used for training, but they were left out from training. Some standard metrics quantifying the performance of each model are shown in Table 4. Each result is the average of a 5-fold cross-validation test, which was used to increase confidence in the results given such a small data set. The training and testing sets were always mutually exclusive. In each run, one subset (composed of 160 images) was reserved for testing, and the model was trained on the other four (totaling 650 images). Thus, each run represents a model trained on a slightly different data set and then tested on a different data set. Table 4. Performance of three architectures tested on lab images not included in the training set. Reported values are the average and standard deviation over the five replicas shown in Figure 5. The best performer in each row is shown in bold. Model B performed best in every metric shown in Table 4, followed closely by Model A; Model C performed the worst in three of the four metrics. Based on a one-sided t-test with α = 0.05, Model B's performance was superior to Model A with statistical significance in accuracy and F1 score. Likewise, Model B's performance was superior to Model C in accuracy, precision, and F1 score. While the rest had p > α, the metrics were still highest for Model B.

Model
The confusion matrices for each model are shown in Figure 6. While the performance of each model varied, they showed similar qualitative behavior on the test data. In general, True positives were more likely than true negatives, and false positives were about twice as likely as false negatives, exposing a bias towards predicting cracks were present in the images. Interestingly, Model C's inferior performance appears to come almost exclusively from these false positives, with approximately 50% higher False Positive incidence compared to the other two architectures. We examine these failure modes of Model C in more detail through visual examples in Figure 7. It seems that the images that are easy to identify as cracked-such as those with deep, expansive cracks-are correctly labeled as such. Likewise, images with clean, ortho-rectified mortar joints are readily identified as uncracked. For the false negative case, there appear to be some uneven and messy mortar joints which confuse the classifer by obscuring small, hairline cracks. For the false positive case, several appear to also have uneven joints which might create extra shadows that mislead the classifier. The blue tint of some images does not appear to create a systematic problem.

True Positive False Negative
False Positive True Negative

Domain Adaptation to Real-World Images
The relevant metrics when the CNNs are applied to these real-world images are presented in Table 5, and the associated confusion matrices in Figure 8. Interestingly, while the performance of all models was similar on the lab data (i.e., Figure 6), the model performance diverges significantly on the domain adaptation task. First, all models perform worse, which is to be expected since the real-world data set represents a much broader range of conditions. Somewhat surprisingly, Model A suffers about 50% greater deterioration in accuracy than Model B. This might be due to the shallower structure of the convolutional filters in Model A, which presumably are less capable of abstraction. Table 5. Performance of three architectures tested on real-world images not included in the lab training set. Reported values are the average and standard deviation over the five replicas. The best performer in each row is shown in bold. If these CNNs were to be deployed to study real structures, they would encounter a much greater variety of brick colors, sizes, textures, and other environmental conditions compared to those represented in the images of lab walls. We therefore studied the model performance on real-world images after being trained only on lab images, a process called domain adaptation. The models were tested against 90 different real-world images: 45 showing cracks and 45 with no visible cracks. Each of the walls had different relative sizes of bricks compared to images, different sizes of bricks compared to mortar joints, as well as non-uniform brick size in the same image (headers and stretchers). Furthermore, bricks and mortars of different colors were used of different colors were used, as well as different styles of mortar joints.

Model
Equally surprising is that while Model C was substantially overfitted on the lab data, it suffers about 50% lower loss of accuracy compared to Model B, resulting in the best performance on the real-world images despite being the worst performer on the lab data. This suggests that the additional trainable parameters do provide an advantage in domain adaptation even though the performance on the lab data seemed to indicate this was not the case. We suspect this is a result of the lab images being very similar, creating an artificially smaller sample space compared to what the number of images would imply. Conversely, on the real-world images, a smaller number of images provides a large amount of variation. We hypothesize that the redundant filters learned by Model C are therefore able to provide additional information to the classifier, while the filters from Models A and B are less successful in adapting to the broader range of new images. Similarly, the deeper architecture of Model B compared to Model A appears more successful at generalizing from the lab to the real world.
We again examine these failure modes of Model C in more detail through visual examples in Figure 9. As in the lab data, the deepest cracks are easy to identify, as are uncracked images with clean mortar joints with faint or nonexistent shadows. For the false negative case, thin, straight cracks that are aligned with the mortar joints appear the hardest to detect. For the false positive case, it seems that strong shadows and discoloration on the brick face are the strongest contributors. We suspect that the color variation that occurs within a single real-world image is the primary reason for the high false positive rate compared to the lab images.

True Positive False Negative
False Positive True Negative Figure 9. A random sample of four real-world images from each case in the confusion matrix, as determined by Model C.

Comparison to Other Classifiers
We also explored the use of simpler classifiers which are both less expensive to compute and more readily interpretable. As already described, the choice of classifiers was based on "off-the-shelf" models from the sci-kit learn Python package [68]. These models represent a variety of strategies including linear and non-linear methods, shallow neural networks and kernel methods, and ensemble methods.
Unsurprisingly, performance on the lab images was generally lower than the CNN models, as shown in Table 6. A notable exception is the high recall score of the MLP and QDA classifiers, which shows that those models throw very few false positives. For three of the four metrics, the highest performer was the RF model, implying that this might be the best alternative to the CNN. Table 6. Performance of different ML models when trained on laboratory images and tested on unseen laboratory images. The best performer in each row is shown in bold. In addition to testing and training on lab images, we performed two additional crossdataset validation studies. First, we trained on lab images and tested on real-world images, as we did with the CNNs. Second, we were also able to train the models on real-world images and then test on the lab images since these models have many fewer trainable parameters and can be fit reasonably on our small real-world data set. The confusion matrices for each model under each of these three cases is shown in Figure 10.

Cracked Uncracked
Predicted label  As expected, the performance drops significantly on the domain adaptation task (Lab to Real). The models appear to segregate into two categories, with SVM, RF, GP, and MLP performing roughly equally, and NB and QDA performing notably worse. The advantage of RF over the other three in the former category appears to dissipate for domain adaptation, with only 2% better true positive rate. This implies overfitting by the RF model in the Lab to Lab scenario. Overall, these models appear comparable, although we note that SVM and RF train significantly faster than GP and MLP.
On the reverse domain adaptation task (Real to Lab), we again find similar performance by models in each category, with the top four performing stronger than the bottom two. Notably, the false positive rate is much lower when training on real-world images and testing on lab images compared to the reverse case, while false negatives are higher. This implies that the variety of cracks is greater in the real-world data, such that the models learn the general features of a crack, whereas in the lab data there are subtle cracks which are more difficult to detect without training on them, especially in the presence of uncracked images with strong shadows and discoloration as shown in Figure 9.
An advantage of these simple models (i.e., models without built-in representation learning) is that we can evaluate how strongly they rely on each provided feature. Since the SVM and MLP had the highest accuracy in the Real to Lab case, we also performed a permutation feature importance test using those classifiers. For each feature, the model was refit with that feature shuffled, such that it no longer corresponded to each other feature in the feature vector. Then the model accuracy was compared to the baseline to quantify the amount of information in that feature.
This test was performed for both the lab and real-world data sets, with results shown in Figures 11 and 12, respectively. Comparing the results, we see that in the lab case both models relied mainly on Feature 5, the standard deviation of x-coordinates of S 2 . SVM relies more heavily on the pixels in S 1 while MLP uses S 2 , while the remaining features are less significant. The positive values of the MLP probably indicate that the model is not fully converged (we used the default settings with no hyperparameter tuning).  When considering the real-world case, both models use a completely different scheme, although Feature 5 remains prominent. The fact that the x-coordinate (Feature 5) is used more heavily than the y-coordinate (Feature 6) might reveal a bias in the data set towards cracks of a certain orientation. Likewise, the dominant use of only two features by the MLP in the lab training case is evidence of overfitting to a particular aspect of those data.

Conclusions
In this work, we produced a new dataset consisting of 2542 labeled image patches of masonry walls in a controlled laboratory environment. These data were used to train three different CNN architectures to classify image patches as cracked or uncracked, a challenging problem which has been the subject of several recent studies by other authors. The results show that the same CNN architecture which was sufficient for concrete and asphalt cracking in Ref. [36] is not sufficient for crack detection in masonry structures. Here, we were able to overcome the limitation of fixed-size sliding windows which have been necessary in past studies to handle the heterogeneity of masonry. We also showed that deeper networks provide superior results on test data.
Additionally, we have demonstrated the ability of the proposed architecture to perform well in domain adaptation to images found on the internet with different colors and relative sizes of materials compared to the training data. While we observed overfitting on the laboratory data for the CNN model with the most trainable parameters, these extra parameters provided superior performance on the domain adaptation task. The best model gave an accuracy of 81.0% under these circumstances, while the model which performed best on the lab images gave only 61.5% accuracy. This contrasts strongly with the 88.7% to 92.5% accuracy achieved on the lab images, and indicates that good model performance on curated, homogeneous data does not necessarily translate to real-world conditions.
The CNN model performance was also compared to that of six simple classifiers based on handcrafted features. The features were constructed from greyscale image patches which focused on dark regions indicative of cracking (i.e., deep shadows). Our results showed that four of the models, the SVM, RF, GP, and MLP, performed better than the remaining two, NB and QDA. While faster to train and run, these simple classifiers performed worse on the lab testing set as well as on the real-world testing set. However, we found that training these on real-world images provided superior domain adaptation performance when going in reverse, from only 90 real-world images to hundreds of lab images. We hypothesize that the greater variety in the real-world data produces models which are able to capture the narrow range of conditions in the lab as a subset, while the reverse results in the inability to generalize.
We conclude that successful domain adaptation is possible in both the CNN and simpler classifiers if trained on a wide range of masonry shapes, colors, and lighting conditions, and will lead to more accurate models than training on a more homogeneous data set (e.g., our controlled lab images). Some engineering firms already have large sets of facade images which they could leverage for this purpose. One outstanding question is that of visual clutter in the image patches, such as doors, windows, lights, and other background. While outside the scope of this work, future crack classification models meant for deployment in structural health monitoring contexts should consider how to separate such background clutter from structurally relevant foreground.