Automated Vision-Based Crack Detection on Concrete Surfaces Using Deep Learning

: Cracking in concrete structures affects performance and is a major durability problem. Cracks must be detected and repaired in time in order to maintain the reliability and performance of the structure. This study focuses on vision-based crack detection algorithms, based on deep convolutional neural networks that detect and classify cracks with higher classiﬁcation rates by using transfer learning. The image dataset, consisting of two subsequent image classes (no-cracks and cracks), was trained by the AlexNet model. Transfer learning was applied to the AlexNet, including ﬁne-tuning the weights of the architecture, replacing the classiﬁcation layer for two output classes (no-cracks and cracks), and augmenting image datasets with random rotation angles. The ﬁne-tuned AlexNet model was trained by stochastic gradient descent with momentum optimizer. The precision, recall, accuracy, and F 1 metrics were used to evaluate the performance of the trained AlexNet model. The accuracy and loss obtained through the training process were 99.9% and 0.1% at the learning rate of 0.0001 and 6 epochs. The trained AlexNet model accurately predicted 1998/2000 and 3998/4000 validation and test images, which demonstrated the prediction accuracy of 99.9%. The trained model also achieved precision, recall, accuracy, and F 1 scores of 0.99, respectively.


Introduction
Many of the existing concrete structures built during the 1960-70s are rapidly nearing the end of their service life [1]. It is estimated that nearly 10% of bridges built during this time have been repaired in the United States [2]. In Korea, the number of buildings over 30 years old was evaluated as 3.8% in 2014, reaching 13.8% by 2024, and 33.7% by 2029 [3,4]. Likewise, concrete structures are often exposed to aggressive environments, fatigue stresses, and cyclic loading that initiate cracks on the surfaces [5,6]. The cracks in structures have a significant impact on durability and make it easy for external aggressive substances to reach the reinforcement bars and cause corrosion [7,8]. In addition, cracks in the structures also reduce the local stiffness and cause material discontinuities [9,10]. Therefore, cracks must be detected and repaired in time in order to maintain the reliability and performance of the structure. Generally, crack detections were performed by non-destructive and destructive tests [11]. Visual inspections combined with surveying equipment were manually performed to detect cracks in the structures [12]. A PZT-based electro-mechanical admittance method combined with FEM analysis was enacted to quantitatively identify the damage caused by concrete cracking and steel yielding of flexural beams subjected to monotonic and cyclic loading [13]. Chalioris et al. [14] developed a wireless impedance/admittance monitoring system to identify the incipient damages caused by concrete cracking. In addition, non-destructive testing techniques such as infrared, thermal, ultrasonic, laser, and radiographic tests were also used to detect and analyze the crack development in concrete structures [15]. Although the above methods provide reliable crack detection results, they are difficult and time-consuming to perform because they require large instrumentation, and are expensive and labor intensive [16]. To overcome the shortcomings of the manual methods, several image processing methods were developed to provide automated crack detection and visualization in concrete structures. Most of the image processing methods used filtering, thresholding, and feature extraction techniques to identify and localize the cracks [17][18][19][20][21][22]. Furthermore, the crack regions were separated by fuzzy transforms and segmentation algorithms [21]. Although image processing methods were effective in detecting cracks, the real-time applicability in structures was limited due to the variations in external environmental factors such as light, shadows, and rough surfaces. To improve the performance of image processing techniques, machine learning algorithms were developed through pattern recognition and extraction [23]. Machine learning algorithms such as support vector machine (SVM) and artificial neural network (ANN) have also been explored to detect cracks in the concrete structures [24][25][26][27]. A local entropy-based thresholding algorithm was proposed that automatically detects spalled regions on the surface of the reinforced concrete columns [28]. In addition, the length and width of cracks were also measured using a local binary pattern (LBP) algorithm [29]. Machine learning algorithms consisting of feature extraction and classification were used to extract relevant crack features. The machine learning algorithm extracts only a few layers of features, and the algorithm might not provide accurate crack detection results if the extracted features do not reflect the cracks.
Deep learning algorithms such as convolutional neural networks (CNNs) have been used in many studies for crack detection and classification to improve the feature extraction process. CNN models can extract relevant features from the input data through multilayer neural networks, which are more advantageous than the existing limitations of image processing and machine learning methods. CNN-based crack detection was performed for the safety diagnosis and localization of damages in concrete structures in [30]. Similarly, Bayesian algorithms were used to identify cracks in nuclear power plants, and deep learning segmentation algorithms were used to identify cracks in the tunnels [31]. Furthermore, deep convolutional neural networks (DCNNs) were recently explored for crack detection and classification [32,33]. Most of the DCNNs focused on pixel-wise crack classification through semantic segmentation by associating each pixel [32,34]. Deep learning networks require a large amount of training data and time. These can be minimized by fine-tuned pretrained DCNNs that use small amounts of data and provide reliable results in minimal time. Fine-tuned pretrained DCNNs such as AlexNet, GoogleNet, ResNet, SqueezeNet, and VGGNet have recently been used to detect and classify cracks in concrete structures. A VGG19 pretrained model was applied to create pixel-level crack maps on concrete pavements and walls [35]. Crack segmentations were performed using SegNet, U-Net, and ResNet models [36][37][38][39]. A DenseNet-121-based fully convolutional network was studied to provide the pixel-level detection of multiple damages including cracks, spalling, efflorescence, and holes in concrete structures [40]. Furthermore, DCNNs based on VGG16 were also used for crack segmentation on the concrete surfaces [41].
All aforementioned deep learning approaches have shown promising performance in the crack detection of structures. Since the performance of DCNNs depends on various factors, such as data, filters, the number of layers, the number of epochs, and the network depth, it is difficult to select an appropriate pre-trained DCNN for crack detection with high precision and accuracy. The advantage of selecting an appropriate DCNNs is that it ensures better generalization and prevents overfitting. AlexNet, with many pre-trained DCNNs, is the most influential CNN widely applied to image classification and won the ImageNet LSVRC-2012 competition with a minor error rate of 15.3% [42]. The highlights of AlexNet are listed as follows: there are more filters in each layer; each convolutional layer is followed by a pooling layer; it uses ReLU instead of tanh, arctan, and logistic to add non-linearity that increases speed by up to 6x with the same accuracy; it uses a dropout layer instead of regularization to deal with overfitting; and it makes use of an overlap pooling layer to reduce the size of the network [43,44]. These characteristics motivated the utilization of AlexNet in this study for crack detection and classification.
This study utilized AlexNet, a pre-trained deep convolutional neural network, for the automated vision-based crack detection and classification. The proposed method consists of three steps: (1) collecting a large number of images from an open-source image dataset with subsequent categorization of two classes (no-crack and crack images); (2) developing a DCNN model, transferring the learning and augmentation process; and (3) automatically detecting and classifying the images using the trained deep learning model. Additionally, a cross-dataset study was performed to verify the ability of the trained AlexNet model. The precision, recall, accuracy, and F 1 metrics were used to evaluate the performance of the trained AlexNet model. The accuracy of the trained AlexNet model was further compared to other pretrained DCNNs such as GoogleNet, ResNet101, InceptionResNetv2, and VGG19.

Scheme of the CNN Model
In this study, a pre-trained DCNN was used for automated crack detection and classification. Pre-trained DCNNs consist of convolutional layers for extracting features and classifying images. Pre-trained DCNNs have been widely used in many applications to classify images, and there are many pre-trained DCNNs available (e.g., AlexNet, GoogleNet, ResNet, SqueezeNet, and VGGNet). This study used the AlexNet pre-trained model to detect and classify images in three stages: image database acquisition, CNN model and transfer learning process, and classification. Figure 1 shows the scheme of crack detection and classification model. First, an image database consisting of thousands of images was acquitted for two classes: crack images and no-crack images. Second, a CNN classifier model was developed to detect and classify images using AlexNet. Third, the trained DCNNs detected cracks and classify the set of validation and test images. Then, the cross-dataset was used to verify the ability of the trained model. add non-linearity that increases speed by up to 6x with the same accuracy; it uses a dropout layer instead of regularization to deal with overfitting; and it makes use of an overlap pooling layer to reduce the size of the network [43,44]. These characteristics motivated the utilization of AlexNet in this study for crack detection and classification.
This study utilized AlexNet, a pre-trained deep convolutional neural network, for the automated vision-based crack detection and classification. The proposed method consists of three steps: (1) collecting a large number of images from an open-source image dataset with subsequent categorization of two classes (no-crack and crack images); (2) developing a DCNN model, transferring the learning and augmentation process; and (3) automatically detecting and classifying the images using the trained deep learning model. Additionally, a cross-dataset study was performed to verify the ability of the trained AlexNet model. The precision, recall, accuracy, and F1 metrics were used to evaluate the performance of the trained AlexNet model. The accuracy of the trained AlexNet model was further compared to other pretrained DCNNs such as GoogleNet, ResNet101, Incep-tionResNetv2, and VGG19.

Scheme of the CNN Model
In this study, a pre-trained DCNN was used for automated crack detection and classification. Pre-trained DCNNs consist of convolutional layers for extracting features and classifying images. Pre-trained DCNNs have been widely used in many applications to classify images, and there are many pre-trained DCNNs available (e.g., AlexNet, Goog-leNet, ResNet, SqueezeNet, and VGGNet). This study used the AlexNet pre-trained model to detect and classify images in three stages: image database acquisition, CNN model and transfer learning process, and classification. Figure 1 shows the scheme of crack detection and classification model. First, an image database consisting of thousands of images was acquitted for two classes: crack images and no-crack images. Second, a CNN classifier model was developed to detect and classify images using AlexNet. Third, the trained DCNNs detected cracks and classify the set of validation and test images. Then, the crossdataset was used to verify the ability of the trained model.

Image Database Acquisition
An open-source dataset of concrete crack images was used for detection and classification [45]. The image dataset consists of 20,000 images, evenly divided into crack and nocrack classes, with an input image size of 227 × 227 × 3 pixels. The image dataset was divided into 70% for training, 10% for validation, and 20% for testing. The image dataset details are shown in Table 1. For training, classification, and testing, the images were divided into 14,000, 2000, and 4000 images, respectively. This study also used an image dataset consisting of crack and no-crack images to perform a cross-dataset study [46]. The

Image Database Acquisition
An open-source dataset of concrete crack images was used for detection and classification [45]. The image dataset consists of 20,000 images, evenly divided into crack and no-crack classes, with an input image size of 227 × 227 × 3 pixels. The image dataset was divided into 70% for training, 10% for validation, and 20% for testing. The image dataset details are shown in Table 1. For training, classification, and testing, the images were divided into 14,000, 2000, and 4000 images, respectively. This study also used an image dataset consisting of crack and no-crack images to perform a cross-dataset study [46]. The dataset consists of 16,285 images taken on bridge-decks, walls, and pavements. The details of the cross-image datasets are shown in Table 2.

AlexNet CNN Model
A CNN consists of several hidden layers as well as input and output layers. The layers of a CNN generally consist of convolutional, ReLU, pooling, fully connected, and normalization layers. This study analyzed the image database by applying a DCNN classifier to classify the input images into two categories: no-crack and crack. The CNN network was designed based on AlexNet for image classification. Figure 2 shows an overview of the CNN classifier based on AlexNet. AlexNet, a large neural network with 60 million parameters and 650,000 neurons, consists of 5 convolutional layers followed by max-pooling layers, 3 fully connected layers, and a final 1000-way SoftMax layer. AlexNet has been widely trained on more than a million images and can classify images into 1000 classes. Since the number of image classes in this study was two (no-crack and crack), the output number of classes was changed to two. dataset consists of 16,285 images taken on bridge-decks, walls, and pavements. The details of the cross-image datasets are shown in Table 2.

AlexNet CNN Model
A CNN consists of several hidden layers as well as input and output layers. The layers of a CNN generally consist of convolutional, ReLU, pooling, fully connected, and normalization layers. This study analyzed the image database by applying a DCNN classifier to classify the input images into two categories: no-crack and crack. The CNN network was designed based on AlexNet for image classification. Figure 2 shows an overview of the CNN classifier based on AlexNet. AlexNet, a large neural network with 60 million parameters and 650,000 neurons, consists of 5 convolutional layers followed by max-pooling layers, 3 fully connected layers, and a final 1000-way SoftMax layer. AlexNet has been widely trained on more than a million images and can classify images into 1000 classes. Since the number of image classes in this study was two (no-crack and crack), the output number of classes was changed to two. A new activation function was used in the AlexNet neural networks to provide nonlinearity. Several traditional activation functions, including logistic function, tanh function, and arctan function, tend to cause gradient vanishing problems. To overcome this, a new activation function was used, the rectified linear unit (ReLU), and its definition is shown in Equation (1).
Deep neural networks with ReLU as the activation function converge faster than those with tanh units. Dropout was employed in fully connected layers to avoid overfitting that A new activation function was used in the AlexNet neural networks to provide nonlinearity. Several traditional activation functions, including logistic function, tanh function, and arctan function, tend to cause gradient vanishing problems. To overcome this, a new activation function was used, the rectified linear unit (ReLU), and its definition is shown in Equation (1).
Deep neural networks with ReLU as the activation function converge faster than those with tanh units. Dropout was employed in fully connected layers to avoid overfitting that trains only a portion of the neurons in each iteration. The dropout reduces joint adaptation Appl. Sci. 2021, 11, 5229 5 of 13 between neurons and improves generalization and robustness. Convolution was employed for automatic feature extraction and defined as in Equation (2).
where w is the convolution kernel. Pooling was employed for automatic feature reduction, which considered a group of neighboring pixels in the feature map and generated a representation value. Cross-channel normalization was used to improve the generalization. In addition, fully connected layers were used for classification in which the neurons in fully connected layers were directly linked. The SoftMax activation function was expressed as in Equation (3). The SoftMax activates neurons by constraining the output in the range of (0, 1).

Transfer Learning
The network analyzer was applied to display interactive visualizations of network architectures and detailed information about network layers. The network architecture is shown in Table 3. The first layer was an image input layer with an input image size of 227 × 227 × 3, where the 3 is the number of color channels. Additionally, the CNN consisted of convolution layers, pooling layers, fully connected layers, and the SoftMax layer. It also included other operations such as ReLU, cross-channel normalization, and dropout layers. The last three layers, fully connected, SoftMax, and the classification output layer of the pretrained network, were configured for 1000 classes. These three layers were fine-tuned by the transfer learning for the two classes (no-crack and crack) as shown in Table 4.

. Augmentation Process
The AlexNet network requires input images of size 227 × 227 × 3. Therefore, image augmentations were used to automatically resize the training images as the image size in the datastore may differ. Additional augmentation operations to perform on the training images were also specified and included randomly flipping the training images along the vertical axis and randomly converting up to 30 pixels horizontally and vertically. The data augmentation prevented the network from overfitting and memorizing the exact details of the training images.

Training and Classification by AlexNet CNN Model
Matlab R2020b was used for image processing and data analysis. The fine-tuned AlexNet model was trained by stochastic gradient descent with momentum (SGDM) optimizer. The initial learning rate was set as 0.001 and 0.0001, the minibatch size was set as 15, and the max epoch was set as 6. After the training, the validation images and test images were classified using the fine-tuned network, and the images were displayed with their predicted labels. To quantify the accuracy of the trained model, the precision, recall, accuracy, and F 1 scores were computed using Equations (4)- (7).
where TP, FP, FN, and TN represent true positive, false positive, false negative, and true negative, respectively.

Performance of the Trained Network
The training progress included the accuracy and cross-entropy loss for each epoch of training and validation. To determine the appropriate learning rate, neural networks with different learning rates over 6 epochs were trained. The training progress with learning rates of 0.0001 and 0.001 were trained and compared. During the training process, the maximum number of iterations was 5598, with 933 iterations per epoch. The training progress plot of accuracy (%) with different learning rates is shown in Figure 3. The accuracy obtained from the 0.0001 learning rate and 6 epochs was 99.9%. The change in accuracy at the 0.0001 learning rate was minimal after 1 epoch. With the 0.001 learning rate, the obtained accuracy was 50%. At two different learning rates, a better performance was achieved with the learning rate of 0.0001 after 6 epochs. In the same way, the training progress plot of loss (%) with different learning rates is shown in Figure 4. The loss obtained from the 0.0001 learning rate was 0.1%. At the 0.001 learning rate, the acquired loss of training and validation was 50%. In the two different learning rates, the loss was least at the 0.0001 learning rate. Considering the above results, the learning rate of 0.0001 and 6 epochs was fixed and trained in this study. the 0.0001 learning rate was minimal after 1 epoch. With the 0.001 learning rate, the obtained accuracy was 50%. At two different learning rates, a better performance was achieved with the learning rate of 0.0001 after 6 epochs. In the same way, the training progress plot of loss (%) with different learning rates is shown in Figure 4. The loss obtained from the 0.0001 learning rate was 0.1%. At the 0.001 learning rate, the acquired loss of training and validation was 50%. In the two different learning rates, the loss was least at the 0.0001 learning rate. Considering the above results, the learning rate of 0.0001 and 6 epochs was fixed and trained in this study.    the 0.0001 learning rate was minimal after 1 epoch. With the 0.001 learning rate, the obtained accuracy was 50%. At two different learning rates, a better performance was achieved with the learning rate of 0.0001 after 6 epochs. In the same way, the training progress plot of loss (%) with different learning rates is shown in Figure 4. The loss obtained from the 0.0001 learning rate was 0.1%. At the 0.001 learning rate, the acquired loss of training and validation was 50%. In the two different learning rates, the loss was least at the 0.0001 learning rate. Considering the above results, the learning rate of 0.0001 and 6 epochs was fixed and trained in this study.

Classification Using the Trained Network
The pre-trained DCNN was trained with the training set of 14,000 images using the AlexNet model, which obtained 99.9% accuracy during the training process. After training, the trained model was validated before the test. The validation image dataset consisted of 10% of the total image dataset, and had 2000 images. During the validation process, the trained model predicted the images into two classes: crack and no-crack. Sample images of predicted crack and no-crack classes in the validation images are shown in Figure 5. The confusion matrix for the validation images is shown in Figure 6. From the set of 2000 images, 1000 images had cracks and 1000 images had no-cracks. An amount of 1998 of the 2000 images were accurately predicted, representing a 99.9% accuracy. In the crack image dataset, 1000 images were accurately predicted. Similarly, 998 no-crack images were accurately predicted. The prediction accuracies of crack and no-crack images were both 99.9%. In the validation images, 99.9% accuracy and 0.1% loss were obtained. The performance metrics were computed, and are shown in Table 5. The precision, recall, and F 1 scores obtained from the confusion matrix were 1, 0.99, and 0.99. The prediction accuracy was 0.99.
ing, the trained model was validated before the test. The validation image dataset consisted of 10% of the total image dataset, and had 2000 images. During the validation process, the trained model predicted the images into two classes: crack and no-crack. Sample images of predicted crack and no-crack classes in the validation images are shown in Figure 5. The confusion matrix for the validation images is shown in Figure 6. From the set of 2000 images, 1000 images had cracks and 1000 images had no-cracks. An amount of 1998 of the 2000 images were accurately predicted, representing a 99.9% accuracy. In the crack image dataset, 1000 images were accurately predicted. Similarly, 998 no-crack images were accurately predicted. The prediction accuracies of crack and no-crack images were both 99.9%. In the validation images, 99.9% accuracy and 0.1% loss were obtained. The performance metrics were computed, and are shown in Table 5. The precision, recall, and F1 scores obtained from the confusion matrix were 1, 0.99, and 0.99. The prediction accuracy was 0.99.   After the validation of the trained model, the test images were classified using the trained network. The test image dataset consists of 20% of the images from the original dataset, comprising 4000 crack and no-crack images. The predicted crack and no-crack classes of the test images are shown in Figure 7. In the test image dataset, there were 2000  After the validation of the trained model, the test images were classified using the trained network. The test image dataset consists of 20% of the images from the original dataset, comprising 4000 crack and no-crack images. The predicted crack and no-crack classes of the test images are shown in Figure 7. In the test image dataset, there were 2000 images of each class (crack and no-crack). The confusion matrix of the test images is shown in Figure 8. The model trained in the test images accurately predicted 3998 from a total of 4000 images. Only two images were left unpredicted. A total of 1999 crack images and 1999 non-crack images were accurately predicted, representing a 99.99% accuracy. Considering the total test images, the prediction accuracy was 99.9% with a 0.1% loss. The computed performance metrics are shown in Table 6. The precision, recall, and F 1 scores obtained from the confusion matrix were all 0.99. In addition, the accuracy of the prediction in the test images was 0.99. After the validation of the trained model, the test images were classified using the trained network. The test image dataset consists of 20% of the images from the original dataset, comprising 4000 crack and no-crack images. The predicted crack and no-crack classes of the test images are shown in Figure 7. In the test image dataset, there were 2000 images of each class (crack and no-crack). The confusion matrix of the test images is shown in Figure 8. The model trained in the test images accurately predicted 3998 from a total of 4000 images. Only two images were left unpredicted. A total of 1999 crack images and 1999 non-crack images were accurately predicted, representing a 99.99% accuracy. Considering the total test images, the prediction accuracy was 99.9% with a 0.1% loss. The computed performance metrics are shown in Table 6. The precision, recall, and F1 scores obtained from the confusion matrix were all 0.99. In addition, the accuracy of the prediction in the test images was 0.99.  The accuracy of the trained model was further compared to the other pretrained models and is shown in Table 7. The GoogleNet, ResNet101, InceptionResNetv2, and VGG19 DCNNs were compared to the trained AlexNet DCNN model. The AlexNet, GoogleNet, and VGG19 obtained accuracies of 0.99. In addition, the ResNet101 and InceptionResNetv2 models obtained accuracies of 0.9833 and 0.95, respectively. While the other DCNN models also obtained high accuracies, the AlexNet has fewer layers compared to other DCNNs, and can be trained in less time. The other DCNNs have more layers for feature extraction, which requires more time for training. Therefore, AlexNet was superior to other pretrained DCNNs for crack detection and classification.

Cross-Dataset Study of the Trained Network
To validate the ability of the trained AlexNet model, a cross-dataset was tested using different images that were not used for training. The dataset consists of crack and no-crack images taken on bridge-decks, walls, and pavements. Examples of the cross-image dataset are shown in Figure 9.    The accuracy of the trained model was further compared to the other pretrained models and is shown in Table 7. The GoogleNet, ResNet101, InceptionResNetv2, and VGG19 DCNNs were compared to the trained AlexNet DCNN model. The AlexNet, GoogleNet, and VGG19 obtained accuracies of 0.99. In addition, the ResNet101 and Incep-tionResNetv2 models obtained accuracies of 0.9833 and 0.95, respectively. While the other DCNN models also obtained high accuracies, the AlexNet has fewer layers compared to other DCNNs, and can be trained in less time. The other DCNNs have more layers for feature extraction, which requires more time for training. Therefore, AlexNet was superior to other pretrained DCNNs for crack detection and classification. images taken on bridge-decks, walls, and pavements. Examples of the cross-image dataset are shown in Figure 9. The trained AlexNet model was saved, and a cross-image dataset was tested using the trained model. The trained model predicted the images taken on bridge-decks with an accuracy of 84.5%. The trained model also predicted the images taken on pavements and walls with 89.3% and 81.9% accuracy, respectively. The loss obtained from the images taken on bridge-decks, pavements, and walls was 15.5%, 10.7%, and 18.2%, respectively. The confusion matrix of the cross-dataset test images is shown in Figure 10. To quantify the trained model, the precision, recall, accuracy, and F1 scores were computed and are presented in Table 8. For the bridge-deck images, the obtained precision, recall, and F1 scores were 0.89, 0.91, and 0.90. The precision, recall, and F1 scores obtained from the images taken on pavements were all 0.92. Similarly, the precision, recall, and F1 scores obtained from the images taken on walls were 0.88, 0.82, and 0.85, respectively. In addition, the prediction accuracies for the three categories were 0.84, 0.89, and 0.81, respectively. The trained AlexNet model was saved, and a cross-image dataset was tested using the trained model. The trained model predicted the images taken on bridge-decks with an accuracy of 84.5%. The trained model also predicted the images taken on pavements and walls with 89.3% and 81.9% accuracy, respectively. The loss obtained from the images taken on bridge-decks, pavements, and walls was 15.5%, 10.7%, and 18.2%, respectively. The confusion matrix of the cross-dataset test images is shown in Figure 10. To quantify the trained model, the precision, recall, accuracy, and F 1 scores were computed and are presented in Table 8. For the bridge-deck images, the obtained precision, recall, and F 1 scores were 0.89, 0.91, and 0.90. The precision, recall, and F 1 scores obtained from the images taken on pavements were all 0.92. Similarly, the precision, recall, and F 1 scores obtained from the images taken on walls were 0.88, 0.82, and 0.85, respectively. In addition, the prediction accuracies for the three categories were 0.84, 0.89, and 0.81, respectively. The prediction accuracy obtained from the original dataset was 0.99, while the accuracy of the cross-dataset were decreased to 0.84, 0.89, and 0.81, respectively. These decreases in the prediction accuracy were due to the presence of a variety of obstructions including shadows, surface roughness, scaling, edges, holes, and background debris in the images [46]. These obstructions resulted in the loss of accurate prediction of the images. the trained model. The trained model predicted the images taken on bridge-decks with an accuracy of 84.5%. The trained model also predicted the images taken on pavements and walls with 89.3% and 81.9% accuracy, respectively. The loss obtained from the images taken on bridge-decks, pavements, and walls was 15.5%, 10.7%, and 18.2%, respectively. The confusion matrix of the cross-dataset test images is shown in Figure 10. To quantify the trained model, the precision, recall, accuracy, and F1 scores were computed and are presented in Table 8. For the bridge-deck images, the obtained precision, recall, and F1 scores were 0.89, 0.91, and 0.90. The precision, recall, and F1 scores obtained from the images taken on pavements were all 0.92. Similarly, the precision, recall, and F1 scores obtained from the images taken on walls were 0.88, 0.82, and 0.85, respectively. In addition, the prediction accuracies for the three categories were 0.84, 0.89, and 0.81, respectively. The prediction accuracy obtained from the original dataset was 0.99, while the accuracy of the cross-dataset were decreased to 0.84, 0.89, and 0.81, respectively. These decreases in the prediction accuracy were due to the presence of a variety of obstructions including shadows, surface roughness, scaling, edges, holes, and background debris in the images [46]. These obstructions resulted in the loss of accurate prediction of the images.

Conclusions
This study investigated automated crack detection based on a CNN. An open-source image dataset with two subsequent classes (no-crack and crack) was used. The image datasets were divided into 70%, 10%, and 20% for training, validation, and testing, respectively. The CNN model was designed based on AlexNet for image classification. AlexNet consists of convolution layers, pooling layers, fully connected layers, and SoftMax layers, as well as other operations such as ReLU, cross-channel normalization, and dropout layers. The last three layers (fully connected, SoftMax, and classification output layer) of the pretrained network were fine-tuned by transfer learning for the two classes (no-crack and crack). Image augmentations were then used to automatically resize the training images. The fine-tuned AlexNet model was trained by stochastic gradient descent with momentum (SGDM) optimizer. After training, the validation images and test images were classified using the fine-tuned network.
The fine-tuned AlexNet model was trained, and the training progress evaluated the accuracy and cross-entropy loss for each epoch. The accuracy obtained at the 0.0001 learning rate and epoch 6 was 99.9%, and the validation loss was 0.1%. The trained model was validated, and it accurately predicted 1998 from 2000 images. The accuracy obtained during the validation was 99%, and the loss of accurate prediction was 0.1%. After the validation, the test images were classified using the trained network. In the test images, the trained model accurately predicted 3998 from the total 4000 images. Considering the total test images, the prediction accuracy was 99.9% with 0.1% loss. This study confirmed that the CNN-based method demonstrates a high level of applicability to detect cracks, with a 99.9% accuracy. The performance of the trained model was quantified by the precision, recall, accuracy, and F 1 metrics, which were all equal to 0.99. Furthermore, the accuracies were compared to other pretrained DCNNs. AlexNet showed an accuracy of 0.99, which is beneficial for detecting and classifying cracks with high precision. The trained AlexNet model was further tested with different cross-dataset images which consisted of several obstructions, including shadows, surface roughness, scaling, edges, holes, and background debris. The existence of these obstructions resulted in a nominal loss of accurate predictions, with an accuracy of around 0.81-0.89%. Data Availability Statement: This study did not report any data.