PCIer : Pavement Condition Evaluation Using Aerial Imagery and Deep Learning

: This paper aims to explore and evaluate aerial imagery and deep learning technology in pavement condition evaluation. A convolutional neural network (CNN) model, named PCIer , was designed to process aerial images and produce pavement condition index (PCI) estimations, which are classiﬁed into four scales of Good (PCI ≥ 70), Fair (50 ≤ PCI < 70), Poor (25 ≤ PCI < 50), and Very Poor (PCI < 25). In the experiment, the PCI datasets were retrieved from the published pavement condition report by the City of Sacramento, CA. Following the retrieved datasets, the authors also collected the corresponding aerial image datasets containing 100 images for each PCI grade from Google Earth. An 80% proportion of datasets were used for PCIer model training, and the remaining were used for testing. Comparisons showed using a 128-channel heatmap layer in the proposed PCIer model and saving the PCIer model with the best validation accuracy would yield the best performance, with a testing accuracy of 0.97, and a weighted average precision, recall, and F1-score of 0.98, 0.97, and 0.97, respectively. Moreover, future research recommendations are provided in the discussion for improving the effectiveness of pavement evaluation via aerial imagery and deep learning.


Introduction
Pavement evaluations (e.g., visual condition surveys, non-destructive testing, destructive testing) are conducted to determine functional and structural conditions of a highway/street section, either for purposes of routine monitoring or planned corrective action [1][2][3][4]. The Pavement Condition Index (PCI) is a numerical value representing roads' and parking lots' pavement status [5]. In ASTM D6433, "Standard Practice for Roads and Parking Lots Pavement Condition Index Surveys," the PCI has a value from 0 to 100, which is rated based on visual inspection of pavement distress type, severity, and quantity [5]. The Flexible Pavement Visual Survey Condition categories include Rutting, Patching, Failures, Block Cracking, Alligator Cracking, Longitudinal Cracking, Transverse Cracking, Raveling, and Flushing [5].
Before deep learning-based pavement defect detection emerged in the field, researchers identified pavement defections, i.e., cracks, through various Digital Image Processing (DIP) methods, such as thresholding and edge detection, with four steps of preprocessing, image enhancement, image transformation, and image classification and analysis [6][7][8]. These procedures focused on standardizing a specific defect, extracting it as a feature, and performing a rule-based inspection to determine which features are included [9]. However, in the traditional DIP approach, a person should manually process all the filter tasks that extract visual features, and then the completed filters are collected and stored as a banked dataset.
Convolutional operation transforms the tiresome DIP process into a more straightforward process through deep learning. It generates thousands of filters automatically optimized for targeted data that previously could not be made by human effort. In addition, a deeper network model can generate more powerful features since it will cover a wide range of trained datasets with a deeper understanding of complicated abstraction [10,11]. Typically, a Convolutional Neural Network (CNN) model starts with a convolutional layer, while its hidden layers contain multiple max-pooling layers, convolutional layers, and fully connected layers (FC or dense layers). To conduct classification tasks, the CNN ends with an FC layer with a SoftMax activation function to normalize the output of a network to a probability distribution over predicted output classes [12]. The effectiveness of CNN models in pavement distress detection and classification has been proven in several studies and experiment results [12][13][14][15][16][17][18][19].

Deep Learning for Pavement Condition Evaluation
The pavement surface of a roadway section is a relatively flat plane, which makes it feasible to use 2D imagery (e.g., top-view and drone photogrammetric orthophotos [20]) to represent the pavement's spectral features (Red, Green, and Blue). In addition, 3D imagery (e.g., surface-height plot [21], depth map [22], and range image [16,23]) can represent the pavement's elevation features. Moreover, 2D and 3D images can be further aligned to the same pixel coordinates and merged as integrated features [20]. Based on those 2D/3D data sources, previous studies used the following machine learning and deep learning-based methods to achieve the pavement condition evaluation objectives.
Machine learning methods, such as Support Vector Machine (SVM) [20] and Random Forest (RF) [24], can output numerical values as classification results. In study [20], a set of spectral features (RGB and mean), textural features (contrast, correlation, energy, and homogeneity), and geometrical features (extent, eccentricity, minor axis length major axis length, and orientation) were generated from the drone photogrammetric orthophoto. The results showed that using the combination features had an accuracy of 92% in crack/noncrack classification. Only using textural features had the lowest accuracy of 81%, as cracks are not significantly different from non-cracks in asphalt pavement. Using spectral and structural features separately had an accuracy of about 85%, because cracks, in color and shape, are different from non-cracks [20]. Moreover, CrackForest [24], a RF classifier, also used an integral channel feature (three color, two magnitude, and eight orientation channels) for road image crack detection.
Deep learning approaches, such as Artificial Neural Networks (ANNs), or Neural Networks (NNs) can also perform the classification task. NNs have the architecture of multiple layers, including an input layer, hidden layers, and an output layer. By using different hidden layers to connect the input and output layers, the NNs can generate anything from numerical values to free-form elements such as images, texts, and sounds. Multilayer Perceptron (MLP) is a class of feedforward ANN which typically has 1D vector input data, such as a GPR trace with 128 samples [25] or 300 samples [26]. The hidden layers usually are fully connected layers (FC or dense layers), dropout layers, and activation functions. The output layer contains a SoftMax activation function to generate a 1D binary class vector for classification tasks, where the size of the output vector depends on the number of classes, such as normal signal and abnormal signal-2 classes [25], and pavement thickness (equal to the samples) of 300 classes [26]. Then, the additional Argmax function is required to return the index of the maximum value in the binary class vector as the final numerical value (classification) output [12].
Beyond structured data, the most common type of input data for NNs in the reviewed studies are 2D imagery data, which results in CNNs and FCNs (Fully Convolutional Networks) being the most widely used data analysis method for pavement evaluation. A CNN starts with a convolutional layer, while its hidden layers contain multiple maxpooling layers, convolutional layers, and FCs. A CNN typically ends with an FC with the SoftMax activation function for conducting classification tasks, which generates a numerical value (classification) output, the same as MLP [12]. That is the major difference from FCNs, because an FCN model typically does not contain FC, but it uses a convolutional layer with a Sigmoid activation function as the network's end layer for generating the same-sized output results as the input images [27]. Furthermore, CNNs can be used with the sliding window scheme (or overlapping small patches [12,14]) to perform crack and non-crack binary classification tasks [14][15][16][17] and pavement cracking category classification tasks [15] in each small patch of a large-resolution 2D/3D image. Moreover, when the size of the window patches is very small, for example, 13 × 13 pixels [14], the CNN-based image patch classification results would be properly annotating cracks on the large-resolution images [14,16]. Moreover, a previous study [18] also utilized the bilateral filter to smooth 227 × 227-pixel small patches with cracks, and implemented a k-means clustering-based image segmentation algorithm to achieve a pixel accuracy of 98.70%.
Therefore, considering the effectiveness of CNN models in pavement distress detection and classification in the previous studies and experiment results, a feasibility study of aerial imagery and CNN-based PCI estimation (by rating PCI at a multi-level) was conducted in this research project. The success of the proposed method can skip the time-consuming and labor-consuming pavement condition survey processes for pavement distress type classification, severity determination, and quantity measurement.

CNN Model for Classification and Visualization
The proposed CNN model, named PCIer, is shown in Figure 1, with detailed model layers and parameters in Table 1. The collected large-dimension aerial images are first resized down to 256 × 256-pixel images as the model inputs. Then, the inputs are processed by four 2D convolutional layers and three max-pooling layers; then, the fifth convolutional layer (named "heatmap layer") uses the 1 × 1 convolutional operation to reduce the 512 channels feature-maps to small numbers of channels. In this paper, the heatmap layers are compared in the options of 64 and 128 channels. In addition, the heatmap layer has a size of 32 × 32 pixels, which is designed to generate the heatmap via the Grad-CAM (gradient class activation map), a visualization technique for deep learning networks [28].
The generated heatmaps have the same size as convolutional outputs; thus, to make heatmaps' sizes close to 256 × 256 pixels as in the original CNN inputs, there is no pooling layer between the fourth and fifth convolutional layers (heatmap layer) in the proposed CNN (see Figure 1). The resized heatmaps indicate the regions of the image that contribute to the CNN's classification results. Moreover, after the heatmap layer, the flatten layer (operation) converts the 32 × 32 × 128 features to a 1D vector of 131,072 elements (or 32 × 32 × 64 features to a 1D vector of 65,536 elements). Then, the four dense layers reduce the dimension of the 1D vector to 1024, 128, 16, and 4 features, respectively. The proposed CNN model has four dropout layers before four dense layers, which are used to avoid model overfitting. The ReLU activation function is used in the CNN model's hidden layers (Feature Learning and Classification Blocks in Table 1), because ReLU is faster than other activation functions, such as Sigmoid [12,27]. The CNN model's output layer uses the SoftMax activation function to generate the probabilities of PCI grades, such as 1% for Green/Good/Very Good (PCI ≥ 70), 2% for Blue/Fair (50 ≤ PCI < 70), 3% for Yellow/Poor (25 ≤ PCI < 50), and 94% for Red/Very Poor/Failed (PCI < 25), as the example shows in Figure 1. Hence, the PCI image datasets of pavement in very poor/failed condition (PCI < 25), poor condition (25 ≤ PCI < 50), fair condition (50 ≤ PCI < 70), and good/very good condition (PCI ≥ 70) need to be prepared, in which images are set with the class label of 0, 1, 2, and 3, respectively.

Data Augmentation
The proposed Data Augmentation (DA) strategies aim to obtain a well-trained CNN using a limited number of datasets. Figure 2 illustrates the proposed DA strategies, which integrate image transformations of scaling (in a range of 0.5 to 1.5), stretch (scaling in either width or height direction), rotation, flipping, and reflection. The ratios of scaling and stretch are randomly generated. When ratios are larger than one, only the central regions are kept. When ratios are less than one, the small-sized images are padded with reflection operations (alternative to constant padding). In addition, the image color adjustments are randomly determined in the adjustments of brightness, contrast, saturation, and sharpness [29].

Data Augmentation
The proposed Data Augmentation (DA) strategies aim to obtain a well-trained CNN using a limited number of datasets. Figure 2 illustrates the proposed DA strategies, which integrate image transformations of scaling (in a range of 0.5 to 1.5), stretch (scaling in either width or height direction), rotation, flipping, and reflection. The ratios of scaling and stretch are randomly generated. When ratios are larger than one, only the central regions are kept. When ratios are less than one, the small-sized images are padded with reflection operations (alternative to constant padding). In addition, the image color adjustments are randomly determined in the adjustments of brightness, contrast, saturation, and sharpness [30].

Evaluation Metrics
The following metrics are used to measure the classification performance of the proposed CNN model and DA strategies: Accuracy Equation (1), the ratio of number of correct predictions to the total number of testing images.

Evaluation Metrics
The following metrics are used to measure the classification performance of the proposed CNN model and DA strategies: Accuracy Equation (1), the ratio of number of correct predictions to the total number of testing images.

Accuracy = Number o f Corrected Prediction Total Number o f Predication Made
(1) Precision Equation (2), the number of correct positive results divided by the number of positive results predicted by the CNN model.
Recall Equation (3), the number of correct positive results divided by the number of all relevant images (all images that should have been identified as positive).
F1 Score Equation (4), the harmonic mean between precision and recall. The range for the F1 Score is between 0 and 1, which indicates how precise the CNN model is (how many images it classifies correctly), as well as how robust it is (it does not miss a significant number of images).

Dataset Preparation
The City of Sacramento (California, CA, USA) rated and mapped the condition of the streets with the following standards: a PCI score of 70 to 100 is considered "Excellent/Good", 50 to 69 is "Fair", 25 to 49 is "Poor", and 0 to 24 is "Very Poor" [30]. Onehundred images were collected from the Google Earth web version for each PCI grade via a Google Earth Screenshot Tool developed in the previous research [11]. Table 2 lists the collected PCI images (and parameters) from five streets in Sacramento. An example of the collected PCI image is shown on the left of Figure 1. The prepared training and testing datasets can be accessed in [31]. For each PCI grade, 80 images were used for CNN model training, and the remaining 20 images were used for CNN model testing. By applying the proposed DA for one time, an original image would be transformed into eight styles, as shown in Figure 2. By running the random DA for ten rounds, the original image would be extended to 80 (=1 × 8 × 10) images. Thus, the collected 320 (=80 × 4) images generated a training dataset with 25,600 (=80 × 320) images.

Model Training
Since the expected output is the probabilities for the four PCI grades of Very Poor, Poor, Fair, and Good, the loss function "categorical_crossentropy" was used in CNN model training. In addition, the "validation_split" was set at 0.20, which means 20% (5120) samples were used for validating the model, and another 80% (20,480) samples were used for model training. The maximum training epoch (an epoch is one full cycle through the entire training dataset) was set at 50 epochs. To avoid model overfitting, the training process was stopped early via monitoring validation accuracy (closer to one is better) where it had not been improved in the previous five epochs.
The plots of accuracy and loss (closer to zero is better) are shown in Figure 3. They indicate that both models (one model has 64 channels in the heatmap layer and another model with a 128-channel heatmap layer) stopped training earlier than the maximum epoch due to activation of the early stopping as described previously. In detail, the 128-channel model stopped at the 14th epoch with a "final" validation accuracy of 0.9846 and the "best" validation accuracy of 0.9979 at the 9th epoch.

Model Testing
The "best" model (at the best validation accuracy epoch) and the "final" model (at the end epoch) were both saved and tested using the collected testing dataset (which has 20 images for each PCI grade and a total of 80 images) with the selected performance evaluation metrics.
The testing results are listed in Table 3 for the four CNN models (including the 128channel final model, 128-channel best model, 64-channel final model, and 64-channel best model) and the four PCI grades of Very Poor, Poor, Fair, and Good. In addition, the confusion matrices in Figure 4 indicate the detailed classification results, where the label "Red 0" is the "Very Poor", "Yellow 1" is the "Poor", "Blue 2" is the "Fair", and "Green 3" is the "Good" PCI grade.

Model Testing
The "best" model (at the best validation accuracy epoch) and the "final" model (at the end epoch) were both saved and tested using the collected testing dataset (which has 20 images for each PCI grade and a total of 80 images) with the selected performance evaluation metrics.
The testing results are listed in Table 3 for the four CNN models (including the 128channel final model, 128-channel best model, 64-channel final model, and 64-channel best model) and the four PCI grades of Very Poor, Poor, Fair, and Good. In addition, the confusion matrices in Figure 4 indicate the detailed classification results, where the label "Red 0" is the "Very Poor", "Yellow 1" is the "Poor", "Blue 2" is the "Fair", and "Green 3" is the "Good" PCI grade.

Model Testing
The "best" model (at the best validation accuracy epoch) and the "final" model (at the end epoch) were both saved and tested using the collected testing dataset (which has 20 images for each PCI grade and a total of 80 images) with the selected performance evaluation metrics.
The testing results are listed in Table 3 for the four CNN models (including the 128channel final model, 128-channel best model, 64-channel final model, and 64-channel best model) and the four PCI grades of Very Poor, Poor, Fair, and Good. In addition, the confusion matrices in Figure 4 indicate the detailed classification results, where the label "Red 0" is the "Very Poor", "Yellow 1" is the "Poor", "Blue 2" is the "Fair", and "Green 3" is the "Good" PCI grade.

Performance Comparison
With the training, validation, and testing results, it is safe to conclude that the proposed CNN model with a 128-channel heatmap layer (average testing accuracy of 0.95) performs better than the 64-channel model (average testing accuracy of 0.87), and the "best" model (average testing accuracy of 0.925) has better performance than the "final" model (average testing accuracy of 0.895). In addition, as shown in Table 3, the testing has an accuracy of 0.97, and weighted average precision, recall, and F1-score of 0.98, 0.97, and 0.97, respectively. Thus, the PCIer with a 128-channel heatmap layer is recommended for PCI grade estimation for future applications. The detailed model layers and parameters are shown in Table 1. Moreover, saving the well-trained model with the best validation accuracy can further improve PCIer performance.

Limitations
Examples of PCI grade prediction results are shown in Figure 5

Performance Comparison
With the training, validation, and testing results, it is safe to conclude that the proposed CNN model with a 128-channel heatmap layer (average testing accuracy of 0.95) performs better than the 64-channel model (average testing accuracy of 0.87), and the "best" model (average testing accuracy of 0.925) has better performance than the "final" model (average testing accuracy of 0.895). In addition, as shown in Table 3, the testing has an accuracy of 0.97, and weighted average precision, recall, and F1-score of 0.98, 0.97, and 0.97, respectively. Thus, the PCIer with a 128-channel heatmap layer is recommended for PCI grade estimation for future applications. The detailed model layers and parameters are shown in Table 1. Moreover, saving the well-trained model with the best validation accuracy can further improve PCIer performance.

Limitations
Examples of PCI grade prediction results are shown in Figure 5, where the welltrained CNN model's (PCIer, 128-channel in heatmap layer) predictions are all matched with the ground truths. The Grad-CAM visualization results are shown in Figure 5 as well. For the Poor (Figure 5b) and Good (Figure 5d) PCI grades, the heatmap generated by the Grad-CAM indicates that the street pavement contributes to those PCI grade classification results. In addition, a demonstration video of real-time PCI estimation and Grad-CAM visualization results of the Power Inn Rd (Sacramento, CA) can be accessed in [32].  However, for the Very Poor ( Figure 5a) and Fair (Figure 5c) PCI grades, the heatmap contains the vegetation zone that contributed to the PCI grade classification results (see the annotation in Figure 5a,c). Hence, for future applications, the vegetation zones' affection could be reduced by the following approaches: (1) Collect more images for CNN model training to reduce the impact of non-street object obstruction on the classification results. In this approach, additional convolution layers (and channels) and dense layers may need to be added to the proposed CNN model for feature learning. Then, the complicated model might discard the vegetation zone. (2) Remove non-street (pavement) surfaces from the collected image. In this approach, the vegetation zone would be cropped, and only the street surface would show in the input images for the proposed CNN model.

Recommendation
The reviewed previous studies with CNN modeling are either input spectral features (red, green, blue, RGB imagery) or input elevation features (3D imagery) [32]. Then, convolutional layers are used to generate complex feature maps based on the input images. However, the traditional methods, such as SVM [20] and RF classifiers, are preferred to input structured combination features and reach high accuracy performance. Thus, for future application of the proposed CNN models, concatenating RGB three-channel and an elevation one-channel to form a four-channel input image may have better performance in PCI estimation. Since considering the elevations, the impacts of vegetation zones would be eliminated as well.
Additionally, this research assumed the Google Earth images are up-to-date highresolution aerial images (or existing commercial high-resolution aerial imagery) for the target road project or network. The pavement condition evaluation would be more efficient with them by skipping the time-consuming and labor-consuming aerial imagery acquisition operations by the infrastructure management agency-self. Otherwise, another feasible approach would program drones to automatically capture top-view images of the targeted street or highway section, or extract keyframes from a drone's video. Then, the photogrammetric orthophoto (which provides the spectral features) and point cloud (that provides the elevation features) would improve the PCIer performance.

Conclusions
This paper developed a CNN and Google Earth-based PCI estimation and visualization method, which is named as PCIer, and presented the feasibility study results. In the experimental evaluations, the ground truth PCI datasets via the ASTM D6433 pavement distress protocols were collected from the publicly published pavement condition report by the City of Sacramento, CA [30]. The aerial image datasets of five streets in Sacramento, CA were collected via the Google Earth Screenshot Tool [11]. The performance comparisons showed that using a 128-channel heatmap layer for the developed PCIer model and the saved model with the best validation accuracy has the best performance of testing accuracy of 0.97, and weighted average precision, recall, and F1-score of 0.98, 0.97, 0.97, respectively.
Compared to ASTM D6433, the developed PCIer can quickly generate PCI estimations and avoid the time-consuming and labor-consuming pavement condition survey processes for the classification of distress type, determination of distress severity, and measurement of distress quantity. Local infrastructure management agencies can use publicly accessible aerial images for the initial pavement condition evaluation and then send crews to check the likely poor-condition sections. In addition, the developed PCIer can also process the oblique images captured by vehicle-mounted cameras. Infrastructure management agencies can easily deploy the PCIer for their pavement evaluation projects with their own datasets of historical PCI data and the associated images. Data Availability Statement: Training and Testing Datasets are available in [31].