Pavement Distress Detection with Deep Learning Using the Orthoframes Acquired by a Mobile Mapping System †

: The subject matter of this research article is automatic detection of pavement distress on highway roads using computer vision algorithms. Speciﬁcally, deep learning convolutional neural network models are employed towards the implementation of the detector. Source data for training the detector come in the form of orthoframes acquired by a mobile mapping system. Compared to our previous work, the orthoframes are generally of better quality, but more importantly, in this work, we introduce a manual preprocessing step: sets of orthoframes are carefully selected for training and manually digitized to ensure adequate performance of the detector. Pretrained convolutional neural networks are then ﬁne-tuned for the problem of pavement distress detection. Corresponding experimental results are provided and analyzed and indicate a successful implementation of the detector.


Introduction
The condition of roads is easily one of the more important signs of economic standards and general well-being in a given country or region. Early detection and repair of pavement defects avoid further degradation and bring down the overall road maintenance cost. Efficient and timely road inspection is therefore one of the key elements of a successful pavement management system. Yet, periodical road surveys tend to be rather costly and time consuming if carried out in the traditional way, i.e., by human visual inspection of the road surface.
In recent years, automatic image based road distress evaluation has become an option [1]. Although it is still an open research problem and subject to environmental conditions such as illumination level, shadows cast by nearby objects, etc., great progress has been made in this area, and various methods ranging from filtering and thresholding to artificial neural networks have been employed to carry out the task.
Public infrastructure undergoes aging, as well as degradation due to weather conditions. The present research is motivated by the fact that in Estonia, the daily temperature can fluctuate around 0 • C for more than five months a year. Therefore, ice and snow melt during the day and freeze

Literature Review
In past decades, multiple research and development projects have addressed the problems arising from road pavement distress. This includes research on pavement distress prediction [3], association between pavement distress and risk of road accidents [4], and pavement distress prevention [5,6]. In addition, considerable research efforts have been focusing on pavement distress detection.
Pavement distress detection research can be categorized based on input data and methods of collecting input data (see Table 1). While images remain the most widely used input data type, (ground penetrating) radar and 3D data (laser or LiDAR scanning and stereo-imaging) are also quite commonly used, whereas acoustic and other types of input data are employed rarely. Table 1. Research by input data and data collection. [2, Image [35][36][37] Radar [38][39][40][41][42][43][44][45][46] 3D images or point clouds [47,48] Acoustic In order to obtain better detection performance, many systems combine several approaches for data acquisition and measurement. For example, LiDAR technology allows acquiring a subsurface profile with elevation information in addition to discovering changes in the properties of material [49], while laser based systems provide the possibility of performing automatic analysis of surface characteristics such as evenness and skid resistance. Unfortunately, these otherwise excellent solutions have one important drawback: most of such systems operate at relatively low speed, e.g., under 10 km/h. Not only does this increase the time and cost of data acquisition, operation at such a low speed in daily traffic will also decrease road traffic safety [50]. Good examples of such complex systems are ARAN 9000 developed by University of Catania and a mobile mapping system S.T.I.E.R. [12,51]. Both systems consist of several laser based measurement devices for texture analysis and range finding, as well as of several high speed cameras. It is worth mentioning that in all works, cameras are placed orthogonally downward, facing the road pavement [12,[51][52][53]. Moreover, in most cases, surface cameras are synchronized with a high performance lighting unit that makes the system independent of exterior lighting conditions and shadows cast by various roadside objects and allows working with different types of pavement, from concrete to dark asphalt, one lane at a time.

Source Input Data and Data Collection
Additionally, pavement distress detection research can be categorized based on which defects are detected. While most research is aimed at detecting cracks (with or without other defects), some approaches, such as [40,41,54], focus solely on detecting potholes.
In this paper, we focus on image based crack detection methods (see Section 1.1 for a discussion on input data). Pavement distress detection research on image based input data has applied a variety of methods to enhance input data and to detect or classify defects.
Image-based pavement crack detection methods fall into the following main categories (see also Table 2): intensity thresholding, edge detection, graph theory, texture analysis, machine learning algorithms (e.g., support vector regression), and (deep) neural network based methods. Thresholding algorithms are based on the assumption that cracks are represented by local intensity minima; thus, binarization of the images will distinguish image areas with cracks from non-crack areas. The very well known Otsu thresholding method was widely used for pavement crack detection [55,56]. In order to avoid illumination variations and shadows, the thresholding of the localized area has been applied [57,58]. In [59], automation of the threshold selection was proposed. For more complex cases, advanced image analysis such as Gabor filters [23] have been used. Edge detection techniques include the usage of Canny filters, the Sobel edge detector, and other morphological filters [60][61][62]. With the development of artificial intelligence methods, new automated techniques for pavement distress detection have been designed. Support vector machines are commonly used for classification problems in computer vision based applications [63][64][65]. However, with the advent of deep learning technology, Convolutional Neural Networks (ConvNets) have started to dominate the field of object detection and recognition in vision based areas [13,51,55,66], as those methods perform feature extraction without requiring a separate feature extraction system. Some auto-encoders and fuzzy logic based neural networks have been used as well (Table 3). Table 2. Image-based pavement distress detection methods.
Note that the training and testing datasets differ considerably from one research project to another. It is possible that some of the differences in results are due to the quality of input data. For example, the work in [7,8] used the publicly available CrackForest dataset with 117 images. The work in [14], on the other hand, used 3900 raw images captured by a NIKON digital camera with a resolution of 3456 × 4608 pixels where the camera took pictures between the ground and the camera with an approximate distance from 80 cm to 100 cm. Similarly, the work in [2,10,13] used custom datasets. In addition, the work in [9,11] used a low cost approach of obtaining images using mobile phones.

Contribution
The goal of this research work is to investigate whether the obtained orthoframes provide sufficient information to detect cracks and other pavement defects automatically and to develop such a method based on deep learning convolutional neural networks. This method should be able, in principle, to detect defects on multi-purpose datasets, such as images obtained from Google Street View. Note that using the data provided by Reach-U Ltd, the method enables defect detection with precise real-world coordinates. As a result of this research and development effort, a Python software package was developed for the company that can be used to prepare training data based on existing datasets and also process arbitrary new road images to obtain pavement distress information.

Analysis and Preprocessing of Source Data
Closer observation of orthoframes that were collected with Ladybug 5+ in April 2019 revealed the following characteristics: 1. Inconsistent sharpness across the image. This stemmed from the horizontal placement of some of the Ladybug cameras. Due to the simple laws of optics, road surface gradually loses detail as the distance from the original camera shooting location increases. See Figure 2. 2. Inconsistent brightness from image to image. This was related to the availability of light at the moment when an image was taken. Note that the situation has improved considerably not only because of the CMOS sensors of Ladybug 5+, but also because Reach-U Ltd. has instructed the MMSdriver to adjust the shutter speed manually during the data collection if the lighting conditions change on the road to avoid under-or over-exposure. 3. Comparatively high number of shadows cast by various objects on the road, near the road, by the camera rig, or by the car itself doing the mapping. The intensity of shadows is directly correlated with the availability of light, and their extent is (among other things) dependent on the angle of Sun rays, which is illustrated in Figure 3.
In addition, there were various artifacts found in some images (vehicle fragments, people, etc.). However, there were only a few of these, so in general, they can be treated as statistically irrelevant and ignored.  Inconsistent quality of the images may mislead the ConvNet training [71], and although by forsaking the orthophoto format, we were able to avoid the "stitching seams" among individual orthoframes, the gradual loss of detail as we moved away from the position of the camera still presented a problem. We therefore focused on the sharper part of the orthoframe. The original orthoframe mask was multiplied with a filled circle with a radius of 1500 pixels. The resulting mask ( Figure 4) was then used to extract the more detailed part of the orthoframe. As the consecutive orthoframes were overlapping, there was no loss of ground. Note also that in the resulting image, the area that was not road surface became much smaller and was most often present on one side of the road only and could be thus typically separated with a single line ( Figure 5). Therefore, the road extraction procedure that was automated in [2] was now delegated to the digitizers (a digitizer in the context of this work refers to a person engaged in digitization of defect information based on the provided orthoframes and initial defect data) who processed the orthoframes as part of the preprocessing step for obtaining more accurate training data for the automatic detector.  Out of 33,288 orthoframes, 20,318 that contained defects in the area of interest were algorithmically picked out for further consideration. Out of this selection, orthoframes with poor lighting conditions, poorly distinguishable defects, or other problems were abandoned after visual inspection. For actual digitization, 1572 orthoframes were used.

Pavement Distress Digitization
The defect layer provided by Reach-U Ltd. contained information about the pavement defects listed in Table 4 based on general polyline, polygon, or point defect types. The number of defects per defect class counted from the orthoframes is given in Table 5. The overall number of defects obtained this way (61408) was considerably larger than the actual number of defects (25771) on observed roads because usually, the same defect can be found from up to three consecutive orthoframes. From the distribution of defect classes, it appeared that the data were imbalanced, i.e., there were too few examples of defects of specific classes for training the convolutional neural network [72]. For this reason, all defects were lumped together, and no attempt was made to train the network to distinguish between individual defect classes, resulting in a binary classification problem.
To visualize pavement distress, e.g., cracks, clearly, it is customary to use very strong illumination while taking the shots with the camera [73]. This was not the case here, and defects could be found in shadowed regions with soft and hard shadows. Defects could be also found near other consistent visual features, e.g., road markings. This will inevitably reduce the accuracy of detection.
Most importantly, defect coordinates in the defect layer were often not very accurately determined ( Figure 6, left). This is not critical in the application in which they were used originally and thus not the primary concern of the original digitizers. For machine learning purposes, however, it is highly important that the samples that are exploited for defect recognition depict actual defects and, conversely, that the samples that are supplied for defect-free pavement recognition not contain any defects. To provide this level of accuracy, 1572 selected orthoframes were redigitized yielding 12,728 training samples (6364 for each of two classes). The part of the orthoframe removed from further analysis is marked with red color. Note that the roadside area on the right has been cut off by digitization. The annotated defects are highlighted with blue color. One can see the difference in annotation accuracy in these two images.

Image Partitioning
Since the original images were of high resolution, they were partitioned into smaller fragments that we refer to as segments throughout this paper. The idea was to study the contents of each segment and to determine whether it depicted a pavement defect or not. The total of all of these segments formed the basis for training the artificial neural network.
Segments were extracted automatically from the annotated images described in the previous section. The resulting dataset may also be augmented as needed, i.e., the number of images depicting the defects was artificially increased by applying various transformations to existing images such as translation and rotation; in theory, this should improve the efficiency of the ConvNet training.
The partitioning algorithm extracted the initial segments based on a simple grid also capturing some redundant segments on the edges to ensure maximum coverage of the orthoframe area of interest. Only those segments that fell unto the unmasked area were kept, though there was also the option to ignore segments that were partially masked. The segments were exported into a large number of PNG image files into two folders: defect_0 containing segments that depicted no defects and defect_1 containing segments with pavement defects. The procedure of division into these two classes was carried out based on the defect masks manually obtained via digitization during the preprocessing step, as discussed above.

Further Data Processing and Augmentation
Previously, it was observed that the neural networks were sensitive to different lighting conditions. Models trained on images in certain types of lighting conditions were unable to generalize well to make unbiased predictions for brighter or darker images. To combat this, we experimented with gamma correction and normalization methods. However, these correction methods might result in a loss of information by intensifying the noise present in the image, which is especially undesirable for inference.
Therefore, image preprocessing methods were abandoned. Instead, training data were augmented by applying a random amount of change to brightness and contrast values. For each new epoch, all the training samples were subject to up to a 35% increase or decrease to both brightness and contrast values. An added benefit of this method was the effective increase in different training samples. Additionally, training data were augmented by random horizontal and vertical flips, as well as random rotations up to ±180 degrees, where the missing pixels were filled by reflecting the border pixels. Various potential outputs of the transformation function can be seen in Figure 7.

Classification Performance Evaluation
In this work, we were concerned with developing an accurate detector of pavement distress based on image data supplied to it. There were four possible outcomes concerning the judgments of road segments given by the classification system: Based on this, it is possible to impose accuracy criteria for the system, where TN, TP, FN, and FP denote the total counts of the corresponding detection outcomes. First, we argue that the bare accuracy measure given as: is not as meaningful as the recall and precision measures since it is critical to identify actual defects properly. The recall measure shows the percentage of how many actual defects were detected by the system and is defined as: and precision shows the percentage of how many of the detected defects were actual defects and is defined as: We also used the so-called Matthews correlation coefficient (MCC) metric, which is defined as: because it provides a more reliable measure in the case of imbalanced data. Finally, since ConvNets returns probabilities and not discrete values, one must use two threshold values: detection threshold P det and suspicion threshold P sus , such that: P(defect) P det ⇒ defect is detected and P(defect) P sus ⇒ defect is suspected.

Deep Neural Networks' Setup
Convolutional neural networks are deep neural networks specifically tailored for analyzing visual imagery. The major advantage of ConvNets is that they require little preprocessing compared to other image classification algorithms. Three main types of layers that make up ConvNet architectures are convolutional layers, pooling layers, and fully connected layers. The main building block of ConvNets is the convolutional layer.
A convolution is the application of a filter to the layer input that results in a map of activations (feature map), indicating the locations and strength of a detected feature in an input [74]. The convolution is performed by sliding a K × K convolution filter (kernel) over the input image with a predetermined step size (stride). The innovation of using the convolution operation in a neural network is that the values of the filter are learned during the training of the network. Under stochastic gradient descent, the network is forced to learn to extract features from the input that are most useful for classifying images.
ConvNets usually learn multiple (32-512) filters in parallel for a given input. A filter must have the same number of channels (depth) as the input and can have specific filter values for each of the input channels. Regardless of the depth of the input and depth of the filter, each filter produces a 2D feature map because eventually, the channels are summed together to form one single channel (element-wise addition).
The Rectified Linear Unit (ReLU) is a supplementary step to the convolution operation. Its purpose is to increase the non-linearity in feature maps. The result of the convolution operation is passed through the ReLU activation function so the values in final feature maps are not just the sums, but the ReLU function applied to the sums. The ReLU activation function has rapidly become the default activation function for most types of neural networks. It provides true zero and acts like a linear function for the most part, but is actually a nonlinear function allowing complex relationships in the data to be learned. ReLU is also easy to implement, and networks trained with this activation function avoid the problem of vanishing gradients [75].
The depth of the output of a convolutional layer is determined by the number of filters because each of them creates a distinct feature map. The width and height of the output of a convolutional layer are, on the other hand, determined by the formula: where D o and D i are the height/length of the output and input, S is the stride, and P is the width of the added border of zeros (zero-padding). Note that commonly, K = 3, P = 1, S = 1, and D o = D i . Pooling layers do not affect the depth dimension, but perform a downsampling operation along the spatial dimensions (width, height) of the input for the next convolutional layer. The decrease in size leads to less computational overhead for the upcoming layers of the network, works against over-fitting, and improves local translation invariance. Much like the convolution operation, the pooling layer takes a sliding window that is moved in stride across the input and transforms its values into a more representative value by selecting, e.g., the maximum value from the window (max pooling). Contrary to the convolution operation, however, pooling has no trainable parameters, although window (kernel) size and stride must be specified. Commonly, K = 2, S = 2.
Fully-connected layers are ordinary neural network layers that are fully connected with the output of the previous layer and are typically used in the last stages of the ConvNet. They are also used to construct the desired number of nodes in the output layer. A fully connected layer expects a 1D vector of numbers as its input so the 3D output of the final pooling or convolution layer must be flattened into a 1D vector of numbers before it becomes the input to the fully connected layer.
The most common form of a ConvNet architecture stacks a few convolutional layers (CONV), followed by a (optional) pooling layer (POOL), and repeats this pattern until the image has been reduced spatially to a small size. At this point, it is customary to introduce the fully connected layers (FC). The standard final layer for a multiclass classification problem is a fully connected layer with a number of nodes that corresponds to the number of classes and that uses the softmax function as its activation function that converts the numbers into probabilities. The ConvNet architecture thus appears as: where N ∈ [1, 3), M ≥ 0, L ∈ [0, 3). Typically, ConvNets are trained with the stochastic gradient descent, and its weights are updated using the backpropagation method. The objective function to be minimized (loss function) is defined as the cross-entropy between the training data and the network response.
Deep neural networks frequently incorporate a regularization technique called dropout to prevent overfitting [76]. At each training iteration, a neuron is temporarily disabled with probability p (all the inputs and outputs to it will be disabled). The dropped out neurons are resampled with probability p at every training step, so a dropped out neuron at one step can be active at the next one. The hyperparameter p is called the dropout rate, and it is typically a number around 0.5, corresponding to 50% of the neurons being dropped out.
A ConvNet model can be thought as a combination of two components: the feature extraction part and the classification part. The convolution and pooling layers perform feature extraction. The fully connected layers act as a classifier on top of the extracted features and assign a probability for the input image representing a class. The lower layers encode/detect simple structures (colors, edges, and simple shapes), and as we go deeper into the network, the layers build on top of each other and learn to encode more complex patterns.
One of the problems using deep ConvNets is the requirement to have large annotated image datasets. For some domains, obtaining such data can be difficult, time consuming, and costly. To overcome those difficulties, transfer learning can be used by applying the ConvNets pretrained on large datasets (such as VGG-16, AlexNet [77], GoogLeNet [78], and ResNet) to a new classification task. Networks with architectures that perform well on large scale classification tasks such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [79] have been found to be able to generalize to other tasks of image classification by retraining the fully connected layers that are near the output of the network while keeping the feature extraction part of the network with the pretrained weights ( [80][81][82]).
In this work, we considered three architectures optimized for the ImageNet dataset for our task of pavement distress detection: • VGG16 [83], which was the best performing classifier of ILSVRC in 2014 along with GoogLeNet. This architecture has 16 weight layers, 13 of which consist of 3 × 3 convolutional filters with a total of 4224 filters, followed by three fully connected layers of length 4096, 4096, and 1000, respectively. In total, it has 15,252,034 trainable parameters. • ResNet34 and ResNet101 [84], which introduced residual blocks to the typical ConvNet architecture and won ILSVRC in 2015. The residual block allows connections from earlier preceding convolutional layers, not only the immediately preceding one. This allows deeper models to be trained while also maintaining information only a shallower network would be able to capture [85]. As for the convolutional layers, ResNet follows the design of VGG16 with 3 × 3 convolutional filters, except for the first layer, which has 7 × 7 filters. In our work, ResNet34 had 33 convolutional layers and two fully connected layers of length 1024 and 512 and a total of 21,813,570 trainable parameters. Its deeper counterpart ResNet101 had 100 convolutional layers and two fully connected layers of length 4096 and 512, with a total of 44,608,066 trainable parameters.

Data Selection
The 1572 selected orthoframes were partitioned into segments each having dimensions of 224 × 224 pixels, which is the size many transfer learning architectures take as a default input [84,86]. Smaller and larger dimensions could also be considered, but there is a trade-off for both of these cases. Smaller segment sizes allowed us to capture more of the road, leaving fewer blind spots. However, it is more difficult to make predictions on smaller segments due to missing context. Conversely, larger segments provide more context for better accuracy, but at the cost of leaving more blind spots at the edge of the road (assuming non-overlapping segments).
In order to classify a segment as defect or not defect, we considered the percentage of defect pixels on the image. If more than 5% of the pixels on a segment were masked by the digitizer, it would be labeled as a defect segment. With this criterion, 15% of all segments would have a defect, and 85% would not. It is known that class imbalance during training reduces the performance of deep neural networks [87]. To balance our dataset, only N non-defect segments were sampled for each orthoframe containing N defect segments. This way,~8 segments per orthoframe were sampled on average.
For the purposes of neural network training, the obtained 12,728 segments with dimensions 224 × 224 pixels were split into training and validation sets, with the ratio of 0.85 and 0.15, respectively. As is typical for deep learning cases, the training set was used to optimize the parameters of the model with respect to the cross-entropy loss function, and the validation set was used to measure if the model was overfit to the training data.
In addition, a test set consisting of 55 new orthoframes from different roads was used to evaluate how well the model generalized to new conditions. As opposed to the training and validation set, for the test set, we extracted all of the possible segments, so a total of 1007 defect-free and 185 defect segments were obtained.

Deep Learning
Throughout the process, the Python library PyTorch [88] was used along with fastai [89], which provides a layer of abstraction upon PyTorch to simplify the experimentation process.
In the choice of hyperparameters, a "learning rate range test" was performed, as suggested by L.N. Smith [90,91]. The network was trained for an epoch with a linearly increasing learning rate, while the loss was measured after each processed batch. The maximum learning rate for the given model was then heuristically decided upon such that it was not in the region where the loss had a rising trend (refer to Figure 8). The learning rates chosen can be seen in Table 6. For all model architectures, we used the pretrained weights optimized for the ImageNet dataset. Then, all of the layers except for the fully connected layers of length 4096 and 512 respectively were frozen, meaning we did not optimize the convolutional filters. In this configuration, the model was trained for two epochs. This selective freezing of the weights was done to speed up the training and ensure the earlier layers of the pretrained network were subject to less noise. After training for two epochs, all of the model parameters were unfrozen for fine-tuning purposes.
Discriminative fine-tuning was used for further training of the model [92]. The idea was to train the layers towards the output at higher learning rates than the earlier layers. In our case, we used logarithmically stepped learning rates: where η(l) is the learning rate of the layer l, L is the number of layers, and N is η (1) , which we chose to be 10. Additionally, the learning rates were cyclical throughout the process, which was shown to speed up the training process [93]. For the optimizer algorithm, we chose to use Adam [94].
The 25 epoch training process can be seen in Figure 9.  It can be observed that throughout the initial epochs, the validation loss was actually lower than the training loss. This can be explained by the fact that the validation data were not subjected to the aggressive brightness and contrast preprocessing. Additionally, the dropout layers were disabled while evaluating the model. After 25 epochs, which corresponded to 8300 training batches in Figure 9, the training loss reached below the validation loss; therefore, in order to prevent overfitting, training was stopped.

Results
From the tests (the results of which are presented in Table 7), it can be noted that the problem of crack detection benefited from the more sophisticated architecture of ConvNets as the 101 layer ResNet slightly outperformed the 34 layer ResNet. Further inspection of the obtained results revealed that many of the misclassifications were due to our labeling methodology. A manually masked defect at the corner of the image may not have passed the 5% threshold, therefore confusing the classifier (refer to Figure 10d). A few of the false positives were due to miscellaneous shapes on the road, such as tire marks, spills, etc. (refer to Figure 10b). Most of the other misclassifications included segments that were ambiguous due to low image quality and lack of context.  Any trained network can be employed to find and localize the defects from the whole orthoframe. Figure 11 depicts an orthoframe that can be considered highly problematic due to a number of sharp contoured shadows. Yet, the network was able to localize a definite crack on the left side of the image. It also suggested another damaged area on the right where the pavement was apparently problem free, however.

Software Solution
As a result of this research and development project, a fully-fledged Python software package was developed for Reach-U Ltd. that could be used to generate the data for training and inspect them; annotate (digitize) the images as needed for creating defect masks and also updating the initial image masks; train the deep learning ConvNets and apply them to arbitrary new images. The package comprised back-end functionality in two separate Python libraries and also had graphical user interfaces developed using PyQt5. The intended end-user application entitled nnapply included a graphical user interface and allowed processing arbitrary road images using the trained ConvNets, showing the detected and suspected defects. It also generated a report in Microsoft Excel format. Examples of both types of output from this application are depicted in Figures 11 and 12.
The implementation of the back-end described in [2] is now being updated to use the newly introduced deep learning libraries, but thanks to the proper separation of back-and front-end functionality, this is a relatively straightforward process. Figure 11. Example image with defect location suggestions generated by the nnapply application. The highlighted area is unmasked, and therefore, only segments fully belonging to this area are considered during partitioning of the image. The segments are extracted at 75% overlap to provide more detail and color coded as red where the intensity of the color corresponds to classifier certainty to having discovered a defect. The regions of the orthoframe having a defect probability over 0.6 are displayed at a higher zoom level.

Discussion
In this paper, we presented a fully working prototype of a computer vision system designed to detect pavement distress based on orthoframes captured by a mobile mapping system. The prototype has to be tested in the appropriate transportation system analysis environment; therefore, the presently claimed technology readiness level (TRL) is four (i.e., tested in a laboratory environment).
Further items for discussion are presented next: • In [2], it was claimed that detection of shadow regions in the orthoframe is a critical component of the complete pavement distress detection system. However, our current tests did not completely confirm this as the system seemed to be robust to such visual artifacts. Hard shadows from tree branches still presented a problem, however, as they resembled pavement cracks. • Ensemble classifiers were not introduced in this work as acceptable performance was obtained without complicating the system architecture. An attempt to make the detector context sensitive, i.e., use progressive zoom where a defect was suspected in the orthoframe, was considered as a possible next step in improving detection performance especially as a countermeasure for hard, fine-detail shadows. • Data augmentation was updated to include orthoframe segment exposure variation and apparently led to improved generalization ability of the resulting ConvNet. • Finally, the current classifier could only be regarded as a detector since the predictions about orthoframe segments were essentially binary, whether a defect was detected or not, with the additional possibility to consider suspected defects. In the future, a more advanced segmentation feature can be implemented whereby different types of defects will have related ground truth information provided by means of manual annotation for which the corresponding software package was also developed as part of this effort. In this case, however, as was shown previously, the issue of imbalanced data will have to be solved.

Conclusions
In the present work, a deep learning convolutional neural network model based on several existing architectures of image classifiers was obtained using fine-tuning. The data for fine-tuning were carefully selected from thousands of existing orthoframes freshly provided by the company and having better image quality compared to the images used in [2].
The manual preprocessing step that included digitizing the orthoframes, i.e., manually painting defect masks and updating the road mask by eliminating image areas with poor sharpness and also areas outside the pavement part, while time consuming and tedious, was proven to be critical for the success of the implementation of the detector. In previous work, we used the data provided by the company for generating ground truth information. However, it must be taken into account that the internal purposes of digitization in the company were different, so pixel annotation accuracy was not the most important factor. Therefore, to ensure that the detection model was developed based on relevant information, a more accurate localization of defects had to be introduced. Due to this redundant approach in annotating images, the proposed solution while not completely foolproof should be fairly robust with respect to annotation mistakes, at least from the point of view of visual inspection.
Furthermore, data augmentation was proven to be useful to combat differing lighting conditions that still presented a challenge while analyzing the image. The next step for data augmentation is the implementation of distortion tuning [71].
Instead of attempting to train convolutional neural networks from scratch, we only considered pretrained neural networks in this work. The reason for this was that significantly better results were obtained with pretrained networks compared to the results reported in [2], and therefore, using simpler network structures was not considered a benefit. Indeed, precision and recall metrics increased from 0.22 and 0.35, respectively, to 0.90 and 0.87. This was a significant improvement and was very likely related to several factors, including better quality orthoframes, using only sharp image areas, and manually digitizing the images, annotating defects and updating pavement masks as needed. The latter also contributed to solving the problem that appeared in [2] where the majority of false positive detections was due to the classifier incorrectly identifying road edges as pavement distress.
Finally, the software package proposed to the company was updated and also included an efficient image annotation tool tailored to the specific purpose of preparing higher quality ground truth files for defect detection and pavement area extraction. Although a different deep learning backend was used (PyTorch and FastAI instead of TensorFlow and Keras), the software is easy to update and hence it will soon be ready for deployment, further testing, and its eventual application for improving highway road pavement conditions.