Detecting and Localizing Dents on Vehicle Bodies Using Region-Based Convolutional Neural Network

: Detection and localization of the dents on a vehicle body that occurs during manufacturing is critical to achieve the appearance quality of a new vehicle. This study proposes a region-based convolutional neural network (R-CNN) to detect and localize dents for a vehicle body inspection. For a better feature extraction, this study employed a lighting system, which can highlight dents on an image by projecting the Mach bands (bright-dark stripes). The R-CNN was trained using the highlighted images by the Mach bands, and heat-maps were prepared with the classiﬁcation scores estimated from the R-CNN to localize dents. This study applied the proposed R-CNN to the inspection of dents on the surface of a car body and quantitatively analyzed its performances. The detection accuracy of the dents was 98.5% for the testing data set, and mean absolute error between the actual dents and estimated dents were 13.7 pixels, which were close to one another. The proposed R-CNN could be applied to detect and localize surface dents during the manufacture of vehicle bodies in the automobile industry.


Introduction
Customers who wish to buy a new car naturally expect that there will be no defects on the vehicle's exterior. If any defects exist on the surface of a vehicle body, customer loyalty for the automobile manufacturer will be significantly reduced and a customer's intention to purchase a new car from the manufacturer may be withdrawn [1]. Therefore, the automobile industry seriously performs an inspection to repair (called touch-up) any exterior defects during the final stage of manufacturing [2,3].
Two limitations (low inspection accuracy and eye fatigue) have been raised in the automobile industry since workers directly inspect vehicle exteriors to detect defects with eyes. First, small defects on the vehicle body are hardly detected through visual inspection. For example, Armesto et al. [4] reported that 80% of minor defects on the vehicle body are undetected via visual inspection. In addition, Tolba et al. [5] stated that 25% to 40% of major defects are also reported as undetected via visual inspection. Second, high illuminance in an inspection room may induce eye fatigue and discomfort in workers. A high illuminance of 2000 lux could improve the visual threshold since visual acuity increases positively with an increase in illuminance [6][7][8][9][10][11]. Consequently, it is the situation that automobile assembly lines have generally employed a high level of illumination (e.g., 2000 lux) in

Dent Highlight in an Image with Mach Bands
A special lighting device with light-emitting diodes (LEDs) and a stripe cover, as shown in Figure 1a, were employed to highlight dents in an image using Mach bands [12]. The overall size of the light device used in this study was 20 cm (width) × 9 cm (height) × 3.4 cm (thickness); the sizes of the bright and dark stripes were 0.5 cm (width) × 6 cm (height) and 1 cm (width) × 6 cm (height), respectively. The light device intensity could be adjusted to determine the best light intensity for different defect types and ambient light conditions. The LEDs with adjustable intensity (1500 lumens) using a rotary knob emitted light. The stripe cover created Mach bands (dark-bright linear stripes) by blocking light behind the stripes and passing light between the stripes. The Mach bands can create a contrast pattern on the vehicle surface of interest and were distorted around dents because of light diffusion, as illustrated in Figure 1b. For a smooth surface, light beams are reflected back in the same Appl. Sci. 2020, 10, 1250 3 of 9 concentration; however, an irregular surface scatters light beams (called diffuse light), as shown in Figure 1c. This diffuse light can distort the Mach bands around dents.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 10 concentration; however, an irregular surface scatters light beams (called diffuse light), as shown in Figure 1c. This diffuse light can distort the Mach bands around dents.

R-CNN Topology
The R-CNN structure in this study consisted of an input layer, two hidden layers, and an output layer as illustrated in Figure 2. The input layer accepted a preprocessed grayscale image and its number of nodes was one. The preprocessed image was represented by a matrix of pixel values ranged from 0 to 255 (0: black, 255: white). The region images (mini-patches) inputted to the network were 32 × 32 pixels, which were determined by sliding the preprocessed image from the top-left corner to the bottom-right corner with stride 5. The hidden layer consisted of two pairs of convolution and pooling layers. The first hidden layer convoluted 6 × 6 58 filters with stride 3 and padding 2. A rectified linear unit (Relu) was used as an activation function, and a 3 × 3 max pooling was employed with stride 3 and padding 2.

R-CNN Topology
The R-CNN structure in this study consisted of an input layer, two hidden layers, and an output layer as illustrated in Figure 2. The input layer accepted a preprocessed grayscale image and its number of nodes was one. The preprocessed image was represented by a matrix of pixel values ranged from 0 to 255 (0: black, 255: white). The region images (mini-patches) inputted to the network were 32 × 32 pixels, which were determined by sliding the preprocessed image from the top-left corner to the bottom-right corner with stride 5.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 10 concentration; however, an irregular surface scatters light beams (called diffuse light), as shown in Figure 1c. This diffuse light can distort the Mach bands around dents.

R-CNN Topology
The R-CNN structure in this study consisted of an input layer, two hidden layers, and an output layer as illustrated in Figure 2. The input layer accepted a preprocessed grayscale image and its number of nodes was one. The preprocessed image was represented by a matrix of pixel values ranged from 0 to 255 (0: black, 255: white). The region images (mini-patches) inputted to the network were 32 × 32 pixels, which were determined by sliding the preprocessed image from the top-left corner to the bottom-right corner with stride 5. The hidden layer consisted of two pairs of convolution and pooling layers. The first hidden layer convoluted 6 × 6 58 filters with stride 3 and padding 2. A rectified linear unit (Relu) was used as an activation function, and a 3 × 3 max pooling was employed with stride 3 and padding 2.  The hidden layer consisted of two pairs of convolution and pooling layers. The first hidden layer convoluted 6 × 6 58 filters with stride 3 and padding 2. A rectified linear unit (Relu) was used as an activation function, and a 3 × 3 max pooling was employed with stride 3 and padding 2. The second Appl. Sci. 2020, 10, 1250 4 of 9 hidden layer convoluted 6 × 6 58 filters with stride 3 and padding 2. Likewise, Relu was employed as an activation function, and a 3 × 3 max pooling was used with stride 3 and padding 2.
The output layer consisted of a fully connected layer and an output layer. The fully connected layer included 91 neurons, which were connected to the last output layer. The output layer included two final neurons, which were connected from the fully connected neurons. The two final neurons were activated by softmax function to determine normal or abnormal (dent) at the end.
Six parameters (kernel number, kernel size, padding size, pooling size, stride size, and number of fully connected nodes) for the R-CNN were determined with an adaptive genetic algorithm (AGA) by referring to [22]. The AGA searched a set of satisficing parameters by evolutionally changing the parameters of the R-CNN. The gene for the AGA consisted of 20 binary digits (6 digits for kernel number, 3 digits for kernel size, 2 digits for padding size, 2 digits for pooling size, 2 digits for stride size, and 5 digits for number of fully connected nodes) for the six parameters. The Roulette wheel selection from the population (population size = 20) and the one-point crossover were applied for generating the next offspring [23,24]. Mutations were performed in the offspring with a probability of mutation (initial value = 5%), which adaptively adjusted its value depending on the homogeneity of the offspring [25][26][27]. The AGA efficiently explored incumbent solutions for the parameters, as can be seen in Figure 3, and the incumbent optimum was improved rapidly within 50 generations. The satisficing parameters for the R-CNN were 58 for the kernel number, 6 for the kernel size, 2 for the pad number, 3 for the pooling size, 3 for the stride, and 91 for the number of fully connected nodes, as listed in Table 1.
two final neurons, which were connected from the fully connected neurons. The two final neurons were activated by softmax function to determine normal or abnormal (dent) at the end.
Six parameters (kernel number, kernel size, padding size, pooling size, stride size, and number of fully connected nodes) for the R-CNN were determined with an adaptive genetic algorithm (AGA) by referring to [22]. The AGA searched a set of satisficing parameters by evolutionally changing the parameters of the R-CNN. The gene for the AGA consisted of 20 binary digits (6 digits for kernel number, 3 digits for kernel size, 2 digits for padding size, 2 digits for pooling size, 2 digits for stride size, and 5 digits for number of fully connected nodes) for the six parameters. The Roulette wheel selection from the population (population size = 20) and the one-point crossover were applied for generating the next offspring [23,24]. Mutations were performed in the offspring with a probability of mutation (initial value = 5%), which adaptively adjusted its value depending on the homogeneity of the offspring [25][26][27]. The AGA efficiently explored incumbent solutions for the parameters, as can be seen in Figure 3, and the incumbent optimum was improved rapidly within 50 generations. The satisficing parameters for the R-CNN were 58 for the kernel number, 6 for the kernel size, 2 for the pad number, 3 for the pooling size, 3 for the stride, and 91 for the number of fully connected nodes, as listed in Table 1.

Dent Localization Using Heat-Map
To localize dents in an image, a heat-map was prepared using the last convolution layer of the R-CNN established in this study. We cropped patches from the test image (120 × 68 pixels) by sliding a window of 32 × 32 pixels and fed them to the R-CNN for obtaining their classification scores at the

Dent Localization Using Heat-Map
To localize dents in an image, a heat-map was prepared using the last convolution layer of the R-CNN established in this study. We cropped patches from the test image (120 × 68 pixels) by sliding a window of 32 × 32 pixels and fed them to the R-CNN for obtaining their classification scores at the last convolution layer. The estimated classification scores were used to form a heat map and a bounding Appl. Sci. 2020, 10, 1250 5 of 9 box containing a dent was formed, as shown in Figure 4. Lastly, the location of a dent was estimated as the center of the bounding box.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 10 last convolution layer. The estimated classification scores were used to form a heat map and a bounding box containing a dent was formed, as shown in Figure 4. Lastly, the location of a dent was estimated as the center of the bounding box.

Methods and Material
We used a vehicle body fender (Figure 5a) with 25 artificial defects (diameter: 0.5-10 mm) to collect normal and abnormal images. The body of vehicle was fixed at 80 cm high on the top of a support fixture, as shown in Figure 5b. The illuminance level of the room was controlled around 500 lux and a researcher seated on an office chair recorded videos. The lighting device employed in this study was placed 32.5 cm (range: 30-35 cm) away from the target inspection surface and was slowly moved above the surface to scan the entire inspection area. A general video camera was employed to capture videos while scanning the target surface with the lighting device. In this study, all video clips were taken from the camera and 8017 image frames were extracted from the video clips. This study preprocessed the images extracted from the recorded videos in three steps (gray scale conversion, image segmentation, and labeling). In the first step, 8017 images (120 × 68 pixels) were randomly sampled from the video clips and were transformed into gray scales in order to generalize the evaluation results by eliminating the effects of car body color. In the second step, image patches (32 × 32 pixels) for each image were prepared by visiting all locations of the image and all patches were manually classified as either normal or abnormal. Then, we randomly selected an equal number of normal (2200) and abnormal patches (2200) for training and testing the R-CNN.

Methods and Material
We used a vehicle body fender (Figure 5a) with 25 artificial defects (diameter: 0.5-10 mm) to collect normal and abnormal images. The body of vehicle was fixed at 80 cm high on the top of a support fixture, as shown in Figure 5b. The illuminance level of the room was controlled around 500 lux and a researcher seated on an office chair recorded videos. The lighting device employed in this study was placed 32.5 cm (range: 30-35 cm) away from the target inspection surface and was slowly moved above the surface to scan the entire inspection area. A general video camera was employed to capture videos while scanning the target surface with the lighting device. In this study, all video clips were taken from the camera and 8017 image frames were extracted from the video clips.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 10 last convolution layer. The estimated classification scores were used to form a heat map and a bounding box containing a dent was formed, as shown in Figure 4. Lastly, the location of a dent was estimated as the center of the bounding box. Figure 4. A heat-map to localize the location of a dent.

Methods and Material
We used a vehicle body fender (Figure 5a) with 25 artificial defects (diameter: 0.5-10 mm) to collect normal and abnormal images. The body of vehicle was fixed at 80 cm high on the top of a support fixture, as shown in Figure 5b. The illuminance level of the room was controlled around 500 lux and a researcher seated on an office chair recorded videos. The lighting device employed in this study was placed 32.5 cm (range: 30-35 cm) away from the target inspection surface and was slowly moved above the surface to scan the entire inspection area. A general video camera was employed to capture videos while scanning the target surface with the lighting device. In this study, all video clips were taken from the camera and 8017 image frames were extracted from the video clips. This study preprocessed the images extracted from the recorded videos in three steps (gray scale conversion, image segmentation, and labeling). In the first step, 8017 images (120 × 68 pixels) were randomly sampled from the video clips and were transformed into gray scales in order to generalize the evaluation results by eliminating the effects of car body color. In the second step, image patches (32 × 32 pixels) for each image were prepared by visiting all locations of the image and all patches were manually classified as either normal or abnormal. Then, we randomly selected an equal number of normal (2200) and abnormal patches (2200) for training and testing the R-CNN. This study preprocessed the images extracted from the recorded videos in three steps (gray scale conversion, image segmentation, and labeling). In the first step, 8017 images (120 × 68 pixels) were randomly sampled from the video clips and were transformed into gray scales in order to generalize the evaluation results by eliminating the effects of car body color. In the second step, image patches (32 × 32 pixels) for each image were prepared by visiting all locations of the image and all patches were manually classified as either normal or abnormal. Then, we randomly selected an equal number of normal (2200) and abnormal patches (2200) for training and testing the R-CNN.
Appl. Sci. 2020, 10, 1250 6 of 9 The normal and abnormal patches were randomly divided into training (70%) and testing (30%) data sets. Thus, the R-CNN was trained with 3000 randomly selected patches (1500 normal, 1500 abnormal) from the total number of patches (4400). Next, the classification accuracy of the R-CNN was quantified with the remaining 1400 patches (700 normal, 700 abnormal) that were not utilized in the R-CNN training. Lastly, the localization accuracy of the R-CNN was calculated for each whole image, and it was judged as correct when an actual dent was located in a bounding box formed by the R-CNN.

Results
The overall classification accuracies of the proposed R-CNN model were 100% for the learning patches and 98.5% for the testing patches. No significant bias between the two groups of patches was observed in the classification accuracy. Figure 6 shows examples of normal and abnormal patches identified by the R-CNN model. In addition, Figure 7 shows an example of learning accuracies by iterations.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 10 The normal and abnormal patches were randomly divided into training (70%) and testing (30%) data sets. Thus, the R-CNN was trained with 3000 randomly selected patches (1500 normal, 1500 abnormal) from the total number of patches (4400). Next, the classification accuracy of the R-CNN was quantified with the remaining 1400 patches (700 normal, 700 abnormal) that were not utilized in the R-CNN training. Lastly, the localization accuracy of the R-CNN was calculated for each whole image, and it was judged as correct when an actual dent was located in a bounding box formed by the R-CNN.

Results
The overall classification accuracies of the proposed R-CNN model were 100% for the learning patches and 98.5% for the testing patches. No significant bias between the two groups of patches was observed in the classification accuracy. Figure 6 shows examples of normal and abnormal patches identified by the R-CNN model. In addition, Figure 7 shows an example of learning accuracies by iterations.  The overall classification accuracy of the proposed R-CNN model (98.5%) was superior to a plain R-CNN model (88.7%) for the testing patches. We implemented a plain R-CNN model, which employed a generic set of hyper-parameters (kernel number: 10, kernel size: 3, pooling size: 2, stride: 2, and number of fully connected nodes: 50). Since our R-CNN model used the best combination of  The normal and abnormal patches were randomly divided into training (70%) and testing (30%) data sets. Thus, the R-CNN was trained with 3000 randomly selected patches (1500 normal, 1500 abnormal) from the total number of patches (4400). Next, the classification accuracy of the R-CNN was quantified with the remaining 1400 patches (700 normal, 700 abnormal) that were not utilized in the R-CNN training. Lastly, the localization accuracy of the R-CNN was calculated for each whole image, and it was judged as correct when an actual dent was located in a bounding box formed by the R-CNN.

Results
The overall classification accuracies of the proposed R-CNN model were 100% for the learning patches and 98.5% for the testing patches. No significant bias between the two groups of patches was observed in the classification accuracy. Figure 6 shows examples of normal and abnormal patches identified by the R-CNN model. In addition, Figure 7 shows an example of learning accuracies by iterations.  The overall classification accuracy of the proposed R-CNN model (98.5%) was superior to a plain R-CNN model (88.7%) for the testing patches. We implemented a plain R-CNN model, which employed a generic set of hyper-parameters (kernel number: 10, kernel size: 3, pooling size: 2, stride: 2, and number of fully connected nodes: 50). Since our R-CNN model used the best combination of  The overall classification accuracy of the proposed R-CNN model (98.5%) was superior to a plain R-CNN model (88.7%) for the testing patches. We implemented a plain R-CNN model, which employed a generic set of hyper-parameters (kernel number: 10, kernel size: 3, pooling size: 2, stride: 2, and number of fully connected nodes: 50). Since our R-CNN model used the best combination of hyper-parameters that were decided by the adaptive genetic algorithm, our model showed a lot better accuracy than a plain R-CNN model.
The sensitivity (the percentage of patches with dents that were correctly identified) and specificity (the percentage of normal patches that were correctly identified) for the testing patches were 97.9% and 99.2%, respectively. 2.1% of the abnormal patches were misclassified into the normal from the R-CNN model; however, the almost misclassified abnormal patches was not even distinguishable by human vision either. In addition, 0.8% of the normal patches were also misclassified into abnormal from the R-CNN model.
A mean absolute error (MAE) was 13.7 pixels (SD = 10.7 pixels). The MAE in this study was calculated using an average Euclidean distance between the locations of the actual and predicted dents. The actual locations of dents on the images were manually identified by our research team. As shown in Figure 8, the locations of the predicted dents were all close to those of the actual dents.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 10 hyper-parameters that were decided by the adaptive genetic algorithm, our model showed a lot better accuracy than a plain R-CNN model. The sensitivity (the percentage of patches with dents that were correctly identified) and specificity (the percentage of normal patches that were correctly identified) for the testing patches were 97.9% and 99.2%, respectively. 2.1% of the abnormal patches were misclassified into the normal from the R-CNN model; however, the almost misclassified abnormal patches was not even distinguishable by human vision either. In addition, 0.8% of the normal patches were also misclassified into abnormal from the R-CNN model.
A mean absolute error (MAE) was 13.7 pixels (SD = 10.7 pixels). The MAE in this study was calculated using an average Euclidean distance between the locations of the actual and predicted dents. The actual locations of dents on the images were manually identified by our research team. As shown in Figure 8, the locations of the predicted dents were all close to those of the actual dents.

Conclusions
The present study proposed and applied a region-based convolutional neural network (R-CNN) to detect and localize dents on the surface of a vehicle body. An adaptive genetic algorithm (AGA) explored the optimal combination of hyper-parameters for the R-CNN by an evolutionary process to achieve the best classification accuracy. The R-CNN established a classification network, which classified an input image into either normal or abnormal (dent). In addition, the R-CNN localized the estimated location of a dent using the heat-map. The proposed method was able to classify normal and abnormal patches with an accuracy of 98% and its MAE was 13.7 pixels, which indicated very small discrepancy between the actual and estimated dent locations. It is expected that the R-CNN proposed in this study could help detect and localize dents on a vehicle exterior during vehicle manufacturing and for assembly companies.
Although the findings of this study revealed such promising results, there is still room for improvement in the future studies. Three further works were suggested for better practical applicability of the R-CNN model. First, this study used images of a vehicle fender to train the R-CNN model, and thus the model was specialized for detecting defects on the fender; however, the model might be applicable to other parts of a vehicle body. To validate the applicability of the R-CNN model developed in this study, further studies are needed to evaluate the classification performance of the method for other parts of a vehicle body. Second, this study demonstrated that the hybrid model could classify an image as normal or abnormal with great accuracy. Therefore, the model could be useful in the development of a real-time inspection system, which could record consecutively images of a vehicle body and detect defects. Lastly, this study used artificial defects on a vehicle body; thus, the future study can be suggested to develop and evaluate R-CNN models for detecting normal defects naturally damaged during the vehicle manufacturing as well as assembly process.

Conclusions
The present study proposed and applied a region-based convolutional neural network (R-CNN) to detect and localize dents on the surface of a vehicle body. An adaptive genetic algorithm (AGA) explored the optimal combination of hyper-parameters for the R-CNN by an evolutionary process to achieve the best classification accuracy. The R-CNN established a classification network, which classified an input image into either normal or abnormal (dent). In addition, the R-CNN localized the estimated location of a dent using the heat-map. The proposed method was able to classify normal and abnormal patches with an accuracy of 98% and its MAE was 13.7 pixels, which indicated very small discrepancy between the actual and estimated dent locations. It is expected that the R-CNN proposed in this study could help detect and localize dents on a vehicle exterior during vehicle manufacturing and for assembly companies.
Although the findings of this study revealed such promising results, there is still room for improvement in the future studies. Three further works were suggested for better practical applicability of the R-CNN model. First, this study used images of a vehicle fender to train the R-CNN model, and thus the model was specialized for detecting defects on the fender; however, the model might be applicable to other parts of a vehicle body. To validate the applicability of the R-CNN model developed in this study, further studies are needed to evaluate the classification performance of the method for other parts of a vehicle body. Second, this study demonstrated that the hybrid model could classify an image as normal or abnormal with great accuracy. Therefore, the model could be useful in the development of a real-time inspection system, which could record consecutively images of a vehicle body and detect defects. Lastly, this study used artificial defects on a vehicle body; thus, the future study can be suggested to develop and evaluate R-CNN models for detecting normal defects naturally damaged during the vehicle manufacturing as well as assembly process.