Data Augmentation Methods Applying Grayscale Images for Convolutional Neural Networks in Machine Vision

: In increasing manufacturing productivity with automated surface inspection in smart factories, the demand for machine vision is rising. Recently, convolutional neural networks (CNNs) have demonstrated outstanding performance and solved many problems in the ﬁeld of computer vision. With that, many machine vision systems adopt CNNs to surface defect inspection. In this study, we developed an effective data augmentation method for grayscale images in CNN-based machine vision with mono cameras. Our method can apply to grayscale industrial images, and we demonstrated outstanding performance in the image classiﬁcation and the object detection tasks. The main contributions of this study are as follows: (1) We propose a data augmentation method that can be performed when training CNNs with industrial images taken with mono cameras. (2) We demonstrate that image classiﬁcation or object detection performance is better when training with the industrial image data augmented by the proposed method. Through the proposed method, many machine-vision-related problems using mono cameras can be effectively solved by using CNNs.


Introduction
With the increasing demand for machine vision to automate the surface inspection of factories, the requirement for higher inspection speed and accuracy has also increased. Machine vision refers to any software or hardware that utilizes visual information of the inspection target to perform the inspection. Conventional machine vision [1][2][3] is capable of inspecting formalized defects through rule-based inspections. However, detecting non-formalized defects is challenging to conventional machine vision applications.
Most of the machine vision applications use mono cameras because they utilize structural features of the inspection target. As shown in Figure 1, the color camera typically obtains a three-channel RGB image via interpolation after shooting with the Bayer pattern. However, the mono camera does not require interpolation after acquiring. As a result, mono cameras with the same pixel have a better resolution than color cameras. Because most machine visions do not require color information for defect inspection, they leverage mono cameras to obtain grayscale images.
Regarding the application of CNN-based machine vision with mono cameras, Yang et al. [16] confirmed that transfer learning from the network that won the ILSVRC competition performs higher classification accuracy than one trained from scratch. In image Regarding the application of CNN-based machine vision with mono cameras, Yang et al. [16] confirmed that transfer learning from the network that won the ILSVRC competition performs higher classification accuracy than one trained from scratch. In image classification with grayscale images, Xie and Richmond [22] showed that transfer learning from the network pretrained with grayscale ILSVRC data shows better classification accuracy than transfer learning from the network pretrained with original ILSVRC data.
Burduja et al. [23] performed intracranial hemorrhage detection by using color images merged from three grayscale images that extracted different features from one CT image.
To solve the machine vision issues due to scarce data on defective products, Yun et al. [17] performed a data augmentation through a conditional convolutional variable autoencoder (CCVAE) for defect classification. However, if surface defect inspection is performed with object detection, the application of CCVAE-based data augmentation is limited.
In general computer vision, CNNs are trained by using large amounts of data, such as a million images for a thousand classes provided by ILSVRC or 110,000 images for 80 classes provided by COCO [24]. However, collecting that amount of balanced dataset for training each application in machine vision is less productive. Eventually, most of the CNN applications are trained by the imbalanced small amount of data. Therefore, reliable methods to train the surface inspection networks with these small datasets must be devised.
In this study, we devised a data augmentation method that can be easily applied when preparing CNN-based machine vision systems, using mono cameras. Our proposed method does not leverage neural networks, so that it can perform data augmentation quickly. We also demonstrate that it can be applicable for imbalanced datasets. Experiments show that our proposed method is effective for both image classification and object detection processes. The data augmentation method developed in this study is based on the following methods: (1) imitating the various changes that can occur while acquiring images from mono cameras in machine vision systems; (2) extracting structural features of the images, which are the primary purpose of using the mono cameras; and (3) merging them into color images.

The NEU-DET Dataset
The NEU-DET dataset [3], which was collected by the Northeastern University, is a dataset for detecting six types of defects on metal surfaces. Object annotations for defect detection are provided, but we used them as a dataset for image classification in this paper. Each class has 240 images for training and 60 images for validation, and each image is 200 × 200 pixels. Figure 2 shows some samples of the NEU-DET dataset. Burduja et al. [23] performed intracranial hemorrhage detection by using color images merged from three grayscale images that extracted different features from one CT image.
To solve the machine vision issues due to scarce data on defective products, Yun et al. [17] performed a data augmentation through a conditional convolutional variable autoencoder (CCVAE) for defect classification. However, if surface defect inspection is performed with object detection, the application of CCVAE-based data augmentation is limited.
In general computer vision, CNNs are trained by using large amounts of data, such as a million images for a thousand classes provided by ILSVRC or 110,000 images for 80 classes provided by COCO [24]. However, collecting that amount of balanced dataset for training each application in machine vision is less productive. Eventually, most of the CNN applications are trained by the imbalanced small amount of data. Therefore, reliable methods to train the surface inspection networks with these small datasets must be devised.
In this study, we devised a data augmentation method that can be easily applied when preparing CNN-based machine vision systems, using mono cameras. Our proposed method does not leverage neural networks, so that it can perform data augmentation quickly. We also demonstrate that it can be applicable for imbalanced datasets. Experiments show that our proposed method is effective for both image classification and object detection processes. The data augmentation method developed in this study is based on the following methods: (1) imitating the various changes that can occur while acquiring images from mono cameras in machine vision systems; (2) extracting structural features of the images, which are the primary purpose of using the mono cameras; and (3) merging them into color images.

The NEU-DET Dataset
The NEU-DET dataset [3], which was collected by the Northeastern University, is a dataset for detecting six types of defects on metal surfaces. Object annotations for defect detection are provided, but we used them as a dataset for image classification in this paper. Each class has 240 images for training and 60 images for validation, and each image is 200 × 200 pixels. Figure 2 shows some samples of the NEU-DET dataset.

Brake Pad Dataset
The machine vision system structure is shown in Figure 3a, and the brake pad image for inspection is shown in Figure 3b. The brake pad image was obtained by using a 2.5megapixel complementary metal-oxide-semiconductor (CMOS) sensor mono camera. The type of product that performs the inspection is shown in Figure 4.
The total number of original images was 545, of which 490 images were used for training and 55 for validation. Table 1 shows the number of each object to detect.

Brake Pad Dataset
The machine vision system structure is shown in Figure 3a, and the brake pad image for inspection is shown in Figure 3b. The brake pad image was obtained by using a 2.5megapixel complementary metal-oxide-semiconductor (CMOS) sensor mono camera. The type of product that performs the inspection is shown in Figure 4.

Brake Pad Dataset
The machine vision system structure is shown in Figure 3a, and the brake pad image for inspection is shown in Figure 3b. The brake pad image was obtained by using a 2.5megapixel complementary metal-oxide-semiconductor (CMOS) sensor mono camera. The type of product that performs the inspection is shown in Figure 4.
The total number of original images was 545, of which 490 images were used for training and 55 for validation. Table 1 shows the number of each object to detect.     The primary defect types are shown in Figure 5. The following procedure can be applied to inspect them: 1. Inspecting the location of the protruding part of the product to inspect whether the product is loaded in the wrong location, as shown in Figure 5b  The total number of original images was 545, of which 490 images were used for training and 55 for validation. Table 1 shows the number of each object to detect. The primary defect types are shown in Figure 5. The following procedure can be applied to inspect them:

1.
Inspecting the location of the protruding part of the product to inspect whether the product is loaded in the wrong location, as shown in Figure 5b.

2.
Inspecting whether the metal sensor is located correctly in the specified location of the product, as shown in Figure 5a,c.

3.
Inspecting whether the riveting is performed correctly to secure the sensor. Figure 5d shows an incorrectly riveted product. 3. Inspecting whether the riveting is performed correctly to secure the sensor. Figure  5d shows an incorrectly riveted product.
Object detection via CNNs was performed to inspect these defects. The objects for detection are as follows: (1) protruding part, (2) unriveted sensor, and (3) riveted sensor.

Proposed Data Augmentation Method
The proposed data augmentation method was performed in two steps. First, we imitated the characteristics of the camera and extracted the structural features of the inspection target. In this step, all the images after augmentation were one-channel grayscale images. Then, we combined the corresponding images to generate several three-channel color images. Four types of data, including the original data, were prepared to validate the superiority of the proposed data augmentation method.
1. Original images (original). 2. Augmented one-channel grayscale images with original images (one-channel). 3. Grayscale images were converted after augmentation, using the proposed method with original images (three-channel, gray). 4. Color images augmented by using the proposed method with original images (threechannel, color).
Then neural networks were trained, using each dataset, and their performances were compared.
We used OpenCV in Python for data augmentation, and the implementation proposed in this paper is opened on a Github repository (github.com/jinfree/ GrayscaleImageAugmentation) (accessed on 9 July 2021), under an AGPLv3 license. Object detection via CNNs was performed to inspect these defects. The objects for detection are as follows: (1) protruding part, (2) unriveted sensor, and (3) riveted sensor.

Proposed Data Augmentation Method
The proposed data augmentation method was performed in two steps. First, we imitated the characteristics of the camera and extracted the structural features of the inspection target. In this step, all the images after augmentation were one-channel grayscale images. Then, we combined the corresponding images to generate several three-channel color images. Four types of data, including the original data, were prepared to validate the superiority of the proposed data augmentation method.

2.
Augmented one-channel grayscale images with original images (one-channel).

3.
Grayscale images were converted after augmentation, using the proposed method with original images (three-channel, gray).

4.
Color images augmented by using the proposed method with original images (threechannel, color).
Then neural networks were trained, using each dataset, and their performances were compared.
We used OpenCV in Python for data augmentation, and the implementation proposed in this paper is opened on a Github repository (github.com/jinfree/GrayscaleImageAug mentation) (accessed on 9 July 2021), under an AGPLv3 license.

One-Channel Augmentation
This section discusses the first of the two steps of data augmentation. We performed one-channel augmentation via four approaches: random pixel noise, bright adjustment, blur, and edge extraction. Edge extraction is conducted to extract structural information of the inspection target. The other approaches imitate image changes that can occur when acquiring images from the CMOS camera.

Pixel Noise
As shown in Figure 6, there are two types of image sensors used in machine vision, namely charge-coupled device (CCD) and CMOS. A CCD is a sensor that accumulates and transmits charges generated by using light energy and eventually converts them into electrical signals. CMOS sensors immediately amplify and transmit the charges generated by using light energy into electrical signals. CMOS sensors outperform CCD sensors regarding the number of frames per second, resolution, and power consumption. As a result, the CMOS sensor is used for high-resolution machine vision cameras; however, it has the disadvantage of pixel-level noise, as shown in Figure 7. We performed data augmentation by imitating such pixel noise; the pseudo-code is shown in Algorithm 1.

One-Channel Augmentation
This section discusses the first of the two steps of data augmentation. We performed one-channel augmentation via four approaches: random pixel noise, bright adjustment, blur, and edge extraction. Edge extraction is conducted to extract structural information of the inspection target. The other approaches imitate image changes that can occur when acquiring images from the CMOS camera.

Pixel Noise
As shown in Figure 6, there are two types of image sensors used in machine vision, namely charge-coupled device (CCD) and CMOS. A CCD is a sensor that accumulates and transmits charges generated by using light energy and eventually converts them into electrical signals. CMOS sensors immediately amplify and transmit the charges generated by using light energy into electrical signals. CMOS sensors outperform CCD sensors regarding the number of frames per second, resolution, and power consumption. As a result, the CMOS sensor is used for high-resolution machine vision cameras; however, it has the disadvantage of pixel-level noise, as shown in Figure 7. We performed data augmentation by imitating such pixel noise; the pseudo-code is shown in Algorithm 1.  Even if the optical system that inspects the product is configured to minimize the effect of external light sources, the brightness of the captured image is sometimes different because of the external reflective light. Neural networks can get robust against brightness changes by adjusting the brightness distribution of the dataset. To equalize the brightness distribution, we used the CLAHE algorithm published by Pizer et al. [25]. Figure 8a,b shows the difference in brightness before and after the application of CLAHE, respectively.

Gaussian Blur
If the focus of the lens is not aligned, the image of the inspection target is blurred. To ensure that the CNNs are robust to image blurring resulting from an inexperienced operator's lens manipulation, blur was applied via the Gaussian kernel generated through Equation (1). We applied the GaussianBlur function of OpenCV Python, and ksize = (11,11), sigmaX = 11, and sigmaY = 11 were used as input factors.
Morphological Gradient Edges are extracted as structural features of the inspection target. Generally, the Canny Edge algorithm [26] is used for edge detection. However, we performed morphological gradient operations to preserve the importance of information while extracting all the structural information from the image under examination in the form of edges. We used the getStructuringElement function to obtain the kernel and morphologyEx to perform the morphological gradient operation. We used the input parameters of the getStruc-turingElement function as flag = cv2.MORPH_ELIPSE and ksize = (11,11). Additionally,

Contrast Limited Adaptive Histogram Equalization (CLAHE)
Even if the optical system that inspects the product is configured to minimize the effect of external light sources, the brightness of the captured image is sometimes different because of the external reflective light. Neural networks can get robust against brightness changes by adjusting the brightness distribution of the dataset. To equalize the brightness distribution, we used the CLAHE algorithm published by Pizer et al. [25]. Figure 8a,b shows the difference in brightness before and after the application of CLAHE, respectively. Even if the optical system that inspects the product is configured to minimize the effect of external light sources, the brightness of the captured image is sometimes different because of the external reflective light. Neural networks can get robust against brightness changes by adjusting the brightness distribution of the dataset. To equalize the brightness distribution, we used the CLAHE algorithm published by Pizer et al. [25]. Figure 8a,b shows the difference in brightness before and after the application of CLAHE, respectively.

Gaussian Blur
If the focus of the lens is not aligned, the image of the inspection target is blurred. To ensure that the CNNs are robust to image blurring resulting from an inexperienced operator's lens manipulation, blur was applied via the Gaussian kernel generated through Equation (1). We applied the GaussianBlur function of OpenCV Python, and ksize = (11,11), sigmaX = 11, and sigmaY = 11 were used as input factors.
Morphological Gradient Edges are extracted as structural features of the inspection target. Generally, the Canny Edge algorithm [26] is used for edge detection. However, we performed morphological gradient operations to preserve the importance of information while extracting all the structural information from the image under examination in the form of edges. We used the getStructuringElement function to obtain the kernel and morphologyEx to perform the morphological gradient operation. We used the input parameters of the getStruc-turingElement function as flag = cv2.MORPH_ELIPSE and ksize = (11,11). Additionally,

Gaussian Blur
If the focus of the lens is not aligned, the image of the inspection target is blurred. To ensure that the CNNs are robust to image blurring resulting from an inexperienced operator's lens manipulation, blur was applied via the Gaussian kernel generated through Equation (1). We applied the GaussianBlur function of OpenCV Python, and ksize = (11, 11), sigmaX = 11, and sigmaY = 11 were used as input factors.
Morphological Gradient Edges are extracted as structural features of the inspection target. Generally, the Canny Edge algorithm [26] is used for edge detection. However, we performed morphological gradient operations to preserve the importance of information while extracting all the structural information from the image under examination in the form of edges. We used the getStructuringElement function to obtain the kernel and morphologyEx to perform the morphological gradient operation. We used the input parameters of the getStruc-turingElement function as flag = cv2.MORPH_ELIPSE and ksize = (11,11). Additionally, op = cv2.MORPH_GRADIENT for the morphologyEx function. Figure 8c shows the re- sults of the morphological gradient operation. Input and output images are both grayscale images.

Three-Channel Augmentation
When reading images from OpenCV, the channel order is Blue-Green-Red. However, other image processing libraries read images in the order of Red-Green-Blue. To use the data independently of the libraries that read the images, we used images applied with morphological gradients in the Green channel. The Red and Blue channels combine the rest of the one-channel augmented images and the original images to create the three-channel images. Figure 9a depicts an example of the order of one-channel images entered into each channel while creating a three-channel image. Figure 9b is a color image combined according to the proposed method. In addition, we prepared the data transformed into grayscale images, as shown in Figure 9c, to verify that neural networks are well-trained when trained with data of different features in all three channels and not well-trained only because of a large amount of data.  Figure 8c shows the results of the morphological gradient operation. Input and output images are both grayscale images.

Three-Channel Augmentation
When reading images from OpenCV, the channel order is Blue-Green-Red. However, other image processing libraries read images in the order of Red-Green-Blue. To use the data independently of the libraries that read the images, we used images applied with morphological gradients in the Green channel. The Red and Blue channels combine the rest of the one-channel augmented images and the original images to create the threechannel images. Figure 9a depicts an example of the order of one-channel images entered into each channel while creating a three-channel image. Figure 9b is a color image combined according to the proposed method. In addition, we prepared the data transformed into grayscale images, as shown in Figure 9c, to verify that neural networks are well-trained when trained with data of different features in all three channels and not well-trained only because of a large amount of data. The method used to preprocess one channel and combine it when augmenting to three channels is shown in Table 2. The number of data of the NEU-DET dataset and the brake pad dataset after the augmentation is shown in Tables 3 and 4.  The method used to preprocess one channel and combine it when augmenting to three channels is shown in Table 2. The number of data of the NEU-DET dataset and the brake pad dataset after the augmentation is shown in Tables 3 and 4.  Table 3. Number of datasets after the augmentation, NEU-DET dataset.

Dataset Configuration # of Training Datasets # of Validation Datasets
Original dataset 240 60 Original dataset + one-channel mixed dataset 1200 300 Original dataset + three-channel mixed dataset 1680 420 Table 4. Number of datasets after the augmentation, brake pad dataset.

Dataset Configuration # of Training Datasets # of Validation Datasets
Original dataset 490 55 Original dataset + one-channel mixed dataset 2450 275 Original dataset + three-channel mixed dataset 3430 385

Networks
Unlike the ordinary CNN-based computer vision tasks, the machine vision problem has relatively few classes required to be classified or detected. Owing to the small number of classes to be inspected, the accuracy of the relatively simple neural networks is not significantly lower than that of the complex neural networks. It is more economical to increase the inspection speed in the production process at the factory. As a result, we focused on inspection speed, and the neural networks are chosen based on the inference speed in this paper.
Four types of datasets were trained by using the same hyperparameters.

Image Classification Networks
Image classification networks are trained by using the NEU-DET dataset. MobileNetV2 by Sandler [27] and Resnet18 by He et al. [28] are transfer-learned, using the prepared data. The framework to train both networks is Pytorch, and GPU is GTX 1080Ti.
Hyperparameters used in the training of both neural networks are shown in Table 5.

Object Detection Networks
There are two types of object detection networks, which are of two types and are shown in Figure 10. Figure 10a shows the architecture of the two-stage detector, which involves the following steps: (1) image input, (2) feature extraction, (3) region proposal, and (4) object classification. Although the object detection accuracy was high, the inference speed was relatively slow. Figure 10b shows the structure of the one-stage detector, which goes through the steps of image input, feature extraction, and object detection. It has the advantage of being less accurate albeit faster in inference than the two-stage detectors.

Image Classification Metrics
Classification accuracy and F1 scores on the validation datasets were used for the evaluation of the trained networks. Classification accuracy is the ratio of results classified as correct for all the classification results and is calculated by using Equation (2). The F1 score is a harmonic mean of precision and recall, an indicator that allows a more accurate evaluation of the networks when the data label is unbalanced. F1 score is calculated using The neural network trained for object detection uses YOLOv4 [29] and YOLOv4-tiny. The framework to train both networks is Darknet, and GPU is GTX 1060; the generalized architecture of YOLOv4 and YOLOv4-tiny is shown in Figure 11.

Image Classification Metrics
Classification accuracy and F1 scores on the validation datasets were used for the evaluation of the trained networks. Classification accuracy is the ratio of results classified as correct for all the classification results and is calculated by using Equation (2). The F1 score is a harmonic mean of precision and recall, an indicator that allows a more accurate evaluation of the networks when the data label is unbalanced. F1 score is calculated using Hyperparameters used in the training of both neural networks are shown in Table 6. Classification accuracy and F1 scores on the validation datasets were used for the evaluation of the trained networks. Classification accuracy is the ratio of results classified as correct for all the classification results and is calculated by using Equation (2). The F1 score is a harmonic mean of precision and recall, an indicator that allows a more accurate evaluation of the networks when the data label is unbalanced. F1 score is calculated using Equation (3). Precision and recall are calculated by using Equations (4) and (5), respectively. The definitions of TP, FP, FN, and TN are tabulated in Table 7.

Object Detection Metrics
We can use the mean average precision (mAP) as an evaluation metric for object detection neural networks. mAP metrics include mAP@0.5 and mAP@0.5:0.95. Everingham et al. [30] used mAP@0.5 at the Pascal VOC competition and mAP@0.5:0.95 at the COCO Object Detection competition [24]. Moreover, mAP@0.5 is the average value of the classwise average precision (AP) for an intersection over union (IoU) threshold of 0.5. Similarly, mAP@0.5:0.95 is the average value of 10 APs for the IoU threshold of 0.5-0.95 with an interval of 0.05. IoU refers to the superposition ratio of the predicted object box to the ground-truth object box by the object detection neural network and is calculated as follows: AP is the area below the line on the precision-recall graph. In the object detection task of machine vision, it is essential to locate the object accurately. Therefore, the value of mAP@0.5:0.95 is more important than the value of mAP@0.5 because mAP@0.5:0.95 needs to compute a high IoU ratio while mAP@0.5 does not.

Quantitative Results
The image classification networks and object detection networks were trained using four prepared datasets for comparison, including the data augmentation method proposed in this study. The types of datasets are as follows: (1) original dataset, (2) one-channel dataset with the original dataset, (3) three-channel grayscale dataset with the original dataset, and (4) three-channel color dataset with the original dataset.
During the evaluation process, the trained network was validated (1) using the original validation data with the augmented validation data and (2) only using the original validation data.
The networks were trained ten times for each experimental condition to verify the reproducibility and repeatability of each metric. Subsequently, we showed average, standard deviation, and boxplot for each metric.

Image Classification
We train two neural networks with the NEU-DET dataset to demonstrate that the proposed data augmentation method affects image classification tasks.

MobileNetV2 Results
With augmented validation data and original validation data, the average and standard deviation of the classification accuracy and the F1 score are obtained and shown in Table 8. Due to balanced datasets, the accuracy and F1 score tend to be the same. Moreover, the networks trained by the proposed three-channel augmented color dataset has higher accuracy and lower standard deviation. In the validation results with the original dataset, the average accuracy of the networks trained with the one-channel augmented dataset is higher than that of networks trained with the three-channel augmented color dataset. However, the networks trained with the three-channel augmented color dataset have a lower standard deviation. Figure 12 shows the boxplots of validation accuracy on each dataset. Figure 12a is the validation results with the augmented dataset, and Figure 12b is the validation results with the original dataset. The network trained by the three-channel augmented color dataset shows good accuracy than the network trained by the original dataset in both validation results. The networks trained with the one-channel augmented dataset do not have higher accuracy in Figure 12a. However, although it has an outlier that cannot assume consistent performance, it has higher accuracy in Figure 12b. The networks trained with the three-channel augmented grayscale dataset have an outlier in Figure 12a. dation results. The networks trained with the one-channel augmented dataset do not have higher accuracy in Figure 12a. However, although it has an outlier that cannot assume consistent performance, it has higher accuracy in Figure 12b. The networks trained with the three-channel augmented grayscale dataset have an outlier in Figure 12a.
By training with the three-channel augmented color dataset, we can assume that the performance of networks will not fall below expected performance in both cases.  By training with the three-channel augmented color dataset, we can assume that the performance of networks will not fall below expected performance in both cases.

Resnet18 Results
With augmented validation data and original validation data, the average and standard deviation of the classification accuracy and the F1 score are obtained and shown in Table 9. The validation accuracy of Resnet18 is lower than that of MobileNetV2. ResNet18 has more parameters to train than MobilenetV2. Since only two epochs have been trained, it can be expected that Resnet18 is not optimized parameters to classify the NEU-DET dataset. However, the tendency of validation results via trained datasets can be confirmed.
Similar to the validation results of MobileNetV2, accuracy and F1 score tend to be the same. Moreover, the average validation accuracy of the networks trained with the three-channel augmented color dataset is higher than other results. Figure 13 shows the boxplots of validation accuracy on each dataset. Figure 13a is the validation results with the augmented dataset, and Figure 13b is the validation results with the original dataset. The network trained by the three-channel augmented color dataset shows good accuracy than the network trained by the original dataset in both validation results.
validation results with the augmented dataset, and Figure 13b is the validation results with the original dataset. The network trained by the three-channel augmented color dataset shows good accuracy than the network trained by the original dataset in both validation results.
The validation results of Resnet18 also show that the networks trained with the threechannel augmented color dataset have high average accuracy. As a result, the proposed data augmentation method was effective for the image classification task, which uses the grayscale image captured by mono cameras for surface inspection. The validation results of Resnet18 also show that the networks trained with the three-channel augmented color dataset have high average accuracy.
As a result, the proposed data augmentation method was effective for the image classification task, which uses the grayscale image captured by mono cameras for surface inspection.

Object Detection
We trained two neural networks with the brake pad dataset to demonstrate that the proposed data augmentation method affects object detection tasks.

YOLOv4 Results
We tabulated average and standard deviations for all mAPs of trained YOLOv4 networks to determine how mAP changes with the IoU threshold. Each mAP and mAP@0.5:0.95 by augmented data are shown in Table 10, and corresponding results by the original data are shown in Table 11. All results show the phenomenon in which the IoU threshold increases and the mAP value decreases similarly. Table 10 shows the high mAPs and mAP@0.5:0.95 of YOLOv4 networks trained with the three-channel augmented color dataset. Nevertheless, in Table 11, some average mAP of YOLOv4 networks trained with the three-channel augmented grayscale dataset has a higher average mAP than YOLOv4 networks trained with the three-channel augmented color dataset. However, the YOLOv4 networks trained with the three-channel augmented color dataset has the highest mAP@0.5:0.95.
The boxplot of mAP0.5:0.95 obtained from both datasets is shown in Figure 14.  Figure 14 shows that the YOLOv4 networks can better infer performance when trained with the three-channel augmented color dataset than trained with the other datasets.

YOLOv4-Tiny Results
We also tabulated average and standard deviations for all mAPs of trained YOLOv4tiny networks. Each mAP and mAP@0.5:0.95 by augmented data are shown in Table 12, and corresponding results by the original data are shown in Table 13.   Figure 14 shows that the YOLOv4 networks can better infer performance when trained with the three-channel augmented color dataset than trained with the other datasets.

YOLOv4-Tiny Results
We also tabulated average and standard deviations for all mAPs of trained YOLOv4tiny networks. Each mAP and mAP@0.5:0.95 by augmented data are shown in Table 12, and corresponding results by the original data are shown in Table 13.  Tables 12 and 13 show the tendency in which the IoU threshold increases and the mAP value decreases. In Table 12, mAP@0.8 of YOLOv4-tiny networks trained with the three-channel augmented color dataset is lower than that of networks trained with the three-channel augmented grayscale dataset. In Table 13, mAP@0.75 and mAP@0.8 of YOLOv4-tiny networks trained with the three-channel augmented color dataset are lower than that of networks trained with the three-channel augmented grayscale dataset. However, in most cases, YOLOv4-tiny networks trained with the three-channel augmented color dataset have the highest mAP value.
The boxplot of mAP0.5:0.95 obtained from both datasets is shown in Figure 15.
of networks trained with the three-channel augmented grayscale dataset. In Table 13, mAP@0.75 and mAP@0.8 of YOLOv4-tiny networks trained with the three-channel augmented color dataset are lower than that of networks trained with the three-channel augmented grayscale dataset. However, in most cases, YOLOv4-tiny networks trained with the three-channel augmented color dataset have the highest mAP value. The boxplot of mAP0.5:0.95 obtained from both datasets is shown in Figure 15.  Figure 15 also shows that the YOLOv4-tiny networks can have outstanding inference performance when trained with the three-channel augmented color dataset.
As a result, the proposed data augmentation method effective in the grayscale image data captured by mono cameras in the surface inspection by object-detection tasks.  Figure 15 also shows that the YOLOv4-tiny networks can have outstanding inference performance when trained with the three-channel augmented color dataset.
As a result, the proposed data augmentation method effective in the grayscale image data captured by mono cameras in the surface inspection by object-detection tasks.

Discussion
In the experiments performed in this study, the NEU-DET dataset was used to train MobileNetV2 and Resnet18 for the image classification task, and the braked pad dataset was used to train YOLOv4 and YOLOv4-tiny for the object detection task. The image classification task and the object-detection task show that the proposed data augmentation method effectively trains the CNNs for machine vision systems using mono cameras.
This shows that the CNNs trained with the proposed three-channel augmented color dataset perform better than the CNNs trained using the other methods. Suppose the CNNs perform better only owing to the number of augmented data. In that case, there should be no difference in the performance of the CNNs trained with the three-channel augmented color dataset and the ones trained with the three-channel augmented grayscale dataset. However, the results show that the CNNs trained with the three-channel augmented color dataset preprocessed by using different methods for each channel performed better.
The reasons are as follows. (1) We imitate the possible variations in the image captured with mono cameras: random oscillation of pixel values in CMOS sensors, brightness changes caused by the light conditions, and blurring effect caused by improper lens alignment. Furthermore, we extract structural information needed for surface defects by extracting the edges. In most experimental results, validation results show that the network trained with the one-channel augmented dataset performs better than the network trained with the original dataset. These results imply that the data augmentation based on characteristics of machine vision is effective in training the CNN for surface defect inspection. (2) When transfer learning on typical CNNs, we assume that the input of CNN is a color image. Moreover, color images have different information for each channel. However, the machine vision system using mono cameras uses grayscale images. Moreover, existing machine vision studies have trained CNNs by opening them as color images so that the three channels have the same original grayscale information. In the work of Burduja et al. [23], they trained CNN by preprocessed color images merged from three grayscale images that extracted different features from one CT image. Based on this work, we train by synthesizing the various information used for inspection in the grayscale machine vision images into color images. The CNNs for surface inspection in the machine vision systems using mono cameras can be trained with a small amount of unbalanced dataset with the proposed data augmentation method.

Conclusions
This study proposes a data augmentation method for training high-performance CNNs in machine vision applications using mono cameras. There has been no research to utilize and apply the characteristics of the images to the CNNs, which can arise from mono cameras, in the industry. This work shows that the CNN-based machine vision using mono cameras can perform when trained with combined three-channel images from multiple variations of images.
Future work will include the application of defect inspection via instance segmentation and anomaly detection. The applicability of the proposed data augmentation method to instance segmentation and anomaly detection will be confirmed in future work.