Vision-Based Surface Inspection System for Bearing Rollers Using Convolutional Neural Networks Vision-Based Surface Inspection System for Bearing Rollers Using Convolutional Neural Networks

: Bearings are commonly used machine elements and an important part of mechanical transmission. They are widely used in automobiles, airplanes, and various instruments and equipment. Bearing rollers are the most important components in a bearing and determine the performance, life, and stability of the bearing. In order to control the surface quality of the rollers, a machine vision system for bearing roller surface inspection is proposed. We brieﬂy introduced the design of the machine vision system and then focused on the surface inspection algorithm. We proposed a multi-task convolutional neural network to detect defects. We extracted the features of the defects through a shared convolutional neural network, then classiﬁed the defects and calculated the position of the defects simultaneously. Finally, we determined if the bearing roller was qualiﬁed according to the position, category, and area of the defect. In addition, we explored various factors affecting performance and conducted a large number of experiments. We compared our method with the traditional methods and proved that our method had good stability and robustness. Abstract: Bearings are commonly used machine elements and an important part of mechanical transmission. They are widely used in automobiles, airplanes, and various instruments and equipment. Bearing rollers are the most important components in a bearing and determine the performance, life, and stability of the bearing. In order to control the surface quality of the rollers, a machine vision system for bearing roller surface inspection is proposed. We briefly introduced the design of the machine vision system and then focused on the surface inspection algorithm. We proposed a multi-task convolutional neural network to detect defects. We extracted the features of the defects through a shared convolutional neural network, then classified the defects and calculated the position of the defects simultaneously. Finally, we determined if the bearing roller was qualified according to the position, category, and area of the defect. In addition, we explored various factors affecting performance and conducted a large number of experiments. We compared our method with the traditional methods and proved that our method had good stability and robustness.


Introduction
Bearings are commonly used mechanical components. A bearing's main function is to support mechanical rotation and reduce the friction coefficient during its movement. Since the roller is the most important part of the bearing, its surface quality has a significant impact on the performance and even the life of the bearing, thus, the surface quality of the roller must be extremely high. Bearing rollers are shown in Figure 1 below.

Introduction
Bearings are commonly used mechanical components. A bearing's main function is to support mechanical rotation and reduce the friction coefficient during its movement. Since the roller is the most important part of the bearing, its surface quality has a significant impact on the performance and even the life of the bearing, thus, the surface quality of the roller must be extremely high. Bearing rollers are shown in Figure 1 below.    The main defect categories are as follows:

1.
Scratch, as shown in Figure 2a,b. A defect caused by a roller being scratched by other hard objects.

2.
Damage, as shown in Figure 2c-f. We describe defects with large areas and irregular shapes as damage. 3.
Corrosion, as shown in Figure 2g,h. The defects caused by corrosion. 4.
Material lacking at the chamfer, as shown in Figure 2i,j. The roller is sunken at the chamfer, making the contour not a circle.

5.
Grind lacking, as shown in Figure 2k. The defects caused by insufficient grinding. 6.
Stamp lacking, as shown in Figure 2l. The defects caused by insufficient stamping. Figure 2a,c,d,g,i,k,l are images of the end surfaces of the roller, and the rest are images of the cylindrical surface of the roller. These defects have a great influence on the performance and stability of the bearing and must be detected. Visual inspection is a good solution because it can reduce a lot of manual detection. At present, visual inspection technology has been used in many scenarios, such as chip pin and circuit solder inspection [1], workpiece vision measurement [2], plastic bottle defect detection [3], metal product surface defect detection [4][5][6], equipment parts identification and classification [7], gear and bearing surface inspection and measurement [8], bearing defect inspection [9], optical character recognition [10], and agricultural product identification [11]. Despite being used in large numbers, there are still many problems with visual inspection in the application of roller surface inspection. In the actual production process, it still relies more on manual inspection, and the inspection efficiency and level are relatively low.
Traditional methods used in manufacturing, such as edge detection [12,13], segmentation [14], and line detection [15,16], can hardly extract the internal structures and accurately classify each defect category. Generally, a defect is regarded as a target without distinction, and the difference in reflection between the target and the background is used to separate the two, and then judge whether the bearing roller is qualified according to the position and area of the target. Internal features of defects are not utilized at all. For this reason, it is easy to treat some textures, marks, oil stains, etc. as defects, resulting in a low accuracy and low recall rate of the detection process. Sometimes we need to know exactly how many defect categories exist and calculate the frequency of each defect category in order to properly adjust the production process. And this is not possible for the traditional surface inspection method that is used in manufacturing.
The appearance of deep learning makes up for the disadvantages of traditional algorithms. Since deep learning algorithms have shown state-of-the-art performance in classification and object detection tasks [17], deep neural networks can be utilized to learn the difference between different categories of defects, and to learn the commonality between the same category of defect, from a large amount of data, so that accurate classification can be achieved.
For example, Daniel Weimer et al. explored how convolutional neural network architecture and different hyper-parameter settings affect the feature extraction in industrial inspection [18]. Yiting Li et al. conducted research on the surface defect detection algorithm based on MobileNet-SSD, which proved that defect detection can be achieved using lightweight networks [19]. Xian Tao et al. designed a cascaded autoencoder architecture for segmenting and localizing defects [20], and showed that their method meets the robustness and accuracy requirements for metallic defect detection. Jinhua Lin et al. used a deep convolution neural network to detect defects on castings. They established a convolutional neural network to extract defect features from a suspicious area and, finally, the accuracy of detection was more than 96% [21]. S. Nahavand et al. used intelligent algorithms to detect defects on a metal surface [22]; Xian Tao et al. developed a machine vision device to detect defects on an electrical connector using convolutional neural networks, and they discussed the effects of data augmentation on defect recognition [23]; Yuan et al. used a modified segmentation method and deep neural networks to detect defects on the cover glass of mobile phones, and used GAN to generate new data in order to overcome the drawbacks presented when a huge amount of data is unavailable [24]. This paper introduces a real-time machine vision system for bearing roller surface inspection, which can classify and locate the major categories of defects occurring on the surface of a bearing roller, and determine whether each bearing roller is qualified based on the position, category, and area of the defect. In order to meet industrial requirements, we propose a multi-task convolutional neural network framework for classifying and locating defects simultaneously. The simplified pipeline lays the foundation for future industrial applications. The system can replace the manual inspection, and its performance is better than the traditional algorithms.
Compared with the existing surface inspection research that is based on deep learning, our method can achieve real-time performance because we use a multi-task learning strategy. The classification task is performed simultaneously with the localization task, making the process of the entire model simpler and more efficient. Our system is an entire surface inspection system for bearing roller defect detection and quality evaluation, which has industrial application value.
The rest of the paper is organized as follows: Section 2 introduces the design of the visual inspection system, including the hardware system and software system; Section 3 elaborates on the defect detection method based on the convolutional neural network; Section 4 gives the implementation and results of the experiment; and the Section 5 summarizes the whole paper.

System Overview
The visual inspection system mainly consisted of two parts: a hardware system and a software system.
The electrical part of the system was mainly composed of the PLC (Programmable Logic Controller) and the industrial computer. The PLC implements motion control and digital I/O control. The industrial computer mainly implements image acquisition, image processing, image analysis, and output. The hardware of the industrial computer was Intel Core i7-6700k CPU, NVIDIA GTX-1080 GPU, 128GB RAM, and the operating system was Windows 10. The mechanical structure is shown in Figure 3 below. It mainly consisted of a feeding device, a feeding conveyor, a pushing mechanism, four cameras, four ring light sources, a strip light source, a cam, a receiving device, etc. classification task is performed simultaneously with the localization task, making the process of the entire model simpler and more efficient. Our system is an entire surface inspection system for bearing roller defect detection and quality evaluation, which has industrial application value. The rest of the paper is organized as follows: Section two introduces the design of the visual inspection system, including the hardware system and software system; section three elaborates on the defect detection method based on the convolutional neural network; section four gives the implementation and results of the experiment; and the section five summarizes the whole paper.

System Overview
The visual inspection system mainly consisted of two parts: a hardware system and a software system.
The electrical part of the system was mainly composed of the PLC (Programmable Logic Controller) and the industrial computer. The PLC implements motion control and digital I/O control. The industrial computer mainly implements image acquisition, image processing, image analysis, and output. The hardware of the industrial computer was Intel Core i7-6700k CPU, NVIDIA GTX-1080 GPU, 128GB RAM, and the operating system was Windows 10. The mechanical structure is shown in Figure 3 below. It mainly consisted of a feeding device, a feeding conveyor, a pushing mechanism, four cameras, four ring light sources, a strip light source, a cam, a receiving device, etc. The bearing roller has two end surfaces and a cylindrical surface, so three workplaces were required for image acquisition. The conveyor conveyed the rollers to workplace 1, workplace 2, and workplace 3 in sequence, and triggered the corresponding image acquisition function. At these three workplaces, we used a total of four industrial cameras. At workplace 1 and workplace 2, the roller was stationary. We use two plane-array cameras, with a resolution of 2448 × 2050, to capture the two end surfaces of the roller. At workplace 3, the rollers began to roll under the action of the mechanism. We used two line-array cameras, with a resolution of 4K, to capture the cylindrical surface. As the cylindrical surface is the working surface of the bearing roller, we used two line-array cameras to prevent defects from being missed due to the rolling of the roller. The selection of the cameras was determined by the working distance and image definition requirements.
Visual inspection has strict requirements on illumination, and stable illumination can ensure the stability of the image quality. For defect features, it is important to choose a targeted light source. We set up two ring light sources, a high-angle light source, and a low-angle light source at workplace 1. The two ring light sources were arranged in front and rear. Since the end surface of the bearing roller contains planes and chamfers, it is not possible to illuminate both parts with only one light source, so we used two light sources to simultaneously illuminate the chamfer and the plane of the roller. The low-angle light source in front was responsible for illuminating the chamfer, and the high-angle light source behind was responsible for illuminating the plane. The light source setting at workplace 2 was The bearing roller has two end surfaces and a cylindrical surface, so three workplaces were required for image acquisition. The conveyor conveyed the rollers to workplace 1, workplace 2, and workplace 3 in sequence, and triggered the corresponding image acquisition function. At these three workplaces, we used a total of four industrial cameras. At workplace 1 and workplace 2, the roller was stationary. We use two plane-array cameras, with a resolution of 2448 × 2050, to capture the two end surfaces of the roller. At workplace 3, the rollers began to roll under the action of the mechanism. We used two line-array cameras, with a resolution of 4K, to capture the cylindrical surface. As the cylindrical surface is the working surface of the bearing roller, we used two line-array cameras to prevent defects from being missed due to the rolling of the roller. The selection of the cameras was determined by the working distance and image definition requirements.
Visual inspection has strict requirements on illumination, and stable illumination can ensure the stability of the image quality. For defect features, it is important to choose a targeted light source. We set up two ring light sources, a high-angle light source, and a low-angle light source at workplace 1. The two ring light sources were arranged in front and rear. Since the end surface of the bearing roller contains planes and chamfers, it is not possible to illuminate both parts with only one light source, so we used two light sources to simultaneously illuminate the chamfer and the plane of the roller. The low-angle light source in front was responsible for illuminating the chamfer, and the high-angle light source behind was responsible for illuminating the plane. The light source setting at workplace 2 was the same as at workplace 1. At workplace 3, we used a strip light source to illuminate the cylindrical surface.
The software system was programmed in C# and C++. C# writes the user interface, and C++ implements the underlying algorithm. The defect detection algorithm was developed using the PyTorch deep learning computing platform. Commonly used image processing algorithms, such as threshold segmentation and morphological processing, were implemented using OpenCV.

Surface Inspection Process
Bearing rollers have two end surfaces and a cylindrical surface. Since the cylindrical surface is the working surface of the bearing roller, a roller must be judged as unqualified if there are defects on it. If the defects occur on the outer circumference of the end surfaces, such as material lacking at the chamfer and stamp lacking, it will also affect the working surface, and the roller must also be judged as unqualified. For the defects inside the end surfaces, we can calculate the defect area to determine whether the roller is qualified.
Because the material lacking at the chamfer, represented by Figure 2i,j above, and the stamp lacking, represented by Figure 2l, can be first detected and excluded in the inspection process described below, our detection algorithm primarily detected and analyzed four categories of defects, which were damage, scratch, corrosion and grind lacking. Details of these defect categories are shown in Figure 4 below. Defects other than those mentioned above are not discussed because of their low frequency of occurrence. the same as at workplace 1. At workplace 3, we used a strip light source to illuminate the cylindrical surface.
The software system was programmed in C# and C++. C# writes the user interface, and C++ implements the underlying algorithm. The defect detection algorithm was developed using the PyTorch deep learning computing platform. Commonly used image processing algorithms, such as threshold segmentation and morphological processing, were implemented using OpenCV.

Surface Inspection Process
Bearing rollers have two end surfaces and a cylindrical surface. Since the cylindrical surface is the working surface of the bearing roller, a roller must be judged as unqualified if there are defects on it. If the defects occur on the outer circumference of the end surfaces, such as material lacking at the chamfer and stamp lacking, it will also affect the working surface, and the roller must also be judged as unqualified. For the defects inside the end surfaces, we can calculate the defect area to determine whether the roller is qualified.
Because the material lacking at the chamfer, represented by Figure 2i,j above, and the stamp lacking, represented by Figure 2l, can be first detected and excluded in the inspection process described below, our detection algorithm primarily detected and analyzed four categories of defects, which were damage, scratch, corrosion and grind lacking. Details of these defect categories are shown in Figure 4 below. Defects other than those mentioned above are not discussed because of their low frequency of occurrence. Image acquisition was performed at a suitable working distance. For each bearing roller, a total of two images were captured on both end surfaces, and the image was cropped to a resolution of 416 × 416. For the cylindrical surface, of each bearing roller, two images were captured and the resolution was also 416 × 416 after cropping.
We note that, although the shapes of the same defect category are different, there are similarities in features that can be extracted and classified by convolutional neural networks. In this section, we will describe in detail the method for identifying various defects on bearing rollers. The completed process pipeline is shown in Figure 5 below. Image acquisition was performed at a suitable working distance. For each bearing roller, a total of two images were captured on both end surfaces, and the image was cropped to a resolution of 416 × 416. For the cylindrical surface, of each bearing roller, two images were captured and the resolution was also 416 × 416 after cropping.
We note that, although the shapes of the same defect category are different, there are similarities in features that can be extracted and classified by convolutional neural networks. In this section, we will describe in detail the method for identifying various defects on bearing rollers. The completed process pipeline is shown in Figure 5 below.
The process consisted of the following three stages: First, contour detection. It is used to determine if the outer contour of the end surface is a standard circle and exclude the roller with a non-circular contour. Second, defect detection. It uses a multi-task learning convolutional neural network to classify and locate defects. Third, roller quality evaluation. It is used to determine whether the bearing roller is qualified according to the position, category, and area of the defect.  The process consisted of the following three stages: First, contour detection. It is used to determine if the outer contour of the end surface is a standard circle and exclude the roller with a non-circular contour. Second, defect detection. It uses a multi-task learning convolutional neural network to classify and locate defects. Third, roller quality evaluation. It is used to determine whether the bearing roller is qualified according to the position, category, and area of the defect.

Contour Detection
In this part, we fitted the outer contour of the end surfaces of the roller by using the Hough transform [25]. The pipeline can be seen from Figure 6 below. We performed the Hough circle detection 10 times for each end surface, and then took the average of the radius and the average of the center coordinates as the actual radius and center coordinates of the outer contour of the end surface. Then we used the Canny algorithm to extract the outer contour and calculated the standard deviation of the distance between the actual center coordinates and all points on the contour. The formulas were as follows:

Contour Detection
In this part, we fitted the outer contour of the end surfaces of the roller by using the Hough transform [25]. The pipeline can be seen from Figure 6 below. The process consisted of the following three stages: First, contour detection. It is used to determine if the outer contour of the end surface is a standard circle and exclude the roller with a non-circular contour. Second, defect detection. It uses a multi-task learning convolutional neural network to classify and locate defects. Third, roller quality evaluation. It is used to determine whether the bearing roller is qualified according to the position, category, and area of the defect.

Contour Detection
In this part, we fitted the outer contour of the end surfaces of the roller by using the Hough transform [25]. The pipeline can be seen from Figure 6 below. We performed the Hough circle detection 10 times for each end surface, and then took the average of the radius and the average of the center coordinates as the actual radius and center coordinates of the outer contour of the end surface. Then we used the Canny algorithm to extract the outer contour and calculated the standard deviation of the distance between the actual center coordinates and all points on the contour. The formulas were as follows: We performed the Hough circle detection 10 times for each end surface, and then took the average of the radius and the average of the center coordinates as the actual radius and center coordinates of the outer contour of the end surface. Then we used the Canny algorithm to extract the outer contour and calculated the standard deviation of the distance between the actual center coordinates and all points on the contour. The formulas were as follows: where (x si ,y si ) and r si are the center coordinates and the radius of the i-th circle detected by Hough circle detection. (x s ,y s ) and r s are the actual center coordinates and the actual radius of the outer contour of the end surface, d j is the distance between the j-th point on the contour and the coordinate (x s ,y s ), and std is the standard deviation of d.
If the std was less than the set threshold (set to 0.4 by experiment), it meant that the outer contour of the current end surface was a circle, and the sample would be sent into the shared convolutional neural network to extract a feature map for defect classification and localization. On the contrary, if there was a defect at the contour of the end surface, and the outer contour was not a circle, then the bearing roller would be judged as unqualified.

Features Extraction Using CNN
In this part, we designed a 26-layer convolutional neural network for feature extraction. The design reference for this network comes from the VGG [26] and the Resnet [27]. Firstly, we used small convolution kernels, instead of large convolution kernels, in order to reduce the computation and increase the network depth as well as the nonlinear mapping, so that the model's data-fitting ability would be stronger. Secondly, we also used the 1 × 1 convolution kernel to compress parameters that were output from the 3 × 3 convolution kernel to reduce the computation of the network. Finally, we referred to Resnet to add shortcuts to the network in order to alleviate the gradient disappearance during training. The structure is shown in Table 1 below. We pre-trained the network on the ImageNet dataset [28] to improve the generalization capabilities.

Defect Classification and Localization
We classified the defects and calculated the position of the defects based on the feature map extracted by the CNN. We used a multi-task CNN architecture to unify classification and localization in order to simplify the entire inspection process. The loss function of the entire CNN was linearly weighted by the loss function of the classification task and the loss function of the localization task, as shown below: where L cls is the loss function of the classification task, L loc is the loss function of the localization task, and α is the weight of L loc .

Classification
The feature map was extracted by the convolutional neural network, and the dimension of the feature map was 13 × 13 × 1024. Each position of 13 × 13 represented a specific area in the original image. We followed the Single Shot MultiBox Detector (SSD) [29] and Faster R-CNN [30] to associate 6 anchor boxes at each location of the feature map. Each anchor box was responsible for predicting whether there was a defect at the position or not. If there was a defect, it would then predict the defect category and calculate the probability of the defect belonging to a certain defect category. In this paper, there were four categories of defects. The loss function of the classification task was defined as follows: where N is the total number of anchor boxes, i refers to the anchor box index, j refers to the ground-truth box index, p refers to the category index, and 0 represents the background. x p ij = 1 when category p of i-th anchor box and category p of j-th ground-truth box match, otherwise x p ij = 0. c p i indicates the predicted probability of the category p corresponding to the i-th anchor box.

Localization
If there was a defect in the current position, we calculated the IoU of each anchor box with the ground-truth box, and removed the anchor boxes whose IoU was smaller than the set threshold by non-maximum suppression, leaving the anchor box whose IoU was larger than the set threshold. The boxes left were our predicted boxes. IoU was defined as: where G T is the ground-truth box and P B is the predicted box. Each predicted box contained four predicted values, which were the center coordinates (x, y) of the box, and the length and width of the box. Through continuous iteration, the loss was gradually reduced, and the position of the predicted box was constantly approaching the ground-truth box. The loss function was as follows: where L reg is Smooth L1 loss, N is the total number of anchor boxes, and x p ij = 1 when category p of i-th anchor box and category p of j-th ground-truth box match, otherwise x p ij = 0. t i is a four-dimensional vector that represents the position of the predicted box. t * i is a four-dimensional vector that represents the position of the ground-truth box.
where x, y, denote the box's center coordinates and w, h, denote its width and height, respectively. Variables x, x a , and x* are for the predicted box, anchor box, and ground-truth box, respectively (likewise for y, w, and h).

Roller Quality Evaluation
For defects that occured on the cylindrical surface, no matter which kind of defect it was and what the defect area was, the bearing roller was judged as unqualified. For defects that occurred on the end surfaces, step 3.1, described above, had already excluded defects, such as material lacking at the chamfer and stamp lacking, that caused the outer contour to not be circular in shape. For corrosion, scratch, damage, and grind lacking defects, the bearing roller was judged based on the defect area. The defects with bounding boxes were equivalent to the ROIs (Region of Interest), and the ROIs were analyzed separately using the image processing method. Accordingly, we calculated the defect area on each end surface separately. The process is shown in Figure 7 below. Variables x, xa, and x* are for the predicted box, anchor box, and ground-truth box, respectively (likewise for y, w, and h).

Roller Quality Evaluation
For defects that occured on the cylindrical surface, no matter which kind of defect it was and what the defect area was, the bearing roller was judged as unqualified. For defects that occurred on the end surfaces, step 3.1, described above, had already excluded defects, such as material lacking at the chamfer and stamp lacking, that caused the outer contour to not be circular in shape. For corrosion, scratch, damage, and grind lacking defects, the bearing roller was judged based on the defect area. The defects with bounding boxes were equivalent to the ROIs (Region of Interest), and the ROIs were analyzed separately using the image processing method. Accordingly, we calculated the defect area on each end surface separately. The process is shown in Figure 7 below. Different defects have different impacts on the performance of the roller. Damage has the greatest impact on the performance, followed by scratch, corrosion, and grind lacking. Our surface inspection system had different tolerances for different defects and; therefore, we defined four coefficients for the four defects. When calculating the total defect area, it was necessary to multiply the area of the different defects by the corresponding coefficient. For damage, scratch, corrosion, and grind lacking, the coefficients were defined as 3, 1.5, 1, and 0.8, respectively. The coefficients were defined by multiple experiments based on the inspection effect, and different coefficients could be defined according to different situations.
After performing median filtering, Otsu thresholding [31], and morphological processing on the ROIs, defects were segmented from the background, and then we calculated the total area of all the defects. If the total defect area was greater than the set threshold, which was about 5% of the end Different defects have different impacts on the performance of the roller. Damage has the greatest impact on the performance, followed by scratch, corrosion, and grind lacking. Our surface inspection system had different tolerances for different defects and; therefore, we defined four coefficients for the four defects. When calculating the total defect area, it was necessary to multiply the area of the different defects by the corresponding coefficient. For damage, scratch, corrosion, and grind lacking, the coefficients were defined as 3, 1.5, 1, and 0.8, respectively. The coefficients were defined by multiple experiments based on the inspection effect, and different coefficients could be defined according to different situations.
After performing median filtering, Otsu thresholding [31], and morphological processing on the ROIs, defects were segmented from the background, and then we calculated the total area of all the defects. If the total defect area was greater than the set threshold, which was about 5% of the end surface area, the bearing roller would be judged as unqualified. The roller would be judged as qualified only when the defect area of each end surface was smaller than the threshold.

Data Augmentation
Both classification and localization depend on the CNN model, and the deep CNN model is easily over-fitting due to its powerful fitting ability, especially when the amount of data is not large. For bearing rollers, the probability of occurrence of defects is relatively low, and the amount of data that can be collected is relatively small, so it is necessary to appropriately augment the original data. We adopted several commonly used augmentation methods, including image rotation, image flipping, image cropping, adding blur, and adding noise. The augmentation results are shown in Figure 8 below.
easily over-fitting due to its powerful fitting ability, especially when the amount of data is not large. For bearing rollers, the probability of occurrence of defects is relatively low, and the amount of data that can be collected is relatively small, so it is necessary to appropriately augment the original data. We adopted several commonly used augmentation methods, including image rotation, image flipping, image cropping, adding blur, and adding noise. The augmentation results are shown in Figure 8 below.

Dataset Description
Our dataset was collected from the bearing rollers with different sizes on the production line. There were 3200 images in the dataset. There were one or more defects on each sample. The specific quantities are shown in Table 2 below. The images were down sampled to match the input size of 416 × 416. We shuffled the data and then divided the data into three parts: 70%, as the training set; 15%, as the validation set; and 15%, as the test set. We made sure that all three parts of the dataset had the same data distribution by way of shuffling. The training set was used for model training, the validation set was used for selecting the model hyper-parameters, and the test set was used for evaluating the model performance. The training set, validation set, and the test set were strictly labeled manually.

Dataset Description
Our dataset was collected from the bearing rollers with different sizes on the production line. There were 3200 images in the dataset. There were one or more defects on each sample. The specific quantities are shown in Table 2 below. The images were down sampled to match the input size of 416 × 416. We shuffled the data and then divided the data into three parts: 70%, as the training set; 15%, as the validation set; and 15%, as the test set. We made sure that all three parts of the dataset had the same data distribution by way of shuffling. The training set was used for model training, the validation set was used for selecting the model hyper-parameters, and the test set was used for evaluating the model performance. The training set, validation set, and the test set were strictly labeled manually.

Evaluation Indicators
In the following experiments, we quantitatively evaluated the performance of the defect detection algorithm and the performance of the entire surface inspection system. For the defect detection algorithm, we used mAP (Mean Average Precision) to evaluate its performance, and we used detection time to evaluate the efficiency of the algorithm. We also compared the multi-class classification performance of our algorithm and the pattern recognition algorithms, and we used the micro F1 score to evaluate the performance of the different methods. For the entire surface inspection system, we used the F1 score to evaluate its performance. The formulas for calculating the F1 score were as follows: where TP represents the number of positive samples that are judged to be positive samples, FP represents the number of negative samples that are judged to be positive samples, and FN represents the number of positive samples that are judged to be negative samples. The formulas for calculating the micro F1 score were as follows: where i represents the i-th category of defect, TP represents the number of positive samples that are judged to be positive samples, FP represents the number of negative samples that are judged to be positive samples, and FN represents the number of positive samples that are judged to be negative samples.

Performance of the Defect Detection Algorithm under Different Settings
The defect detection results are shown in Figure 9 below. The red box belongs to damage, the green box belongs to scratch, the yellow box belongs to grind lacking, and the blue box belongs to corrosion. The category of the defect and the probability are displayed above the box. micro F1 score to evaluate the performance of the different methods. For the entire surface inspection system, we used the F1 score to evaluate its performance. The formulas for calculating the F1 score were as follows: where TP represents the number of positive samples that are judged to be positive samples, FP represents the number of negative samples that are judged to be positive samples, and FN represents the number of positive samples that are judged to be negative samples. The formulas for calculating the micro F1 score were as follows:  (8) where i represents the i-th category of defect, TP represents the number of positive samples that are judged to be positive samples, FP represents the number of negative samples that are judged to be positive samples, and FN represents the number of positive samples that are judged to be negative samples.

Performance of the Defect Detection Algorithm under Different Settings
The defect detection results are shown in Figure 9 below. The red box belongs to damage, the green box belongs to scratch, the yellow box belongs to grind lacking, and the blue box belongs to corrosion. The category of the defect and the probability are displayed above the box. The red box belongs to damage, the green box belongs to scratch, the yellow box belongs to grind lacking, and the blue box belongs to corrosion.

Influence of Different α on Performance
We used cross-validation to select the appropriate α. Table 3 gives the results of the task under different α. It can be seen from the table that the best score was achieved when α = 1.05. The AP (Average Precision) of each category when α = 1.05 is shown in Table 4: The detection results of different α are shown in Figure 10 below. The yellow boxes represent the ground-truth label. The Figure only shows the detection results when α = 0.8, α = 0.9, α = 1.05, and α = 1.2. It can be seen from the Figure that when α = 0.8, α = 0.9, and α = 1.2, the detection results deviated from the ground-truth label, especially when α = 0.8, and the result was more accurate when α = 1.05.

Influence of Different α on Performance
We used cross-validation to select the appropriate α. Table 3 gives the results of the task under different α. It can be seen from the table that the best score was achieved when α = 1.05. The AP (Average Precision) of each category when α = 1.05 is shown in Table 4: The detection results of different α are shown in Figure 10 below. The yellow boxes represent the ground-truth label. The Figure

Influence of Data Augmentation on Performance
We used a variety of data augmentation strategies and ended up using the following methods to get the best results: Each sample had a 20% chance of performing a specified angular rotation (60°, 120°, 180°, 240°, and 300°), with a 50% chance of flipping, a 5% chance of adding gaussian noise, a 5% chance to add blur, and a 30% chance of performing center cropping. The results are shown in Table 5 below. When using the best data augmentation method, the APs for each defect category are shown in Table 6 below. We compared the influence of different resolutions on the detection performance. The results are shown in Table 7 below.

Influence of Data Augmentation on Performance
We used a variety of data augmentation strategies and ended up using the following methods to get the best results: Each sample had a 20% chance of performing a specified angular rotation (60 • , 120 • , 180 • , 240 • , and 300 • ), with a 50% chance of flipping, a 5% chance of adding gaussian noise, a 5% chance to add blur, and a 30% chance of performing center cropping. The results are shown in Table 5 below. When using the best data augmentation method, the APs for each defect category are shown in Table 6 below. We compared the influence of different resolutions on the detection performance. The results are shown in Table 7 below. It can be seen from the results that increasing the resolution had a significant impact on the mAP and detection time. As the resolution increased, the mAP increased but the detection time decreased. That was because the increase in resolution lead to an increase in computation. Therefore, it is necessary to select an appropriate resolution according to actual needs.

Influence of Model Pre-Training on Performance
Inspired by transfer learning [32][33][34][35], we pre-trained our CNN model on the ImageNet data set and compared the same model without pre-training. The results are shown in Table 8 below. It can be concluded from the results that the pre-trained model had a better generalization ability and had a positive effect on improving the mAP.

Influence of Different Base Networks on Performance
We compared our network with Resnet-50, VGG-19, and MobileNet [36]. The results are shown in Table 9 below, and the detection results are shown in Figure 10 below. As can be seen from the table, the best mAP was achieved using Resnet-50, but processing an image was more time consuming. VGG-19 achieved a mAP of 83.86% but took even longer to process a single image. MobileNet had a fairly high processing efficiency, but the mAP was the lowest among all the base networks. Our network achieved a better balance between the mAP and data processing efficiency due to less parameters and computation. Our mAP was close to that of using Resnet-50, and the detection time had a great advantage compared with Resnet-50 and VGG-19.
It can be seen from Figure 11 below that the detection results using MobileNet deviated from the ground-truth label the most. Using our CNN model, the Resnet-50, or the VGG-19 as the feature extractor was more accurate.
all the base networks. Our network achieved a better balance between the mAP and data processing efficiency due to less parameters and computation. Our mAP was close to that of using Resnet-50, and the detection time had a great advantage compared with Resnet-50 and VGG-19.
It can be seen from Figure 11 below that the detection results using MobileNet deviated from the ground-truth label the most. Using our CNN model, the Resnet-50, or the VGG-19 as the feature extractor was more accurate.

Influence of Different Factors on Performance
We summarized all the influencing factors, as shown in Table 10 below. We got the best results when using more image augmentation, higher resolution, and the pre-trained model.

Influence of Different Factors on Performance
We summarized all the influencing factors, as shown in Table 10 below. We got the best results when using more image augmentation, higher resolution, and the pre-trained model.

Comparison between Pattern Recognition Methods and Our Method
To evaluate the performance of the classification module of our method, we compared the accuracy of the defect classification between our method and traditional methods whose codes are publicly available. (1) GLCM (Grey-Level Co-Occurrence Matrices) [37]: The GLCM feature refers to a common method of describing texture features by studying the spatial correlation properties of grayscale, and the texture features are a combination of energy, contrast, entropy, and correlation.
(2) HOG (Histogram of Oriented Gradients) [38]: The HOG feature is a feature descriptor used for object detection in image processing. The algorithm first divided the image into small connected regions, which we call cell units. Then we collected the gradient or edge direction of each pixel in the cell unit to get a histogram. Finally, these histograms were combined to form a feature descriptor.
After obtaining the features described above, we used the SVM (Support Vector Machine) and the MLP (Multi-layer Perceptron) to classify the features. The MLP consisted of a two-layer neural network, a hidden layer, and an output layer. The hidden layer had 15 hidden units and the output layer had 4 output units. We evaluated the performance of the defect classifier quantitatively using the micro F1 score. The micro F1 score was introduced in Section 4.1.2.
The results are shown in the Table 11. It can be seen from the Table that the traditional method could only achieve a micro F1 score of about 70, whereas our method achieved a score of over 90 in the classification task. That was because we used deep convolutional neural networks to learn the internal features of the defects, which had a positive impact on the classification task.

Performance of the Surface Inspection System
As the detection of a defect does not mean that a bearing roller fails, it is necessary to determine whether the roller is qualified according to the category, position, and area of the defect. In the following experiments, we inspected three different sized bearing rollers. We used the F1 score to evaluate the performance of the entire bearing roller surface inspection system. We obtained 1800 bearing rollers from the production line by manual screening, 600 for each size, including 300 qualified products (positive) and 300 unqualified products (negative). Then we used our surface inspection system to test these bearing rollers, and checked the precision and recall rate to calculate the F1 score. The F1 score was introduced in Section 4.1.2.
To evaluate the actual performance of our surface inspection system, we compared our approach to the traditional method currently being used in the production line. The traditional method captured the images and adjusted the resolution to 500 × 500, then it performed median filtering and divided the ROIs on the image, and then it performed threshold segmentation [39] and morphological processing in the ROIs to segment the defects. After the segmentation, defects were separated from the background. Finally, the traditional method determined whether the bearing roller was qualified by calculating whether the area of the defect exceeded the set threshold. The results of the comparison experiment are shown in Table 12 below. It can be seen from the results that the accuracy and recall rate of the traditional method, which was currently being used in the manufacturing, were lower than our method; the recall rate especially was very low. The main reason for this is that traditional methods can easily misjudge some non-defects (e.g., textures, oil stains, marks, etc.) as defects, so that some qualified products will be misjudged as unqualified, resulting in a low precision and a low recall rate. The recall rate and accuracy of our method were relatively higher because our method classifies defects well.

Conclusions
In this paper, we proposed a machine vision system for bearing roller surface inspection. In order to control the quality of the product, a multi-task convolutional neural network was designed to detect the defects. The features of the defects were extracted through the shared convolutional neural network, and then the defects were classified and the position of the defects were calculated simultaneously. Finally, we determined if the bearing roller was qualified base on the position, category, and area of the defects. We conducted a large number of experiments, and compared our method with the traditional surface inspection methods used in manufacturing. The quantitative experimental results showed that our method was superior in accuracy and robustness, and meet the requirements of industrial manufacturing.
The limitation of our proposed approach is that deep learning requires a large amount of labeled data and depends on the performance of the hardware. In the future, we will continue to optimize the algorithm and network structure to reduce the computational cost and, thus, allow them to be truly widely used in industrial manufacturing. And we will try to use semi-supervised learning or GAN (Generative Adversarial Networks) to generate new data to solve the problem of insufficient data.