Research on a Surface Defect Detection Algorithm Based on MobileNet-SSD

: This paper aims to achieve real-time and accurate detection of surface defects by using a deep learning method. For this purpose, the Single Shot MultiBox Detector (SSD) network was adopted as the meta structure and combined with the base convolution neural network (CNN) MobileNet into the MobileNet-SSD. Then, a detection method for surface defects was proposed based on the MobileNet-SSD. Speciﬁcally, the structure of the SSD was optimized without sacriﬁcing its accuracy, and the network structure and parameters were adjusted to streamline the detection model. The proposed method was applied to the detection of typical defects like breaches, dents, burrs and abrasions on the sealing surface of a container in the ﬁlling line. The results show that our method can automatically detect surface defects more accurately and rapidly than lightweight network methods and traditional machine learning methods. The research results shed new light on defect detection in actual industrial scenarios.


Introduction
Intellisense and pattern recognition technologies have made progress in robotics [1][2][3], computer engineering [4,5], health-related issues [6], natural sciences [7] and industrial academic areas [8,9]. Among them, computer vision technology develops particularly quickly. It mainly uses a binary camera, digital camera, depth camera and charge-coupled device (CCD) camera to collect target images, extract features and establish corresponding mathematical models, and to complete the processing of target recognition, tracking and measurement. For example, Kamal et al. comprehensively consider the continuity and constraints of human motion. After contour extraction of the acquired depth image data, the Hidden Markov Model (HMM) is used to identify human activity. This system is highly accurate in recognition and has the ability to effectively deal with rotation and deficiency of the body [10]. Jalal et al. use Texture and shape vectors to reduce feature vectors and extracts important features in facial recognition through density matching score and boundary fixation, so as to manage key processing steps of face activity (recognition accuracy, recognition speed and security) [11]. In [12], vehicle damage is classified by a deep learning method, and the recognition accuracy of a small data set was up to 89.5% by the introduction of transfer learning and an integrated learning method. This provides a new way for automatic processing of vehicle insurance. Zhang et al. combine the four features of color, time motion, gradient norm and residual motion to identify the position of each frame in video. The method uses weighted linear combination to evaluate the different combinations of these features and establishes a precise hand detector [13]. With the continuous improvement of computer hardware and the deepening of research on complex image classification, the application prospect of computer vision technology will be more and more extensive.

Image Pre-Processing
There are two purposes of image pre-processing: one is to enhance image quality to ensure sharp contrast and filter out noise, such as adopting histogram equalization and gray scale transformation to enhance contrast, or adopting methods such as median filtering and adaptive filtering to remove noise. The second is to segment the image to facilitate subsequent feature extraction, such as threshold segmentation, edge segmentation and region segmentation. In this paper, data enhancement and defect area planning were carried out in view of the features of a container mouth in a filling line, such as stability, monotonous shape and a limited number of images. The pre-processing flow is illustrated in Figure 1.

Image Pre-Processing
There are two purposes of image pre-processing: one is to enhance image quality to ensure sharp contrast and filter out noise, such as adopting histogram equalization and gray scale transformation to enhance contrast, or adopting methods such as median filtering and adaptive filtering to remove noise. The second is to segment the image to facilitate subsequent feature extraction, such as threshold segmentation, edge segmentation and region segmentation. In this paper, data enhancement and defect area planning were carried out in view of the features of a container mouth in a filling line, such as stability, monotonous shape and a limited number of images. The preprocessing flow is illustrated in Figure 1.

Data Enhancement
The defect images on the sealing surface of a container in the filling line were collected by a CCD camera. A total of 400 images were taken, covering defects like breaches, dents, burrs and abrasions. The images were converted to the size of 300 × 300 as the training inputs. Such a size can reduce the computing load in training without losing too much information from the images. Then, data expansion was performed to increase the number of images and prepare the dataset for K-fold crossvalidation. During data expansion, the following actions were targeted: rotation, horizontal migration, vertical migration, scaling, tangential transformation and horizontal flip. In this way, the CNN could learn more invariant image features and prevent over-fitting. The implementation method of data expansion is shown in Table 1: Table 1. The implementation method of data expansion.

Data Expansion Mode Changes
Rotation From 0° to 10° random rotation of images Horizontal migration Horizontal migration images, and the offset value is 10% of the length of the images Vertical migration Vertical migration images, and the offset value is 10% of the length of the images

Data Enhancement
The defect images on the sealing surface of a container in the filling line were collected by a CCD camera. A total of 400 images were taken, covering defects like breaches, dents, burrs and abrasions. The images were converted to the size of 300 × 300 as the training inputs. Such a size can reduce the computing load in training without losing too much information from the images. Then, data expansion was performed to increase the number of images and prepare the dataset for K-fold cross-validation. During data expansion, the following actions were targeted: rotation, horizontal migration, vertical migration, scaling, tangential transformation and horizontal flip. In this way, the CNN could learn more invariant image features and prevent over-fitting. The implementation method of data expansion is shown in Table 1: The K-fold cross validation method divides the expanded dataset C into K discrete subsets. During network training, a subset was selected as the test set, while the rest (K-1) subsets were combined into the training set. Each training outputs a classification accuracy of the network model on the selected test set. The same process was repeated K times to get the mean accuracy, i.e., the true accuracy of the model.

Defect Area Planning
The defect samples contain lots of useless background information that may affect the recognition quality of the detection algorithms. Our defect samples carry the following features: the defects concentrated on the sealing surface, whose round shape remains unchanged. In view of these, the Hough circle detection was used in the pre-processing phase to locate the edge of the cover, and mitigate the impact of useless background on the recognition accuracy. The Hough circle transform begins with the extraction of edge location and direction with a Canny operator. The Canny operation involves six steps: RGB to gray conversion, Gaussian filtering, gradient and direction computation, non-maximum suppression (NMS), double threshold selection and edge detection. Among them, Gaussian filtering is realized by the two-dimensional (2D) Gaussian kernel convolution: The purpose of Gaussian filtering is to smoothen the image and remove as much noise as possible. Then, the Sobel operator was introduced to obtain the gradient amplitude and its direction, which can enhance the image and highlight the points with significant changes in neighboring pixels. The operator contains two groups of 3 × 3 kernels. One group is transverse detection kernels, and the other is vertical detection kernels. The formula of the operator is as follows: where A is a smooth image; G x is an image after transverse gradient detection; and G y is an image after vertical gradient detection. The gradient can be expressed as: The gradient direction can be expressed as: During edge detection, the NMS was adopted to find the maximum gradient of local pixels by comparing the gradients and gradient directions. After obtaining the binary edge image, the Hough circle transform was employed to detect the circles. Under the coordinate system (c 1 , c 2 , r), the formula of Hough circle transform can be described as: where (c 1 , c 2 ) is the center of the center coordinates; and r is the radius. The detection was realized through the following steps: first, the non-zero points in the image are traversed, and line segments along the gradient direction (radius direction) and the opposite direction are drawn. The intersection point of the segments is the circle center. Then, the maximum circle is obtained by setting the threshold value. After that, the rectangular regions are generated by the maximum radius. The size of the image was normalized to 300 × 300 × 3. Figure 2 shows the regional planning process.
Appl. Sci. 2018, 8, x FOR PEER REVIEW 5 of 17 where 1 2 (c ,c ) is the center of the center coordinates; and r is the radius. The detection was realized through the following steps: first, the non-zero points in the image are traversed, and line segments along the gradient direction (radius direction) and the opposite direction are drawn. The intersection point of the segments is the circle center. Then, the maximum circle is obtained by setting the threshold value. After that, the rectangular regions are generated by the maximum radius. The size of the image was normalized to 300 × 300 × 3. Figure 2 shows the regional planning process.

Principles of MobileNet Feature Extraction
The MobileNet network [27] was developed to improve the real-time performance of deep learning under limited hardware conditions. This network can reduce the number of parameters without sacrificing accuracy. Previous studies have shown that MobileNet only needs 1/33 of the parameters of Visual geometry group -16 (VGG-16) to achieve the same classification accuracy in ImageNet-1000 classification tasks. Figure 3 shows the basic convolution structure of MobileNet. Conv_Dw_Pw is a deep and separable convolution structure. It is composed of depth-wise layers (Dw) and point-wise layers (Pw). The Dw are deep convolutional layers using 3 × 3 kernels, while the Pw are common convolutional layers using 1 × 1 kernels. Each convolution result is treated by the batch normalization algorithm and the activation function rectified liner unit (ReLU).

Principles of MobileNet Feature Extraction
The MobileNet network [27] was developed to improve the real-time performance of deep learning under limited hardware conditions. This network can reduce the number of parameters without sacrificing accuracy. Previous studies have shown that MobileNet only needs 1/33 of the parameters of Visual geometry group -16 (VGG-16) to achieve the same classification accuracy in ImageNet-1000 classification tasks. Figure 3 shows the basic convolution structure of MobileNet. Conv_Dw_Pw is a deep and separable convolution structure. It is composed of depth-wise layers (Dw) and point-wise layers (Pw). The Dw are deep convolutional layers using 3 × 3 kernels, while the Pw are common convolutional layers using 1 × 1 kernels. Each convolution result is treated by the batch normalization algorithm and the activation function rectified liner unit (ReLU). without sacrificing accuracy. Previous studies have shown that MobileNet only needs 1/33 of the parameters of Visual geometry group -16 (VGG-16) to achieve the same classification accuracy in ImageNet-1000 classification tasks. Figure 3 shows the basic convolution structure of MobileNet. Conv_Dw_Pw is a deep and separable convolution structure. It is composed of depth-wise layers (Dw) and point-wise layers (Pw). The Dw are deep convolutional layers using 3 × 3 kernels, while the Pw are common convolutional layers using 1 × 1 kernels. Each convolution result is treated by the batch normalization algorithm and the activation function rectified liner unit (ReLU).  In this paper, the activation function ReLU is replaced by ReLU6, and the normalization is carried out by the batch normalization (BN) algorithm, which supports the automatic adjustment of data distribution. The ReLU6 activation function can be expressed as: where z is the value of each pixel in the feature map. The deep and separable convolutional structure enables the MobileNet to speed up the training and greatly reduces the amount of calculation. The reasons are as follows: The standard convolution structure can be expressed as: where K M,N is the filter; and M and N are respectively the number of input channels and output channels. During the standard convolution, the input image, including the feature image, F M means input images, including feature maps, which use the fill style of zero padding. When the size and channels of input images respectively are D F * D F and M, it is necessary to have N filters with M channels and the size of D K * D K before outputting N feature images of the size D K * D K . The computing cost is D K * D K * M * N * D F * D F . By contrast, the Dw formula can be expressed as: whereK 1,M is the filter. F M has the same meaning as Formula (7). When the step size is one, the filling of zero ensures that the size of the characteristic graph is invariable after the application of deep and separable convolutional structure. When the step size is two, zero filling ensures that the size of the feature graph obtained after the application of deep and separable convolutional structure becomes half of the input image/feature graph; that is, the dimensional reduction operation is realized. The deep separable convolution structure of MobileNet can obtain the same outputs as those of standard convolution based on the same inputs. The Dw phase needs M filters with one channel and the size of D K * D K . The Pw phase needs N filters with M channels and the size of 1 × 1. In this case, the computing cost of the deep separable convolution structure is Besides, the data distribution will be changed by each convolution layer during network training. If the data are on the edge of the activation function, the gradient will disappear and the parameters will no longer be updated. Similar to the standard normal distribution, the BN algorithm adjusts the data by setting two learning parameters, and prevents gradient disappearance and the adjustment of complex parameters (e.g., learning rate and dropout ratio).

SSD Meta Structure
SSD network is a regression model, which uses features of different convolution layers to make classify regression and boundary box regression. The model solves the conflict between translation invariance and variability, and achieves good detection precision and speed.
In each selected feature map, there are K frames that differ in size and width-to-height ratio. These frames are called the default boxes. Figure 4 expresses default boxes on feature maps of different convolutional layers. Each default box predicts the B class score and the four position parameters. Hence, B * k * w * h class score and 4 * k * w * h position parameters must be predicted for a w * h feature image. This requires (B + 4) * k * w * h convolution kernels of the size 3 × 3 to process the feature map. Then, the convolution results should be taken as the final feature for classification regression and bounding box regression. Here, B is set to four because there are four typical defects on the sealing surface of a container in the filling line. The scale of the default boxes for each feature map is computed as: where m is the number of feature maps; and S max , S min are parameters that can be set. In order to control the fairness of feature vectors in the training and test experiments, the same five kinds of width-to-height ratios a r = {1, 2, 3, 0.5, 0.33} were used to generate default boxes. Then, each default box can be described as: where w a k is the width of default boxes; and h a k is the height of default boxes. Next, a default box S k = S k S k+1 should be added when the width-to-height ratio is one.
The intersection over union (IoU) between area A and area B can be calculated as: If the IoU of default box and calibration box (Ground-truth Box) is greater than 0.5, it means the default box matches the calibration box of that category.
The SSD is an end-to-end training model. The overall loss function of the training contains the confidence loss L conf (s, c) of the classification regression and the position loss of the bounding box regression L loc (r, l, g). This function can be described as: where α is a parameter to balance the confidence loss and position loss; s and r are the eigenvectors of confident loss and position loss, respectively; c is the classification confidence; l is the offset of predicted box, including the translation offset of the center coordinate and scaling offset of the height and width; g is the calibration box of the target actual position; and N is the number of default boxes that match the calibration boxes of this category.
where m is the number of feature maps; and m ax S , m in S are parameters that can be set. In order to control the fairness of feature vectors in the training and test experiments, the same five kinds of width-to-height ratios . The intersection over union (IoU) between area A and area B can be calculated as: If the IoU of default box and calibration box (Ground-truth Box) is greater than 0.5, it means the default box matches the calibration box of that category.
The SSD is an end-to-end training model. The overall loss function of the training contains the confidence loss conf loc 1 L s,r,c,l,g = L s,c +αL r,l,g N (12) where α is a parameter to balance the confidence loss and position loss; s and r are the eigenvectors of confident loss and position loss, respectively; c is the classification confidence; l is the offset of predicted box, including the translation offset of the center coordinate and scaling offset of the height and width; g is the calibration box of the target actual position; and N is the number of default boxes that match the calibration boxes of this category.

Surface Defect Detection Algorithm Based on MobileNet-SSD Model
In the filling line, the sealing surface is easily damaged by friction, collision and extrusion in the recycling and transport of pressure vessels. The common defects include breaches, dents, burrs and abrasions on the sealing surface. In this paper, the MobileNet-SSD model can greatly reduce the number of parameters, and achieve higher accuracy under the limited hardware conditions. The

Surface Defect Detection Algorithm Based on MobileNet-SSD Model
In the filling line, the sealing surface is easily damaged by friction, collision and extrusion in the recycling and transport of pressure vessels. The common defects include breaches, dents, burrs and abrasions on the sealing surface. In this paper, the MobileNet-SSD model can greatly reduce the number of parameters, and achieve higher accuracy under the limited hardware conditions. The complete model contains four parts: the input layer for importing the target image, the MobileNet base net for extracting image features, the SSD for classification regression and bounded box regression and the output layer for exporting the detection results.
This model supports fast and accurate detection because the structure of MobileNet reduces the complexity of computing. However, the structure of Pw changes the distribution of output data of Dw. This may cause a loss of precision. To solve this problem, the fully-connected layers were abandoned, and eight standard convolutional layers were added with the aim to widen the receptive field of the feature image, adjust the data distribution and enhance the translation invariance of the classification task. To prevent the disappearance of the gradient, the BN layer and activation function ReLU6 were introduced to each layer of the added structure. In addition, the two-layer feature image in MobileNet and the four-layer feature image in the added standard convolutional layers constituted a feature image pyramid ( Figure 5). Kernels 3 × 3 in size were adopted as the convolutional kernels to convolve selected feature maps. The convolutional results were taken as the final feature for classification regression and bounded box regression. complete model contains four parts: the input layer for importing the target image, the MobileNet base net for extracting image features, the SSD for classification regression and bounded box regression and the output layer for exporting the detection results. This model supports fast and accurate detection because the structure of MobileNet reduces the complexity of computing. However, the structure of Pw changes the distribution of output data of Dw. This may cause a loss of precision. To solve this problem, the fully-connected layers were abandoned, and eight standard convolutional layers were added with the aim to widen the receptive field of the feature image, adjust the data distribution and enhance the translation invariance of the classification task. To prevent the disappearance of the gradient, the BN layer and activation function ReLU6 were introduced to each layer of the added structure. In addition, the two-layer feature image in MobileNet and the four-layer feature image in the added standard convolutional layers constituted a feature image pyramid ( Figure 5). Kernels 3 × 3 in size were adopted as the convolutional kernels to convolve selected feature maps. The convolutional results were taken as the final feature for classification regression and bounded box regression. A 300 × 300 image was taken as the input. The six layers of the pyramid respectively contain 4, 6, 6, 6, 6 and 6 default boxes. Besides, different 3 × 3 kernels were adopted for classification and location with the step length of one. The numbers in brackets were the amount of 3 × 3 filters that are applied around each location in the feature map. Its number was the amount of default Box × number of categories (classification) and the amount of default Box × 4(location), respectively. The general structure of MobileNet-SSD is shown in Table 2. Table 2. General structure of MobileNet-SSD. ※ = the feature image to be used in classification regression and bounded box regression.

Convolution
Step Input Output Dw:2 128 256 A 300 × 300 image was taken as the input. The six layers of the pyramid respectively contain 4, 6, 6, 6, 6 and 6 default boxes. Besides, different 3 × 3 kernels were adopted for classification and location with the step length of one. The numbers in brackets were the amount of 3 × 3 filters that are applied around each location in the feature map. Its number was the amount of default Box × number of categories (classification) and the amount of default Box × 4(location), respectively. The general structure of MobileNet-SSD is shown in Table 2. In Table 2, Conv_BN_ReLU6 is a standard convolutional layer, while Conv1_Dw_Pw is a and separable convolutional layer. Besides, the sign ※ represents the feature image to be use classification regression and bounded box regression. Considering the small size of the target def the feature image of the shallow Conv7_Dw_Pw output was adopted for further analysis.

Experimental Results and Analysis
Our experiment targets an oil chili filling production line in China's Guizhou Province. Du the detection, the image of the sealing surface was transmitted via the image acquisition unit to host for image signal processing. Then, the corresponding features were extracted and the de were detected and marked by the MobileNet-SSD network. Specifically, the MobileNet-SSD se as the training base net of the pre-processed database; then, the trained model was migrated to detection network for boundary box regression and classification regression. A total of five widt height ratios were selected according to the defect size, namely, 1, 2, 3, 0.5 and 0.33 respectively In addition learning rate of this paper is the exponential decay learning rate initialized to 0.1, and ran initialization weights and bias terms.

Image Processing Unit
As shown in Figure 6, the image acquisition unit consists of a transmission mechanis proximity switch sensor, an industrial CCD camera, a light-emitting diode (LED) arc light source a lens. Before the acquisition, the LED arc light source was adjusted to calibrate the brightness, the lens was mounted on the CCD camera. Then, the aperture and focal length were adjuste ensure the imaging quality of the acquisition unit. Under the action of the transmission mechan the sensor detected the workpiece and produced pulse signals when the vessel maintained a con speed and spacing. These signals triggered the CCD camera to take photos. In order to ensure the container runs to the center of view field, sensor needs to be accurately debugged.
= the feature image to be used in classification regression and bounded box regression.

Convolution
Step Input Output Conv0_BN_ReLU6 ( In Table 2, Conv_BN_ReLU6 is a standard convolutional layer, while Conv1_Dw_Pw is a deep and separable convolutional layer. Besides, the sign ※ represents the feature image to be used in classification regression and bounded box regression. Considering the small size of the target defects, the feature image of the shallow Conv7_Dw_Pw output was adopted for further analysis.

Experimental Results and Analysis
Our experiment targets an oil chili filling production line in China's Guizhou Province. During the detection, the image of the sealing surface was transmitted via the image acquisition unit to the host for image signal processing. Then, the corresponding features were extracted and the defects were detected and marked by the MobileNet-SSD network. Specifically, the MobileNet-SSD served as the training base net of the pre-processed database; then, the trained model was migrated to the detection network for boundary box regression and classification regression. A total of five width-toheight ratios were selected according to the defect size, namely, 1, 2, 3, 0.5 and 0.33 respectively. The In addition, the learning rate of this paper is the exponential decay learning rate initialized to 0.1, and random initialization weights and bias terms.

Image Processing Unit
As shown in Figure 6, the image acquisition unit consists of a transmission mechanism, a proximity switch sensor, an industrial CCD camera, a light-emitting diode (LED) arc light source and a lens. Before the acquisition, the LED arc light source was adjusted to calibrate the brightness, and the lens was mounted on the CCD camera. Then, the aperture and focal length were adjusted to ensure the imaging quality of the acquisition unit. Under the action of the transmission mechanism, the sensor detected the workpiece and produced pulse signals when the vessel maintained a constant speed and spacing. These signals triggered the CCD camera to take photos. In order to ensure that the container runs to the center of view field, sensor needs to be accurately debugged. In Table 2, Conv_BN_ReLU6 is a standard convolutional layer, while Conv1_Dw_Pw is a deep and separable convolutional layer. Besides, the sign ※ represents the feature image to be used in classification regression and bounded box regression. Considering the small size of the target defects, the feature image of the shallow Conv7_Dw_Pw output was adopted for further analysis.

Experimental Results and Analysis
Our experiment targets an oil chili filling production line in China's Guizhou Province. During the detection, the image of the sealing surface was transmitted via the image acquisition unit to the host for image signal processing. Then, the corresponding features were extracted and the defects were detected and marked by the MobileNet-SSD network. Specifically, the MobileNet-SSD served as the training base net of the pre-processed database; then, the trained model was migrated to the detection network for boundary box regression and classification regression. A total of five width-toheight ratios were selected according to the defect size, namely, 1, 2, 3, 0.5 and 0.33 respectively. The In addition, the learning rate of this paper is the exponential decay learning rate initialized to 0.1, and random initialization weights and bias terms.

Image Processing Unit
As shown in Figure 6, the image acquisition unit consists of a transmission mechanism, a proximity switch sensor, an industrial CCD camera, a light-emitting diode (LED) arc light source and a lens. Before the acquisition, the LED arc light source was adjusted to calibrate the brightness, and In Table 2, Conv_BN_ReLU6 is a standard convolutional layer, while Conv1_Dw_Pw is a deep and separable convolutional layer. Besides, the sign ※ represents the feature image to be used in classification regression and bounded box regression. Considering the small size of the target defects, the feature image of the shallow Conv7_Dw_Pw output was adopted for further analysis.

Experimental Results and Analysis
Our experiment targets an oil chili filling production line in China's Guizhou Province. During the detection, the image of the sealing surface was transmitted via the image acquisition unit to the host for image signal processing. Then, the corresponding features were extracted and the defects were detected and marked by the MobileNet-SSD network. Specifically, the MobileNet-SSD served as the training base net of the pre-processed database; then, the trained model was migrated to the detection network for boundary box regression and classification regression. A total of five width-toheight ratios were selected according to the defect size, namely, 1, 2, 3, 0.5 and 0.33 respectively. The In addition, the learning rate of this paper is the exponential decay learning rate initialized to 0.1, and random initialization weights and bias terms.

Image Processing Unit
As shown in Figure 6, the image acquisition unit consists of a transmission mechanism, a In Table 2, Conv_BN_ReLU6 is a standard convolutional layer, while Conv1_Dw_Pw is a deep and separable convolutional layer. Besides, the sign ※ represents the feature image to be used in classification regression and bounded box regression. Considering the small size of the target defects, the feature image of the shallow Conv7_Dw_Pw output was adopted for further analysis.

Experimental Results and Analysis
Our experiment targets an oil chili filling production line in China's Guizhou Province. During the detection, the image of the sealing surface was transmitted via the image acquisition unit to the host for image signal processing. Then, the corresponding features were extracted and the defects were detected and marked by the MobileNet-SSD network. Specifically, the MobileNet-SSD served as the training base net of the pre-processed database; then, the trained model was migrated to the detection network for boundary box regression and classification regression. A total of five width-toheight ratios were selected according to the defect size, namely, 1, 2, 3, 0.5 and 0.33 respectively. The In addition, the learning rate of this paper is the exponential decay learning rate initialized to 0.1, and random initialization weights and bias terms. In Table 2, Conv_BN_ReLU6 is a standard convolutional layer, while Conv1_Dw_Pw is a deep and separable convolutional layer. Besides, the sign ※ represents the feature image to be used in classification regression and bounded box regression. Considering the small size of the target defects, the feature image of the shallow Conv7_Dw_Pw output was adopted for further analysis.

Experimental Results and Analysis
Our experiment targets an oil chili filling production line in China's Guizhou Province. During the detection, the image of the sealing surface was transmitted via the image acquisition unit to the host for image signal processing. Then, the corresponding features were extracted and the defects were detected and marked by the MobileNet-SSD network. Specifically, the MobileNet-SSD served as the training base net of the pre-processed database; then, the trained model was migrated to the detection network for boundary box regression and classification regression. A total of five width-toheight ratios were selected according to the defect size, namely, 1, 2, 3, 0.5 and 0.33 respectively. The In addition, the learning rate of this paper is the exponential decay learning rate initialized to 0.1, and random initialization weights and bias terms. In Table 2, Conv_BN_ReLU6 is a standard convolutional layer, while Conv1_Dw_Pw is a deep and separable convolutional layer. Besides, the sign ※ represents the feature image to be used in classification regression and bounded box regression. Considering the small size of the target defects, the feature image of the shallow Conv7_Dw_Pw output was adopted for further analysis.

Experimental Results and Analysis
Our experiment targets an oil chili filling production line in China's Guizhou Province. During the detection, the image of the sealing surface was transmitted via the image acquisition unit to the host for image signal processing. Then, the corresponding features were extracted and the defects were detected and marked by the MobileNet-SSD network. Specifically, the MobileNet-SSD served as the training base net of the pre-processed database; then, the trained model was migrated to the detection network for boundary box regression and classification regression. A total of five width-toheight ratios were selected according to the defect size, namely, 1, 2, 3, 0.5 and 0.33 respectively. The In In Table 2, Conv_BN_ReLU6 is a standard convolutional layer, while Conv1_Dw_Pw and separable convolutional layer. Besides, the sign ※ represents the feature image to classification regression and bounded box regression. Considering the small size of the targ the feature image of the shallow Conv7_Dw_Pw output was adopted for further analysis.

Experimental Results and Analysis
Our experiment targets an oil chili filling production line in China's Guizhou Provin the detection, the image of the sealing surface was transmitted via the image acquisition host for image signal processing. Then, the corresponding features were extracted and t were detected and marked by the MobileNet-SSD network. Specifically, the MobileNet-S as the training base net of the pre-processed database; then, the trained model was migra detection network for boundary box regression and classification regression. A total of five height ratios were selected according to the defect size, namely, 1, 2, 3, 0.5 and 0.33 respec was set to 0.95 and the to 0.2. The six layers of the pyramid respectively cont represents the feature image to be used in classification regression and bounded box regression. Considering the small size of the target defects, the feature image of the shallow Conv7_Dw_Pw output was adopted for further analysis.

Experimental Results and Analysis
Our experiment targets an oil chili filling production line in China's Guizhou Province. During the detection, the image of the sealing surface was transmitted via the image acquisition unit to the host for image signal processing. Then, the corresponding features were extracted and the defects were detected and marked by the MobileNet-SSD network. Specifically, the MobileNet-SSD served as the training base net of the pre-processed database; then, the trained model was migrated to the detection network for boundary box regression and classification regression. A total of five width-to-height ratios were selected according to the defect size, namely, 1, 2, 3, 0.5 and 0.33 respectively. The S max was set to 0.95 and the S min to 0.2. The six layers of the pyramid respectively contain 4, 6, 6, 6, 6 and 6 default boxes. During the training, the IoU of the positive sample fell in (0.5, 1), that of the negative sample fell in (0.2, 0.5), and the difficulty of the sample fell in (0, 0.2). In addition, the learning rate of this paper is the exponential decay learning rate initialized to 0.1, and random initialization weights and bias terms.

Image Processing Unit
As shown in Figure 6, the image acquisition unit consists of a transmission mechanism, a proximity switch sensor, an industrial CCD camera, a light-emitting diode (LED) arc light source and a lens. Before the acquisition, the LED arc light source was adjusted to calibrate the brightness, and the lens was mounted on the CCD camera. Then, the aperture and focal length were adjusted to ensure the imaging quality of the acquisition unit. Under the action of the transmission mechanism, the sensor detected the workpiece and produced pulse signals when the vessel maintained a constant speed and spacing. These signals triggered the CCD camera to take photos. In order to ensure that the container runs to the center of view field, sensor needs to be accurately debugged. The detection network was trained on the following hardware: Intel Core i7 7700K processor (Vietnam, 2017) which has a main frequency of 4.2 GHz, 32 GB memory and a GeForce TITAN X graphics processing unit (GPU). The software part used the Ubuntu 14.04.2 operating system, and the Tensorflow deep learning framework. Twenty percent of the samples in the pre-processed library were allocated to the test set and the other 80% to the training set.

Comparison of Three Deep Learning Networks
The loss function and the accuracy of the proposed MobileNet-SSD surface defect detection algorithm on the test set ( Figure 7) were compared to those of the VGGNet [28], an excellent detection network in 2014 ImageNet and MobileNet. The three algorithms were trained via migration learning and data enhancement. The training parameters and results are shown in Table 3.   The detection network was trained on the following hardware: Intel Core i7 7700K processor (Vietnam, 2017) which has a main frequency of 4.2 GHz, 32 GB memory and a GeForce TITAN X graphics processing unit (GPU). The software part used the Ubuntu 14.04.2 operating system, and the Tensorflow deep learning framework. Twenty percent of the samples in the pre-processed library were allocated to the test set and the other 80% to the training set.

Comparison of Three Deep Learning Networks
The loss function and the accuracy of the proposed MobileNet-SSD surface defect detection algorithm on the test set ( Figure 7) were compared to those of the VGGNet [28], an excellent detection network in 2014 ImageNet and MobileNet. The three algorithms were trained via migration learning and data enhancement. The training parameters and results are shown in Table 3. The detection network was trained on the following hardware: Intel Core i7 7700K processor (Vietnam, 2017) which has a main frequency of 4.2 GHz, 32 GB memory and a GeForce TITAN X graphics processing unit (GPU). The software part used the Ubuntu 14.04.2 operating system, and the Tensorflow deep learning framework. Twenty percent of the samples in the pre-processed library were allocated to the test set and the other 80% to the training set.

Comparison of Three Deep Learning Networks
The loss function and the accuracy of the proposed MobileNet-SSD surface defect detection algorithm on the test set ( Figure 7) were compared to those of the VGGNet [28], an excellent detection network in 2014 ImageNet and MobileNet. The three algorithms were trained via migration learning and data enhancement. The training parameters and results are shown in Table 3.   In the training process, the detection networks were tested once after two hundred iterations of the training set. The loss function and accuracy in Table 2 were mean values obtained from 40 to 50 iterations of the test set. It is clear that the MobileNet-SSD detection algorithm achieved better accuracy than the other two networks with fewer network parameters.

Results of Defect Detection Network
The trained network parameters were adopted for the MobileNet-SSD defect detection network. The test set image contained four different types of defect samples, each of which had 30 images obtained through resampling. Each sample involved one or more defects. The detection results of the trained MobileNet-SSD defect detection network on the four kinds of defect samples are show in Table 4. It can be seen from Table 3 that the surface defect detection network completes the defect marking of 120 defect samples with a 95.00% accuracy rate. There were missing and false samples in dent and burr defects and missing samples in abrasion defects. This is because the notches are more obvious than the other defects, and related to the image quality and subjective feelings of humans.
When the filling line was in operation, the container passed by the image acquisition device within a certain distance, triggering the CCD camera to take photos. The defect detection network performed forward operation. If there were defects in the image, the alarm would buzz and the defect type and location were identified by the host (Figure 8). The single forward operation of the network was at the rate of 0.12 s/image.

Degree of Defect Detection
The defects of the same type may differ in terms of severity. Here, the pre-processed datasets were divided into three categories based on the defect severity: easy, medium and hard. The recognition result can serve as a yardstick of the network classification quality. Seventy percent of all samples were divided into the training set and the remaining 30% to the test set. The detection results of the breaches are shown in Figure 9.  Figure 10 shows the precision-recall (PR) curves of the experimental dataset. The advanced multi-task CNN (MTCNN) [29] and Faceness-Net [30] were contrasted with the proposed MobileNet-SSD algorithm ("MS" in the figure) on the experimental dataset. The experimental results show that the recall rates of the proposed algorithm were 93.11%, 92.18% and 82.97%, in easy, medium and hard subsets, respectively, and its performance was better than that of the contrastive algorithms.

Degree of Defect Detection
The defects of the same type may differ in terms of severity. Here, the pre-processed datasets were divided into three categories based on the defect severity: easy, medium and hard. The recognition result can serve as a yardstick of the network classification quality. Seventy percent of all samples were divided into the training set and the remaining 30% to the test set. The detection results of the breaches are shown in Figure 9.

Degree of Defect Detection
The defects of the same type may differ in terms of severity. Here, the pre-processed datasets were divided into three categories based on the defect severity: easy, medium and hard. The recognition result can serve as a yardstick of the network classification quality. Seventy percent of all samples were divided into the training set and the remaining 30% to the test set. The detection results of the breaches are shown in Figure 9.  Figure 10 shows the precision-recall (PR) curves of the experimental dataset. The advanced multi-task CNN (MTCNN) [29] and Faceness-Net [30] were contrasted with the proposed MobileNet-SSD algorithm ("MS" in the figure) on the experimental dataset. The experimental results show that the recall rates of the proposed algorithm were 93.11%, 92.18% and 82.97%, in easy, medium and hard subsets, respectively, and its performance was better than that of the contrastive algorithms.  Figure 10 shows the precision-recall (PR) curves of the experimental dataset. The advanced multi-task CNN (MTCNN) [29] and Faceness-Net [30] were contrasted with the proposed MobileNet-SSD algorithm ("MS" in the figure) on the experimental dataset. The experimental results show that the recall rates of the proposed algorithm were 93.11%, 92.18% and 82.97%, in easy, medium and hard subsets, respectively, and its performance was better than that of the contrastive algorithms.

Contrast Experiment
Three comparative experiments were designed to further validate the proposed algorithm. In the first experiment, the proposed algorithm was compared to five lightweight feature extraction networks, including SqueezeNet [31], MobileNet, performance vs accuracy net (PVANet) [32], MTCNN and Faceness-Net. The feature extraction accuracy of each algorithm for the ImageNet classification task is displayed in Table 5. In the second experiment, the above five networks were contrasted with the proposed algorithm in defect detection of the filling line in terms of correct detection rate, training time and the detection time per image (Table 6).

Contrast Experiment
Three comparative experiments were designed to further validate the proposed algorithm. In the first experiment, the proposed algorithm was compared to five lightweight feature extraction networks, including SqueezeNet [31], MobileNet, performance vs accuracy net (PVANet) [32], MTCNN and Faceness-Net. The feature extraction accuracy of each algorithm for the ImageNet classification task is displayed in Table 5. In the second experiment, the above five networks were contrasted with the proposed algorithm in defect detection of the filling line in terms of correct detection rate, training time and the detection time per image (Table 6). As shown in the two tables, the MobileNet-SSD surface defect model is fast and stable, thanks to the improved SSD meta-structure of the feature pyramid. In general, the proposed algorithm outperformed the contrastive algorithms in detection rate, training time and detection time. The final detection time of our algorithm was merely 120 milliseconds per piece, which meets the real-time requirements of the industrial filling line.
In Contrast Experiment 3, four traditional defect recognition methods of k-nearest neighbor (KNN) [33], HMM [34][35][36], SVM and HMM [37] and back propagation neural network (BPNN) [38] are realized, which are compared with the method in this paper. The KNN method selects Euclidean distance as the distance function; the HMM model adopts a sampling window of 5 × 4 size and uses the discrete cosine transform (DCT) coefficient as the observation vector of HMM. The SVM and HMM method is the same as in literature [37]. The hidden layer number of the BP neural network is set to 30. The above models are also applied to detect the defects on the sealing surface of a container in the filling line. The statistical results are shown in Table 7. As can be seen from Table 6, compared with the other traditional defect detection methods, the MobileNet-SSD method has a higher positive detection rate. Under the same hardware conditions, MobileNet-SSD still maintains the optimal speed despite the small differences between the above five methods. In addition, the results of HMM and KNN are not ideal. The reason for this may be that the proportion of defects is small, and the sealing surface of a container contains a lot of background information. KNN and HMM did not extract specific features of the image before classifying. However, both the BP neural network and MobileNet-SSD are based on neural networks, which can automatically learn features by itself, so the accuracy rate of the two methods are relatively high. MobileNet-SSD, due to its unique deep convolutional structure, can learn the deep and detachable features of defects with a bigger receptive field, so it can achieve a higher positive detection rate.

Conclusions
This paper proposes a surface defect detection method based on the MobileNet-SSD network, and applies it to identify the types and locations of surface defects. In the pre-processing phase, a regional planning method was presented to cut out the main body of the defect, reduce redundant parameters and improve detection speed and accuracy. Meanwhile, the robustness of the algorithm was elevated by data enhancement. The philosophy of MobileNet, a lightweight network, was introduced to enhance the detection accuracy, reduce the computing load and shorten the training time of this algorithm. The MobileNet and SSD were adjusted to detect the surface defects, such that the proposed method could differentiate small defects from the background. The feasibility of the proposed method was verified by defect detection for the sealing surface of an oil chili filling production line in Guizhou, China. Specifically, an image acquisition device was established for the sealing surface and the deep learning framework was adopted to mark the defect positions. The results show that the proposed method can identify most defects in the production environment at high speed with accuracy. However, the system also has its limitations. Deep learning models have a certain dependence on the hardware platform because of computationally intensive processes, and they are not suitable for embedded systems with general performance. Future research will further improve the proposed method through integration with embedded chips and the Internet of Things, balancing the classification accuracy and number of parameters of the detection method, and expand the application scope of our method to complex defects in industrial processes.