Research on Automated Defect Classiﬁcation Based on Visual Sensing and Convolutional Neural Network-Support Vector Machine for GTA-Assisted Droplet Deposition Manufacturing Process

: This paper proposes a novel metal additive manufacturing process, which is a composition of gas tungsten arc (GTA) and droplet deposition manufacturing (DDM). Due to complex physical metallurgical processes involved, such as droplet impact, spreading, surface pre-melting, etc., defects, including lack of fusion, overﬂow and discontinuity of deposited layers always occur. To assure the quality of GTA-assisted DDM-ed parts, online monitoring based on visual sensing has been implemented. The current study also focuses on automated defect classiﬁcation to avoid low efﬁciency and bias of manual recognition by the way of convolutional neural network-support vector machine (CNN-SVM). The best accuracy of 98.9%, with an execution time of about 12 milliseconds to handle an image, proved our model can be enough to use in real-time feedback control of the process.


Introduction
Additive manufacturing (AM) is revolutionary compared to traditional processing methods in creating complex 3D-shaped components. Among the different additive manufacturing techniques, wire and arc additive manufacturing (WAAM), which combines an electric arc as heat source and wire as the feedstock material, is suitable to produce large metallic parts owing to the high deposition rates being significantly larger than powder as the feedstock [1]. Gas metal arc (GMA), gas tungsten arc (GTA) and plasma arc (PA) are the most used processes in WAAM. They all need external filler materials and a high energy-density arc heat source under an inert shielding gas [2]. Comparing with wires as feedstock, we developed a new metal additive manufacturing process in this paper, which uses fused droplets as feedstock combined with variable-polarity GTA to form aluminum alloy. The solid cylindrical aluminum alloy is inductively heated in a graphite crucible to a molten state. At the same time, a certain argon pressure is applied to make the fused aluminum alloy droplets flow out of the nozzle and fall into the molten pool by GTA. The deposited layer is formed after the liquid metal solidification. Figure 1 shows the schematic diagrams of the two different processes, where Figure 1a is the schematic diagram of WAAM, and Figure 1b is the schematic diagram of the process we presented.
The GTA process is accompanied by a highly non-linear heat source, and there are several input parameters to consider, such as the welding voltage and current, the process speed, the shielding gas flow and the type of materials [3,4]. To make the process relatively stable and restrain defects to form a good shape, non-destructive testing (NDT) plays a vital role in implementing online monitoring without changing or damaging the nature The GTA process is accompanied by a highly non-linear heat source, and there are several input parameters to consider, such as the welding voltage and current, the process speed, the shielding gas flow and the type of materials [3,4]. To make the process relatively stable and restrain defects to form a good shape, non-destructive testing (NDT) plays a vital role in implementing online monitoring without changing or damaging the nature and structure of the parts. Therefore, it is economical in different levels of development and maintenance [5].
In the past few years, several typical NDT techniques such as computerized tomography (CT), radiographic testing (RT), ultrasonic testing (UT), magnetic particle inspection (MPI) and eddy current testing (ET) have been applied to the field of metal AM [6]. Chabot et al. [7] applied a phased array ultrasonic testing (PAUT) to WAAM components. With the help of X-ray radiography, the PAUT method finished defect size detection from 0.6 to 1 mm for aluminum alloy parts. Bento et al. [8] developed an eddy current testing (ECT) system where the customized ECT probes were able to locate artificial defects: at depths up to 5 mm; with a thickness as small as 350 μm; with the probe up to 5mm away from the inspected sample surface. Wu et al. [9] used an infrared monochrome pyrometer (IMP) for accurately identifying simulated cracks on the surface of a laser metal deposition (LMD) sample. To detect lack-of-fusion defects, Montazeri et al. [10] captured the dynamic phenomena around the melt pool region by a spectrometer and an optical camera during directed energy deposition (DED). Chang et al. [11] proposed a method based on the position information of electron beam speckle to realize the three-dimensional reconstruction of the surface of the deposited parts in the process of electron beam freeform fabrication (EBF3).
As a very important method of NDT, a visual sensing system is widely used in online monitoring of metal AM. In addition, a lot of image processing algorithms suitable for different processes have been designed in order to improve the detection stability of systems. Zhuang et al. [12] proposed k-nearest neighbor (KNN) classification algorithms based on contour curve-KNN (CC-KNN) and locality preserving projection-KNN (LPP-KNN) effectively performed in vision and spectral analysis. Yu et al. [13] established the visual sensing system to capture every frame of the molten pool images matched for the actual weld location in the GMA AM process. A back propagation (BP) neural network was used to extract the shape and location features of the molten pool. Xia et al. [14] developed a visual sensing system working with a robot and a cold metal transfer (CMT) welder. The adaptive Wiener filter and the Canny algorithm were utilized to obtain information in welding pool images. Aminzadeh et al. [15] developed and trained a statistical In the past few years, several typical NDT techniques such as computerized tomography (CT), radiographic testing (RT), ultrasonic testing (UT), magnetic particle inspection (MPI) and eddy current testing (ET) have been applied to the field of metal AM [6]. Chabot et al. [7] applied a phased array ultrasonic testing (PAUT) to WAAM components. With the help of X-ray radiography, the PAUT method finished defect size detection from 0.6 to 1 mm for aluminum alloy parts. Bento et al. [8] developed an eddy current testing (ECT) system where the customized ECT probes were able to locate artificial defects: at depths up to 5 mm; with a thickness as small as 350 µm; with the probe up to 5 mm away from the inspected sample surface. Wu et al. [9] used an infrared monochrome pyrometer (IMP) for accurately identifying simulated cracks on the surface of a laser metal deposition (LMD) sample. To detect lack-of-fusion defects, Montazeri et al. [10] captured the dynamic phenomena around the melt pool region by a spectrometer and an optical camera during directed energy deposition (DED). Chang et al. [11] proposed a method based on the position information of electron beam speckle to realize the three-dimensional reconstruction of the surface of the deposited parts in the process of electron beam freeform fabrication (EBF3).
As a very important method of NDT, a visual sensing system is widely used in online monitoring of metal AM. In addition, a lot of image processing algorithms suitable for different processes have been designed in order to improve the detection stability of systems. Zhuang et al. [12] proposed k-nearest neighbor (KNN) classification algorithms based on contour curve-KNN (CC-KNN) and locality preserving projection-KNN (LPP-KNN) effectively performed in vision and spectral analysis. Yu et al. [13] established the visual sensing system to capture every frame of the molten pool images matched for the actual weld location in the GMA AM process. A back propagation (BP) neural network was used to extract the shape and location features of the molten pool. Xia et al. [14] developed a visual sensing system working with a robot and a cold metal transfer (CMT) welder. The adaptive Wiener filter and the Canny algorithm were utilized to obtain information in welding pool images. Aminzadeh et al. [15] developed and trained a statistical Bayesian classifier to classify the quality of the build that signifies the defective or unacceptable build layers during laser powder bed fusion (LPBF).
Deep learning (DL) algorithms have recently grabbed the attention of scientists due to their strong ability to learn high-level features from raw data, and in most cases, they are much better than traditional algorithms in terms of accuracy and robustness. Convolutional neural networks (CNN) are particularly more used in computer vision tasks which include image classification, object detection, segmentation and so on [16][17][18][19]. However, restricted by the computation performance and datasets, CNN fell out of use for several years until AlexNet was proposed by Krizhevsky for the ImageNet competition in 2012 [20,21]. Subsequently, VGGNet [22] and GoogleNet [23] were proposed considering the width and depth of the network respectively. ResNet [24] proved that the depth of the network can be increased very deeply. With the rise of state-of-the-art CNN architectures, researchers have introduced them to the field of metal AM. Cui et al. [25] used the Missouri S&T dataset (optical microscope images of LMD parts) to train and investigate Hyper-parameters including kernel size and the number of layers of their CNN model. Kwon et al. [26] applied CNN to melt-pool images with respect to six laser power labels in selective laser melting (SLM). The classification failure rate was under 0.01. Yin et al. [27] adopted CNN to analyze the welding process parameters and the weld dimensions from twinwire CMT welding of 5083 aluminum alloy. Zhang et al. [28] presented the application of deep learning framework for automated surface quality inspection in recognition of under-melt, beautiful-weld and over-melt categories in LPBF. The classification accuracy of the finally developed model on the UB-Moog dataset is 0.82 by optimizing hyper parameters. Wang et al. [29], based on previous work [13], developed a prediction network (PredNet) to predict the change of molten pool shape 140ms in advance. Through regression network (SERes), the predicted results were regressed to the accurate weld reinforcement information of the deposited layer in advance. Tomaz et al. [30] realized multi-objective optimization during the GTAW process with the help of an artificial neural network (ANN) and a genetic algorithm (GA). The optimal welding parameters, including welding current = 222 A, welding speed = 25 cm/min, nozzle deflection distance = 8 mm, travel angle = 25 • , were determined, and the determination coefficient (R2) and RMSE value of all response parameters were satisfactory, and the R2 of all the data remained higher than 0.65.
The theorem called "No Free Lunch" states that no algorithm can perform well on all problems, so the objective of this work is to explore a good CNN-SVM-based model with the best possible optimizer function, good learning rate and a varied number of epochs to identify the common defects with best accuracy in GTA-assisted DDM. The results can be used for quasi-real-time (layer-wise) process control, further process decisions or corrective actions.
The remainder of this paper is organized as follows. In Section 2, we introduce the GTA-assisted DDM experiment platform and the CNN-SVM architecture in detail. In Section 3, the hyperparameters optimization is introduced in detail, including performance evaluation and the visualization of CNN features. The conclusion is summarized in Section 4.

Experiment Platform
GTA-assisted DDM experiment platform combined the GTAW platform (Fronius, Pettenbach, Austria), melting type high frequency induction heating equipment (SPG50K-15AB, ShuangPing Power Technology Co., Ltd., ShenZhen, China) and a visual sensing system (Mikrotron GmbH, Unterschleissheim, Germany ), which are shown in Figure 2. The GTA welding platform included a welding power supply (Fronius Magicwave 3000 Job G/F, Fronius, Pettenbach, Austria) and a TIG robot welding torch (TBi RT20, TBi Industries GmbH, Fernwald-Steinbach, Germany). The melted aluminum 2024 was processed into a cylinder with a diameter of 60 mm, height of 80 mm and put in the graphite crucible. Aluminum 2024 grade was selected as work material because it is high-strength duralumin and extensively used in high-load parts such as skeletons and skins on aircraft. The chemical composition of that is shown in Table 1. The size of the substrate was 220 mm × 220 mm × 10 mm and a 65 • inclination angle formed between the welding torch and the substrate. The visual sensing system consisted of a CMOS camera (EoSens CL: CAMMC1362, Mikrotron GmbH, Unterschleissheim, Germany) with an optical lens (AT-X 100 mm F2.8, Kenko Tokina Co., Ltd., Tokyo, Japan), an image acquisition card (Xtium-CL MX4, Teledyne DALSA, Waterloo, ON, Canada) and other data storage devices. The high-speed camera was fixed on a tripod (Benro IF28+) and focused on the back edge of the deposition layer through the glass panel on the front of the glovebox. Once focused by adjusting the aperture to the maximum position, the aperture and the camera exposure time were adjusted at the same time to darken the field of view on the monitor. During the experiment, the droplets falling into the molten pool accurately matched each frame of image to a specific welding position.
crucible. Aluminum 2024 grade was selected as work material because it is high-stre duralumin and extensively used in high-load parts such as skeletons and skins on air The chemical composition of that is shown in Table 1. The size of the substrate wa mm × 220 mm × 10 mm and a 65° inclination angle formed between the welding torch the substrate. The visual sensing system consisted of a CMOS camera (EoSens CAMMC1362, Mikrotron GmbH, Unterschleissheim, Germany) with an optical lens X 100 mm F2.8, Kenko Tokina Co., Ltd., Tokyo, Japan), an image acquisition card (X CL MX4, Teledyne DALSA, Waterloo, ON, Canada) and other data storage devices high-speed camera was fixed on a tripod (Benro IF28+) and focused on the back ed the deposition layer through the glass panel on the front of the glovebox. Once foc by adjusting the aperture to the maximum position, the aperture and the camera expo time were adjusted at the same time to darken the field of view on the monitor. D the experiment, the droplets falling into the molten pool accurately matched each f of image to a specific welding position.

Experiment Methods
GTA-assisted DDM was mainly affected by parameters such as forming speed (Ts), forming current (Ip), forming flux (Qv), substrate temperature (Tb), etc. The overall dimensions of a good Al-2024 single-pass deposited layer were generally stable, and the surface of the deposited layer had fish-scale patterns, as shown in Figure 3c. It can be seen that the deposited layer spreads continuously and is metallurgically bonded well to the substrate. Figure 3a shows the cross section of the deposited layer. Table 2 lists the welding parameters as a standard baseline to achieve good processing conditions. Images of good Al-2024 single-pass deposited layers were captured and recorded, then several defects were introduced by altering process parameters one at a time. Finally, morphology images of defective deposited layers were also recorded.

Experiment Methods
GTA-assisted DDM was mainly affected by parameters such as forming speed (Ts), forming current (Ip), forming flux (Qv), substrate temperature (Tb), etc. The overall dimensions of a good Al-2024 single-pass deposited layer were generally stable, and the surface of the deposited layer had fish-scale patterns, as shown in Figure 3c. It can be seen that the deposited layer spreads continuously and is metallurgically bonded well to the substrate. Figure 3a shows the cross section of the deposited layer. Table 2 lists the welding parameters as a standard baseline to achieve good processing conditions. Images of good Al-2024 single-pass deposited layers were captured and recorded, then several defects were introduced by altering process parameters one at a time. Finally, morphology images of defective deposited layers were also recorded.  Defects like lack of fusion between the deposited layer and the substrate mean that overflow and discontinuity of the deposited layer are prone to appear when process parameters change. Lack of fusion is mainly caused by insufficient heat input. When Ip ≤ 200A or Tb ≤ 220 °C, it is difficult to spread the deposited layer and bond with the substrate poorly. Overflow is mainly due to excessive heat input and forming flux, when Ip ≥ 280A or Qv ≥ 200mm 3 /s, the heat accumulation is serious and the molten droplets cannot be completely absorbed by the molten pool. Discontinuity is mainly caused by the excessive forming speed, when Ts ≥ 30 mm/s, the droplets will not fall into the molten pool continuously, causing fluctuations in the outer dimensions of the deposited layer. Figure 4. shows the macroscopic morphology and raw captured images with a size of 1280 × 1024 pixels for the above three common defects at different process parameters, which provided enough information about the defects.  Defects like lack of fusion between the deposited layer and the substrate mean that overflow and discontinuity of the deposited layer are prone to appear when process parameters change. Lack of fusion is mainly caused by insufficient heat input. When Ip ≤ 200 A or Tb ≤ 220 • C, it is difficult to spread the deposited layer and bond with the substrate poorly. Overflow is mainly due to excessive heat input and forming flux, when Ip ≥ 280 A or Qv ≥ 200 mm 3 /s, the heat accumulation is serious and the molten droplets cannot be completely absorbed by the molten pool. Discontinuity is mainly caused by the excessive forming speed, when Ts ≥ 30 mm/s, the droplets will not fall into the molten pool continuously, causing fluctuations in the outer dimensions of the deposited layer. Figure 4. shows the macroscopic morphology and raw captured images with a size of 1280 × 1024 pixels for the above three common defects at different process parameters, which provided enough information about the defects.

Preprocessing
The size of raw captured images (1280 × 1024 pixels) had an approximate size of 1.2 MB for each one, which contained a substantial number of black pixels surrounding the deposited layer and the welding arc as seen in Figure 4. As a result of the hardware constraints, during the training stage of the algorithms, where the model receiving a higher resolution requires significantly more GPU memory. The subsampling operation is necessary before creating the deep learning dataset [31]. The reliability and speed of the algorithms are improved by region of interest (ROI) segmentation, histogram equalization and image filtering [32]. Figure 5 shows four types of data after ROI segmentation. The segmented images with 326 × 495 pixels were compressed to 155 kB to 160 kB. There were "good" (582), "lack of fusion" (641), "overflow" (589) and "discontinuity" (588). Among them, data at the beginning or end of the process that seriously affected the sample were given up to minimize the impact of sample imbalance on performance of algorithms.

Preprocessing
The size of raw captured images (1280 × 1024 pixels) had an approximate size of 1.2 MB for each one, which contained a substantial number of black pixels surrounding the deposited layer and the welding arc as seen in Figure 4. As a result of the hardware constraints, during the training stage of the algorithms, where the model receiving a higher resolution requires significantly more GPU memory. The subsampling operation is necessary before creating the deep learning dataset [31]. The reliability and speed of the algorithms are improved by region of interest (ROI) segmentation, histogram equalization and image filtering [32]. Figure 5 shows four types of data after ROI segmentation. The segmented images with 326 × 495 pixels were compressed to 155 kB to 160 kB. There were "good" (582), "lack of fusion" (641), "overflow" (589) and "discontinuity" (588). Among them, data at the beginning or end of the process that seriously affected the sample were given up to minimize the impact of sample imbalance on performance of algorithms.

Data Augmentation
Generally speaking, parameters of many deep CNN architectures are in the millions; a lot of data was required to make these parameters work correctly for training. It does not work to rely on new data completely in AM because of time and economic costs [33]. Data augmentation needs to be used to improve the model generalization ability due to the high diversity of welding conditions [34]. In this paper, scaling as well as translation, rotation, flipping, adding "Salt and Pepper" noise and changing the lighting condition was applied to create more data to make the algorithms have a better generalization effect. The result of the original image "good" processed by the above data augmentation methods was shown in Figure 6. Where scaling was achieved by three scale factors of 0.9, 0.75, 0.6, translation was done by moving 20% left, right, up and down respectively and rotation was finished by sequentially rotating 6 • once a time from −30 • to 30 • . The number of each class after data augmentation was given in Table 3 and they were split into two main subsets: training and testing, with approximatively 75% and 25%, respectively.

Data Augmentation
Generally speaking, parameters of many deep CNN architectures are in the millions; a lot of data was required to make these parameters work correctly for training. It does not work to rely on new data completely in AM because of time and economic costs [33]. Data augmentation needs to be used to improve the model generalization ability due to the high diversity of welding conditions [34]. In this paper, scaling as well as translation, rotation, flipping, adding "Salt and Pepper" noise and changing the lighting condition was applied to create more data to make the algorithms have a better generalization effect. The result of the original image "good" processed by the above data augmentation methods was shown in Figure 6. Where scaling was achieved by three scale factors of 0.9, 0.75, 0.6, translation was done by moving 20% left, right, up and down respectively and rotation was finished by sequentially rotating 6° once a time from −30° to 30°. The number of each class after data augmentation was given in Table 3 and they were split into two main subsets: training and testing, with approximatively 75% and 25%, respectively.

CNN+SVM Architecture
CNN consists of multiple, repeating components which are stacked in layers: convolution, pooling, fully connected and classifier layers. Among them, local receptive field, shared weights and pooling are the three most important concepts [19,20]. The model in this paper draws on the ideas of the classic AlexNet model and linear SVM. The overall structure is composed of two parts: feature extraction and classification, which is shown in Figure 7. The feature extraction consists of five convolutional layers named C1-C5 with a size of corresponding filters: 11 × 11, 5 × 5, 3 × 3 respectively and three maximum pooling layers named P1-P3. It can be described as follows: where (J, I) denotes the size of the filters, J is the height of the filters, and I is the width of the filters. b l denotes the bias of the convolutional layer. x l−1 denotes the output of the previous layer. w l denotes the weight of the convolutional layer. f(x) is the nonlinear activation function rectified linear units (ReLU) shown as Equation (2). Pooling operations are shown as Equation (3), which is a form of non-linear down-sampling and used to replace the output of a certain location with a summary statistic of nearby [34,35]. Finally, batch normalization was used for centering and normalization of the images and applied before the fully connected layers instead of local response normalization (LRN) for its unobvious regularization effects. It comes to the classification stage. On the one hand, output of P3 was flattened to transfer to the fully connected layers with a 0.4 dropout to prevent overfitting of the data. The was model constructed here with the better choice of optimizer function between stochastic gradient descent (SGD) and adaptive moment estimation (Adam). Varied learning rates ranging from 1 × 10 −3 to 1 × 10 −5 were selected and optimized by minimizing crossentropy loss function. That can be described as: The feature extraction consists of five convolutional layers named C1-C5 with a size of corresponding filters: 11 × 11, 5 × 5, 3 × 3 respectively and three maximum pooling layers named P1-P3. It can be described as follows: where (J, I) denotes the size of the filters, J is the height of the filters, and I is the width of the filters. b l denotes the bias of the convolutional layer. x l−1 denotes the output of the previous layer. w l denotes the weight of the convolutional layer. f(x) is the nonlinear activation function rectified linear units (ReLU) shown as Equation (2). Pooling operations are shown as Equation (3), which is a form of non-linear down-sampling and used to replace the output of a certain location with a summary statistic of nearby [34,35]. Finally, batch normalization was used for centering and normalization of the images and applied before the fully connected layers instead of local response normalization (LRN) for its unobvious regularization effects. It comes to the classification stage. On the one hand, output of P3 was flattened to transfer to the fully connected layers with a 0.4 dropout to prevent overfitting of the data.
The was model constructed here with the better choice of optimizer function between stochastic gradient descent (SGD) and adaptive moment estimation (Adam). Varied learning rates ranging from 1 × 10 −3 to 1 × 10 −5 were selected and optimized by minimizing cross-entropy loss function. That can be described as: where H is the cross entropy, which can determine how close the actual output is to the expected output, p is the expected output of sample x and logq(x) is the logarithm of the actual output of sample x.
On the other hand, the top layer is replaced by a linear SVM (L2-SVM), which minimizes the squared hinge loss, shown as Equation (5). Output of the pooling layer, M3, is given to the SVM algorithm and the classification performs.
where w is the normal vector of classification hyperplane, C is a regularization parameter, x n is the feature vector, t n is the label it belongs to.
The overall training procedure was performed on a PC with an Intel Core i5-4570 CPU, RAM of 12 GB, a NVIDIA RTX 2080 Ti GPU, running Windows 10 professional with CUDA 9.0 libraries. The software environment was Python 3.6 using TensorFlow 1.13.1.

Evaluation Metrics
The commonly used evaluation metrics were described as Equations (6) Among them, accuracy is defined as the number of correct predictions, precision refers to how many of all positive samples are correctly identified, also called positive predictive value (PPV), recall is a measure of the degree of discrimination of all positive samples, which tells how many samples judged as positive are true. F score indicates the overall performance of the precision and recall, which is the harmonic mean of precision and recall. True Positive (TP) describes the number when the true value is positive and the model is considered to be positive; True Negative (TN) counts the number when the true value is negative and the model is considered to be negative; False Positive (FP) illustrates the number when the true value is negative but the model is considered to be positive; False Negative (FN) represents the number when the true value is positive but the model is considered to be negative.

Tuning of CNN Architecture
The classic AlexNet model has more than 60 million parameters and more than 650,000 neurons. If the parameter adjustment is not ideal, the performance will not be represented well. Batch size, learning rate, epochs, weight initialization, dropout, the choice of optimizer and data normalization are several indicators we are mainly concerned about. Figure 8 shows the effect of a varied batch size on the stability of model convergence, which presents loss and accuracy after training 500 epochs. A larger batch size is conducive to more stable model convergence but constrained by hardware. Hence, the maximum batch size used in our model is 128. It is particularly emphasized that networks can converge unless the weight initialization of the fully connected is 1 × 10 −4 . The weights of other layers can be initialized using a Gaussian distribution with a mean value of 0 and a standard deviation of 0.01 or 0.1. The neuron bias of all layers is initialized to a constant 0.1. The change-of-learning rate is often combined with the optimization function used in model training. This study compared two different optimization functions of SGD and Adam [36,37], and then analyzes the impact of learning rate on model convergence. SGD randomly selects a point for calculating the fastest descent direction instead of traversing the entire training data set. This can greatly speed up the iteration speed while taking the The change-of-learning rate is often combined with the optimization function used in model training. This study compared two different optimization functions of SGD and Adam [36,37], and then analyzes the impact of learning rate on model convergence. SGD randomly selects a point for calculating the fastest descent direction instead of traversing the entire training data set. This can greatly speed up the iteration speed while taking the local optimal solution into account. It can be described as: where, J(θ) is the loss function that needs to be minimized, m is the sample batches, h θ (x j ) is the function of parameter θ fitting to the sample, α is called learning rate. The Adam algorithm calculates step size to be updated by calculating the first moment estimation and second moment estimation of the gradient, which are the mean of the gradient and the decentralized variance of the gradient, to design independent adaptive learning rates for different parameters. Figures 9 and 10 show the process of training models with different learning rates and optimization functions. It can be seen that the selection of the learning rate has a great influence on whether the model converges. When the learning rate is 1 × 10 −4 and Adam is used, the model converges with the highest accuracy of convergence. If the learning rate is set too large, the larger training step makes the parameters oscillate back and forth on both sides of the optimal solution. If the learning rate is too small, the convergence speed will be greatly reduced and the final model will not reach the best accuracy.
where, J(θ) is the loss function that needs to be minimized, m is the sample batches, hθ(x j ) is the function of parameter θ fitting to the sample, α is called learning rate. The Adam algorithm calculates step size to be updated by calculating the first moment estimation and second moment estimation of the gradient, which are the mean of the gradient and the decentralized variance of the gradient, to design independent adaptive learning rates for different parameters. Figures 9 and 10 show the process of training models with different learning rates and optimization functions. It can be seen that the selection of the learning rate has a great influence on whether the model converges. When the learning rate is 1 × 10 −4 and Adam is used, the model converges with the highest accuracy of convergence. If the learning rate is set too large, the larger training step makes the parameters oscillate back and forth on both sides of the optimal solution. If the learning rate is too small, the convergence speed will be greatly reduced and the final model will not reach the best accuracy.

Performance Evaluation
The precision and recall of our model were evaluated on the test dataset according to evaluation metrics in Section 2.4 and the results were reported in Table 4. Overall, F score can basically reach 0.9, indicating good classification performance. At the same time, a comparative experiment of KNN, single use of SVM or CNN and our model was carried

Performance Evaluation
The precision and recall of our model were evaluated on the test dataset according to evaluation metrics in Section 2.4 and the results were reported in Table 4. Overall, F score can basically reach 0.9, indicating good classification performance. At the same time, a comparative experiment of KNN, single use of SVM or CNN and our model was carried out in order to highlight our model in terms of recognition accuracy and efficiency. The results are listed in Table 5. It can be seen that our model has an accuracy of 98.9% with a time of about 12 milliseconds to recognize an image, which would be enough to use in a real-time feedback control of our process.

The Visualization of CNN Features
The visualization of the feature map after convolution can help us understand what each layer of convolution has learned and show the process of abstracting features layer by layer. Figure 11 shows the feature maps of four samples representing "overflow", "good", "discontinuity" and "lack of fusion" after having been learned by different layers. Some feature maps focus on the background of images, others are more inclined to the outline of images. The 96 and 256 feature maps from the first two convolutional layers C1, C2 are basically about information of edges, stripes and grayscale where the shape of different classes can be seen clearly. Starting from C3, a larger number of 3 × 3 convolution kernels are used. Compared with the previous convolution layer, the 3 × 3 convolution kernel has a larger receptive field on the convolution output of this step. This expansion of the receptive field allows the convolutional layer to combine low-level features (lines, edges) into higher-level features (curves, textures) and this more abstract expression makes it less visually interpretable or difficult to differentiate.
C2 are basically about information of edges, stripes and grayscale where the shape of different classes can be seen clearly. Starting from C3, a larger number of 3 × 3 convolution kernels are used. Compared with the previous convolution layer, the 3 × 3 convolution kernel has a larger receptive field on the convolution output of this step. This expansion of the receptive field allows the convolutional layer to combine low-level features (lines, edges) into higher-level features (curves, textures) and this more abstract expression makes it less visually interpretable or difficult to differentiate. Figure 11. The feature maps of four samples representing "overflow", "good", "discontinuity" and "lack of fusion" after having been learned by different layers.

Conclusions
Quality monitoring based on visual sensing was applied to a novel metal additive manufacturing process, GTA-assisted DDM, in this paper.
(1) A large number of process experiments were implemented with parameters including forming speed (Ts), forming current (Ip), forming flux (Qv) and substrate temperature (Tb) deviating from a standard baseline. Different kinds of morphology images of deposited layers were obtained: "good", "lack of fusion", "overflow" and "discontinuity". Figure 11. The feature maps of four samples representing "overflow", "good", "discontinuity" and "lack of fusion" after having been learned by different layers.

Conclusions
Quality monitoring based on visual sensing was applied to a novel metal additive manufacturing process, GTA-assisted DDM, in this paper.
(1) A large number of process experiments were implemented with parameters including forming speed (Ts), forming current (Ip), forming flux (Qv) and substrate temperature (Tb) deviating from a standard baseline. Different kinds of morphology images of deposited layers were obtained: "good", "lack of fusion", "overflow" and "discontinuity".
(2) Translation, rotation, flipping, adding "Salt and Pepper" noise and other data augmentation methods were used to expand the original image dataset and reduce the cost of the process experiments.
(3) Training of a CNN-SVM model based on AlexNet and linear support vector machine was completed, where a batch size of 128 and learning rate of 1e-4 on Adam function were determined as the optimal value, and the output of trained P3 as a feature was transferred to the SVM for classification. The results showed that F scores of our model could reach 0.9 mostly, and compared to KNN, or single use of SVM or CNN, it had a best test set accuracy of 98.9% as well as an execution time of about 12 milliseconds to handle an image, which provided a more sufficient control time in the GTA-assisted DDM.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.

GTA Gas tungsten arc Ts
Forming speed Ip Forming current Qv Forming flux Tb Substrate temperature C1-C5 Convolutional layer with size of corresponding filters: n x × n y , stride: t and padding (Yes or Not) P1-P3 Pooling layer ReLu rectified linear units specified as f(x) for nonlinear activation of neurons α learning rate used during training EPOCHs Number of passes over the data set during training DROPOUT Dropout, specified as 0.4 for fully connected layers to prevent overfitting KNN k-nearest neighbor SGD Stochastic gradient descent Adam Adaptive moment estimation