1. Introduction
Several countries have selected Industry 4.0 as a strategic goal for their industrial development because of its huge potential and exciting possibilities. In industrial production, welding is the most common method for joining materials. Various welding processes can be accomplished very efficiently using automatic equipment, such as industrial robots [
1]. Providing the robot with only expected instructions allows it to begin working immediately.
Although industrial robots have a high level of intelligence, the whole welding process does not have the same level of intelligence. Online intelligent weld defect detection still has a troublesome track record, at least when it comes to detecting weld defects [
2]. Online weld testing not only makes weld quality more reliable and stable, but also improves production efficiency and reduces production costs [
3]. GMAW (Gas Metal Arc Welding) is one of the most commonly used welding processes. However, high temperature, spatter, high-strength arc light, and complex welding environments make GMAW online weld detection difficult [
4].
The most popular method for finding weld defects online is to use visual images to obtain more understandable weld and weld pool information. By gathering an image of the weld center and by examining the geometric and spatial distribution properties of the weld, Ma suggested a weld defect detection method based on active visual sensing and machine learning [
5].
Zhang developed an image collection system to acquire image data in all welding directions. After that, a CNN (Convolutional Neural Network) is used to identify it. It is important to note that the system incorporates welding arc light as complementary information rather than weakening or eliminating arc light interference [
6]. Arc changes can be reflected in welding current and voltage. In order to determine the weld penetration, Cao used the arc voltage in the peak current period and the average arc voltage during this time and proposed a bilinear model to design a gain-scheduled model predictive controller [
7]. Ultrasonic testing has also been used for online testing as a conventional, non-destructive procedure. Bonikila established a non-destructive weld detection method based on ultrasonics utilizing a machine learning algorithm [
8].
The analysis of arc acoustic signals has always been a crucial factor in the online assessment of weld quality. An acoustic signal may perfectly match the detection criteria, especially in the online detection of weld fields. However, acoustic signals are easily interrupted, and welding is not a linear change process. A reason for this is the fact that acoustic signal extraction and analysis are highly challenging.
Through experimentation, Horvat examined the acoustic signal of GMAW welding, identified the two primary noise sources, and proposed a classification algorithm [
9]. Gao developed the notion that professional welders can detect welding process flaws through their auditory experience. By gathering significant crowd input, a subjective assessment model based on human hearing was presented [
10]. Following the extraction of acoustic information using LPCC (Licensed Professional Clinical Counselors), Lv suggested a BPANN (Back Propagation Artificial Neural Network) pie-change algorithm and constructed a controller using this algorithm. The controller accuracy in the experiment ranged from 80% to 90% [
11]. Cui studied the arc bias blowing phenomenon in K-TIG (Keyhole Tungsten Inert Gas) welding and subsequently suggested an ECOC-SVM-GSCV technique based on machine learning to categorize weld penetration flaws [
3]. Liu focused on cold metal transfer lap welding (CMT) and emphasized how air intake affects the caliber of the weld. An RNN-based BiLSTM-CTC method was suggested to determine the signal features by tracking the sound signal of air intake [
12]. Acoustic experimental hardware was upgraded by Chen, who also set up an array of several microphones and used the FastICA (Fast algorithm for Independent Component Analysis) to separate weld noise [
13]. Wang developed a central audit perception model, tested the model effectiveness in online weld detection, and used the model to examine the human hearing principle to identify defects in GMAW welding [
14].
The acoustic signal is a key parameter in welding site inspection. The real-time and accurate acoustic diagnosis of weld defects has long been a challenging subject because of the complicated on-site environment and highly non-linear welding process. In order to identify weld defects, this study suggests a system based on digital twin acoustic signals. Real-time, accurate defect identification was considered when defining the fundamental needs of the digital twin system.
This paper presents a digital twin system for weld sound detection that includes the setup of hardware for experimental purposes, creating a virtual environment for experimentation. In
Section 2, the communication network of the virtual reality system and other components are introduced. In
Section 3, the preprocessing approach for sound signals is introduced. By examining the time-domain and frequency-domain properties of the signal, we then suggest an adaptive wavelet threshold denoising method. To extract signal features and to detect and classify signals based on the timing properties of acoustic signals and the requirements of real-time systems, an improved SeCNN-LSTM deep learning model is proposed in
Section 4. In this work, seven other models were tested and compared with the SeCNN-LSTM model, and a variety of metrics, such as precision and sample balance, were used for validation. This paper is summarized in
Section 5, which also examines its flaws and suggests a future line of inquiry.
4. Identification and Classification
4.1. Classification Model
The usual sequential signal, the acoustic signal, has a very high degree of continuity. In order to identify and categorize the information reflected by acoustic signals, in this work, we developed an SeCNN-LSTM depth learning model.
The discipline of speech recognition has extensively explored acoustic signal processing. However, although they can be used as a guide, related voice recognition technologies are not entirely relevant, because signals from human speech are the primary focus of speech recognition. For instance, the widely used MFCC (mel-frequency cepstral coefficient) feature classification approach enhances the human auditory frequency range while suppressing the high-frequency band [
26,
27]. Suppressing high-frequency signals results in a significant loss of information when industrial machinery is used as the research object.
There are currently three established techniques for identifying auditory signals. First, the MFCC approach is represented by cepstral feature extraction and classification [
28]. For example, a signal of 16 KHZ usually intercepts a frame of the cepstrum, with a 39D channel output. Cepstral feature extraction also makes use of LPCCs (linear prediction cepstral coefficients), LSFs (line spectral frequencies), PLP (perceptual linear prediction), and other algorithms [
29,
30]. Additionally, filter banks that have an output of up to 80D channels are used to extract the features. The alternative is to sample and categorize signals directly. The benefit of this is that the output channel can be altered to prevent catastrophes caused by high and small numbers of dimensions.
Figure 9 displays the deep learning model created in this paper. The experimental requirements for the deep learning model include a lightweight design in addition to an effective recognition rate, because a certain level of portability is necessary for the creation of digital twin systems. As a result, we built the one-dimensional convolution layer feature extraction process as a three-layer parallel extended structure. Although it resembles the traditional Squeeze network, the parallel convolution kernel size is the same. Data are folded for one-dimensional convolution when they are input and then expanded, and their time series are regressed. The output timing signal group has 256 channels. Finally, the two layers are LSTM (Long Short-Term Memory) structures of 128 and 32 hidden units.
4.2. Model Training and Parameters
Dataset: The sampling frequency of the acoustic signal was 4.8 KHz, so 0.1 S signals were intercepted as a group, where each group contained 480 sampling points. A total of 90 groups were intercepted upon the occurrence of each defect type. Because the burn-through signal had a certain degree of instantaneity and abruptness, the burn-through signal was 40 groups. The data for this experiment is 310 timing signals and the amount of data is 148,800. 80% is the training set and 20% is the testing set.
Loss function: Softmax is often applied to multi-category tasks, also known as multi-category categorical cross entropy loss.
where
is
and corresponds to the target class.
Activation function: In the one-dimensional convolution part of the model, Formula (
16) is used as the activation function.
LSTM performs well in current time series data classification and prediction tasks. The saturated activation function is more suitable for the requirements of LSTM. Formula (
17) is used as the gate activation function for the forgetting gate, input gate, and output gate. When generating candidate memory, the state activation function is Formula (
18).
Weight initialization: In the training process, weight has a significant impact on the effect of network training. The optimal scheme is obtained by comparing different weight initialization functions. The ”Glorot” initializer independently samples from a uniform distribution with 0 and weights
.
where
is the uniform distribution in interval
and
n is the size of the previous layer. The "He" initializer independently samples from a uniform distribution with 0 and weights
.
where
indicates the current input size. The “Orthogonal” initializer refers to the orthogonal matrix decomposed from the random matrix sampled from the normal distribution. “Narrow normal” refers to sampling from 0 to 0.01 standard deviation.
Figure 10 shows the training results under the influence of different initialization weight functions. From the figure, it can be seen that the training result of the "HE" initialization function was the best one.
The maximum iteration was 100, and the learning rate was 0.005.
The SeCNN-LSTM network was trained as shown in
Figure 11. Target class refers to the prediction classification item, while output class denotes the true classification item. The sample was unbalanced because of the experimental circumstances, as was indicated in the preceding article. Additionally, since a random sample among all the samples was used to create the test set data, each sample in the confusion matrix was likewise out of balance.
where
refers to accuracy;
refers to the trace of the confusion matrix, that is, the sum of the main diagonal elements; and
is the sum of all elements in the matrix. The test set confusion matrix verification accuracy reached 91.0%.
4.3. Model Comparison
In order to verify the superiority of the model described in this paper, the SeCNN-LSTM model was compared with other seven models using multiple indicators. Models LSTM, BiLSTM(Bidirectional Long Short-Term Memory), CNN-LSTM, CNN-BiLSTM, GRU(Gated Recurrent Units), CNN-GRU, and CNN-SVM were compared. In the area of sequence classification, the widely applied and developed deep learning models LSTM and BiLSTM were used. BiLSTM is composed of two LSTMs, with one receiving inputs forwards and the other backwards. Another type of RNN is GRU. GRU is better in line with the demands of processing power and time cost than LSTM and has fewer parameters. The goal of this paper was to demonstrate the benefits of the deep learning model described in this paper in one-dimensional sequential classification networks. The deep learning model in this paper is a CNN-RNN structure, and the majority of the models involved in the comparison are also CNN-RNN structures.
The test set accuracy of each model is shown in
Figure 12. The test accuracy of the SeCNN-LSTM model used in this study was 90.99% after 100 iterations, while those of CNN-SVM and CNN-BiLSTM were 83.78% and 78.28%, respectively. The test accuracy values of other models were under 80%.
In binary classification models, precision and recall are commonly employed as important metrics to check the model classification outcomes. This idea is expanded in the multi-classification model to cover the classification of each sample separately. A metric used to describe the outcomes of predictions is precision. It speaks of the likelihood of a real positive sample among all anticipated positive samples. The initial sample is described by the recall rate. It is the likelihood that the expected positive sample will be positive.
As can be seen in
Table 3, the SeCNN-LSTM model classification precision was significantly higher than those of other classification models, with the first three samples being classified correctly with more than 90% precision and the fourth sample being classified correctly with over 85% precision.
Each sample in the SeCNN-LSTM model had a recall rate of over 88%, which is significantly greater than those of conventional classification models.
In actual classification, sample imbalance is inevitable. ROC (receiver operating characteristic) and AUC (area under the curve) indicators can ignore sample imbalance to assess the classification and prediction abilities of a model.
ROC refers to the relationship between sensitivity (recall) and specificity under different classification thresholds. ROC is a concept based on a binary classification model. In a multivariate classification model, each sample needs to be considered separately. As shown in
Figure 13, the ROC curves of the eight classification models are compared. It could be observed that the ROC curve of the SeCNN-LSTM model is smoother and more concentrated in the
coordinates.
AUC refers to the area values enclosed by ROC curve and coordinates. The closer the AUC value is to 1, the better the model is. When the AUC value is less than 0.5, the model is invalid. It can be seen from
Table 4 that the AUC values of all samples of the SECNN-LSTM model are above 0.9, much higher than other models.
As shown in
Figure 14, the P–R curve refers to the relationship between precision and recall under different classification thresholds. Precision and recall are a pair of mutually exclusive indicators. In an actual sample, the highest accuracy rate and recall rate cannot be obtained at the same time, so only an ideal balance value can be obtained. It can be observed from the P–R curves of the eight models that the curves of SeCNN-LSTM model are smoother and more concentrated in the
coordinates.
The area values enclosed by the P–R curves (P–R AUCs) are shown in
Table 5. The value of each sample of the SeCNN-LSTM model is above 0.83.
The F1-score indicates the best balance between precision and recall. As shown in
Table 6, comparing the F1-score of each sample of the eight models, the model described in this paper showed the best performance.
In summary, we compared the eight different models’ accuracy, precision, recall, AUC, ROC, F1 score and other measures. The findings demonstrate that the model presented in this research outperforms the one-dimensional sequence data classification model in terms of performance, stability, and sample balance.
5. Conclusions
This research established a digital twin system for robot weld field detection based on acoustic signal processing and acoustic signal analysis. The identification and classification of auditory signals was performed using an enhanced SeCNN-LSTM deep learning model. The digital twin system incorporates both a deep learning model and a signal processing technique.
The digital twin system modeled in this paper is insufficient. First, the modeling of rule models was covered at length in this study, but the modeling methods of the physical material model and the dynamic model in the digital twin system were not discussed in detail. Second, the deep learning model discussed in this research was found to have 91% verification accuracy, which is not the best possible outcome, meaning that there is still room for development.
Weld field inspection is both a research focus and a challenge in the welding industry, due to the complicated on-site environment, influencing factors, non-linear welding conditions, and other considerations. Despite having solid timing, the acoustic signal is susceptible to interference. Therefore, future research could focus on multi-sensor or multi-modal weld information detection to reflect the welding conditions using several types of weld information.