Stamping Tool Conditions Diagnosis: A Deep Metric Learning Approach

: Stamping processes remain crucial in manufacturing processes; therefore, diagnosing the condition of stamping tools is critical. One of the challenges in diagnosing stamping tool conditions is that traditionally, the tools need to be visually checked, and the production processes thus need to be halted. With the development of Industry 4.0, intelligent monitoring systems have been developed by using accelerometers and algorithms to diagnose the wear classiﬁcation of stamping tools. Although several deep learning models such as the convolutional neural network (CNN), auto encoder (AE), and recurrent neural network (RNN) models have demonstrated promising results for classifying complex signals including accelerometer signals, the practicality of those methods are restricted due to the ﬂexibility of adding new classes and low accuracy when faced to low numbers of samples per class. In this study, we applied deep metric learning (DML) methods to overcome these problems. DML involves extracting meaningful features using feature extraction modules to map inputs into embedding features. We compared the probability method, the contrastive method, and a triplet network to determine which method was most suitable for our case. The experimental results revealed that, compared with other models, a triplet network can be more effectively trained with limited training data. The triplet network demonstrated the best test results of the compared methods in the noised test data. Finally, when tested using unseen class, the triplet network and the probability method demonstrated similar results.


Introduction
The metal stamping process remains one of the most common processes in manufacturing and is still being used by major industries, including the automotive, aerospace, and consumer appliances industries [1]. Therefore, the stamping process must be monitored and diagnosed to ensure that every product meets the required quality standards. A crucial component requiring diagnosis is the tool die, the quality of which can greatly affect the outcome of a product. One of the challenges in diagnosing stamping tool conditions is that, traditionally, the tools need to be visually checked and the production processes thus need to be halted. Following the trend of Industry 4.0, automation in stamping processes has triggered the use of online intelligent condition monitoring systems, which are crucial for improving the productivity and availability of production systems. Today's advanced sensor technology pays attention to and incorporates numerous mechanical properties such as vibration, strain, and displacement to monitor the conditions of manufacturing processes [2], however acquiring the mechanical properties from the tool is only the beginning of diagnosing its condition. Nonetheless, these data need to be analyzed and processed before we able to diagnose tool condition. The advancement of Industry 4.0 also accelerates the research and development of machine learning, which is extremely helpful to analyze the non-linear data that are being used to monitor stamping process.
Traditional signal processing and conventional machine learning methods have been employed in several studies on the stamping processes and tool diagnosis. For example, Training f under various loss functions and other boundaries thus becomes the clear purpose of metric learning. Because deep learning models can be trained to learn linear or nonlinear problems, they can be used to map data points into feature space, and the weight and biases in deep learning architectures can be trained using various loss functions incorporating distance metric (5). The two most common architectures used for DML applications are Siamese neural networks and triplet networks.

Siamese Neural Network
A Siamese neural network [30] uses a single FEM, but it is used to map two data inputs into a feature space. The term "Siamese" reflects the nature of the shared neural network. Figure 1 presents an architecture of a Siamese neural network, where two samples are fed into a network in which two identical CNNs act as an FEM; they are then transformed into a feature space. After a feature representation is created, several methods can be used to train the CNN, namely the probability method [38] and the contrastive method [33].
the weight and biases in deep learning architectures can be trained using various loss functions incorporating distance metric (5). The two most common architectures used for DML applications are Siamese neural networks and triplet networks.

Siamese Neural Network
A Siamese neural network [30] uses a single FEM, but it is used to map two data inputs into a feature space. The term "Siamese" reflects the nature of the shared neural network. Figure 1 presents an architecture of a Siamese neural network, where two samples are fed into a network in which two identical CNNs act as an FEM; they are then transformed into a feature space. After a feature representation is created, several methods can be used to train the CNN, namely the probability method [38] and the contrastive method [33].

Probability Method
Suppose we already have a feature representation from input ( , ) extracted using two FEMs. If FEM is denoted as a function , then the distance metric can be obtained using function (5), thus yielding function (6).
The output from the distance metric is then converted into the probability of the two samples being the same. This probability can be computed using the sigmoid function (7): Let = ( , ) be the binary label for inputs and . Let = 1 if and are from the same class; otherwise, = 0. Because the output is a probability problem, regularized cross-entropy is used as loss function (8):

Contrastive Method
The contrastive method minimizes the metric distance between inputs of the same class and dissociates the inputs of different classes. It still uses distance metric function (6), but instead of being used to activate another function, it is directly used in the loss

Probability Method
Suppose we already have a feature representation from input (x 1 , x 2 ) extracted using two FEMs. If FEM is denoted as a function f , then the distance metric can be obtained using function (5), thus yielding function (6).
The output from the distance metric is then converted into the probability of the two samples being the same. This probability can be computed using the sigmoid function (7): Let t = y(x 1 , x 2 ) be the binary label for inputs x 1 and x 2 . Let t i = 1 if x i and x j are from the same class; otherwise, t i = 0. Because the output is a probability problem, regularized cross-entropy is used as loss function (8):

Contrastive Method
The contrastive method minimizes the metric distance between inputs of the same class and dissociates the inputs of different classes. It still uses distance metric function (6), but instead of being used to activate another function, it is directly used in the loss function. A contrastive loss obtained using function (9) forces a positive pair to become closer to zero and pushes the negative pair with a degree of margin α. where α is the margin when the inputs are from different classes. Figure 2 presents a triplet network architecture. In this architecture, three identical CNNs are used as three FEMs; therefore, the weight, bias, and other parameters of the three CNNs are identical. Triplet datum X t is used as an input, and the given datum contains three sets of samples, namely anchor samples x a , positive samples x p , and negative samples x n . function. A contrastive loss obtained using function (9) forces a positive pair to becom closer to zero and pushes the negative pair with a degree of margin .

Triplet Network
where is the margin when the inputs are from different classes. Figure 2 presents a triplet network architecture. In this architecture, three identica CNNs are used as three FEMs; therefore, the weight, bias, and other parameters of th three CNNs are identical. Triplet datum is used as an input, and the given datum contains three sets of samples, namely anchor samples , positive samples , and negative samples . The and samples are from the same class, whereas the negative samples are from a different class than the samples. The purpose of triplet learning is to train the FEM (CNN) so that it can map a pseudometric space either close or far for positiv and negative pairs, respectively ( Figure 3).  The x a and x p samples are from the same class, whereas the negative samples x n are from a different class than the x a samples. The purpose of triplet learning is to train the FEM (CNN) so that it can map a pseudometric space either close or far for positive and negative pairs, respectively ( Figure 3). function. A contrastive loss obtained using function (9) forces a positive pair to becom closer to zero and pushes the negative pair with a degree of margin . where is the margin when the inputs are from different classes. Figure 2 presents a triplet network architecture. In this architecture, three identica CNNs are used as three FEMs; therefore, the weight, bias, and other parameters of th three CNNs are identical. Triplet datum is used as an input, and the given datum contains three sets of samples, namely anchor samples , positive samples , and negative samples .   The FEMs in the triplet training phase map data input into the embedding f (x) ∈ R n , which is a representation of Euclidean space of n-dimensional size. With (5), a distance metric can be calculated for a positive pair, as in (10), and for a negative pair, as in (11):

Triplet Network
According to [39], a loss function for a triplet network using positive and negative pair distances is as follows: where α is the margin added in the negative pair distance. This margin maintains a distance between the positive and negative groups, enabling the loss to push the negative group over the margin and away from the positive group.

Triplet Selection
Schroff et al. [34] proposed a problem for generating all possible triplets that may easily satisfy function (12). If these "easy triplets" fill most of the samples in the training data, then the result would be slower convergence; therefore, selecting the appropriate triplet data is crucial.
One method for selecting hard triplets to ensure fast convergence is to violate the constraint in function (12). However, "easy" and "hard" triplets must first be defined.
An easy triplet (13) already fulfills the equation, and the model exerts less effort on learning. However, hard triplets (14) place the negative pair closer to the anchor than they place the positive pair, creating difficulty for the model in terms of learning.
Another type of triplet is a semihard triplet (15), in which the value of the negative pair is not smaller than that of the positive pair but falls between positive and negative both with and without a margin: Figure 4 illustrates the differences between the types of triplets. Therein, the triplets are compared in terms of the distance between the negative and anchor. Each triplet yields a different effect on model training; that is, if the training batch contains an excessive number of easy triplets, then the model dose not learn effectively. However, an excessive number of hard triplets would generate a high loss and assign excessively high weights to mislabeled data.
Schroff et al. [34] also proposed online triplet mining, in which sets of triplets are generated before training. This method requires less effort, but may generate only easy or only hard triplets, which would necessitate the time-consuming process of manual data processing. Online triplet mining feeds a batch of training data, generates triplets by using all the samples in the batch, and then calculates the loss from every batch. Using this approach would increase the number of easy, hard, and semihard triplets included in every training batch. . Easy, hard and semihard triplet illustration, a represent represents an anchor sample and p represents a positive sample; a hard triplet selects a negative in the hard triplet region; a semihard triplet selects a negative sample in semihard (α) region, and an easy triplet selects a negative sample in easy region.

Hard Triplet Soft Margin
Hermans et al. [40] proposed a soft margin to replace a hinge function α +∘ inside the triplet loss function (12), which is used to avoid overcorrection with the softplus function ln(1 + exp (∘)) for which practical implementation is expressed as 1 . They argue that samples from the same class can be beneficial for their case. The softplus function offers slow (exponential), rather than abrupt decay, using only margin α.

Dataset
The data set used in the current study was extensively used in our previous study [29]. It contains progressive stamping die vibration signals acquired using an accelerometer with a sampling rate of 25.6 kHz and an axis parallel to the stroke direction of the stamping machine. The stamping machine used (LCP-60H, Ingyu Machinery, Taiwan) had a capacity of 60 tons and an automatic sheet metal feeder. The sheet material used was SPCC steel with a thickness of 1.5 mm.
Three locations on the tool die as illustrated in Figure 5 were examined for two degrees of wear: mild and heavy. One set of healthy-condition samples was used as a reference.
In total, seven classes of wear were included in the stamping tool condition data set explained in Table 1.  . Easy, hard and semihard triplet illustration, a represent represents an anchor sample and p represents a positive sample; a hard triplet selects a negative in the hard triplet region; a semihard triplet selects a negative sample in semihard (α) region, and an easy triplet selects a negative sample in easy region.

Hard Triplet Soft Margin
Hermans et al. [40] proposed a soft margin to replace a hinge function [α + •] inside the triplet loss function (12), which is used to avoid overcorrection with the softplus function ln(1 + exp(•)) for which practical implementation is expressed as log1p. They argue that samples from the same class can be beneficial for their case. The softplus function offers slow (exponential), rather than abrupt decay, using only margin α.

Dataset
The data set used in the current study was extensively used in our previous study [29]. It contains progressive stamping die vibration signals acquired using an accelerometer with a sampling rate of 25.6 kHz and an axis parallel to the stroke direction of the stamping machine. The stamping machine used (LCP-60H, Ingyu Machinery, Taiwan) had a capacity of 60 tons and an automatic sheet metal feeder. The sheet material used was SPCC steel with a thickness of 1.5 mm.
Three locations on the tool die as illustrated in Figure 5 were examined for two degrees of wear: mild and heavy. One set of healthy-condition samples was used as a reference.
In total, seven classes of wear were included in the stamping tool condition data set explained in Table 1.  Data preprocessing was conducted for each vibration sample. First, each data sample was converted from time-based to frequency-based (frequency = 12.8 kHz). Second, each converted sample was normalized. The data transformation is presented in Figure 6.

One-Shot K-Way Testing
For every test data sample ∈ , a support set S consisting of one number of test samples was created, in which only one sample, ∈ , was the same class as .
We then placed randomly inside S. Because every DML configuration is different, the probability method was used with a Siamese neural network to calculate the accuracy of each DML.
where is a distinct label of each data sample in support set S. Subsequently, each test sample can be classified using probabilistic function (7), in which the highest value indicates the sample most similar to test data sample as follows: The accuracy is then calculated for the given test data set ∈ ℝ × , where is the size of the test data set and is a dimension of data point . Data preprocessing was conducted for each vibration sample. First, each data sample was converted from time-based to frequency-based (frequency = 12.8 kHz). Second, each converted sample was normalized. The data transformation is presented in Figure 6.

Mild Wear Position C
Class 7 280 Data preprocessing was conducted for each vibration sample. First, each data sam was converted from time-based to frequency-based (frequency = 12.8 kHz). Second, ea converted sample was normalized. The data transformation is presented in Figure 6.

One-Shot K-Way Testing
For every test data sample ∈ , a support set S consisting of one number test samples was created, in which only one sample, ∈ , was the same class as We then placed randomly inside S. Because every DML configuration is different, probability method was used with a Siamese neural network to calculate the accuracy each DML.
where is a distinct label of each data sample in support set S. Subsequently, each t sample can be classified using probabilistic function (7), in which the highest va indicates the sample most similar to test data sample as follows: The accuracy is then calculated for the given test data set ∈ ℝ × , where is size of the test data set and is a dimension of data point .

One-Shot K-Way Testing
For every test data sample x i ∈ X , a support set S consisting of one K number of test samples was created, in which only one sample, x j ∈ X , was the same class as x i . We then placed x j randomly inside S. Because every DML configuration is different, the probability method was used with a Siamese neural network to calculate the accuracy of each DML.
where y is a distinct label of each data sample in support set S. Subsequently, each test sample can be classified using probabilistic function (7), in which the highest value indicates the sample most similar to test data sample x as follows: Appl. Sci. 2021, 11, 6959 The accuracy is then calculated for the given test data set X ∈ R N×D , where N is the size of the test data set and D is a dimension of data point x i .
In terms of a Siamese neural network with contrastive loss and the triplet network, the greater the distance between classes, the more similar they are. Therefore, substituting (5) into (17) yields (19).
The accuracy can then be calculated as (20).

1D CNN Architecture
Zhang et al. [38] proposed a wide first-layer kernel (WDCNN) to extract features from roller bearings. In their study, they used a Siamese network with a probability output. Their argument for using a wide first-layer kernel was that if the kernel was small then it could be disturbed by high frequency noise. In this study, we use normal kernel 1D CNN instead of WDCNN since our problem does not use time-based input. Figure 7 shows our proposed architecture.
In terms of a Siamese neural network with contrastive loss and the triplet network, the greater the distance between classes, the more similar they are. Therefore, substituting (5) into (17) yields (19).
The accuracy can then be calculated as (20).

1D CNN Architecture
Zhang et al. [38] proposed a wide first-layer kernel (WDCNN) to extract features from roller bearings. In their study, they used a Siamese network with a probability output. Their argument for using a wide first-layer kernel was that if the kernel was small then it could be disturbed by high frequency noise. In this study, we use normal kernel 1D CNN instead of WDCNN since our problem does not use time-based input. Figure 7 shows our proposed architecture.

Model Performance According to The Number of Training Samples
The five models were evaluated using different numbers of training samples to simulate the lack of training data observed in real-world stamping process scenarios. Each class was evaluated according to three sample sets, namely 100, 180, and 280 (all data) samples. These sets were then divided into training and test sets containing 60% and 40% of the samples, respectively. Each class sample set was randomly sampled five times, and each random sample was trained and tested four times. In total, every class sample set underwent 20 training processes, each of which generated a new model. This procedure was intended to mitigate randomness. The procedure is illustrated in Figure 9.  Figure 10 presents the results of each loss function performance. The x-axis of Figure  10 represents the total number of samples for the training and test sets. One-shot ten-way testing was conducted to evaluate the test set. As illustrated in Figure 10, the triplet loss function yielded the most favorable results, with greater than 99% accuracy for the hard, semihard, and hard-soft-margin batches. The binary cross-entropy loss function yielded the second-best results, in which accuracy increased concurrently with an increasing number of training samples. The contrastive max-margin function yielded the least

Model Performance According to The Number of Training Samples
The five models were evaluated using different numbers of training samples to simulate the lack of training data observed in real-world stamping process scenarios. Each class was evaluated according to three sample sets, namely 100, 180, and 280 (all data) samples. These sets were then divided into training and test sets containing 60% and 40% of the samples, respectively. Each class sample set was randomly sampled five times, and each random sample was trained and tested four times. In total, every class sample set underwent 20 training processes, each of which generated a new model. This procedure was intended to mitigate randomness. The procedure is illustrated in Figure 9.

Model Performance According to The Number of Training Samples
The five models were evaluated using different numbers of training samples to simulate the lack of training data observed in real-world stamping process scenarios. Each class was evaluated according to three sample sets, namely 100, 180, and 280 (all data) samples. These sets were then divided into training and test sets containing 60% and 40% of the samples, respectively. Each class sample set was randomly sampled five times, and each random sample was trained and tested four times. In total, every class sample set underwent 20 training processes, each of which generated a new model. This procedure was intended to mitigate randomness. The procedure is illustrated in Figure 9.   Figure  10 represents the total number of samples for the training and test sets. One-shot ten-way testing was conducted to evaluate the test set. As illustrated in Figure 10, the triplet loss function yielded the most favorable results, with greater than 99% accuracy for the hard, semihard, and hard-soft-margin batches. The binary cross-entropy loss function yielded the second-best results, in which accuracy increased concurrently with an increasing number of training samples. The contrastive max-margin function yielded the least   Figure 10 presents the results of each loss function performance. The x-axis of Figure 10 represents the total number of samples for the training and test sets. Oneshot ten-way testing was conducted to evaluate the test set. As illustrated in Figure 10, the triplet loss function yielded the most favorable results, with greater than 99% accuracy for the hard, semihard, and hard-soft-margin batches. The binary cross-entropy loss function yielded the second-best results, in which accuracy increased concurrently with an increasing number of training samples. The contrastive max-margin function yielded the least favorable results, with 95.56% accuracy when training was conducted with all available samples.   Figure 10 also presents the standard deviation for each calculation; the triplet loss function yielded the highest accuracy and exhibited the lowest standard deviation, the value of which decreased concurrently with increase in the number of training samples All loss functions exhibited high standard deviations when trained using the lowest number of training samples, with the contrastive max-margin and binary cross-entropy functions exhibiting the highest testing accuracy.
To determine the efficacy of each loss function to enable each feature extractor (FE) to distinguish between different classes, embedding projections were produced for every FE (Figure 11).
Each model was trained and tested using all available samples, and the results supported the previous results presented in Figure 10. Compared with untrained FEs, all models trained using loss functions exhibited some degree of improvement, but the results varied somewhat. In particular, the max-margin loss function provided the least distinguished groupings for each class in comparison with the other loss functions; that is, the class groupings appeared scattered. In addition, the max-margin loss function was the least accurate when trained and tested with all available samples (Figure 10) However, the binary cross-entropy loss function provided much greater embedding compared to that provided by the max-margin loss function. The binary cross-entropy loss function provided the distinguished embedding values required to group the samples, exhibiting a 2.81% increase in accuracy compared with that of the max-margin loss function. The triplet loss function exhibited the most favorable results, with small variations in accuracy among the different batch strategies.  To determine the efficacy of each loss function to enable each feature extractor (FE) to distinguish between different classes, embedding projections were produced for every FE (Figure 11).
Each model was trained and tested using all available samples, and the results supported the previous results presented in Figure 10. Compared with untrained FEs, all models trained using loss functions exhibited some degree of improvement, but the results varied somewhat. In particular, the max-margin loss function provided the least distinguished groupings for each class in comparison with the other loss functions; that is, the class groupings appeared scattered. In addition, the max-margin loss function was the least accurate when trained and tested with all available samples ( Figure 10). However, the binary cross-entropy loss function provided much greater embedding compared to that provided by the max-margin loss function. The binary cross-entropy loss function provided the distinguished embedding values required to group the samples, exhibiting a 2.81% increase in accuracy compared with that of the max-margin loss function. The triplet loss function exhibited the most favorable results, with small variations in accuracy among the different batch strategies. i. 2021, 11, x FOR PEER REVIEW 12 of 16

Model Performance under Noised Test Samples
In this experiment, we evaluated the robustness of each method to the ever-changing conditions of mechanical environments by adding Gaussian noise to the test sets. The signal-to-noise ratio (21) measures the power ratio of a signal compared with the noise applied to the signal, and in our case, we applied a noise power higher than the signal power (−2 dB and −4 dB, respectively) to simulate an environment with high-noise conditions. = 10log / As we did in the previous experiment, we used 100, 180, and 280 (all data) samples per class for the training and test sets ( Figure 12). The results (Figure 13) indicate the accuracy of each loss function. In general, for all loss functions, the accuracy increased, and the standard deviation decreased when the number of training samples increased. The triplet loss function exhibited the most favorable result for the −2 dB (Figure 13a) signal-to-noise ratio, with the semihard, hard,

Model Performance under Noised Test Samples
In this experiment, we evaluated the robustness of each method to the ever-changing conditions of mechanical environments by adding Gaussian noise to the test sets. The signalto-noise ratio (21) measures the power ratio of a signal compared with the noise applied to the signal, and in our case, we applied a noise power higher than the signal power (−2 dB and −4 dB, respectively) to simulate an environment with high-noise conditions. SNR dB = 10 log 10 P signal /P noise (21) As we did in the previous experiment, we used 100, 180, and 280 (all data) samples per class for the training and test sets ( Figure 12).

Model Performance under Noised Test Samples
In this experiment, we evaluated the robustness of each method to the ever-changing conditions of mechanical environments by adding Gaussian noise to the test sets. The signal-to-noise ratio (21) measures the power ratio of a signal compared with the noise applied to the signal, and in our case, we applied a noise power higher than the signal power (−2 dB and −4 dB, respectively) to simulate an environment with high-noise conditions. = 10log / As we did in the previous experiment, we used 100, 180, and 280 (all data) samples per class for the training and test sets ( Figure 12). The results (Figure 13) indicate the accuracy of each loss function. In general, for all loss functions, the accuracy increased, and the standard deviation decreased when the number of training samples increased. The triplet loss function exhibited the most favorable result for the −2 dB (Figure 13a) signal-to-noise ratio, with the semihard, hard, The results (Figure 13) indicate the accuracy of each loss function. In general, for all loss functions, the accuracy increased, and the standard deviation decreased when the number of training samples increased. The triplet loss function exhibited the most favorable result for the −2 dB (Figure 13a) signal-to-noise ratio, with the semihard, hard, and hard-soft-margin batch strategies achieving accuracies of 96.0%, 95.94%, and 95.75% for the highest number of training samples and 93.86%, 94.60%, and 94.64% for the lowest number of training samples, respectively. (a) (b) Figure 13. Accuracy of the max-margin, binary cross-entropy, and triplet semihard, hard, and hard-soft-margins when tested with the −2 dB (a); and −4 dB (b) signal-to-noise ratio noise test sets.
The binary cross-entropy function did not exhibit an increase in accuracy when tested using 100 and 180 samples per class for the training and test sets, but it did exhibit a high standard deviation with the lowest number of training set samples per class, even though it achieved higher accuracy than the 80 samples per class (81.03% for 60 samples) set, indicating that the model had low precision. The max-margin loss function achieved the lowest accuracy (72.14%), even with highest number of training set samples per class. However, with the lowest number of training set samples per class (60 and 180), it did not exhibit a high standard deviation, despite the FEM not being able extract the most meaningful features.
The triplet loss function exhibited a drop in accuracy of 9-10% for the −4 dB signalto-noise ratio test set compared with its accuracy in the −2 dB signal-to-noise ratio test set ( Figure 13a). The binary cross-entropy loss function exhibited a high standard deviation in accuracy when it was trained with 60 samples per class, but its accuracy dropped to 57.21%. The max-margin loss function yielded the lowest accuracy compared with the accuracies of the other loss functions, and in terms of the low signal-to-noise ratio (−4 dB), it exhibited a high standard deviation in accuracy when tested 20 times for each training set.

Performance under New Classes
In this experiment, we evaluated a simulated scenario in which a new class could be recognized by the model without the need for model retraining. We evaluated all loss functions by using the test set combined with unseen classes to be identified by the model during training. The unseen classes were randomly chosen, and the percentages of unseen classes in the test sets were 20% and 40% of the total number of samples in each set, respectively. Additionally, no noise was added to the training or test sets, as illustrated in Figure 14.
Notably, when minibatches were generated for the samples in the test sets, the unseen class was used as an anchor and employed for the target samples. Essentially, the model compared more than 20% and 40% of seen and unseen samples per test set, respectively. Figure 13. Accuracy of the max-margin, binary cross-entropy, and triplet semihard, hard, and hard-soft-margins when tested with the −2 dB (a); and −4 dB (b) signal-to-noise ratio noise test sets.
The binary cross-entropy function did not exhibit an increase in accuracy when tested using 100 and 180 samples per class for the training and test sets, but it did exhibit a high standard deviation with the lowest number of training set samples per class, even though it achieved higher accuracy than the 80 samples per class (81.03% for 60 samples) set, indicating that the model had low precision. The max-margin loss function achieved the lowest accuracy (72.14%), even with highest number of training set samples per class. However, with the lowest number of training set samples per class (60 and 180), it did not exhibit a high standard deviation, despite the FEM not being able extract the most meaningful features.
The triplet loss function exhibited a drop in accuracy of 9-10% for the −4 dB signal-tonoise ratio test set compared with its accuracy in the −2 dB signal-to-noise ratio test set ( Figure 13a). The binary cross-entropy loss function exhibited a high standard deviation in accuracy when it was trained with 60 samples per class, but its accuracy dropped to 57.21%. The max-margin loss function yielded the lowest accuracy compared with the accuracies of the other loss functions, and in terms of the low signal-to-noise ratio (−4 dB), it exhibited a high standard deviation in accuracy when tested 20 times for each training set.

Performance under New Classes
In this experiment, we evaluated a simulated scenario in which a new class could be recognized by the model without the need for model retraining. We evaluated all loss functions by using the test set combined with unseen classes to be identified by the model during training. The unseen classes were randomly chosen, and the percentages of unseen classes in the test sets were 20% and 40% of the total number of samples in each set, respectively. Additionally, no noise was added to the training or test sets, as illustrated in Figure 14.
Notably, when minibatches were generated for the samples in the test sets, the unseen class was used as an anchor and employed for the target samples. Essentially, the model compared more than 20% and 40% of seen and unseen samples per test set, respectively. samples, the FE still was unable to learn essential features; the high number of samples in the test class resulted in low accuracy for 280 samples per class because the model had to recognize more unseen classes. For the test set with 40% ( Figure 15b) unseen classes, the FE exhibited a lower ability to extract meaningful features when tested with a high number of test samples; moreover, even though the standard deviation decreased with increases in the number of training samples, the accuracy also decreased. uracy of max-margin, binary cross-entropy, and triplet semihard, hard, and hard-soft-margin with test sets (a) and 40% (b) unseen classes.

Conclusions
This study presents a stamping tool condition diagnosis method based on DML. Several DML methods were compared to determine which one was the most suitable for stamping tool condition diagnosis. The probability method employs binary cross-entropy,  The results (Figure 15a) revealed the accuracy of each loss function tested using the 20% unseen class test set. The triplet loss and binary cross-entropy functions achieved similar accuracies, 80.44% and 80.31%, respectively. However, these accuracies were achieved with 60 and 108 training samples per class, not 168 samples. We suspect that the model was able to generalize the training samples, but was not fully able to recognize the unseen class. In addition, even when it was trained with a higher number of training samples, the FE still was unable to learn essential features; the high number of samples in the test class resulted in low accuracy for 280 samples per class because the model had to recognize more unseen classes. For the test set with 40% ( Figure 15b) unseen classes, the FE exhibited a lower ability to extract meaningful features when tested with a high number of test samples; moreover, even though the standard deviation decreased with increases in the number of training samples, the accuracy also decreased. The results (Figure 15a) revealed the accuracy of each loss function tested using the 20% unseen class test set. The triplet loss and binary cross-entropy functions achieved similar accuracies, 80.44% and 80.31%, respectively. However, these accuracies were achieved with 60 and 108 training samples per class, not 168 samples. We suspect that the model was able to generalize the training samples, but was not fully able to recognize the unseen class. In addition, even when it was trained with a higher number of training samples, the FE still was unable to learn essential features; the high number of samples in the test class resulted in low accuracy for 280 samples per class because the model had to recognize more unseen classes. For the test set with 40% ( Figure 15b) unseen classes, the FE exhibited a lower ability to extract meaningful features when tested with a high number of test samples; moreover, even though the standard deviation decreased with increases in the number of training samples, the accuracy also decreased. (a) (b) Figure 15. Accuracy of max-margin, binary cross-entropy, and triplet semihard, hard, and hard-soft-margin with test sets containing 20% (a) and 40% (b) unseen classes.

Conclusions
This study presents a stamping tool condition diagnosis method based on DML. Several DML methods were compared to determine which one was the most suitable for stamping tool condition diagnosis. The probability method employs binary cross-entropy,  Figure 15. Accuracy of max-margin, binary cross-entropy, and triplet semihard, hard, and hard-soft-margin with test sets containing 20% (a) and 40% (b) unseen classes.

Conclusions
This study presents a stamping tool condition diagnosis method based on DML. Several DML methods were compared to determine which one was the most suitable for stamping tool condition diagnosis. The probability method employs binary cross-entropy, the contrastive method employs contrastive max-margin loss, and the triplet network method employs three batch-generation strategies (semihard, hard, and easy). The main contributions of this study are as follows. First, we compared methods incorporating several types of evaluations. Second, we evaluated the methods by using various numbers of training samples, and the results revealed that the triplet network was the most accurate, followed by the probability and the contrastive methods. Third, we evaluated the methods by using a noise test data set, and for this experiment, the triplet network also demonstrated the most favorable results, followed by the probability and contrastive methods. Finally, we evaluated each method in terms of its ability to recognize new classes. The triplet and probability methods, which achieved similar results, exhibited the best performance, followed by the contrastive method.
In general, the triplet network provided the most favorable results overall, and was most suitable for stamping tool condition diagnosis. However, when subjected to new classes, triplet networks may not be able to provide sufficient accuracy when used with the number of data samples employed in the present study. This problem may be mitigated with additional data.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to further study will be carried out using the same data.

Conflicts of Interest:
The authors declare no conflict of interest.