A Fast Accurate Attention-Enhanced ResNet Model for Fiber-Optic Distributed Acoustic Sensor (DAS) Signal Recognition in Complicated Urban Environments

: The ﬁber-optic distributed acoustic sensor (DAS), which utilizes existing communication cables as its sensing media, plays an important role in urban infrastructure monitoring and natural disaster prediction. In the face of a wide, dynamic environment in urban areas, a fast, accurate DAS signal recognition method is proposed with an end-to-end attention-enhanced ResNet model. In preprocessing, an objective evaluation method is used to compare the distinguishability of different input features with the Euclidean distance between the posterior probabilities classiﬁed correctly and incorrectly; then, an end-to-end ResNet is optimized with the chosen time-frequency feature as input, and a convolutional block attention module (CBAM) is added, which can quickly focus on key information from different channels and speciﬁc signal structures and improves the system recognition performance further. The results show that the proposed ResNet+CBAM model has the best performance in recognition accuracy, convergence rate, generalization capability, and computational efﬁciency compared with 1-D CNN, 2-D CNN, ResNet, and 2-D CNN+CBAM. An average accuracy of above 99.014% can be achieved in ﬁeld testing; while dealing with multi-scenario scenes and inconsistent laying or burying environments, it can still be kept above 91.08%. The time cost is only 3.3 ms for each signal sample, which is quite applicable in online long-distance distributed monitoring applications.

Among these networks, CNN is a typical representative due to its outstanding local feature extraction ability with its convolution blocks [42]. However, it also has some common problems, such as gradient disappearance/explosion and local convergence in the training of the deep network. Thus, various CNN network structures have been developed, and a performance comparison based on the ImageNet data set is shown in Figure  1b. For example, the deep residual network, ResNet, was proposed by He et al. [43] in 2015 to eliminate redundant layers autonomously and greatly solve the problem of gradient disappearance. In addition, it has the best performance, as shown from its top-5 error and top-1 accuracy (top-1 accuracy is the probability that the number of samples was correctly labeled in the maximum probability of the model output over the total number of samples; top-5 error is the probability that all five outcomes of the assumed model-predicted outputs are wrong). Simultaneously, the importance of the attention mechanism has also been noticed by researchers [44][45][46]. In 2019, Chen et al. [40] proposed a long short-term memory (LSTM) model with an attention mechanism called ALSTM for signal recognition in DAS. However, attention is realized by a weighted sum of different features, which cannot be updated automatically. Moreover, the feature extraction of ALSTM is time-consuming and laborious, resulting in a poor real-time performance. At present, it is still challenging to find an accurate DAS recognition method with good generalization and high computation efficiency in the face of a wide, dynamic environment in urban areas. Therefore, a fast, accurate, attention-enhanced ResNet model for DAS signal recognition is proposed in this paper. The main contribution includes: (1) An end-to-end ResNet network is proposed for DAS signal recognition. The recognition accuracy can be further improved by 1.3%, and the time for each signal sample can be saved by 40% compared to the common 2-D CNN network. This also shows the training and the online test processes both speed up through the residual blocks in Res-Net. Furthermore, the generalization capability is also improved a lot in more challenging, atypical, and inconsistent signal recognition under multi-scenario conditions.
(2) An attention module with a CBAM is added to the ResNet network, which enables the model to focus on the key features of the signal quickly and automatically through both the local structure and channel attention. This highlights the difference between Therefore, a fast, accurate, attention-enhanced ResNet model for DAS signal recognition is proposed in this paper. The main contribution includes: (1) An end-to-end ResNet network is proposed for DAS signal recognition. The recognition accuracy can be further improved by 1.3%, and the time for each signal sample can be saved by 40% compared to the common 2-D CNN network. This also shows the training and the online test processes both speed up through the residual blocks in ResNet. Furthermore, the generalization capability is also improved a lot in more challenging, atypical, and inconsistent signal recognition under multi-scenario conditions. (2) An attention module with a CBAM is added to the ResNet network, which enables the model to focus on the key features of the signal quickly and automatically through both the local structure and channel attention. This highlights the difference between significant structural information and channel information, thus achieving a high recognition rate of 99.014% for four typical DAS events in urban communication cable monitoring. (3) The effectiveness of different methods of extracting deep features is evaluated via the Euclidean distance between the posterior probabilities, classified correctly and incorrectly, which are calculated in a matrix. It assumes when the Euclidean distances between the posterior probabilities, classified correctly and incorrectly, for one type of sample are more considerable, the degree of feature distinguishability is stronger. In this way, different models' feature extraction capabilities can be measured by an objective parameter rather than only based on classification accuracy.

Data Collection with DAS
The typical structure of DAS and its working principle in the safety monitoring of communication cables buried in urban areas are shown in Figure 2. It usually directly takes in a dark fiber from the communication cable laid under urban ground. The system hardware consists of three parts: the detection cable, the optical signal demodulation equipment, and the signal processing host. The optical signal demodulation equipment mainly includes optical devices and electrical devices. A continuous, coherent optical signal is generated by an ultra-narrow linewidth laser, and the optical pulse signal is modulated by an audio-optical or electro-optical modulator. The optical pulse signal is concentrated and amplified by an erbium-doped fiber amplifier (EDFA), and the amplified optical pulse signal is injected into the detection cable by an isolator and a circulator in turn. A light pulse signal along the fiber optic cable transmission process produces Rayleigh scattering; subsequent to the Rayleigh scattering, a light signal along the cable is returned and received by the circulator. Then, the phase change in the coherent Rayleigh backscattering light wave carrying the vibration information is linearly demodulated by an imbalanced Mach-Zehnder fiber interferometer (MZI) and a 3 × 3 coupler [48], as shown in Figure 2. Each point in the fiber is equivalent to a sensor node. These distributed nodes cooperate to sense vibration signals in the whole line. Thus, the system returns a space-time matrix as: where the row index, t, represents the time; T is the time length; the column index, s, denotes the spatial sampling node; and S is the spatial width. The spatial interval of each two nodes is ∆S, and the temporal interval is ∆T = 1 f s , in which f s is the sampling frequency. One data column represents the temporal signal collected at a sampling node.
The typical structure of DAS and its working principle in t communication cables buried in urban areas are shown in Figu takes in a dark fiber from the communication cable laid under ur hardware consists of three parts: the detection cable, the optic equipment, and the signal processing host. The optical signal de mainly includes optical devices and electrical devices. A continuo nal is generated by an ultra-narrow linewidth laser, and the optic lated by an audio-optical or electro-optical modulator. The optica trated and amplified by an erbium-doped fiber amplifier (EDFA) cal pulse signal is injected into the detection cable by an isolator A light pulse signal along the fiber optic cable transmission pro scattering; subsequent to the Rayleigh scattering, a light signal alo and received by the circulator. Then, the phase change in the cohe tering light wave carrying the vibration information is linearly de anced Mach-Zehnder fiber interferometer (MZI) and a 3 × 3 co Figure 2. Each point in the fiber is equivalent to a sensor node. T cooperate to sense vibration signals in the whole line. Thus, the time matrix as: where the row index, , represents the time; is the time lengt denotes the spatial sampling node; and is the spatial width. Th two nodes is , and the temporal interval is , in which quency. One data column represents the temporal signal collected In the cable monitoring task, four frequently encountered ev ground noises, traffic interference, manual digging, and mechan are labeled as 0 to 3, respectively. In the recognition, a 1-D tempo length at the event's location is selected from the time-space matr In the cable monitoring task, four frequently encountered events are included: background noises, traffic interference, manual digging, and mechanical excavations, which are labeled as 0 to 3, respectively. In the recognition, a 1-D temporal signal with a certain length at the event's location is selected from the time-space matrix and built into the data sets. The flowchart of the proposed attention-enhanced ResNet model is illustrated in Figure 3, which includes three stages: data preprocessing, the construction of the attention-enhanced ResNet network, and offline training and online testing of the network.
sets. The flowchart of the proposed attention-enhanced ResNet model is illustrated in Figure 3, which includes three stages: data preprocessing, the construction of the attentionenhanced ResNet network, and offline training and online testing of the network.

Data Preprocessing
In the data preprocessing, each signal sample is converted into a time-frequency spectrogram with short-time Fourier transform (STFT). In detail, the STFT is realized on the DAS signal, : where is a rectangular window of length used to obtain the windowed data frame for the STFT, is the hop size of the window, and 1,2,3, … is the moving location of the windowed data, as the window "slides" or "hops" over time. The window length, ; the hop size, ; and the FFT points, , are three critical parameters that need to be carefully chosen in applications, or they will influence the quality of the spectrogram and its time consumption.
When the database is ready, the processing of the STFT, the image preprocessing of the gray conversion, and clipping for the spectrogram are successively carried out. In this application, the sampling rate is 500 Hz, and the sample length is 10 s. To ensure the resolution of the spectrogram in the STFT, a boxcar window with a 95 data length (about 0.2 s) is chosen, the hop size is one sample (0.2 ms), and the FFT size is equal to the window length. The time-frequency matrix is built in a linear way without the logarithmic operation. To alleviate the computational load of the following network, the obtained time-frequency spectrogram is converted into a gray image with gray levels of 0-255, clipped, and then resized by downsampling it into a smaller size of 50 (in frequency axis) × 100 (in time axis). In total, 50 stands are used for a clipped frequency range of 125 Hz and 100 are used for the 10 s sample length time range. The purpose of clipping is to reduce the image dimension; then, a 2-D time-frequency 50 × 100 data matrix is obtained as the input of the following ResNet network.

Attention-Enhanced ResNet Network Construction
To adapt to the time-frequency characteristics of the DAS signals, the basic residual block in ResNet is composed of convolution layers and rectified linear unit (ReLU) layers. Assuming the input of the l-th residual block is , its output is formalized in one of the following two ways: If and , have the same dimension

Data Preprocessing
In the data preprocessing, each signal sample is converted into a time-frequency spectrogram with short-time Fourier transform (STFT). In detail, the STFT is realized on the DAS signal, x(n): where w(n) is a rectangular window of length L used to obtain the windowed data frame for the STFT, R is the hop size of the window, and mR (m = 1, 2, 3, . . .) is the moving location of the windowed data, as the window "slides" or "hops" over time. The window length, L; the hop size, R; and the FFT points, nFFT, are three critical parameters that need to be carefully chosen in applications, or they will influence the quality of the spectrogram and its time consumption. When the database is ready, the processing of the STFT, the image preprocessing of the gray conversion, and clipping for the spectrogram are successively carried out. In this application, the sampling rate is 500 Hz, and the sample length is 10 s. To ensure the resolution of the spectrogram in the STFT, a boxcar window with a 95 data length (about 0.2 s) is chosen, the hop size is one sample (0.2 ms), and the FFT size is equal to the window length. The time-frequency matrix is built in a linear way without the logarithmic operation. To alleviate the computational load of the following network, the obtained time-frequency spectrogram is converted into a gray image with gray levels of 0-255, clipped, and then resized by downsampling it into a smaller size of 50 (in frequency axis) × 100 (in time axis). In total, 50 stands are used for a clipped frequency range of 125 Hz and 100 are used for the 10 s sample length time range. The purpose of clipping is to reduce the image dimension; then, a 2-D time-frequency 50 × 100 data matrix is obtained as the input of the following ResNet network.

Attention-Enhanced ResNet Network Construction
To adapt to the time-frequency characteristics of the DAS signals, the basic residual block in ResNet is composed of convolution layers and rectified linear unit (ReLU) layers. Assuming the input of the l-th residual block is x l , its output is formalized in one of the following two ways: If x l and F(x l , W l ) have the same dimension where F(x l , W l ) is the residual function; W l is the weight parameter of F(x l , W l ); f (·) is the ReLU; and W s is a linear mapping, which can be performed through a shortcut connection to match their dimensions. Further, in the residual block, a convolutional block attention module (CBAM) [49] is added, as shown in Figure 3, which is selected from Table 1 to secure the highest recognition performance and training efficiency. CBAM deduces the attention map from the intermediate feature map, F, along the channel and the local structure dimensions in turn. Specifically, the attention map is multiplied by the input feature map for adaptive feature refinement, emphasizing the meaningful features along the two main dimensions of channel and local structure in the convolution process. Then, the calculation process of the final feature map F is: where F is the feature map output after the channel attention module; ⊗ represents element-wise multiplication; M C is the channel attention map; and M S is the local structure attention map. M C and M S are calculated specifically as follows: where σ represents the Sigmoid function; MLP represents the multilayer perceptron; f 7×7 represents a convolution operation; and the filter size is 7 × 7.

Training of the Network
Similar to CNN, ResNet+CBAM is generally trained through feedforward, backpropagation, and iterative parameter updating.
Initialization: Typical parameters of weight, W, and bias, b, are initialized using the Xavier initialization method, and the uniform distribution range is chosen as: where n in is the number of input parameters and n out is the output parameter number.

Convolution layer:
The convolution is calculated as: where X i is the i-th input, W j is the weight matrix after initialization of the j-th convolution kernel, m is the kernel size of the convolution, s is the stride, p is the padding, and L is the input sequence length. Then, the convolution output is activated by ReLU. Pooling layer: The pooling is calculated as: where s is the stride and p is the padding. Fully connected layer: The final classification output, y, is obtained as: where x is the input, W is the weight matrix, and b is the bias. Specifically, a dropout layer is added to avoid the overfitting phenomenon. Loss function: Cross entropy is used as the loss function to train the whole ResNet+CBAM network: where N is the batch size or the sample number in the data batch for training; y is the true label of the sample; and a is the predicted label of the sample.

Field Data Collection and Preprocessing
The data were collected by the DAS system, as shown in Figure 2, to monitor the underground communication cables in two cities in China: Tongren in Guizhou Province and Wuhan in Hubei Province. We mixed the event data from the two cities: one 34 km underground cable located in Wuhan, Hubei Province and the other 18 km underground cable located in Tongren, Guizhou Province. The key data collection parameters for the two monitoring fields were the same. The selected cables were buried 0.8-1.5 m deep in the underground; the linewidth of the laser was 5 kHz; the pulse width was 200 ns, which corresponds to a spatial resolution (gauge length) of 20 m; the temporal sampling rate of the system was 500 Hz; and its spatial sampling interval was 5.16 m. These real event data were collected at different locations of the whole line. However, some mechanical excavation and manual digging were present at two or three selected locations along each line. The data were collected in a period, but not in a day. Our test in Tongren lasted from February 2019 to March 2019, while the test in Wuhan lasted from April 2019 to May 2019. During the process of these months, the database obtained data from different weather conditions, different times, and different optical fiber locations. The diversity of the data was guaranteed so that the database has a good generalization ability to support our results and conclusions.
Here, four typical events, as stated above are chosen as our vibration targets to be identified, and the original signals and STFT time-frequency diagram obtained for the above four events are illustrated in Figure 4.
In the preprocessing, by using the training set in Table 2, STFT and Mel-Frequency Cepstral Coefficients (MFCC) methods are used to mine the time-frequency characteristics of the signals, and the Euclidean distance method is used to compare the resolution of the two time-frequency characteristics. MFCC has a higher resolution for the lower-frequency part, which can help to distinguish the changing background, traffic, manual digging, and mechanical excavation, which are all concentrated in the low-frequency band of 0-150 Hz. In detail, the basic steps of MFCC include Mel frequency conversion and cepstral analysis. The most critical steps of MFCC are as follows: 26 triangular filter banks are used to filter the power spectrum estimation, and then a logarithm was taken to perform DCT transform to obtain the Mel-frequency Cepstral. The 12 retention numbers, from 2 to 13, are retained to obtain the MFCC features.
Photonics 2022, 9, x FOR PEER REVIEW In the preprocessing, by using the training set in Table 2, STFT and Mel-F Cepstral Coefficients (MFCC) methods are used to mine the time-frequency chara of the signals, and the Euclidean distance method is used to compare the resolut two time-frequency characteristics. MFCC has a higher resolution for the lower-f part, which can help to distinguish the changing background, traffic, manual dig mechanical excavation, which are all concentrated in the low-frequency band of 0 In detail, the basic steps of MFCC include Mel frequency conversion and cepstral The most critical steps of MFCC are as follows: 26 triangular filter banks are use the power spectrum estimation, and then a logarithm was taken to perform DC form to obtain the Mel-frequency Cepstral. The 12 retention numbers, from 2 t retained to obtain the MFCC features.
The distinguishability of the two types of time-frequency features is compar the Euclidean distance index, which is defined as the distances between the probabilities classified correctly and incorrectly for all the test samples, which is c as: where is the test sample number; is the sample index; and represen and wrong labels, respectively; is the posterior probability when it is correct fied; and is the posterior probability when it is incorrectly classified. In this distances of the different posterior probability curves can be measured objectiv assumes that when the distance between the posterior probabilities classified and incorrectly for one type of sample is larger, the feature distinguishability is According to Equation (14), the average Euclidean distances of STFT and M be calculated, and the time cost of time-frequency conversion for a single sample be calculated, as shown in Figure 5. It can be seen that the average distance of features is larger, indicating that STFT features have better distinguishability tha features. Meanwhile, the time cost of the feature extraction of MFCC is three tim  The distinguishability of the two types of time-frequency features is compared using the Euclidean distance index, which is defined as the distances between the posterior probabilities classified correctly and incorrectly for all the test samples, which is calculated as: where N test is the test sample number; j is the sample index; i and m represent the true and wrong labels, respectively; p j i is the posterior probability when it is correctly identified; and p j m is the posterior probability when it is incorrectly classified. In this way, the distances of the different posterior probability curves can be measured objectively. This assumes that when the distance between the posterior probabilities classified correctly and incorrectly for one type of sample is larger, the feature distinguishability is stronger.
According to Equation (14), the average Euclidean distances of STFT and MFCC can be calculated, and the time cost of time-frequency conversion for a single sample can also be calculated, as shown in Figure 5. It can be seen that the average distance of the STFT features is larger, indicating that STFT features have better distinguishability than MFCC features. Meanwhile, the time cost of the feature extraction of MFCC is three times longer than that of STFT. As a result, STFT is chosen for time-frequency feature extraction in this application, and the constructed database is detailed in Table 2. Each sample lasted 5 s. The training, validation, and test data are divided randomly according to a normal ratio of 6:2:2, and there is no duplication between them.

Realization and Optimization of the Proposed R
In this section, we use the training set an optimize the proposed network model. For the r commonly used network structures such as Re and so on, in which the number represents the signal recognition is different from the image re multi-scene, whose data scene is relatively sim mance. Therefore, we do not consider the deep

Realization and Optimization of the Proposed ResNet+CBAM Network
In this section, we use the training set and validation set in Table 2 to realize and optimize the proposed network model. For the residual network (ResNet), there are some commonly used network structures such as ResNet18, ResNet34, ResNet50, ResNet101, and so on, in which the number represents the layer numbers of the network. The DAS signal recognition is different from the image recognition of multi-classification (>10) and multiscene, whose data scene is relatively simple but requires higher real-time performance. Therefore, we do not consider the deep network and choose ResNet18, with its good real-time performance, as the basic architecture.
However, in the DAS signal recognition, to choose a proper number of residual blocks, different numbers of residual blocks for the training and validation are compared in Figure 6a,b with the same parameters. The training epoch is set to 10. The error bars for the means and standard deviations (3σ) of the training time and validation accuracy are obtained. It can be seen that, with the increase in the number of residual blocks, the training time and the validation accuracy both keep increasing, while the increasing rate of the validation accuracy decreases when the number exceeds two. Thus, in the proposed model in Figure 3, two residual blocks are used.

Realization and Optimization of the Proposed ResNet+CBAM Network
In this section, we use the training set and validation set in Table 2 to realize and optimize the proposed network model. For the residual network (ResNet), there are some commonly used network structures such as ResNet18, ResNet34, ResNet50, ResNet101, and so on, in which the number represents the layer numbers of the network. The DAS signal recognition is different from the image recognition of multi-classification (>10) and multi-scene, whose data scene is relatively simple but requires higher real-time performance. Therefore, we do not consider the deep network and choose ResNet18, with its good real-time performance, as the basic architecture.
However, in the DAS signal recognition, to choose a proper number of residual blocks, different numbers of residual blocks for the training and validation are compared in Figure 6a,b with the same parameters. The training epoch is set to 10. The error bars for the means and standard deviations (3 ) of the training time and validation accuracy are obtained. It can be seen that, with the increase in the number of residual blocks, the training time and the validation accuracy both keep increasing, while the increasing rate of the validation accuracy decreases when the number exceeds two. Thus, in the proposed model in Figure 3, two residual blocks are used. As shown in Figure 3, there are two convolution layers and two residual blocks, among which there are two convolution layers in the residual block. Considering the definition of the CBAM, it can only work if it is added to the convolution layer of the network, and we should find the optimal location of the CBAM. Thus, three possible positions of CBAM are compared: in Case I, the CBAM is outside the residual block and after Conv1 should find the optimal location of the CBAM. Thus, three possible positions of CBAM are compared: in Case I, the CBAM is outside the residual block and after Conv1 and Conv2, respectively; in Case II, it is located after the first Conv in the two residual blocks; and in Case III, it is located after the second Conv in the two residual blocks. These three cases can be roughly divided into two categories: adding a CBAM to the convolution layer outside (Case I) and inside of the two residual blocks (Cases II and III). The mean and standard deviation (3σ) error bars for the visualization total training time and the validation accuracy for the three network structures are shown in Figure 6c,d. This shows that the effect of the attention module inside the residual block is better than outside the residual block; it has higher validation accuracy and less training time cost. Further, the validation accuracy of 98.90% is higher in Case III than in Case II, which means the two residual blocks may play different roles, in which the first residual block possibly mines more common features, while the second one mines more specific features.
Meanwhile, in Figure 6, the training time of ResNet without a CBAM is 40.29 ms, while ResNet with a CBAM takes about 44.22 ms, indicating that the added CBAM takes less than 4 ms. It means the CBAM is a lightweight module that has little influence on the time cost of the whole network. Thus, the structure of Case III is selected.
In addition, the reduction number is a hyperparameter in the CBAM, which affects the ratio of input and output channels in the CBAM and needs to be carefully chosen through experimentation. In this paper, 16, 8, 4, 2, and 1 are tested and compared in the validation set. The default is 16, which is suitable for data sets with large sizes and large sample sizes. Testing accuracies with the reduction numbers 16, 8, 4, 2, and 1 are 97.92%, 98.12%, 98.90%, 98.75%, and 98.36%, respectively. This shows that the best number is four, which is more suitable for DAS data. Therefore, we set this hyperparameter to four in subsequent experiments.

Performance Evaluation of the Proposed ResNet+CBAM
In this section, five models are compared to evaluate the recognition performances, including the 1-D CNN [33] validation accuracy for the three network structures are shown in that the effect of the attention module inside the residual block i residual block; it has higher validation accuracy and less trainin validation accuracy of 98.90% is higher in Case III than in Case residual blocks may play different roles, in which the first residu more common features, while the second one mines more specifi Meanwhile, in Figure 6, the training time of ResNet witho while ResNet with a CBAM takes about 44.22 ms, indicating tha less than 4 ms. It means the CBAM is a lightweight module that h time cost of the whole network. Thus, the structure of Case III is In addition, the reduction number is a hyperparameter in t the ratio of input and output channels in the CBAM and need through experimentation. In this paper, 16,8,4,2, and 1 are test validation set. The default is 16, which is suitable for data sets w sample sizes. Testing accuracies with the reduction numbers 16, 98.12%, 98.90%, 98.75%, and 98.36%, respectively. This shows tha which is more suitable for DAS data. Therefore, we set this hy subsequent experiments.

Performance Evaluation of the Proposed ResNet+CBAM
In this section, five models are compared to evaluate the re including the 1-D CNN [33]   Firstly, the average loss curves of cross-entropy for the five models in the training process are comparatively obtained, as detailed in Figure 8a,b. This shows that, in terms of the convergence rate, ResNet is better than 2-D CNN, there is little difference between 2-D CNN and 1-D CNN, the two networks with CBAMs are both better than those without them, and ResNet+CBAM is the best. Further, the training and validation processes are compared as in Figure 8c,d. From the training accuracy and validation accuracy in Figure 8a,b, ResNet+CBAM has the best training accuracy, which is better than 2-D CNN+CBAM, ResNet, 2-D CNN, and 1-D CNN, successively; and 2-D CNN is the first to be overfitted, followed by 1-D CNN, 2-D CNN+CBAM, and ResNet, while ResNet+CBAM is the last. This highlights the better anti-overfitting ability of the residual blocks in ResNet and ResNet+CBAM. Table 3. Structural parameters of the ResNet+CBAM network.

Layers Kernel Size/Stride/Padding Input Size
Conv1  Firstly, the average loss curves of cross-entropy for the five models in the training process are comparatively obtained, as detailed in Figure 8a,b. This shows that, in terms of the convergence rate, ResNet is better than 2-D CNN, there is little difference between 2-D CNN and 1-D CNN, the two networks with CBAMs are both better than those without them, and ResNet+CBAM is the best. Further, the training and validation processes are compared as in Figure 8c,d. From the training accuracy and validation accuracy in Figure  8a,b, ResNet+CBAM has the best training accuracy, which is better than 2-D CNN+CBAM, ResNet, 2-D CNN, and 1-D CNN, successively; and 2-D CNN is the first to be overfitted, followed by 1-D CNN, 2-D CNN+CBAM, and ResNet, while ResNet+CBAM is the last. This highlights the better anti-overfitting ability of the residual blocks in ResNet and Res-Net+CBAM. In order to optimize the structure and hyperparameters of the model and to evaluate the generalization ability of the five models, the 10-fold cross-validation method is used to verify the models. The specific steps of the 10-fold cross-validation experiment are as follows: Firstly, the training and validation sets in Table 2 are added up and then randomly divided into 10 equal subsets of samples. The 10 subsets are traversed successively, the current subset is taken as the validation set each time, and all the other subsets are taken as a training set to train and evaluate the model. Finally, the average value of 10 evaluation indexes is taken as the final evaluation index. The ten-fold cross-validation of In order to optimize the structure and hyperparameters of the model and to evaluate the generalization ability of the five models, the 10-fold cross-validation method is used to verify the models. The specific steps of the 10-fold cross-validation experiment are as follows: Firstly, the training and validation sets in Table 2 are added up and then randomly divided into 10 equal subsets of samples. The 10 subsets are traversed successively, the current subset is taken as the validation set each time, and all the other subsets are taken as a training set to train and evaluate the model. Finally, the average value of 10 evaluation indexes is taken as the final evaluation index. The ten-fold cross-validation of the five models is illustrated in Figure 9. In Figure 9, ResNet+CBAM behaves the best, and the average accuracy is 99.014% for the four events in Table 2, which indicates that the generalization ability of this model is superior to the other four models.
Photonics 2022, 9, x FOR PEER REVIEW 11 of the five models is illustrated in Figure 9. In Figure 9, ResNet+CBAM behaves the best, an the average accuracy is 99.014% for the four events in Table 2, which indicates that th generalization ability of this model is superior to the other four models. After training and cross-validation, we selected the optimal model of each of the fiv models and their training parameters for the test phase to test the data in the test set. Th confusion matrices, ROC curves, and the calculated performance indices are illustrated Figures 10 and 11, and Table 5. It can be seen that ResNet+CBAM and 2-D CNN+CBA are both better than ResNet and the 2-D CNN without the CBAMs, which means th attention plays an important role, and it can improve the performance of the network fu ther. With a CBAM, the average recognition accuracy of the 2-D CNN increases fro 96.98% to 98.79%, increasing by 1.81%, and the convergence rate is improved accordin to Figure 8. Similarly, with a CBAM, the average recognition accuracy of ResNet increas from 97.11% to 98.89%, by 1.78%, and the convergence rate also improves. The perfo mance of the two networks with the attention mechanism is significantly better than tho without it. Finally, with the same epoch number and reduction number, the averag recognition accuracy of ResNet+CBAM is achieved at 98.89%, which is better than th 98.79% of 2-D CNN+CBAM, while the convergence rate is much better than the 2-CNN+CBAM. It can also be seen from the ROC curve of the test phase that ResNet+CBA has the largest AUC area of 0.9870, which is better than the other four models, 2-CNN+CBAM, ResNet, 2-D CNN, and 1-D CNN. In addition, ResNet is slightly better tha the 2-D CNN, and then the 1-D CNN, which also indicates that 2-D CNN features e tracted from the time-frequency spectrograms are more comprehensive than the 1-D CN features only extracted in the time domain.  After training and cross-validation, we selected the optimal model of each of the five models and their training parameters for the test phase to test the data in the test set. The confusion matrices, ROC curves, and the calculated performance indices are illustrated in Figures 10 and 11, and Table 5. It can be seen that ResNet+CBAM and 2-D CNN+CBAM are both better than ResNet and the 2-D CNN without the CBAMs, which means that attention plays an important role, and it can improve the performance of the network further. With a CBAM, the average recognition accuracy of the 2-D CNN increases from 96.98% to 98.79%, increasing by 1.81%, and the convergence rate is improved according to Figure 8. Similarly, with a CBAM, the average recognition accuracy of ResNet increases from 97.11% to 98.89%, by 1.78%, and the convergence rate also improves. The performance of the two networks with the attention mechanism is significantly better than those without it. Finally, with the same epoch number and reduction number, the average recognition accuracy of ResNet+CBAM is achieved at 98.89%, which is better than the 98.79% of 2-D CNN+CBAM, while the convergence rate is much better than the 2-D CNN+CBAM. It can also be seen from the ROC curve of the test phase that ResNet+CBAM has the largest AUC area of 0.9870, which is better than the other four models, 2-D CNN+CBAM, ResNet, 2-D CNN, and 1-D CNN. In addition, ResNet is slightly better than the 2-D CNN, and then the 1-D CNN, which also indicates that 2-D CNN features extracted from the time-frequency spectrograms are more comprehensive than the 1-D CNN features only extracted in the time domain.
Photonics 2022, 9, x FOR PEER REVIEW 11 of 16 the five models is illustrated in Figure 9. In Figure 9, ResNet+CBAM behaves the best, and the average accuracy is 99.014% for the four events in Table 2, which indicates that the generalization ability of this model is superior to the other four models. After training and cross-validation, we selected the optimal model of each of the five models and their training parameters for the test phase to test the data in the test set. The confusion matrices, ROC curves, and the calculated performance indices are illustrated in Figures 10 and 11, and Table 5. It can be seen that ResNet+CBAM and 2-D CNN+CBAM are both better than ResNet and the 2-D CNN without the CBAMs, which means that attention plays an important role, and it can improve the performance of the network further. With a CBAM, the average recognition accuracy of the 2-D CNN increases from 96.98% to 98.79%, increasing by 1.81%, and the convergence rate is improved according to Figure 8. Similarly, with a CBAM, the average recognition accuracy of ResNet increases from 97.11% to 98.89%, by 1.78%, and the convergence rate also improves. The performance of the two networks with the attention mechanism is significantly better than those without it. Finally, with the same epoch number and reduction number, the average recognition accuracy of ResNet+CBAM is achieved at 98.89%, which is better than the 98.79% of 2-D CNN+CBAM, while the convergence rate is much better than the 2-D CNN+CBAM. It can also be seen from the ROC curve of the test phase that ResNet+CBAM has the largest AUC area of 0.9870, which is better than the other four models, 2-D CNN+CBAM, ResNet, 2-D CNN, and 1-D CNN. In addition, ResNet is slightly better than the 2-D CNN, and then the 1-D CNN, which also indicates that 2-D CNN features extracted from the time-frequency spectrograms are more comprehensive than the 1-D CNN features only extracted in the time domain.     In addition, it can be seen from Figure 10 that traffic interference (Label 1) and mechanical excavation (Label 3) are easily confused. The reason may be that the vibration sources of traffic interference and mechanical excavators are both time-varying, and in some time periods, their signals are difficult to distinguish. As such, the recognition of these two types of events is more challenging.

The Computation Efficiency of the Proposed Method
The time cost of the five models for one test sample on average is compared in Figure 12. It shows the recognition speed of ResNet is faster than 1-D CNN, 2-D CNN, ResNet+CBAM, and 2-D CNN+CBAM, successively. In more detail, the 1-D CNN is faster than the 2-D CNN because of the simpler structure, and the recognition time of ResNet is 1.5 ms, which is faster than the 2.5 ms of the 2-D CNN for one sample. Furthermore, due to the addition of the attention mechanism module (CBAM), the recognition times of 2-D CNN+CBAM and ResNet+CBAM are 18.5 ms and 3.3 ms, which are longer than those of 2-D CNN and ResNet. At any rate, the proposed ResNet+CBAM has obvious advantages in online real-time processing, which are important in practical applications.
The time cost of the five models for one test sam 12. It shows the recognition speed of ResNet is fa Net+CBAM, and 2-D CNN+CBAM, successively. In than the 2-D CNN because of the simpler structure, 1.5 ms, which is faster than the 2.5 ms of the 2-D CN to the addition of the attention mechanism module ( CNN+CBAM and ResNet+CBAM are 18.5 ms and 3 2-D CNN and ResNet. At any rate, the proposed Re in online real-time processing, which are important

The Challenging Test Case in Fields
When dealing with different scenes and inconsis a network's generalization and robustness are impor are further constructed, in which only the inconsist base are selected and tested. Since it has the worst r dimensional time series contains less feature inform test. The other models are compared with the data d Table 6. Comparison of the recognition results for the five

The Challenging Test Case in Fields
When dealing with different scenes and inconsistent laying or burying environments, a network's generalization and robustness are important. Thus, two challenging data sets are further constructed, in which only the inconsistent and atypical samples in the database are selected and tested. Since it has the worst recognition performance and its one-dimensional time series contains less feature information, 1-D CNN is excluded from the test. The other models are compared with the data demonstrated in Table 6. Here the "typical and inconsistent" data set is composed of balanced-but typical and inconsistent-samples selected by hand, which denote the inconsistent samples of the same type of test events. The inconsistent signals mainly refer to the signals whose amplitudes are inconsistent due to the different acting forces or the varying buried condition, but the signal-changing law in the time domain is basically the same or not distorted. Furthermore, the "atypical and inconsistent" data set contains only atypical and inconsistent samples, which mainly refer to signals that have a distorted shape or that have an inconsistent evolution law from the perspective of the human eyes due to the different acting period or other unpredictable interfering factors for the same type of event. The inconsistent data sets stand for extremely challenging cases.
The test results for the two inconsistent data sets are compared in Figure 13. It can be seen that ResNet's average recognition performance is superior to that of 2-D CNN in both the typical and inconsistent data sets and the atypical and inconsistent data set, while the two networks with the attention modules are better than those without them, and the four performance indices can be achieved at more than 91.08% for ResNet+CBAM, which is better than ResNet, and then the 2-D CNN+CBAM and then the 2-D CNN. On the whole, the results of the atypical and inconsistent data set are worse than those of the typical and inconsistent data set for all the models, which is consistent with the field complexity in the two actual cases. It also shows that the proposed ResNet+CBAM has the best generalization capability and generality in this challenging case.
Photonics 2022, 9, x FOR PEER REVIEW 14 of 16 inconsistent data set for all the models, which is consistent with the field complexity in the two actual cases. It also shows that the proposed ResNet+CBAM has the best generalization capability and generality in this challenging case.  Table 6.

Conclusions
In this paper, a novel recognition method is proposed for DAS by using an end-toend, attention-enhanced ResNet model. The effectiveness of different time-frequency features of STFT and MFCC are compared and the better one is chosen as the input of the network. The field test results show that the proposed ResNet+CBAM model behaves the best in recognition accuracy, convergence rate, generalization capability, and computational efficiency among the five models, namely 1-D CNN, 2-D CNN, ResNet, 2-D CNN +CBAM, and ResNet+CBAM. In particular, its generalization ability and time efficiency are quite exciting, which could be very promising in online, long-distance, distributed monitoring applications. In the future, small training samples or imbalanced data sets will be tested, which is also challenging in practice.  Table 6.

Conclusions
In this paper, a novel recognition method is proposed for DAS by using an end-to-end, attention-enhanced ResNet model. The effectiveness of different time-frequency features of STFT and MFCC are compared and the better one is chosen as the input of the network. The field test results show that the proposed ResNet+CBAM model behaves the best in recognition accuracy, convergence rate, generalization capability, and computational efficiency among the five models, namely 1-D CNN, 2-D CNN, ResNet, 2-D CNN+CBAM, and ResNet+CBAM. In particular, its generalization ability and time efficiency are quite exciting, which could be very promising in online, long-distance, distributed monitoring applications. In the future, small training samples or imbalanced data sets will be tested, which is also challenging in practice.