An Event Recognition Method for a Φ-OTDR System Based on CNN-BiGRU Network Model with Attention

Li, Changli; Chen, Xiaoyu; Shi, Yi

doi:10.3390/photonics12040313

Open AccessArticle

An Event Recognition Method for a Φ-OTDR System Based on CNN-BiGRU Network Model with Attention

by

Changli Li

^1,*

,

Xiaoyu Chen

¹

and

Yi Shi

²

¹

School of Artificial Intelligence, Nanjing University of information Science and Technology, Nanjing 210044, China

²

College of Engineering, Shantou University, Shantou 515063, China

^*

Author to whom correspondence should be addressed.

Photonics 2025, 12(4), 313; https://doi.org/10.3390/photonics12040313

Submission received: 13 March 2025 / Revised: 26 March 2025 / Accepted: 27 March 2025 / Published: 28 March 2025

(This article belongs to the Special Issue Distributed Optical Fiber Sensing Technology)

Download

Browse Figures

Versions Notes

Abstract

:

The phase-sensitive optical time domain reflectometry (Φ-OTDR) technique offers a method for distributed acoustic sensing (DAS) systems to detect external acoustic fluctuations and mechanical vibrations. By accurately identifying vibration events, DAS systems provide a non-invasive solution for security monitoring. However, limitations in temporal signal analysis and the lack of spatial features significantly impact classification accuracy in event recognition. To address these challenges, this paper proposes a network model for vibration-event recognition that integrates convolutional neural networks (CNNs), bidirectional gated recurrent units (BiGRUs), and attention mechanisms, referred to as CNN-BiGRU-Attention (CBA). First, the CBA model processes spatiotemporal matrices converted from raw signals, extracting low-level features through convolution and pooling. Subsequently, features are further extracted and separated along both the temporal and spatial dimensions. In the spatial-dimension branch, horizontal convolution and pooling generate enhanced spatial feature maps. In the temporal-dimension branch, vertical convolution and pooling are followed by BiGRU processing to capture dynamic changes in vibration events from both past and future contexts. Additionally, the attention mechanism focuses on extracted features in both dimensions. The features from the two dimensions are then fused using two cross-attention mechanisms. Finally, classification probabilities are output through a fully connected layer and a softmax activation function. In the experimental simulation section, the model is validated using real-world data. A comparison with four other typical models demonstrates that the proposed CBA model offers significant advantages in both recognition accuracy and robustness.

Keywords:

attention; BiGRU; CNN; distributed optical fiber vibration sensing; event recognition; phase-sensitive optical time domain reflectometry (Φ-OTDR)

1. Introduction

Distributed acoustic sensing (DAS) systems can transform pre-installed optical fibers into continuously distributed sensors [1,2]. One key implementation is the phase-sensitive optical time domain reflectometer (Φ-OTDR), which injects narrow-pulse laser light into optical fibers and monitors the backscattered Rayleigh signals returning continuously along the fiber. Any disturbance along the fiber, such as vibrations, pressure, or temperature changes, alters the phase and intensity of the scattered signals, creating disturbance data reflected in time-series patterns [3,4,5]. DAS systems based on Φ-OTDR achieve high sensitivity and accuracy in detecting vibration events [6,7]. These systems boast advantages such as strong real-time capabilities, high spatial resolution, and extensive coverage [8,9], enabling the effective monitoring of vibrations, stress, and temperature changes over long distances [10,11]. They play a significant role in applications like security monitoring [12,13], pipeline protection [14,15], and seismic hazard analysis [16,17].

With the widespread adoption of such systems, their hardware has become increasingly advanced, and the quality of collected data has significantly improved. The current research focuses on enhancing detection and recognition capabilities, particularly through optimizing data post-processing and recognition algorithms for DAS. Among post-processing techniques, event recognition algorithms are crucial. By analyzing the characteristics of variations within the fiber data, these algorithms can identify and locate different disturbance events, ensuring precise perimeter security along the fiber. In recent years, numerous recognition algorithms based on distributed fiber-optic sensing have emerged, achieving innovative progress and providing essential methodological references for signal-feature extraction and analysis. However, the limitations of traditional machine learning algorithms [18,19,20] in terms of adaptability and feature utilization have gradually become apparent, with recognition rates and processing times approaching bottlenecks. To overcome these challenges, researchers have begun incorporating deep learning models. For example, one-dimensional (1D) bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) models have been combined to effectively extract the contextual features of 1D signals, achieving high recognition accuracy [21]. Long short-term memory (LSTM) with attention mechanisms has been employed to focus on extracting temporal features of 1D data, demonstrating an advantage in identifying critical features [22].

In two-dimensional (2D) feature extraction, spatiotemporal matrices and 2D feature extraction strategies have been introduced, incorporating temporal and spatial dimensions into modeling and thus enhancing the handling of complex vibration signals [23,24]. One-dimensional signals have been reconstructed into two-dimensional images and analyzed for spatiotemporal correlation features using CNNs, achieving notable improvements in recognition accuracy and noise handling [25,26]. CNNs and LSTM have been further combined to balance efficiency and accuracy in the spatiotemporal analysis of 2D images [27,28]. The potential of three-dimensional (3D) modeling across the temporal, spatial, and frequency domains has been explored [29].

These studies have deeply analyzed 1D signals and explored classification from temporal, spatial, and frequency domains. They have also attempted to transform signals into 2D images for spatiotemporal modeling and even ventured into extracting features from 300 dpi information. However, these methods often face performance limitations due to insufficient feature extraction or interference from excessive redundant information. On the one hand, algorithms based on 1D signals [21,22] employ a limited utilization of spatial information, making it challenging to comprehensively capture spatiotemporal features. On the other hand, studies based on 2D signals [23,24,25,26,27,28] have enhanced spatiotemporal modeling capabilities but still struggle to deeply capture the correlation between temporal and spatial features. Moreover, 3D modeling approaches [29], while expressing richer features, face application challenges due to data redundancy and high computational complexity. There remains room for optimization in improving recognition accuracy and model efficiency, as well as handling background noise and similar disturbance events.

This paper integrates CNNs and bidirectional gated recurrent units (BiGRU) to extract both temporal and spatial features from distributed fiber vibration signals, fully leveraging the temporal evolution patterns and spatial distribution characteristics of the signals. Attention mechanisms are employed to focus on critical features, coupling spatiotemporal features to further enhance the model’s robustness and adaptability.

2. Experimental Setup

2.1. The Principle of Φ-OTDR

DAS systems based on Φ-OTDR detect vibration events along the fiber’s perimeter by measuring phase changes in light waves propagating through the fiber. Φ-OTDR continuously emits laser pulses into the connected optical fiber through a laser source. Due to the presence of inhomogeneous scatterers within the fiber, a portion of the incident laser pulses undergo scattering, with the backpropagated Rayleigh-scattering light forming the basis of DAS measurements. When a specific point along the fiber experiences external disturbances, such as strain, the refractive index of the fiber at that point changes, resulting in a corresponding phase shift in the backscattered light. By analyzing the phase information of the coherent Rayleigh-scattered light from various points along the fiber, the demodulator extracts strain or strain rate information. Measuring these phase changes enables the identification of vibration events along the fiber [3,4,5]. The vibration response of the fiber at position z and time t can be expressed as follows:

Δ Φ (z, t) = α \cdot A_{v} (z, t)

(1)

where

A_{v} (z, t)

represents the amplitude caused by external vibrations or acoustic waves, and

α

is the sensitivity coefficient of the system. Generally, the scattered echo of light is extremely weak, but Φ-OTDR can precisely measure the echo signal through phase changes:

E_{r} (z, t) = E_{0} (z) \cdot e^{j (ω t - k z + Δ ϕ (z, t))}

(2)

Here,

E_{0} (z)

is the baseline amplitude of the scattered light within the fiber, while

ω

and k represent the angular frequency and propagation speed of the light wave, respectively.

2.2. DAS System Construction

We developed a DAS system for vibration-event recognition, as shown in Figure 1. The system consists of modules for optical signal generation, transmission, sensing, signal detection, and data analysis. These components work in unison to enable the efficient sensing of vibration signals. First, a highly coherent continuous laser signal is generated via a narrow linewidth laser (NLL) with a bandwidth of 3 kHz. An acousto-optic modulator (AOM) modulates the signal into high-temporal-resolution optical pulses and controls the repetition rate. The pulses are then amplified using an erbium-doped fiber amplifier (EDFA) to enhance transmission capacity. The modulated optical pulses are injected into a 1 km-long sensing fiber via a circulator. The sensing fiber is composed of G652 single-mode fiber reinforced with protective materials. As the optical pulses propagate through the fiber, they interact with microscopic inhomogeneities within the fiber core, resulting in backscattered Rayleigh signals. The circulator’s optical isolator directs these backscattered signals to a photodetector (PD) for high-speed acquisition. The signals are recorded via a data acquisition card (DAC) with a sampling rate of 50 MHz and are subsequently transferred to a PC for further analysis and processing. We further collected six types of vibration signals using buried optical fibers to simulate common perimeter events. These events include clear weather background noise, rainy weather background noise, walking, jumping, digging with a shovel, and striking with a shovel. The six events were simulated at the same location along the sensing fiber to test the proposed network model’s ability to recognize vibration events. The details of the simulations are as follows: background noise was collected under clear and rainy conditions without external disturbances (No.I and No.II); walking was simulated using an operator moving at 1.2 m/s along the fiber (No.III); jumping was simulated at a frequency of one jump per second (No.IV); digging was simulated by an operator using a shovel near the sensing fiber (No.V); and striking was simulated by intermittently hitting the area near the sensing fiber with a shovel (No.VI). Some events, such as clear- and rainy-weather background noises or walking and jumping, exhibit similar characteristics, adding complexity to event classification.

2.3. Data Preprocessing

Traditional 1D signal-processing methods involve limitations in extracting features across the temporal and spatial dimensions. Recent advancements in 2D image classification algorithms offer new possibilities for vibration-event detection. By transforming signals into 2D images, the horizontal direction represents spatial-dimension information, while the vertical direction corresponds to temporal-dimension information. This joint representation of temporal and spatial data allows advanced 2D classification algorithms to analyze Φ-OTDR signals, significantly improving the accuracy and efficiency of vibration-event detection [25,26,28,30,31].

In the DAS system that was constructed, detection pulses with a repetition rate of 20 kHz and a pulse width of 100 ns are used for signal acquisition. The temporal waveforms of the vibration signals are shown in Figure 2.

The raw waveform data reflect the Rayleigh backscattering trace (RBT), which encapsulates the Rayleigh backscattering intensity at different positions along the fiber. The raw waveform is segmented into equal time intervals, with each segment forming a row of the spatiotemporal matrix. Each row corresponds to different positions along the sensing fiber, while each column represents different time points. Specifically, the column vectors in the matrix indicate the variation of Rayleigh-backscattered intensity over time at specific spatial locations.

To reduce noise and enhance the recognizability of vibration signals, a band-pass filter with a range of 50 Hz to 5 kHz is applied to the spatiotemporal data. Subsequently, vibration-related data from a 32-m spatial region surrounding the event, collected within 1 s, are extracted and normalized into grayscale images, yielding image samples of vibration events, as shown in the first row of Figure 3.

2.4. Data Augmentation

In this study, to enhance the diversity of the optical fiber-vibration dataset and improve the accuracy of event recognition, we adopted a method combining additive noise and weighted averaging for data augmentation. Given that operations such as cropping and rotating could disrupt the spatiotemporal integrity of the data, we opted for a more conservative approach to simulate random environmental disturbances. The specific procedure involves creating new images through weighted averaging (randomly selecting two images from the original dataset) and adding Gaussian noise. The formula for generating enhanced images follows:

I = α \times I_{1} + (1 - α) \times I_{2} + N (0, σ^{2})

(3)

where I represents the enhanced image,

I_{1}

and

I_{2}

are the two input images,

α

is the weighting coefficient controlling the fusion ratio of the two images, and

N (0, σ^{2})

represents Gaussian noise with a mean of 0 and variance

σ^{2}

, used to simulate random environmental interference. In the technical implementation, we merged two images from the original dataset with a weighted average and added Gaussian noise with an intensity of 0.05 to mimic potential noise disturbances encountered in real-world applications.

This method, by overlaying multiple images, helps offset random errors, making the generated images visually smoother and more realistic. Using this technique, we expanded the original set of 600 images to 1800 images, as partially exemplified in the second row of Figure 3. This data augmentation strategy not only increases the dataset’s diversity but also enhances the model’s adaptability to environmental noise, thereby effectively improving recognition performance.

3. Fundamental Theory of Neural Network Architecture

3.1. CNN

For optical-fiber sensing data presented as 2D images, using a convolutional neural network (CNN) is ideal. CNNs effectively capture both local details and global features, enhancing the model’s ability to distinguish between different vibration events. The convolutional and pooling layers reduce dimensionality and extract features, revealing spatial differences in activities such as walking, jumping, or digging. Combined with fully connected layers for classification, CNNs provide an efficient solution for feature extraction and pattern recognition in fiber-vibration signals.

The convolutional layer applies a filter (kernel) to the input data. For a 2D image, X, and a kernel, K, the output is given via the following:

Y = X \otimes K = \sum_{i, j} X (i, j) \cdot K (i, j)

(4)

Here, the kernel size defines the filter’s local receptive field, while the stride controls how quickly it expands. By tuning both, the network can better capture local or global features.

The pooling layer downsamples the feature map to reduce computation and overfitting while preserving key features. Common methods include max and average pooling. For a window size of

w \times w

, max pooling is defined as follows:

Y_{i, j} = max_{\begin{matrix} k, l \end{matrix}} X_{i + k, j + l}

(5)

where

(i, j)

represents the starting position of the pooling window on the feature map, and

k, l

are the indices within the window. The average pooling is as follows:

Y_{i, j} = \frac{1}{w \times w} \sum_{k = 0}^{w - 1} \sum_{l = 0}^{w - 1} X_{i + k, j + l}

(6)

The fully connected layer integrates the features from previous layers to produce the final output. Given an input, X, a weight matrix, W, bias, b, and the activation function, f, its output is as follows:

Y = f (W X + b)

(7)

For multi-class classification, the

S o f t m a x

activation function converts raw scores into a probability distribution:

S o f t m a x (z_{i}) = \frac{e^{z_{i}}}{\sum_{j = 1}^{n} e^{z_{j}}}

(8)

where

z_{i}

is the score for the i-th class, and n is the total number of classes. The model assigns the class with the highest probability as the final classification result.

3.2. BiGRU

In optical-fiber sensing and similar applications, BiGRU demonstrates clear advantages due to its simple structure, high computational efficiency, fewer parameters, strong bidirectional information integration, and good model stability [32]. Compared to models such as BiLSTM and transformers [33,34], BiGRU is more suitable for handling medium- to short-length sequences and small datasets. It ensures the accurate extraction of temporal features while reducing computational overhead and improving training and inference efficiency. Therefore, BiGRU is particularly well suited for modeling complex time series data like optical-fiber sensing.

A BiGRU is a neural network architecture specifically designed for sequential data. It employs gating mechanisms such as reset gates, update gates, and candidate hidden states, enabling it to incorporate past information while anticipating future trends. This allows for a more comprehensive modeling of temporal dynamics. The structure of the GRU is illustrated in Figure 4.

The reset gate (

r_{t}

) in a gated recurrent unit (GRU) adjusts the contribution of the previous hidden state,

h_{t - 1}

, at the current time step, enabling the model to decide whether to ignore historical information based on the relevance of the current input,

x_{t}

. The output of the reset gate is given via the following:

r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}] + b_{r})

(9)

where t represents the time step in the input sequence,

W_{r}

is the weight of the reset gate,

b_{r}

is the bias term, and

σ

is the sigmoid function, ensuring that the output of the reset gate lies between 0 and 1:

σ = \frac{1}{1 + e^{- x}}

(10)

The update gate (

z_{t}

) in the GRU determines how much of the previous state information should be retained. It helps the model balance between remembering and forgetting. The output of the update gate is given by the following:

z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}] + b_{z})

(11)

The candidate hidden state (

\tilde{h_{t}}

) is a combination of the current input and the adjusted historical information via the reset gate, ensuring that the model takes into account the updated history. The output is as follows:

\tilde{h_{t}} = tanh (W \cdot [r_{t} * h_{t - 1}, x_{t}] + b)

(12)

The final hidden state (

h_{t}

) is updated by combining the results of the update gate and the candidate hidden state, which allows the GRU to dynamically update the hidden state based on the current input while maintaining necessary historical information and integrating new observations. The output is as follows:

h_{t} = z_{t} * h_{t - 1} + (1 - z_{t}) * \tilde{h_{t}}

(13)

For each time step of the input, the outputs of the forward GRU and the backward GRU are concatenated to form the bidirectional GRU (BiGRU), as shown in Figure 5. Unlike the unidirectional GRU, the BiGRU considers both past and future information at each time step of the time series, enabling bidirectional information integration. This bidirectional structure allows the BiGRU to capture a more comprehensive understanding of the global context in the sequence data, enhancing the model’s ability to comprehend time-dependent patterns.

3.3. Attention Mechanism

In recent years, self-attention and cross-attention mechanisms have gained widespread application in neural network models. In fiber-optic sensing applications, the self-attention mechanism effectively processes and focuses on the time dependencies and spatial relationships within sequence data, making it an ideal choice for handling high-dimensional and complex data. Additionally, the cross-attention mechanism significantly enhances the model’s ability to capture complex relationships through the interaction and integration of information across data streams. This approach enriches feature representation by incorporating information from diverse data sources, enabling the model to not only handle single data types but also adapt to multiple sources. This strategy greatly improves the model’s versatility and flexibility, allowing it to comprehensively understand and utilize all available data, thereby achieving more accurate monitoring and analysis in practical applications.

In the attention mechanism, the query represents the focus of the current task, the key represents the identity of the input information, and the value represents the actual content. Through the application of linear transformations to the input sequence, X, the query, key, and value vectors can be obtained, respectively:

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

(14)

Then, the attention score is computed by calculating the similarity between the query vector, Q, and the key vector, K:

AttentionScore = Q \cdot K^{T}

(15)

The attention score is normalized using the

S o f t m a x

function and applied to the value vector V to obtain the final output:

Attention (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) \cdot V

(16)

where

d_{k}

represents the dimensionality of the key vector. The result of

Q K^{T}

is scaled during attention-score calculation to stabilize the gradient. The specific details of the attention mechanism are shown in Figure 6.

The cross-attention mechanism is a variant of the attention mechanism, commonly used in multimodal tasks. The key idea is to interact elements from one sequence with elements from another sequence when calculating attention weights. The query sequence Q comes from one input source, while the key, K, and value, V, come from another input source. Their outputs are as follows:

Q = X W_{Q}, K = Y W_{K}, V = Y W_{V}

(17)

Subsequent calculations follow the same process as the attention mechanism described above.

4. The Proposed CBA Model

4.1. Overall Architecture

The network architecture consists of the spatial feature extraction module, temporal feature extraction module, spatiotemporal cross-attention module, spatiotemporal cross-attention module, and self-attention module. These modules work collaboratively to enhance the model’s performance in vibration-event classification. An overview of the proposed CBA model architecture is shown in Figure 7.

The grayscale images obtained from preprocessing are fed into the two convolution layers (Conv) and max-pooling layer (MaxPool) of the CNN to extract low-level features and perform dimensionality reduction. The outputs are then sent to the spatial feature extraction module and the temporal feature extraction module for feature extraction in the spatial and temporal dimensions. Subsequently, their outputs pass through the spatiotemporal cross-attention module and the temporal–spatial cross-attention module. The output features are concatenated and fused through a fully connected layer, with the final classification result determined by the

S o f t m a x

function, which selects the category corresponding to the maximum component.

4.2. Space-Domain Feature Extraction Module

The design of the spatial feature extraction module aims to extract critical spatial feature information from the dimension-reduced input data. This module combines convolutional layers, pooling layers, and self-attention mechanisms to progressively extract spatial features from local to global levels, with a focus on key regions. Ultimately, it generates high-information-density and highly expressive spatial feature representations.

In the preliminary processing stage, the model uses specifically designed pooling layers to effectively reduce the data dimensions while preserving spatial integrity in the horizontal direction. The key innovation is the use of asymmetric pooling (2 × 1), which focuses on compressing information along the vertical direction (similar to the time axis) and avoids destroying the resolution of the horizontal direction (spatial features). This design considers the importance of preserving horizontal spatial distribution for spatial feature extraction. By compressing only the vertical direction, the model efficiently reduces computational costs while maintaining the integrity of spatial distribution. In the later stage of feature extraction, the model applies specially designed convolution operations (2 × 3) to further explore spatial features in the horizontal direction. This convolution operation uses a 2 × 3 window that slides horizontally to capture higher-order relationships between local spatial regions in the feature map. Compared to traditional symmetric convolutions, asymmetric convolutions provide a more precise way to capture the interactions and aggregations of features in the horizontal direction. The operation is illustrated in Figure 8a. To further reduce data dimensionality and extract higher-level features, the model introduces a pooling layer (2, 2) for global downsampling in the subsequent stages. By alternating between convolution and pooling layers, the model gradually transitions from low-level features to high-level features. In the initial layers, the model mainly captures simple edge, texture and basic features, while deeper layers learn more complex spatial relationships and semantic information. Finally, through high-level feature representations, the model not only understands the global spatial layout but also integrates local detail features, achieving a comprehensive understanding of the spatial structure of the data.

To enhance the feature representation ability, the model introduces a self-attention mechanism to the extracted spatial features. The attention mechanism calculates the correlation between spatial features and assigns higher weights to important areas, highlighting key information. Specifically, the input features are mapped to query, key, and value spaces. Attention scores are generated by calculating the similarity between the query and the key, and these scores are used for normalization to weight the value vectors, forming enhanced feature representations and improving classification performance.

This design, which combines convolutionally extracted local structural information with global features discovered by the attention mechanism, empowers the spatial feature extraction module with stronger spatial feature extraction capabilities. Convolution layers capture subtle differences in the data, while the self-attention mechanism ensures that the model focuses on the most informative areas in the spatial dimension. Therefore, this module effectively captures the complex patterns in vibration signals and enhances the model’s ability to interpret and understand the spatial feature distribution, thus improving overall classification performance.

4.3. Time-Domain Feature Extraction Module

Similarly, the time-domain feature extraction module focuses on extracting time-related features from the input data. This module relies on the feature extraction and dimensionality reduction capabilities of convolutional and pooling layers, combined with the bidirectional dynamic learning mechanism of the BiGRU and the key time-step focusing function of the self-attention mechanism. The model progressively analyzes and captures the dependencies in the time series, extracting efficient time-domain feature representations, from fine-grained local changes to global temporal patterns.

In the preliminary processing stage for time-domain features, the model first downsamples the input data using a pooling operation (1, 4) to reduce redundant information while preserving the integrity of the time dimension. This operation effectively reduces the computational load while ensuring that the dynamic changes in time-related features are adequately preserved. Next, the model applies asymmetric convolutions (3 × 2), as shown in Figure 8b, further enhancing the ability to capture changes in the vertical direction (time axis) of the signal. This convolution kernel has a large receptive field in the vertical direction and is capable of identifying details and change patterns in the time series. The asymmetric convolution design enables a deep exploration of complex temporal features.

After the initial feature extraction, the model continues to reduce the spatial dimensions through a pooling operation (1, 2). This phase of pooling not only helps downsample the data but also further enhances the model’s performance in learning time-series features. Finally, the data are divided along the vertical direction into multiple time segments, each corresponding to the signal variations at different moments, providing a more accurate time-series representation for subsequent temporal analysis. The resulting temporal sequence is then input into the BiGRU, capturing the time dependencies within the sequence. The bidirectional structure of BiGRU strengthens the model’s ability to capture long-term dependencies, enabling the network to fully understand the temporal dynamics of the sequence. Additionally, the self-attention mechanism is introduced to optimize the BiGRU output, enhancing its ability to represent temporal features. By combining BiGRU with the self-attention mechanism, the module can capture both local features and global contextual information, improving its understanding and processing capabilities for time-series data.

Through the integration of multi-layer structures and attention mechanisms, the module can thoroughly analyze both local patterns and global dependencies within the time series. The hierarchical, multi-stage processing ensures that the model can extract crucial information from complex time-series data, thereby improving classification and prediction performance.

4.4. Cross-Attention Module

The proposed CBA model incorporates a dual cross-attention mechanism, consisting of the spatiotemporal cross-attention module and the temporal–spatial cross-attention module. Both modules share a common cross-attention structure (Figure 9), with the difference lying in the positioning of the inputs, which enhances the model’s ability to effectively process both temporal and spatial features. These mechanisms significantly improve feature expressiveness by attending to the interdependencies between space and time. Specifically, the spatiotemporal cross-attention module uses spatial features as the query vector and temporal features as the key and value for execution. This allows the module to query and weight the temporal features in the spatial dimension, precisely capturing the key temporal behavior patterns at specific spatial locations. In this process, the output from the spatial feature extraction module is used as input X to this module, for which the query vector is computed via Equation (16). The output from the temporal feature extraction module is used as input Y, with the key and value vectors computed via Equation (16).

Similarly, the temporal–spatial cross-attention module queries spatial information along the temporal dimension, using time features as the query vector and spatial features as the key and value. This method highlights spatial regions relevant to the temporal dynamics, optimizing the model’s information-processing ability in the spatial dimension. In this process, the output from the temporal feature extraction module is used as input X to this module, where the query vector is computed via Equation (16). The output from the spatial feature extraction module is used as input Y, with the key and value vectors computed by Equation (16). The combination of these two cross-attention modules not only enhances the model’s understanding of complex spatiotemporal data but also improves its accuracy in recognizing and classifying complex events by carefully analyzing and utilizing the interdependencies between space and time.

4.5. FC and Softmax

In the final stage of the model, the features output via the spatiotemporal and temporal–spatial cross-attention modules are concatenated to fuse the temporal and spatial information. Combining these with the features extracted via earlier layers of the network, a fully connected layer further processes the features and maps them to a new space, resulting in a score vector corresponding to the number of output categories. This layer learns the deep relationships between the input features, providing a foundation for decision-making. The output of the fully connected layer is then passed through the

S o f t m a x

activation function, which normalizes the model’s linear responses into a probability distribution, intuitively representing the likelihood of each event occurring. After the Softmax layer is applied, each category corresponds to a probability value, indicating the model’s confidence in predicting that event type. The prediction selects the category with the highest probability. This configuration, combining the fully connected layer and the Softmax layer, enables the model to not only integrate and transform features but also directly output the predicted probabilities for each category, facilitating the execution and evaluation of classification tasks.

5. Experimental Results and Discussion

5.1. Details for Experiments

The experimental team collected a dataset of 600 sample images labeled with six types of vibration events using a DAS system built in the laboratory. The dataset was then expanded using data augmentation techniques, resulting in a dataset of 1800 samples for model validation and comparative analysis. The dataset was randomly split, with 70% of the samples used for training, 15% for validation, and 15% for testing. The model code was deployed on a single machine in the laboratory. The machine was equipped with an NVIDIA 3070Ti GPU with 8GB of memory. The system ran the TensorFlow framework and supported GPU acceleration. Figure 10 shows each submodule of the model, and the hyperparameter settings of the model are shown in Table 1.

5.2. Index

In vibration-event classification, common evaluation metrics include precision, recall, F1-score, and NAR, which comprehensively assess model performance. A confusion matrix provides a detailed view of prediction results, aiding in analyzing classifier performance and facilitating the calculation of various metrics. In the matrix,

T P

(true positive) indicates correctly predicted positive samples,

T N

(true negative) denotes correctly predicted negative samples,

F P

(false positive) represents negative samples incorrectly predicted as positive, and

F N

(false negative) indicates positive samples incorrectly predicted as negative. Accuracy is the proportion of samples correctly classified by the classifier among the total number of samples:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(18)

Precision represents the proportion of truly positive samples among those predicted as positive by the model. It measures the “purity” of the model’s positive predictions. High precision indicates fewer false positive (

F P

). The formula for precision is as follows:

Precision = \frac{T P}{T P + F P}

(19)

Recall represents the proportion of true positive samples that were correctly identified by the model. Recall measures the “coverage” of the model’s positive predictions. A high recall indicates fewer false negatives (

F N

). The formula for recall is as follows:

Recall = \frac{T P}{T P + F N}

(20)

The F1-score is a key metric for evaluating the performance of classification models. It is the harmonic mean of precision and recall, used to assess the balance between the model’s ability to correctly predict positive samples and its ability to cover all actual positive samples. The calculation formula is as follows:

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(21)

The NAR (false alarm rate) represents the proportion of noise incorrectly detected as target events among all negative samples. It measures the probability of false alarms, and its formula is as follows:

NAR = \frac{F P}{F P + T N}

(22)

5.3. Results and Discussion

To validate the effectiveness of the CBA model, it was compared with four typical models: CNN [25], TCN, CNN-LSTM [27], and CNN-BiLSTM [21], all under the same input of preprocessed grayscale images and the same operating environment. Among them, the CNN is an untrained AlexNet, and the TCN represents a temporal convolutional network, a highly efficient architecture specifically designed for processing time-series data. In the comparative experiments, the model’s performance in processing time-series data was demonstrated through three key metrics: precision, recall, F1-score, and the NAR (false alarm rate), as shown in Table 2. Table 3 further demonstrates the performance of the CBA model in recognizing each category. And Figure 11 shows the confusion matrix for the CBA model; the horizontal and vertical axes correspond to six categories of vibration events.

Although the CNN model has the fewest parameters and the shortest training time, giving it an advantage in computational efficiency, its F1-score is only 85.56, with a high false alarm rate of 14.2%. This indicates significant shortcomings in both accurately identifying vibration events and ensuring comprehensive coverage of positive samples. In contrast, while the TCN model requires slightly more training time (15 min), it achieves a higher precision of 88.47% and an improved F1-score of 88.28%, demonstrating stronger feature representation and more stable recognition performance in time-series modeling.

The LSTM-ATTENTION model improves performance by introducing an attention mechanism for time-series modeling while also accelerating convergence. Although it increases the parameter count and training time, the number of training epochs is significantly reduced, resulting in higher overall efficiency. The model maintains stable precision (90.90%) and recall (90.03%), with an F1-score of 90.46%, indicating a balanced classification capability. The NAR also drops to 9.3%, making the model well-suited for tasks with high temporal analysis requirements. Among all baseline models, the CNN-BiLSTM model achieved the best performance. Although it involved a higher number of parameters and a longer training time, it converged faster, maintaining overall training efficiency. With both precision and recall exceeding 92% and the lowest NAR at just 8.5%, it demonstrated strong reliability. Furthermore, its F1-score reached 92.15%, reflecting a well-balanced performance between accuracy and coverage. This indicates the model’s solid capability and robustness in handling complex time-series data.

In the comparative experiments, the proposed CBA model demonstrated the best overall performance, achieving a precision of 95.13%, a recall of 95.00%, and an F1-score of 95.06%, with the lowest NAR of only 6.1%. Although the model has a relatively large number of parameters and requires a longer training time, it converges very quickly, reaching optimal performance within just 32 training epochs. The high F1-score indicates that the model maintains a strong balance between precision and recall, reflecting its stable capability to identify various types of vibration events. These results confirm that the model can consistently perform well under diverse environmental and activity conditions, demonstrating excellent adaptability and generalization. The confusion matrix shown in Figure 11 and Table 3 further validates its ability to accurately distinguish activities such as walking, jumping, and digging, highlighting its effectiveness and robustness in identifying different vibration patterns.

The model performs well under sunny conditions, but the NAR is high at 12.3%, possibly due to indistinct background noise characteristics on sunny days that make it difficult to differentiate. It performs even better under rainy conditions, with improved precision and recall, and a significantly reduced NAR of 5.2%, demonstrating its excellent adaptability and robustness to high-frequency noise. In recognizing specific activities such as walking and jumping, the model demonstrated exceptionally high accuracy. It performed particularly well in activities involving the use of a shovel, such as “Spade-shovel” and “Spade-pat”. Notably, the “Spade-pat” activity achieved the lowest false alarm rate of just 3.1% and the highest F1-Score of 97.05%, indicating the model’s high sensitivity and precision in identifying scenarios involving specific tool usage.

To further validate the stability of the proposed CBA model and assess the contribution of each individual module, we designed and conducted an ablation study. Under the same overall architecture, several model variants were evaluated, including CNN, BiGRU, CNN-BiGRU, CNN-BiGRU-AT, CNN-BiGRU-CrossAT, and the complete CBA model. To comprehensively assess whether the performance improvements achieved using each module were statistically significant, we additionally applied t-tests to evaluate the significance of differences in F1-scores among the models.

In addition, to evaluate the model’s stability and robustness while maintaining a fixed data split ratio (70% for training, 15% for validation, and 15% for testing), we adopted a multi-round resampling strategy inspired by cross-validation. Specifically, five independent experiments were conducted, each with a different random split of the dataset using the same 70/15/15 ratio. In each round, the model was retrained and evaluated, and we recorded key performance metrics: precision, recall, F1-score, and NAR. Figure 12 illustrates the convergence speeds and training durations of each model in the ablation study. The overall experimental results are summarized in Table 4, while Table 5 provides a detailed breakdown of the evaluation experiments.

Firstly, according to the ablation study results, the standalone CNN and BiGRU models did not perform very well. Specifically, the CNN model achieved precision and recall of approximately 85%, with a high NAR of 14.2%, indicating its unsuitability for practical deployment. As the model structure evolved into a CNN-BiGRU integration, precision and recall improved to around 90%, and the NAR decreased to 9.6%. The further inclusion of the attention mechanism (CNN-BiGRU-AT) raised the precision and recall to about 92%. Most notably, the introduction of the cross-attention mechanism (CNN-BiGRU-CrossAT) resulted in the most significant performance improvement, with precision and recall reaching 94.20% and 93.70% respectively, and the NAR dropping to 6.8%. We conducted pairwise t-tests to evaluate the statistical significance of performance improvements across model variants. The results showed that the CBA model outperformed all other models with statistically significant differences in F1-score (

p < 0.05

). In particular, the performance gap was most pronounced when compared to the baseline CNN (

p = 7.7 \times 10^{- 8}

) and BiGRU (

p = 3.3 \times 10^{- 7}

), indicating a substantial improvement over the basic models. As attention mechanisms were incrementally introduced, the F1-score increased from 90.12 (CNN-BiGRU) to 91.95 (CNN-BiGRU-AT,

p = 5.5 \times 10^{- 5}

), and then to 93.94 (CNN-BiGRU-CrossAT,

p = 8.1 \times 10^{- 5}

), and it finally reached 95.06 in the complete CBA model (

p = 2.5 \times 10^{- 4}

). These results confirm that the inclusion of attention, especially cross-attention, plays a key role in enhancing the model’s ability to capture complex spatiotemporal features and significantly boosts its recognition performance.

The complete CBA model ultimately achieved optimal results. These outcomes clearly demonstrate that the model, by effectively integrating BiGRU and attention mechanisms, significantly optimized its ability to analyze time-series data. By leveraging cross-perception of spatiotemporal features, it substantially enhanced performance in dynamic environments, with attention mechanisms, especially cross-attention, playing a key role in enhancing model performance.

Secondly, the results from five independent experiments show that the CBA model maintained stable performance across different data splits. The precision ranged from 94.67% to 95.23%, the recall from 93.81% to 94.95%, the F1-score from 94.21% to 95.15%, and the NAR from 6.0% to 6.6%. The average performance across these experiments was only slightly lower than that of the complete CBA model, indicating strong generalization and robustness under varying data distributions.

In conclusion, although other models demonstrated some effectiveness in comparative experiments, the CBA model exhibited the best overall performance across all evaluation metrics. The ablation study results highlight the key roles of BiGRU and cross-attention mechanisms in enhancing recognition accuracy and reducing false alarms. Moreover, the consistency of results from five independent runs further confirms the stability and reliability of the CBA model in practical applications. By integrating BiGRU and attention mechanisms, the CBA model effectively identifies and classifies various dynamic events in complex environments, showcasing its potential and robustness in real-world scenarios.

6. Conclusions

In order to improve the accuracy of vibration-event recognition using fiber-optic sensing, this study has proposed an efficient model combining CNN, BiGRU, and attention mechanisms. The model takes into full consideration the characteristics of fiber-optic-vibration data, extracting spatial features of the signal through CNN, capturing the dynamic evolution of time-domain features via BiGRU, and further focusing on spatial or temporal features through a self-attention mechanism. The cross-attention mechanism strengthens the correlation between spatiotemporal features. The superiority of the proposed method was demonstrated through experiments with real-world data collected via buried fiber-optic vibration sensors. This work provides a reliable solution for vibration-event recognition in fiber-optic sensing.

Although the proposed CBA model performs well in vibration-event recognition with high accuracy and stability, there are still areas for improvement. First, the model shows relatively low accuracy in recognizing background-noise events (e.g., under clear or rainy conditions) due to their weak and ambiguous features. Enhancing the model’s ability to distinguish between such subtle signals remains a challenge. Second, the current dataset is limited in scale and diversity, especially for rare or low-frequency events. This may affect the model’s generalization in more complex scenarios. While data augmentation and repeated evaluations help validate stability, they cannot fully simulate real-world factors like terrain, fiber layout, or device variability.

Future research may focus on enhancing the adaptability and generalization capabilities of the model in complex real-world environments. On one hand, building larger-scale fiber-optic vibration-event datasets that cover a wider range of application scenarios could be a key direction for practical deployment. Such datasets may help in recognizing rare or low-frequency events and are likely to improve the model’s robustness and stability in field applications. On the other hand, from an algorithmic perspective, the introduction of domain adaptation and transfer learning strategies could address the challenges posed by discrepancies between training and deployment data distributions, thereby improving cross-domain performance. In addition, incorporating emerging methods such as self-supervised learning and continual learning offers promising avenues for reducing the reliance on labeled data and handling continuously evolving data distributions. These directions collectively hold the potential for developing more intelligent and adaptable fiber-optic sensing recognition systems.

In summary, while the model achieves promising results, further efforts are needed to enhance its noise robustness, generalization, and real-world deployment readiness.

Author Contributions

Conceptualization, C.L.; methodology, C.L. and X.C.; validation, X.C.; investigation, X.C.; resources, Y.S.; data curation, Y.S.; writing—original draft preparation, X.C.; writing—review and editing, C.L. and X.C.; visualization, X.C.; supervision, C.L.; project administration, C.L.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grant 61801283.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Abufana, S.; Dalveren, Y.; Aghnaiya, A.; Kara, A. Variational mode decomposition-based threat classification for fiber optic distributed acoustic sensing. IEEE Access 2020, 8, 100152–100158. [Google Scholar]
Shen, X.; Wu, H.; Zhu, K.; Liu, H.; Li, Y.; Zheng, H.; Li, J.; Shao, L.; Shum, P.P.; Lu, C. Fast and storage-optimized compressed domain vibration detection and classification for distributed acoustic sensing. J. Light. Technol. 2024, 42, 493–499. [Google Scholar]
Zhang, X.; Ding, Z.; Hong, R.; Chen, X.; Liang, L.; Zhang, C.; Wang, F.; Zou, N.; Zhang, Y. Phase-sensitive optical time-domain reflective distributed optical fiber sensing technology. Acta Opt. Sin. 2021, 41, 0106004. [Google Scholar]
Rao, Y.; Wang, Z.; Wu, H.; Ran, Z.; Han, B. Recent advances in phase-sensitive optical time domain reflectometry (Φ-OTDR). Photonic Sens. 2021, 11, 1–30. [Google Scholar] [CrossRef]
Verma, S.; Mathew, J.; Gupta, S. Feature extraction based acoustic signal detection in a cost effective Φ-OTDR system. In Proceedings of the 2023 International Conference on Microwave, Optical, and Communication Engineering (ICMOCE), Bhubaneswar, India, 26–28 May 2023. [Google Scholar]
Sun, Z.; Guo, Z. Intelligent intrusion detection for optical fiber perimeter security system based on an improved high efficiency feature extraction technique. Meas. Sci. Technol. 2024, 35, 045107. [Google Scholar]
He, T.; Sun, Q.; Zhang, S.; Li, H.; Yan, B.; Fan, C.; Yan, Z.; Liu, D. A dual-stage-recognition network for distributed optical fiber sensing perimeter security system. J. Light. Technol. 2023, 41, 4331–4340. [Google Scholar] [CrossRef]
Xiao, C.; Long, J.; Jiang, L.; Yan, G.; Rao, Y. Review of sensitivity-enhanced optical fiber and cable used in distributed acoustic fiber sensing. In Proceedings of the 2022 Asia Communications and Photonics Conference (ACP), Shenzhen, China, 5–8 November 2022. [Google Scholar]
Liu, S.; Yu, F.; Hong, R.; Xu, W.; Shao, L.; Wang, F. Advances in phase-sensitive optical time-domain reflectometry. Opto-Electron. Adv. 2022, 5, 1–28. [Google Scholar]
Yan, Y.; Khan, F.N.; Zhou, B.; Lau, A.P.T.; Lu, C.; Guo, C. Forward transmission-based ultra-long distributed vibration sensing with wide frequency response. J. Light. Technol. 2021, 39, 2241–2249. [Google Scholar]
Yan, Y.; Zheng, H.; Zhao, Z.; Guo, C.; Wu, X.; Hu, J.; Lau, A.P.; Lu, C. Distributed optical fiber sensing assisted by optical communication techniques. J. Light. Technol. 2021, 39, 3654–3670. [Google Scholar]
Lyu, C.; Huo, Z.; Cheng, X.; Jiang, J.; Liu, H. Distributed optical fiber sensing intrusion pattern recognition based on GAF and CNN. J. Light. Technol. 2020, 38, 4174–4182. [Google Scholar]
Chen, J.; Li, H.; Shi, Z.; Xiao, X.; Fan, C.; Yan, Z.; Sun, Q. Low-altitude unmanned aerial vehicle detection and localization based on distributed acoustic sensing. In Proceedings of the 2023 Conference on Lasers and Electro-Optics (CLEO), San Jose, CA, USA, 7–12 May 2023. [Google Scholar]
Wu, H.; Chen, J.; Liu, X.; Xiao, Y.; Wang, M.; Zheng, Y.; Rao, Y. One-dimensional CNN-based intelligent recognition of vibrations in pipeline monitoring with DAS. J. Light. Technol. 2019, 37, 4359–4366. [Google Scholar] [CrossRef]
Pen, Z.; Jian, J.; Wen, H.; Gribok, A.; Chen, K.P. Distributed fiber sensor and machine learning data analytics for pipeline protection against extrinsic intrusions and intrinsic corrosions. Opt. Express 2020, 28, 27277–27292. [Google Scholar]
Lior, I.; Rivet, D.; Ampuero, J.P.; Sladen, A.; Barrientos, S.; Sánchez-Olavarría, R.; Villarroel Opazo, G.A.; Bustamante Prado, J.A. Magnitude estimation and ground motion prediction to harness fiber optic distributed acoustic sensing for earthquake early warning. Sci. Rep. 2023, 13, 424. [Google Scholar] [CrossRef]
Hernández, P.; Ramírez, J.; Soto, M. Deep-learning-based earthquake detection for fiber-optic distributed acoustic sensing. J. Light. Technol. 2022, 40, 2639–2650. [Google Scholar] [CrossRef]
Ma, P.; Liu, K.; Jiang, J.; Li, Z.; Liu, T. Probabilistic event discrimination algorithm for fiber optic perimeter security systems. J. Light. Technol. 2018, 36, 2069–2075. [Google Scholar] [CrossRef]
Wada, M.; Maeda, Y.; Shimabara, H.; Aihara, T. Manhole locating technique using distributed vibration sensing and machine learning. In Proceedings of the 2021 Optical Fiber Communications Conference and Exhibition (OFC), San Francisco, CA, USA, 6–11 June 2021. [Google Scholar]
Pranay, Y.S.; Tabjula, J.; Kanakambaran, S. Classification studies on vibrational patterns of distributed fiber sensors using machine learning. In Proceedings of the 2022 IEEE Bombay Section Signature Conference (IBSSC), Mumbai, India, 8–10 December 2022. [Google Scholar]
Wu, H.; Yang, M.; Yang, S.; Lu, H.; Wang, C.; Rao, Y. A Novel DAS Signal Recognition Method Based on Spatiotemporal Information Extraction With 1DCNNs-BiLSTM Network. IEEE Access 2020, 8, 119448–119457. [Google Scholar] [CrossRef]
Chen, X.; Xu, C. Disturbance Pattern Recognition Based on an ALSTM in a Long-distance Φ-OTDR Sensing System. Microw. Opt. Technol. Lett. 2020, 62, 168–175. [Google Scholar] [CrossRef]
Li, S.; Peng, R.; Liu, Z. A Surveillance System for Urban Buried Pipeline Subject to Third-Party Threats Based on Fiber Optic Sensing and Convolutional Neural Network. Struct. Health Monit. 2021, 20, 1704–1715. [Google Scholar] [CrossRef]
Sun, Q.; Li, Q.; Chen, L.; Quan, J.; Li, L. Pattern Recognition Based on Pulse Scanning Imaging and Convolutional Neural Network for Vibrational Events in Φ-OTDR. J. Light Electron Opt. 2020, 219, 165205. [Google Scholar] [CrossRef]
Shi, Y.; Li, Y.; Zhang, Y.; Zhuang, Z.; Jiang, T. An Easy Access Method for Event Recognition of Φ-OTDR Sensing System Based on Transfer Learning. J. Light. Technol. 2021, 39, 4548–4555. [Google Scholar]
Shi, Y.; Dai, S.; Jiang, T.; Fan, Z. A Recognition Method for Multi-Radial-Distance Event of Φ-OTDR System Based on CNN. IEEE Access 2021, 9, 143473–143480. [Google Scholar] [CrossRef]
Li, Z.; Wang, M.; Zhong, Y.; Zhang, J.; Peng, F. Fiber Distributed Acoustic Sensing Using Convolutional Long Short-Term Memory Network: A Field Test on High-Speed Railway Intrusion Detection. Opt. Express 2020, 28, 2925–2938. [Google Scholar] [PubMed]
Li, Y.; Zeng, X.; Shi, Y. A Spatial and Temporal Signal Fusion Based Intelligent Event Recognition Method for Buried Fiber Distributed Sensing System. Opt. Laser Technol. 2023, 166, 109658. [Google Scholar] [CrossRef]
Wu, H.; Liu, X.; Wang, X.; Wu, Y.; Liu, Y.; Wang, Y. Multi-Dimensional Information Extraction and Utilization in Smart Fiber-Optic Distributed Acoustic Sensor (sDAS). J. Light. Technol. 2024, 42, 6967–6980. [Google Scholar]
Zhao, X.; Shan, G.; Kuang, Y.; Zhu, M. Development and Test of Smart Fiber Optic Vibration Sensors for Railway Track Monitoring. Opt. Express 2022, 30, 31123–31134. [Google Scholar]
Hu, S.; Hu, X.; Li, J.; He, Y.; Qin, H.; Li, S.; Liu, M.; Liu, C.; Zhao, C.; Chen, W. Enhancing Vibration Detection in Φ-OTDR Through Image Coding and Deep Learning-Driven Feature Recognition. IEEE Sens. J. 2024, 24, 38344–38351. [Google Scholar]
Wu, G.; Wang, L.; Hu, X.; Luo, Q.; Guo, D. BiGRU-DA: Based on Improved BiGRU Multi-Target Data Association Method. In Proceedings of the 2023 8th International Conference on Computer and Communication Systems (ICCCS), Guangzhou, China, 21–23 April 2023; pp. 1125–1130. [Google Scholar]
Binlu, Y.; Guangqi, Q.; Jin, L. RUL Prediction of Rolling Bearings Based on Crested Porcupine Optimization Algorithm Optimized CNN-BiGRU-Attention Neural Network. In Proceedings of the 2024 Global Reliability and Prognostics and Health Management Conference (PHM-Beijing), Beijing, China, 21–23 April 2024; pp. 1–6. [Google Scholar]
Murray, C.; Chaurasia, P.; Hollywood, L.; Coyle, D. A Comparative Analysis of State-of-the-Art Time Series Forecasting Algorithms. In Proceedings of the 2022 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 14–16 December 2022; pp. 89–95. [Google Scholar]

Figure 1. DAS system for vibration detection.

Figure 2. Temporal waveforms of six types of vibration signalsThese activities include: (I). Clear conditions, (II). Rainy conditions, (III). Walking, (IV). Jumping, (V). Digging, and (VI). Striking.

Figure 3. Grayscale images of vibration events converted from the spatiotemporal matrix after band-pass filtering. The first row contains six types of images from the original dataset, while the second row features data-augmented images.These activities include: (I). Clear conditions, (II). Rainy conditions, (III). Walking, (IV). Jumping, (V). Digging, and (VI). Striking.

Figure 4. Gated recurrent unit.

Figure 5. Bidirectional gated recurrent unit.

Figure 6. Attention mechanism.

Figure 7. Overview of the proposed CBA model architecture. The network consists of spatial and temporal feature extraction modules, a cross-attention module, a self-attention module, and fully connected layers. The final prediction is for six types of vibration events.

Figure 8. Asymmetric convolution operations for feature extraction: (a) Enhances spatial feature detection to optimize shape and boundary recognition. (b) Extracts temporal features to analyze patterns and movements over time. The green indicates convolution kernels, and the light blue shows the feature maps to be convolved.

Figure 9. Cross-attention mechanism.

Figure 10. Details for each layer of the model, illustrating how they work together to process both spatial and temporal features.

Figure 11. The confusion matrix of the CBA model evaluates the model’s performance in distinguishing between six vibration events: sunny noise, rainy noise, walk, jump, spade-shovel, and spade-pat.

Figure 12. Training curves showing convergence behavior in ablation study.

Table 1. Hyperparameter settings of the CBA model.

Hyperparameter	Value
Optimizer	Adam
Learning rate	0.001
Loss function	Cross-entropy loss
Dropout rate	0.5
GRU units	128 (bi-directional)
FC layer size	512
Batch size	32
Epochs-stopping	32
Input image size	64 × 64 × 1
Number of classes	6

Table 2. Performance of different methods.

Method	Precision	Recall	F1-Score	NAR	Training Time (min)	Epochs	Param Count (M)
CNN	85.52	85.60	85.56	14.2	12	48	0.45
TCN	88.47	88.10	88.28	10.9	15	45	0.60
LSTM-ATTENTION	90.90	90.46	90.46	9.7	20	39	1.40
CNN-BiLSTM	92.20	92.10	92.15	8.5	24	36	1.80
CBA model	95.13	95.00	95.06	6.1	27	32	2.10

Table 3. Performance of CBA for different event types.

Event Type	Precision	Recall	F1-Score	NAR
Sunny noise	91.10	91.08	91.09	12.3
Rainy noise	93.89	93.19	93.54	7.8
Walk	96.51	97.52	97.01	6.1
Jump	94.92	94.91	94.91	5.2
Spade-shovel	96.40	96.00	96.2	6.9
Spade-pat	97.00	97.10	97.05	3.1

Table 4. CBA ablation study.

Model	Precision	Recall	F1-Score	NAR	p-Value vs. CBA
CNN	85.60	84.90	85.20	14.2	7.7 × 10⁻¹⁶
BiGRU	87.20	86.50	86.90	12.7	1.9 × 10⁻¹⁵
CNN-BiGRU	90.35	89.90	90.12	9.6	1.9 × 10⁻¹³
CNN-BiGRU-AT	92.10	91.80	91.95	8.1	6.9 × 10⁻¹²
CNN-BiGRU-CrossAT	94.20	93.70	93.94	6.8	1.0 × 10⁻⁸
CBA (Full model)	95.13	95.00	95.06	6.1	–

Table 5. Stability evaluation of the CBA model.

Experiment	Precision	Recall	F1-Score	NAR
Exp 1	95.23	94.95	94.98	6.0
Exp 2	94.95	93.81	95.15	6.3
Exp 3	94.67	94.20	94.21	6.4
Exp 4	94.68	94.74	94.58	6.6
Exp 5	95.00	94.63	95.02	6.2
Average	94.91	94.47	94.79	6.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Chen, X.; Shi, Y. An Event Recognition Method for a Φ-OTDR System Based on CNN-BiGRU Network Model with Attention. Photonics 2025, 12, 313. https://doi.org/10.3390/photonics12040313

AMA Style

Li C, Chen X, Shi Y. An Event Recognition Method for a Φ-OTDR System Based on CNN-BiGRU Network Model with Attention. Photonics. 2025; 12(4):313. https://doi.org/10.3390/photonics12040313

Chicago/Turabian Style

Li, Changli, Xiaoyu Chen, and Yi Shi. 2025. "An Event Recognition Method for a Φ-OTDR System Based on CNN-BiGRU Network Model with Attention" Photonics 12, no. 4: 313. https://doi.org/10.3390/photonics12040313

APA Style

Li, C., Chen, X., & Shi, Y. (2025). An Event Recognition Method for a Φ-OTDR System Based on CNN-BiGRU Network Model with Attention. Photonics, 12(4), 313. https://doi.org/10.3390/photonics12040313

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Event Recognition Method for a Φ-OTDR System Based on CNN-BiGRU Network Model with Attention

Abstract

1. Introduction

2. Experimental Setup

2.1. The Principle of Φ-OTDR

2.2. DAS System Construction

2.3. Data Preprocessing

2.4. Data Augmentation

3. Fundamental Theory of Neural Network Architecture

3.1. CNN

3.2. BiGRU

3.3. Attention Mechanism

4. The Proposed CBA Model

4.1. Overall Architecture

4.2. Space-Domain Feature Extraction Module

4.3. Time-Domain Feature Extraction Module

4.4. Cross-Attention Module

4.5. FC and Softmax

5. Experimental Results and Discussion

5.1. Details for Experiments

5.2. Index

5.3. Results and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI