Change Point Detection for Fine-Grained MFR Work Modes with Multi-Head Attention-Based Bi-LSTM Network

Detection of the changes in Multi-Functional Radar (MFR) work modes is a critical situation assessment task for Electronic Support Measure (ESM) systems. There are two major challenges that must be addressed: (i) The received radar pulse stream may contain multiple work mode segments of unknown number and duration, which makes the Change Point Detection (CPD) difficult. (ii) Modern MFRs can produce a variety of parameter-level (fine-grained) work modes with complex and flexible patterns, which are challenging to detect through traditional statistical methods and basic learning models. To address the challenges, a deep learning framework is proposed for fine-grained work mode CPD in this paper. First, the fine-grained MFR work mode model is established. Then, a multi-head attention-based bi-directional long short-term memory network is introduced to abstract high-order relationships between successive pulses. Finally, temporal features are adopted to predict the probability of each pulse being a change point. The framework further improves the label configuration and the loss function of training to mitigate the label sparsity problem effectively. The simulation results showed that compared with existing methods, the proposed framework effectively improves the CPD performance at parameter-level. Moreover, the F1-score was increased by 4.15% under hybrid non-ideal conditions.


Introduction
Modern Multi-Functional Radars (MFRs) can perform multiple tasks at the same time, including surveillance, tracking and target recognition through the adaptive selection of various complex signal modulation types and parameters. Electronic Support Measure (ESM) systems analyze intercepted radar signals to detect the change points of the work modes to assess threats [1]. These radar signals may contain multiple work mode segments of unknown number and duration. The mode changes are more difficult to detect when the signal features are compromised in non-ideal conditions, such as when pulses are lost or spurious. Therefore, the precise detection of the change point of each work mode is a vital challenge for ESM.
Change Point Detection (CPD) refers to a set of feasible methods for identifying abrupt changes in a time series and can be applied in various fields to extract work mode information [2][3][4][5][6][7]. The objective of this considered task is to detect all points at which a change in trends occurs, which are defined as change points. In particular, CPD has been proven to be effective in radar signal processing. Early studies for radar work mode CPD were typically based on statistical methods, such as probabilistic analysis [8,9], circular binary segmentation algorithms [10], cumulative sum algorithms [11][12][13], and manually-designed feature extraction [14]. Nevertheless, heavy reliance on these hypothetical statistical models and manually-designed features usually results in high complexity and poor performance under non-ideal conditions in real applications. Deep Learning (DL) has provided a promising approach to this problem through automatic learning of complex patterns from large datasets without the requirement for feature and metric engineering [15][16][17]. Most related studies solved the work mode CPD through binary classification or work mode recognition and mode change boundary detection methods. In Ref. [18,19], an Autoencoder was used to remove noise from the original pulse stream before identification. Several studies [20][21][22] have used Convolutional Neural Networks (CNNs) to recognize the modulation type of the Pulse Repetition Interval (PRI).Compared to traditional manual feature extraction, CNNs achieved significant improvements in recognition performance. These CNN-based methods improved MFR system performance and are robust to lost and spurious pulses but have the limitation of requiring inputs of fixed length. To overcome this limitation, in Ref. [23][24][25], Recurrent Neural Networks (RNNs), including their improved form, i.e., The Long Short-Term Memory (LSTM) network, were utilized to classify pulse sequences of different work modes. Moreover, RNNs introduced the concept of timing into network architecture design to achieve better adaptability in time series data analysis. However, due to the vanishing gradient problem, RNNs have a limited memory capacity, which can be problematic for tasks that require the processing of very long sequences. Furthermore, through Ref. [26], attention mechanisms with RNNs made their way into modulation recognition, resulting in an improvement in capturing dependencies at long intervals. These existing DL methods have achieved automatic identification of multiple work modes with different modulation types under non-ideal conditions, which can be understood as detecting change points of work modes at the modulation type level. Considering that within the same modulation type, there may exist consecutive multiple work modes with different modulation parameters in the MFR pulse sequences. These fine-grained work modes exhibit similar trends under the same modulation type, making it difficult for existing methods to extract discernible features. Therefore, it is necessary to further study CPD algorithms at the modulation parameter level to cope with the rapid and flexible work mode changes of advanced MFRs. There are two problems remain to be solved: first, CPD tasks are often treated as binary classification problems in deep learning, which creates label imbalance issues and makes model fitting difficult. Second, the feature extraction module for detecting fine-grained work mode changes needs to be carefully designed.
To address these challenges, this paper propose a processing framework to accommodate fine-grained MFR work modes with flexible modulation types and parameter ranges under non-ideal conditions. The proposed framework consists of three modules: data normalization, label configuration, and a Multi-Head Attention-based Bi-LSTM (MHAB) network. Data normalization ensures that the input radar sequence values are within a unified range. Then label configuration converts sequences with change point labels into weighted change probabilities. The MHAB network identifies all trends with different dynamics, extracts important features and generates the probabilities of change points. The training process is optimized utilizing weighted Binary Cross Entropy (wBCE) as the loss function, which can highlight the minority target labels during training. Extensive simulations with PRI-defined MFR work modes prove the effectiveness and superiority of the proposed method. To our knowledge, this paper is the first to introduce the DL-based CPD methods into fine-grained MFR work mode detection. The main contributions of this paper are as follows: 1.
The Bi-LSTM learns the long-term dependencies between past and future, and the Multi-Head Attention (MHA) mechanism helps the model to focus on multiple aspects of the most informative features to reduce the impact of non-ideal observations and other useless features. Thus, the performance of the proposed method on CPD is also improved.

2.
This paper design a new label configuration and utilize a weighted binary cross entropy (wBCE) loss function for training, which effectively addresses the problem of sparse change point labels.

3.
Simulation results show that the proposed CPD method is superior to other methods for fine-grained work modes, and has robustness under non-ideal conditions.
The remainder of this article is organized as follows: in Section 2, the CPD problem is formulated. In Section 3, the proposed processing framework is presented. The simulations are discussed in Section 4. Finally, Section 5 is the conclusion.

Fine-Grained MFR Work Modes
MFR work modes are defined via radar pulses with multiple parameters [27]. Mathematically, it can be represented on two levels from the implementation perspective: modulation type level (modulation-level) and modulation parameter level (parameter-level) [28].

Definition 1.
A radar work mode is described via a dedicated arrangement of a finite number of ordered pulses P = (p 1 , p 2 , · · · , p T ) ∈ R H×T that serves a certain radar function. A radar pulse p ∈ R H in the work mode P using H mode definition parameters is represented p = (p 1 , p 2 , · · · , p H ).

Definition 2.
A mode definition parameter describes a certain feature of a radar work mode. From radar's perspective, the mode definition parameters can refer to some of the work mode related parameters in a Pulse Descriptive Word (PDW), such as PRI, Radio Frequency(RF), Pulse Width(PW) and intra-pulse modulations.

Definition 3.
A modulation type refers to the modulation pattern of a mode definition parameter. Typical PRI modulation types include constant, jittered and stagger. The corresponding modulation type varies with different mode definition parameters.

Definition 4.
A mode modulation parameter refers to a specific parameter defining a modulation type (e.g., The mean and variance for a jittered modulation type).
PRI is the most important parameter in the PDW, due to its inherent temporal characteristics and that it largely reflects the characteristics of the intercepted MFR pulse stream. Therefore, to simplify the analysis, in this paper, only the PRI mode definition parameter is considered. A radar work mode can either be defined on a coarse-grained level through different modulation type [24] or a fine-grained level through its modulation type and parameters [9,27,29]. Consequently, two levels of representation can be defined for each MFR work mode.
The coarse-grained representations of the modulation-level work modes underlie the flexibility of MFRs to adapt their transmission modulation types. Specifically, pulse sequences with the same PRI modulation types but different modulation parameters are regarded as the same work mode at the coarse-grained analysis level.
The fine-grained representation of a parameter-level work mode can be understood as an unrestricted choice of the modulation parameters regulated by the relevant modulation type. In particular, since the MFR can select parameter values in the corresponding space, each resulting combination can be a parameter-level work mode. Taking Figure 1 as an example, the PRI sequence includes the four work modes A-D, which are defined as having the same modulation type as jittered but with different parameters in the {mean, variance} space.  Figure 1. Illustration of PRI sequence with fine-grained MFR work modes conversion.

Basic PRI Models under Ideal Conditions
The PRI data are inherently timing data and have more complicated temporal correlations than other mode definition parameters, such as the RF. To enhance the applicability of the method, in [30] four statistical PRI models were formulated, including jittered, sliding, stagger and periodic, to fit all modulation types. Other types of modulation can be extrapolated from variations of these basic modulation types. For instance, constant modulation with noise can be considered as jittered modulation with small variance, while dwell-and-switch modulation can be treated as multiple consecutive constant modulations.

Non-Ideal Observation of PRI Sequence
As real electromagnetic environments are very complex, the received signals are inevitably non-ideal, and the interference usually takes the form of measurement noise, lost or spurious pulses. In practical applications, instead of directly observing the PRI values, the Time Of Arrival (TOA) of each pulse is recorded by the ESM receiver. Many factors affect the measurement accuracy of TOA, the most dominant of which is the noise in the receiver circuit. In this study, to simulate measurement noise, Gaussian White Noise (GWN) is added to the original TOA pulse sequence. The crowded electromagnetic environments and reception errors are the main causes of lost and spurious pulses.
x = {x 1 , x 2 , . . . , x L } denotes an original PRI sequence of length L. The TOA of the t-th pulse can be described as TOA t+1 = TOA t + x t , 1 ≤ t ≤ L. A non-ideal PRI sequence where j pulses are lost from the i-th pulse can be expressed as [26]: For spurious pulses, a non-ideal PRI sequence with j pulses train insertion after the i-th pulse can be expressed as: where s 0 is the difference between the original i-th pulse and the first spurious pulse; s j is the difference between the last spurious pulse and the original (i + 1)-th pulse; and s 1 to s j−1 is the differences between the spurious pulses themselves. An example of lost and spurious pulse conditions is shown in Figure 2. TOA1  TOA2  TOA3  TOA4  TOA5  TOA6 x 1 x 2 x 3 x 4 x 5 x 3   TOA1  TOA2  TOA3   TOA1  TOA2  TOA3  TOA6  TOA7  TOA8 x 1 x 2

Change Point Definition
Assuming that the PRI values of a received MFR pulse sequence are denoted using x t , where t is the discrete time index. In a certain work mode, {x t } t≥1 follows a certain probability density function PDF(x t ), with parameter θ.At an unknown time t 0 , the MFR changes the radar work modes with the corresponding parameter changing from θ 0 to θ 1 after t 0 . The pre-change and post-change distributions PDF(x t ) are represented respectively as: In this paper, the change point is defined as the first point of the θ 1 distribution, denoted by the time index t 0 in Figure 3. As there are often multiple change points in a sequence, the set of PRI sequence with K change points is formulated as T = {T 1 , T 2 , . . . , T K }.

Fine-Grained MFR Work Mode CPD Task
In fine-grained work mode, MFRs can emit specified pulse sequences with different PRI modulation types and corresponding modulation parameters. The CPD task's objective is to accurately identify these work mode change boundaries in the PRI sequence.
An input PRI sequence x of length L input is denoted as x = (x 1 , x 2 , · · · , x T ) ∈ X , where {x t } 1≤t≤L is the PRI value of the t-th pulse. The corresponding true label is denoted as y = (y 1 , y 2 , · · · , y T ) ∈ Y , which expresses the probabilities that the corresponding pulses are the change points. Where X and Y denote the input and the target spaces, respectively. The main task is to learn the model function f : X → Y, from a training dataset D = {(x i , y i ), 1 ≤ i ≤ N}. For each training sample (x i , y i ), x i ∈ X is the input PRI sequence, and y i ∈ Y is the corresponding output associated with x i . Given a test sample x ∈ X , the trained function f (·) will outputŷ = f (x) as the CPD result,ŷ t is the predicted probability that the point corresponding to index t is a change point. As mentioned above, CPD is generally regarded in DL as a problem of binary classification, that is, the probability of judging whether the current point is a change point based on existing information. For this task, a special loss function and a unique neural network structure parameterized using a vector of weights need to be designed.

The Proposed Framework for CPD
In this paper, a processing framework using a MHAB network for CPD is proposed. As shown in Figure 4, the framework consists of three modules. First, the values of the input PRI sequence are normalized into a unified range. Then, the label configuration module converts the corresponding change points set into a sequence of weighted change probabilities. Next, the MHAB network extracts features and finally generates the target probability sequence. In addition, a wBCE loss function is adopted to improve the fitting of the proposed network. The details of each of these processing steps are described in the following sections.

Data Normalization
PRI sequences are usually represented via different modulation types and parameters with different value ranges. Normalization generally limits data to a specific range, such as [0, 1], thus eliminating the impact of data dimensions on modeling. Data normalization can result in accelerated training speeds and improved network convergence. In this paper, the input sequences are normalized using maximum-minimum normalization according to the following equation: where min and max denote the minimum and maximum values of the sample sequence, respectively.

Label Configuration
As mentioned above, CPD is a problem of determining whether a given time step is a point of change based on existing information. In DL, these types of problems are typically approached through binary classification. However, the number of change points in a time series is often small compared to the total number of samples; for example, there may be 1 to 4 change points in a sequence with a length of 800 samples. This leads to a highly imbalanced classification problem that can be difficult to model with deep networks. Furthermore, pulse merging and imperfect receiver circuits are factors that often affect real-world radar data, which can cause confusion at mode changes and make it difficult to accurately identify change points. This is particularly true in cases of slow-trend changes. To balance the proportion of samples in the training set and improve the model's performance, in this paper a method of linear smoothing of the training labels at change points is proposed to improve detection accuracy. As shown in Figure 5, if the change point's position is T i and a smoothing margin M is used, then the label of each sample is: In this new label configuration, model parameters are optimized to solve a regression problem.

Feature Extraction Module
The normalized PRI sequence is fed into the two Bi-LSTM layers with rectified linear unit activations. This part regards the entire PRI sequence as a whole and extracts the temporal features for CPD task. The Bi-LSTM iterates over pulse sequences x from opposite directions of forward (superscript f w ) and backward (superscript bw), generating hidden state vectors Here x t is the input pulse of the LSTM at time index t, and h t is the hidden state. LSTM [31,32] denotes the LSTM cell function, which is implemented through the following functions: where f t , i t , o t variables represent the forget gate vector, input gate vector and output gate vector;C t is the new candidate value for the cell state, C t and C t−1 are the new and the old cell state; W and b with corresponding gate subscripts denote the learnable weights and the corresponding biases; σ and are the sigmoid function and the Hadamard product, respectively. Then the Bi-LSTM output's hidden state h t concatenates the h f w t and h bw t , which can effectively capture the contextual information and dependencies in the input pulse x t .

Multi-Head Attention Module
Lost and spurious pulses can cause incorrect PRI values that interfere with detection, yet these values also contain some original information. However, the LSTM lacks the ability to focus on certain areas adaptively, potentially leading to redundancy or lost information during learning process. To address this issue, the MHA mechanism is employed to enable the LSTM to concentrate on the critical feature vectors. This approach enhances the model's ability to capture the most relevant information from the input data, leading to more accurate and efficient detection outcomes.
The attention mechanism is simulating the data retrieval process in the data management system. To retrieve data, a query needs to be addressed to the data management system. If the query matches a key, the value associated with the key will be retrieved. Equation (14) shows the formulation process of queries Q, keys K, and values V.
where W Q , W K , W V are trainable weight matrices. To retrieve the most relevant part of the values V, the scaled dot-product method uses to calculate attention [33]: where d k is the model dimension; QK T is used to calculate the self-attention score which is divided each by √ d k ; and a softmax function is employed to obtain the weights of the values. As shown in Figure 6, the MHA mechanism is a projection of Q, K, and V through different linear transformations for h times, which corresponds to perform the attention function on different projected versions in parallel. These are concatenated and once again projected, resulting in the final values. This structure allows the model to focus on different aspects than the single self-attention.The equations are as follows.
where W 0 are the trainable weight matrices of the final projection.
Hidden units

Classifier
Finally, the classifier outputs the probability of change point at time t from the Fully Connected (FC) layer through the sigmoid activation function σ. Given the features a t extracted by the MHA module at time step t, the classifier provides the predicted output y t ∈ [0, 1], which is the probability of input sample x t being a change point.
where W FC and b FC are the weight and bias of the FC layer, respectively. Another solution to the problem of unbalanced data that we adopted is to reweight the losses to give higher weight to the error relative to the change points. Thus, the wBCE loss function is obtained. By adjusting the weight of the loss function, the model can be forced to pay more attention to the minority target label during training, which improves the accuracy of detecting minority targets.
In this paper, w c is set to w c = (N N /N CP ), with N N and N CP being the total numbers of normal points and change points in the sequence, respectively. The use of this loss function for updating results in better model performance.

Simulations and Analysis
In this section, four non-ideal scenes are simulated to evaluate the ability of the proposed framework for fine-grained MFR work mode CPD with different PRI sequences.

Dataset Description
In this paper, the MHAB network is evaluated on a simulated PRI sequence set with change points. Four PRI modulation types were adopted in the sample sequences including jittered, periodic, sliding and stagger. The simulation data were generated randomly based on the modulation types and the modulation parameter values were generated uniformly from specific ranges. The corresponding value ranges of the modulation types and parameters during training and testing are listed in Table 1. An L length PRI sequence with K change points was utilized as a sample, where L = 800, K = U (1,4). The interval between adjacent change points was greater than 100. In order to evaluate the robustness of the proposed method in non-ideal conditions, four scenarios were simulated, including measurement noise, lost pulses, spurious pulses and the hybrid scenes combinations of these three. The three basic non-ideal condition scenarios are denoted as Measurement Noise Only (MNO), Lost Pulse Only (LPO), and Spurious Pulse Only (SPO). In the MNO scenario, GWN was introduced to the TOA observation at seven noise levels with a variance σ ranging from 0 µs to 3 µs with a step of 0.5 µs. Both LPO and SPO involved pulse sequences with a proportion of lost or spurious pulses at five levels ranging from 0 to 20% in 5% steps. The hybrid scenario combined the three basic non-ideal conditions, the parameter setup as shown in Table 2. Training and test sets were generated to verify the robustness of the model under these non-ideal scenarios. Taking lost pulse only scenarios as an example, 4000 samples per lost pulse ratio level were generated, i.e., 20,000 in total for five levels. The samples are divided into training and test sets with a proportion of 75% and 25%, respectively. Then the model was trained using the 15,000 samples in training sets together and separately tested with 1000 samples at each level. The dataset settings were the same for other scenarios. Simulations were also conducted to verify the extent to which the model trained on MNO was affected by the modulation types and the number of change points.

Evaluation Metrics
The ability to detect change points within a specific margin of error is an important aspect of evaluating CPD methods [2]. F1-score can represent the detection accuracy within a certain error range. Therefore, in this study, the F1-score is used to evaluate the performance of the models. A change point estimation was considered to be a True Positive (TP) if it fell within an uncertain detection margin M of the ground truth change point. In cases where multiple change point estimates fell within the error margin of the ground truth change point, only the closest estimate was considered a true positive, and the rest were considered False Positives (FP). Ground truth change points without any estimates within the specified error margin were considered False Negatives (FN). In this simulation, the detection margin M and the target region of label configuration were set to 5.

Simulation Implementation
The Adam optimizer [34] was adopted with a learning rate l r of 0.00005 (5 × 10 −5 ). For the network layers, the network weights were updated using the following rule: The number of hidden layer nodes N h of the network was set to 256, and the size of the input batch was 128. In addition, in order to mitigate overfitting, the dropout strategy was employed for the Bi-LSTM layer. After multiple rounds of simulations, when the dropout rate was 0.25, the model achieved a better performance. Following [33], the number of attention heads h was set to 8. Early stopping was been employed in the simulations to avoid overfitting during training. The programing environment of the proposed model was Python 3.7.0 and the deep learning library Pytorch 2.6.0 was used.

Validation of Basic CPD Performance
To demonstrate the effectiveness in CPD, this section compared the performance of the proposed method with three methods: CNN [2,18,19], CNN-LSTM [2] and Bi-LSTM [20,21,31] as baselines. Ref. [18] extended PRI modulation recognition to PRI CPD on the basis of CNN. The hybrid CNN-LSTM model [2] combined CNN and LSTM which extracting complex features effectively with spatial and temporal information. With its inspiration, this paper applied CNN-LSTM from anomaly detection in our MFR problem. The Bi-LSTM for radar pulse sequence analysis [31] and a version of Bi-LSTM involving weighted BCE loss function (Bi-LSTMw) [35] were both adopted as baselines.
In order to verify the basic performance of CPD for fine-grained MFR work modes, 1000 test samples with different parameter were generated at each modulation type. The model were trained by the MNO scenario at all variance levels and tested at the σ = 1.5 µs level. Due to the sparsity of change points in the samples, the baseline methods could hardly reach convergence in our simulation scenario with the normal binary labels, so label configuration was added to all baseline methods. In the methods using wBCE loss function, predictions were obtained by retaining the peaks in the output probability distribution with threshold values greater than 0.8, while for the other methods, the threshold was set to 0.3.
The different methods' CPD performance on PRI sequences of different modulation types is presented in Table 3. It can be seen from Table 3 that the MHAB achieved the best effect under all the modulation types. Because the LSTM can learn long time sequence information, it showed a significant improvement compared with CNN. CNN-LSTM combines local spatial information and long temporal information, thus reducing the error rate. The Bi-LSTM allowed the model to learn the dependencies between the past and future of the data to better capture two-directional features, so its detection effect was also improved. The weighted BCE loss function constrained the acceleration of the network's convergence and reduced the possibility of the output being drowned in noise, while the MHA mechanism further enhanced this effect. Since the constraint of wBCE loss function made the probabilities obtained by the network closer to 0 and 1, it can be seen from Figures 7 and 8e that the predicted values of MHAB and Bi-LSTMw at the change point were also the highest, reducing the detection error caused by the peak threshold setting.  Compared with the randomly jittered PRI, the modulation characteristics of more regular sequences were more easily learned by the network. Therefore, the detection effect for periodic, sliding and stagger PRI sequences was better.

Performance under Non-Ideal Conditions
Electromagnetic environments with highly non-ideal conditions often result in contaminated pulse sequences. Effective CPD methods should be robust enough to correctly recognize change points in corrupted pulse sequences. The models' performance was therefore evaluated in four non-ideal scenarios, including MNO, LPO, SPO and hybrid scenario at 7, 5, 5 and 4 levels of interference intensity, respectively. In each scenario, the model was trained using 3000 samples per non-ideal condition intensity level together. And 1000 samples were tested on each level. The F1-score was used to evaluate the performance of the different methods, and the corresponding simulation results are shown in Figure 8. The MHAB significantly outperformed the other baseline methods in all non-ideal conditions. Figure 8a shows that the performance of all methods was almost unaffected by the increase in measurement noise variance. Figure 8b,c show that the total F1-score of MHAB was almost unaffected by the proportion of lost and spurious pulses, but the total F1-score of CNN,CNN-LSTM and Bi-LSTM decreases as the conditions become worse. In Figure 8d, as the hybrid conditions became more complex, the total F1-score of all methods decreased, inevitably. But the proposed method proved to be more stable. Because it lacks the ability to capture timing relationships, CNN exhibited the worst performance among all methods. Compared with CNN-LSTM and Bi-LSTM, Bi-LSTMw achieved better performance, thanks to the use of wBCE in training. However, the effective improvement of Bi-LSTMw due to the utilization of wBCE loss became less obvious as the percentage of spurious pulses increased. This was primarily because the original sequence features were not effectively extracted, owing to the overwhelming presence of noise. To compensate for this shortcoming, the MHA mechanism was adopted in the proposed method. As shown in Figure 8e, the stagger PRI sequence exhibited five work modes and corresponding four change points in the hybrid 4 scenario. However, the transition between the fourth and fifth work modes was not clearly visible in the noise. Including Bi-LSTMw, none of the four baselines were able to detect this change point. With the help of MHA, MHAB was able to focus on this transition well. Although hybrid scenes posed a higher challenge to feature extraction, the proposed method still achieved a 93.83% CPD accuracy with 15% spurious pulse, 15% lost pulse and 3 µs measurement noise.

Influence of Label Configuration
This section examined the influence of the designed label configuration. F1-score was used to present the results. In Section 3.3, the smoothing label sequence is designed to address the pulse confusion problem that arises at change points in practical situations. It also solves the difficulty of model fitting caused by sparse samples to obtain better performance. Figure 9a compared the performance of the original binary label and designed label. With different modulation types, the F1-score for designed label was considerably higher than original label corresponding to all the four types. The maximum gap was 8.08%. Summarizing, the superiority of label configuration in this study was verified.

Influence of WBCE Loss Function
As discussed in Section 3.4.3, another solution to the problem of unbalanced data that we adopted is wBCE loss function. This section compared the performance of the MHAB network trained by binary cross entropy (BCE) loss function and wBCE loss function. The result was presented by F1-score.
The periodic and sliding modulation types exhibit obvious trends, making the ordinary BCE loss function sufficient for the model to learn these features. However, the weak features of random jittered and stagger modulation in MFR sequences submerge the noise, which pose a great challenge to BCE loss function. As shown in Figure 9b, the effect enhancement of the network using wBCE loss function was more significant for jittered and stagger types. The F1-scores of the former methods increased from 87.83% to 97.34% and from 91.13% to 99.72%.

Influence of Change Point Numbers
The variation of the number of change points causes various detection difficulties. Therefore, in this simulation the network was trained uniformly through the introduction of K = 1 to 4 change points in MNO scenario and tested at the σ = 1.5 µs level. 1000 test samples were generated at each level. The FN and F1-score were measured respectively under different values of K. The corresponding results are shown in Figure 10. Intuitively, with the increase of the change point number, the number of false positives increased. On the whole, the FN was more sensitive to K than the total F1-score, which means that FN should be monitored more closely in CPD tasks. Compared with the other baseline methods, the FN increase rate as K increased was reduced. This showed that the proposed model can adapt to more complex samples, and the FN for samples containing four change points was only 59 points out of 4000 total predicted points.

Conclusions
In this paper, an MFR work modes CPD framework using the MHAB network is proposed. In this framework, the Bi-LSTM learns the dependencies of contextual data in the pulse stream, while the attention mechanism emphasizes the informative features and ignores the noisy parts. Meanwhile, a new label configuration is adapted and an improved wBCE loss function is utilized for training, which solves the problem of change point label sparsity effectively. The simulation results demonstrate that in each modulation type, the proposed framework effectively extracts the temporal features between pulses of different modulation parameters and can be used to detect changes in work modes in different non-ideal environments. In future work, we will perform the detection using real data and conduct this research in unsupervised methods.
Author Contributions: Data curation, methodology, writing-original draft., Y.F.; supervision, writing-review and editing, Q.Z. and Z.Z.; conceptualization, resources and supervision, J.Y. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.