Convolutional Neural Networks for Local Component Number Estimation from Time–Frequency Distributions of Multicomponent Nonstationary Signals

: Frequency-modulated (FM) signals, prevalent across various applied disciplines, exhibit time-dependent frequencies and a multicomponent nature necessitating the utilization of time-frequency methods. Accurately determining the number of components in such signals is crucial for various applications reliant on this metric. However, this poses a challenge, particularly amidst interfering components of varying amplitudes in noisy environments. While the localized Rényi entropy (LRE) method is effective for component counting, its accuracy significantly diminishes when analyzing signals with intersecting components, components that deviate from the time axis, and components with different amplitudes. This paper addresses these limitations and proposes a convolutional neural network-based (CNN) approach for determining the local number of components using a time–frequency distribution of a signal as input. A comprehensive training set comprising single and multicomponent linear and quadratic FM components with diverse time and frequency supports has been constructed, emphasizing special cases of noisy signals with intersecting components and differing amplitudes. The results demonstrate that the estimated component numbers outperform those obtained using the LRE method for considered noisy multicomponent synthetic signals. Furthermore, we validate the efficacy of the proposed CNN approach on real-world gravitational and electroencephalogram signals, underscoring its robustness and applicability across different signal types and conditions.


Introduction
In various scientific domains, such as gravitational wave detection [1], radar systems [2,3], tire sensor signal processing [4] and biomedical signal processing [5,6], signals frequently exhibit nonlinear frequency modulation (FM), characterized by time-dependent frequencies known as instantaneous frequencies (IFs).Time-frequency distributions (TFDs) are essential tools for representing signal energy in the joint time-frequency (TF) domain [7].Quadratic TFDs (QTFDs), widely utilized in practical contexts, frequently yield undesirable oscillatory artifacts, referred to as cross-terms, notably evident in signals comprising multiple components.While methods such as two-dimensional (2D) low-pass filters in the ambiguity function (AF) can suppress cross-terms, they may compromise the quality of useful components, known as auto-terms.Given the diversity of signals, various TFD methods have emerged over time [7][8][9][10].
Many methods in time-frequency signal analysis, such as IF estimation methods, signal decomposition, and TFD reconstruction methods, necessitate prior knowledge of the local number of signal components, i.e., signal complexity [7,[11][12][13][14].Traditionally, health of milling tool inserts.They use the pre-defined CNNs, such as VGG16, and achieve accuracies of approximately 98%.A similar approach is used by Jung et al. [31], with spectrograms of sound-based data, coming from the engine used to detect faults in the rotor.Authors apply CNNs with transfer learning on the spectrograms produced with STFT, gaining models with accuracy above 99%.These examples demonstrate the high applicability of CNNs on images generated with time-frequency analysis.
In this research, our goal is to demonstrate that training datasets for the proposed CNN can be constructed from synthetic single and multicomponent signals, encompassing both linear FM (LFM) and quadratic FM (QFM) components, to achieve improvements over the LRE method and robust performance across a wide spectrum of signals.Special emphasis is placed on signal scenarios featuring diverse time and frequency supports, closely spaced and intersecting components, and variations in component amplitudes, where existing methods have exhibited significantly diminished accuracy.We also show that incorporating additive white Gaussian noise (AWGN) into the training process reduces the CNN's sensitivity to noise.This approach allows users to simply provide the signal's TFD, eliminating the need for additional parameter tuning or the use of additional LRE variants and component separation techniques.In addition to the synthetic signal examples, our study demonstrates the enhanced accuracy of estimated local numbers of components for real-world electroencephalogram (EEG) and gravitational signal examples, previously unseen by CNN.The key contributions of this study are summarized as follows: 1.
Development of a novel CNN-based framework for accurate local component counting in TFDs, overcoming the limitations of traditional LRE methods.

2.
Introduction of a comprehensive training dataset comprising synthetic signals with LFM and QFM components, facilitating the robust generalization of the CNN across a wide range of synthetic and real-world signal types and complexities.

3.
Incorporation of AWGN into the training process, improving the CNN's robustness to noise.4.
Simplification of the local component counting process by eliminating the need for additional parameter tuning within the LRE method and the use of additional iterative LRE and NBRE approaches, streamlining the application for end-users.
The remainder of the paper is structured as follows.Section 2.1 introduces adaptive time-frequency distributions.Section 2.2 defines the LRE method and highlights its limitations.The methodology of the proposed CNN is outlined in Section 2.3.Section 3 presents the experimental simulation results and provides a discussion.Finally, Section 4 presents the conclusions of the paper.

Adaptive Time-Frequency Distributions
The representation of a nonstationary signal comprising NC components, denoted as z(t), is established as the analytical counterpart of a real signal, delineated as [7] where z k (t) is the k-th signal component, while A k (t) and φ k (t) denote the k-th signal component's instantaneous amplitude and phase, respectively.The ideal TFD, denoted as Υ(t, f ), assumes the form of a unit delta function that traces the crests of ridges representing the IF, f 0 k (t), for each component.This concept is articulated as [7] where f 0 k (t) signifies the dominant frequency of the k-th component at a specific time.However, achieving the ideal TFD is often unattainable due to the imprecise localization and potential influence of cross-terms in practical scenarios, as acknowledged in the literature [7].
The Wigner-Ville distribution (WVD) stands as a core TFD method, offering a nearperfect estimate of IF for signals dominated by a single LFM component in the TF plane [7].The WVD is defined as the Fourier transform (FT) of the instantaneous auto-correlation function Despite its efficacy, the susceptibility to cross-terms in multicomponent signals necessitates effective suppression techniques.Utilizing the AF, denoted as and a 2D low-pass filter, denoted with w(ν, τ) offers a pathway to define a class of TFDs known as QTFDs, represented by Υ(t, f ) [7].
Usually, simple multiplication in the AF is computationally less demanding than using double convolution, denoted with double asterisk * , in the TF domain as [7] where γ(t, f ) denotes the separable kernel for smoothing the WVD in the TF domain.
One of such TFDs using the independent smoothing kernels is the smoothed-pseudo Wigner-Ville distribution (SPWV), defined as [7] SPWV QTFDs aim to strike a balance between concentrating autoterms and attenuating crossterms, a challenge well-documented in the literature [7].
To address the limitations of conventional TFDs, an approach called the adaptive directional TFD (ADTFD) is introduced, as outlined by [14,32].For each point in the TF plane, this technique adapts the direction of the kernel γ θ (t, f ), expressed as where γ θ (t, f ) represents the smoothing kernel controlled by θ [14,32].In our research, we utilized the extended modified B distribution (EMB) as the basis QTFD, with its w(ν, τ) separable kernel defined according to the established parameters [7] w where α EMB = β EMB = 0.25 serve as the smoothing parameters in time and frequency as in [14,[32][33][34].The chosen smoothing kernel, γ θ (t, f ), is the double-derivative directional Gaussian filter (DGF), as introduced in previous studies [14,[32][33][34]: The degree of smoothing along the time and frequency axes are regulated by the parameters a and b, respectively [14].For each TF point, the orientation angle of γ θ (t, f ) is adjusted locally by maximizing the correlation with TF ridges as where The optimization of smoothing kernel parameters and shape is crucial for optimal performance and depends on the signal under analysis.Previous studies have suggested ranges for parameters a and b to balance the intensity of smoothing and prevent component merging.Additionally, the window length W L affects the performance of ADTFD, with smaller values failing to resolve close components but preserving short-duration signals' energy, and larger values achieving the opposite effect.
To automate parameter optimization, we employed the locally adaptive ADTFD (LOADTFD) method proposed by [33], which selects TF points with minimal values from a set of ADTFDs and their respective parameters.This results in a LOADTFD that effectively preserves short-duration signals' energy while enhancing resolution and suppressing interference.In our study, we selected parameter values (a, b) from a predefined set, while W L was tuned for each combination of (a, b) using the energy concentration measure [35] where N t and N f represent the numbers of time samples and frequency bins, respectively.For the pairs (3,6) and (2,20), the range is set to [N t /8 : 4 : N t /4], while for the pairs (3,8) and (2,30), it is set to [N t /4 : 4 : 3N t /8], as suggested in prior work [33].

The Localized Rényi Entropy
Considering that TFD represents a pseudo-energy density in the TF domain, the Rényi entropy, represented as H(Υ(t, f )), serves as a comprehensive measure of signal complexity in the TF domain [36][37][38].It is defined by the expression where α R > 2 ∈ N is set as an odd integer to integrate out the cross-terms.As can be observed in (15), this is a global version of the measure, meaning that it is applied to a whole TFD.The primary constraint of the global Rényi entropy estimation approach is its applicability solely to multicomponent signals composed of shifted replicas of a single fundamental component.It proves ineffective when confronted with multicomponent signals featuring components with varying time/frequency supports and frequency modulations [15,16].
To address this challenge, the local number of signal components can be estimated by leveraging the counting characteristics of the Rényi entropy using a short-time moving window (controlled by the parameter Θ t ) [15,16].Termed the LRE or STRE, this complexity measure enables the continuous estimation of the number of components within the moving time window, given by [15,16] where (•) ∆ t 0 denotes that only a segment of TFD in the vicinity of t 0 is considered Note that Υ ref (t, f ) denotes the TFD of a reference signal that is chosen as a cosine signal with an amplitude of 1 and a constant normalized frequency of 0.1 [15,16].Even though time and frequency marginals are not preserved over short-term estimation intervals, it has been demonstrated that the counting property of the Rényi entropy remains valid under the assumption of a positive TFD and TFD with reduced interference [15,16].
Driven by the constraints of entropy-based estimation methods for signals with components of equal amplitudes, an iterative method for estimating the local number of components with varying spectral amplitudes was delineated [27].In this strategy, the predominant spectral component is filtered out in each iteration, which consequently emphasizes weaker spectral components.At each iteration j of the algorithm, NC   [17,27].

Limitations
The limitations of both the original and iterative LRE methods are illustrated in Figure 1 using three synthetic signal instances featuring LFM components.The first signal example (see Figure 1a) portrays two LFM components with distinct slopes relative to the time axis.In Figure 1d, it is evident that as a signal component deviates considerably from the reference component of the LRE, the NC t experiences an artificial increase, leading to significantly inaccurate estimates compared to those for components more parallel to the time axis that are considerably closer to the ideal NC t .This discrepancy arises from the calculation of the LRE in (16), where the Rényi entropy of the segmented single component becomes substantially larger than the reference (H(Υ ∆ t 0 (t, f )) >> H(Υ The second signal example depicted in Figure 1b showcases two intersecting LFM components.As depicted in Figure 1e, both the original NC t and iterative NC iter t exhibit a decline when two components intersect.The limitation of the iterative method lies in the inadvertent removal of the strongest component, resulting in the unwanted deletion of the second component precisely at and near the intersection point.Consequently, this leads to an even larger decrease in the number of components as observed in the original LRE method. The third signal instance, composed of three LFM components with different amplitudes (see Figure 1c), was embedded in strong AWGN with a signal-to-noise ratio (SNR) of 2 dB, demonstrating a limitation of the iterative LRE method wherein noise samples are erroneously recognized as auto-terms.
It is apparent that the iterative LRE approach does not effectively address the aforementioned limitations of the original approach, except in specific cases where components exhibit different amplitudes in a noise-free environment.Additionally, research in [6] demonstrates significant dependence of NC iter t on the initial TFD threshold level when considering real-world EEG signals.The utility of the iterative LRE is better suited for purposes such as extracting the strongest component, as demonstrated in [6], and for optimization purposes where cross-terms need to be detected, interpreting NC iter t as the number of total energy regions rather than the number of auto-terms [17].Therefore, in this study, we opt to consider the original and more robust LRE method for comparison in Section 3, as has been widely utilized across various applications [11,[18][19][20][21][22]39].(f) NC t , ⌊NC iter t ⌉ and ⌊NC t ⌉ corresponding to the LOADTFD in (c).NC t and ⌊NC t ⌉ were obtained using the original LRE method in [15], while ⌊NC iter t ⌉ was obatined using the iterative LRE method in [27].

The Proposed CNN-Based Approach
Convolutional neural networks are based on the application of filters that are convoluted with the input of the network.In comparison to the classic artificial neural networks (ANNs), the network parameters are stored as the values within the filter tensors instead of within the connection weights between neurons [40].The training process is the same as with the classic artificial neural networks-with the error determined in the forward propagation stage, and the filter values adjusted in the backward propagation stage, based on the error gradient [41].
When discussing the application of matrix-shaped two-dimensional data, the models commonly applied are the two-dimensional convolutional networks, built of layers that perform the two-dimensional convolution operation, per with the I being the input matrix, h being the applied filter, m and n being the indices of the feature map and u, v being the indices of the filter.In addition to the convolutional layers, another applied is the max pooling layer, which reduces the dimensionality of the input by selecting the maximum value from a subset of the input matrix.The pooling operation is defined as We define three different CNN architectures, with the main differentiation being the depth of the network-in other words, the amount of convolutional and pooling layers applied.To simulate the application of the previous methods and to generate the expected output of the number of components present in time, all the models are tuned to output a vector of 256 values, which is compared to the expected output, as described previously.Each convolutional layer is defined by its width and height.Due to them being the same for each example in this study, a choice is made based on the fact the input is square as well, the width and height are defined as a single value-N.Additionally, the number of filters k is another parameter, as each filter is applied to the input, allowing for the creation of more feature maps-or in other words, a greater amount of network parameters to be tuned.That way, each convolutional layer can be defined as (k, (m, m)), with the max pooling layer defined as (s, s), with s being the stride of the pooling operation.With that, the change in the dimensionality of the input can be calculated as, in the case of the convolutional layer [42] N − m + 1, (20) and in the case of the pooling layer The used models are given in Figure 2. In the rest of the paper, the first model from the left, with the smallest amount of layers, will be referred to as model 1, the medium-sized model will be referred to as model 2, and the largest model as model 3.
It can be seen that the different layer configurations are used.The smallest model in Figure 2a only uses two convolutional and pooling layers, with this increased to a total of five in the case of the second model in Figure 2b.The last model is designed to utilize convolutional layers until the final desired output size of 256 values is reached through convolution and max pooling operations, as shown in Figure 2c.All of the networks end with a flattened layer, which is used to convert the two-dimensional output of the convolutional and pooling layers to a one-dimensional vector, which is then used as the input to the fully connected layer.The fully connected layer is used to map the input to the desired output size of 256 values, which is then compared to the expected output [43,44].All of the networks are trained with the batch size of 8, 16, and 32, for 5, 10, 25, and 50 epochs.The learning rate used for training is 0.001 and the Adam optimizer is used for the model training.These different values are tested in a grid search scheme, in which all possible combinations of the discrete values given for the batch sizes and epochs are tested.The network is trained anew for each of these values and the score is noted, looking for the best possible combination of the given hyperparameters.The loss function used is the mean absolute error (MAE), which compares the output of the model, the vector Ŷ of the size 256, to the expected output, the vector Y of the same size.The loss function is defined as and will also be used for the process of the evaluation of the model performance.The evaluation is done using the cross-validation schema.The dataset is first randomly split into two parts-with 90% of the set falling into the training set, and the remaining 10% into the test set.The training set is used with the basis of the cross-validation principle applied.Repeating the process five times, the dataset is split into five parts-four parts are combined into the training set, and the remaining part is used as the validation set.
The model is trained on the training set, and the validation set is used to evaluate the model's performance during training.Then, the evaluation is finally completed by testing the dataset on the separate, unseen, testing set-again repeated five times, for each of the dataset folds [45].

Training Set
We have constructed a dataset comprising input-output pairs {(Υ (j) (t, f ), NC t )} 8000 j=1 , where the j−th input and output tensors are represented as Υ (j) (t, f ) ∈ R 256×256 and NC (j) t ∈ R 1×256 , respectively.The prediction process of the CNN with parameters χ as a function f χ applied to the input Υ(t, f ) and yielding the predicted output NC Υ(t, f ) CNN can be written as: NC For the dataset, we have generated synthetic signals, both single and multicomponent, expressed as a summation of NC finite-duration signals as follows: where t 0 k , t f k , and T k denote the starting time, ending time, and duration of the k-th signal component, respectively.Here, Π k (t) is a rectangular function defined as Given the prevalent occurrence of LFM or QFM behavior in real-world signals [7,46], each signal z k (t) in our dataset embodies either an LFM or QFM component, expressed as where f 0 k and A k denote the starting normalized frequency and amplitude, respectively, while a k , b k , and c k represent the frequency modulation rates.These parameters were randomly generated to encapsulate diverse variations of signal components, including instances of multiple intersections with varying amplitudes.For our investigation, we con- , and the corresponding rates a k , b k , and c k to facilitate diverse signal time and frequency supports across the entirety of the TFD.Finally, we embedded randomly selected signals into AWGN with SNR down to 0 dB.We selected EMB, SPWV, LOADTFD, and WVD as the training data TFDs, primarily for their widespread application in the LRE method as documented in previous studies [11,16,[18][19][20][21][22]39].The inclusion of these TFDs facilitates a comprehensive comparison with the LRE method.Additionally, we incorporated WVD to assess CNN's efficacy when confronted with TFDs containing cross-terms.

Summary of the Proposed Approach
The block diagram depicted in Figure 4 illustrates the fundamental steps of the proposed methodology.Initially, the signals provided by end-users are typically in the time domain.The first task is to transform these signals into their corresponding analytic representations, z(t), particularly when dealing with real-valued signals.This transformation is achieved through the application of the Hilbert transform [7].Subsequently, the TFD of the signal z(t) needs to be computed, as detailed in Section 2.1.This computation involves selecting an appropriate TFD method from a set of options such as the WVD (given by ( 4)), the SPWV (given by ( 9)), the EMB (given by ( 11)), or the LOADTFD (given by ( 10)-( 14)).The chosen TFD then serves as input to the proposed CNN corresponding to the selected TFD method.It is worth noting that from the three CNN models illustrated in Figure 2, the optimal model will be determined in the following section.Finally, the proposed CNN model produces a vector that represents the local number of components of the signal's TFD.Depending on the application's requirements, this vector may undergo rounding to the nearest integer value.

Results and Discussion
The performance evaluation of the CNN-based local component estimation method was conducted on four synthetic signals, each comprising N t = 256 samples.The first signal, labeled z S1 (t) and also used in prior works [19,39], consists of four LFM components with different amplitudes, expressed analytically as  [1,11,39,47] and one representative of EEG seizure activity (the data and relevant code are publicly available at https:// github.com/nabeelalikhan1/EEG-Classification-IF-and-GD-features(accessed on 14 May 2024)) (z EEG (t)) [11,14,22,33,34,[48][49][50], were also employed for validation purposes.The preprocessing of the real-life signals involved established whitening and filtering techniques to enhance signal detection, as detailed in Table 2. Notably, to ensure the integrity of the evaluation, none of the synthetic nor real-world signals were included in the training or validation datasets of the proposed CNN model.Note that the calculation of the LRE involved the following parameters: α R = 3 and Θ t = 11 as recommended in [11,16,24].downsampled to 256 Hz and further to 32 Hz, N t = 256 samples enhanced spike signatures using a differentiator filter [49][50][51][52] We utilized several metrics to assess the error between the obtained local number of components and the reference (or ideal) values.These metrics include the mean squared error (MSE), MAE, and maximum absolute error (MAX), defined as follows: Larger values of MSE, MAE, or MAX indicate greater disparities between the calculated and ideal local component counts, thus lower values are preferable.
To quantitatively evaluate the smoothness of the obtained NC t curve, we introduced a metric that considers successive changes from positive to negative.To begin with, we compute the differential vector, denoted as ∆NC, which tracks changes in NC t as follows: To discern alterations in the differential vector indicative of successive transitions from positive to negative, we are interested in instances where ∆NC changes its sign from positive to negative.These transitions are identified by the set PTN, defined as Subsequently, we compute the magnitudes of these transitions, |∆NC i |, for each i ∈ PTN, and aggregate them to derive the total magnitude of changes, denoted as TM:

Model Evaluation
The training performance of the three developed networks can be seen below in Table 3.The results are given as the mean and standard deviation, denoted as σ, across different input methods.
For the first neural network, the best results are achieved using EMB as input, with an MAE of 0.24.The hyperparameters used were the batch size of 16, with the 10 training epochs.Actually, all models used the same number of epochs in the models that achieved 10 results, indicating that the created networks exhibit the overfitting issue relatively quickly.The comparison between the performance for different inputs is given in Figure 5. Clearly, the model based on the EMB shows the best performance on the test set, across all folds, with the slightest variation, performing better than even the best-performing folds of other models.The second model shows the best results on EMB input, with an MAE of 0.17, using the batch size of 32.The comparison between the performance for different inputs is given in Figure 6.Again, EMB is shown to grant the best performance when used as the input to the model.Even the upper bound of the model error across folds with EMB is lower than the best-performing folds of the other models, as was the case in the previous example.The third model achieves the best results, with an MAE of 0.22, using the batch size of 16 and 10 training epochs, on the SPWV input.Observing Figure 7, the scores between inputs are much more balanced, with only WVD showing a significantly higher error on the box plot.While EMB does show a remarkably low error of almost zero in the extreme case on a single fold, its median value is much closer to the remaining values.The computational complexity of a CNN approach depends on the size of the CNN applied to the problem.When classifying a value with the CNN based on the size of the filters within the network.As such the smallest network used comprises 4,777,988 multiplication operations while the largest comprises 8,958,424.Still, since these are linear operations, the time complexity simplifies to O(n).Measuring the times necessary to infer the number of components based on the input is 0.35 ± 0.05 seconds for the smallest network, 0.38 ± 0.04 seconds for the medium-sized network and 0.42 ± 0.06 seconds for the largest network when using a i5 Intel(R) Core(TM) i5-9400 CPU @ 2.90GHz to perform inference.It has to be noted that this time could be significantly lowered using a tensor or GPU-based machine for inference.Since the second model shows significantly better results compared to the other two models, it will be used for the further evaluation of different signals, the results of which will be presented in the continuation of this section.

Results: Synthetic Signals
Figure 8 illustrates the WVDs, EMBs, SPWVs, and LOADTFDs of the considered synthetic signals z S1 (t), z S2 (t), z S3 (t), and z S4 (t).These TFDs served as inputs to the proposed CNN and represent multicomponent signals with closely spaced or intersecting components exhibiting randomized amplitudes.
The estimated local number of components (NCs) obtained using the LRE method with the EMB, SPWV, or LOADTFD as underlying TFDs is depicted in Figure 9. Across all three considered TFDs, the LRE method demonstrates notable limitations.Particularly, the NCs for components deviating from the time axis exhibit a pronounced artificial increase when compared to the ideal scenario, as consistently observed across all signal examples.Moreover, Figure 9c,d highlight that the LRE-based estimates diminish when components intersect, and the LRE method inadequately captures components of low amplitude.9-11.These metrics complement observations from visual inspection and substantiate the superiority of the proposed CNN approach over the LRE method using any of the considered advanced TFDs: EMB, SPWV, or LOADTFD.In terms of MSE performance, an improvement compared to the original LRE method spans from 55.52% (EMB, z S4 (t)) to 98.63% (SPWV, z S2 (t)).In terms of MAE performance, the proposed CNNbased estimations achieve improvements spanning from 49.33% (EMB, z S4 (t)) to 96.16% (LOADTFD, z S3 (t)).MAX metric showed that CNN-based NCs using EMB, SPWV or LOADTFD have significantly lower maximum errors when compared to the LRE method's NCs.Notably, the proposed CNN based on the WVD demonstrates inferior performance compared to other considered TFDs but still outperforms the LRE method, particularly evident for signals z S1 (t), z S2 (t), and z S3 (t).In Table 5, we present the analysis of the TM from consecutive positive to negative transitions of the raw NCs for the considered synthetic signals.Across all synthetic signal examples (z S1 (t), z S2 (t), z S3 (t), and z S4 (t)), the TM for the LRE method yields lower values compared to the CNN approach.Specifically, the NCs obtained using the LRE method and the EMB consistently yield the lowest TM, while the NCs obtained using the CNN approach based on the WVD exhibits the highest volatility, thus highlighting it as the most fluctuant curve.Figure 1d    Notably, both spikes and segments of hyperbolic FM components deviate from the time axis, rendering them unsuitable for analysis using the original LRE method, as evident in Figure 13.Specifically, spikes introduce significant errors in the estimated local number of components (NCs), causing an increase even up to seven for the SPWV example.A similar phenomenon is observed for segments of the gravitational signal component, where the expected value of NCs should remain constant at one throughout the signal's duration.Figure 14 portrays the local numbers of components of z EEG (t) and z G (t) obtained using the proposed CNN.Upon visual inspection, these results demonstrate improvements over the local number of components estimated using the LRE method, reinforcing observations made for synthetic signals.For z EEG (t), the CNN-based NCs, particularly when utilizing the LOADTFD and EMB, accurately identify spikes and indicate the presence of two components during spike occurrences.Furthermore, the NCs based on the LOADTFD notably preserve the tone component.Regarding z G (t), the CNN-based NCs derived from the LOADTFD, EMB, and SPWV exhibit reduced errors compared to those resulting from the LRE method.These CNN-based estimates consistently indicate the presence of a single component for the majority of time instances.However, it is noteworthy that the NCs using the CNN based on the WVD consistently demonstrate drops to zero, indicating a limitation in capturing components using this particular method.

Noise Sensitivity Analysis
To assess the impact of noise on the estimated local number of components (NCs) using the proposed CNN, a comprehensive comparative analysis is conducted using MSE values in decibels (dB).These values are calculated between the estimated NCs for noisefree and noisy synthetic signals.The signals are deliberately embedded in AWGN across four SNR levels ranging from 9 dB down to 0 dB.It is essential to note that the reported results are based on 1000 independent noise realizations.The findings, as presented in Table 6, demonstrate the superiority of the proposed CNN over the LRE method across all synthetic signal examples and SNR levels.This conclusion is substantiated by the CNN consistently achieving lower MSE values, indicating enhanced accuracy in estimating the local number of components amidst varying levels of noise interference.Table 6 illustrates that CNN-based estimates exhibit significantly lower sensitivity to AWGN across considered SNR levels compared to the LRE method.While superior smoothing capabilities of the LOADTFD benefit LRE at lower SNRs, the proposed CNN, particularly when using EMB, LOADTFD, and SPWV, demonstrates competitive performance across all SNR levels.Notably, CNN's performance with WVD input surpasses that of other TFDs for most considered signals, attributed to training with interferencecorrupted TFD such as WVD, rendering noise inclusion less apparent compared to other TFDs.
An added advantage of implementing the proposed CNN methodology is the circumvention of the necessity for additional parameter tuning, for example in the LRE method.Furthermore, users are liberated from the responsibility of identifying and separating components deemed more suitable for localization via frequency slices, as typically entailed in the NBRE version of the LRE method.
The enhancement in estimated local component numbers holds the potential to improve the performance of numerous LRE-based applications.For instance, more precise estimation of local component numbers could notably refine IF estimation techniques employed in algorithms such as those detailed in [6,53].Additionally, the iterative thresholding algorithm for sparse TFD reconstruction proposed in [19], currently relying on a balance between outputs from STRE or NBRE depending on their correctness for a given signal, could potentially undergo computational simplification by exclusively utilizing the proposed CNN-based estimate.

Limitations of the Proposed Approach
Similar to the LRE method, the results showed that the CNN's performance is affected by the clarity of its underlying TFD.Given the computational simplicity of the WVD over other QTFDs, requiring only the WVD from a user would be preferable.However, the CNN based on the WVD exhibited volatility and notable estimation drops, thus indicating lower performance.Consequently, it is advisable to use the CNN based on EMB, SPWV, or LOADTFD instead.
The results presented in Table 5 suggest that unrounded curves representing local component numbers exhibit smoother transitions with the LRE method compared to the proposed CNN.This smoother behavior in LRE can be attributed to its use of sliding windows, where the window size influences the curve's smoothness and dynamics.While larger window sizes result in smoother curves, they may compromise prompt component detection.Conversely, smaller window sizes may emphasize unresolved cross-terms or noise samples, leading to reduced accuracy in estimated numbers and poorer curve smoothness [24].Although local component numbers are typically rounded for practical applications [18][19][20][21]39,54], raw output from the proposed CNN may limit its usage in cases where the identification of local maxima is required.Therefore, our future research will explore refining the CNN's raw output through careful analysis of smoothing filters and methods while preserving its dynamic capabilities.It is anticipated that smoothed curves will reduce inaccurate spikes that may appear in CNN-based estimates when rounding to the nearest integer.
Observing the CNN part of the study, it has to be noted that a more detailed exploration of the possible hyperparameter influences-such as model shapes or training epochscould always be performed.Still, due to the presented architectures having achieved satisfying scores, we deemed a more precise exploration unnecessary.
While this study aimed to propose a versatile approach applicable to diverse signal examples, it is recognized that certain applications may benefit from constructing training datasets tailored to signals exhibiting specific characteristics within specific noise environments.This may entail the integration of large datasets comprising real-world signals specific to an application, either as a supplement or substitute for synthetic datasets.The observed enhancements in our approach, which is not tailored to any specific application, underscore the potential utility of developing application-specific datasets.Such datasets are anticipated to further enhance estimation performance in targeted applications by aligning more closely with the characteristics and challenges present in those contexts.

Conclusions
Our study presents evidence supporting the superiority of CNN over the traditional LRE method for estimating local numbers of components in signal processing tasks.Through extensive analysis across diverse synthetic and real-world signal examples, CNN-based estimations consistently outperform LRE, underscoring the robustness and generalization capability of the CNN approach.We observed that among the TFDs utilized for CNN training, including EMB, SPWV, and LOADTFD, each emerged as a competitive option.Despite the smoother transitions seen with LRE, CNN offers dynamic capabilities and a higher resilience to noise, making it well-suited for applications in noisy environments.
Our findings also extend to real-world signals, such as EEG seizure and gravitational signals, where CNN-based estimations showcase promising results even without specific training for these signals.This highlights the adaptability and effectiveness of CNNs in practical scenarios.
In conclusion, our study provides strong support for the efficacy and versatility of CNNs in signal processing applications.These findings suggest opportunities for the further refinement of CNN models, the exploration of additional training datasets tailored to specific applications, and the investigation of advanced smoothing techniques to enhance accuracy and reliability in real-world scenarios.

Figure 1 .
Figure 1.(a) LOADTFD of a signal with two distinct components; (b) LOADTFD of a signal with two components intersected at t = 128; (c) LOADTFD of a signal with three components with different amplitudes embedded in AWGN with SNR = 2 dB; (d) NC t , NC t and ⌊NC t ⌉ corresponding to the LOADTFD in (a); (e) NC t , NC t and ⌊NC t ⌉ corresponding to the LOADTFD in (b); (f) NC t , ⌊NC itert ⌉ and ⌊NC t ⌉ corresponding to the LOADTFD in (c).NC t and ⌊NC t ⌉ were obtained using the original LRE method in[15], while ⌊NC iter t ⌉ was obatined using the iterative LRE method in[27].

Figure 3
showcases three randomly selected examples of multicomponent signals, where the WVD, EMB, SPWV, and LOADTFD were employed as inputs for the proposed CNN, with the corresponding ideal local number of components, NC t , serving as outputs.

Figure 3 .
Figure 3.For three random signal examples of the CNN training set: (a) WVD; (b) SPWV; (c) EMB; (d) NC t corresponding to the WVD in (a); (e) NC t corresponding to the SPWV in (b); (f) NC t corresponding to the WVD in (c).TFDs in (a-c) represent inputs to the CNN, while NC t in (d-f) represent desired outputs.

Figure 4 .
Figure 4. Block diagram of the proposed approach.

Figure 5 .
Figure 5.The performance of the first model for different input TFDs.Lower values indicate better performance.
Scores across dataset folds for different inputs, second model.

Figure 6 .
Figure 6.The performance of the second model for different input TFDs.Lower values indicate better performance.
Scores across dataset folds for different inputs, third model.

Figure 7 .
Figure 7.The performance of the third model for different input TFDs.Lower values indicate better performance.

Figures 10 and 11
Figures 10 and 11  present the NCs obtained using the proposed CNN based on the EMB, SPWV, WVD, and LOADTFD.Visual inspection reveals that these CNN-based estimates mitigate the limitations of the LRE method and exhibit closer adherence to the ideal scenario.An exception is observed with the CNN-based estimated NC CNN on the WVD, which displays noticeable drops to zero throughout the entire time support of the signal components.

Figure 10 .
Figure 10.Local numbers of component obtained using the proposed CNN and different TFDs versus the ideal NC t for the considered synthetic signals: (a) z S1 (t); and (b) z S2 (t).

Figure 11 .
Figure 11.Local numbers of component obtained using the proposed CNN and different TFDs versus the ideal NC t for the considered synthetic signals: (a) z S3 (t); and (b) z S4 (t).
presents the LRE-based raw curves, while Figures 10 and 11 depict the CNN-based curves.

Figure 12 Figure 12 .
Figure 12 presents the WVDs, EMBs, SPWVs, and LOADTFDs of the considered realworld signals z EEG (t) and z G (t).The TFDs of z EEG (t) reveal the presence of a single tone component alongside several spikes, while those of z G (t) exhibit a hyperbolic behavior in signal composition.

Figure 13 .
Figure 13.Local numbers of component obtained using the LRE method and different TFDs for the considered real-world signals: (a) z EEG (t); and (b) z G (t).

Figure 14 .
Figure 14.Local numbers of component obtained using the proposed CNN and different TFDs versus the ideal NC t for the considered real-world signals: (a) z EEG (t); and (b) z G (t).
The remaining three signals, designated as z S2 (t), z S3 (t), and z S4 (t), were randomly generated and include multiple components of both LFM and QFM components, each with distinct amplitudes.Their analytical forms are described as follows: S2 (t), z S3 (t) and z S4 (t) were embedded in AWGN with SNRs equal to 12, 16, and 9 dB, respectively.Real-world signals, namely gravitational (This research has made use of data, software, and/or web tools obtained from the LIGO Open Science Center (https: //losc.ligo.org), a service of LIGO Laboratory and the LIGO Scientific Collaboration.LIGO is funded by the U.S. National Science Foundation.) (z G (t))

Table 2 .
Summary of real-world dataset information.

Table 3 .
Results of the three models trained on different input TFDs.Lower values indicate better performance.The best results per model are bolded.

Table 4
presents the numerical results for the LRE and CNN-based NCs depicted in Figures

Table 4 .
Performance metrics of the proposed CNN-based versus LRE-based ⌊NC t ⌉ for the considered synthetic signals.The best results per metric are bolded.

Table 5 .
Total magnitude metric calculated from the local numbers of components obtained using the proposed CNN versus LRE method for the considered synthetic signals.The best results per signal are bolded.

Table 6 .
MSE in dB between the estimated local numbers of components for noise-free and noisy synthetic signals embedded in AWGN with SNR= {0, 3, 6, 9}.Values are averaged from 1000 simulations of the signals with different noise realizations.

Table 4 ,
the CNN-based estimations of local component numbers surpass those obtained through the LRE method for all synthetic signal examples, which were unseen by the CNN during training.Additionally, EMB, SPWV, and LOADTFD emerged as competitive choices for TFDs in CNN training.The findings and conclusions drawn from the synthetic signals also extend to realworld EEG seizure and gravitational signals, as depicted in Figures13 and 14.Our research aimed to demonstrate that utilizing a diverse training set not specifically tailored to the application can enhance estimation results for real-world signals showcasing LFM and QFM components.