Comparison of Information Criteria for Detection of Useful Signals in Noisy Environments

This paper considers the appearance of indications of useful acoustic signals in the signal/noise mixture. Various information characteristics (information entropy, Jensen–Shannon divergence, spectral information divergence and statistical complexity) are investigated in the context of solving this problem. Both time and frequency domains are studied for the calculation of information entropy. The effectiveness of statistical complexity is shown in comparison with other information metrics for different signal-to-noise ratios. Two different approaches for statistical complexity calculations are also compared. In addition, analytical formulas for complexity and disequilibrium are obtained using entropy variation in the case of signal spectral distribution. The connection between the statistical complexity criterion and the Neyman–Pearson approach for hypothesis testing is discussed. The effectiveness of the proposed approach is shown for different types of acoustic signals and noise models, including colored noises, and different signal-to-noise ratios, especially when the estimation of additional noise characteristics is impossible.


Introduction
Since Shannon [1] introduced information and information entropy, these concepts have attracted significant attention from scientists, as evidenced by the large number of articles devoted to the development of information theory in relation to various theoretical and practical aspects. Many different information criteria, metrics, and methods for their calculation that are based, one way or another, on the concepts of Shannon entropy have been proposed and investigated [2]. These metrics can be used quite successfully in signal processing, which eventually led to the emergence of a separate section of this scientific area, called entropic signal analysis [3].
For signals described by time series, the information entropy can be calculated on the basis of both signal representation in the time domain [4] and its representation in the frequency domain [5], i.e., using the signal spectrum. The convenience of the second approach comes from the fact that white noise, which is usually used to model background noise in these problem statements, has a uniform frequency distribution. This allows us to simplify its mathematical description and separate useful signals more effectively.
Decision theory considers change-point detection problems, which are closely related to the problems discussed above: often, in such problems, the moment of change in the parameters of a random process registered in discrete time must be determined. In [6,7], many probabilistic-statistical methods of solving such problems are considered. The Neyman-Pearson approach to this problem was used in [8]. Additionally, one cannot ignore the so-called Anomaly Detection Problems, where detection of anomalies in time series is required [9][10][11], i.e., the moment at which the behavior of the system begins to qualitatively differ from normal for various reasons, in particular, due to unwanted external interference. The electrocardiogram (ECG) is one example of such a time series, and ECGs have been analyzed in a large number of articles, for example, [12]. The presence of an anomaly in this case can indicate health problems, and detection at an early stage may save the life of the patient.
Of particular interest is the processing of acoustic signals, which can be useful, for example, in Voice Activity Detection (VAD) problems [13] related to voice assistants. The task is usually to separate speech segments from background environmental noise. Related articles [5,14,15] present a method for endpoint detection, i.e., the determination of the limits of a speech signal in a mixture of this signal and background noise based on the calculation of the spectral entropy. The general idea of methods based on information criteria is that their values experience a sharp jump when a useful signal appears in the noise.
In a series of articles [16][17][18][19][20][21], researchers introduced the concept of a statistical measure of signal complexity, which they called statistical complexity. In [22,23], statistical complexity and information entropy were used to classify various underwater objects of animate and inanimate nature from recorded sound. In the present article, we use this measure to indicate the appearance of an useful acoustic signal in a highly noisy mixture. It should be noted that the positive side of the proposed method is that it does not require any a priori knowledge about the signal to be detected. However, a priori information, such as the approximate frequency range of the signal is known, its detection will be even more accurate.
The structure of the paper is as follows. Section 2 provides a brief theoretical summary of the information criteria used in various known signal detection methods. In Section 3, entropy variation is investigated and statistical complexity is introduced. In Section 4, the connection between statistical complexity and the Neyman-Pearson criterion for hypothesis testing is also discussed to justify the proposed approach. Section 5 is provides a variety of examples, gives a comparison of different information criteria, and discussed the results, which allows us to make an educated choice about a suitable rule for the detection of signals in a noisy mixture. Section 6 summarizes the conducted research and provides the direction for future work.

Information Entropy and Other Information Criteria
In information theory, the entropy of a random variable is the average level of "surprise" or "uncertainty" inherent in the possible outcomes of the variable. For a discrete random variable X that takes values in the alphabet X and has a distribution density of p p : X → [0, 1], the entropy, according to Shannon [1], is defined as where Σ denotes the sum across all possible values of the variable. When computing the sum (1), it is agreed that 0 log 0 = 0, and this assumption holds for all future equations. From the Formula (1), it follows that entropy reaches its maximum value when all states of the system are equally probable.
There are several definitions of information divergences, i.e., the statistical distances between two distributions. The Kullback-Leibler divergence (or mutual entropy) between two discrete probability distributions p(x) and q(x) on an event set X is defined as This measure is a statistical distance and distinguishes statistical processes by indicating how much p(x) differs from q(x) by the maximum likelihood hypothesis test when the actual data obey the distribution p(x). It is easy to see that where H(p, q) is a cross-entropy between p and q: where E p [·] is an operator of the mathematical expectation relative to the distribution p.
The symmetrized Kullback-Leibler distance [13] is often used in studies: However, the Jensen-Shannon divergence, which symmetrizes the Kullback-Leibler divergence and is often a more convenient information measure for practical applications, is used more often: It is symmetric and always has a finite value. The square root of the Jensen-Shannon divergence is a metric that is often called the Jensen-Shannon distance.
It is easy to see that Another quantity related to the complexity of the system is the "disequilibrium," denoted by D, which shows the deviation of a given probability distribution from a uniform one. The concept of the statistical complexity of a system can be considered a development of the concept of entropy. In [16][17][18][19][20][21], it is defined as where C is the statistical complexity, H is the information entropy, and D is a measure of the disequilibrium of the distribution relative to the uniform one. The measure of statistical complexity reflects the relationship between the amount of information and its disequilibrium in the system. As a parameter D, according to the authors of [16], one can choose any metric that determines the difference between the maximum entropy and the entropy of the studied signal. The simplest example of disequilibrium is the square of the Euclidean distance in R N between the original distribution and the uniform distribution, but often, the Jensen-Shannon divergence [22,23] is also used.

Time Entropy
Now let us consider the information characteristics mentioned above in relation to time series. The Shannon entropy for systems with unequal probability states is defined as follows: Let the i-th state of the system have a probability of p i = N i /N, where N is a sample volume and N i is the amount of filling at the i-level. Then, the entropy H(p), according to the Formula (1), equals From here, we consider discrete probability distributions p i with the following properties: There are different ways to calculate probabilities p i from the time series. The simplest one is as follows: First, the maximum x max and minimum x min values are found for the considered time series x(t) with N data points. Then, the interval (x max − x min ) is divided into n subintervals (levels) so that the value of the interval ∆x is not less than the confidence interval of the observations. The resulting sample is treated as a "message", and the i subintervals are treated as an "alphabet". Then, we find the number ∆N i of sample values x k that fall into each of the subintervals and determine the relative population level p t i (the probability of a value from the sample falling into a subinterval i, that is, the relative frequency of occurrence of the "letter" in the "message"): The elementary entropy of the sampling is defined as the Shannon entropy (9) on a given set p t i , and this is normalized to the total number of states n so that its values belong to the interval [0, 1]: This approach is known as the first sampling entropy [4] and is used, for example, in [24] to detect the hydroacoustic signals emitted by an underwater source.
On the other hand, the second sampling entropy can be defined as In this case, the signal samples themselves are considered "letters", which are distributed across the time axis in contrast to the amplitude axis from (12), and the "alphabet" is the whole set of amplitudes.

Spectral Entropy
In addition to the time domain, the entropy can be calculated based on the representation of the signal in the frequency domain, i.e., p i can be calculated with the spectrum of the signal. Spectral entropy is a quantitative assessment of the spectral complexity of the signal in the frequency domain from an energy point of view.
Consider the time series x(t) and its spectral decomposition in the frequency domain X( f i ) with N f f t frequency components, obtained using the Fast Fourier Transform (FFT). The spectral power density is estimated as follows: Then, the probability distribution of the spectral power density p s = {p 1 , p 2 , ..., p N f f t } can be written in the form where s( f i ) is the spectral energy for the spectral component with a frequency of f i , p s i is the corresponding probability density, N f f t is the number of spectral components in the FFT, and the upper index s shows that the distribution refers to the signal spectrum. The resulting function is a spectrum distribution density function.
Finally, the spectral entropy can be determined with the Equation (9) and normalized by the size of the spectrum: The Spectral Information Divergence (SID) method [25] was recently added to the Matlab mathematical package, and is calculated according to Formula (5) according to the similarity of two signals based on the divergence between the probability distributions of their spectra: where r and t are the reference and test spectra, respectively, and the values of the probability distribution p i and q i for these spectra are determined according to (15).

Entropy Variation and Related Information Criteria
The purpose of this section is to determine the most appropriate formulas for calculating the information criteria that are responsible for the differences between distributions. Let us consider an entropy variation with respect to the variation in the probability distribution. The following lemma is valid: (10), and the decomposition of the entropy variation δH in the case of series convergence by powers δq i has the form The first summand of the entropy variation decomposition δH is the difference between cross-entropy and entropy, and the second depends on the weighted squares of the variation of the distribution: The proof of the Lemma 1 is given in Appendix A.

Remark 1.
If q is the uniform distribution, i.e., q i = 1/N for i = 1, ..., N, when and the disequilibrium D = D(p, q) is proportional to variance of the distribution p relative to the uniform one and is equal to Equation (21) coincides with the disequilibrium definition from [16]. According to Lemma 1 and Remark 1, we can introduce a new definition.

Definition 1.
In the case where q is the uniform distribution statistical complexity, as defined in [16], it is proportional to the first nonzero member of the row of square entropy variation, namely, Remark 2. In general cases, the statistical complexity is defined as where H max is an entropy maximum.
It follows from Remark 1 that the disequilibrium (21) and the complexity (22) concelts must be applied when evaluating and comparing signals with background noise that has a spectral distribution close to uniform.
The formula for disequilibrium (21) proposed in [16] is derived from entropy variation, but most papers use the Jensen-Shannon divergence [22] for disequilibrium: where q i = 1/N. The statistical complexity, correspondingly, is expressed as Further on in the article, a comparison of the complexity graphs calculated for these two values D SQ and D JSD is presented.
Considering the signal distribution in the frequency and time domains, it can be observed that the spectral distribution does not require any additional estimation of the signal variance, whereas when calculating entropy in the time domain, variance estimation is required, since for white noise (Gauss distribution), the following formula is valid: Remark 3. For the case of two continuous distributions (p, q), the disequilibrium D(p, q) is equivalent to the f -divergence [26] with the quadratic function f : if an integral exists.

Remark 4.
In the case of two Gaussian distributions with the parameters (µ p , σ p ) and (µ q , σ q ), Formulas (19) and (20) from Lemma 1 take the form The Formulas (28) and (29) are obtained by calculating the integrals (27) and (26) for continuous distributions. These are also applicable for discrete ones.
The problem addressed here is the determination of the most informative methods for calculating entropy and other information criteria. In our opinion, the answer can only be obtained in the presence of additional knowledge about the phenomenon under study. Indeed, when measuring the amplitude of the signal x(t), initially there is only knowledge of the time series samples x(t i ), i.e., we know the values of amplitudes in the increasing sequence of time samples t i , i = 1, ..., N. Setting the distribution density using the Formula (11) itself allows some random variable to be defined.
Let us calculate the entropy H(p 0 ) by applying grouping (11) and considering the fact that the entropy does not change as the summation order changes. Thus, the following chain of equations is valid: where so that Thus, we have obtained that n is now a uniform distribution, then (33) takes the form H(p 0 ) = log 2 N − log 2 n + H(p 1 ).
The Formula (33) shows the relationship between the entropy calculated from the time samples and the Kullback-Leibler distance between the distributions obtained by alphabetical grouping along the amplitude and time coordinate axes. Therefore, the next statement is valid. The following observation considering distribution p 1 is true.

Remark 5.
If the sequence of samples x(t i ) has the property of ergodicity and the signal is represented by white noise, then if the number of letters of the alphabet from (31) is large enough, the density p 1 will be close to Gaussian.
Remark 5 allows us to estimate H(p 0 ) using Equation (26) for the Gaussian p 1 . Now, there are four distribution densities at our disposal: three of them, p 0 , p 1 , and p t , are related to the time domain and are determined by the Formulas (13), (31), and (32), and one, p s , is related to the signal spectrum, calculated by the Formula (15).
With the presence of four distribution densities, the following criteria are considered simultaneously: the normalized information entropy H, defined by Formulas (12) and (16); the statistical complexity C, computed by (22) with a disequilibrium D (21); the Jenson-Shannon divergence JSD (6); the spectral information divergence SID (17); and the cross-entropy and entropy difference LH (19).
Since the spectral density is used to compare the signal/noise mixture with white noise, i.e., with a uniform distribution, all of the proposed criteria are applicable for this density. In the case of temporal distributions, the normalized information entropy H, which depends only on the distribution under study, and the difference of cross-entropy and entropy LH, calculated explicitly by the Formula (29) for µ p = µ q , are estimated.
On the basis of the numerical experiments performed, a conclusion is made about the quality of the criteria used and the limits of their applicability in the presence of the noise component of the signal.

Hypothesis Testing
The classical probabilistic approach to the study of the considered problem of the detection of useful signals against background noise is called binary hypothesis testing. The binary problem associated with the decision to receive only noise (hypothesis Γ 0 ) or to receive a mixture of a useful signal and noise (hypothesis Γ 1 ) is solved [8].
In the statistical decision theory [6], it is shown that, in signal detection in the presence of noise, the optimal decisive rule is based on a comparison of the likelihood ratio with some threshold. The Neyman-Pearson criterion is used to select the threshold in the absence of a priori probabilities of the presence and absence of a useful signal. The efficiency of the detection procedure using the Neyman-Pearson criterion is characterized by the probability of correct detection with a fixed probability of false alarms.
The solution to the problem of distinguishing between two hypotheses can be derived from the following variant of the Neyman-Pearson lemma.
In the problem of the detection of a useful signal, α is known as the probability of a false alarm occurring, and β is known as the probability of missing a useful signal.
The error function can be calculated precisely through the variation of the measure (with a sign) by the following formula from [6]: where P = 2TV(P 0 , P 1 ) ≈ 0 and the possibility of a reasonable estimate of this value is interesting.
The estimated TV(P, Q) constraints are known from thed estimate JSD(PQ), which is used to compute the statistical complexity C JSD . Both TV and √ JSD are metrics related to the probability distribution space, but in the Euclidean space, D SQ (21) serves as this metric. Since the problem of detecting a deterministic signal in the presence of background noise is considered, it is reasonable to additionally take into account this "determinism" by multiplying D SQ by the entropy H, which is associated with the introduction and use of statistical complexity in the form of (22) and (25).

the Calculation Algorithm and Presentation of the Simulation Results
In all experiments, graphs of the information characteristics are presented as functions of time. The characteristics are calculated from the signal according to the following algorithm:

1.
After being digitized with the sampling rate, the F audio signal is divided into short segments containing W digital samples.

2.
The discrete densities p i (15) are calculated from the time or frequency domains.

3.
The information criterion is calculated using p i .

4.
The sequence of values is displayed together with the signal on the time axis (each of the obtained values is extended by W counts).
When a certain threshold of the information criterion is exceeded, this indicates the appearance of a useful signal in the mixture.
The signal processing results according to this algorithm are presented below. For different acoustic signals, a comparison of the quality of indication of the appearance of a useful signal by different information criteria at different levels of added white noise is demonstrated. In addition, Section 5.5 shows a comparison of two methods for calculating the statistical complexity and draws conclusions about the usefulness of both.
The first acoustic signal chosen was an audio recording of a humpback whale song recorded underwater. A large set of such recordings is available from the Watkins Marine Mammal Sound Database collected by Woods Hole Oceanographic Institution and the New Bedford Whaling Museum. The ability to separate such signals from strong sea noise may be useful for research biologists for further classification and study. In addition, these signals are similar in structure to the human voice with separate words, the extraction of which could be useful, for example, in tasks of voice activity detection and speech recognition.
In all of the graphs presented below, the signal is marked with a blue line, and the corresponding information metric is marked with a red line. The left vertical axis corresponds to the values of the signal amplitude, and the right vertical axis corresponds to the values of the information metric. All horizontal axes represent the timeline in seconds. The signal is shown without added noise for better comprehension, but the variable parameter of the standard deviation σ N of the white noise is marked with a dashed line. All information metrics are normalized for the convenience of presentation. All calculations and visualizations were performed using Python. White, brown, and pink noises, which were artificially added to audio recordings, were also generated numerically.

Time Information Criteria
First, we consider the behavior of the information entropies H(p t ) and H(p 0 ), calculated from the time samples of the signal x(t). Figure 1 shows that as noise increases, there is serious degradation of the time entropy graph for both calculation methods, so that for σ N = 2000 (SNR ≈ 1.5 dB), these information criteria can no longer serve as reliable indicators of the appearance of a useful signal in the mixture. We note an interesting feature of the behavior of H(p 0 ) and H(p t ): the value of the first characteristic is maximal for the uniform distribution and decreases with the appearance of a useful signal in the mixture, while, in contrast, the value of the second is minimal in the absence of a signal and increases with its appearance. This obviously follows on from the formulas for calculating the distributions and entropies (11), (13), (12). The information characteristic LH(pq) stands out favorably from the time entropies, as demonstrated above.
In Figure 2, the LH for the noise level σ N = 2000 shows the appearance of a useful signal and works sufficiently, even for the double noise value. However, it should be noted that this is true only for stationary noise, whose average value does not change over time. Otherwise, this metric will react to changes in noise as well, which follows on from the formula (29). Moreover, initial estimation of σ N is required for the correct functioning of this criterion.

Time Entropy H(p 1 )
The time entropy H(p 1 ) associated with another grouping of the "alphabet" derived from the signal samples is considered separately and the graphs for different number of letters are shown in Figures 3 and 4. Changing the alphabet partitioning negatively affects the effectiveness of entropy in this representation:

Spectral Information Criteria
Information criteria based on the spectral distribution of p s are deprived of the disadvantages of the time criteria. Figure 5 shows the dependence of the spectral entropy on time. We can see a significant improvement in the maximum allowable noise level, at which the indication of the appearance of a useful signal is still possible, with respect to the graphs presented in Figure 2. The point that we want to make is that the white noise in a signal in spectral representation has quite a definite uniform probability distribution, which greatly facilitates the calculation of entropy and saves us from the necessity of estimating the variance of this noise. Moreover, even if the noise is not stationary, i.e., its parameters change over time, in a small window W, it can still be considered white, and the above statement is still true.
The distribution p s can be used as the basis for a number of information divergences (17), (24), (22), (25): Figure 6 shows that the separability of information metrics decreases along with the signal-to-noise ratio (SNR). However, the statistical complexity C SQ performs better than all other criteria, because it still allows a useful signal to be distinguished when other metrics behave irregularly and no longer show significantly excess levels compared to areas without a signal. Thus, it is the most promising characteristic in our opinion.

Comparison of Different Ways of Calculating the Statistical Complexity
Of separate interest is the comparison of the behavior of the statistical complexities C SQ and C JSD , which essentially correspond to different methods of calculating the same value of statistical complexity. Figure 7 illustrates this comparison. We can see that C SQ shows a better result when used as an indicator of the appearance of a useful signal in white noise compared to the Jensen-Shannon divergence.

Hydroacoustic Signal Model of an Underwater Marine Object
The second signal is a modelled hydroacoustic signal of an underwater marine object. The study of such signals is important in military and civilian applications, because it can automate the process of analyzing the hydroacoustic scene and identifying potential threats. In Figure 8, spectral entropy dependencies for different levels of added noise are shown.  Figure 9 shows the dependencies of statistical complexity for a given signal. It is worth noting that the selected information metric shows the presence of a useful signal, even for a very small SNR (≈ −17 dB) in the last example. It can be observed that, in comparison with all other information metrics, the statistical complexity shows the best result in terms of indicating the presence of a useful signal in the mixture, because it remains effective for small SNRs, while all other characteristics can no longer detect a useful signal in noisy receiving channels.

Hydroacoustic Signal Model with Pink Noise
Now let us change the additive noise model and use pink noise instead of white noise. As can be observed in Figure 10, the spectral entropy shows an unsatisfactory result for the chosen low SNR.  Figure 11 shows that, along with the spectral entropy, the statistical complexity C SQ performs poorly, but C JSD confidently shows the presence of a signal. Figure 11. Statistical complexities C SQ and C JSD for pink noise model.

Hydroacoustic Signal Model with Brown Noise
In this example, brown noise is used as the noise model. As in the previous subsection, spectral entropy fails in the task of signal extraction, as shown in Figure 12. However, Figure 13 shows that the statistical complexity with the Jensen-Shannon disbalance exhibits a satisfactory performance.
The results are summarized in Table 1. The checkmark indicates the possibility of confident indication of the useful signal, and x indicates the lack of this.

Conclusions
The article proposed a method for indicating the appearance of a useful signal in a heavily noisy mixture based on the statistical complexity. The analytical formulas used to determine the disequilibrium and statistical complexity were obtained using entropy variation. The effectiveness of the proposed approach for two types of acoustic signals in comparison with other information metrics was shown for different models of added noise. For white noise, the appearance of deterministic signal was shown to be reliably detected for a very small SNR (≈ −15 dB) when the statistical complexity based on the spectral distribution variance was used as the criterion. However, for more complex noise models, the use of the statistical complexity with the Jensen-Shannon disequilibrium was shown to have better efficiency. Both the time and frequency domains were considered for the entropy calculation. The criteria for signal detection in a heavy noise mixture based on time distributions were shown to be less informative than those based on spectral distribution. The connection between the statistical complexity criterion and the Neyman-Pearson approach for hypothesis testing was also discussed. Future work will be devoted to research on the information criteria based on two-and multidimensional distributions, and acoustic signals with realistic background noise will be considered. Data Availability Statement: The humpback whale song example was downloaded from https://cis. whoi.edu/science/B/whalesounds/bestOf.cfm?code=AC2A (accessed on 25 September 2022).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

FFT
Fast Fourier Transform SNR Signal-to-noise ratio Appendix A Proof of Lemma 1. The difference between entropies for the distributions p i and q i gives the entropy variation δH: The property of the logarithm of the product and the regrouping of the summands allows the chain of equations to continue as follows: The first sum is equal to the difference between cross-entropy and entropy. The next transformation is decomposed into an infinite logarithm series, and the resulting sum is divided into two parts: One summand corresponding to n = 1 is removed from the first sum, and summation continues with n = 2. Another shift of the summation index leads to the Equation (18), which ends the proof of the Lemma.