A Feature-Reduction Scheme Based on a Two-Sample t -Test to Eliminate Useless Spectrogram Frequency Bands in Acoustic Event Detection Systems

: Acoustic event detection (AED) systems, combined with video surveillance systems, can enhance urban security and safety by automatically detecting incidents, supporting the smart city concept. AED systems mostly use mel spectrograms as a well-known effective acoustic feature. The spectrogram is a combination of frequency bands. A big challenge is that some of the spectrogram bands may be similar in different events and be useless in AED. Removing useless bands reduces the input feature dimension and is highly desirable. This article proposes a mathematical feature analysis method to identify and eliminate ineffective spectrogram bands and improve AED systems’ efficiency. The proposed approach uses a Student’s t-test to compare frequency bands of the spectrogram from different acoustic events. The similarity between each frequency band among events is calculated using a two-sample t-test, allowing the identification of distinct and similar frequency bands. Re-moving these bands accelerates the training speed of the used classifier by reducing the number of features, and also enhances the system’s accuracy and efficiency. Based on the obtained results, the proposed method reduces the spectrogram bands by 26.3%. The results showed an average difference of 7.77% in the Jaccard, 4.07% in the Dice, and 5.7% in the Hamming distance between selected bands using train and test datasets. These small values underscore the validity of the obtained results for the test dataset.


Introduction
In addition to the sense of sight, which is an important tool for interaction between humans and their environment, understanding the environment's acoustic signals is also crucial for survival.Using acoustic signals, one can often prejudge a scene or event without actually observing it.With the advancement of processors and AI systems, the proximity of machine learning (ML) algorithms to human perception, and the emergence of the smart city concept, urban smart monitoring systems based on audio and video have become increasingly popular.Considering the smart city concept, the automatic processing of audio and video data from urban areas enables city authorities to respond to incidents and quickly improve service quality.In congested urban areas, image-based event detection systems have long been used to monitor vehicle traffic or automatically detect events.These systems have the advantage of never getting tired, rarely making mistakes, and providing comprehensive documentation of crimes and violations.Additionally, this topic is occasionally used in security systems.Under several circumstances, cameras cannot cover a scene; therefore, adding an acoustic signal can complement the monitoring system and enhance systems' accuracy and efficiency.So far, numerous methods have been proposed for processing and classifying urban events using the acoustic signal spectrogram, each with strengths and weaknesses.One aspect overlooked in spectrogram-based AED systems is the analysis of acoustic events based on similarities and differences of spectrogram bands related to different events separately.Considering this analysis, the trained system can use the spectrogram bands that differ between events to enhance its efficiency.Due to the lack of comprehensive research investigating the similarities and differences of spectrogram bands for different events, this article introduces a method to identify similar and dissimilar spectrogram bands for urban events from a mathematical and probabilistic perspective.In the meantime, the proposed method can be effectively used to select compelling features in the design and implementation of ML algorithms.The main advantages of the proposed method can be categorized into three areas: • By analyzing the potential similarities and differences in the spectrogram bands of each urban event, one can focus on the similarities and dissimilarities within and outside the group of events, thus aiding in event classification.

•
Since the proposed method uses a probabilistic model and confidence interval to evaluate the similarity and difference of spectrogram bands across different events, there is no need for heuristic methods.Also, mathematical analysis is more reliable than heuristic methods in ensuring accurate results.

•
The proposed method can be used to identify irrelevant features in AED systems, and in cases where there is a high degree of similarity between two or more events, a secondary classifier can be designed using the outcomes of the proposed method to minimize errors.

•
The proposed method can be used to identify useless bands in acoustic-based systems that use the spectrogram as a feature.Normality should be checked in each case to ensure the accuracy of the result.
The article is structured as follows: In Section 2, recent methods for AED and feature selection based on mathematical models are presented.Section 3 describes the proposed method.Section 4 is devoted to the results obtained by the proposed method and its comparison against other existing methods.Finally, a summary of the study is presented in the conclusion.

Literature Review
In recent years, multiresolution analyses, such as spectrograms, mel frequency cepstral coefficients (MFCCs), and wavelets, have been widely used in signal analysis and AED because of their suitability for finding patterns in time-varying signals.Hajihashemi et al. [1,2] used MFCC and wavelets for sound analysis in AED and acoustic scene classification.The authors also used wavelet scattering as another spectral feature in [1].Roy et al. [3] used the spectrogram as a time-frequency expression of arterial Doppler signals to predict blood clots and microemboli.Several features were extracted from the spectrogram, such as the root mean of the local power spectrum and the modal frequency.Ibs-von Seht [4] aimed to provide an overview of volcanic activity using the spectrogram of seismic signals.Hafez et al. [5] predicted the timing of an earthquake using the spectrogram of signals obtained from the ground.Broussard and Givens [6] analyzed the oscillations of the posterior parietal cortex in rats and the impact of different acoustic signals on them using the spectrogram.
Liu et al. [7] employed the spectrogram in conjunction with the Hilbert-Huang Transform (HHT) for sleep apnea detection.Dennis et al. [8] used spectrogram and Hough transform features to detect acoustic events.Towsey et al. [9] used features extracted from the spectrogram to estimate the number of birds in a natural environment.Vales et al. [10] predicted earthquakes by analyzing data collected from the spectrogram of low-frequency terms of terrestrial signals.Oliveira et al. [11] proposed an efficient method to detect bird activity using a spectrogram-based filter.The spectrogram separated the background sound from the bird's voice in this method.
Ghosh et al. [12] applied the spectrogram and Wigner-Ville transform of the vibration signal for vehicle detection.Xie et al. [13] introduced an AED system that used features extracted from spectrograms.Additionally, Xie et al. [14] used spectrograms, linear predictive coding, and MFCCs to estimate the number of frogs based on ambient sound.Using ambient sound, Sánchez-Gendriz and Padovese [15] analyzed biological choruses.In this study, features were extracted using the spectrogram, and an effective graphical expression for biological choruses was presented using the amplitude of the spectrogram in some frequency bands.Zhaoa et al. [16] separated the sounds of different bird species using MFCC, spectrogram, and an autoregressive model.Shervegar et al. [17] proposed a phonocardiogram spectrogram-based system for heart disease classification.
Nobre et al. [18] measured the biological parameters of caged domestic animals using an electric field.This research relied on the spectrogram to determine the frequency characteristics of long recordings.Ye et al. [19] used a combination of local and global features, including spectrogram entropy, to detect urban events.Goenka et al. [20] proposed a method for detecting seizures using quantitative electroencephalogram spectrograms.Hoyos-Barcelóa et al. [21] used local features of the acoustic signal spectrogram to detect coughs.The proposed method was implemented in a smartphone application and showed promising results.Waldman et al. [22] detected high-frequency oscillations within the human skull using electroencephalographic (EEG) signal spectroscopy.Yan et al. [23] used spectrograms to diagnose seizures based on a convolutional neural network (CNN) classifier whose input was spectral images.
Oliveira et al. [24] relied on the capabilities of spectrograms and ML methods, such as neural networks (NN) and support vector machines (SVM), to classify EEG signals and diagnose epilepsy.In addition to the spectrogram, cross-correlation and discrete Fourier transform were used in this study.Zhang et al. [25] employed acoustic sensors and a phase-sensitive optical time-domain reflectometer to distinguish five different acoustics.The authors extracted features using the spectrogram.Sahai et al. [26] considered spectrogram-related features for musical font separation.This method applied the spectrogram image as the input to the VGG network.Lin et al. [27] applied spectrogram features as the input to a deep neural network (DNN) and trained a semi-supervised CNN as an AED system in an urban area.
Spadini et al. [28] evaluated several acoustic features in detecting urban events, and the spectrogram was among them.Su et al. [29] used a two-stage CNN network to classify environmental acoustics considering features such as a log-mel spectrogram and MFCCbased features.Gloaguen et al. [30] proposed spectrogram features and non-negative matrix factorization to estimate road traffic levels.Satar et al. [31] proposed an AED method based on the spectrogram of data collected by the hydrophone.The continuous wavelet transform and spectrogram were used by Lapins et al. [32] to analyze seismic signals caused by volcanic activity.The audio spectrogram was among the acoustic features suggested by Vafeiadis et al. [33] for a smart home AED system.For environmental monitoring and counting of low detectable species, Znidersic et al. [34] used the spectrogram of an acoustic signal.Robinet et al. [35] used the spectrogram to extract transient noise characteristics in gravitational wave detectors.
Azab and Khasawneh [36] used the spectrogram to detect malware files.Kachaa et al. [37] analyzed the different conditions of dysarthric speech, which is a speech disorder related to muscle weakness, using the spectrogram of voice signals to interpret the different states of this disorder.Zeng et al. [38] extracted the spectrogram of arm movements and used this feature to classify the movements.
Franzoni et al. [39] proposed an emotion recognition system using a human voice spectrogram and a CNN-based classifier.Sinha et al. [40] extracted the audio spectrogram, and converted it to an image that was inputted into a CNN for audio classification.
Luz et al. [41] relied on different acoustic features, such as the spectrogram, to detect events in an urban space based on a CNN-based classifier.In analyzing the heart's electrocardiogram (ECG) signals, Gupta et al. [42] used the spectrogram.Manhertz and Bereczky [43] used the short-time Fourier transform (STFT) spectrogram to analyze vibration in a rotating electric machine and identify faults in early stages.In a study by Lara et al. [44], seismic and volcanic events were detected using a spectrogram and deep learning.Pham et al. [45] used a spectrogram-based method to classify scenes based on a CNN-DNN architecture.
Liu et al. [46] combined convolutional recurrent neural networks and mel spectrogram, delta, and delta-delta features for underwater target recognition.Kadyan and Bawa [47] proposed a two-level augmentation scheme via the spectrogram of speech signals using transfer learning techniques for an automatic speech recognition system.Pahuja and Avijeet [48] proposed a bird sound-based recognition system for classifying eight species of birds using the STFT spectrogram for feature extraction.Zhang et al. [49] used gradientweighted class activation mapping as a CNN and the mel spectrogram as a feature for acoustic scene classification.Cheng et al. [50] suggested a spectrogram-based sound recognition system using AlexNet, which identified passing vehicles with modified loud exhausts.Wang et al. [51] used the spectrogram of underwater signals for time-frequency tracking and enhancement of whistle signals.The application of whistle signals is in research about cetaceans.You et al. [52] used audio spectrogram transformers and a CNN to generate embeddings for the few-shot learning of bioacoustic AED.Bhangale et al. [53] combined mel spectrogram and other acoustic features and used them as input for a parallel emotion network for speech emotion recognition.Özseven [54] discussed the effectiveness of the spectrogram as a time-frequency domain image in urban sound classification.Latif et al. [55], Shafik et al. [56] and Mushtaq et al. [57] used the spectrogram as an effective acoustic feature in speech emotion recognition, speaker identification, and environmental sound classification, respectively.All of these approaches were based on deep learning.
The spectrogram has also been used in many medical applications, such as cough detection [58], detection of cardiovascular disease and epilepsy using ECG and EEG signals [24,59], sleep spindles [60], and scalp peak ripples [61] detection using EEG signals.
It has also been used for sleep apnea-hypopnea syndrome diagnosis based on nasal airflow signal [62] and snoring detection system using voice [63].In industrial applications, the spectrogram has been used for fault detection using vibration signals [64], fault detection in gearboxes based on sound [65], and fault detection in rotary systems using data from various sensors [66].Verification of bird diversity [67] and the detection of shoots in forests [68] using ambient sound, seismo-acoustic event prediction using vibration signals and ground waves [69], and an AED system [70] are other state-of-the-art applications that use the spectrogram as a feature extraction method.
In most of the studies mentioned above, the classifier employed was a DL network.According to the reviewed studies, the following findings were observed:

•
In some methods, the spectrogram image was used as the input to a two-dimensional (2D) DNN; • Various features extracted from the spectrogram are used as the input; • In some cases, experts have analyzed spectrogram images to distinguish between different conditions; • Based on our best search, no quantitative methods have been proposed to determine which frequency bands of the spectrogram are most effective for the application under study.
In the current study, an efficient method to separate the useful from useless bands of the spectrogram regardless of the used classifier was developed based on statistical tests.The proposed method can be used to determine the similarities and differences in the frequency bands of the spectrogram considering different classes.Using the results of the proposed method, the noise is reduced by removing similar frequency bands from the spectrogram, and the accuracy and learning speed are increased.Therefore, in an AED system, the proposed method can provide insights into the different events associated with the spectrogram frequency bands relative to the background and increase the system's accuracy and speed.

Materials and Methods
This section provides an overview of the dataset used in this study and explains the architecture of the proposed system.In addition, it gives the theoretical background for each step of the proposed method.

Dataset
As one of the most popular datasets used in AED studies, the well-known public URBAN-SED dataset was also used in the current study.This dataset contains ten sound events, as shown in Figure 1: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music.In addition, there is an 11th class, defined as the background, which does not include any of the mentioned events.The URBAN-SED dataset was generated using the Scaper library for synthesizing and augmenting acoustic scenes and developed by incorporating background noise into the original sounds sourced from the UrbanAcoustic8K dataset.The UrbanAcoustic8K dataset contains 8732 trimmed acoustic clips recorded in an urban environment, with the most extended clip lasting four seconds.To generate the URBAN-SED dataset, background noise was added to the original sounds, and the duration of each sample was set to ten seconds.Since the UrbanAcoustic8K sounds were recorded in a natural environment, several events can co-occur.To standardize the comparison between the different studies tested on the URBAN-SED, the data were divided into three categories: training, testing, and validation.The training set consists of 6000 samples, whereas the remaining two contain 2000 samples each.

Analysis of Spectrograms
The current study aims to separate the useful bands from the useless ones of the spectrogram in a spectrogram-based classification system.The spectrograms of acoustic events are analyzed to identify frequency bands that vary among different events to achieve this objective.These bands can be used as indicators of acoustic events.Figure 2 provides an overview of the proposed methodology, and its pseudocode is outlined in Algorithm 1.
Assume an audio sample is divided into N frames with equal length, and X M is the spectrogram of each frame, so X M×N is the spectrogram matrix of the sample.Each column of X denotes a frame number, i.e., a time interval, while each row represents a frequency band.Suppose that the only difference between different time intervals, i.e., columns of the spectrogram matrix, is the presence of an acoustic event in that column.In this case, the columns can be divided into two non-overlapping groups: one with and one without an acoustic event.To avoid ambiguity, overlapping periods with more than one event were removed from the input data (Figure 3).In statistics, the two-sample t-test checks the equality of means between two populations, i.e., two groups, that follow a normal distribution.
In this study, each frequency band was assumed to be a population.The normality condition was checked and validated in each frequency band to ensure the validity of the final result.Thus, if the equality of means between the two populations is rejected, it is reasonable to mark the frequency band as usable in the AED.In contrast, if the test for equality of means is accepted, frequency bands are useless due to similarity.To mitigate the impact of transient and short-term noise on the results, all training samples were used for analysis, and the results were averaged.Performing this analysis for all acoustic events in the dataset and then averaging the results yields a comprehensive model of the similarities and differences between acoustic events across the entire frequency band.

Mel Spectrogram
Many studies on audio analysis and AED used the mel spectrogram and MFCC as features.The current study analyzed mel spectrograms to identify the frequency bands that are effective for AED systems.Knowing the effective frequency bands for each acoustic event in the spectrogram makes it possible to annotate the presence and even the start and end times of the acoustic event with greater accuracy than considering all frequency bands.Moreover, identifying the dominant frequency bands for each event allows for separating similar but not identical events, thereby reducing the possibility of errors.Several parameters need to be set in the mel spectrogram, including the number of filters in the filter bank, the duration of the time interval, the type of window used in the time domain, the number of points in the Fourier transform, and the amount of overlap between time intervals.These parameters affect the system's accuracy.Figure 4 shows a mel spectrogram filter bank.
It is well known that the bandwidth of mel filters is smaller at low frequencies and increases at higher frequencies.In the analysis performed in this study, an audio sampling frequency of 44,100 Hz was used, and the number of points in the time window was set to 2048, with an overlap of 1024 points.This means that each time window overlaps adjacent time slots by half of the window size.Also, the number of mel filters was set to 173.Different windows in the time domain have two important but opposite characteristics: a narrow "main lobe width" and a high attenuation of the side lobes.Compared to Hanning and Hamming, the Blackman-Harris window has a wider main lobe width, which is a disadvantage but without effect in the current application; however, it has a stronger sidelobe attenuation than other windows, which is highly desirable, so Blackman-Harris was used in this study [71].Based on the fact that all clips in the dataset have a duration of ten seconds, the final mel spectrogram matrix is a 173 × 429 matrix, where 173 frequency bands correspond to rows and 429 columns correspond to time intervals.In this step, based on the dataset labeling, intervals where only one audio event is present can be separated from intervals without events.To avoid ambiguity, the intervals where the acoustic event started or ended were excluded from the analysis.

Student's t-Test Analysis
To verify the effectiveness of each frequency band in detecting audio events, a twosample Student's t-test is performed, as previously mentioned.It is assumed that the spectrogram values within the frequency bands at different time intervals follow a normal distribution (details of the normality test are discussed in the following section).Frequency bands can be divided into two parts, with/without events, considering their time intervals, and the means of these two parts, as two populations, can be compared using the Student's two-sample t-test, which is dependent on the variance of populations.If the variances of the two populations are known, the test statistic is given by [72]: where σ 1 , σ 2 are standard deviations, and n 1 , n 2 are the number of samples, i.e., the number of time intervals, in each population.In cases where the variances are unknown but assumed to be equal, the test statistic is as [72]: where S 2 p is the pooled variance, which can be calculated as [72]: where S 2 1 and S 2 2 are the variances calculated from two populations and n 1 , n 2 are the number of samples as in Equation (1).If the variances are unequal, the following equation [73] can be used: where S 2 1 and S 2 2 are the variances of the two populations, and n 1 , n 2 are the number of samples.It was verified that all populations belonging to all frequency bands have equal variances.The results indicated that, in most cases, the variances were not equal.Therefore, Equation ( 4) was applied to perform the t-tests.Finally, supposing that the mean equality test of two populations, one with an event and another without an event, in a frequency band, is accepted, this frequency band is ineffective for detecting this event because the values of this frequency band remained the same with and without the event.In contrast, rejecting the test shows that this frequency band differs with and without events, which makes it helpful in detecting the event.In cases where the test is accepted for some clips and rejected for others, this frequency band can be used to indicate the event; however, it is not a strong indicator.This statistic is highly efficient in feature selection because it can determine the usefulness of a feature in a binary or multi-event classification system, regardless of the classification method.To the best of our knowledge, there have been no previous studies on the functionality of the frequency bands of the spectrogram in AED systems, and the current study was the first attempt to explore the effects of each spectrogram frequency band on AED systems.

Normality Tests
The normality of two populations is a necessary assumption in a test of the equality of two means.Various tests can be used to assess data normality.Five tests were used in this study: Kolmogorov-Smirnov, Lilliefors, Anderson-Darling, Jarque-Bera, and Shapiro-Wilk.The Kolmogorov-Smirnov test is a nonparametric test that examines the fit of a given probability distribution to a set of samples.This test first transforms the data into the standard normal form: zero mean, unit variance.Subsequently, the cumulative distribution function of the data is compared to a standard normal cumulative distribution function.The normality of the data can be accepted or rejected based on the differences between the two graphs.With some modifications, this test is also used to check the goodness of fit.The Lilliefors test is similar to the Kolmogorov-Smirnov test in its initial stage.The difference between the two tests is how the cumulative distribution function is calculated.In the Lilliefors test, the data is not transformed into standard form, and the cumulative distribution is calculated directly.Normality is accepted or rejected based on the maximum discrepancy between the ideal normal cumulative distribution and the empirical cumulative distribution function of the data.One challenge of this test is determining the significance of the difference between the data distribution function and the ideal form.Because the test function is calculated based on the mean and variance of the data, it appears to be similar to the normal function, which can be considered a weakness.Nevertheless, this test can yield better results in some cases than the Kolmogorov-Smirnov test.The third test is the Anderson-Darling.In the general form, the Anderson-Darling test compares any population to any possible distribution, including the normal distribution.Similar to the Kolmogorov-Smirnov test, this test involves comparing the empirical distribution function of the data to the ideal normal distribution function.However, the initial assumptions of the Anderson-Darling test differ.The Anderson-Darling test has four different modes for testing the normality of data, which are as follows: 1.The mean and variance of the data are both known; 2. The data variance is known, but the mean is unknown; 3. The mean of the data is known, but the variance is unknown; 4. Both the mean and variance of the data are unknown.
In the current study, the mode where the data's mean and variance were unknown was used.In such cases, the mean and variance of the data are first estimated using statistical relationships.The data are then transformed into a standard form according to the following relationship [73]: The following equation was used to estimate the cumulative distribution function of the standardized data [73]: Based on this statistic, the following statistic is estimated [73]: It is important to note that this relationship is valid when the mean and variance are unknown and are estimated based on the data.If any of the A 2 or A * 2 values exceed the value given in the Anderson-Darling distribution table, the assumption of data normality is rejected.The fourth test performed to ensure data normality is the Jarque-Bera test.Unlike previous tests, this test compares the data probability distribution with a standard normal distribution based on skewness and kurtosis.Deviations of skewness and kurtosis from the normal distribution values lead to the rejection of normality.If the mean and variance of the data are not known, skewness and kurtosis can be calculated using the following equations [74]: After calculating skewness and kurtosis as the third and fourth central moments of the data, the Jarque-Bera statistic is calculated as [75]: where n is the number of samples.To accept or reject normality, the Jarque-Bera statistic is compared with the Jarque-Bera table obtained by the Monte Carlo method or chi-square approximation.Here, the Monte Carlo table was used based on the number of samples in the two populations.According to [76,77], the Shapiro-Wilk test is the most appropriate normality test for data with a sample size of less than 50: where x is the samples, x is the mean, and the (a i ) coefficients are normalized best lin- ear unbiased estimators that can be computed using methods such as the Monte Carlo method [78,79].Because of the considerable variation in population size and the dependence of the normality test accuracy on the number of samples, the types of tests in this study were selected based on the number of samples [80][81][82].

Validation Scheme
Figure 5 illustrates the scheme used to validate the results.In the first step, effective bands were determined between events using the train and test data separately.Then, the values 1 (one) and 0 (zerp), were assigned to effective and excluded bands, respectively, and a binarized vector with 173 elements for each pair of events was created.In the second step, the Dice coefficient, Hamming distance, and Jaccard distance were used.Among these metrics, the Dice coefficient measures similarity, while the Hamming and Jaccard distances measure differences.In the proposed scheme, the 1-Dice coefficient was used as the Dice distance to measure differences.In an ideal scenario, the results of the training and testing data are perfectly similar, so the Dice, Hamming, and Jaccard distances should all be zero.Given two binarized vectors, R train and R test , each with n binary elements, the Jaccard distance measures the missed overlap between R train and R test relative to the total number of bands, regardless of the excluded bands.First, the following parameters were defined: • E11-number of elements where both R train and R test are equal to 1 (one); • E01-number of elements where R train is equal to 0 (zero) and R test to 1 (one); • E10-number of elements where R train is equal to 1 (one) and R test to 0 (zero); • E00-number of elements where both R train and R test are equal to 0 (zero).
Each binary element must fall into one of these four parameters, meaning that: where Total spectrogram bands is equal to 173 in the current study.On one hand, the Jaccard distance, d J , is given by [83,84]: The Hamming distance measures the missed overlap between R train and R test relative to the total number of bands and is given by [84,85]: On the other hand, the Dice distance is defined as [83,84,86]: Since the Dice distance does not satisfy the triangle inequality, it can be considered a semi-metric version of the Jaccard distance.All metrics are reported here as percentages.

Results and Analysis
In this study, when the number of populations was less than 50, the Shapiro-Wilk test was used for the normality test.When samples exceed 50, alternative tests are recommended to verify normality [87].In these cases, the dominant response of the Liliefors, Anderson-Darling, and Jarque-Bera tests, were used.Almost all the statistical tests had a reasonable response when the number of samples exceeded 300.In this case, the dominant response of the four tests, Lilliefors, Anderson-Darling, Jarque-Bera, and Kolmogorov-Smirnov, was chosen as the result, which is the typical case here.Table 1 presents the normality test results of the four normality tests in the most common issue in the current study.The selected normality test results are given in the hybrid column of Table 1.The results in Table 1 confirm the validity of the assumption of normality.

Mean Equality Test
To perform the two-sample test of means, the following assumptions were considered: • The frequency bands of the spectrogram examined in this study were 173; the mean equality test was performed separately for each frequency band; • There was only one event in the populations selected for the test; • The minimum number of samples in each population was equal to nine; • The populations had an unequal number of samples; • Each population, which consisted of consecutive samples belonging to an event, was compared with events from the same audio file to minimize the effects of background noise; • The percentage of rejections in the "mean equality test" was calculated separately for each audio event compared to other events and background, i.e., no event, using all training samples (6000 samples); • The assumed confidence interval for all tests was equal to 95%; • If a population failed in the normality test and its skewness and kurtosis deviated strongly from the normal distribution, it was excluded from the test; • The higher values in Figures 6-8 indicated frequency bands with a higher probability of a mathematical difference between two acoustic events, as indicated by a higher percentage of rejections in the mean equality test.
In Figure 6, four events are depicted according to the background: gunshot, jackhammer, siren, and street music.It can be seen that the jackhammer spectrogram differs from the background in the bands between 10 and 150 in at least 80% of the clips.However, this situation is not observed for the other three events.Among the four events, siren differs from the background only in a relatively narrow range of frequency bands.Figure 7 depicts the test result of the mean equality test between the gunshot event and other events.It is possible to perceive that the importance of different frequency bands in distinguishing the gunshot event from other events varies depending on the type of the second event.In all the reported results, frequency bands with a higher rejection percentage of the mean equality test (value of the vertical axis of the graphs depict in the figures) are more suitable for classification.Based on the results depicted in Figure 7, among the events, dog bark, siren, and street music have a smaller area under the curve than the others with gunshot, which indicates a higher probability of classification error between these three events and gunshot when classified by the spectrogram.Conversely, the air condition, engine idling, and jackhammer exhibited the most significant differences.Thus, if an AED system's confusion matrix shows significant errors between the gunshot and dog bark classes, a new classifier can be developed using the most appropriate bands from the spectrogram, as shown in Figure 7.This approach enhances the AED efficiency and reduces errors.Similar analyses can be performed for other events.For example, Figure 8 shows the results for dog bark according to the other events, which differs from the gunshots.As an implicit rule, when the rejection percentage of the mean equality test is less than 75%, the frequency bands are considered ineffective.This statistical criterion can be used as an efficient method for feature selection based on statistical patterns without the need for evolutionary or iterative techniques.The only limitation of the proposed method is the requirement for many samples.The ratios between the effective frequency bands, i.e., with a rejection percentage greater than 75%, and the total frequency bands are indicated separately in Table 2.A higher ratio indicates that more spectrogram bans can be helpful.In contrast, a smaller ratio indicates that more spectrogram bans can be removed in developing an AED system.The weakest result in Table 2 is 17.9% (between siren and dog bark), indicating that only 31 of the 173 spectrogram frequency bands effectively distinguish between the dog bark and siren classes.Regarding the weak features, the dog bark has more in common with other events, showing that for this event, many spectrogram bans can be removed during the classifier design without reducing the efficiency.According to the results in Table 2, many spectrogram bands (approximately 26.3%) can be omitted during the AED design.Thus, in addition to reducing noise, complexity, and training time, the number of samples required to train the system is reduced.
Table 2. Percentage of effective bands to total spectrogram bands in the train set of the URBAN-SED dataset (the listed events are: A = air conditioner, B = car horn, C = children playing, D = dog bark, E = drilling, F = engine idling, G = gun shot, H = jackhammer, I = siren, J = street music).

Validation
Figure 5 illustrates the scheme used to validate the results in Table 2.If the results of Table 2 are valid, the selected bands from the test samples would be relatively similar to the training samples.
Based on the results of Table 3, it can be seen that in the Jaccard metric the average difference in all events is 7.77%, and the greatest change occurred for the dog bark and siren events.This great difference (25%) is due to the low number of effective bands between the dog bark and siren events (Table 2) as E11 and its effect on the denominator of Equation (13) does not necessarily indicate a high mismatch.The Dice and Jaccard metrics, due to the denominator of Equations ( 13) and (15), when the number of effective bands between two events E11 is small, may also show a high value in the low mismatch.According to the results of Table 2, only 17.34% of bands (equivalent to 30 bands) between these two events were effective.In this situation, a slight mismatch of ten bands (out of 173 bands) between selected bands in training and testing data, showed a 25% mismatch in the Jaccard metric.In such cases, bands with values close to but less than the specified rejection percentage of the mean equality test can be selected as effective bands.
Table 3. Jaccard distance (%) between the results of train and test datasets (the listed events are: A = air conditioner, B = car horn, C = children playing, D = dog bark, E = drilling, F = engine idling, G = gun shot, H = jackhammer, I = siren, J = street music).In the Hamming metric, the total length of the vector is taken as the denominator (Equation ( 14), so the small value of the number of effective bands between two events E11 does not affect the response.Based on the results of Table 4, the maximum mismatch between train and test results is 8.1%, which occurred in the drilling and gunshot events.The average difference in all events is 5.7%, which shows that the change in the selected bands based on train and test data is very slight, and only 10 bands (out of 173 bands) differ.

Class
The average Dice difference obtained in the training and testing samples is 4.07% (Table 5), which reflects the good alignment of the selected bands using the training data and the testing data.The maximum difference in this metric is 14.3%, which occurred between the dog bark and siren events.Similar to the Jaccard metric, the reason for this high difference is the low number of effective bands in this case.According to Equation (15), if there are only a few effective bands available, E11 is a small value, and a small mismatch between the two vectors causes a large difference.To solve this problem, it is sufficient to increase the number of effective bands by reducing the specified rejection percentage of the mean equality test.It can be concluded that when selecting effective frequency bands using training data (Table 2), good results on test data could be achieved, demonstrating the effectiveness of the proposed effective frequency band selection method.Using these tables makes it possible to select effective spectrogram bands for AED systems.Therefore, the proposed method can be considered a suitable scheme for feature selection in AED or classification systems because the rejection percentage is not affected by the feature type.

Conclusions and Future Work
In this article, a statistical method for feature analysis was proposed.The proposed method considers each feature value as a statistical population.The samples of each feature are divided into two populations according to belonging or not belonging to a particular class.The means of these two populations are compared using the two-sample t-test.The feature is useful and otherwise useless if the rejection percentage of the mean equality test for these two populations is sufficiently large.To demonstrate the efficiency of this approach, different frequency bands of the acoustic signal spectrogram were analyzed in an AED system.Since the populations in the two-sample t-tests must be expected, various normality tests were performed, and the normality of the spectrogram features was validated.After the normality test, the two-sample t-test was used to analyze the mean equality between all the frequency bands of the spectrogram for every two acoustic events.According to the results, many spectrogram features (approximately 26.3%) could be omitted during the AED design.In this way, in addition to reducing noise, complexity and training time, the number of samples required to train the system is reduced.Moreover, the training and testing sets were analyzed separately, and the results showed an average difference of 7.77% in the Jaccard, 4.07% in the Dice, and 5.7% in the Hamming metrics.These small values indicate the validity of the obtained results for the test set.
The assumption of normality in the input data is the only limitation of the proposed method.As to future work, the proposed method can be applied to different AED systems, and its efficiency can be evaluated.Further analysis is needed to show the selected frequency bands are as effective as all frequency bands in machine learning or deep learning models.In this case, the proposed approach should be applied to state-of-the-art AED systems and the accuracy of the system with two inputs, i.e., selected bands and all bands, should be compared.

Funding:
The first author would like to thank"Fundação para a Ciência e a Tecnologia" (FCT) for his Ph.D. grant with reference 2021.08660.BD.This article partially results from the project "Sensitive Industry", co-funded by the European Regional Development Fund (ERDF) through the Operational Programme for Competitiveness and Internationalization (COMPETE 2020) under the PORTUGAL 2020 Partnership Agreement.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Figure 2 .Algorithm 1
Figure 2. Overview of the proposed method.

Figure 3 .
Figure 3. Process of removing time intervals containing multiple events and preparing populations.

Figure 4 .
Figure 4. Example of a mel spectrogram filter bank (each line represents one of the filters used).

Figure 5 .
Figure 5. Scheme used for the validation of the results.

Figure 6 .
Figure 6.Rejection percentage (RP) of mean equality test for four events according to the background.

Figure 7 .
Figure 7. Rejection percentage (RP) for mean equality test between gunshot and other events.

Figure 8 .
Figure 8. Rejection percentage (RP) for mean equality test between dog bark and other events.

Table 4 .
Hamming distance (%) between the results of train and test data (the listed events are: A = air conditioner, B = street music, C = children playing, D = dog bark, E = drilling, F = engine idling, G = gun shot, H = jackhammert, I = siren, J = street music).

Author Contributions:
Conceptualization, funding acquisition, and supervision by J.M.R.S.T.; investigation, data collection, and code implementation by V.H. and A.A.G.; formal analysis and original draft preparation by V.H., A.A.G., N.H. and M.Z.; writing review and editing by J.J.M.M. and J.M.R.S.T.All authors have read and agreed to the version of the manuscript.

Table 5 .
Dice distance (%) between the results of train and test data (the listed events are: A = air conditioner, B = street music, C = children playing, D = dog bark, E = drilling, F = engine idling, G = gun shot, H = jackhammert, I = siren, J = street music).