Measurements, Analysis, Classification, and Detection of Gunshot and Gunshot-like Sounds

Gun violence has been on the rise in recent years. To help curb the downward spiral of this negative influence in communities, machine learning strategies on gunshot detection can be developed and deployed. After outlining the procedure by which a typical type of gunshot-like sounds were measured, this paper focuses on the analysis of feature importance pertaining to gunshot and gunshot-like sounds. The random forest mean decrease in impurity and the SHapley Additive exPlanations feature importance analysis were employed for this task. From the feature importance analysis, feature reduction was then carried out. Via the Mel-frequency cepstral coefficients feature extraction process on 1-sec audio clips, these extracted features were then reduced to a more manageable quantity using the above-mentioned feature reduction processes. These reduced features were sent to a random forest classifier. The SHapley Additive exPlanations feature importance output was compared to that of the mean decrease in impurity feature importance. The results show what Mel-frequency cepstral coefficients features are important in discriminating gunshot sounds and various gunshot-like sounds. Together with the feature importance/reduction processes, the recent uniform manifold approximation and projection method was used to compare the closeness of various gunshot-like sounds to gunshot sounds in the feature space. Finally, the approach presented in this paper provides people with a viable means to make gunshot sounds more discernible from other sounds.


Introduction
Considering the recent uptick in senseless shootings in otherwise quiet and relatively safe environments, there is a need, now more than ever, to deter these incidents. Although the first barrier to diminishing gun violence involves the push for implementing gun control laws, artificial intelligence (AI) can play a significant role in helping deter individuals who might slip through the first barrier. The installation of sensors can assist in the proper surveillance of surroundings tied to public safety, which is the first step toward AI-driven surveillance. With the increase in popularity of machine learning (ML) processes, systems are being developed and optimized to assist personnel in highly dangerous situations. Together with saving innocent lives, helping capture the responsible criminals is part of the AI algorithm that can be hosted in acoustic gunshot detection systems (AGDSs).
Researchers have also been very active in seeking effective tools to combat gun violence. Population health technology (PHT), which is defined as the application of emerging technology to improve the health of populations [1], discusses the use of sensors in detecting acoustics related to gunshots. It is a common practice to use video for criminal surveillance and monitoring; however, this practice has its limitations. For example, the strength of video-only surveillance is inhibited by the field of view (FOV) being occluded or by a lack of proper lighting [2][3][4][5]. In these instances, videos fail to reliably detect and account for

Gunshot-like Sound Measurement
Together with procuring the gunshot and gunshot-like sounds from both the open and paid resources listed above, the authors generated their own database for the plastic bag pop (class 8 in Table 1) gunshot-like sound.
The plastic bag pop sounds were recorded in likely environments where gunshots can be fired, some of which are listed here (see Figure 1 below): (a) inside a building along a corridor, (b) inside a personal dwelling, (c) outdoors between two buildings, (d) outdoors on the side of a building and, (e) outdoors in an open field. Together with these likely environments, a controlled set of data was taken in an anechoic chamber (see Figure 1f below).
sound events with abrupt changes in energy, such as door slams, claps, and recrackers [10][11][12][13][14][15]. It is noted in the Figure 1 above, that together with the various environments, various microphones were used in the data collection process. The various microphones and their associated frequency response enhanced the audio variety of the collected data.
Together with the two variables listed above, that is, the environment and the microphones, various sizes of plastic bags and distances from the microphones were also included during the collection process. The variety of audio files collected in this procedure lends to a very robust dataset. In the experiments, the Tascam DR-05X, Zoom H4nPro, Brüel & Kjaer, Blue Yeti USB, JLab TALK GO, iPad mini, and a Samsung S9 were the recording devices used. The Samsung S9 was the only device that automatically adjusted its recording level, while the others had to be manually adjusted. These different recording devices, with its various sensitivity in frequency range, allowed for the capture of plastic bag pop sounds with various audio responses and audio levels. Much more granular details of the data collection process can be found in Ref. [16]. 0.05 m (2in) from Yeti mic using 1.89 L (0.5 Gal) bags-outdoor park Deep rumble 0.30 m (1FT) from Yeti mic using 1.89 L (0.5 Gal) bags-side of building Long and slow rolling bursts 0.91 m (3FT) from Tascam mic using 9.08 L (2.4 Gal) bags-between buildings Long thunderstorm with hard rain 1.52 m (5FT) from JLab mic using 1.89 L (0.5 Gal) bags-inside lab with curtains Rain and thunder approaching 3.04 m (10FT) from Bruel and Kajer mic using 15.14 L (4 Gal) bags-inside home Rolling thunderstorm 4.57 m (15FT) from JLab mic using 3.02 L (0.8 Gal) bags-inside lab with glass walls Storm with strong thunders 6.10 m (20FT) from Zoom mic using 1.89 L (0.5 Gal) bags-inside home Strong thunderstorm in city 6.71 m (22FT) from JLab mic using 9.08 L (2.4 Gal) bags-inside home Thunder rumble with constant rain 7.32 m (24FT) from iPad mini using 1.89 L (0.5 Gal) bags-outdoor park Thunderstorm in closed car 7.32 m (24FT) from Samsung S9 phone using 1.89 L (0.5 Gal) bags -outdoor park

Audio Classes
As previously mentioned, audio samples of gunshot-like sounds were procured. Table 1 below shows the order of the audio classes together with a partial list of the groups of audio clips in each class used in the analysis.
For the majority of the gunshot clips in the database, the actual gunshot or gunshotlike sound lasted for approximately 0.3 s. With this in mind, the authors decided to standardize, to 1 s, the duration of all the audio clips to be fed to the feature extraction process discussed next.

Feature Extraction
There is a wide variety of feature extraction methods to choose from. The feature extraction method chosen for this analysis is the MFCC. Although the MFCC was initially implemented for speech recognition [28], it has found its way to the realm of sound classification [29][30][31]. Figure 2 below shows the normalized Mel-filter bank that is typically implemented in a 40 MFCC feature extraction process of a 96 k sample rate audio file (note that although a sample rate of 48 kHz is more than sufficient for the analysis that follows, the 96 k sample rate example is used just as an illustration since some of the audio files from the secondary databases were sampled at 96 kHz). In a conventional application, as done in librosa, the end frequency of the filter bank is set to sample rate/2. From the minimum to the maximum filter bank frequencies, 40 triangular filters are produced and resized according to the Mel-scale. An email (B. McFee, personal communication, 22 July 2022) conveyed information that adding the delta features accounts for local, short-term temporal patterns. Having only the MFCC feature extraction will cause the analysis to become sensitive to exact timing alignment, which is an area that should be avoided, especially when the dataset is limited. Figure 3 shows the block diagram for MFCC feature extraction together with its delta coefficient (∆-differentiation of MFCCs, called the velocity features) and delta-delta coefficient (∆∆-double differentiation of MFCCs, called the acceleration features). Stated another way, the ∆ and ∆∆ are approximations of the first and second temporal derivatives of MFCCs.
where σ is the activation function, w ij is the weight, x j is the feature vector and b k is the bias term. Note that one of the major pitfalls of using the ∆ features is that differentiators tend to amplify noise. This effect causes the output to be noisier than the original signal. Differentiation applied twice causes the features to be even more unstable.
In our analysis, we extracted 40 features for the MFCC, ∆ and ∆∆, respectively. We then took the mean and standard deviation of each extracted feature and stacked the vector horizontally per sample. This process obtained 240 features in total (see Figure 4a-c later in the paper). The act of using the mean and standard deviation features was done to increase the feature set for feature importance/feature reduction analysis.  As we take a closer look at the extracted MFCC features, we will see, later on in this paper, (refer to Section 3 below) some of the important features sorted as per the relevant feature importance algorithm. Table 2 below shows the feature names together with their respective feature meanings. For example, feature 0 = MFCC_MEAN_FLTR0, which means MFCC mean coefficients from Mel band 0 (from 0 Hz-160.97 Hz). Another example: feature 200 = DELTA2_STDDEV_FLTR0, which means delta-delta standard deviation coefficients from Mel band 0. It is worth noting here that Mel bands will change depending on the sample rate of the input audio file and also the number of MFCCs chosen. Table 3 below lists the Mel triangular frequency bands used for 96k sample rate audio files. Additionally used for the generation of this frequency table is the number of Mel coefficients. As mentioned earlier, 40 Mel coefficients/filters were implemented in our analysis. When one is using MFCC as the feature extraction process, good resolution at low frequencies is attained, whereas at higher frequencies, broad ranges get lumped into one band. This is because the Mel-scaled bank is designed to mimic the human auditory system. Human perception of pitches, which can be approximately described as logarithmic, translates into better deciphering of low frequencies as compared to high frequencies. The MFCC feature vector represents the spectral envelope of a single frame in the sense that two signals with a similar spectral envelope will have a similar sequence of MFCCs.
As can be observed from the zoomed in spectrogram plots of the plastic bag pop and the gunshot sound in Figure 5 below using Adobe Audition, much of the energy content of the gunshot sound is concentrated in the lower end of the frequency spectrum. We will show later in Section 3 how this observation of the low frequency content concentration leads to MFCC feature importances, which are also concentrated in the low end of the frequency band. shows the spectrogram, together with the amplitude over time (to the bottom of the spectrogram) and the power spectral density (PSD) (to the left of the spectrogram). In this view, noting that all the frequency content is in the power spectrum, the spectrogram tells us where in time those frequencies occurred. In addition, we observe that the power spectrum is the cumulative average of the spectrogram, averaged over time. We can physically see where in time the majority of the frequency content is concentrated. Using Figure 6g (the gunshot sound) as an example, we can conclude that the plas-tic_pop sound (Figure 6i), looks very similar from the PSD perspective, that is, a lot of energy is concentrated in the 0-10 kHz region, and then quickly decays to about half its original spectral density. Figure 7 provides a physical (waterfall plot) view of the same samples given in Figure 6 above, for each of the various classes. Each plot shows the power spectral density (PSD) over time and frequency. In Figure 7g, there are two gunshot sounds in quick succession at the beginning of the audio clip. If we take just a single shot and compare that power spectral density to the plastic-pop sound (Figure 7i), we see some similarities, confirming our observation from Figures 6g,i above.

Analysis Tools
Coupled with the low-level feature engineering process (i.e., simply generating the mean and standard deviation of the MFCCs and their derivatives), feature importance is also analyzed. Based on the feature importance analysis, we then employ feature selection.
As a general rule of thumb, if one has more features than samples, one runs the risk that the observations will be harder to cluster. According to the Hughes phenomenon [32], as the number of features increases, the classifier's performance also increases, until the optimal number of features is attained. Adding more features beyond the size of the training set will then degrade the classifier's performance. To overcome this curse of dimensionality, we apply two popular feature selection/reduction techniques-Mean Decrease in Impurity and Shapley Additive Explanations-to the analysis of gunshot-like sounds.

Mean Decrease in Impurity
RF is an ensemble-trees model used mostly for classification. Ensemble methods combine several decision trees to produce a better predictive performance than that of a single decision tree.
The gini impurity measure, one of the methods used in decision tree algorithms, determines the optimal split from a root node and its subsequent splits. The gini impurity of a dataset is a number between 0 to 0.5. It denotes the probability of misclassifying an observation. The lower the gini impurity, the better the split, and the lower the likelihood for misclassification. When all cases in the node fall into a single target category, a value of 0 is attained. The mean decrease in impurity (MDI) is the added weighted impurity decrease for all nodes and the average over all trees.
Figure 4a-c above shows the mean and standard deviation of the MFCC features extracted via the librosa library [33] and transformed via the sklearn RF MDI. The mean of the delta and delta-delta features does not add much to the overall feature importance, although the standard deviation does show some impact in the early part of the feature extraction process. As a refresher, designations of feat_"X" (or feature "X") refer to the "X" Mel-triangular filter or band (see Table 3 above) that is processed during the MFCC calculations. Figure 4d above shows the 20 most important features sorted by the RF feature importance based on MDI. Figure 4d above displays no features beyond the frequency band of 3807.86 to 4844.28 Hz (see Tables 2 and 3 above) are present.
The plot in Figure 8 below shows the distribution of the data using the box and whisker approach for the 20 most important MDI features. The data in Figure 8 is displayed in quartiles, which also includes the outliers. Features 2 and 3 show approximately the same dispersion of data as well as the interquartile range (length of the box). Additionally, the overall spread of features 2 and 1 is a bit more compared to the other features (length of the whiskers). The model's prediction towards the uncertainty region is guided by the outliers. An increase in outliers results in greater uncertainty of the model. Finally, a positive or right skew of the box plot indicates that higher values of the feature occur more often.  Considering the three most important features (features 2, 3, and 40 as determined by the RF MDI feature importance) for Figure 9 below, in both the 2D and 3D space, the features for gunshot, jackhammer, doorknock_slam, and glassbreak_bulb-burst seem to be fairly dispersed. The remainder of the classes appears to be fairly defined even down to a 2D space. The gunshot sounds appear to be closely intertwined with the plastic_pop sounds. Taking the centroids of each class (see Figure 9a above) and comparing the relative distances from the gunshot sounds, we arrive at the results shown in Table 4 below. In Table 4 below, it can be observed that the plastic bag pop sound is closest to that of the gunshot sound [16] (note that "feat2 " and "feat3" is the x, y coordinate respectively of the centroids of each class (see Figure 9a above)). Additionally shown in Table 4 below, is a comparative analysis, indicating which sound is closest to the gunshot sound in descending order, namely: plastic bag pop, door knock, door slam, and car backfire. Additionally, researches also stated that the door knock, door slam, and car backfire was close to the actual gunshot sound [23][24][25][34][35][36][37][38].
to a 2D space. The gunshot sounds appear to be closely intertwined with the plastic_pop sounds. Taking the centroids of each class (see Figure 9a above) and comparing the relative distances from the gunshot sounds, we arrive at the results shown in Table 4 below. In Table 4 below, it can be observed that the plastic bag pop sound is closest to that of the gunshot sound [16] (note that "feat2 " and "feat3" is the x, y coordinate respectively of the centroids of each class (see Figure 9a above)). Additionally shown in Table 4 below, is a comparative analysis, indicating which sound is closest to the gunshot sound in descending order, namely: plastic bag pop, door knock, door slam, and car backfire. Additionally, researches also stated that the door knock, door slam, and car backfire was close to the actual gunshot sound [23][24][25][34][35][36][37][38].

SHapley Additive exPlanations
SHapley Additive exPlanations (SHAP) is a methodology that can be used to interpret a model. Named after Lloyd S. Shapley, the SHAP concept was originally developed to estimate the importance of an individual player on a collaborative team. This concept was geared toward distributing the total gain or payoff among players, depending on the relative importance of their contributions to the outcome of a game [39].
Application of SHAP in ML includes the "total gain" or "payoff" as the model prediction ( f (x)) for a single instance of the dataset, and the "players" as the features of the instance that collaborate to receive a gain (predicted value). The SHAP values are the averaged marginal contribution of a feature value across all possible coalitions.
Although beyond the scope of this paper, it is worth noting that in addition to giving us the ability to extract important features, SHAP falls into the category of interpretable machine learning (IML). IML aims to build models that can be understood by humans.

SHapley Additive exPlanations
SHapley Additive exPlanations (SHAP) is a methodology that can be used to interpret a model. Named after Lloyd S. Shapley, the SHAP concept was originally developed to estimate the importance of an individual player on a collaborative team. This concept was geared toward distributing the total gain or payoff among players, depending on the relative importance of their contributions to the outcome of a game [39].
Application of SHAP in ML includes the "total gain" or "payoff" as the model prediction ( f (x)) for a single instance of the dataset, and the "players" as the features of the instance that collaborate to receive a gain (predicted value). The SHAP values are the averaged marginal contribution of a feature value across all possible coalitions.
Although beyond the scope of this paper, it is worth noting that in addition to giving us the ability to extract important features, SHAP falls into the category of interpretable machine learning (IML). IML aims to build models that can be understood by humans.
SHAP feature importance is model-agnostic compared to model-specific, as in the MDI feature importance carried out above. Model-agnostic feature importance can, in principle, be used for any model. The algorithm is treated as a black box that can be swapped out for any model. Methods involving model agnostic evaluations provide flexibility when it comes to model selection. Different models can employ the same evaluation framework. In this manner, a comparison of many models can be carried out using the same metrics. Maintaining a consistent framework allows for a much more robust comparison between models.
Another important distinction between the SHAP process and the MDI is the availability of local and global feature importance. Local feature importance focuses on the contribution of features for a specific prediction, whereas global feature importances take all the predictions into account. Figure 10 below shows the sorted feature importance as per the SHAP calculations. Each bar shows the contribution each class has to the model's output. Focusing on class 6 (colored light blue for emphasis), we see the contribution that the gunshot sound makes to each of the 20 most important features shown in the figure. Note in Figure 10, that for feature 80 (or the mean of the first delta coefficient -DELTA_MEAN_FLTR0), the gunshot sound is the most dominant contributor.   Table 5 compares the 20 most important features as calculated by the RF MDI and SHAP processes. The red colored cells show the features that are different between the both processes. Although the MDI and SHAP show different orders of the features' importance, the first 11 features, are common to both, but have a different order. We will see later on in Figure 11 in Section 5.3, that by using only these 20 features, we can achieve an accuracy of approximately 92%.

Gunshot Proximity Analysis
Before delving into the classification phase of this project, let us investigate the proximity of the gunshot-like sounds to the gunshot sounds. To assist in our analysis, we employ the uniform manifold approximation and projection (UMAP) method.

UMAP
UMAP is a novel manifold learning technique for dimension reduction. It is constructed from a theoretical framework based on Riemannian geometry and algebraic topology [40]. Although it has a rigorous mathematical foundation, it is easy to use via the scikit-learn compatible API.
In its simplest sense, the UMAP algorithm consists of two steps: (1) construction of a graph in high dimensions and (2) optimization to find the most similar graph in lower dimensions.
UMAP is among the fastest manifold learning implementations available and is significantly faster than most t-distributed stochastic neighbor embedding (t-SNE) implementations. It is very good at preserving the global structure in the final projection.
As noted in Figure 12 below, there is significant connectivity between the gunshot and plastic_pop sounds [16]. In addition, the doorknock_slam shows a significant amount of connectivity to the gunshot sound [23,24,29,35,41].
Note that the distances between the clusters have no real meaning, as the scale is arbitrary and induced by the optimization approach. The main interpretation that we can deduce from Figure 12 below is that the plastic_pop sound is much more similar to the gunshot sound than it is to the thunderstorm sound, as it is much farther away. From a global perspective, clusters that are closer together are more similar than those that are farther apart.

Classification
For the classification of gunshot and gunshot-like sounds, the random forest (RF) classifier was chosen because of its easy setup process and its effectiveness as compared to other methods. We will show later in this section that one obtains impressive classification results by only using a small subset of features selected with the MDI and SHAP procedures.

Model for Sound Analysis
RFs are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [42]. RF consists of a large number of individual decision trees. The trees operate as an ensemble-a method whereby only a concrete, finite set of alternative models is used to obtain a better predictive performance than that from any constituent learning algorithm alone. Each tree in the RF outputs a class prediction, and the class with the most votes becomes the model's prediction. The low correlation between models (trees) is the key to the success of the RF classifier.
When the RF model is generated, a GridSearchCV is carried out via the scikit-learn library [43]. That is, the hyperparameters of the RF are tuned via an exhaustive search over specified parameter values for the estimator. The parameters of the estimator used to apply these methods are optimized by a cross-validated grid search over a parameter grid. For the train dataset in this research, the approximate time taken for a solution to converge was around 10 min for 2000 samples using a parameter_space = 'n_estimators': [10,50,100], 'criterion': ['gini', 'entropy'], 'max_depth': np.linspace (10,50,11). Here, 'n_estimators' is the number of trees in the forest, 'criterion' is the function to measure the quality of a split, and 'max_depth' is the maximum depth of the tree. After running the GridSearchCV, the best parameters converged to: 'n_estimators' = 100, 'criterion' = 'gini', and 'max_depth' = 50.

Data Leakage Avoidance
To help avoid the serious and widespread problem of machine learning data leakage (a process whereby the effects of the train data are transferred to the test data), the train and test data were split prior to any scaling and post-processing steps. For example, the StandardScaler() [43] (a scaling technique that normalizes the features individually or column-wise, to a mean = 0 and standard deviation = 1) function was first fit to the train data and then to the test data prior to the generation of the RF model.
Note that it is not necessary to normalize the dataset, as any algorithm based on recursive partitioning, such as decision trees and regression trees, is invariant to monotonic transformations of the features. In RF, the trees see only ranks in the features based on a collection of partition rules. As a result, there should be no change with scaling. Experiments were conducted using both the raw data and normalized data. As expected, no difference in accuracy was discovered in this analysis. Throughout the evaluations discussed here, the X_train and X_test data were scaled independently.

Feature Reduction
For this configuration, adding the mean and standard deviation to the MFCC, delta, and delta-deltas, using the RF on the full feature set, we obtain the CM shown in Figure 13a below. Taking only the important features according to the MDI feature importance and applying the scikit-learn SelectFromModel, we arrived at 69 features. After the RF classifier was run again with the reduced feature set of 69 features, the CM and classification report (Figure 13b) and the receiver operating characteristic (ROC) with its associated area under the curve (AUC) (( Figure 13c) remained unchanged with a reduction of about 71% in features.    Table 5 above.   The first 20 features derived from the SHAP feature importance analysis appear to perform marginally better than the MDI feature importance analysis. We also noted from the CM that the gunshot detection is marginally better from the SHAP feature importance analysis as compared to the MDI feature importance analysis.

Actual
The list of features from the SHAP analysis that are different from the MDI feature importance analysis are shown highlighted in red in Table 5 above. As noted for the SHAP analysis, most of the features are confined to within about the 10th filter band, whereas the MDI analysis takes us into the 18th filter band. In both cases this shows that the energies used to generate the model need only these low Mel-frequency bands.

Detection
In this section, we combine the gunshot-like sounds into 1 class. In doing so, we used 200 samples for each of the gunshot-like sounds. This new dataset contains 1800 gunshot-like sounds and 1800 gunshot sounds. Attention was given, as was done above, to ensure that an equal proportion of classes was available for the 80/20 train/test split. In this study, the RF algorithm is again used for its simplicity and effectiveness. Using the GridSearchCV with the same parameters as listed in Section 5.1 above, we ended up with the best parameters converging to: 'n_estimators' = 100, 'criterion' = 'gini', and 'max_depth' = 34.
Using the important features found in Table 5 above, we generate the confusion matrices for the RF MDI and SHAP analysis (see Figure 15 below). We also included in Figure 15 above the CM using the full dataset, Figure 15a, and also with only the 16 features that are common to both the feature importance/feature selection analysis, Figure 15b.
We note that Figure 15a,b generate about the same true positive (TP) for the gunshot sounds, whereas the RF MDI and SHAP generate a slightly better TP. This can be accounted for by the fact that the concentrated feature set (the RF MDI and SHAP top 20 features) lends itself to a marginally better model. Table 6 below compares the accuracy and false positive rate (FPR) (geared toward the gunshot sound) for the various datasets. As can be seen from Table 6, the SHAP top 20 feature set has the best FPR performance and also the best accuracy.

Conclusions
So far, we have reported the findings of data procurement, analysis, classification, and detection of gunshot and gunshot-like sounds. After an outline of the data procurement procedure, we provided the results of two different feature importance/reduction techniques using MFCC features, i.e., (1) RF MDI and (2) SHAP on gunshot and gunshot-like audio. We also demonstrated the closeness of various gunshot-like sounds to gunshot sounds in the drastically reduced feature space using the UMAP technique. Finally, we presented the classification and detection results using reduced sets of features, in comparison with those obtained with all features.
We showed that the SHAP feature importance process produced marginal improvement over the RF MDI feature importance process. The SHAP analysis generated an accuracy of about 97.78% with the FPR of 0.025 for gunshot detection. The RF MDI produced an accuracy of 97.22% with an FPR of 0.033. Further analyses have discovered that among a total of 240 features, 20 leading features were found to be acceptable to maintain good accuracy and FPR, with a reduction of 92% in terms of the number of features used for gunshot detection. However, both the delta and delta-delta derivatives had to be added to the normal MFCC coefficients to make this result possible. It is interesting to note that among the two sets of 20 leading features identified respectively by the SHAP and MDI algorithms, they shared 16 common features.
From a physical perspective, the most dominant feature in the MDI feature importance was feature 2, which is concentrated around 160 to 359 Hz. The SHAP's most dominant feature was feature 0, which is concentrated around 0 to 161 Hz. This indicates that much of the information that leads the model to its decision is based in the very low end of the frequency spectrum.
Although not discussed in this paper, an avenue for potentially better detection can be carried out via the pitch shifting of the recorded gunshot events to a higher frequency range. As the MFCC follows the human hearing sensitivity, the higher pitches can become more discernable, and in turn lead to a highly optimized model.
To study how the gunshot-like sounds relate to the gunshot sounds from a global perspective, we employed the UMAP feature reduction technique in the study, which reduced the number of leading features down to two. Viewing the relation to the sounds in this 2-dimensional space gives one a perspective on which sounds are physically audibly close to each other. Further analysis utilizing various methods to isolate these close sounds leads us to conclude that the plastic bag popping sounds resembled the gunshot sounds.
The experimental results of classification and detection further illustrated that employing proper feature importance/reduction techniques can increase the efficiency and improve the overall performance of a gunshot detection model. Acknowledgments: The authors would like to express thanks to Ali Ibrahim for his insightful discussions and feedback during the development and optimization of the data collection and analysis process; Charles Cooper for his advice on the proper analysis of the gunshot sounds and overall data capture process; and Jeet Kiran Pawani for his Python scripting help and corrections during the programming phase of this project.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: