A Study of Features and Deep Neural Network Architectures and Hyper-Parameters for Domestic Audio Classiﬁcation

Featured Application: The algorithms explored in this research can be used for any multi-level classiﬁcation applications. Abstract: Recent methodologies for audio classiﬁcation frequently involve cepstral and spectral features, applied to single channel recordings of acoustic scenes and events. Further, the concept of transfer learning has been widely used over the years, and has proven to provide an efﬁcient alternative to training neural networks from scratch. The lower time and resource requirements when using pre-trained models allows for more versatility in developing system classiﬁcation approaches. However, information on classiﬁcation performance when using different features for multi-channel recordings is often limited. Furthermore, pre-trained networks are initially trained on bigger databases and are often unnecessarily large. This poses a challenge when developing systems for devices with limited computational resources, such as mobile or embedded devices. This paper presents a detailed study of the most apparent and widely-used cepstral and spectral features for multi-channel audio applications. Accordingly, we propose the use of spectro-temporal features. Additionally, the paper details the development of a compact version of the AlexNet model for computationally-limited platforms through studies of performances against various architectural and parameter modiﬁcations of the original network. The aim is to minimize the network size while maintaining the series network architecture and preserving the classiﬁcation accuracy. Considering that other state-of-the-art compact networks present complex directed acyclic graphs, a series architecture proposes an advantage in customizability. Experimentation was carried out through Matlab, using a database that we have generated for this task, which composes of four-channel synthetic recordings of both sound events and scenes. The top performing methodology resulted in a weighted F1-score of 87.92% for scalogram features classiﬁed via the modiﬁed AlexNet-33 network, which has a size of 14.33 MB. The AlexNet network returned 86.24% at a size of 222.71 MB.


Introduction
The continuous research advances in the field of single and multi-channel audio classification suggests its importance and relevance in a broad range of real-world applications. In this work, we focus on domestic multi-channel audio classification, which can be applied to monitoring systems and assistive technology [1,2].
The majority of the existing works within this area are based on the classification of sound events found in single channel audio [3,4] rather than classifying multi-channel audio signals containing acoustic scenes, which is required to understand the continuous nature of daily domestic activities. Acoustic scenes refer to the sound scene recording of a certain activity over time, while sound events refer to more specific sound classes happening at short periods of time within a duration [5]. The detection of multi-channel audio was also found to be 10% more accurate when compared to single channel audio, considering the case of overlapping sounds that commonly occur in real-life [6]. Such overlapping sounds may be better detected through joint processing from different channels, reducing the effects of background noise and other interference. A similar concept to this work is the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 5 challenge, which focuses on domestic multi-channel acoustic scene classification [7]. In this challenge, top performing methods often involve the use of Log-Mel energies and Mel-frequency Cepstral Coefficients (MFCC), while VGG-16 and VGG-ish pre-trained models are common choices for classification. The use of Log-Mel continues to be a popular choice for features in top performing methods of the DCASE 2019 and 2020 Task 4 challenges on sound event detection and classification [7]. Nonetheless, the utilization of spectro-temporal scalograms for multi-channel classification has not yet been thoroughly explored.
Log-Mel energies are a subset of spectral features, which consider the frequency components of a signal [8]. On the other hand, MFCCs are based on the cepstral representation of a signal, which results from the Inverse Fourier Transform (IFT) of the spectral components of the signal [8]. Although these algorithms are commonly used and are popular for noise-free environments, they have several challenges when faced in noisy acoustic environments [8,9].
Hence, this work aims to determine the optimum feature for domestic multi-channel acoustic scene classification, which takes into account real-life scenarios, such as the presence of different types of background noise. Although the DCASE 2018 Task 5 challenge had real recordings in real environments, the specific characteristics of the noise and reverberation were unknown. Hence, here we conduct a controlled study on these effects using a new database with known characteristics. Experimentation is done by conducting a thorough analysis and comparison of the classification performances and processing time of cepstral and spectral features for several pre-trained neural network and compact neural network models, using weight-sensitive metrics. It is important to note that the use of weight-sensitive metrics is important, in order to take into account the biasing that may be caused by imbalanced datasets. Further, a study on the effects of architectural and hyper-parameter modification on the optimum pre-trained network has also been looked into, in order to reduce the size of the network while maintaining its performance. In turn, we propose the use of spectro-temporal features in the form of scalograms, which are computed through a fast Fourier transform (FFT)-based continuous wavelet transform (CWT) [10]. These features possess excellent time and frequency localization, allowing a thorough representation of continuous signals with minimal loss of information [10]. This is coupled with a modified AlexNet Model, which consists of 33 layers instead of 25, and utilizes a leaky rectified linear unit (ReLU) activation function instead of a traditional ReLU function. Finally, we also synthesize an original database, which aims to recreate scenarios that could occur in real life, in order to test and verify the overall robustness of the system. In summary, the contributions described in this article include: • A detailed performance comparison between different cepstral, spectral, and spectrotemporal features for audio classification. • A direct performance comparison of pre-trained models and a detailed study of the effects of network modification on the optimum model.

•
The development of a modified, compact AlexNet model that maintains the model's accuracy while reducing the network size by over 90%, allowing compatibility with mobile devices and applications.

•
The development of a multi-channel synthetic domestic acoustic scene and event database to test the overall system robustness.
In this work, we focus on the classification and labelling of sound event and scenes, which are relevant for dementia patient monitoring systems. However, applications of the techniques explored in this work are not limited to acoustic scene classification and can be extended to other domains. For example, the compact network and the features examined can be modified to fit any image classification problem, such as emotion detection systems [11] and image-based diagnosis for healthcare applications [12]. Further, features explored in this work, as well as their combination, can also be used for regression problems, such as the estimation of characteristics of seismic waves [13], which is based on STFT features combined with CNN.
It is important to note that the compact neural network development is not a step towards an actual deployment in any specific resource-limited system. Rather, we explore and experiment the extent to which the system can be scaled down while maintaining high performance.

Audio Signal Features
Audio classification is typically achieved by extracting discriminative features that represent the underlying common characteristics of audio signals belonging to the same class. Similar to the DCASE challenge, it is assumed that the audio signals are recorded by microphone arrays placed at different locations (nodes) within a room. The recorded audio signals can then be represented as: where, y m (t) is the signal recorded at time t by microphone m in the array at each node, S i (t) is the i th sound source signal (where K is the total number of sounds), h m,i (t) is the room impulse response (RIR) from source i to microphone m, and v m (t) is additive background noise at microphone m. The audio recordings used in this work are four-channel and are time-aligned. This section discusses several top performing features considered for multi-channel acoustic scenes and evaluates them in terms of their advantages and drawbacks according to the requirements of the system. The following subsections evaluate the possible features according to their relevant categories within the feature engineering process [8], as shown in Figure 1. In this work, we focus on the classification and labelling of sound event and scenes, which are relevant for dementia patient monitoring systems. However, applications of the techniques explored in this work are not limited to acoustic scene classification and can be extended to other domains. For example, the compact network and the features examined can be modified to fit any image classification problem, such as emotion detection systems [11] and image-based diagnosis for healthcare applications [12]. Further, features explored in this work, as well as their combination, can also be used for regression problems, such as the estimation of characteristics of seismic waves [13], which is based on STFT features combined with CNN.
It is important to note that the compact neural network development is not a step towards an actual deployment in any specific resource-limited system. Rather, we explore and experiment the extent to which the system can be scaled down while maintaining high performance.

Audio Signal Features
Audio classification is typically achieved by extracting discriminative features that represent the underlying common characteristics of audio signals belonging to the same class. Similar to the DCASE challenge, it is assumed that the audio signals are recorded by microphone arrays placed at different locations (nodes) within a room. The recorded audio signals can then be represented as: where, ( ) is the signal recorded at time t by microphone in the array at each node, ( ) is the ℎ sound source signal (where K is the total number of sounds), ℎ , ( ) is the room impulse response (RIR) from source to microphone , and ( ) is additive background noise at microphone . The audio recordings used in this work are fourchannel and are time-aligned.
This section discusses several top performing features considered for multi-channel acoustic scenes and evaluates them in terms of their advantages and drawbacks according to the requirements of the system. The following subsections evaluate the possible features according to their relevant categories within the feature engineering process [8], as shown in Figure 1. As observed, features are sub-divided into three main categories, namely: temporal features, spectral features, and cepstral features. Temporal features are computed in the time-domain and have the least computational complexity [8]. Spectral features, on the other hand, are extracted starting from the frequency representation of the signal [8]. Cepstral features then represent the rate of change within the different spectrum bands As observed, features are sub-divided into three main categories, namely: temporal features, spectral features, and cepstral features. Temporal features are computed in the time-domain and have the least computational complexity [8]. Spectral features, on the other hand, are extracted starting from the frequency representation of the signal [8]. Cepstral features then represent the rate of change within the different spectrum bands [8].
Appl. Sci. 2021, 11, 4880 4 of 23 Finally, the fusion between spectral and temporal features results in spectro-temporal features, which combine both time and frequency attributes of a signal [8].
Since temporal features are directly extracted from the audio signal, they often deter from providing reliable descriptors for multi-channel audio classification, as they do not contain information about the frequency. Hence, in this work, we examine cepstral and spectral features only. Along with this, we also examine spectro-temporal features, which are a combination of temporal and spectral features.

Cepstral Features
Cepstral features represent the cepstrum, a depiction of acoustic signals that is commonly utilized in homomorphic signal processing, and is often characterized by the conversion of signals combined through convolution, into the sums of their specific cepstra [14]. Cepstral coefficients were found to be one of the most commonly utilized features for classification of acoustic scene and events.
The mel-frequency cepstral coefficients (MFCC) were the most widely apparent, and are based on a filter that models the behaviour of the human auditory system [14], making it advantageous in terms of sound identification. The MFCCs can be acquired through taking the log of the mel spectrum. Following this, the discrete cosine transform (DCT) of the log spectrum are obtained, with the MFCCs being the result of the DCT's amplitudes [15].
Calculation of the MFCC coefficient starts by dividing the time-aligned four-channel averaged audio signal y avg (t) into multiple segments. Windowing is then applied to each of these segments prior to being subject to the discrete Fourier transform (DFT), resulting in the short-term power spectrum P(f) [16].
The power spectrum P(f) is then warped along the frequency axis f, and into the melfrequency axis M, resulting in a warped power spectrum P(M). The warped power spectrum is then discretely convolved with a triangular bandpass filter with K filters, resulting in θ(M k ) [16]. The MFCC coefficients are calculated according to Equation (2) [16].
where X k = ln(θ(M k )), and D << K due to the compression ability of the MFCC [16]. Nonetheless, these were also found to be prone to loss of substantial information due to its sensitivity to noise [17]. Similarly, its performance can be affected by the shape and spacing of the filters and the warping of the power spectrum [16]. Nevertheless, the MFCC approach has several advantages due to its simple computation, and flexibility with regards to integration with several other features [16].

Spectral Features
Spectral features are computed from the frequency components of the audio signal. The two-dimensional representation of the frequency components of an audio signal is called a spectrogram, which often results from the application of the short time discrete Fourier transform (STFT) to constantly compare the input signal with a sinusoidal analysis function [18]. Although this representation is known to work well with neural networks [19], the signal processing techniques used in order to display the representation can cause inconsistency within the structure of the spectrogram [18]. Further, the majority of the works concerning the spectrogram solely makes use of the magnitude component representation of the audio signal, omitting the phase information [20].
Although spectral features have several advantages, the information yielded may not be sufficient for the characterization of multi-channel audio scene acoustics. Often, they are combined with other features in order to produce a considerable representation of the signal magnitude [8]. However, since different audio scenes have different requirements in terms of temporal and frequency resolutions [21], the combination of several spectral features does not necessarily improve the accuracy of the classifier. A study by Chu, S. et al. [22] had shown that combining several spectral features, including centroid, bandwidth, flatness, and asymmetry for sound classification, does not really improve the accuracy. Instead, an increase in the computational complexity is observed due to the individual computation of multiple features that had to be combined.
Nonetheless, the log-Mel energy features are deemed beneficial for multi-channel acoustic scene classification and were utilized in notable related works mentioned in this research [23,24]. Log-Mel energy features had also been a well-received choice of features for DCASE challenge entries, as per the review of Mesaros, A. et al. [25], due to the twodimensional matrix output that it yields, which is a suitable input for the CNN classifier. Log-Mel features are extracted through the application of a STFT applied to Hamming windowed audio segments [9]. A Mel-scale filter bank is then implemented after taking the square of the absolute value per bin, which are then processed to fit the requirements of the system [9].

Spectro-Temporal Features
Spectro-temporal features stem from the fusion of temporal and spectral features. Although not widely explored in the field of multi-channel audio classification, several works have devised algorithms that integrate the use of both temporal and spectral features for acoustic event detection [26,27]. Cotton, et al. proposed the use of a non-negative matrix factorization algorithm in order to detect a set of patches containing relevant spectral and temporal information that best describes the data [27]. The results achieved in their experiment suggest that their features provide more robustness in noisy environments as opposed to MFCCs as sole features. Schroder, et al. [26], on the other hand, devises a spectro-temporal feature extraction algorithm through two-dimensional Gabor functions for robust classification.
Nevertheless, these algorithms were tested solely on acoustic events as opposed to acoustic scenes. Similarly, the applicability of these algorithms to multi-channel audio scenes remains controversial; aside from not being widely utilized, comparison against other top performing feature combinations for the same application were not apparent.
However, one of the most notable works in the field of spectro-temporal features is scalogram features, which are computed through the continuous wavelet transform (CWT) [28]. Such methods consider both the time and frequency components of a signal. The time components represent the motion of the signal, and the frequency components symbolize the pixel positions in an image [28]. Taking a computer vision approach, the velocity vectors are first calculated through multi-scale wavelets, which are localized in time [29]. The CWT of a continuous signal is defined by Equation (3) [29].
where ψ * refers to the complex conjugate of the mother wavelet, t refers to the time domain, u signifies the signal segment, and s refers to the scale, which is a function of the frequency [29]. Separation of the audio channels is then performed via the low-dimensional models that reverberated from the firmness of the harmonic template models [28]. Such a process is beneficial for multi-channel audio classification due to its ability to separate mixed audio sources, which allows a thorough analysis for individual audio channels.
The scalogram is a visual representation of the absolute value of the CWT coefficients, represented by Equation (4) [30]: Nonetheless, despite its advantages, computation of CWT coefficients are often extensive and are subject to high computational time duration [31]. Wavelets are computed through comparing and inverting the DFT of the signal against the DFT of the wavelet, Appl. Sci. 2021, 11, 4880 6 of 23 which can be computationally expensive. Thus, integration of other techniques in order to reduce this complexity must also be examined.

Pre-Trained Networks
Convolutional neural networks (CNN) have been commonly used for multi-channel sound scene classification in the recent years. CNNs are a sub-type of neural networks that utilize multiple convolution stages for classification [32]. Similar to the traditional neural network, CNNs are composed of three layers, namely: the convolutional layer, the pooling layer, and the fully connected layer [33]. Nonetheless, instead of a traditional fully connected layer, only a subset of the previous layer neurons is connected to the next ones. This suggests improvements in run time, computational complexity, and memory requirements.
There are various pre-trained convolutional neural network models for classification. This is achieved through the use of transfer learning, which allows the reuse of a previously trained network's weights to train a new network model [34], typically using new training data representing new classes. Several advantages of transfer learning include an improved efficiency both in time duration requirements of the model building process, training, and the learning workflow [35]. Further, several research works also report improved results by using transfer learning on pre-trained networks as opposed to training a network from scratch [36].
Various examples of pre-trained CNN models include AlexNet [37], GoogleNet [38], ResNet [39], Inception-ResNet [40], Xception [41], SqueezeNet [42], VGGNet [43], and LeNet [44]. These networks are trained with large datasets, and the weights are saved in order to be re-used for transfer learning. Table 1 provides a summary of the comparison between these pre-trained networks in terms of their basic characteristics, including the year of introduction, network size in MB, image input size, number of layers, number of parameters, and the 5% error rate. Nonetheless, as per our previous works, the AlexNet model returns the highest accuracy for domestic audio classification applications [45,46].

Experimental Methodology
Based on the above discussion on the advantages and disadvantages of different feature and classification techniques, this section starts by explaining the dataset utilized and details the methodology and process we used to carry out this study.

Synthetic Domestic Acoustic Database
Synthesizing our own database allows the production of data that address issues commonly faced in a certain environment and recreates scenarios that could occur in real life. This includes noisy environments, as well as various source-to-receiver distances. Furthermore, this also provides the exact locations of the sound sources.
For this work, the generation of the synthetic database was done based on a 92.81 m 2 one-bedroom apartment modelled after the Hebrew Senior Life Facility [47], illustrated in Figure 2. We assumed a 3 m height for the ceiling. Multi-channel recordings were aimed for; hence, microphone arrays were placed on each of the four corners of the six rooms at 0.2 m below the ceiling. This produced four recordings, one from each of the receiver nodes.
For this work, the generation of the synthetic database was done based on a 92.81 m 2 one-bedroom apartment modelled after the Hebrew Senior Life Facility [47], illustrated in Figure 2. We assumed a 3 m height for the ceiling. Multi-channel recordings were aimed for; hence, microphone arrays were placed on each of the four corners of the six rooms at 0.2 m below the ceiling. This produced four recordings, one from each of the receiver nodes.
Accordingly, the microphone arrays were composed of four linearly arranged omnidirectional microphones with 5 cm inter-microphone spacing (n), as per the geometry provided in Figure 3, where d refers to the distance from the sound source to the microphones.   Accordingly, the microphone arrays were composed of four linearly arranged omnidirectional microphones with 5 cm inter-microphone spacing (n), as per the geometry provided in Figure 3, where d refers to the distance from the sound source to the microphones. For this work, the generation of the synthetic database was done based on a 92.81 m 2 one-bedroom apartment modelled after the Hebrew Senior Life Facility [47], illustrated in Figure 2. We assumed a 3 m height for the ceiling. Multi-channel recordings were aimed for; hence, microphone arrays were placed on each of the four corners of the six rooms at 0.2 m below the ceiling. This produced four recordings, one from each of the receiver nodes.
Accordingly, the microphone arrays were composed of four linearly arranged omnidirectional microphones with 5 cm inter-microphone spacing (n), as per the geometry provided in Figure 3, where d refers to the distance from the sound source to the microphones.   Dry samples are taken from Freesound (FSD50K) [48], Kaggle [49], DESED Synthetic Soundscapes [50], and Open SLR [51], depending on the audio class. Due to the variations in sampling frequency, some of the audio signals were down sampled to 16 kHz for uniformity purposes. The room dimensions, source and receiver locations, wall reflectance, and other relevant information, were then used in order to calculate the impulse response for each room using the image method, incorporating source directivity [52]. This was then convolved with the sounds, specifying their location, in order to create the synthetic data. The data generated included clean signals, as well as different types of noisy signals, including: children playing, air conditioner, and street music, added at three different SNR levels: 15 dB, 20 dB, and 25 dB. The duration of each audio signal was uniformly kept at 5-s, as this was found to provide satisfactory time resolution for the sound scenes and events detected in this work. Table 2 describes this dataset. This data was curated such that the testing data consisted of one noise level for each node. Any instances of the data contained in the test set were then removed from the training data. The testing set content is summarized for a specific sound being recorded at four nodes: •  This ensures that even when the same sound is being recorded by the four nodes present, it reduces the chances of biasing through the addition of different types of noise at different SNR levels. Further, this was also designed to reflect real life recordings, where the sound from different microphones may differ based on their distance to the source and other sounds present in their surroundings.
As observed, audio classes used in the generation of this database focus on sound events and scenes that often occur, or require an urgent response, in dementia patients' environment. Further, this was also generated through the room impulse responses of the HebrewLife Senior Facility [47], in order to reflect a realistic patient environment. This is because assistance monitoring systems are real-world applications of deep-learning audio classifiers, such as the work presented in this paper. Nonetheless, this can also be extended to other application domains as previously discussed.

Feature Extraction Using Fast CWT Scalograms
The CWT has several similarities to the Fourier transforms, such that it utilizes inner products in order to compute the similarity between the signal and an analysing function [53]. However, in the case of CWT, the analysing function is a wavelet, and the coefficients are the results of the comparison of the signal against shifted, scaled, and dilated versions of the wavelet, which are called constituent wavelets [53]. Compared with the STFT, wavelets provide better time-localization [30] and are more beneficial to non-stationary signals [53].
However, in order to reduce the computational requirements for deriving scalograms, this work proposes the use of the Fast Fourier Transform (FFT) algorithm for CWT coefficients computation [30]. Such that, if we define the mother wavelet (Ψ) to be [30], where t refers to continuous time: Then Equation (3), involving the CWT coefficients, can be rewritten as follows [30], where y avg refers to the average of the four-channels of the audio signal: This shows that CWT coefficients can be expressed by the convolution of wavelets and signals. Thus, this can be written in the Fourier transform form domain, resulting in Equation (7) [30]: where ψ * s,t (ω) specifies the Fourier transform of the mother wavelet at scale t: Further, y avg (ω) then denotes the Fourier transform of the analysed signal y avg (t): Hence, the discrete versions of the convolutions can be represented as per Equation (10), where n is in discrete time domain: From the sum in Equation (10), we can observe that CWT coefficients can be derived from the repetitive computation of the convolution of the signal, along with the wavelets, at every value of the scale per location [30]. This work follows this process in order to extract the DFT of the CWT coefficients at a faster rate compared to the traditional method.
In summary, CWT coefficients are calculated through obtaining both the DFT of the signal, as per Equation (9), and the Morlet analysing function, as per Equation (8), via the FFT. The products of these are then derived and integrated, as per Equation (6), in order to extract the wavelet coefficients. Accordingly, the discrete version of the integration can be represented as a summation, which is observed in Equation (10).

Feature Representation
Feature computation is carried out in MATLAB, exploiting functionalities provided in the Audio System and Data Communications toolboxes. A total of 20 filter bank channels with 12 cepstral coefficients are used for the cepstral feature extraction, as per the standard after DCT application [54]. An FFT size of 1024 is utilized, while the lower and upper filter bank frequency limits are set to 300 Hz and 3700 Hz. This frequency range includes the main components of speech signals (specifically, narrowband speech), while filtering out the humming sounds from the alternating current power, as well as high frequency noise [55]. Further, this range is relevant to the sound classes of speech and scream, and was found to also include the main components of the other classes. While larger frequency ranges could also be considered, this would require much larger FFT sizes to maintain the same frequency resolution, which in turn would increase the computational requirements. The extraction of the feature vectors is carried out by computing the average of the four time-aligned channels in the time domain, y avg (t). The coefficients are then extracted accordingly, from which single feature matrices are generated. The feature images are resized into 227 × 227 matrices using a bi-cubic interpolation algorithm with antialiasing [56], in order to match the input dimensionality of the AlexNet neural network model. Figure 4 shows samples of feature images for each of the three features compared, using the 'Speech' and 'Kitchen sound' classes.
filtering out the humming sounds from the alternating current power, as well as high frequency noise [55]. Further, this range is relevant to the sound classes of speech and scream, and was found to also include the main components of the other classes. While larger frequency ranges could also be considered, this would require much larger FFT sizes to maintain the same frequency resolution, which in turn would increase the computational requirements. The extraction of the feature vectors is carried out by computing the average of the four time-aligned channels in the time domain, ( ). The coefficients are then extracted accordingly, from which single feature matrices are generated. The feature images are resized into 227 × 227 matrices using a bi-cubic interpolation algorithm with antialiasing [56], in order to match the input dimensionality of the AlexNet neural network model. Figure 4 shows samples of feature images for each of the three features compared, using the 'Speech' and 'Kitchen sound' classes.

Modified AlexNet Network Model
Domestic multi-channel acoustic scenes consist of several signals that are captured with microphone arrays of different sizes and geometrical configurations. As discussed previously, CNNs have been widely popular for their advantage with regards to efficiency when used with data of spatial behaviour [57]. Thus, the experimentation part of this work compares different pre-trained network models for transfer learning. Modifications on the hyper-parameters are then made on the best performing network, the response being observed in three ways: 1. Effects of changing the network activation function. 2. Effects of fine-tuning the weight and bias factors, and parameter variation. 3. Effects of modifications in the network architecture.
Activation functions in neural networks are a very important aspect of deep learning. These functions heavily influence the performance and computational complexity of the deep learning model [58]. Further, such functions also affect the network in terms of its convergence speed and ability to perform the task. Aside from exploring different activation functions, we also look at fine-tuning the weights and bias factors of the convolutional layers, as well as investigating the effects of the presence of convolutional layers based on performance.

Modified AlexNet Network Model
Domestic multi-channel acoustic scenes consist of several signals that are captured with microphone arrays of different sizes and geometrical configurations. As discussed previously, CNNs have been widely popular for their advantage with regards to efficiency when used with data of spatial behaviour [57]. Thus, the experimentation part of this work compares different pre-trained network models for transfer learning. Modifications on the hyper-parameters are then made on the best performing network, the response being observed in three ways:

1.
Effects of changing the network activation function.

2.
Effects of fine-tuning the weight and bias factors, and parameter variation.

3.
Effects of modifications in the network architecture.
Activation functions in neural networks are a very important aspect of deep learning. These functions heavily influence the performance and computational complexity of the deep learning model [58]. Further, such functions also affect the network in terms of its convergence speed and ability to perform the task. Aside from exploring different activation functions, we also look at fine-tuning the weights and bias factors of the convolutional layers, as well as investigating the effects of the presence of convolutional layers based on performance.
For the modified AlexNet model, we examine the traditional Rectified Linear Unit (ReLU) activation function, along with three of its variations. The ReLU offers advantages in solving the vanishing gradient problem [59], which is common with the traditional sigmoid and tanh activation functions. The gradients of neural networks are computed through backpropagation, which calculates the derivatives of the network through every layer. Hence, for activation functions such as the sigmoid, the multiplication of several small derivatives causes a very small gradient value. This, in turn, negatively affects the update of weights and biases across training sessions [59]. Provided that the ReLU function has a fixed gradient of either 1 or 0, aside from providing a solution to the vanishing gradient problem and overfitting, it also results in lower computational complexity, and therefore significantly faster training. Another benefit of ReLUs is the sparse representation, which is caused by the 0 gradient for negative values [60]. Over time, it has been proven that sparse representations are more beneficial compared to dense representations [61].
Nonetheless, despite the numerous advantages of the ReLU activation function, there are still a number of disadvantages. Because the ReLU function only considers positive components, the resulting gradient has a possibility to go towards 0. This is because the weights do not get adjusted during descent for the activations within that area. This means that the neurons that will go into that state would stop responding to any variations in the input or the error, causing several neurons to die, which makes a substantial part of the network passive. This phenomena is called the dying ReLU problem [62]. Another disadvantage of the ReLU activation function is that values may range from zero to infinity. This implies that the activation may continuously increase to a very large value, which is not an ideal condition for the network [63]. The following activations attempt to mitigate the disadvantages faced by the traditional ReLU function through modifications and will be explored in this work: a.
Leaky ReLU: The leaky ReLU is a variation of the traditional ReLU function that attempts to fix the dying ReLU problem by adding an alpha parameter, which creates a small negative slope when x is less than zero [64]. b.
Clipped ReLU: The clipped ReLU activation function attempts to prevent the activation from continuously increasing to a large value. This is achieved cutting the gradient at a pre-defined ceiling value [63]. c. eLU: The exponential linear unit (eLU) is a similar activation function to ReLU. However, instead of sharply decreasing to zero for negative inputs, eLU smoothly decreases until the output is equivalent to the specified alpha value [65].
Aside from activation functions, variations in the convolutional and fully connected layers will also be examined. The study will be done in terms of both the number of parameters and the number of existing layers within the network.
For parameter modification, we explore the reduction of output variables in the fully connected layers. This method immensely reduces the overall network size [66]. However, it is important to note that recent works solely reduce the number of parameters from the first two fully connected layers. Hence, here we introduce the concept of uniform scaling, which is achieved by dividing the output parameters of fully connected layers by a common integer, based on the subsequent values.
Modification of the network architecture is also considered through examining the model's performance when the number of layers within the network is varied. These layers may include convolutional, fully-connected, and activation function layers. Nonetheless, throughout the layer variation process, the model architecture is maintained to be of a series network type. A series network contains layers that are arranged subsequent to one another, containing a single input, and output layer. Directed Acyclic Graph (DAG) networks, on the other hand, have a complex architecture, from which layers may have inputs from several layers, and the outputs of which may be used for multiple layers [67]. The higher number of hidden neurons and weights, which is apparent on DAG networks, could increase risks of overfitting. Hence, maintaining a series architecture allows for a more customizable and robust network. Further, as per the state-of-the-art, all other compact networks that currently exist present a DAG architecture. Thus, the development of a compact network with a more customizable format, and through using fewer layers, proposes advantages in designing sturdy custom networks.

Performance Evaluation Metrics
To evaluate the performance of the proposed systems, the following aspects are investigated:

1.
Per class and overall comparison of different cepstral, temporal, and spectrotemporal features classified using various pre-trained neural network and machine learning models.

2.
Effects of balancing the dataset Aside from the standard accuracy, evaluations of the performances of different techniques were also compared and measured in terms of their F1-scores. This is defined to be a measure that takes into consideration both the recall and the precision, which are derived from the ratios of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) [68], which can be extracted from confusion matrices.
The databases used for this research compose of unequal numbers of audio files per category. To account for the data imbalance, two different techniques are used:

1.
Balancing the Dataset Particularly used for the initial development and experiments conducted for this work, in this technique, the dataset was equalized across all levels in order to preserve a balanced dataset. This is done in order to avoid biasing in favour of specific categories with more samples. It is achieved by reducing the amount of data per level to match the minimum amount of data amongst the categories. Selection of the data was done randomly throughout the experiments.

Using Weight-sensitive Performance Metrics
Provided that the F1-score serves as the main performance metric used for the experiments conducted, it is crucial to ensure that these metrics are robust and unbiased, especially for multi-classification purposes. When taking the average F1-score for an unbalanced dataset, the amount of data per level may affect and skew the results for the mean F1-score in favour of the classes with the most amount of data. Therefore, we consider three different ways of calculating the mean F1-score, including the Weighted, Micro, and Macro F1-scores, in order to take into account for the dataset imbalance [69].

Comparison of Cepstral, Spectral, and Spectro-Temporal Features
Per-level and average comparisons using MFCC and Log-Mel spectrogram features against the proposed CWTFT scalograms method are seen in Table 3, which is an average of three training trials. As observed, F1-score averaging is done using three different methods: Micro, Macro, and Weighted, in order to take into account the biasing that may be caused by the data imbalance. Further, the table also entails the comparison of the system performance between imbalanced and balanced data. To achieve a balanced data, the size of the dataset is reduced to match the lowest numbered category in both training and testing sets. As per Table 2, for each category, this turns out to be 1565 files for training, based on the "Slam" category, and 260 files for testing, based on the "Alarm" category. This adds up to a total of 17,215 training files and 2860 testing files.
The following results are achieved using the traditional AlexNet network, provided that this gives us the highest results as per our previous works [45,46]. Training for the imbalanced data is achieved at 10 epochs with 1016 iterations per epoch. However, it is important to note that the number of epochs for the balanced data is 75, as it has less iterations per epoch due to the lower amount of data per category. Hence, it requires more epochs in order to reach stability. As observed, the CWTFT scalograms have consistently achieved the highest F1-score across all categories, exceeding the performance of the MFCC features by over 10%. As mentioned earlier, this can be explained by the spectro-temporal properties of wavelets, which allows excellent time and frequency localization. The Log-Mel spectrograms gather the least F1-score out of the three features. In terms of the data imbalance, it is observed that once data is even across all categories, it improves the performance of the smaller categories. Nonetheless, the trade-off is that it reduces the F1-score for the categories with more data initially. It is also evident that performances associated with classes referring to acoustic scenes are higher than those associated to sound events. This is because sound events occur sporadically and at different instances throughout the 5-s intervals, whereas sound scenes are continuously present throughout the duration. Overall, the imbalanced dataset returns higher performance. Figure 5 accordingly shows the relevant confusion matrices for imbalanced and balanced datasets.  Aside from the accuracy, execution time for the inference and resource requirements is another important consideration that must be made when selecting features. Table 4 details the execution time information for the three features compared, in terms of extracting the relevant features and translating them into a 227 × 227 image. Recording the execution time was achieved through a machine with Intel Core i7-9850H CPU @ 2.60 GHz processor, operated in single core. The reported execution times are in seconds and are an average of 100 different readings. As observed, scalograms also returned the shortest overall time duration across all three features compared. The numerous processes involved with the MFCC and Log-mel features justify the longer extraction time.
CWTFT coefficients are derived through taking the product between the DFT of the signal and the analyzing function through FFT, and inverting this in order to extract the wavelet coefficients. On the other hand, both MFCC and Log-Mel are based on the Melscale filter bank. This is based on the short-term analysis, from where vectors are computed per frame. Further, windowing is performed to remove discontinuities, prior to utilizing the DFT to generate the Mel filter bank. Further processes, such as the use of triangular filters and warping, are also necessary prior to the application of the IDFT and transformation.
It is important to note that in terms of memory usage, there are negligible differences between the three features compared. This is because the features are being resized and translated into a 227 × 227 image through bi-cubic interpolation, in order to fit the classifier. Nonetheless, each image translation occupies between 4-12 KB of memory, depending on the sound class.  In our previous works, we examined the response of the system performance by concatenating the cepstra from individual channels [45,46]. This yielded a slightly better performance than using a single cepstrum after averaging the four time-aligned channels for the case of cepstral coefficients. Extracting cepstral coefficients for each channel allows a thorough consideration of all distinctive properties of the signal, which minimizes the loss of information. However, per-channel feature extraction did not cause improvement with Scalogram features, yielding a result of 90.72% as opposed to 92.33% for averaging the channels, as audio sources are already separated within its wavelet computation process.
Aside from the accuracy, execution time for the inference and resource requirements is another important consideration that must be made when selecting features. Table 4 details the execution time information for the three features compared, in terms of extracting the relevant features and translating them into a 227 × 227 image. Recording the execution time was achieved through a machine with Intel Core i7-9850H CPU @ 2.60 GHz processor, operated in single core. The reported execution times are in seconds and are an average of 100 different readings. As observed, scalograms also returned the shortest overall time duration across all three features compared. The numerous processes involved with the MFCC and Log-mel features justify the longer extraction time. CWTFT coefficients are derived through taking the product between the DFT of the signal and the analyzing function through FFT, and inverting this in order to extract the wavelet coefficients. On the other hand, both MFCC and Log-Mel are based on the Mel-scale filter bank. This is based on the short-term analysis, from where vectors are computed per frame. Further, windowing is performed to remove discontinuities, prior to utilizing the DFT to generate the Mel filter bank. Further processes, such as the use of triangular filters and warping, are also necessary prior to the application of the IDFT and transformation.
It is important to note that in terms of memory usage, there are negligible differences between the three features compared. This is because the features are being resized and translated into a 227 × 227 image through bi-cubic interpolation, in order to fit the classifier. Nonetheless, each image translation occupies between 4-12 KB of memory, depending on the sound class.

Architecture of Modified AlexNet-33 (MAlexNet-33)
This section discusses the results achieved through the detailed study of the effects of modifying the traditional AlexNet architecture. The AlexNet model was found to result in the highest F1-scores based on our previous work experiments [45,46]. In this work, we aim to improve this network by decreasing the overall network size while maintaining its performance. To begin with, the original layer structure of the AlexNet network is presented in Figure 6. As observed, it contains 25 layers, with 2 regular convolution layers, 3 group convolution layers, and 3 fully connected layers. For this experiment, the response of the system to reducing the number of layers is investigated. Further, different variations of the ReLU activation function are also examined. Table 5 displays the different combinations tested for this experiment with regards

Exploring Variations of the Rectified Linear Unit and the Number of Layers
For this experiment, the response of the system to reducing the number of layers is investigated. Further, different variations of the ReLU activation function are also examined. Table 5 displays the different combinations tested for this experiment with regards to decreasing the number of layers and changing the activation function, presented as an average between 11 classes. Hence, throughout the results, it is apparent that the micro averaging results between the four measures are the same and there are close similarities between some of the measures. This is due to the total number of false negatives and false positives being the same. More distinct differences between the classes can be seen in the per-level comparison, such as that of Table 3. From Table 5, AlexNet-20 was achieved by removing one grouped convolutional, two ReLU, one fully connected, and one 50% dropout layer from the original network. It is observed that removing convolutional and fully connected layers from the network reduces its performance as well.
However, it is also apparent that using other activation functions improves the performance. For instance, using a Leaky ReLU with a 0.01 parameter in place of the ReLU activation function increased the weighted F1-score to 85.58%, having less than 1% difference from the original network's performance. Such improvement is reportedly due to the Leaky ReLU's added parameter to solve the dying ReLU problem. Due to having less layers in the system, a reduction of about 30% from the original size was also achieved. MAlexNet-20 with a Leaky ReLU activation function has a network size of about 150 MB, compared against AlexNet's 220 MB network size.
Subsequent to this, the concept of a successive activation function was also looked at. For this, two activation function layers were placed successively throughout the network. However, as per Table 6, it is implied that using two successive activation functions does not necessarily improve the overall system performance. However, it is also apparent that using more than one activation function does not affect the overall size of the network. The AlexNet contains three fully connected layers with parameter values of 9216, 4096, and 4096 for the inputs, and 4096, 4096, and 1000 for the outputs. In this experiment, we reduce the output parameters across the first two fully connected layers within the network through scaling. The results achieved from this experiment are reported in Table 7. In here, FC6 refers to the output of the first fully connected layer, and FC7 refers to the output of the second fully connected layer. It is important to note that the output of the last fully connected layer corresponds to the number of classes the system aims to identify and is not determined by parameter scaling.
As observed from Table 7, a notable improvement is observed through scaling the output parameters of the fully connected layers through a division of 24 (from the input parameter and fully connected sizes of the original network), which provided slightly higher F1-score compared to the original AlexNet. Further, this results in an almost 90% reduction in size of the network compared to the original (23.82 MB as opposed to 221.4 MB). Uniform scaling also returns better performance compared to keeping an equal number of parameters across all fully connected layers. Further, it also achieved a higher weighted F1score than the combination used by previous recent studies, for which the exact parameters used are represented by the last entry on Table 7 [66]. It is important to note that the input size for FC6 is automatically calculated for the modified networks. After the convolution stages, this is found to be 4608 parameters. Quantitatively, it is implied that the output parameters of all fully connected layers subsequent to the last fully connected layer can be scaled down extensively, depending on the number of classes that the model is designed to predict, keeping in mind that the fully connected output parameters are higher than the number of possible predictions.
The number of epochs required is determined through the training accuracy and losses graph. Generally, a lower number of output parameters slows down the training, requiring more epochs in order to reach a well-learned network. Figure 7 displays the difference between a traditional AlexNet and a version with lower numbers of output parameters in the fully connected layers. The comparison was done for 10 epochs.

The Combination of Layer and Parameter Modification
Provided that uniformly scaling the fully connected layer parameters has proven beneficial, in this section, we combine this technique with the advantages of modifying the number of layers. This is done in two ways, the results for which are presented in Table 8 As observed from Table 7, a notable improvement is observed through scaling the output parameters of the fully connected layers through a division of 24 (from the input parameter and fully connected sizes of the original network), which provided slightly higher F1-score compared to the original AlexNet. Further, this results in an almost 90% reduction in size of the network compared to the original (23.82 MB as opposed to 221.4 MB). Uniform scaling also returns better performance compared to keeping an equal number of parameters across all fully connected layers. Further, it also achieved a higher weighted F1-score than the combination used by previous recent studies, for which the exact parameters used are represented by the last entry on Table 7 [66]. It is important to note that the input size for FC6 is automatically calculated for the modified networks. After the convolution stages, this is found to be 4608 parameters. Quantitatively, it is implied that the output parameters of all fully connected layers subsequent to the last fully connected layer can be scaled down extensively, depending on the number of classes that the model is designed to predict, keeping in mind that the fully connected output parameters are higher than the number of possible predictions.
The number of epochs required is determined through the training accuracy and losses graph. Generally, a lower number of output parameters slows down the training, requiring more epochs in order to reach a well-learned network. Figure 7 displays the difference between a traditional AlexNet and a version with lower numbers of output parameters in the fully connected layers. The comparison was done for 10 epochs.   As per Table 8, it is observed that the top performing algorithm is the MAlexNet-33, which is designed as a combination of both fully connected parameter scaling, as well as the addition of two new grouped convolutional layers with bias learnable weights of 1 × 1 × 64 × 2 and 1 × 1 × 32 × 2, and relevant activation layers. This provided a weighted F1-score of 87.96%, exceeding the performance of the AlexNet, with a network size of 14.33 MB. This suggests an over 95% decrease in the size of the resource requirements when compared to the original model. When compared to [66], this also improved both the performance and the network size, exceeding the performance by around 2.16% and decreasing the network size by over 40%. Aside from the improvement in resource requirements, decreasing the network size also returned a notable improvement in the inference execution time, provided that they are factors linearly related to one another.

Comparison with Other Compact Networks
In this section, a comparison of the proposed architecture to currently existing compact networks is presented. For this work, several compact pre-trained models including SqueezeNet [42], MobileNet-v2 [70], NasNet Mobile [71], and ShuffleNet [72], are considered. A summary of the comparison is seen in Table 9, in terms of the total number of layers, depth, type, network size in MB, the activation function used, the weighted F1-score, the training time for 30 epochs, the network loading time, and the execution inference time average. The network loading time is an average of five trials, while the execution time is measured in 100 trials. Throughout the comparison, it is important to note that, while MAlexNet-33 is a series network, all other compact networks are DAG networks, which have a complex architecture and a significantly larger number of layers.
As observed, our proposed network consistently provided the highest weighted F1score in comparison to the other compact networks. Despite having a 14.33 MB network size, this provided negligible time differences (about 0.08-s against SqueezeNet) in terms of loading the network. Further, it also possesses the least training and execution time compared to the other networks.
It is also apparent that other compact networks possess a higher loading time despite the smaller network size, which is caused by the DAG network configuration, and the multiple layers within the architecture. Provided that the MAlexNet-33 has the least number of layers, it creates a highly customizable network architecture. Adding more layers of neurons increases the complexity of the neural networks. Although hidden layers are crucial for extracting the relevant features, having too many hidden layers may cause overfitting. In this case, the network would be limited in terms of its generalization abilities. In order to avoid this effect, this work focuses on designing a smaller network with fewer neurons and weights than a traditional compact neural network.

Discussion
Interpreting the presented results, we conclude that the use of CWTFT scalograms returns the best results for audio scene and event classification applications. This is supported by our previous experiments, which were performed using the SINS database [45,46] and the experiments conducted in this work. This can be justified by the fact that scalograms possess excellent time and frequency localization. Furthermore, another advantage is that it also separates audio sources upon the wavelet computation process. Using an FFT-based wavelet transform also returns favourable time duration requirements, which exceeded that of cepstral and spectral features.