An Interpretable Deep Learning Model for Automatic Sound Classiﬁcation

: Deep learning models have improved cutting-edge technologies in many research areas, but their black-box structure makes it difﬁcult to understand their inner workings and the rationale behind their predictions. This may lead to unintended effects, such as being susceptible to adversarial attacks or the reinforcement of biases. There is still a lack of research in the audio domain, despite the increasing interest in developing deep learning models that provide explanations of their decisions. To reduce this gap, we propose a novel interpretable deep learning model for automatic sound classiﬁcation, which explains its predictions based on the similarity of the input to a set of learned prototypes in a latent space. We leverage domain knowledge by designing a frequency-dependent similarity measure and by considering different time-frequency resolutions in the feature space. The proposed model achieves results that are comparable to that of the state-of-the-art methods in three different sound classiﬁcation tasks involving speech, music, and environmental audio. In addition, we present two automatic methods to prune the proposed model that exploit its interpretability. Our system is open source and it is accompanied by a web application for the manual editing of the model, which allows for a human-in-the-loop debugging approach.


Introduction
The popularization of deep learning has led to significant advances in a wide range of scientific fields and practical problems [1,2], most notably in computer vision [3] and natural language processing [4], but also in the audio domain, improving the state of the art of several tasks, such as speech recognition [5] and music recommendation [6]. Despite the recent progress, it is often hard to provide insights into the decision-making process of deep neural networks (DNNs). Their deep recursive structure and non-linear transformations allow for DNNs to learn useful representations of the data, but, at the same time, make it difficult to trace which aspects of the input drive their decisions. This black-box nature of DNNs may lead to unintended effects, such as reinforcing inequality and bias [7][8][9], and makes it difficult to extract the knowledge that is captured by the model about the problem in a way that humans can understand [10]. Particularly, in applications that impact human lives-such as in healthcare-the lack of transparency and accountability can have serious consequences [11]. It can be argued that the difficulty to understand these models is not problematic in many successful applications of DNNs [12,13]. However, prediction with high probability does not guarantee the model to always behave as expected. For instance, it has been shown that DNNs can be fooled with certain ease in classification tasks by adversarial attacks [14]-see [15,16] for examples in speech.
The integration of such algorithms into our daily lives requires wide social acceptance, but the unforeseen malfunctioning and side-effects mentioned above undermine trustworthiness. Such concerns emerge in the new artificial intelligence (AI) regulations, like the legal notion of a right to explanation in the European Union's general data protection For already trained models, post-hoc explanations may be the only option. However, a model that is inherently interpretable provides explanations that are faithful to what the model actually computes [11]. Several different approaches may be taken for designing interpretable or explanation-producing neural networks [18]. For instance, DNNs can be trained to explicitly learn disentangled latent representations [32] or to create generative explanations. Besides, attention mechanisms that proved very effective in natural language processing and computer vision provide a distribution over input units indicating their relative importance, which can be considered as a form of explanation. However, more research is needed to assess whether attention mechanisms can provide reliable explanations [33].
Recent approaches for designing interpretable neural network architectures based on explanations through prototypes and concepts, which have been mainly applied to the classification of images, are of particular relevance to our work [21][22][23][24][25]. Prototype classification is a classical form of case-based reasoning. It stands as an appropriate approach for interpretability that is grounded on how domain experts explain the reasoning processes behind complicated classification tasks in the visual domain. The architecture proposed in [21] appends a special prototype layer to the network. Those prototypes are learned during training and the predictions of the network are based on the similarity of the input to them, being computed as a distance in the latent space. Thus, the explanations that are given by the network are the prototypes and corresponding distances. Subsequently, in [23], this is extended to use hierarchically organized prototypes to classify objects in a predefined taxonomy, and in [24], to learn parts of the training images as prototypes of each class. Finally, the approaches in [22,25] can be regarded as generalizations beyond similarities to prototypes into more general interpretable concepts.

Contributions and Description of This Work
In this work, we present a novel explanation-producing neural network for sound classification, with the aim of contributing to the development of interpretable models in the audio domain.
Automatic sound classification technologies are becoming widely applied into monitoring systems, smart safety devices, voice assistants, and in different real-world environments, such as industrial, domestic, urban, and natural [34][35][36]. Consequently, there are various application scenarios in which a human would need to make actionable decisions based on an automatic sound classification system-like surveillance or machine fault monitoring. In such scenarios, humans would benefit from accessing explanations regarding the outputs of the sound classification system. Traditional sound classification systems are based on acoustic features, such as mel-frequency cepstral coefficients (MFCCs), and shallow classifiers, such as Gaussian mixture models (GMMs) [37,38] and decision trees [39]. These are comparatively easier to interpret than classifiers that are based on deep neural networks. However, deep-learning techniques have brought new state-of-the-art results to several sound classification tasks [40][41][42], thus new research addressing the issue of making these classifiers more interpretable is needed.
In addition, self-explaining networks can be more easily inspected, which has the previously noted advantages.
The proposed network architecture is based on the interpretable model for image classification introduced in [21]. The input in our case is a time-frequency representation of the audio signal. The network learns a latent space-by means of an autoencoder-and a small set of prototypes in this latent space. The ability to learn a latent space is the most powerful trait of DNNs and it proved to be a key factor for achieving high performance in the different sound classification tasks addressed. The predictions of the network are based on the similarity of the input to the prototypes in the latent space. We leverage audio domain knowledge to design a frequency-dependent similarity measure and consider different time-frequency resolutions in the feature space, both of which contribute to better discrimination of the sound classes. By applying the decoder function of the autoencoder, a prototype can be mapped from the latent space to the time-frequency input representation and then to an audio signal. The model is constrained to produce good quality audio from the time-frequency representation. This allows for the aural inspection of the learned prototypes and their comparison to the audio input. It is this approach that renders the explainability of the proposed model: the explanation is the set of prototypes-mapped to the audio domain-and the similarity of the input to them.
The conducted experiments show competitive results when compared to that of the state-of-the-art methods in automatic sound classification for three different application scenarios involving speech, music, and environmental audio. Moreover, the ability to inspect the network allows for evaluating its performance beyond the typical accuracy measure. For this reason, some experiments are devised to explore the advantages of the interpretability property of the model. To do so, we propose two automatic methods for network refinement that allow reducing some redundancies, and suggest how the model could be debugged using a human-in-the-loop strategy.
The implemented model and the software needed for reproducing the experiments are available to the community under an open source license from: https://github.com/ pzinemanas/APNet.
Our main contributions can be summarized, as follows.

1.
We propose a novel interpretable deep neural network for automatic sound classification-based on an existing image classification model [21]-that provides explanations of its decisions in the form of a similarity measure between the input and a set of learned prototypes in a latent space.

2.
We exploit audio domain knowledge to improve the discrimination of the sound classes by designing a frequency-dependent similarity measure and by considering different time-frequency resolutions in the feature space.

3.
We rigorously evaluate the proposed model in the context of three different application scenarios involving speech, music, and environmental audio, showing that it achieves comparable results to those of state-of-the-art opaque algorithms.

4.
We show that interpretable architectures, such as the one proposed, allow for the inspection, debugging, and refinement of the model. To do that, we present two methods for reducing the number of parameters at no loss in performance, and suggest a human-in-the-loop strategy for model debugging.
The rest of this document is organized, as follows. Section 2 reviews previous work in relation to our proposed approach, which is presented in Section 3. Section 4 details the datasets and baseline methods that are used in the experiments reported in Section 5. Section 6 finalizes the paper with the main conclusions and ideas for future work.

Relation with Previous Work
In this section, we review previous research in explainable deep neural models for audio in relation to our work. Although the research in explainable machine learning is quickly expanding, the existing work that focuses on the audio domain is quite limited. Most of this research follows a post-hoc approach for explaining black-box models, and only a few works deal with intrinsically interpretable network architectures.
Regarding visualization methods for explainability, saliency maps were applied in [43] to a convolutional-recurrent network trained for polyphonic sound event detection. The authors show that convolutional filters recognize specific patterns in the input and that the complexity of these patterns increases in the upper layers. A gradient-based approach was proposed in [44] for visualizing heat maps in raw waveform convolutional neural networks. It is also possible to propagate the model's prediction backward while using rules about relevance [45] or deep Taylor decomposition [46]. These two methods have been applied to audio classification problems in [47,48], respectively. In contrast to the visualization methods, our explanations come in form of examples, which are easier to interpret by end-users of the model that may have no machine learning knowledge.
With regards to post-hoc methods that create proxy models, a variation of the LIME algorithm for audio content analysis-called SLIME-was proposed in [49,50]. It can generate explanations in the form of temporal, frequency, and time-frequency segmentation, and it was applied to singing voice detection. In contrast, our model is designed to be interpretable and we do not generate explanations for black-box models. The time-frequency representations used as input for these models cannot be interpreted as simple images. We argue that example-based explanations are a better fit for these audio representation than local pixels areas that do not always define sound objects.
The model internals, weights, neurons, and layers may also be explained. In the audio domain such an approach using transfer learning was applied to study the capability of the layers of a pre-trained network to extract meaningful information for music classification and regression [51]. The role of each layer in an end-to-end speech recognition systems has been studied in [52]. The main idea is to synthesize speech signals from the hidden representations of each layer. The results show that specific characteristics of the speaker are gradually discarded in each layer, along with the ambient noise. Similar to these approaches, our network architecture is designed to generate prototypes by reconstructing representations that were learned in a latent space.
In terms of interpretable network architectures, several parts of the network may be designed to target particular tasks, such as feature computation or signal decomposition. For example, it has been shown that the first layers of end-to-end convolutional neural networks that learn representations from raw audio data extract features that are similar to the spectrogram or energies in mel-frequency bands [53][54][55]. Additionally, some works have addressed the design of the first layers of these networks to tailor the feature extraction stage using parametric filters [56][57][58] or trainable hand-crafted kernels [59,60]. Attention mechanisms have been used to bring interpretability to neural networks in speech and music emotion recognition [61,62] and in music auto-tagging [63]. Recently, a visualization tool was proposed in [64] for understanding the attention mechanisms in self-supervised audio transformers.
Regarding intrinsic interpretable network architectures, learning a disentangled latent space has been applied to learn interpretable latent factors that are related to pitch and timbre [65]-in the context of musical instrument recognition-and related to chord and texture [66]-in the context of generative models of polyphonic music. In addition, mid-level features [67] and source separation [68] have been used to improve the model interpretability for the problem of music emotion recognition. In [69], invertible networks [70] have been applied for interpretable automatic polyphonic transcription. Similar to [71], we use domain knowledge to design several parts of the network to learn interpretable representations for speech and audio signals.
Neural network pruning has been shown to reduce the training time and network complexity [72]. To that extent, we refine the proposed model by eliminating redundant prototypes and channels. Thus, we show that our system is able to achieve similar accuracy with a reduced number of parameters.

Proposed Model
The proposed model-called Audio Prototype Network (APNet)-has two main components: an autoencoder and a classifier. The input to the model is a time-frequency representation of the audio signal. The purpose of the autoencoder is to represent the input into a latent space of useful features that are learned during training. The encoded input is then used by the classifier to make a prediction.
The diagram shown in Figure 1 represents the network architecture of APNet. The classifier consists of: a prototype layer, a weighted sum layer, and a fully-connected layer. The prediction of the classifier is based on the similarity-computed in the latent spacebetween the encoded input and a set of prototypes, which are learned during training to be representatives of each class. The weighted sum layer controls the contribution of each frequency bin-in the latent space to-the similarity measure. Finally the fully connected layer takes the weighted similarity measure as input and produces the output prediction. The decoder function of the autoencoder is used to map the prototypes from the latent space to the time-frequency input representation-and then to the audio domain using signal processing. In the following, the main network components are thoroughly described.

Input Representation
The log-scale mel-spectrogram is used as the time-frequency representation of the audio input. This representation is widely used in sound classification, as well as other audio-related problems.
We define the ith input as X i P R TˆF where T and F are the number of time hops and the frequency bins, respectively. Hence, let tpX i , Y i qu N i"1 be the training set, where Y i P R K are the one-hot encoded labels, and N and K are the number of instances and classes, respectively.

Autoencoder
The encoder function f p¨q is used to extract meaningful features from the input. Therefore, the encoder transforms the input into its representation in the latent space, where Z i is three-dimensional tensor of shape pT, F, Cq, T and F are the time and frequency dimensions after the encoding operations, and C is the number of channels used in the encoder's last layer. On the other hand, the decoder function, gp¨q, is devised to reconstruct the input from the latent space, The autoencoder that is proposed in [21] is devised for image processing. We designed the autoencoder of our model to deal with a time-frequency representation as input. In particular, the encoder is suitable for audio feature extraction, and the decoder provides good audio quality in the reconstruction. Figure 2 shows a diagram of the proposed autoencoder. It has three convolutional layers in the encoder and three transpose convolutional layers in the decoder. Besides, we apply max-pooling layers after the first two convolutional layers in order to capture features at different time-frequency resolutions. Note that max-pooling is a noninvertible operation and, thus, there is not an upsampling operation that is suitable for the decoder stage. To overcome this problem, we use the solution that was proposed by Badrinarayanan et al. [73]. This implies saving the indexes used in the max-pooling operations in the form of masks and then using them in the decoder. This mask is set to 1 at the position that maximizes the max-pooling and to 0 otherwise. In the decoder, the mask is applied to the result of a linear upsampling. The next transpose convolutional layer learns how to upsample the masked input. A leaky ReLu is the activation after each convolutional layer [74], except for the last one, which is an hyperbolic tangent in order to limit the reconstruction to the range r´1, 1s.
Note that we use padding for all convolutional layers, so that the output has the same shape as the input. Besides, max-pooling operations use a 2ˆ2 window and, therefore, the shape of the encoder's last layer (i.e., the dimension of the latent space) is For the decoding process to have enough audio quality, it is necessary to optimize the autoencoder by minimizing its reconstruction error. To accomplish this, we use a L2 mean square loss function over its inputs and outputs, as

Prototype Layer
The prototype layer stores a set of M prototypes that we want to be representatives of each class: tP j u M j"1 . These prototypes are learned in the latent space and, thus, the shape of P j is also pT, F, Cq.
In order to learn the prototypes, we used the same loss function as in the original model [21]. First, we calculate D P R NˆM , whose values are the squared L2 distance from each data instance to each prototype in the latent space: and then we calculate the following cost function: The minimization of this loss function would require that each learned prototype is similar to at least one of the training examples in the latent space; and, vice versa, every training example to be similar to one prototype [21]. Therefore, training examples will cluster around prototypes in the latent space. Besides, if we choose the decoder to be a continuous function, we should expect that two close instances in the latent space to be decoded as similar instances in the input space, as noted in [21]. Consequently, we should expect the prototypes to have meaningful decodings in the input space.
The output of this layer is a similarity measure that is based on the distance from each encoded data instance Z i to each prototype P j , as described in the next section.

Similarity Measure and Weighted Sum Layer
Unlike images, the dimensions of the time-frequency representation have different meanings and they should be treated differently [75]. For this reason, we propose a frequency-dependent similarity that assigns a different weight to each frequency bin in the latent space. This allows for the comparison of inputs and prototypes to be based on those frequency bins that are more relevant, for instance, where the energy is concentrated.
The similarity measure computation has two steps: (1) the calculation of a frequencydependent similarity and (2) the integration of the frequency dimension by means of a learnable weighted sum.
The frequency dependent similarity is calculated as a squared L2 distance, followed by a Gaussian function. Therefore, the output of the prototype layer, S, is obtained by: Note that S is a three-dimensional tensor whose shape is pN, M, Fq, and that we use both sub-indexes and brackets to denote tensor dimensions. Sub-indexes denote the i-th data instance and the j-th prototype; and, brackets denotes the time, frequency, and channels dimensions.
Subsequently, the frequency dimension is integrated, to obtain the matrix p S P R NˆM , while using the following weighted sum: where H " tH j r f su P R MˆF is a trainable kernel. Note that this is similar to a dot product with a vector of length F for each prototype. We initialize H with all values equal to 1{F (equal weight, mean operation), but we let the network learn the best way to weight each frequency bin for each prototype. The kernel is particularly useful to discriminate between overlapping sound classes by focusing on the most relevant frequency bins for each prototype.

Fully-Connected Layer
The fully-connected layer is devised to learn the decisions to transform the similarity measure p S into the predictions. Given that the network is intended for a classification task, we use softmax as the activation of this layer. Therefore, the predictions are calculated as: where W P R MˆK is the kernel of the layer and r Y " t r Y ik u P R NˆK . Note that we do not use bias in order to obtain more interpretable kernel weights. We expect that for a given output class, the network gives more weight to the prototypes related to this class. For instance, in a problem with K classes and one prototype per class (M " K), we expect to learn a fully-connected layer with a kernel close to the identity matrix, W " I K [21]. The loss function to train this layer is a categorical cross-entropy, Finally, note that, since the prototypes can be converted from the latent space to the time-frequency input representation by applying the decoder function, gpP j q P R TˆF is the mel-spectrogram representation of the j-th prototype. Hence, we can illustrate the prediction process while using the mel-spectrograms of data instances and prototypes, even though this is actually performed in the latent space. Figure 3 shows an example of this illustration in the context of a classification task with three classes and using one prototype per class (M " K " 3). Illustration of how the model makes its predictions. This is an example with three classes: siren, air conditioner, and car horn. The input X i is compared to three prototypes (one per each class) to get the frequency-dependent similarity S ij r f s. This similarity is integrated in the frequency dimension using the weighted sum layer to obtain p S ij . The final step in the reasoning process is to calculate the prediction ( r Y ik ) by projecting the similarity using a fully-connected layer. Note that gray scale in the fully-connected arrows denote the strength of the connection.

Materials and Methods
In this section, we describe the sound classification tasks addressed, the publicly available datasets, the evaluation methodology, and the baselines that were used for performance comparison.

Sound Classification Tasks
We aim to study the performance of APNet when considering sounds of different nature, such as speech, music, and environmental sounds. To do that, we address three different audio classification tasks: urban sound classification, musical instrument recognition, and keyword spotting in speech.
The basic premise of sound classification is that sounds that are produced by the same source of physical process can be grouped into a category. In this work, we refer to sound classification as the task of assigning a sound category to an audio segment, from a previously defined set of options. We differentiate this from sound detection, which also involves locating the sound event within the audio in terms of onset and offset time instant; and, from audio tagging, in which multiple labels can be assigned to an audio segment indicating the presence of instances from several sound classes [37].
Simple application scenarios have a single sound event per audio segment. This is the case of the keyword spotting task in speech addressed in this work, in which each audio segment has only one short word. Subsequently, a more complex scenario is the presence of a sequence of non-overlapping sound events of the same kind, such as the stem of a single instrument in a multi-track music recording session. The musical instrument recognition task is an example of this kind, despite the fact that some of the instruments (e.g., piano, violin) can simultaneously produce several sounds. The environmental sound scenarios are one of the most complex settings because they typically involve multiple temporally overlapping sound events of different types, and the sounds present often have considerable diversity. This is the kind of problem that is addressed in the urban sound classification task.
Further information regarding the tasks is provided in the following description of the datasets.

UrbanSound8k
For the urban sound classification task we use the UrbanSound8K dataset [76]. It has more than 8000 audio slices that are extracted from audio recordings from Freesound [77]. Each audio slice is tagged with one of the following ten labels: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music. The length of the audio slices is variable, with a maximum value of 4 s. The audio format is the same as the one originally uploaded to Freesound.
We use the 10-fold cross-validation scheme that was provided by the dataset complying with the following methodology: (1) select a test fold (e.g., fold 1); (2) select a validation set as the next fold (e.g., fold 2); (3) train the model on the rest of the folds (e.g., folds 3 to 10) using the validation set for model selection; and, (4) evaluate the model on the test set. Finally, repeat for the ten folds and average the results.

Medley-Solos-DB
For the music instrument recognition task, we use the Medley-solos-DB dataset [78]. It contains single-instrument samples of nine instruments: clarinet, distorted electric guitar, female singer, flute, piano, tenor saxophone, trumpet, and violin. Audio clips are three-second length, sampled at 44,100 Hz.
It has a training set extracted from the MedleyDB dataset [79] and a test set from the solosDB dataset [80]. The training set also includes a split for model validation.

Google Speech Commands
For the keyword spotting task in speech we use the Google Speech Commands V2 dataset [81]. It consists of more than 100.000 audio files of one-second length, sampled at 16,000 Hz, each one containing a single short word. Although the dataset is devised for the detection of a small set of commands (e.g., up, down, left, right) and to distinguish them from other short words, we use the complete set of 35 words: backward, bed, bird, cat, dog, down, eight, five, follow, forward, four, go, happy, house, learn, left, marvin, nine, no, off, on, one, right, seven, sheila, six, stop, three, tree, two, up, visual, wow, yes, and zero. The audio files are organized in three folds for model training, validation, and testing.

Baselines
We compare the performance of APNet to that of three different state-of-the-art opaque models: a convolutional neural network designed for environmental sound recognition in urban environments (SB-CNN) [41]; a convolutional recurrent neural network with an attention mechanism devised for speech classification (Att-CRNN) [42]; and, a pretrained embedding extractor model based on self-supervised learning of audio-visual data (Openl3) [40].
The SB-CNN model is a neural network that is composed by three convolutional layers followed by two dense layers. The Att-CRNN model consists of two horizontal convolutional layers, two bidirectional Long Short-Time Memory (LSTM) recurrent layers, an attention mechanism to integrate temporal dimension, and three dense layers.
Openl3 is a pre-trained embedding extractor model that is based on the self-supervised learning of audio-visual data. The parameters used to calculate the input representation are fixed, and different flavours of the model are available based on the embedding space dimension, the number of mel bands, and the type of data used for training (music or environmental). We use an embedding space of 512 and 256 mel bands for the three datasets. For the urban sound classification task, we select the model trained on environmental data, and the model trained on music for the other two tasks. We use Openl3 as a feature extractor and train a Multi Layer Perceptron (MLP) for each classification task. This network has two hidden layers of 512 and 128 units, respectively, with ReLu activation and dropout with rate 0.5. The dimension of the last layer corresponds to the number of classes in each case.
The three baseline models and APNet use log-scaled mel-spectrograms representation as input, but with different set of parameters. These parameters are also dataset dependent and, therefore, we summarize all combinations in Table 1. For instance, in the case of APNet trained on UrbanSound8K, we extract the time-frequency representation with F " 128 bands from 0 to 11,025 Hz, while using a spectrogram calculated with a window length of 4096 and hop size of 1024. Each input corresponds to a 4-seconds slice of the audio signal; hence, the number of time hops is T " 84. Table 1. Mel-spectrogram parameters for each model and dataset: sampling rate in kHz ( f s ); window length (w) and hop size in samples (h); number of mel-bands (m); and, audio slice length in seconds (l).

Training
We train APNet and the baseline models from scratch, without the use of pre-trained weights, except for feature extraction in Openl3. For the three baselines, we optimize a categorical cross-entropy loss over the predictions and the ground-truth, as shown in Equation (9). While training APNet, we optimize the weighted sum of all function losses defined previously, Equations (3), (5) and (9): l " αl c`β l p`γ l r (10) where the weights (α, β, and γ) are real-valued hyperparameters that adjust the ratios between the terms. Other important hyperparameters are the number of channels used in the convolutional layers C, the number of prototypes M, and the batch size B. See Table 2 for the values of the hyperparameters for each dataset. We train APNet and the baseline models while using an Adam optimizer with a learning rate of 0.001 for 200 epochs. We select the network weights that maximize the accuracy in the validation set.

Experiments and Results
First, we evaluate the performance accuracy of the proposed model in the three classification tasks considered, and then compare it with that of the baseline models. All of the experiments were conducted using the DCASE-models library [82]. This simplifies the process of reproducing the experiments, as well as the development of the proposed model. After training and validating all of the models, we evaluate them on the test sets of each dataset. Table 3 shows the performance results and the number of parameters of APNet and the three baselines for the three sound classification tasks. These results show that APNet is a very competitive algorithm in all of the tasks, with accuracy values that are comparable to that of the baseline models. However, unlike the baseline models, APNet is designed to be an interpretable deep neural network. This allows for us to carry out further analysis to provide insight into the inner workings of the network. In particular, the following sections show how to inspect the network (Section 5.1) and how to refine the network based on the insights provided by the inspection (Section 5.2).

Network Inspection
In this section, we examine the functioning of the autoencoder (Section 5.1.1) and inspect the learned weights from the prototype (Section 5.1.2), fully-connected (Section 5.1.3), and weighted sum (Section 5.1.4) layers.

Autoencoder
The autoencoder is devised to extract meaningful features (encoding process) from the input and to transform back the prototypes from the latent space to the mel-spectrogram representation (decoding process). We qualitatively assess the reconstruction of some data instances in order to examine the functioning of the autoencoder. If this reconstruction is not appropriate, then we can not guarantee that the learned prototypes can be transformed back to the input space in a meaningful way. Figure 4 shows the reconstruction of five random signals from one of the validation sets of the UrbanSound8K dataset. The visual comparison of the original and reconstructed mel-spectrogram representations indicates that the decoding process gives adequate results. This is also confirmed by listening to the audio signal that is obtained from transforming the time-frequency representation to the waveform domain. The mapping to the audio domain is done by first converting the mel-spectrogram to an approximate linear-frequency spectrogram, followed by the Griffin-Lim [83] method. This process is implemented while using librosa [84]. The audio signals of the examples of Figure 4 are available in https://pzinemanas.github.io/APNet/.

Prototypes
We take advantage of the decoder to obtain the reconstructed mel-spectrogram of the prototypes, gpP j q. Note that, to do that, we need the masks of the max-pooling operations from the encoder function f p¨q. However, these masks are not available since the prototypes were not transformed by the encoder. However, we can use the masks produced by the instances from the training data that minimize the distance to each prototype. Given that the prototypes are learned to be similar to instances from the training data, we assume that these masks are also suitable for the prototypes. Subsequently, we input these reconstructed mel-spectrograms to the model and obtain the class prediction for each prototype, in order to associate each prototype to a sound class. Figure 5 shows one prototype of each class for the UrbanSound8K dataset. It can be noticed that the mel-spectrograms exhibit the typical traits of the sound classes that they represent.
We also extract the audio signal from each prototype. By listening to the reconstructed audio, one can confirm that the prototypes actually represent their corresponding sound classes. The audio signals of some of these prototypes can be found in https://pzinemanas. github.io/APNet/.

Fully-Connected Layer
The fully-connected layer transforms the similarity measureŜ into the network predictions. Recall that the weight matrix W of this layer is designed to be learnable. The analysis of the learned weights of W contributes to the interpretability of the network. More specifically, we are able to tell, from the value of the learned weights, which prototypes are more representative of which class. Figure 6 shows the transposed weight matrix, W T , obtained for the UrbanSound8K dataset. The prototypes are sorted by the prediction that is given by the classifier. Note that most of the prototypes are connected to the corresponding class, with only two exceptions that are related to acoustic similarities. For instance, the last prototype of air conditioner is connected to engine idling. Besides, there is a strong connection between the first prototype of dog bark and the output of children playing.

Weighted Sum Layer
The weighted sum layer allows for the network to learn the best way to weight each frequency bin in the latent space for each prototype. It can be useful to focus on the bins where the energy is concentrated or to better discriminate between overlapping sound classes. Figure 7 shows three examples of the learned weights from UrbanSound8K. For instance, in the case of siren and engine idling prototypes, the weights are very correlated to the energy envelope; the layer gives more importance to the low frequencies for engine idling and middle frequencies for siren. However, in the case of jackhammer, this layer gives greater significance to high frequencies, even though the low frequencies are energy predominant. This can be explained as a way to distinguish this class from others with high energy in low frequencies, such as engine idling or air conditioner. To show the importance of the trainable weighted sum layer, we undertake the following experiment. We select a data instance that is difficult to classify, because it has a mix of sound sources (a siren, a women yelling, and an engine) and the only tagged class (siren) is in the background. We extract the closest prototypes in two cases: (1) using the learned weighted sum; and, (2) replacing the weighted sum layer with a fixed mean operation (H j r f s " 1{F @j, f ). Figure 8 shows the mel-spectrogram of the data instance (X i ) to the left, while, to the right, the three closest prototypes using the mean operation are depicted in the top row, and the three closest prototypes using the trainable weighted sum are depicted in the bottom row. Note that the first two prototypes of the top row correspond to children playing-because the predominant source present is a women yelling-and only the third prototype corresponds to the correct class. On the other hand, when the trainable weighted sum is used, the similarity measure is able to capture the source from the background by given more weight to the frequency bins where its energy is concentrated. Note that the first two prototype in the bottom row correspond to the correct class. Figure 8. Example on the importance of the weighted sum layer. To the left, the mel-spectrogram, X i , of an audio slice that includes several sources: siren in the background along with a women yelling, and an engine in the foreground. Using a mean operation instead of the weighted sum, the two closest prototypes are from children playing (top row). However, applying the weighted sum, the two closest prototypes are from the correct class, siren (bottom row).

Network Refinement
The architecture of APNet allows designers to refine, debug, and improve the model. In the following, we propose two automatic methods to refine the network by reducing redundancy in the prototypes (Section 5.2.1) and the channels in the encoder's last layer (Section 5.2.2). Besides, we present a web application that is devised for the manual editing of the model (Section 5.2.3).

Prototype Redundancy
APNet does not include a constraint on the diversity of the prototypes. As a result, some of the prototypes can be very similar, producing some kind of redundancy. To evaluate this, we calculate the distance matrix of the prototypes D P R MˆM , while using the L2 squared distance: D jl " P j´Pl 2 2 .
(11) Figure 9 shows this distance matrix in the case of the Medley-solos-DB dataset. Note that some of the prototypes of the same class are very similar. To reduce this redundancy and, consequently, make the network smaller, we remove prototypes that are very close to each other. Formally, we eliminate one prototype for each pair pj, lq that meet: D jl ă mintD jl u`meantD jl u{2 @j, l P r1, . . . , Ms : j ‰ l (12) Note that eliminating these prototypes also implies removing the rows of W associated with them. For instance, if P j is deleted, then row j from W should also be removed.
After eliminating the redundant prototypes, we train the networks for 20 more epochs using the same parameters from Section 5. After this network pruning process, we improve the classification results from 65.8% to 68.2%; and, reduce the number of parameters.

Channel Redundancy
The number of channels of convolutional layers within the autoencoder are usually selected by a grid search method or relying on values used by previous research. A high number of channels can produce network overfitting or add noisy information in the feature space. We take advantage of the prototype architecture to see whether some channels are redundant. Because the output of the last convolutional layer of the encoder yields the latent space, we can use the prototypes to analyze the channel's dimension. Therefore, if there is redundant or noisy information in the prototypes, this can be related to the filters in the last layer of the encoder.
To study this, we undertake the following experiment in the instrument recognition task. We calculate the accumulated distance matrix, C P R CˆC , of the prototypes as a function of the channel dimension: Subsequently, we find the eight minimum values of C outside the diagonal and we delete one channel from each pair pk, qq. We delete all of the weights related to these channels and, as a result, we reduce the number of filters of the last layer from 48 to 40. We re-train the network for 20 more epochs, because the decoder part has to be trained again. Table 4 shows the results that were obtained for the Medley-solos-DB dataset after both prototype and channel refinement processes, when compared to Openl3 as the most competitive baseline. After the two refinements the accuracy increases by 3.3%, while the number of parameters is reduced by 2.6 M. Besides, this result improves the Openl3 performance by 1.8%. In the previous sections, we described two automatic methods to refine APNet that lead to better performance and smaller networks. However, when working with networks designed for interpretability like APNet, it is also possible to visualize the models and allow the users to refine them manually. To that end, we designed a web application that allows for the users to interact with the model, refine it, and re-train it to obtain an updated model. Figure 10 shows a screenshot of the web application. The user can navigate in a two-dimensional (2D) representation of the prototypes and the training data. It is also possible to see the mel-spectrograms of prototypes and the data instances and to play the audio signals. The user can remove selected prototypes and convert instances to prototypes. Once the prototypes are changed, the model can be re-trained and evaluated.
Using this tool, manual debugging of the model is also possible. Furthermore, different training schemes, such as human-in-the-loop strategies, are possible. For instance, the user can periodically check the training and change the model after a few numbers of epochs. It is also a good starting point to design interfaces for explaining the inner workings of deep neural networks to end-users. Figure 10. Screenshot of the web application designed for the manual editing of APNet. In this tab the user can navigate through the training set and the prototypes in a two-dimensional (2D) space, listen to the audio files and explore the mel-spectrograms. It is also possible to delete prototypes and convert data instances into prototypes. Other tabs include functions for training and evaluating the edited model.

Conclusions
In this work, we present a novel interpretable deep neural network for sound classification-based on an existing model devised for image classification [21]-which provides explanations of it decisions in terms of a set of learned prototypes in a latent space and the similarity of the input to them.
We leverage domain knowledge to tailor our model to audio-related problems. In particular, we propose a similarity measure that is based on a trainable weighted sum of a frequency-dependent distance in a latent space. Our experiments show that including the trainable weighted sum effectively improves the model, in particular when classifying input data containing mixed sound sources.
The proposed model achieves accuracy results that are comparable to that of stateof-the-art baseline systems in three different sound classification tasks: urban sound classification, musical instrument recognition, and keyword spotting in speech.
In addition, the ability to inspect the network allows for evaluating its performance beyond the typical accuracy measure and provide useful insights into the inner workings of the model. We argue that the interpretability of the model and its reliable explanations-in the form of a set of prototypes and the similarity of the input to them-increase its trustworthiness. This is important for end-users relying on the network output for actionable decisions, even in low-risk applications.
The interpretable architecture of APNet allows designers to refine, debug, and improve the model. In this regard, we propose two automatic methods for network refinement that eliminate redundant prototypes and channels. We show that, after these refinement processes, the model improves the results, even outperforming the most competitive baseline in one of the tasks. Our results exemplify that interpretability may also help to design better models. This contrasts with the widely extended assumption that there is an unavoidable trade-off between interpretability and accuracy.
Future work includes adapting the proposed model to other sound recognition problems, such as sound event detection and audio-tagging. Along with this, we seek to incorporate audio-domain knowledge to the development of other intrinsically interpretable neural network models based on prototypes [23,24] and concepts [22,25]. Besides, we want to use the visualization tool for manual editing to study different ways of training the network with a human-in-the-loop approach; and, to create tools for explaining the inner functionality of the network to end-users.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: