Training a Filter-Based Model of the Cochlea in the Context of Pre-Trained Acoustic Models

: Auditory research aims in general to lead to understanding of physiological processes. By contrast, the state of the art in automatic speech processing (notably recognition) is dominated by large pre-trained models that are meant to be used as black-boxes. In this work, we integrate a physiologically plausible (albeit simple filter-based) model of the cochlea into a much larger pre-trained acoustic model for speech recognition. We show that the hybrid system can be trained and evaluated with various combinations of fine-tuning and self-supervision. The results broadly show that the system automatically yields structures that are known to work well. Moreover, these structures lack artifacts that were apparent in (our) previous work using less sophisticated neural models. We conclude that the hybrid structure is an appropriate way to proceed in auditory research, more generally allowing the work to take advantage of larger models and databases from which it would not otherwise benefit.


Introduction
Since the advent of deep learning, the general field of speech technology has advanced to a point that would be unrecognizable to proponents working just a decade ago [1,2].Perhaps the main difference is that the technology is now driven by the machine learning community rather than the speech processing community.One visible effect of this change is that physiologically inspired approaches that guided the topology of hidden Markov models (HMMs, the previous state of the art) have largely disappeared; they have been replaced by "end-to-end" (E2E) approaches.These in turn learn the representation that leads to the best performance on the available data, irrespective of physiological plausibility.For example, HMMs were typically defined at a phoneme level, but deep architectures have preferred byte-pair encodings [3] or often just (orthographic) letters.Spectral features have been replaced by generic outputs from convolutional layers.The exemplar in these cases is perhaps the "wav2letter" of Collobert et al. [4].
Self-supervision in particular has been a key recent advance in deep learning, in computer vision as well as audio processing [5].The "self" supervision arises from a loss-function that is designed to reveal discriminability at a given granularity of interest.Crucially, this removes the need for labeled training data, resulting in systems that can be trained on vast amounts of data.To put this into perspective, the LibriSpeech database of Panayotov et al. [6] is one of the largest commonly available labeled speech databases at around 1000 h.By contrast, self supervision is associated with datasets of tens of thousands of hours; the million hours of Parthasarathi and Strom [7] is roughly a century.The commercial side of the community has made pre-trained models available [8,9].Note that this pre-training implies almost fully trained; it is distinct from the pre-training that used to be applied to some networks to enable them to respond to conventional training [10].However, subsequent training of the recent models is common and referred to as fine-tuning.
Given that these recent systems work well, the question arises: "what have they learned?".This is difficult to answer because their component parts cannot readily be mapped to biological ones.A recent study has shown that analogies can be drawn between layers of such systems and brain function [11], suggesting that the two fields are in fact converging.In general, however, it seems reasonable that in order to make inferences about biological function, it is necessary to build the deep networks using components that are themselves interpretable.Until quite recently, embedding such components into a deep-learning framework might have been onerous owing to the need for their bein differentiable.
Fortunately, current automatic gradient packages allow networks of arbitrary operations.They also allow arbitrary parts of a network to be trained.This automatic gradient method is a technique that automates the calculation of gradients for the model parameters, enabling stochastic gradient descent computation.It alleviates the need for manual derivation while enhancing the efficiency and scalability of these gradient-based algorithms.
The innovation of this work lies in the transformative impact that a self-supervision model brings to the auditory model.Therefore, we combine a current state-of-the-art pretrained model with a trainable filter front-end to infer a physiologically plausible function of the human auditory apparatus.More specifically, we build upon a previous study [12] where we showed that attempting to do this with a smaller model led to unexpected results.We show that it is possible to use a state-of-the-art model, and that doing so mitigates the effects of the smaller model.
Section 2 first describes the state of the art in both E2E modeling as well as cochlear modeling for speech technology.Section 3 further details how a simple auditory model can be trained from scratch in the context of a larger model that has been pre-trained.Section 4 contains a logical sequence of experiments showing that:

•
Trainable filters can replace the encoder CNN in an already pre-trained model for fine-tuning.

•
A physiologically adaptable front-end performs as well as a CNN in a pre-trained model.

•
Trainable filters can be incorporated during self-supervision.

•
When trained with a large transformer model, SincNet filters do not tend to learn wide-band filters as they do with a smaller MLP model.

Background
2.1.Self-Supervised Models 2.1.1.Foundations Self-supervised learning arose around 2008 in the natural language processing (NLP) field with the model of Collobert and Weston [13].Self-supervision differs from supervised learning in the way the loss function is computed.Supervised learning necessarily requires labeled data; the output of the network is directly compared to the labels and the difference is back-propagated through gradients.Assigning labels can be onerous.By contrast, selfsupervised learning has the advantage of needing no labels to train, meaning it can make use of huge amounts of data.The model is typically trained either as an auto-encoder (the data are the labels) or by contrastive loss (see below).
Self-supervised models rely on a two-stage training procedure: pre-training and fine-tuning.These map onto what might be traditionally called training and adaptation respectively.Pre-training is resource-intensive and the resulting models can be large.A significant recent trend has been for proponents to make such models available online [8].A core goal of this study is to investigate what can be done in the acoustic research field starting from those available pre-trained networks.

Wav2vec
Wav2vec [5] was the first attempt at applying self-supervision in the sense of BERT to the automatic speech recognition (ASR) field.It was inspired by wav2letter [4] and unsupervised machine translation [14], a model that translates between languages based on two unlabeled datasets.The structure of wav2vec [5] comprises two main blocks: the encoder network and the context network.The encoder consists of a 5-layer convolutional network (CNN) down-sampling the input waveform from 16 kHz to a 100 Hz signal.The context network in wav2vec is also a CNN with kernel sizes of three and strides of one.In vq-wav2vec, Baevski et al. [15] proposed to integrate quantization between the encoder and the context layers.The recent Wav2vec2 [16] parallelizes the quantization and an enhanced version of the wav2vec structure; the encoder CNN is now seven layers deep and down-samples the signal to 50 Hz and the CNN-based context layer is replaced by a transformer.Of those two networks, the transformer stack accounts for 94.4% of the total number of trainable parameters.
Wav2vec models use contrastive loss as an objective during training, and the connectionist temporal classification (CTC) of Graves et al. [17] for decoding.The original contrastive loss of wav2vec was formulated to favor predictability for adjacent observations, or "predictors", with the opposite for more distant ones, hence "distractors".Oord et al. [18] proposed another approach to the contrastive loss computation, this time based on the cross entropy loss.The loss equation in wav2vec2 adopts this more recent approach, being where sim is the cosine similarity, sim(a, b) = a T b/|a||b|, κ is a constant that regulates the entropy of the cosine similarity, preserving the relative ranks of events.In the machine learning field, the analogous parameter controlling the smoothness of probability distributions is commonly referred to as 'temperature'.The parameter c t is the context representation and q ∈ Q t are vectorized samples of other parts of input waveform in the latent space.The goal of Equation ( 1) is to find a quantized representation for speech in the latent space, training towards orthogonal representations of the quantizers in that latent space.The model outputs a context representation c t that is able to guess the true q t vector quantization (the predictor) of the latent space out of K + 1 candidates (with K distractor quantizers and one closely related target; in wav2vec2, 100 distractors are sampled out of the same utterance).

Cochlear Models 2.2.1. Filterbanks
Cochlear models have been studied for many years, with our current understanding perhaps going back to the work of Von Békésy [19].Simplistic (but functional) approaches consider the cochlea as a natural filterbank [20,21].
Comparatively recent studies have continued to seek biological plausibility.Lyon [22,23,24], for instance, proposed a model of the complete auditory path where the cochlea is modeled with a cascade of resonators.This model has since been implemented on an FPGA [25] and used in applications such as speaker identification [26] and sound localization [27].
Of particular interest for speech processing is the filterbank scaling.Probably the best known frequency distribution is the mel scale [28], based on frequencies judged to be equally spaced in human perceptual tests.This has been ubiquitous in feature extraction algorithms for ASR, especially in its guise as mel frequency cepstral coefficients (MFCCs).The perceptual linear prediction (PLP) of Hermansky [29] favored the Bark scale, based on noise bandwidths required to mask tones [30][31][32].Physical measurements are also possible; the greenwood [33] and ERB (equivalent rectangular bandwidth) [32] scalings lead to more extreme warping than mel.

Current Understanding
Current explanations of the workings of the cochlea are as processes of wave propagation through an active oscillator system [34].According to our current best understanding, the cochlea works as an array of Hopf oscillators [35,36].These active oscillators incorporate the interaction between the inner and outer hair-cells with the tectoral and basilar membrane, and explain oto-acoustic emissions [37,38] as arising from the outer hair cells.Several works have implemented those Hopf oscillators as models for the cochlea [39][40][41].
Notwithstanding, in the present study we limit ourselves to filterbanks, with active systems being a goal for further research work.

ASR with Trainable Filters 2.3.1. From Cochlear Models to E2E ASR
For many years, ASR front-ends such as MFCC [42] and PLP [29] were loosely based on models of the cochlea.However, given the large number of choices within such models, and parallel success with raw images as input [43], E2E approaches have been investigated for audio processing.Early work involved trainable convolution layers on raw audio inputs [2,44,45].Although they can perform well, such black box models lack interpretability, explainability, comprehensibility and transparency [46,47].Retaining an explicit filterbank structure can alleviate these issues.Candidates for the filterbank have included Gamma-tone [48,49], Gabor [50][51][52], SincNet [53,54], and Spline filters [55].Zeghidour et al. [50] in particular showed that using trainable filters consistently increased the performance of speech recognition compared to a model where MFCCs were used.Since we do not have a hypothesis that some trainable filter models would perform better, that this work builds on a previous work using SincNet filters in a previous study, and that there are computational advantages of doing so (see Section 2.3.2),we take the SincNet model of Ravanelli and Bengio [53] as representative of the above models for the present study.

SincNet
SincNet is characterized by a convolutional filter as the first layer in a larger convolutional front-end.In the initial SincNet implementation, this convolutional front-end was followed by a five-layer MLP, whereas in this work, the main idea is to combine this encoder with pre-trained transformers.The implementation of the filter layer uses the Fourier transform property: a convolution in the temporal domain corresponds to a multiplication in the frequency domain.Thus, using a sinc filter as kernel in a convolutional layer gives the filtered signal as output.The name arises because the Fourier transform of a rectangular filter in the frequency domain corresponds to sinc(x) = sin(x) x in the temporal domain.SincNet is implemented as a combination of two low-pass filters F 1 − F 2 with cut-off frequencies f 1 > f 2 , giving a band-pass filter of bandwidth f 1 − f 2 and a central frequency of The central frequency and bandwidth are the two trainable parameters.In the temporal domain, the filter is then given by: Ravanelli and Bengio [53] showed that SincNet outperformed MFCCs.They also showed that using SincNet filters gave better results than using only CNN layers.A second study by the same authors showed that the proposed architecture converges faster, performs better, and is more interpretable than standard CNNs [54].They showed that such trainable filters could map onto specific speech-related features like formants, while standard CNNs did not.SincNet was subsequently incorporated into a joint CTC-attention training scheme by Parcollet et al. [56].The authors showed that their approach outperforms previously investigated E2E systems.
Based on these examples in the literature that show the reliability of SincNet as a front-end for ASR, we use this trainable filterbank in this work.

Speech Features
We are particularly interested in the interpretability of the learned filterbanks.Pitch and formants are well known important speech features.For human speech, pitch typically lies between 85 Hz and 300 Hz.Formants are resonances of the vocal tract and define vowel sounds.The first formant is generally situated between 200 Hz and 800 Hz; the second formant lies between 500 Hz and 2500 Hz [57].

Wide-Band Structures
In a previous study [12], we analyzed the behavior of SincNet filters in a simple neural network structure.Although the literature above had shown that E2E trained filters led to improved ASR performance, we were interested in making inference about filter spacing-whether ASR favored mel, Bark, ERB or greenwood scales.The network we used for this study was the one proposed in the original paper [53] within the pytorch-kaldi framework [58].
Unexpectedly, we observed two types of filters appearing: narrow-band filters and wide-band filters; these are depicted in Figure 1.For narrow-band filters, we identified those that more or less followed the conventional expected frequency bands and covered the whole frequency range.These narrow-band filters showed characteristics of a typical ASR filterbank: it was composed of 30 to 40 filters and the scaling the central frequencies tended to learn is the mel scale.By contrast, the wide-band filters were filters learning more unexpected wide-band structures.The reason for this kind of structure was not clear, especially given that, theoretically, wide-band structures could clearly be reconstructed by linear combinations of the other filters.However, literature revealed that some wide-band structures do appear in the human auditory path [59,60].
Additionally, two functions of the filter layer were observed to add artifacts in the frequency domain: the maxpooling and ReLU functions.Maxpooling does not have a physiological equivalent, it is commonly used after convolution operations to reduce the (channel) dimension.ReLU acts as a half-wave rectifier, as do the inner haircels in the cochlea, making the use of this function physiologically relevant.We showed that removing a maxpooling function from the internal structure and keeping the ReLU activation function had a direct impact on the performance of the model, but did not prevent wide-band filters from appearing.One of the goals of this work is to analyze if those wide-band structures still appear when SincNet filters are combined with a large self-supervised network.

Overall Hypothesis
In our previous work (Section 2.5), we showed that a model of the cochlea trained in an E2E manner behaves unexpectedly.The model learns a combination of expected narrowband structures representative of the cochlea.However, it also learns wider-band structures that are more representative of the higher level auditory pathway.The network used was a combination of a SincNet front-end with a simple MLP as context network and trained in a supervised manner.The state-of-the-art in neural models for acoustic processing of speech is now associated with pre-trained stacks of transformers.Such stacks have been shown to exhibit behavior quite similar to that of the human brain using fMRI [11].
In the following experiments, we aim to show that a plausible model of the cochlea can be combined with an otherwise black-box pre-trained model to yield a model much closer to the human auditory pathway.If such a model continues to behave in the same way as our simplistic one, we could further conclude that something is wrong with our cochlear model; certainly we have no biological evidence for wide-band structure at such a low level.By contrast, if the new model behaves differently, we could conclude that our simplistic model is too simplistic, and that a larger (implying pre-trained) model is necessary for such low level auditory investigation.

Pre-Trained Model
The model used in this work is based on wav2vec2 [16], presented in Section 2.1.For this study, the encoder CNN of wav2vec2 is replaced with SincNet.The encoder layer is composed of SincNet filters followed by a classical CNN of 3 layers as in the SincNet baseline; this CNN down-samples the input signal to 50 Hz, which is then compatible with the further transformer layers.It is illustrated in Figure 2. The training follows the wav2vec2 method, but is customized to take the SincNet filters into account without starting the pre-training from scratch.The framework used is fairseq [8], (which stands for Facebook AI Research in sequence modeling), a framework developed and used by Meta, which has an implementation of wav2vec that can be easily modified.The modifications are described in following sections.Moreover, the code is adapted to run on several GPUs in parallel.

Dataset
We use the LibriSpeech dataset [6] for all experiments.The amount of data in every subset is given in Table 1 (h: hours, spk: speaker, f: female, m: male, exp: used for given part of experiment, SS: self-supervision, FT: fine-tuning).Our first hypothesis posits that the CNN encoder block acquires information learnable through trainable filters.To validate this, we replace the front-end with a SincNet encoder (see Figure 2) and compare post-training ASR performance to the baseline.Secondly, by incorporating modifications from our previous work, we expect improved results compared to the original SincNet implementation.Third, the trainable filters in the initial encoder layer should mirror patterns learned by the pre-trained encoder.

Choice of Global Parameters
In order to test this hypothesis, several implementation parameters are adapted: the learning rate scheduler, the structure of the new encoder, the kernel size of the filter layer and the number of updates of the model.

Learning Rate Schedulers
If a pre-trained model is used in, say, ASR, there is normally an adapter layer at the output of the pre-trained model.This layer can be trained whilst keeping the pre-trained model fixed by simply not back-propagating gradients further than the adapter.By contrast, the SincNet that we wish to train is an the input of the pre-trained model; gradients must be back-propagated through the pre-trained model.The new weight value is given by the update parameter learning rule in Equation (3).
where θ t is the weight value one time-step before the updated value.∆θ t is dictated by the equations of optimizer update-in the case of this work, Adam [61]. where: • mt and vt are bias-corrected estimates of the linear combination of the gradient with first and second moment estimates.• ϵ is a small constant preventing a division by 0. • η is the learning rate.
To freeze the model weights, we set the learning rate to zero.According to the gradient descent update Equation (4), if the learning rate η is set to zero, the parameters to which this 0 learning rate is applied are frozen to their initial value, which is a common machine learning practice.
In this work, we adopt a learning rate scheduler integrating a freezing period.During this freezing period, the encoder network is able to learn what the previous CNN layers had learned during the pre-training phase and the context network stays frozen.
The original wav2vec2 implementation uses two types of learning rate schedulers: the polynomial decay and tri-stage learning rate scheduler.The polynomial decay scheduler is used for non-pre-trained network chunks; in the wav2vec framework, this corresponds to the self-supervision phase while the tri-stage learning rate scheduler is used for fine-tuning.For this experiment, the polynomial decay learning rate was assigned to the encoder network (Figure 3c), since this part of the network is replaced by a SincNet encoder and consequently trained from scratch in the experiment.For the context network, we created a hybrid version of the tri-stage learning rate scheduler incorporating a freezing period at the beginning (Figure 3d).The whole training scheme is schematically summarized in Figure 4, clearly depicting the two training phases.Freezing the transformers and projection layer in the first place avoids catastrophic forgetting of the transformer weights.Fairseq's composite optimizer enables distinct learning rate schedulers for different model parts, utilizing a pass-through general optimizer to automatically associate each part with its specific scheduler and optimizer.

SincNet Modifications
In the previous work (Section 2.5), we showed that removing the maxpooling layer after the filters was physiologically relevant and improved the performance.In this experiment, we compare the performance of the initial SincNet structure and the adapted structure.

Kernel Size
With the larger model, we observed a saturation effect at lower frequencies in a draft experiment, linked to the maximum filter precision dictated by the kernel size.We replaced the initial kernel size of 129 by 400 in the sinc filter initialization; this both corrected the above issue, and corresponded to the 25 ms used as window size in classical MFCC computations.

Number of Updates
An update corresponds to a back-propagation of gradients in the model, which updates the parameters, while an epoch corresponds to passing the whole dataset through the model.For most of the experiments, the number of updates is fixed to 10 k, which corresponds to 38 epochs with the training set train-clean-100 for the fine-tuning part.For completeness and comparison, we also report one experiment where the model was able to train for 100 k updates, with 40 filters and a kernel size of 400.

Results
The results of this first experiment are summarized in Table 2 and Figure 3. Concerning the performance, the WER results of Table 2 are all below 4%.This means that a SincNet encoder is capable of replacing the original wav2vec2 encoder when keeping the context network fixed.
Table 2. Comparison between a short and long run using no maxpooling after the filter layer in the first 2 experiments and with a maxpooling that has a kernelsize of 3 in the third experiment.Thus, the third experiment has a downsampling of a factor 3 just after the sinc filters via a maxpooling operation in the time domain.The encoding and impact of the learning rate scheduler on the WER and loss curve are illustrated in Figure 3.During the first half of the training updates, the context network is frozen through the zeroing of learning rate, while the encoder network is able to train.In the second half of the experiment, the training of the context network is enabled; the transformer can slightly adapt itself to the trained encoder.In the WER and loss curves, we can see a clear impact of the transformers when they obtain the ability to train around 40 k updates.However, when only enabling the SincNet encoder structure to train, the network already shows a decent performance during the first part of the training.
As summarized in Table 2, three different experiments were performed to address this first hypothesis: two short experiments to analyze the impact of removing the maxpooling layer and a third experiment where the training time was much longer.As in the previous work (Section 2.5), removing the maxpooling function improves the global performance of the ASR.Further, letting the experiment train for a longer period makes the performance increase.
The shape of the filter distribution for the different runs of Table 2 are illustrated in Figure 5.When training the network for 100 k updates instead of 10 k, the WER continues decreasing, but the spatial distribution of the filters does not move much anymore.Changing the network structure by considering the maxpooling layer, however, has a small impact: for lower frequencies, some wider filters are appearing.This means that using maxpooling precludes the wide-band filters to be learned higher up in the network; nevertheless, the proportion of wide-band filters are much lower than what we obtained in the previous work, and the size of the bandwidth is smaller.
A consistent pattern emerges in the filter distribution shape across various runs within the spectral content, up to approximately 4 kHz.Below 1 kHz, a bump of very small (and consequently precise) filters occurs in the filterbank between 200 Hz and 800 Hz; this frequency range corresponds to the first formant (see Section 2.4).From 1 kHz to 5 kHz, the bandwidths of the filters are all around 400 Hz.Above 5 kHz, the filters seem to learn something different in every run.Ravanelli and Bengio [54] showed that using the SincNet model instead of simple CNN converged faster, performed better, and was more interpretable.We hypothesize that if we train a SincNet model and a CNN model with the same number of layers from scratch, combined with a pre-trained context network, the SincNet model would also show those characteristics.
This can be tested by performing and comparing four experiments: The first experiment retrains the wav2vec2 encoder network from scratch to gauge its performance without the pre-training information, setting a baseline for encoder relearning while keeping the pretraining information of the context network.Second, we use the baseline CNN with a layer of the same size as the trainable filters to be comparable with Ravanelli and Bengio [54], but this time combined with a pre-trained context network.The third experiment consists of training a large SincNet encoder.The baseline CNN structure is combined with a layer of trainable filters and trained with the pre-trained transformers.Finally, we compare these experiments with a SincNet structure containing fewer CNN layers (than the structure used in the experiments of Section 4.1).

Results
The convergence speed can be analyzed in the training curve.Figure 6 shows that in terms of training, the SincNet structure converges faster towards an equilibrium compared to a pure CNN structure.Concerning the validation, this is true in the very beginning, especially concerning the CNN with the same shape as SincNet (red curve) up to 2000 updates.We based our analysis of the convergence speed on the update metric to align the loss on the amount of data that go through a forward-backward pass.However, to be complete, Table 3 details the time that every experiment takes and the number of parameters that are contained in the front-end for purposes of comparison.The performance of the different experiments is summarized in Table 4. Overall, the 95% confidence interval of the large SincNet structure performance overlaps widely with the baseline experiment; this means that the performance with SincNet is not significantly better than the baseline CNN.However, between the SincNet shape CNN and the SincNet implementation, which correspond to the experiments performed in [54], the overlap is less important and SincNet performs slightly better.Finally, concerning the interpretability, Figure 7 illustrates the normalized sum of the filters in the CNN implementation and SincNet after training.In both cases, "bumps" occur around 700 Hz and 1.4 kHz, which correspond to typical first-and second-formant frequencies.In [54], those interpretable elements in the normalized filter sum were more distinguishable in the SincNet structure than in the CNN structure.Further, for frequencies above 4 kHz, the SincNet filters more efficiently reduce the amount of information recorded in that area than the CNN structure.
To summarize, in the context of a self-supervised model with a pre-trained context network, using a SincNet encoder converges faster and performs as well as the baseline CNN trained from scratch and better than the similar shape CNN.Concerning the interpretability, despite being more interpretable by design, the SincNet structure does not seem to better emphasize interpretable signal components than a CNN using 40 filters.

Can Trainable Filters Be Incorporated during Self-Supervision?
Another way to train the filters using wav2vec2 is to incorporate the filters into the model in the self-supervision phase.

Hypothesis
Although more computationally demanding, using the self-supervision training path is more faithful to the original wav2vec2 training path and it enables the filters to train on two successive tasks.We hypothesize that continuing the self-supervision task while keeping the context network frozen would lead the filters to learn a similar distribution to the first experiment results.Moreover during self-supervised learning we can either let the filters train or keep them frozen and let them only train during fine-tuning.In this case, we expect the rest of the network to adapt itself to the frozen filter distribution and we do not expect the filters to differ much from what they previously learned during further fine-tuning.However, we expect the free filters to show slightly better performance since they are able to adapt to the fine structure in the frequency range.

Self-Supervised Learning
The first step of this experiment consists of further training the model through the contrastive loss task (see Equation ( 1)).To be consistant with the original self-supervision dataset we included the two other training subsets of LibriSpeech (see Table 1).
An important parameter to fix is the number of updates needed to obtain a comparable performance as the baseline pre-trained model.Since self-supervision implies a performance measure that is not based on labels, the best comparison measures are the loss and the accuracy.
Table 5 compares the accuracy and loss of the training and validation curves of the baseline and the wav2vec2 model integrating the trainable filters.Using 10 k updates, the accuracy and loss differ from 10-20% from the baseline loss and accuracy.With 100 k updates, the final loss and accuracy match the baseline results up to 1% of error.
Table 5.Comparison of the loss and accuracy of the baseline pre-trained model and the model after a further pre-training for both frozen and trainable filters.After 10 k updates, the model already performs well but not as good as the baseline, while after 100 k, the model reaches the performance of the baseline in terms of loss and accuracy.

Number of Updates
Accuracy

Fine-Tuning
The second step consists of fine-tuning this self-supervised model.In this fine-tuning experiment, the learning rates do not have to be disentangled as we did in the first type of training path, since the whole model has been pre-trained.

Results
The final performance is given in Table 6.For a fine-tuning of 10 k updates, the performance is best when the filters have been through a self-supervision task before fine-tuning (see Table 2).Since during self-supervised training, the filters are able to train, they can be visualized both after pre-training and fine-tuning (Figure 8).For the trainable filters, below 1 kHz, a couple of filters become very narrow-band, then around 1 kHz, 2 kHz and 3 kHz, the filters show a narrow-band bump in their structure as if they sought slightly more finegrained structures at those specific frequencies.Filters do barely move between during self-supervised training and fine-tuning.This means that the rest of the model is probably more inclined to move towards a better suited equilibrium to minimize the loss.Moreover, globally the shape of the filter distribution shows similar behavior between the two different training paths.
In summary, using only supervised fine-tuning gives a broader flexibility for experiments where several parameters have to be changed, since only a fine-tuning run has to be adapted.Training through both self-supervision pre-training and supervised fine-tuning is more time and resource consuming, but better corresponds to the general idea of using both self-supervision and fine-tuning for training a model.Overall, the filters, when able to train before the transformer adaptation, tend to learn similar patterns.An observation in conflict with our previous work (Section 2.5) is that no wide-band filters appear within the current implementation.Figure 1 showed that some filters learned a very broad-band structure, while in Figure 5, for example, no such wide-band structures appear.Two hypothesis were apparent to explain why those filters could potentially not appear: the number of filters could be too low within the context of a self-supervised model and the freezing of the transformers by decoupling the learning rate scheduler could possibly preclude those filters from appearing.

The Hypothesis of Too Few Filters Hypothesis
The first hypothesis suggests that a limited number of filters may hinder the emergence of wide-band structures.Additionally, the absence of maxpooling precludes the appearance of wide-band filters.This hypothesis can be tested by increasing the number of filters.If the number of filters is insufficient for the manifestation of wide-band filters, a larger number of them should lead to the emergence of these filters.

Results
Figure 9 illustrates the structure that the filters learn when initialized to 100, 80, 60, and 40 filters.Globally, below 1 kHz, filters learn a very narrow-band frequencyspecific filters, above 1 kHz, filters have a bandwidth around 400-500 Hz for all the different numbers of initialized filters.Compared to the previous work, below 1 kHz, the number of very narrow-band filters is much higher in this experiment.
A few wide-band filters do appear when the model is initialized with a high number of filters (80-100), For a smaller number of filters, (40, for example), the wide-band structures as we had in our previous work (Section 2.5) do not appear anymore.Concerning the convenient number of filters to use based on this filter distribution analysis, 40 filters seem to be a convenient number to describe the whole frequency range.This corresponds with the conclusions of our previous paper and it is consistent with choices made in the literature, e.g., [50].
To be complete, we also computed the performance for this model with the different numbers of filters, summarized in Table 7. Similarly to the observations made in our previous work [12], the performance increases slightly with the number of filters we initialize.

The Transformers Precluding the Filters to Learn Wide-Band Hypothesis
The second hypothesis for why the wide-band filters do not appear is linked to the training path: the transformers are first frozen before being able to train in parallel to the filters, while in the previous work, the whole network trained together.Until now, all experiments have been conducted by first keeping the transformers frozen for a given amount of time in order to train only the encoder network while keeping the transformers fixed.In order to verify this hypothesis, we perform an experiment where all learning rate schedulers start the warm-up period at the same time.

Results
When filters and transformers are free to train jointly from the pre-trained transformer version, the filter distribution does not learn wide-band structures and, moreover, it shows a similar distribution to previous experiments (Figure 10).This means that, used with a pre-trained transformer, the filters do not tend to get a wide-band structure as we had in (Section 2.5).
In summary, when incorporating SincNet into a self-supervised model, the obtained filter distribution does not correspond to the results of our previous work.Only a very small number of wide-band filters appear when enlarging the total number of filters.Besides, the experiments confirmed that the number of filters needed to cover the frequency spectrum in ASR tends to 40.We conclude that a pre-trained context network probably encodes the combination of those wide-band structures, precluding those structures from appearing on the trainable filter layer.

Conclusions
This study proposes an integration of two physiologically grounded concepts: trainable filters and self-supervision.It begins by delineating these concepts in Section 2 and subsequently elaborates on their joint training using a self-supervised context network while concurrently training a new front-end in Section 3. Section 4 outlines a logical sequence of experiments aimed at exploring the behavior of the trainable filters within this framework.
In the first experiment, we demonstrate that a new encoder can be trained while retaining the information acquired during pre-training in the transformer, avoiding a catastrophic forgetting of weights and resulting in performance close to the state of the art.Additionally, substituting the CNN encoder with a SincNet encoder sheds light on the information expectation of the transformer from the CNN, emphasizing the interest in frequency information at the trainable filter layer.
The second experiment illustrates that, when trained from scratch, the SincNet encoder converges more rapidly than the CNN.However, there is no significant performance improvement, and interpretability across the entire frequency spectrum does not reveal more speech artifacts compared to a CNN.
In the third experiment, we observe that the sole fine-tuning training path offers greater flexibility and is better suited for conducting experiments.While the pre-train-finetune training path aligns more with the original concept of self-supervision, it necessitates a larger dataset and substantially more time for the pre-training phase.
Lastly, in the fourth experiment, we investigate why wide-band filters cease to emerge.Through several experiments analyzing behavior with additional filters and employing different training paths, it appears that the pre-trained context network inhibits the emergence of wide-band filters in the initial layer.
Overall, this study demonstrates the feasibility of integrating and fine-tuning stateof-the-art networks with physiologically plausible models.Furthermore, the utilization of decoupled learning rate schedulers enables the fine-tuning of specific parts of the model.We posit that implementing more intricate physiological models and leveraging this approach can facilitate a deeper understanding of how physiological mechanisms may evolve to better interpret external inputs in the encoder network.

Figure 1 .
Figure 1.The behavior of filters when training with 60 filters initialized to the mel distribution.The gray bars illustrate the filter initialization with the dots representing the central frequency (on the x-axis) and the bar representing the bandwidth.The blue bars are the filters after training.We observe that two structures are formed: a narrow-band structure, close to the gray filters and covering the whole frequency domain, and a wide-band structure.

Figure 2 .
Figure 2. A schematic overview of the original SincNet implementation model used as baseline in our previous work, the wav2vec2 fine-tuning path and the proposed fine-tuning path in this work.Based on compositionality capacity of networks, we combined the feature extractor of the original SincNet model with the pre-trained transformer of wav2vec2.

Figure 3 .FirstFigure 4 .
Figure 3. Training curve of the long-run training of 100 k updates.(a) depicts the WER of the validation set (dev-other), (b) contains the negative log likelihood of the loss,and (c,d) depict the learning rate evolution of the feature extractor and context network, respectively.

Figure 5 .
Figure 5.Comparison of filter distribution for different run lengths and internal SincNet structure.Using maxpooling within the filter structure makes some wide-band filters appear in low frequency ranges.4.2.Does a Physiologically Adapted Front-End Perform as Well as a CNN in a Pre-Trained Model?4.2.1.Hypothesis

Figure 7 .
Figure 7.Comparison of the content of the normalized filter sum using SincNet filters and using a same-size generic CNN.

Figure 8 .
Figure 8. Filter distribution after the self-supervised (ss) training and fine-tuning (ft) of the wav2vec2 model.Globally, filters do not tend to move significantly during fine-tuning when they have been incorporated during pre-training.4.4.Do Wide-Band Filters Appear in Some Other Training or Model Configurations?

Figure 9 .
Figure 9. Distribution of the different filters in the frequency domain.The dots represent the central frequencies and the lines represent the bandwidths, similarly to Figure 1.The number of initial filters are (a) 100, (b) 80, (c) 60, and (d) 40.

Table 1 .
Summary of LibriSpeech dataset Can Trainable Filters Replace the Encoder CNN in an Already Pre-Trained Model?

Table 3 .
Table summarizing the time of each experiment on 4 parallel GPUs and the number of parameters contained in the feature extractor.Note that this number of parameters is quite small compared to the rest of the network (90 M parameters).

Table 4 .
Performance of the different experiments on the capacity of SincNet (SN) to replace the initial CNN structure with a 95% confidence interval, assuming the result is beta distributed.

Table 6 .
Performance of model using trainable filters in pre-training (100 k updates) and fine-tuning (10 k updates).

Table 7 .
Performance on the dev-clean subset of LibriSpeech for different numbers of filters.