Audlet Filter Banks: A Versatile Analysis/Synthesis Framework Using Auditory Frequency Scales

Featured Application: The proposed framework is highly suitable for audio applications that require analysis–synthesis systems with the following properties: stability, perfect reconstruction, and a flexible choice of redundancy. Abstract: Many audio applications rely on ﬁlter banks (FBs) to analyze, process, and re-synthesize sounds. For these applications, an important property of the analysis–synthesis system is the reconstruction error; it has to be minimized to avoid audible artifacts. Other advantageous properties include stability and low redundancy. To exploit some aspects of auditory perception in the signal chain, some applications rely on FBs that approximate the frequency analysis performed in the auditory periphery, the gammatone FB being a popular example. However, current gammatone FBs only allow partial reconstruction and stability at high redundancies. In this article, we construct an analysis–synthesis system for audio applications. The proposed system, referred to as Audlet , is an oversampled FB with ﬁlters distributed on auditory frequency scales. It allows perfect reconstruction for a wide range of FB settings (e.g., the shape and density of ﬁlters), efﬁcient FB design, and adaptable redundancy. In particular, we show how to construct a gammatone FB with perfect reconstruction. Experiments demonstrate performance improvements of the proposed gammatone FB when compared to current gammatone FBs in terms of reconstruction error and stability, especially at low redundancies. An application of the framework to audio source separation illustrates its utility for audio processing.


Introduction
Time-frequency (TF) transforms like the short-time Fourier or wavelet transforms play a major role in audio signal processing.They allow any signal to be decomposed into a set of elementary functions with good TF localization and perfect reconstruction is achieved if the transform parameters are chosen appropriately (e.g., [1,2]).The result of a signal analysis is a set of TF coefficients, sometimes called sub-band components, that quantifies the degree of similarity between the input signal and the elementary functions.In applications, TF transforms are used to perform sub-band processing, that is, to modify the sub-band components and synthesize an output signal.De-noising techniques [3,4], for instance, analyze the noisy signal, estimate the TF coefficients associated with noise, delete them from the set of TF coefficients, and synthesize a clean signal from the set of remaining TF coefficients.Lossy audio codecs like MPEG-2 Layer III, known as MP3 [5], or advanced audio coding (AAC) [6,7] quantize the sub-bands with a variable precision in order to reduce the digital size of audio files.In audio transformations like time-stretching or pitch-shifting [8,9], the phases of sub-band components are processed to ensure a proper phase coherence.As a last example, applications of audio source separation [10][11][12] or polyphonic transcriptions of music [13] rely on the non-negative matrix factorization scheme: the set of TF coefficients is factorized into several matrices that correspond to various sources present in the original signal.Each source can then be synthesized from its matrix representation.In these applications, the short-time Fourier transform (STFT) is mostly used, although modified discrete cosine transforms (MDCTs) are usually preferred in audio codecs.
Because sub-band processing may introduce audible distortions in the reconstructed signal, important properties of the analysis-synthesis system include stability (i.e., the coefficients are bounded if and only if the input signal is bounded), perfect reconstruction (i.e., the reconstruction error is only limited by numerical precision when no sub-channel processing is performed), resistance to noise, and aliasing suppression in each sub-band (e.g., [14,15] Chap.10).Furthermore, in all applications, a low redundancy (i.e., a redundancy between 1 and 2) lowers the computational costs.
TF transforms are usually implemented as filter banks (FBs) where the set of analysis filters defines the elementary functions and the set of synthesis filters allows signal reconstruction.The TF concentration of the filters together with the downsampling factors in the sub-bands define the TF resolution and redundancy of the transform.FBs come in various flavors and have been extensively treated in the literature (e.g., [16][17][18][19]).The mathematical theory of frames constitutes an interesting alternative background for the interpretation and implementation of FBs (e.g., [20][21][22]).Gabor frames (sampled STFT [2,23]), for instance, are widespread in audio signal processing.
For certain applications, such as audio coding [5][6][7], audio equalizers [24], speech processing [25], perceptual sparsity [26,27], or source separation [11,12,28,29], exploiting some aspects of human auditory perception in the signal chain constitutes an advantage.One of the most exploited aspects of the auditory system is the auditory frequency scale, which is a simple means to approximate the frequency analysis performed in the auditory system [30].Generally, the auditory system is a complex and in many aspects nonlinear system (for a review see, e.g., [31]).Its description ranges from simple collections of linear symmetric bandpass filters [32] through collections of asymmetric and compressive filters [33] to sophisticated models of nonlinear wave propagation in the cochlea [34].Because nonlinear systems may complicate the inversion of the signal processing chain (e.g., [35,36]), linear approximations of the auditory system are often preferred in audio applications.In particular, gammatone filters approximate well the auditory periphery at low to moderate sound pressure levels [37,38] and are easy to implement as FIR or IIR filters [32,[39][40][41][42][43].
Various analysis-synthesis systems based on gammatone FBs have been proposed for the purpose of audio applications (e.g., [35,39,40,44]).However, these systems do not satisfy all requirements of audio applications as, even at high redundancies, they only achieve a reconstruction error described as "barely audible".This error becomes clearly audible at low redundancies.In other words, these systems do not achieve perfect reconstruction.To our knowledge, a general recipe for constructing a gammatone FB with perfect reconstruction at redundancies close to and higher than one has not been published yet.
In this article, we describe a general recipe for constructing an analysis-synthesis system using a non-uniform oversampled FB with filters distributed on an arbitrary auditory frequency scale, enabling perfect reconstruction at arbitrary redundancies.The resulting framework is named "Audlet" for audio processing and auditory motivation.The proposed approach follows the theoretical foundation of non-stationary Gabor frames [20,45] and their application to TF transforms with a variable TF resolution [46][47][48].This report extends the work reported in [20] (Section 5.1) by providing a full theoretical and practical development of the Audlet.
The manuscript is organized as follows.The next section briefly recalls the basics of non-uniform FBs, frames, and auditory frequency scales.Section 3 describes the theoretical construction of the Audlet framework.The practical implementation issues are discussed in Section 4 and Section 5 evaluates important properties and capabilities of the framework.

Notations and Definition
In the following, we consider signals in 2 (Z) as samples of a continuous signal with sampling frequency f s , with the Nyquist frequency of f N = f s /2.We denote the normalized frequency by ξ = f / f s , i.e., the interval [0, and the energy of a signal is defined from the inner product as ||x|| = x, x .The floor, ceiling, and rounding operators are • , • , and • , respectively.We denote the z-transform by Z: x[n] → X(z).By setting z = e 2iπξ for ξ ∈ (−1/2, 1/2], the z-transform equals the discrete-time Fourier transform (DTFT).Note that the frequency domain associated to the DTFT is circular and therefore, the interval Since we exclusively consider real-valued signals we deal with symmetric DTFTs, which allows us to process only the positive-frequency range.Finally, we denote the complex conjugation by an overbar, e.g., H.

Filter Banks and Frames
The general structure of a non-uniform analysis FB is presented in Figure 1 (e.g., [17]).It is a collection of K + 1 analysis filters H k (z), where H k (z) is the z-transform of the impulse response h k [n] of the filter, and downsampling factors d k , k ∈ {0 . . .K}, that divides a signal x into a set of K + 1 sub-band components y k , where The special case where all downsampling factors are identical, i.e., d k = D ∀ k ∈ {0 . . .K}, is referred to as a uniform FB.By analogy, a synthesis FB is a collection of K + 1 upsampling factors d k and synthesis filters G k (z) (see Figure 2) that recombines the sub-band components y k into an output signal x according to where , denoting the real part, and the factor of 2 are a consequence of considering the positive frequency range only.
A synthesis FB can be generalized to a synthesis system (shown in Figure 3), which is a linear operator S that takes as an input sub-band components y k and yields an output sequence x.For the synthesis operation, we use the notation S(•, (G k , d k ) k ), where (G k , d k ) k is the synthesis FB.An analysis FB is invertible or allows for perfect reconstruction if there exists a synthesis system S that recovers x from the sub-band components y k without error, i.e., x = x for all x ∈ 2 (Z).In other terms, the analysis-synthesis system ((H k , d k ) k , S) has the perfect reconstruction property.In practice, the implementation of that operation might introduce errors of the order of numerical precision.We use the mathematical theory of frames in order to analyze and design perfect reconstruction FBs (e.g., [20][21][22]).A frame over the space of finite energy signals 2 (Z) is a set of functions spanning the space in a stable fashion.Consider a signal x and an analysis FB (H k , d k ) k yielding y k .Then, an FB constitutes a frame if and only if 0 where A and B are called the lower and upper frame bounds of the system, respectively.The existence of A and B guarantees the invertibility of the FB.Several numerical properties of an FB can be derived from the frame bounds.In particular, the ratio √ B/A corresponds to the condition number [49] of the FB, i.e., it determines the stability and reconstruction error of the system.Furthermore, the ratio B/A characterizes the overall frequency response of the FB.A ratio B/A = 1, for instance, means a perfectly flat frequency response.This is often desired in signal processing because, in that particular case, the analysis and synthesis FB are the same.Specifically, the synthesis filters are obtained by time-reversing the analysis filters, i.e., G k (z) = H k (z).
The frame bounds A and B correspond to the infinimum and supremum, respectively, of the eigenvalues of the operator S(A(•, In practice, these eigenvalues can be computed using iterative methods (see Sections 3.2 and 3.3).

Auditory Frequency Scales
An important aspect of the auditory system to consider in auditory-motivated analysis is the frequency-to-place transformation that occurs in the cochlea.Briefly, when a sound reaches the ear it produces a vibration pattern on the basilar membrane.The position and width of this pattern along the membrane depend on the spectral content of the sound; high-frequency sounds produce maximum excitation at the base of the membrane, while low-frequency sounds produce maximum excitation at the apex of the membrane.This property of the auditory system can be modeled in a first approximation as a bank of bandpass filters, named "critical bands" or "auditory filters", whose center frequencies and bandwidths respectively approximate the place and width of excitation on the basilar membrane.The frequency and bandwidth of the auditory filters are nonlinear functions of frequency.These functions, called auditory frequency scales, are derived from psychoacoustic experiments (see e.g., [50], Chapter 3 for a review).The Bark, the equivalent rectangular bandwidth (ERB), and Mel scales are commonly used in hearing science and audio signal processing [30].To refer to the different frequency mappings we introduce the function F: f → Scale where f is frequency in Hz and Scale is an auditory unit that depends on the scale.The ERB rate, for instance, is [30] F ERB ( f ) = 9.265 ln 1 + f 228.8455 (4) and its inverse is The ERB (in Hz) of the auditory filter centered at frequency f is Expressions for the Bark and Mel scales are respectively provided in [51,52].For scales that do not specify a bandwidth function, like the Mel scale, we propose the following function: . This ensures a proper overlap between the filters' passband.

The Proposed Approach
This section describes the analysis FB and synthesis stage of the Audlet FB.The FB is entirely designed in the frequency domain, which simplifies the assessment of properties such as invertibility and the amount of aliasing.Note that the purpose of this section is to provide a mathematical framework for general FB regardless of the practical implications.The implementation of the Audlet framework is addressed separately in Section 4.

Analysis Filter Bank
The analysis FB consists of Audlet filters H k , k ∈ {1, . . ., K − 1}, a low-pass filter H 0 , and a high-pass filter H K .In total, it consists of K + 1 filters.The Audlet filters are defined by where w(ξ) is a prototype filter's shape centered at frequency 0. Any symmetric or asymmetric window is an eligible w.The main condition on w is that its frequency response must decay away from 0 on both sides.The parameters Γ k = βB scale ( f k ) and f k control the bandwidth and center frequency, respectively, of the filter H k .The parameter β allows for the filter bandwidths to be compressed/expanded.Note that when β = 1, the bandwidth of the filters H k deviates from the human auditory filters' bandwidth.
To determine K and construct the sets { f k } and {Γ k }, the first step consists in choosing an essential frequency range [ f min , f max ] ⊆ [0, f N ], a frequency mapping F Scale , and a filter density V ∈ R + of filters per Scale unit.The set {d k } is considered arbitrary for now.An optimal choice of downsampling factors d k is provided in Section 3.1.3.

Construction of the Set { f k }
The center frequency f 1 is given by and the subsequent f k 's are obtained iteratively by The iteration is processed as long as Note that f max should be slightly higher than the highest frequency of interest in the analyzed signals.Finally, f 0 = 0 and f K = f N .At this stage, the "restricted" frequency response of the FB (i.e, restricted to the filters H 1 , . . ., H K−1 ) is given by To obtain a perfect reconstruction system, the frequency response of the system should optimally cover the frequency range [0, f N ].However, this may not be the case for H 0 (r) (ξ) because the amplitude of the filter H 1 (and/or H K−1 ) may vanish at frequencies between 0 and f 1 (resp., between f K−1 and f N ).
To circumvent this problem, a low-pass filter H 0 and high-pass filter H K are included.

Construction of H 0 and H K
The purpose of the filters H 0 and H K is to stabilize the FB response H 0 by compensating for the potentially low amplitude of H (r) While the content in the frequency bands 0 and K might carry some perceptually relevant information, most applications will not modify the corresponding coefficients.Consequently, it is crucial that H 0 and H K are mostly concentrated outside [ f 1 , f K−1 ], but their time domain behavior is only of secondary importance.Nonetheless, we propose a construction that retains some smoothness in frequency and thus, by Fourier duality, h 0 and h K have appropriate decay.
There is no canonical method that provides optimal compensation and time localization for any valid set of Audlet parameters.In [46], for instance, plateau functions with raised cosine flanks were proposed.This method might result in additional ripples in H 0 if w is not a raised-cosine window.Alternatively, in [47], H 0 and H K were constructed from a set of virtual filters extending the FB beyond [ f 1 , f K−1 ].An adaptation of this method to the Audlet framework is unnecessarily complex and unintuitive.Instead, we propose the following.We define inv is nonnegative and has at least the same differentiability as w (taking the positive part (•) + is only necessary in the special cases considered in the remark below).However, any ripples in H (r) inv .To reduce this rippling effect and introduce strict band-limitation of H 0 and H K , we multiply H (r) inv with appropriately localized plateau functions P 0 and P K .Assume that 0 elsewhere, and The frequency f − p,s (resp.f + p,s ) defines the width of the plateau in P 0 (resp.P K ).The region ) defines the transition area of P 0 (P K ) (see Figure 4).The filters H 0 and H K are finally defined by their DTFTs as inv , and We propose selecting 0 ).This choice ensures that P 2 0 + P 2 K ≤ 1, preventing overcompensation, and is properly adapted to the scale used.By default, we set The intuition here is that from f 4 (resp.f K−4 ) onward, the restricted FB response H (r) 0 is expected to be stable already, and that the size of the transition area ensures a sufficiently smooth roll-off.It should be noted that, although the filters proposed above are chosen to be strictly band-limited, a similar construction with time-limited, but only approximately band-limited, filters is also conceivable, by smoothly truncating h 0 , h K instead of H 0 , H K .
Remark 1.The choice of raised cosine transition areas provides continuously differentiable P 0 , P K .If additional decay of h 0 , h K is desired, the construction of a compactly supported plateau function of arbitrary differentiability is standard, e.g., through convolution of a characteristic function with a smooth function.There are some corner cases in which one or both of the compensation filters h 0 , h K are unnecessary, namely if f 1 is very close to 0 (resp.f K−1 to f N ).In that case the maximum M should be computed over the interval 1, then the low-pass filter H 0 is not required.An analogous argument is valid for H K .
The total frequency response of the analysis FB (i.e., including the K + 1 filters) is then H 0 (ξ and the redundancy of the FB is The factor of 2 stems from the fact that coefficients in the 1-st to (K − 1)-th sub-bands may be complex valued.
, also called the alias domain representation [16,17].For ξ ∈ (−1/2; 1/2], X(z) reduces to the following ([20] Section 4) where for all j ∈ {0, . . ., D − 1}.The term H 0 in (14) represents the frequency response of the FB, while the terms H j , j = 0, represent the alias components.Thus, an alias-free system is obtained when H 0 = C > 0 and H j = 0, ∀j = 0.While this is not always achievable, choosing d k 's to be inversely proportional to the filters' bandwidth yields a close-to-optimal solution [19], i.e., For a targeted redundancy R t , combining (15) and ( 12) while disregarding the floor operator • leads to Since the H k values are strictly decaying away from f k with a bandwidth of Γ k , choosing d k 's according to (15) ensures an even distribution of the overall aliasing across channels.
Using (15) to derive d 0 and d K may result in a large amount of aliasing because H 0 and H K may feature large plateaus depending on f min and f max .We propose instead choosing d 0 and d K according to Note that R t controls the d k only for k = 1, . . ., K − 1, while the actual redundancy R depends on all d k , i.e., including d 0 and d K .As a result, the value of R is slightly larger than R t .

Invertibility Test
Overall, the design of an Audlet analysis FB involves a set of seven parameters: the perceptual scale, frequency range [ f min , f max ], filter shape w, filter density V and bandwidth factor β, and a target redundancy R t .To check that a given parameter set results in a stable and invertible system, three methods exist: 1.
An eigenvalue analysis of the linear operator corresponding to analysis with (H k , d k ) k followed by FB synthesis with (H k , d k ) k .The frame bounds A and B correspond to the smallest (infinimum) and largest (supremum) eigenvalues of the resulting operator, respectively.The largest eigenvalue can be estimated by numerical methods with reasonable efficiency but estimating the smallest eigenvalue directly is highly computationally expensive.In the next section we discuss an alternative method that consists in approximating the inverse operator and estimating its largest eigenvalue, the reciprocal of which is the desired lower frame bound A (see also Section 5 for an example frame bounds analysis).
While method 1 above can always be applied, the applicability of methods 2 and 3 depends on w.
, and , then the alias terms H j , j ∈ {1, . . ., D − 1} = 0.This setting corresponds to the painless case [53].This is the only case when method 2 can be applied.If w has no compact support but is mostly concentrated on [a, b] and decays outside, the alias terms H j , j ∈ {1, . . ., D − 1} exist but may be small compared to H 0 .In that case, method 3 can be applied.
In terms of computational costs, method 1 is by far the most demanding of the three.Still, if a certain parameters set is used over a large number of analyses, this one-time investment to determine invertibility easily pays off.However, the user must still be aware of the potential inaccuracies induced by numerical eigenvalue computation.

Synthesis Stage
The synthesis stage consists of a linear operator S((y k ) k ) mapping the sub-band signals y k to the output signal x (see Figure 3) such that the input signal x is recovered.For uniform analysis FBs, i.e., d k = D∀k, the operator S can be structured as in Figure 2. In that case, exact dual filter G k 's can be computed [54] with a factorization algorithm that generalizes [23].The synthesis is then performed by computing S((y k ) k , (G k , D) k ).
For non-uniform analysis FBs, we implement S using a conjugate gradient (CG) iteration [49,55,56].This is a very efficient iterative algorithm that is guaranteed to converge when (H k , d k ) k forms a frame, i.e., whenever stable perfect reconstruction is possible.Given the Hermitian operator S(A(x, (H k , d k ) k ), (H k , d k ) k ), the CG approximates the action of the inverse operator.For Hermitian operators, the CG converges monotonously to 0. In addition, for problems of size P, the CG is guaranteed to converge within P steps.In practice, convergence speed depends solely on the (potentially unknown) condition number of the linear problem at hand, which, in this case, equals √ B/A.Often, it is beneficial to use a preconditioning step to improve the condition number.We propose the operator F −1 diag(1/H 0 )F as preconditioner (see also [48,57,58]).A robust implementation of the appropriate preconditioned CG (PCG) algorithm is conceptually straightforward and was provided in [48].
In the following, we describe a heuristic variant of this PCG algorithm with asymmetric preconditioning, that enables efficient implementation even if the intermediate solutions (denoted by x j below) are only given in the time domain.Experimentally, Algorithm 1 was observed to converge in the same number of iterations as the robust implementation from [48] (divergence was observed only if the filters H k were set to uniformly distributed random noise).We denote the analysis of x with respect to the analysis FB (H thus represents analysis followed by synthesis.

Algorithm 1 Synthesis by means of conjugate gradients
To speed up convergence we use approximate dual filters as an initial choice for G k 's, We interpret G k 's as approximate dual filters because in the absence of aliasing (i.e., if H j = 0, ∀j = 0), the application of G k exactly cancels all ripples in the frequency response H 0 .
Note that in the painless case, evoked in Section 3.2, the operator S(A(x, (H k , d k ) k ),(G k , d k ) k ) equals the identity and thus, synthesis is performed simply by applying S(( Although this is not apparent from the iterative inversion scheme described above, the proposed synthesis stage acts in a similar fashion to an FB.More specifically, if D = lcm({d k } k ) and ( Hj , D) j is the equivalent uniform FB associated with (H k , d k ) k [16,20], then iterating the CG algorithm until convergence is equivalent to computing the FB synthesis with respect to the canonical dual FB of ( Hj , D) j , which is of the form ( Gj , D) j , for some sequences of filters ( Gj ) j (see [21]).Since convergence is achieved within numerical precision in a small number of CG steps we can assume that the proposed synthesis system is characterized by the properties of the filters ( Gj ) j .We cannot easily compute those filters, but it is well known that the ratio of the optimal frame bounds B/A (see Section 3.2) is closely related to the similarity of a system and its canonical dual [59].If B/A ≈ 1, then we can expect Gj ≈ Hj , for all j.Since each Hj is just a delayed version of some H k , the time-and frequency-domain localization of the synthesis system matches that of the analysis system.
For larger values of B/A, the duality of ( Hj , D) j and ( Gj , D) j implies that ( Gj , D) j has to account for the discrepancies of ( Hj , D) j [59].These considerations apply to any dual FB pair, Audlet or not.The Audlet FB is constructed in such a way that, given the prototype filter w and filter density V, the frequency response of (H k , d k ) k is as flat as possible, such that the B/A depends mostly on the presence of aliasing.The required aliasing compensation often implies a widening of the dual filters' essential support and essential passband, proportional to the amount of aliasing present.

Practical Issues
The general mathematical framework described in the previous section is valid for band-limited filters and more classical FIR filters.Although the impulse responses of band-limited filters are theoretically infinite, their decay can be controlled by design such that they can be truncated with a minor loss of precision.In our implementation, we instead choose an alternative approach similar to "fast Fourier transform (FFT) filter banks" proposed by Smith [60].We start by considering the input signal as a finite-length vector in R L , L ∈ N. In an overlap-add block-processing scheme like the one proposed in [46,60], such a sequence would be a single windowed block possibly zero-padded on both ends.In the offline setting assumed in this paper, the sequence represents the entire input signal.We discretize the continuous frequency ξ by assuming the sequence is one period of an L-periodic signal.This introduces circular boundary effects that can be diminished by zero padding (increasing L), provided the filters' impulse responses decay rapidly.Increasing L preserves the perfect reconstruction property.Such assumptions allow implementing the filtering, downsampling, and upsampling directly in the frequency domain using sampled frequency responses of analysis and synthesis filters H k and G k , respectively.The filtering with an analysis filter followed by downsampling is done using the standard point-wise product of the L-point FFT of the signal with a sampled frequency response, while the downsampling is achieved by folding the result to a sequence of length L/d k (manual aliasing) and performing L/d k -point inverse FFT (IFFT).Performing downsampling this way is exactly equivalent to time-domain downsampling by a factor of d k .Upsampling and filtering is achieved by taking a L/d k -point FFT of the sub-band, periodizing the result to length L followed by a point-wise product with the sampled frequency response of a synthesis filter.A final L-point IFFT brings the result back to the time domain.In this framework, working with strictly band-limited filters is even advantageous for two reasons.First, the frequency domain point-wise product can be restricted to the filter bandwidth and second, for band-limited filters, the parameters can be chosen such that the system is painless [53] (no aliasing is introduced by downsampling), for which the approximate dual filters from (20) are exact and thus achieve perfect reconstruction.

Code
We provide code for performing an Audlet analysis/synthesis as part of the Matlab/Octave "large time-frequency analysis toolbox (LTFAT)" toolbox [61,62] available at http://ltfat.github.io/.The analysis filters are generated by the function audfilters.The function allows to construct at will uniform or non-uniform Audlet FBs with integer or rational downsampling factors, thus offering flexibility in FB design.Rational downsampling factors can be achieved in the time domain by properly combining upsamplers and downsamplers (e.g., [19]).In LTFAT the sampling rate changes are directly performed in the frequency domain by periodizing and folding the Y k (z)'s, then performing an inverse DFT [63].This technique allows to achieve rational downsampling factors at low computational costs.The desired number of channels in the frequency range [ f min , f max ] can be set by specifying either K or V.The function audfilters also accepts parameters Scale, β, w, and R t .Currently, three scales (ERB-the default-as well as Bark and Mel) are available.Possible choices of w include (but are not limited to) Hann (default), Blackman, Nuttall, gammatone, or Gaussian.If R t is specified, c bw is inferred from R t according to ( 15)- (18).Otherwise c bw = 1.The analysis of a signal is performed by filterbank.The synthesis is performed by ifilterbankiter that implements Algorithm 1.In the painless case, the more computationally efficient synthesis can be achieved by first computing the exact synthesis FB with filterbankdual and then synthesizing the signal with ifilterbank.The function filterbankdual can also be used to check whether a given analysis FB qualifies for the painless case.
Example scripts to perform Audlet analyses/syntheses in various FB settings are provided as Supplementary Material (see Archive S1).The supplementary material also demonstrates the realization of iterative reconstruction.
Note that for real-time implementations using macro blocks like in [46], the overall redundancy depends also on the overlap between the blocks.For analysis or processing purposes, the sub-bands can be combined in an overlap-add manner closely approximating the true non-blocked sub-bands.The perfect reconstruction property within the blocks is preserved.

Computational Complexity
In [45,64] it was shown that the frequency-domain computation of an FB analysis In the painless case, the same analysis applies to the dual FB (G k , d k ) k .In general, every iteration of the CG has the complexity of FB analysis with (H k , d k ) followed by FB synthesis with (G k , d k ).For a given analysis system (H k , d k ), the number of iterations required for numerical convergence relies only on the frame bound ratio B/A and is completely independent of the signal under scrutiny (see also [48] for a visualization of convergence in various settings).

Evaluation
In this section we evaluate three important properties of the Audlet, namely its simple and versatile FB design, perfect reconstruction, and utility for audio applications that perform sub-channel processing.This evaluation comprises two parts: 1.
The construction of uniform and non-uniform gammatone FBs and examination of their stability and reconstruction property at low and high redundancies.For this purpose we replicated the simulations described in [44] (Section IV), which we consider as state of the art.

2.
The construction of various analysis-synthesis systems and use to perform sub-band processing.
For this purpose we considered the example application of audio source separation because it is intuitive, clear, and it easily demonstrates the behavior of the system when attempting modification of an audio signal.In this application we assess the effects of perfect reconstruction, bandwidth and shape of the filters, and auditory scale on the quality of sub-channel processing.
Scripts to reproduce the results of these evaluations are provided as Supplementary Material (see Archive S1).

Method
To construct a gammatone FB we use the prototype filter shape in the frequency domain of a complex gammatone filter of order γ centered at zero [42,43] An order γ = 4 and bandwidth factor α = 1.019 are usually chosen for emulating the human auditory filters [38].Because H GT,γ,α has an infinite support in the frequency domain, it can be truncated to become a compactly-supported gammatone filter shape by where is a threshold that allows to trade accuracy for computational efficiency.Once an essential frequency range and a filter density are chosen, the set of gammatone filters is generated according to (7) using w(ξ) = H GT,γ,α (e 2iπξ ) (or w = w csGT,γ,α if a painless system is desired) and β = 1.In Figure S2 in supplementary material, the frequency response and impulse response of two gammatone filters computed using ( 7) and ( 21) with center frequencies f k = 258 and 4000 Hz are displayed.
To examine the stability and reconstruction property of the proposed gammatone construction, we replicated the two simulations described in [44] (Section IV).The first simulation considers uniform FBs and the second simulation considers non-uniform FBs.The uniform FBs were evaluated by two measures: the ratio B/A and reconstruction error in terms of signal-to-noise ratio (SNR).The non-uniform FBs were evaluated only by the SNR.We compared our results to those from Strahl and Mertins (S-M) [44] where available.
The FB settings were as follows.The sampling rate was f s = 44.1 kHz, the essential frequency range was [ f min = 20 Hz, f max = 20000 Hz], and the scale was ERB.The gammatone filters in [44] were implemented as FIR filters, that is, the H k 's had an infinite frequency response.Thus, in the following simulations we used w(ξ) = H GT,4,1.019 (e 2iπξ ).In the uniform case, the downsampling factors d k 's, k ∈ {1, . . ., K − 1}, were set to a constant D; d 0 and d K were chosen according to (17) and (18), respectively.The evaluation was performed for all combinations of D ∈ {1, 2, 4, 6, 8} and K ∈ {51, 76, 101, 151} (our K corresponds to M + 1 in [44]).For the synthesis stage, Algorithm 1 was used with an error tolerance ε = 10 −9 .The ratio B/A was calculated for the full frequency range (i.e., from 0 to f N ) by iteratively computing the eigenvalues of the operator S associated with the system (H k , d k ) k [65].The SNR was calculated as ||x|| 2 /||x − x|| 2 in dB for x being a Gaussian white noise with a length of 30,000 samples.
In the non-uniform case, K was fixed to 51 and the FBs were evaluated for various values of R. We considered the oversampling factors O ∈ {1, 2, 4, 6, 8} used in [44].The relationship between [44], O was ∑ K−1 k=1 d −1 k , which considers only the real part of the coefficients, and h 0 and h K were not included.For simplicity, our FBs were designed for R t ∈ {2, 4, 8, 12, 16}.Similar to [44], two sets of d k were used to achieve the various R t 's.The first set consisted of d k 's that were inversely proportional to the filters' bandwidth according to (15).The second set was exactly that mentioned in [44] (Appendix B).For each set, d 0 and d k were chosen according to (17) and (18), respectively.All other FB and signal parameters were as in the uniform case.

Results and Discussion
The ratios B/A computed for the uniform gammatone FBs for various combinations of D and K and those reported in [44] (Figure 5) are listed in Table 1.For K = 51-101, our ratios B/A decreased with increasing K.This is a consequence of the increasing overlap between filters with increasing K, which in turn yields a flatter FB response.Increasing K to 151 did not result in smaller ratios.This can be attributed to the steep flank of H K in that setting.This can be counteracted by increasing the values of κ 1 and κ 2 when very small filter spacing V (equivalently, large K) is used.Our framework generally achieved comparable or smaller ratios than those from [44].Note that in [44], B/A was calculated for the frequency range from 0.06 to 17 kHz.These ratios, when calculated for the full frequency range, might have been larger than those listed in Table 1.Consequently, the actual difference between Audlet and S-M ratios might be larger than that reflected in Table 1.
Table 1.Ratios B/A for various combinations of D and K obtained for the proposed Audlet framework and reported in [44] (S-M).The SNRs achieved with our framework were 180 dB (or larger) for all tested combinations of D and K.The limit of 180 dB is the consequence of the error tolerance of 10 −9 in the PCG algorithm.In comparison, SNRs reported in [44] for D = 1 ranged between 30 and 72 dB and increased with increasing K (SNRs for other D's were not reported).
The SNRs computed for the non-uniform gammatone FBs for various R are listed in Table 2 together with those reported in [44].In all conditions, our framework achieved SNRs of at least 170 dB.In contrast, the system from [44] offered decent reconstruction (SNR ≥ 15 dB) only in configurations involving small downsampling factors (i.e., at large R).
Table 2. Signal-to-noise ratios (SNRs; in dB) obtained for the Audlet framework and reported in [44] (Figure 10) (S-M).Overall, we conclude that the reconstruction quality of currently available gammatone FB implementations deteriorates at low redundancies.This may hinder the quality of sub-channel processing in audio applications but, as it seems, the reconstruction quality can be improved by using the Audlet framework.
It might appear intriguing that we obtained larger SNRs than in [44] even in conditions with similar ratios B/A (compare the condition with K = 151 and D = 1 in Table 1).The good performance achieved by our framework can mostly be explained by the design of our synthesis stage.In contrast, most analysis-synthesis systems based on gammatone filters, such as [44], use synthesis filters that are time-reversed versions of the analysis filters, i.e., G k (e 2iπξ ) = H k (e 2iπξ ) that translates to g k [n] = h k [−n] in the discrete-time domain (e.g., [35,39,40]).Such a synthesis stage provides perfect reconstruction if and only if the frame bound ratio is equal to one [20].

Method
This experiment is an example application of the Audlet framework to audio source separation.Given a mixture of instrumental music and voice, we constructed various analysis-synthesis systems and separated the voice from the music.The systems were designed so as to assess the effects of perfect reconstruction, shape and bandwidth of the filter, and auditory scale on the quality of sub-channel processing at low, mid, and high redundancies.Four systems were implemented: trev_gfb: a state-of-the-art gammatone FB with approximate reconstruction (the acronym trev stands for "time reversal").The H k 's followed (7) with w(ξ) = w csGT,4,1.019(ξ) (22) with a threshold = 10 −5 .The synthesis filters G k (e 2iπξ ) = H k (e 2iπξ ).This corresponds to the baseline system used in audio applications like [11,28,29].Audlet_gfb: an Audlet FB with a gammatone prototype.The H k 's were computed as in trev_gfb but the synthesis stage was Algorithm 1.This system aims to compare to the baseline system and assess the effect of perfect reconstruction.Audlet_hann: an Audlet FB with a Hann prototype.This system aims to assess the effect of filter shape.STFT_hann: an STFT using a 1024-point Hann window.Synthesis was achieved by the dual window [2].
The time step was then adapted to match the desired redundancy R t .This corresponds to the baseline system used in most audio applications (e.g., [10,66]).This system aims to assess the use of an auditory frequency scale.
The effect of filter bandwidth was assessed by varying parameter β.Specifically, two values were tested: β ∈ {1, 1/6}.Using a value of β = 1 means a clear departure from auditory perception but may help better resolve spectral components, particularly at high frequencies where the auditory filters become really broad (see (6)).Accordingly, many audio applications that rely on constant-Q or wavelet transforms use 12 or more bins per octave (e.g., [46,63]).
The performance of all systems were evaluated at three redundancies: R t ∈ {1.1, 1.5, 4}.To this end, (15) was used with c bw adjusted such that R t was achieved.The quality of the separation was assessed by computing energy ratio-and perceptually-based objective measures according to [67].Energy ratio measures include the signal-to-distortion ratio (SDR) and signal-to-artifact ratio (SAR).Perceptual measures include the overall perceptual score (OPS) and target perceptual score (TPS).OPS assesses the general audio quality of the separation, while TPS assesses the preservation of the target.All measures were computed using the PEASS toolbox [67].
The signal mixture, shown in Figure 5a, was created by adding an instrumental music signal to a singing voice signal (target), shown in Figure 5b.The separation was performed by analyzing the mixture with the analysis FB, applying a binary TF mask to the sub-band components by point-wise multiplication, and computing the output signal from the modified sub-band components using the synthesis stage.This operation corresponds to the application of a frame multiplier in signal processing [9,68].In order to create the binary masks, the target signal was analyzed by the FB and the magnitude of the coefficients was hard thresholded with a threshold of -25 dB.The threshold value was varied between −40 and −20 dB in 5-dB steps.While the threshold value did affect the separation performance, all configurations were affected equally.The value of -25 dB was selected because it yielded good separation results for both the gammatone and Hann prototypes.Four masks were created in total, one for each analysis filter's shape and each β.The two masks for β = 1/6 are displayed in Figure 5c,d.Because the frequency resolution of the STFT does not match those of other FBs, an additional mask was computed for the STFT.

Results and Discussion
Figure 5e,f show the voice signal separated using Audlet_hann and Audlet_gfb, respectively, for R t = 4 and β = 1/6.The objective quality measures are listed in Table 3. Audio files are available on the companion webpage: http://www.kfs.oeaw.ac.at/audletFB.The following observations can be made.
First, system Audlet_gfb outperformed trev_gfb in most conditions.This demonstrates the role of perfect reconstruction in the quality of sub-channel processing.In other words, using the Audlet framework can improve the reconstruction quality.Note that for β = 1/6, the performance of trev_gfb improved with increasing R t and tended towards the performance of Audlet_gfb.This is due to the decrease in the amount of aliasing with increasing R t .For trev_gfb and R t = 4, very little aliasing was present and a good performance was achieved despite the approximate reconstruction of trev_gfb.
Second, the performance of Audlet_hann was comparable to that of Audlet_gfb in almost every measure.Although the filter shape did not play a major role in this particular example, it may have a larger impact in other applications.
Third, for all configurations, reducing β from 1 to 1/6 generally improved all quality measures.This suggests that, depending on the application, a departure from the human auditory perception may improve signal processing performance.In the present application, for instance, finely tuned filters are required to resolve all harmonics and therefore properly separate the signals.
Finally, while STFT_hann performed comparably to Audlet_hann at the highest R, the performance of STFT_hann dropped at mid and low redundancies.This suggests that using an auditory frequency scale may improve signal processing performance at low redundancies.Table 3. Objective quality measures for the separated voice signal.The signal-to-distortion ratio (SDR) and signal-to-artifact ratio (SAR) are in dB; the larger the ratio, the better the separation result.Overall perceptual score (OPS) and target perceptual score (TPS) are without unit; they indicate scores between 0 (bad quality) and 1 (excellent quality).The corresponding audio files are available on the companion webpage.STFT: short-time Fourier transform.

Conclusions
A framework for the construction of oversampled perfect-reconstruction FBs with filters distributed on auditory frequency scales has been presented.This framework was motivated by auditory perception and targeted at audio signal processing; it has thus been named "Audlet".The proposed approach has its foundation in the mathematical theory of frames.The analysis FB design is directly performed in the frequency domain and allows for various filter shapes, and uniform or non-uniform settings with low redundancies.The synthesis is achieved using a (heuristic) preconditioned conjugate-gradient iterative algorithm.The convergence of the algorithm has been observed for Audlet FBs that constitute a frame.This is possible even for redundancies close to 1.For higher redundancies and filters with a compact support in the frequency domain, a so-called "painless" system can be achieved.In this case the exact dual FB can be calculated, which in turn results in a computationally more efficient synthesis.
We showed how to construct a gammatone FB with perfect reconstruction.The proposed gammatone FB was compared to widely used state-of-the-art implementations of gammatone FB with approximate reconstruction.The results showed the better performance of the proposed approach in terms of reconstruction error and stability, especially at low redundancies.An example application of the framework to the task of audio source separation demonstrated its utility for audio processing.
Overall, the Audlet framework provides a versatile and efficient FB design that is highly suitable for audio applications requiring stability, perfect reconstruction, and a flexible choice of redundancy.The framework is implemented in the free Matlab/Octave toolbox LTFAT [61,62].

Figure 1 .
Figure 1.General structure of a non-uniform analysis filter bank (FB) (H k , d k ) k with H k being the z-transform of the impulse response h k [n] of the filter, also denoted as A(•, (H k , d k ) k ).

Figure 2 .
Figure 2. General structure of a non-uniform synthesis FB (G k , d k ) k , also denoted by S(•, (G k , d k ) k ).

Figure 3 .
Figure 3. General structure of a synthesis system.S is a linear operator that maps the sub-band components y k to an output signal x.

Figure 4 .
Figure 4. Illustration of the frequency allocations of the filters H 0 (red line) and H K (green line) given the restricted frequency response H (r) 0 (ξ) (dashed line) of an FB.
obtained as the sum of: (1) an L-point FFT (O(L log L)); (2) point-wise multiplication with the filter frequency responses (∑ k L k ); and (3) an L/d k -point IFFT (O(∑ k L/d k log L/d k )) for each filter, and similarly for FB synthesis with respect to (H k , d k ).Here, L k denotes the bandwidth of H k in samples.

Figure 5 .
Figure 5. Source separation for R t = 4 and β = 1/6 displayed as time-frequency (TF) plots: the magnitude of each sub-band component (in dB) as a function of time (in s).(a) Shows the mixture analyzed by a gammatone FB; (b) Shows the target (voice) analyzed by a gammatone FB; (c) Shows the binary mask obtained for Audlet_hann; (d) Shows the binary mask obtained for trev_gfb and Audlet_gfb-the black and white dots in the masks represent '1' and '0' entries, respectively; (e,f) Show the target separated by Audlet_hann and Audlet_gfb, respectively.