## 1. Introduction

Time-frequency (TF) transforms like the short-time Fourier or wavelet transforms play a major role in audio signal processing. They allow any signal to be decomposed into a set of elementary functions with good TF localization and perfect reconstruction is achieved if the transform parameters are chosen appropriately (e.g., [

1,

2]). The result of a signal analysis is a set of TF coefficients, sometimes called sub-band components, that quantifies the degree of similarity between the input signal and the elementary functions. In applications, TF transforms are used to perform sub-band processing, that is, to modify the sub-band components and synthesize an output signal. De-noising techniques [

3,

4], for instance, analyze the noisy signal, estimate the TF coefficients associated with noise, delete them from the set of TF coefficients, and synthesize a clean signal from the set of remaining TF coefficients. Lossy audio codecs like MPEG-2 Layer III, known as MP3 [

5], or advanced audio coding (AAC) [

6,

7] quantize the sub-bands with a variable precision in order to reduce the digital size of audio files. In audio transformations like time-stretching or pitch-shifting [

8,

9], the phases of sub-band components are processed to ensure a proper phase coherence. As a last example, applications of audio source separation [

10,

11,

12] or polyphonic transcriptions of music [

13] rely on the non-negative matrix factorization scheme: the set of TF coefficients is factorized into several matrices that correspond to various sources present in the original signal. Each source can then be synthesized from its matrix representation. In these applications, the short-time Fourier transform (STFT) is mostly used, although modified discrete cosine transforms (MDCTs) are usually preferred in audio codecs.

Because sub-band processing may introduce audible distortions in the reconstructed signal, important properties of the analysis–synthesis system include stability (i.e., the coefficients are bounded if and only if the input signal is bounded), perfect reconstruction (i.e., the reconstruction error is only limited by numerical precision when no sub-channel processing is performed), resistance to noise, and aliasing suppression in each sub-band (e.g., [

14,

15] Chap. 10). Furthermore, in all applications, a low redundancy (i.e., a redundancy between 1 and 2) lowers the computational costs.

TF transforms are usually implemented as filter banks (FBs) where the set of analysis filters defines the elementary functions and the set of synthesis filters allows signal reconstruction. The TF concentration of the filters together with the downsampling factors in the sub-bands define the TF resolution and redundancy of the transform. FBs come in various flavors and have been extensively treated in the literature (e.g., [

16,

17,

18,

19]). The mathematical theory of frames constitutes an interesting alternative background for the interpretation and implementation of FBs (e.g., [

20,

21,

22]). Gabor frames (sampled STFT [

2,

23]), for instance, are widespread in audio signal processing.

For certain applications, such as audio coding [

5,

6,

7], audio equalizers [

24], speech processing [

25], perceptual sparsity [

26,

27], or source separation [

11,

12,

28,

29], exploiting some aspects of human auditory perception in the signal chain constitutes an advantage. One of the most exploited aspects of the auditory system is the auditory frequency scale, which is a simple means to approximate the frequency analysis performed in the auditory system [

30]. Generally, the auditory system is a complex and in many aspects nonlinear system (for a review see, e.g., [

31]). Its description ranges from simple collections of linear symmetric bandpass filters [

32] through collections of asymmetric and compressive filters [

33] to sophisticated models of nonlinear wave propagation in the cochlea [

34]. Because nonlinear systems may complicate the inversion of the signal processing chain (e.g., [

35,

36]), linear approximations of the auditory system are often preferred in audio applications. In particular, gammatone filters approximate well the auditory periphery at low to moderate sound pressure levels [

37,

38] and are easy to implement as FIR or IIR filters [

32,

39,

40,

41,

42,

43].

Various analysis–synthesis systems based on gammatone FBs have been proposed for the purpose of audio applications (e.g., [

35,

39,

40,

44]). However, these systems do not satisfy all requirements of audio applications as, even at high redundancies, they only achieve a reconstruction error described as “barely audible”. This error becomes clearly audible at low redundancies. In other words, these systems do not achieve perfect reconstruction. To our knowledge, a general recipe for constructing a gammatone FB with perfect reconstruction at redundancies close to and higher than one has not been published yet.

In this article, we describe a general recipe for constructing an analysis–synthesis system using a non-uniform oversampled FB with filters distributed on an arbitrary auditory frequency scale, enabling perfect reconstruction at arbitrary redundancies. The resulting framework is named “

Audlet” for

audio processing and

auditory motivation. The proposed approach follows the theoretical foundation of non-stationary Gabor frames [

20,

45] and their application to TF transforms with a variable TF resolution [

46,

47,

48]. This report extends the work reported in [

20] (Section 5.1) by providing a full theoretical and practical development of the Audlet.

The manuscript is organized as follows. The next section briefly recalls the basics of non-uniform FBs, frames, and auditory frequency scales.

Section 3 describes the theoretical construction of the Audlet framework. The practical implementation issues are discussed in

Section 4 and

Section 5 evaluates important properties and capabilities of the framework.

## 2. Preliminaries

#### 2.1. Notations and Definition

In the following, we consider signals in ${\ell}_{2}(\mathbb{Z})$ as samples of a continuous signal with sampling frequency ${f}_{s}$, with the Nyquist frequency of ${f}_{N}={f}_{s}/2$. We denote the normalized frequency by $\xi =f/{f}_{s}$, i.e., the interval $\left[0,{f}_{N}\right]$ corresponds to $\left[0,1/2\right]$. The inner product of two signals $x,y$ is $\u2329x,y\u232a={\sum}_{n}x\left[n\right]\xb7y\left[n\right]$ and the energy of a signal is defined from the inner product as $\left|\right|x\left|\right|=\u2329x,x\u232a$. The floor, ceiling, and rounding operators are $\lfloor \xb7\rfloor ,\lceil \xb7\rceil $, and $\lfloor \xb7\rceil $, respectively. We denote the z-transform by $\mathcal{Z}$: $x\left[n\right]\mapsto X(z)$. By setting $z={e}^{2i\pi \xi}$ for $\xi \in (-1/2,1/2]$, the z-transform equals the discrete-time Fourier transform ($\mathrm{DTFT}$). Note that the frequency domain associated to the $\mathrm{DTFT}$ is circular and therefore, the interval $(-1/2,1/2]$ is considered circularly, i.e., $\xi \in \mathbb{R}$ is identified with $\xi -\lfloor \xi \rceil \in (-1/2,1/2]$. The same applies for $(-{f}_{N},{f}_{N}]$. Since we exclusively consider real-valued signals we deal with symmetric $\mathrm{DTFT}$s, which allows us to process only the positive-frequency range. Finally, we denote the complex conjugation by an overbar, e.g., $\overline{H}$.

#### 2.2. Filter Banks and Frames

The general structure of a non-uniform analysis FB is presented in

Figure 1 (e.g., [

17]). It is a collection of

$K+1$ analysis filters

${H}_{k}(z)$, where

${H}_{k}(z)$ is the

z-transform of the impulse response

${h}_{k}\left[n\right]$ of the filter, and downsampling factors

${d}_{k},\phantom{\rule{0.166667em}{0ex}}k\in \left\{0\dots K\right\},$ that divides a signal

x into a set of

$K+1$ sub-band components

${y}_{k}$, where

The special case where all downsampling factors are identical, i.e., ${d}_{k}=D\phantom{\rule{0.166667em}{0ex}}\forall \phantom{\rule{0.166667em}{0ex}}k\in \left\{0\dots K\right\}$, is referred to as a uniform FB.

By analogy, a synthesis FB is a collection of

$K+1$ upsampling factors

${d}_{k}$ and synthesis filters

${G}_{k}(z)$ (see

Figure 2) that recombines the sub-band components

${y}_{k}$ into an output signal

$\tilde{x}$ according to

where ℜ, denoting the real part, and the factor of 2 are a consequence of considering the positive frequency range only.

A synthesis FB can be generalized to a

synthesis system (shown in

Figure 3), which is a linear operator

$\mathcal{S}$ that takes as an input sub-band components

${y}_{k}$ and yields an output sequence

$\tilde{x}$. For the synthesis operation, we use the notation

$\tilde{\mathcal{S}}(\xb7,{({G}_{k},{d}_{k})}_{k})$, where

${({G}_{k},{d}_{k})}_{k}$ is the synthesis FB. An analysis FB is

invertible or

allows for perfect reconstruction if there exists a synthesis system

$\mathcal{S}$ that recovers

x from the sub-band components

${y}_{k}$ without error, i.e.,

$\tilde{x}=x$ for all

$x\in {\ell}_{2}(\mathbb{Z})$. In other terms, the analysis–synthesis system

$\left({({H}_{k},{d}_{k})}_{k},\mathcal{S}\right)$ has the

perfect reconstruction property. In practice, the implementation of that operation might introduce errors of the order of numerical precision.

We use the mathematical theory of frames in order to analyze and design perfect reconstruction FBs (e.g., [

20,

21,

22]). A

frame over the space of finite energy signals

${\ell}_{2}(\mathbb{Z})$ is a set of functions spanning the space in a stable fashion. Consider a signal

x and an analysis FB

${({H}_{k},{d}_{k})}_{k}$ yielding

${y}_{k}$. Then, an FB constitutes a frame if and only if

$0<A\le B<\infty $ exist such that

where

A and

B are called the lower and upper frame bounds of the system, respectively. The existence of

A and

B guarantees the invertibility of the FB. Several numerical properties of an FB can be derived from the frame bounds. In particular, the ratio

$\sqrt{B/A}$ corresponds to the

condition number [

49] of the FB, i.e., it determines the stability and reconstruction error of the system. Furthermore, the ratio

B/

A characterizes the overall frequency response of the FB. A ratio

$B/A=1$, for instance, means a perfectly flat frequency response. This is often desired in signal processing because, in that particular case, the analysis and synthesis FB are the same. Specifically, the synthesis filters are obtained by time-reversing the analysis filters, i.e.,

${G}_{k}(z)={\overline{H}}_{k}(z)$.

The frame bounds

A and

B correspond to the infinimum and supremum, respectively, of the eigenvalues of the operator

$\tilde{\mathcal{S}}(\mathcal{A}(\xb7,{({H}_{k},{d}_{k})}_{k}),{({H}_{k},{d}_{k})}_{k})$ associated with the system

${({H}_{k},{d}_{k})}_{k}$. In practice, these eigenvalues can be computed using iterative methods (see

Section 3.2 and

Section 3.3).

#### 2.3. Auditory Frequency Scales

An important aspect of the auditory system to consider in auditory-motivated analysis is the frequency-to-place transformation that occurs in the cochlea. Briefly, when a sound reaches the ear it produces a vibration pattern on the basilar membrane. The position and width of this pattern along the membrane depend on the spectral content of the sound; high-frequency sounds produce maximum excitation at the base of the membrane, while low-frequency sounds produce maximum excitation at the apex of the membrane. This property of the auditory system can be modeled in a first approximation as a bank of bandpass filters, named “critical bands” or “auditory filters”, whose center frequencies and bandwidths respectively approximate the place and width of excitation on the basilar membrane. The frequency and bandwidth of the auditory filters are nonlinear functions of frequency. These functions, called auditory frequency scales, are derived from psychoacoustic experiments (see e.g., [

50], Chapter 3 for a review). The Bark, the equivalent rectangular bandwidth (ERB), and Mel scales are commonly used in hearing science and audio signal processing [

30]. To refer to the different frequency mappings we introduce the function

F:

$f\to \mathrm{Scale}$ where

f is frequency in Hz and

$\mathrm{Scale}$ is an auditory unit that depends on the scale. The ERB rate, for instance, is [

30]

and its inverse is

The ERB (in Hz) of the auditory filter centered at frequency

f is

Expressions for the Bark and Mel scales are respectively provided in [

51,

52]. For scales that do not specify a bandwidth function, like the Mel scale, we propose the following function:

${B}_{\mathrm{scale}}(f)=\frac{\partial \left({F}_{\mathrm{scale}}^{-1}\right)}{\partial f}({F}_{\mathrm{scale}}(f))$. This ensures a proper overlap between the filters’ passband.

## 6. Conclusions

A framework for the construction of oversampled perfect-reconstruction FBs with filters distributed on auditory frequency scales has been presented. This framework was motivated by auditory perception and targeted at audio signal processing; it has thus been named “Audlet”. The proposed approach has its foundation in the mathematical theory of frames. The analysis FB design is directly performed in the frequency domain and allows for various filter shapes, and uniform or non-uniform settings with low redundancies. The synthesis is achieved using a (heuristic) preconditioned conjugate-gradient iterative algorithm. The convergence of the algorithm has been observed for Audlet FBs that constitute a frame. This is possible even for redundancies close to 1. For higher redundancies and filters with a compact support in the frequency domain, a so-called “painless” system can be achieved. In this case the exact dual FB can be calculated, which in turn results in a computationally more efficient synthesis.

We showed how to construct a gammatone FB with perfect reconstruction. The proposed gammatone FB was compared to widely used state-of-the-art implementations of gammatone FB with approximate reconstruction. The results showed the better performance of the proposed approach in terms of reconstruction error and stability, especially at low redundancies. An example application of the framework to the task of audio source separation demonstrated its utility for audio processing.

Overall, the Audlet framework provides a versatile and efficient FB design that is highly suitable for audio applications requiring stability, perfect reconstruction, and a flexible choice of redundancy. The framework is implemented in the free Matlab/Octave toolbox LTFAT [

61,

62].