Gabor frames and deep scattering networks in audio processing

In this paper a feature extractor based on Gabor frames and Mallat's scattering transform, called Gabor scattering, is introduced. This feature extractor is applied to a simple signal model for audio signals, i.e. a class of tones consisting of fundamental frequency and its multiples and an according envelope. Within different layers, different invariances to certain signal features occur. In this paper we give a mathematical explanation for the first and the second layer which are illustrated by numerical examples. Deformation stability of this feature extractor will be shown by using a decoupling technique, previously suggested for the scattering transform of Cartoon functions. Here it is used to see if the feature extractor is robust to changes in spectral shape and frequency modulation.


Introduction
Popular machine learning techniques such as (deep) convolutional networks (CNN) have led to impressive results in several applications, e.g., classification tasks.In audio, data usually undergo some pre-processing, often called feature extraction, before being fed into the trainable machine.CNNs as feature extractors have been given a rigorous mathematical analysis by Mallat [6], leading to the so-called scattering transform, based on a wavelet transform in each network layer.Invariance and deformation stability properties of the resulting feature extractors were investigated in [6,2,1].Wiatowski and Bölcskei extended the scattering transform by allowing different semidiscrete frames in each network layer, general Lipschitz-continuous non-linearities and pooling operators, see [9,8].In this general network, the authors proved vertical translation invariance and deformation stability for band-limited functions.In the current contribution we introduce a new feature extractor called Gabor scattering previously introduced in the conference contribution [3], which is a scattering transform based on Gabor frames in each layer.Gabor frames are closely related to the short-time Fourier transform (STFT), see Section 2.2, which is commonly used in practice to analyse audio data.We then study invariance properties of Gabor scattering when applied to a common signal model for audio signals, Section 3 and use the same signal model to establish deformation bounds for a feature extractor based on Gabor scattering, Section 4. This approach is motivated by the observation, that assuming more specific restrictions than band-limitation will lead to more precise deformation estimates, when dealing with signals with certain known properties.Thereby, we follow the idea presented in Grohs et al. [5] and use a decoupling technique which proves deformation stability by exploiting stability of the signal class introduced in Section 2.3 in combination with general properties, namely contractivity, which is entirely due to the network architecture.

Deep Convolutional Neural Networks (DCNNs) and Invariance
In order to understand the mathematical construction used within this paper, we briefly introduce the principal idea and structure of a DCNN.DCNNs enable the computer to learn from observation data.Their structure is inspired by neurones and the human process of learning.Usually a network consists of several layers, namely an input, several hidden (since we consider the case of deep CNN the number of hidden layers is supposed to be ≥ 2) and one output layer.As input we use the data, that should be classified or more generally whose properties should be learnt.A hidden layer consists of several ingredients: first the convolution of the data with a small weighting matrix, which can be interpreted as localization of certain properties of the input data.Similarly neurones in our brain act only in a local environment, when certain stimulation appears.The next ingredient of the hidden layer is the application of a non-linearity function, also called activation function, which signals if information of this neurone is relevant to be transmitted.Furthermore, in order to reduce redundancy and increase invariance, pooling is applied.Due to the ingredients of the hidden layer, i.e. convolution with a weighting matrix, non-linearity function and pooling, certain invariances are generated [7].In order to make concrete statements about invariances generated by certain layers for a certain class input data, one needs to develop a signal model, as we do in Section 2.3.Within our simple model for tones, convolution is performed with a low-pass filter and thus in combination with pooling, temporal fine structure is averaged out, while frequency information is maintained.Hence in our case we have some invariance w.r.t. the envelope in the first layer.Moreover, after further convolutions, more temporal fine structure is removed and information on large scales is captured in higher layers.In our model the second layer is invariant w.r.t. the pitch and information about the envelopes is visible, cp.Section 3 and 5.Note that in a neural network, in particular in CNNs, the output, e.g.classification labels, is obtained after several concatenated hidden layers.In the case of scattering network the outputs of each layer are stacked together into a feature vector and further processing is necessary to obtain the desired result.Usually, after some kind of dimensionality reduction, cf.[10], this vector can be fed into a support vector machine or a general NN, which performs the classification task.

Gabor Scattering
Since Wiatowski and Bölcskei used general semi-discrete frames to obtain a wider class of window functions for the scattering transform (cp.[9,8]), it seems natural to consider a specific model used for audio data and analyse Gabor frames for the scattering transform and study corresponding properties.We next introduce the basics of Gabor frames and refer to [4] for more details.A sequence (g k ) ∞ k=1 of elements in a Hilbert space H is called frame if there exist positive frame bounds A, B > 0 such that for all f ∈ H (1) If A = B, then we call (g k ) ∞ k=1 a tight frame.In order to define Gabor frames we need to introduce two operators, i.e. the translation and modulation operator of a function • then the modulation (frequency shift) operator is defined as M ω f (t) = e 2πitω f (t) for all ω ∈ R.
Moreover, we can use these operators to express the short-time Fourier transform (STFT) of a function f ∈ L 2 (R) with respect to a given window function g ∈ L 2 (R) as In order to reduce redundancy, we sample V g f on a separable lattice Λ = αZ × βZ, in time by α > 0 and in frequency by β > 0. The resulting samples correspond to the coefficients of f w.r.t. a Gabor system.

Definition 1. (Gabor System)
Given a window function 0 = g ∈ H, where the Hilbert space H is either L 2 (R) or 2 (Z), and lattice parameters α, β > 0, the set of time-frequency shifted versions of g G(g, α, β) = {M βj T αk g : j, k ∈ Z} is called a Gabor system.
We proceed to introduce a scattering transform for Gabor frames.We base our considerations on [9] by using a triplet-sequence Ω = (Ψ , σ , S ) ∈N here is associated to the -th layer of the network.Note that in this contribution we will deal with Hilbert spaces L 2 (R) or 2 (Z); more precisely in the input layer, i.e. the 0−th layer, we have H 0 = L 2 (R) and due to the discretization inherent in the Gabor transform, We briefly review the elements of the triplet: • A non-linearity function (e.g.rectified linear units, modulus function, see [9]) σ : C → C, is applied pointwise and is chosen to be Lipschitz-continuous, i.e.
In this paper we only use the modulus function with Lipschitz constant L = 1 for all ∈ N.
• Pooling depends on a pooling factor S > 0, which leads to dimensionality reduction.Mostly used are max-or average-pooling, some more examples can be found in [9].In our context, pooling is covered by choosing specific lattices Λ in each layer.
In order to explain the interpretation of Gabor scattering as CNN, we write I(g Thus the Gabor coefficients can be interpreted as the samples of a convolution.

Definition 2. (Gabor Scattering)
Let Ω = (Ψ , σ , Λ ) ∈N be a triplet-sequence, with ingredients explained above.Then the -th layer of the Gabor scattering transform is defined as the output of the operator where f −1 is the output-vector of the previous layer and f ∈ H ∀ .
Taking the calculation steps of the previous layers into account, we can extend (2) to paths on index sets q := (q 1 , ..., q Similar to [9] for each layer, we use one atom of the Gabor frame in the subsequent layer as output-generating atom, i.e. φ −1 := g λ * , λ * ∈ Λ .Since this element is the -th convolution, it is an element of the -th frame, but because it belongs to the ( −1)th layer, its index is ( − 1).We also want to introduce a countable set Q := ∞ =0 B and the space ( 2 (Z)) Q of sets s := {s q } q∈Q , s q ∈ 2 (Z) for all q ∈ Q.Now we can define the feature extractor Φ Ω (f ) of a signal f ∈ L 2 (R) as in [9,Def. 3].

Definition 3. (Feature Extractor)
Let Ω = (Ψ , σ , Λ ) ∈N be a triplet-sequence and φ the output generating atom for each layer.Then the feature extractor In the following section we are going to introduce the signal model which we consider in this paper.

Musical Signal Model
Tones are one of the smallest units and simpel models of an audio signal, consisting of one fundamental frequency ξ 0 , corresponding harmonics nξ 0 and a shaping envelope A n for each harmonic, providing specific timbre.Further, since our ears are limited to frequencies below 20kHz, we develop our model over finitely many harmonics, i.e. {1, ..., N } ⊂ N.
The general model has the following form where A n (t) ≥ 0 ∀n ∈ {1, ..., N } and ∀t.For one single tone we choose η n (t) = nξ 0 t.Moreover we can create a space of tones 3 Gabor Scattering of Music Signals In [2] it was already stated that due to the structure of the scattering transform the energy of the signal is pushed towards low frequencies, where it then is captured by a low-pass filter as output generating atom.In the current section we explain how Gabor scattering separates relevant structures of signals described by our signal model T .Due to the smoothing action of the output generating atom, each layer expresses certain invariances, which will be illustrated by numerical examples in Section 5.In Proposition 1, inspired by [2], we add some assumptions on the analysis window in the first layer g : where χ is the indicator function.
Equation (5) shows that for slowly varying amplitude functions A n , the first layer mainly captures the contributions near the frequencies of the tone's harmonics.Obviously, for time-sections during which the envelopes A n undergo faster changes, such as during a tone's onset, energy will also be found outside a small interval around the harmonics' frequencies and thus the error estimate (6) becomes less stringent.
Proof.Step1 -Using the signal model for tones as input, interchanging the finite sum with the integral and performing a substitution t = t − αk, we obtain A n (t + αk)g(t )e 2πi(nξ0−βj)(t +αk) dt .
After performing a Taylor series expansion for Hence we choose n 0 = argmin and split the sum to obtain Step 2 -We now bound (7): For the second bound, i.e. the bound of Equation ( 8), we use the decay condition on ĝ, thus Next we split the sum into n > n 0 and n < n 0 .We estimate the error term for n > n 0 : Since Due to symmetry we get Summing up the error terms, we obtain (6).
Remark 1.The error bound of Equation (10) gets bigger for lower frequencies.This makes sense, since the separation of the fundamental frequency and corresponding harmonics by the analysis window deteriorates.For higher frequencies separation improves and hence the error term gets smaller.
We now introduce two more operators, first the sampling operator S α (f (x)) = f (αx) ∀x ∈ R and second the periodization operator P 1 α ( f (ω)) = k∈Z f (ω − k α ) ∀ω ∈ R.These operators have the following relation F(S α (f ))(ω) = P 1 α ( f (ω)).In order to see how the second layer captures relevant signal structures, depending on the first layer, we propose the following Corollary 1. Recall that g ∈ H ∀ ∈ N.
Then the elements of the second layer can be expressed as where Proof.Using the outcome of Proposition 1 we obtain For the error E 1 (k) we use the global estimate | E 1 , M β2h T α2m g 2 | 2 (Z) ≤ E 1 ∞ • g 1 and, using the notation above we proceed as follows: Since the values are maximal in a neighborhood of the center frequency β 2 h we consider the case k = 0 separately and obtain It remains to bound the sum, i.e. the second term of Equation ( 12): In the last Equation (13) we applied the triangle inequality twice and the modulation term can be ignored because of the modulus.Now we can use our assumption k∈Z\{0} | Ân0 (. − k α1 )| ≤ ε α1 and also the assumption on the Fourier transform of the analysis window g 2 : We rewrite the first term in (12): The last Equation (15) uses Plancherl's theorem.Rewriting the last term we obtain Remark 2. The sum r 1 + |β 2 h − r| s −1 decreases very fast, i.e. taking s = 3 the summand is already smaller as 10 −5 for r = 48.
Remark 3. Note that, since the envelopes A n are expected to change slowly except around transients, their Fourier transforms concentrate their energy in the low frequency range.In Section 5 it will be shown by means of the analysis of example signals, how the second layer distinguishes tones which have a smooth onset (transient) from those which have a sharp attack, which leads to broadband characteristics of A n around this attack.Similarly, if A n undergoes an amplitude modulation, the frequency of this modulation can be clearly discerned, cf. Figure 1 and the corresponding example.This observation is clearly reflected in expression (12).Since ĝ2 decays fast, the aliasing terms in (12), for k > 0 are sufficiently small.
To obtain the Gabor scattering coefficients, we need to apply the output generating atom as in (3).
Corollary 2. Let φ 1 ∈ Ψ 2 , then the output of the first layer is where and with φ 2 ∈ Ψ 3 the second layer output is Remark 4. Note that the convolution is a low-pass filter for sufficient smoothness of φ 1 .Hence, in dependence on the pooling factor α 1 , the temporal fine-structure of A n0 is averaged out.In the second layer, applying the output generating atom φ 2 ∈ Ψ 3 removes the fine temporal structure and thus, the second layer reveals information contained in the envelopes A n .
Proof.We show the calculations for the first layer, for the subsequent layer it is the same: where The factor φ 1 will be absorbed by the constants in the error terms, i.e.
Calculations are similar for the second layer.

Deformation stability
In this section we study to which extent Gabor scattering is stable with respect to certain deformations.We consider changes in spectral shape as well as frequency modulations.The method we apply is inspired by [5] and uses the decoupling technique, i.e. in order to prove stability of the feature extractor we first take the structural properties of the signal class into account and search for an error bound of deformations of the signals in T .In combination with the contractivity property Φ , which follows from B ≤ 1 ∀ ∈ N, where B is the upper frame bound of the Gabor frame G(g , α , β ), this yields deformation stability of the feature extractor.

Envelope Changes
Simply deforming a tone would correspond to deformations of the envelope A n , n = 1, ..., N.This corresponds to a change in timbre, for example by playing a note on a different instrument.Mathematically this can be expressed as: We apply the mean value theorem for a continuous function A n (t) and get |A n (y)|.
Applying the 2−norm on h n (t) and the assumption on A n (t), we obtain: Splitting the integral into B 1 (0) and R\B 1 (0) and using the monotonicity of (1+|t| s ) −1 we have In Equation ( 16) we performed a change of variables, i.e. x = (1 − τ ∞ )y.Setting and summing up we obtain Remark 5. Harmonics' energy decreases with increasing frequency, hence C n C n−1 .

Frequency modulation
Another different kind of sound deformation results from frequency modulation of f (t) ∈ T .This corresponds to, for example, playing higher or lower pitch, or producing a vibrato.This can be formulated as: Proof.We have t) ).Now we compute the 2−norm of h n (t) : . Using the assumptions of our signal model on the envelopes, i.e.A n ∞ < 1 n , we obtain

Numerical Examples
In this section we are going to show some in Matlab performed numerical examples of the developed mathematical theory of the Gabor scattering transform.As input we produced a single tone following the signal model from Section 2.3.The first example, Fig. 1, shows two tones, played sequentially, having the same fundamental frequency ξ 0 = 800Hz and 15 harmonics, but different envelopes, i.e. the first tone has a sharp attack, maintains and goes softly to zero, the second starts with a soft attack and has some amplitude modulation.An amplitude modulated signal would for example correspond to f (t) = N n=1 sin(2π20t)e 2πinξ0t , here the signal is modulated by 20Hz.We explain now, what is visible within Figure 1: • First layer: The first layer is invariant w.r.t. the envelope of the signals.This is due to the output generating atom and the subsampling, which removes temporal information of the envelope.In the spectogram corresponding to the first layer, there is no information about the envelope visible, hence the spectogram of the different signals look almost the same.
• Second layer: For the second layer we took as input a time vector at fixed frequency of the first layer.Here we fixed the fundamental frequency.The second layer is invariant w.r.t. the pitch, but differences on a larger scale are captured.Within this layer we are able to distinguish the different envelopes of the signals.We first see the sharp attack of the first tone and then we can distinguish the modulation, where a second frequency is visible.
The second example, Fig. 2 and 3, shows two tones, both having a smooth envelope, but different fundamental frequencies and number of harmonics.The first tone has fundamental frequency ξ 0 = 800Hz and 15 harmonics and the second tone has fundamental frequency ξ 0 = 1060Hz and 10 harmonics.
In the following we explain, what is displayed in the Figures 2 and 3:  • Gabor transform: The first spectogram of Figure 2 shows the Gabor transform.We see the different fundamental frequencies of the two tones and also that tone one has more harmonics than tone two.
• First layer: The second spectogram of Figure 2 shows the first layer, which is the layer invariant w.r.t. the envelope of the signals.Since both tones have the same envelope, there is nothing to say about this.But we can compare this layer with the first layer of Figure 1 and see, that they look similar.The difference in the fundamental frequency is still visible.
• Second layer: For the second layer we prepared several outputs, displayed in Figure 3.As input for the first spectogram, we took a time vector at fixed fundamental frequency of the first tone, i.e. ξ 0 = 800Hz.Since the second tone does not contribute to this frequency, we do not see anything of the second tone.
As input for the second spectogram, we took a time vector at fixed fundamental frequency of the second tone, i.e. ξ 0 = 1060Hz.We can see that the first tone is not contributing at this frequency.As input for the third spectogram, we took a frequency that both share within their harmonics, i.e. about ξ = 9550Hz.Here we can see that both tones contribute to the second layer output.Although we obtain different outputs for different fixed frequency vectors from the first layer, we cannot see a difference in using different frequencies, since the output bumps look all the same.Hence this layer is invariant to frequency.

Conclusion and Perspectives
Within this paper we introduced a method that combines Gabor frames and the scattering transform of Mallat, called Gabor scattering.This method has been applied to a simple signal model, consisting of the class of tones.We investigated that different layers of the Gabor scattering, especially here the first and the second, are invariant w.r.t.certain signal features, like pitch and envelope of an audio signal.Moreover we proved that this feature extractor is robust to changes in the spectral shape and frequency modulation.
As a next step we will use the output of the Gabor scattering transform as input to various classifiers, e.g.SVM (support vector machine), but also simple (C)NN and see whether it improves the performance of the classification or not.Moreover an extension of this work will be the investigation of the impact of 2-dimensional window functions in higher layers, as it has already been introduced by [1].In audio tasks this models standard CNNs more closely and is expected to capture complex time-frequency structures.

Figure 1 :
Figure 1: Gabor transform, first layer and second layer of the signal having a sharp attack and afterwards some modulation.

Figure 2 :
Figure 2: Gabor transform and first layer of two tones having different fundamental frequencies.

Figure 3 :
Figure 3: Second layer of two tones having different fundamental frequencies, at different fixed frequencies of the first layer.