Stochastic Restoration of Heavily Compressed Musical Audio using Generative Adversarial Networks

Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of audio enhancement and compression artifact removal using deep learning techniques. However, only a few works tackle the restoration of heavily compressed audio signals in the musical domain. In such a scenario, there is no unique solution for the restoration of the original signal. Therefore, in this study, we test a stochastic generator of a Generative Adversarial Network (GAN) architecture for this task. Such a stochastic generator, conditioned on highly compressed musical audio signals, could one day generate outputs indistinguishable from high-quality releases. Therefore, the present study may yield insights into more efficient musical data storage and transmission. We train stochastic and deterministic generators on MP3-compressed audio signals with 16, 32, and 64 kbit/s. We perform an extensive evaluation of the different experiments utilizing objective metrics and listening tests. We find that the models can improve the quality of the audio signals over the MP3 versions for 16 and 32 kbit/s and that the stochastic generators are capable of generating outputs that are closer to the original signals than those of the deterministic generators.


Introduction
The introduction of MP3 (i.e., MPEG-1 layer 3 [1]) was transformative in how music was stored, transmitted, and shared in digital devices and on the internet. MP3 players, sharing platforms and streaming resulted directly from the possibility to considerably compress audio data without noticeable perceptual compromises. Compared to lossless audio coding formats, which allow for a perfect reconstruction of the original PCM audio signal, lossy formats (like MP3) typically lead to better compression by ignoring the parts of the signal to which humans are less sensitive. This process is also called perceptual coding which takes into account the physio-and psychological abilities of the human auditory perception, resulting in so-called psychoacoustic models [2].
While there exist several different lossy audio codecs (e.g., AAC, Opus, Vorbis, AMR), MP3 is undoubtedly the most commonly used. It is built upon an analysis filter bank and a subsequent computation of the modified discrete cosine transform (MDCT). In parallel, the signal is analyzed based on a perceptual model that exploits the psychoacoustic phenomena of auditory masking to determine sound events in the audio signal that are considered to be beyond human hearing capabilities. Based on this information, the spectral components are quantized with specific resolution and coded with variable bit allocation while keeping the noise introduced in this process below the masking thresholds [3]. This process may introduce a variety of deficiencies when configured with incorrect or very extreme parameters. For example, under large compression rates, high-frequency content is susceptible to being removed, resulting in a bandwidth loss. Pre-echoes can occur when decoding very sudden sound events for which the quantization noise spreads out over the synthesis window and consequently precede the event causing the noise. Other common artifacts are so-called swirlies [4], characterized by fast energy fluctuations in the low-level frequency content of the sound. Furthermore, there are other problems related to MP3 compression such as double-speak as well as a general loss of transient definition, transparency, loss of detail clarity, and more [4].
Many works exist which tackle the problem of audio enhancement, including the removal of compression artifacts. The most common recent methods used for these types of problems are based on deep learning. Typically, they focus on specific types of impairments present in the audio signals (e.g., reverberation [5], bandwidth loss [6], or audio codec artifacts [7][8][9][10][11]). Also, different types of neural network architectures have been studied for these tasks. For example, Convolutional Neural Networks (CNNs) [12], WaveNet-like architectures [8,13], and UNets [14,15]. However, most of the works in this line of research tackle the enhancement of speech signals [7][8][9][10][12][13][14][15][16][17][18], and only a few publications exist for musical audio restoration [11,[19][20][21]. This focus on speech is understandable, given the wide range of speech enhancement techniques in telephony, automatic speech recognition, and hearing aids. Also, compared to musical audio signals, speech signals are easier to study, as they are more homogeneous, narrow-band, and usually monophonic. In contrast, musical audio signals, particularly in the popular music genre, are highly varied. It typically consists of multiple, superimposed sources, which can be of any type, including (polyphonic) tonal instruments, percussion, (singing) voice, and various sound effects. In addition, music is typically broad-band, containing frequencies spanning over the entire human hearing range.
Given that studies on deep learning-driven audio codec artifact removal for musical audio data are underrepresented in audio enhancement research, in this work, we attempt to provide some more insights into this task. We investigate the limits of a generative neural network model when dealing with a general popular music corpus comprising music released in the last seven decades. In particular, we are interested in the ability of the model to regenerate lost information of heavily compressed musical audio signals using a stochastic generator (which is not very common in audio enhancement, with [10,22] being some exceptions). This work is not only relevant for the restoration of MP3 data in existing (older) music collections. In the light of current developments in musical audio generation, where full songs can already be generated from scratch [23], musical audio enhancement may soon possess a much more generative aspect. It has already been shown that strong generative models can enhance heavily corrupted speech through resynthesis with neural vocoders [22]. Along these lines, examining a generative (i.e., stochastic) decoder for heavily compressed audio signals may contribute to insights about more efficient musical data storage and transmission. Today, music streaming is increasingly common, which poses issues regarding energy consumption and consequently environmental sustainability. When accepting deviations from the original recording, higher compression rates could be reached with a generative decoder without perceptual compromises in the listening experience. Moreover, for heavily compressed audio signals, there is no single best solution to recover the original version. Therefore, it may be interesting for users to generate multiple recoveries and pick the one they like most.
We introduce a Generative Adversarial Network (GAN) [24] architecture for the restoration of MP3-encoded musical audio signals. We train different stochastic and deterministic generators on MP3s with different compression rates. Using these models, we investigate if 1) restorations of the models considerably improve the MP3 versions, 2) if we can systematically pick samples among the outputs of the stochastic generators which are closer to the original than such of the deterministic generators, and 3) if the stochastic generators generally output higher-quality restorations than the deterministic generators. To that end, we perform an extensive evaluation of the different experiment setups utilizing objective metrics and listening tests. We find that the models are successful in points 1 and 2, but the random outputs of the stochastic generators are approximately on a par (i.e., do not improve) the overall quality compared to the deterministic models (point 3).
The proposed GAN architecture is based on dilated convolutions with skip connections, combined with a novel concept which we call Frequency Aggregation Filters. These are convolutional filters spanning the whole frequency range, which contribute to the stability of the training and constitute a consequent take on the problem of non-local correlations in the frequency spectrum (see Section 3.1.3). We also find that using so-called self-gating considerably reduces the memory requirement of the architecture by halving the number of input maps to each layer without degradation of the results (see Section 3.1.2). In order to prevent mode collapse, we propose a regularization that enforces a correlation between differences in the noise input and differences in the model output (see Section 3.2.1). As opposed to most other works (but in line with few other approaches using GANs [25] and U-Net-based architectures [14,15]), we input (and output) directly the (non-linearly scaled) complex-valued spectrum to the generator, eliminating the need to deal with phase information separately.
The rest of this paper is organised as follows. In Section 2 we revise previous works in bandwidth extension and audio enhancement. In Section 3 we describe in depth the proposed GAN architecture (Section 3.1), the training procedure (Section 3.2), the dataset (Section 3.3) and the evaluation methods (Section 3.4). Finally, in Section 4 we present and discuss the results and conclude with suggestions for future work in Section 5. Audio examples of the work are provided in the accompanying website 1 .

Related Work
In this work, Generative Adversarial Networks (GANs) are employed to restore MP3compressed musical audio signals to their original high-quality versions. This task falls into the intersection of audio enhancement and bandwidth extension. Therefore, we review works on both these domains.

Bandwidth Extension
Low-resolution audio data (i.e., audio signals with a sample rate lower than 44.1kHz) is generally preferable for storage or transmission over band-limited channels, like streaming music over the internet. Also, lossy audio encoders can significantly reduce the amount of information by removing high-frequency content, but at the expense of potentially hampering the perceived audio quality. In order to restore the quality of such truncated audio signals, bandwidth extension (BWE) methods aim to reconstruct the missing high-frequency content of an audio signal given its low-frequency content as input [26]. BWE is alternatively referred to as audio re-sampling or sample-rate conversion in the field of Digital Signal Processing (DSP), or as audio super-resolution in the Machine Learning (ML) literature. Methods for BWE have been extensively studied in areas like audio streaming and restoration, mainly for legacy speech telephony communication systems [13,16,17,27] or, less commonly, for degraded musical material [19,20].
Pioneering works to speech BWE were originally algorithmic and operated based on a source-filter model. There, the problem of regenerating a wide-band signal is divided into finding an upper-band source and the corresponding spectral envelope, or filter, for that upper band. While methods for source generation were based on simple modulation techniques such as spectral folding and translation of a so-called low-resolution baseband [28], the efforts focused on estimating the filter or spectral envelope [29]. These works introduced the so-called spectral band replication (SBR) method, where the lower frequencies of the magnitude spectra are duplicated, transposed, and adjusted to fit the high-frequency content. Because in most use-cases for speech BWE the full transmission stack is controlled, most of these algorithmic methods rely on side information about the spectral envelope, obtained at the encoder from the full wide-band signal, and then transmitted within the bitstream for subsequent reconstruction at the decoder.
Learning-based approaches to speech BWE rely on large models to learn dependencies across the lower and higher end of the frequency spectrum. Methods based on Non-negative Matrix Factorization (NMF) treat the spectrogram as a fixed set of non-negative bases learned from wide-band signals [27]. These bases are fixed at test time and used to estimate the activation coefficients that best explain the narrow-band signal. The wide-band signal is then reconstructed by a linear combination of the base vectors weighted by the activations. These methods up-sample speech audio signals up to 22.05kHz efficiently but are sensitive to non-linear distortions due to the linear-mixing assumption. Dictionary-based methods can significantly improve the speech quality over the NMF approach by reconstructing the high-resolution audio signals as a non-linear combination of units from a pre-defined clean dictionary [30], or by casting the problem as an l1-optimization of an analysis dictionary learned from wide-band data [31].
Early works on speech BWE using neural networks inherited the source-filter methodology found in previous works. By employing spectral folding to regenerate the wide-band signal, a simple NN is used to adjust the spectral envelope of the generated upper-band [16]. Direct estimation of the missing high-frequency spectrum was not extensively studied until the introduction of deeper architectures [17]. Advances in computer vision [32,33] inspired the usage of highly expressive models to audio BWE, leading to significant improvements in the up-sampling ratio and quality of the reconstructed audio signal. Different approaches followed: by generating the missing time-domain samples in a process analogous to image super-resolution [34], by inpainting the missing content in a time-frequency representation [20], or by combining information from both domains, preserving the phase information [35]. Powerful auto-regressive methods for raw audio signals based on SampleRNN [36], or WaveNet [13] are able to increase the maximum resolution to 16 kHz and 24 kHz sample-rate, respectively, without neglecting phase information, as it is the case in most works operating in the frequency domain [6,17,19,20,27]. Most recent techniques using sophisticated transformerbased GANs can up-sample speech to full resolution audio at 44.1 kHz sample-rate [6].

Audio Enhancement
Audio signals may suffer from a wide variety of environmental adversities: e.g., sound recordings using low-fidelity devices or in noisy and reverberant spaces; degraded speech in mobile or legacy telephone communications systems; musical material from old recordings, or heavily compressed audio signals for streaming services. Audio enhancement aims to improve the quality of corrupted audio signals by removing noisy additive components and restoring distorted or missing content to recover the original audio signal. The field was first introduced for applications in noisy communication systems to improve the quality and intelligibility of speech signals [37]. Many studies have been carried out on speech audio enhancement, e.g., for speech recognition, speaker identification and verification [38][39][40], hearing assistance devices [41,42], de-reverberation [5], and so on. In the specific case of audio codec restoration, many different techniques exist for improvement of speech signals [7][8][9][10], yet only few works attempt the restoration of heavily compressed musical audio signals [11,21].
Classic speech enhancement methods follow multiple approaches, primarily based on analysis, modification, and synthesis of the noisy signal's magnitude spectrum and often omitting phase information. Popular strategies are categorized into spectral subtraction methods [43], Wiener-type filtering [44], statistical model-based [45] and subspace methods [46]. These approaches have proven successful when the additive noise is stationary. However, under highly non-stationary noise or reduced signal-to-noise ratios (SNR), they introduce artificial residual noise.
Recent deep learning approaches to speech enhancement outperform previous methods in terms of perceived audio quality, effectively reducing both stationary and non-stationary noise components. Popular methods learn non-linear mapping functions of noisy-to-clean spectrogram signals [18] or learn masks in a time-frequency domain representation [5,14,47]. Many architectures have been proposed: basic feed-forward DNNs [18], CNN-based [12], RNN-based [48], and more sophisticated architectures based on WaveNet [8] or U-Net [14]. GANs are also increasingly popular in speech enhancement [49][50][51][52]. Pioneering works using GANs operated either on the waveform domain [49] or on the magnitude STFT [53]. Subsequent works mainly focused on the latter representation due to the reduced complexity compared to time-domain audio signals [51,52,54]. Recent works operating directly on the raw waveform were able to consider a broader type of signal distortions [50] and to improve the reduction of artifacts over previous works [55]. Successive efforts were made to further reduce artefacts by, for example, taking into consideration human perception. Some works directly optimize over differentiable approximations of objective metrics such as PESQ [54]. However, these metrics correlate poorly with human perception, and some works defined the objective metric in embedding spaces from related tasks [56] or by matching deep features of real and fake batches in the critic's embedding space [57].
The vast majority of the speech audio enhancement approaches mentioned above operate on the magnitude spectrum and ignore the phase information [20,21,51,52]. At synthesis, researchers often reuse the phase spectrum from the noisy signal, introducing audible artifacts that would be particularly annoying in musical audio signals. To address this, phase-aware models for speech enhancement use a complex ratio mask [47], or, as we have seen, operate directly in the waveform domain [50,55]. Inspired by a recent work demonstrating that DNNs implementing complex operators [58] may outperform previous architectures in many audiorelated tasks, new state-of-the-art performances were achieved on speech enhancement using complex representations of audio data [14,15]. Recent work was able to further improve these approaches by introducing a complex convolutional block attention module (CCBAM) and a mixed loss function [59].

Materials and Methods
In the following, we describe the experiment setup. This includes the model architecture (see Section 3.1) and the training procedure (see Section 3.2). Furthermore, the data used and the data representation are presented in Section 3.3, and the objective and subjective evaluation methods are discussed in Section 3.4.

Model Architecture
The model employed in this work is a Generative Adversarial Network (GAN), conditioned on spectrogram representations of MP3-compressed audio files (see Figure 1 for an overview on architecture and training). As common in GANs, there are two separate models, the generator G and the critic D. G receives as input an excerpt of an MP3-compressed musical audio signal in spectrogram representation y (i.e., non-linearly scaled complex STFT components, see Section 3.3.1) and learns to output a restored versionx of that excerpt (i.e., the fake data), approximating the original, high-quality signal x. D learns to distinguish between such restorationsx and original high-quality versions of the signal x (i.e., the true data). In addition to the true/fake data, D also receives the MP3 versions of the respective excerpts. That way, it is ensured that the information present in the MP3 data is faithfully preserved in the output of G. We test stochastic and deterministic generators in our experiments. For the stochastic models, we also provide some noise input z ∼ N (0, I), resulting in different restorations for a given MP3 input.
As the training criterion, we use the GAN Wasserstein loss [60] as and we are interested in min G max D Γ(D, G), meaning the parameters of G are optimized to minimize this loss, and the parameters of D are optimized to maximize it. Note that the optimization of G only affects the second term of Equation 1, resulting in a maximization of D(y i , G(y i , z i )).

Architecture Details
For details on the implemented architecture, please refer to Table 1. We implement both the generator G and the critic D convolutional in time. This allows us to use less overlap (i.e., 50%) when chopping up the training data, as the convolutional architectures obtain differently shifted versions of the input by design. At test time, G is applied to variable-length and potentially relatively long input sequences (e.g., full songs). In such a setting, G does not perform very well if trained on short excerpts with zero-padding in the time dimension. Therefore, we do not use zero-padding for G during training.
The critic D is convolutional in time, too, resulting in as many loss outputs as there are spectrogram frames in the input (the individual costs are simply averaged for computing the final Wasserstein loss). We are using two convolutional groups throughout the critic stack, which amounts to two independent critics. Only in the output layer, those two groups are joined again. This is to provide D with two different views on the input, (1) the signed square-root of the complex STFT components (see Section 3.3.1) and (2) Table 1: Architecture details of generator G and critic D for 4-second-long excerpts (i.e., 336 spectrogram frames), where (·)-brackets mark information applying only to G, and information in [·]-brackets applies only to D. During training, no padding is used in the time dimension for G resulting in a shrinking of its output to 212 time steps.
magnitude spectrum of the generator output (we found empirically that this resulted in more stable training than when using the log-magnitude spectrogram). Many convolutional architectures with the same output and input size used for data restoration and in-painting employ the symmetrical U-Net paradigm (first introduced in [61]) with bottleneck layers and skip connections. In contrast, the architecture proposed in this work is non-symmetrical, mainly facilitating dilated convolutions for increasing receptive fields, and the main part of the architectures of G and D are identical (see Table 1). Only the top parts of the stacks differ, wherein G the aggregated information is fed into deconvolution layers, while in D the information is used to compute the Wasserstein distance.
We use Parametric Rectified Linear Units (PReLUs) [62] for all layers, and skip connections for D in the convolutional layers Conv6 -Conv14 in Table 1. The noise input z (to the generator G) is simply repeated in the two convolutional dimensions and concatenated to layer Conv5.

Gated Convolutions
In order to increase the architecture size under limited resources, a handy modification to common convolutions are self-gating convolutional layers. This idea was also proposed in [63], but we are using PReLU activations instead of linear units for the gated output units (linear units resulted in unstable training). The characteristic of (self-)gating convolutions is that half the output maps of each convolutional layer are used to element-wise gate the other half of the output maps (where we use sigmoid non-linearities on the gating units and PReLUs on the gated units). We found that with self-gating layers, the network's performance does not degrade, even though the operation effectively halves the number of output maps for each layer. The advantage of self-gating is a considerable reduction of memory, as the successive layer receives only half the input maps compared to non-self-gating layers. In practice, we use only one layer with twice as many output maps as input maps, which are then used for self-gating. Formally, we describe the operation with two different weight matrices as where x l ∈ R R×P×T is the input to layer l with R convolutional input maps, y l ∈ R S×P×T is the resulting self-gated output of layer l with S convolutional output maps (where S = R in our architecture), W l , V l ∈ R R×S×K×K are two weight matrices with quadratic kernels of size K × K, and a l , b l ∈ R S are bias vectors. PReLU are parametric ReLU activations [62], σ is the sigmoid non-linearity, * is the convolution operator and · is the Hadamard product. This operation is applied to all convolutional layers Conv6 -Conv14 in Table 1.

Frequency Aggregation Filters
The neural network architectures used in audio processing are often derived from the visual domain. Convolutional neural networks are particularly well-suited for image processing because, in natural images, close pixels are usually higher correlated than pixels further apart. Using stacks of convolutional layers, the filter kernels in the lower layers can learn highly correlated information, and filters in the higher layers can learn more complex combinations of filter responses of the lower layers. That way, as a rule of thumb, the higher up in the convolutional hierarchy, the less correlated information is represented, which results in a hierarchical aggregation of pixel information that is well-suited for natural images.
Such a correlation assumption may also hold for the time dimension when working with musical audio data in spectrogram representation. However, it does not hold in the frequency dimension, where highly correlated spectral energy is potentially spread over the whole frequency range. In order to comply with this characteristic, it is common to employ non-rectangular filters in the input layer of a convolutional network stack, for example, kernels  Table 1). By reshaping, the responses are then separated into 32 groups (of size 128 each) and re-combined again through a stack of dilated convolutions (Conv5 -Conv14 in Table 1). of shape [1,32] in [64]. Still, when considering harmonics of tonal instruments or percussive sounds, correlated information may be so distant in the frequency axis that also with vertical filter kernels, a complete acoustic source may only be fully represented in the highest layer of the hierarchy. This fact contradicts the useful characteristic of efficient aggregation of information in convolutional network stacks to represent the least correlated information (i.e., the most complex patterns) in the highest layers of the hierarchy.
In order to tackle the problem of highly correlated frequency bins very distant in the frequency dimension, we take a novel approach. As it is not obvious before training which frequency bins are most correlated in the training data, and it is therefore not clear how to best design the architecture, we allow the network to learn a useful hierarchy of frequency aggregation during training (see Figure 2). To that end, in Layer Conv4 (see Table 1), we use 4096 filter kernels that span the whole frequency dimension and only convolve in time (i.e., no padding in the frequency dimension). Then, we reshape the output maps (see Reshape1 in Table 1) so that we again obtain a 2D convolutional architecture and let the network learn which filter kernels are most correlated, i.e., in what layer of the hierarchy which filter responses need to be brought together (throughout layers Conv5 -Conv14). In the generator G, Reshape2 reshapes back to 4096 feature maps and DeConv4 inverts the frequency aggregation.

Training Procedure
Each model is trained for 40k iterations and a batch size of 12, which takes about 2 days on two NVIDIA Titan RTX with 24GB memory each. We use the ADAM optimizer [65] with a learning rate of 1e-3 and gradient penalty loss, to restrict the gradients of D to 1 Lipschitz [66]. We also use a loss term that penalizes the magnitudes of the output of D for real input data, preventing the loss from drifting. Furthermore, He's initialization is performed for all the layers in the architecture [62].

Preventing Mode Collapse
We found that the Generator G tends to ignore the input noise z. This may be because G is densely conditioned on x, and the output variability in the data is limited given an input with a specific characteristic. In order to prevent such a mode collapse, for updating G during training, we define an additional cost term which is maximal when the noise input to G does not influence the output of G. To that end, for a fixed y i , we compute the ratio between the Euclidean distance of two arbitrary input noise vectors {z i , z j }, and the distances between the corresponding frequency profiles (summing over the time axis) and the rhythm profiles (summing over the frequency axis) of the output magnitude spectrogram of G, resulting in loss L freq and L rhyt , respectively: whereP • 1 2 z i is the Hadamard root of the power spectrum of the output of G for input noise vector z i , d is a column vector of 1s for the frequency profiles (resulting in the loss L freq ) and a row vector of 1s for the rhythm profiles (resulting in L rhyt ). The scalar ϑ controls the strength of the regularization, p = 1.3 for frequency profiles and p = 1.6 for rhythm profiles in our experiments.
In practice, for each conditional input y i to G (i.e., each instance with index i in a batch), we compute two outputs G(x k , {z i , z j } k ) using randomly sampled {z i , z j } k ∼ N (0, I), and use those outputs to compute L profile , as well as the common gradient update of G. Note that in order to minimize L profile , G could simply learn to introduce huge changes in its output when the input noise z changes. However, in practice, this is prevented by the Wasserstein loss, which introduces a strong bias towards plausible (i.e., obeying the data distribution) outputs of G. Therefore, L profile is effective in pushing the generations of G away from deterministic outputs while the overall training process remains stable.

Data
The model is trained on pairs of audio data, where one part is the MP3 version, the other part is a high-quality (44.1 kHz) version of the signal. We use a dataset of approximately 64 hours of Nr 1 songs of the US charts between 1950 and 2020. The high-quality data is then compressed to 16kbit/s, 32kbit/s and 64kbit/s mono MP3 using the LAME MP3 codec, version 3.100. 2 The total number of songs is first divided into train, eval, and test sub-sets with a ratio of 80%, 10%, 10%, respectively. We then split each of the songs into 4-second-long segments with 50% overlap for training and validation. For the subjective evaluation (see Section 3.4.5), we split the songs into segments of 8 seconds.

Data Representation
The main representation used in the proposed method are the complex STFT components of the audio data h j,k ∈ C JK , as it has been shown that this representation works well for audio generation with GANs in [67]. The STFT is computed with window size 2048, and a hop size of 512. In addition, we perform non-linear scaling to all complex components, in order to obtain a scaling which is closer to human perception than when using the STFT components directly. This is, we transform each complex STFT coefficient h j,k = a j,k + i b j,k by taking the signed square-root of each of its components h σ j,k = σ(a j,k ) + i σ(b j,k ), where the signed square-root is defined as σ(r) = sign(r) |r|. (4)

Evaluation
We perform objective and subjective evaluations for the proposed method. The main goal of the evaluation is to assess the similarity between the reference signals (i.e., the high-quality signals) and the signal approximations (i.e., MP3 versions of the audio excerpts or outputs of the proposed model). The objective metrics used include Log-Spectral Distance (LSD), Mean Squared Error (MSE), Signal-to-Noise Ratio (SNR), Objective Difference Grade (ODG), and Distortion Index (DI). We also perform a subjective evaluation in the form of the Mean Opinion Score (MOS), which is described in Section 3.4.5.

Objective Difference Grade and Distortion Index
The Objective Difference Grade (ODG) is a computational approximation to subjective evaluations (i.e., the subjective difference grade) of users when comparing two signals. It ranges from 0 to −4, where lower values denote worse similarities between the signals. The Distortion Index (DI) is a metric that is differently scaled but correlated to the ODG and can be seen as the amount of distortion between two signals. Both the ODG and DI are based on a highly non-linear psychoacoustic model, including filtering and masking to approximate the human auditory perception. They are part of the Perceptual Evaluation of Audio Quality (PEAQ) ITU-R recommendation (BS.1387-1, last updated 2001) [68]. We use an openly available implementation of the basic version (as defined in the ITU recommendation) of PEAQ 3 , including ODG and Distortion Index (DI). Even though PEAQ was initially designed for evaluating audio codecs with minimal coding artifacts, we found that the results correlate well with our perception.

Log-Spectral Distance
The log-spectral distance (LSD) is the Euclidean distance between the log-spectra of two signals and is invariant to phase information. Here, we calculate the LSD between the spectrogram of the reference signal and that of the signal approximation. This results in the equation where P andP are the power spectra of x andx, respectively, L is the total number of frames, and W is the total number of frequency bins.

Mean Squared Error
The LSD described above (see Section 3.4.2) is particularly high when comparing MP3 data with high-quality audio data. This is because it is standard practice found in many MP3 encoders (including the one we use) to perform a high-cut, removing most frequencies above a specific cut-off frequency. For values close to zero, a log-scaling introduces negative numbers with very high magnitudes. Therefore, when comparing log-scaled power spectra of MP3 and PCM, we obtain particularly high distances. This generally favors algorithms that add frequencies in the upper range (like the proposed method). In this regard, a fairer comparison is the Mean Squared Error (MSE) between the square-root of the power spectra P of the two signals:

Signal-to-Noise Ratio
The signal-to-noise ratio (SNR) measures the ratio between a reference signal and the approximation residuals. As it is computed in the time domain, it is highly sensitive to phase information. The SNR is calculated as where s is the reference signal, andŝ is the signal approximation.

Mean Opinion Score
We ask 15 participants (mostly expert listeners) to provide absolute ratings (i.e., no reference audio excerpts) of the perceptual quality of isolated musical excerpts. The listening test is performed with random, 8 second-long audio excerpts of the test set. We present to the listeners 5 high-quality audio excerpts, 15 MP3s (5 × 16kbit/s, 5 × 32kbit/s and 5 × 64kbit/s) and 50 restored versions (using 25 stochastic restorations with random noise z and 25 deterministic restorations). Among these 25 restorations per model we restored 10 × 16kbit/s, 10 × 32kbit/s and 5 × 64kbit/s MP3s. All together this results in 70 ratings per user.
The participants were asked to give an overall quality score and instructed to consider both the extent of the audible frequency range and noticeable, annoying artifacts. They provided their rating using a Likert-scale slider with 5 quality levels (1) very bad, 2) poor, 3) fair, 4) good and 5) excellent). From these results, we compute the Mean Opinion Score (MOS) [69].

Results and Discussion
In the following, we present the results of the performed evaluations. In Section 4.1 we discuss the results of the objective metrics and in Section 4.2 we discuss the subjective evaluation (i.e., the MUSHRA test and the MOS).    the model output by comparing the spectrograms of some high-quality audio segments, the corresponding MP3 versions, and some restorations.

Objective Evaluation
We test the method for three different MP3 compression rates (16kbit/s, 32kbit/s and 64kbit/s) as input to the generator. Moreover, as stated above, we assume that there are multiple valid solutions for an MP3 to be restored with very high compression rates. This would also mean that when using a stochastic generator, some of all possible samples should be closer to the original than when only using a deterministic generator. In order to test this hypothesis, for each compression rate, we train a stochastic generator (with noise input z) and a deterministic generator (without noise input). Then, for any input y taken from the test set, we sample 20 times with the corresponding generator using z i ∼ N (0, I), and for each objective metric, we take the best value of that set. Note that all objective metrics are computed by comparing the restored data with the original versions. Therefore, when picking samples to optimize a specific metric, we do not pick the sample with the best "quality", but rather the restoration that best approximates the original. Table 2 and Figure 4 show the results (i.e., the comparison to the high-quality data) for the stochastic and the deterministic models, and the respective MP3 baselines. For high compression rates (i.e., 16kbit/s and 32kbit/s), the best reconstructions of the stochastic models generally perform better than the baseline MP3s in most metrics and improve over the outputs of the deterministic models. This indicates that the facilitation of a stochastic generator is actually useful for restoration tasks. For some metrics (except LSD), the deterministic models perform on a par with the MP3 baselines. That is reasonable, as there are many different ways to restore the original version, and it is unlikely that a deterministic model outputs a close approximation. In Figure 4 the strong violin-shaped forms in the figures indicate that the restorations form two groups in the ODG and DI metrics. From visual inspection of the respective data, it becomes clear that those excerpts in the lower (worse) groups are such without percussion instruments, indicating that the models cannot add meaningful high-frequency content for, e.g., singing voice or tonal instruments. The SNR is always worse for the restorations (compared to the MP3 baselines), which shows that the phase information is not faithfully regenerated. Given the high variety of possible phase information in the high frequency range, particularly for percussive sounds, this is not surprising, but also does not hamper the perceived audio quality.
For the 64kbit/s MP3s, we see that the reconstructions are worse than the MP3 itself, except in the LSD metric. Note that 64kbit/s mono MP3s are already close to the original. The fact that the generator performs worse on these data indicates that in addition to adding high frequency content (which is mostly advantageous, as can be seen in the LSD results), it also introduces some undesirable artifacts in the reconstruction of the MP3 information.

Frequency Profiles
In order to test the influence of the input noise z onto the generator output, we input random MP3 examples and restore them while keeping the noise input fixed. Then, we calculate the frequency profiles of the resulting outputs by taking the mean over the time dimension. Figure 5 shows examples of this experiment, which makes it clear that a specific z causes a characteristic frequency profile consistently over different examples. This is advantageous when z is chosen manually to control the restoration of an entire song, where a consistent characteristic is desired throughout the whole song.

Subjective Evaluation
In this section, we describe our own assessment when listening to the restored audio excerpts (Section 4.2.1), and then we provide results of the Mean Opinion Score (MOS), where we evaluate the restorations in a listening test with expert listeners.

Informal Listening
For sound examples of the proposed method, please refer to the accompanying website. 4 When listening to the restored audio excerpts compared to the MP3 versions, the overall impression is a richer, higher bandwidth sound that could be described as "opening up". Also, we notice that the model is able to remove some MP3 artifacts, particularly swirlies, as described in the introduction (see also [4]). It is clearly audible that the model adds frequency content which got lost in the MP3 compression. When comparing the restorations directly to the high-quality versions, it is noticeable that the level of detail in the high frequencies is considerably lower in the restorations. When inspecting the restorations closer, we can hear that for specific sound events, the model performs particularly well (i.e., adds convincing high-frequency content and removes specific compression artifacts), other sources do not undergo a considerable improvement, and some events tend to cause undesired, audible artifacts.
Among the sound events which are generally improved very well are percussive elements like snare, crash, hi-hat, and cymbal sounds, but also other onsets with steep transients and non-harmonic high-frequency content, like the strumming of acoustic guitars or sibilants or plosives ('s' and 't') in a singing voice. Also, sustained electric guitars undergo considerable improvement. Note that all these sound types do not possess harmonics but instead require the addition of high-frequency noise in the restoration process. Considering the nature of percussive sounds and the wide variety of sources in the training data, this is a reasonable outcome. On the one hand, percussive sounds dominate other sources in the higher frequency range, which constitutes the main difference between MP3 and high-quality versions of the audio excerpts. On the other hand, harmonic sources are extremely varied, and their harmonics are of different characteristics. In addition, harmonics are rarely found above 10kHz, which is the range in which the critic can best determine the difference between MP3 and high-quality audio signals.
Sometimes, the generator adds undesired, sustained noise, mainly when the audio input is very compressed or when there are rather loud, single tonal instruments or singing voice.  Table 3: Mean Opinion Score (MOS) of absolute ratings for different compression rates. We compare the stochastic (sto) versions against the deterministic baselines (det), the MP3-encoded lower anchors (mp3) and the original high-quality audio excerpts.
Other undesired artifacts added by the generator are mainly "phantom percussions", like hi-hats that do not have meaningful rhythmic positions, triggered by events in the MP3 input that get confused with percussive sources. Also, the generator sometimes overemphasizes 's' or 't' phonemes of a singing voice. However, in some cases percussive sounds not present in the original audio signals are added, which are rhythmically meaningful. In general, the overall characteristics of the percussion instruments are often different in the restorations compared to the high-quality versions. This is reasonable, as the lower frequencies present in the MP3 do not provide information about their characteristic in the higher frequency range, wherefore the characteristic needs to be regenerated by the model (dependent on the input noise z). Table 3 shows the results of the listening test (i.e., MOS ratings). Overall, the original and the 64kbit/s MP3s (mp3_64k) obtain the highest ratings and the restored 64kbit/s MP3s (det_64k and sto_64k) perform slightly worse. The ratings for the restored 16kbit/s and 32kbit/s (det_16k, sto_16k, det_32k and sto_32k) are considerably better than the MP3 versions (mp3_16k and mp3_32k). This shows that the proposed restoration process indeed results in better perceived audio quality. However, the random samples from the stochastic generators are not assessed better than the outputs of the deterministic generators (the differences are not significant, as detailed below). We note that for the high compression rates, we reach only about half the average rating of the high-quality versions (but about double the rating of the MP3 versions). While overall, a restored MP3 version possesses a broader frequency range, weak ratings may result from off-putting artifacts, like the above-mentioned "phantom percussions". In 8-second-long excerpts, only one irritating artifact can already lead to a relatively weak rating for the whole example.

Formal Listening
As the variance of the ratings is rather high, we also compute t-tests for statistical significance comparing responses to the different stimuli. We obtain p-values < 0.05 (< 10 −5 ) when comparing det and sto to mp3 for compression rates below 64kbit/s. Conversely, we observe no statistically significant differences between ratings of det and sto for all compression rates (p-values > 0.15). Responses to original and mp3_64k also show no statistically significant differences (p-value = 0.49). We also observe no statistical significance between responses to mp3_64k and det_64k (p-value = 0.06), whereas there is a significant difference between ratings of sto_64k and mp3_64k (p-value = 0.04).

Conclusion and Future Work
We presented a Generative Adversarial Network (GAN) architecture for stochastic restoration of high-quality musical audio signals from highly compressed MP3 versions. We tested 1) if the output of the proposed model improves the quality of the MP3 inputs, 2) if a stochastic generator improves (i.e., can generate samples closer to the original) over a deterministic generator, and 3) if the output of the stochastic variants are generally of higher quality than deterministic baseline models.
Results show that the restorations of the highly compressed MP3 versions (16kbit/s and 32kbit/s) are generally better than the MP3 versions themselves, which is reflected in a thorough objective evaluation, and confirmed in perceptual tests by human experts. We also tested weaker compression rates (64kbit/s mono), where we found that the proposed architecture results in slightly worse results than the MP3 baseline. We could also show in the objective metrics that a stochastic generator can indeed output samples that are closer to the original than when using a deterministic generator. However, the perceptual tests indicate that when drawing random samples from the stochastic generator, the results are not assessed significantly better than the results of the deterministic generator.
Due to the wide variety of popular music, the task of generating missing content is very challenging. However, the proposed models succeeded in adding high-frequency content for particular sources resulting in an overall improved perceived quality of the music. Examples for sources where the model clearly learned to generate meaningful high-frequency content are percussive elements (i.e., snare, crash, hi-hat and cymbal sounds), sibilants or plosives ('s' and 't') in singing voice, strummed acoustic guitars and (sustained) electric guitars.
We expect future improvements when limiting the style of the training data to particular genres or time periods of production. Also, as we use the complex spectrum directly, the adaption to Complex Networks [70] could improve the results further. In order to tackle the problem of "phantom percussions" (as described in Section 4.2.1), a beat detection algorithm could provide additional information to the generator so that it is better informed about the rhythmic structure of the input. For improvement in learning to restore the harmonics of tonal sources, other representations (e.g., Magnitude and Instantaneous Frequencies (Mag-IF) [71]) or a different scaling (e.g., Mel-scaled spectrograms) could be tested for the input and output of the generator.