Audio Signal Processing Using Fractional Linear Prediction

: Fractional linear prediction (FLP), as a generalization of conventional linear prediction (LP), was recently successfully applied in different ﬁelds of research and engineering, such as biomedical signal processing, speech modeling and image processing. The FLP model has a similar design as the conventional LP model, i.e., it uses a linear combination of “fractional terms” with different orders of fractional derivative. Assuming only one “fractional term” and using limited number of previous samples for prediction, FLP model with “restricted memory” is presented in this paper and the closed-form expressions for calculation of FLP coefﬁcients are derived. This FLP model is fully comparable with the widely used low-order LP, as it uses the same number of previous samples, but less predictor coefﬁcients, making it more efﬁcient. Two different datasets, MIDI Aligned Piano Sounds (MAPS) and Orchset, were used for the experiments. Triads representing the chords composed of three randomly chosen notes and usual Western musical chords (both of them from MAPS dataset) served as the test signals, while the piano recordings from MAPS dataset and orchestra recordings from the Orchset dataset served as the musical signal. The results show enhancement of FLP over LP in terms of model complexity, whereas the performance is comparable


Introduction
The sinusoidal model is widely used for representation of pseudo-stationary signals, especially in audio coding [1] and musical signal processing [2].Parameters of the sinusoidal model are determined frame-wise from the input audio/musical signal, and a sound is synthesized using the extracted parameters [3].A pure tone can be represented as a single sine wave, whereas the musical chords are produced by combining three or more sine waves with different frequencies.In fact, any musical tone can be described as a combination of sine waves or its partials, each with its own amplitude, phase and frequency of vibration [4].A sine wave can be fully described using three parameters: amplitude, phase and frequency.Obviously, such signal is redundant; hence, there is no need to encode and transmit each signal sample.
Linear prediction (LP) can be used to remove redundancy by predicting the current signal sample from the signal history, as the weighted linear combination of past samples.In that case, only the coefficients of the predictor need to be transmitted, not the signal samples themselves.While LP is extensively used for modeling speech signal [5][6][7], it did not prove to be the best choice for modeling audio signals.This is unexpected, since a signal represented by a combination of sine waves should be perfectly predicted using an LP model with an order twice larger than the number of sinusoids.The problem might be the fact that LP can model well signals with equally distributed tonal components in the Nyquist interval, which is not the case with audio, where tonal components are concentrated in a substantially smaller frequency region in comparison to the signal bandwidth [8].This happens due to the fact that audio signals are usually sampled at a much higher frequency than the frequency of their tonal components.Nevertheless, there are applications of LP in audio coding algorithms using the so-called frequency-warped LP [9,10], where the unit delays are replaced by the first-order all-pass filter elements to adjust the frequency resolution in the spectral estimate to closely approximate the frequency resolution of human hearing [9].LP is also used in acoustic echo cancelation [11], music dereverberation [12], audio signal classification [13] and audio/music onset detection [14,15].
The idea of using the signal history is fundamentally rooted in fractional calculus [16].Fractional linear prediction (FLP), as a generalization of LP for fractional (arbitrary real) order derivatives, was recently used in electroencephalogram (EEG) [16,17] and electrocardiogram (ECG) signal modeling [18], as well as in speech coding [16,[19][20][21].While in [17][18][19] the full signal history is used for predicting the current signal sample, which is impractical from the implementation point of view, a model with restricted signal memory that uses only the recent signal samples and its applications is proposed in [21,22].However, to the best of our knowledge, there are no applications of FLP in audio/musical signal processing.In this paper, we present FLP with memory restricted to maximum of four previous samples and apply it to prediction of randomly generated test chords, usual chords in Western music and piano parts extracted from the MIDI Aligned Piano Sounds dataset; and musical parts extracted from symphonies, ballets and other classical musical forms, and interpreted by symphonic orchestras, from the Orchset dataset.
The paper is organized as follows.Section 2 presents an overview of conventional LP and the FLP with "restricted memory".Datasets used for experiments are described in Section 3. The numerical results using the test chords, piano and orchestra musical parts are discussed in Section 4, followed by concluding remarks in Section 5.

Conventional Linear Prediction
Let the signal x(t) represent a linear and stationary stochastic process, where x [n] = x(nT) is the nth signal sample at arbitrary time t, and T is the sampling period.The signal x(t) at time instance t = nT is modeled as the linear combination of p previous signal samples: where x[n] denotes the predicted signal sample and a i are the linear predictor coefficients.The order of a linear predictor denotes the number of linear predictor coefficients, which is equal to the number of samples used for prediction.
The prediction error e is defined as the deviation of the predicted signal x from the original signal x, and the mean-squared prediction error is equal to: where E[•] is the mathematical expectation.The optimal predictor coefficients a i can be determined by equating the first derivative of J, with respect to a i , to zero.After some manipulation, we obtain: where denotes the autocorrelation function at lag k.Equation ( 3) is known as the Yule-Walker equation [7] and can be rewritten in the matrix form as: where The optimal linear predictor coefficients a can be found from:

Fractional Linear Prediction with "Restricted Memory"
FLP is a generalization of LP using the fractional-order derivatives.Using the analogy from LP, the nth signal sample can be represented as the linear combination of q "fractional terms", and can be written as [16]: where x[n] is the estimate of the nth signal sample, q is the number of "fractional terms" used for the prediction, a i are the FLP coefficients, and D α x [n−1] are the fractional derivatives of order α i of the time-delayed signal, where α i ∈ R.
The fractional derivative D α can be approximated by the Grünwald-Letnikov (GL) definition of a function x(t) at time instant t [23]: where h is the sampling period, a and t are lower and upper limits of differentiation, and α ∈ R is the order of fractional differentiation.Note that the upper limit of summation tends to infinity.Accounting only for the recent history of the signal, i.e., replacing the lower limit a by the the moving lower limit t − L (L is the memory length), the "short memory" principle [23] is employed.Due to this approximation, the number of addends in Equation ( 7) is not greater than K = L/h .For t = nh, Equation (7) becomes: Replacing x(nh) with x [n] , and assuming that in the signal prediction only the past samples are used for the estimation of the predicted signal sample, without including the current sample, i.e., introducing a time-delay in Equation (8) of one sample, one gets: Taking into account only one "fractional term" from Equation (6), i.e., when q = 1, one obtains [21,22]: Considering K ∈ I as the upper limit of the summation in Equation ( 9), i.e., for K = 1: and K = 3: we get three modifications of FLP model with "restricted memory" (Equation ( 10)), which use the memory (M) of two, three, and four samples, respectively.Employing the memory of two samples, i.e., substituting D α x [n−1] from Equation (11) into Equation (10), the two-sample FLP model is defined as: and the prediction error is evaluated as e Minimizing the mean squared prediction error J = E e 2 [n] and substituting the autocorrelation function, the optimal coefficient a can be found.After some manipulation, the optimal FLP parameter can be written as: In case the order of fractional derivative α tends to zero, we get: i.e., the optimal first-order linear predictor is only a special case of the proposed FLP model with "restricted memory" using the memory of two previous samples.Considering the FLP model with "restricted memory" of three samples, where D α x [n−1] is estimated using Equation ( 12), the predicted sample becomes: Minimizing the mean squared prediction error J = E e 2 [n] , the optimal coefficient a can be found as: As in the case of FLP model with two-sample memory, when the order of fractional derivative α tends to zero, the computation of the FLP coefficient a reduces to a = R xx (1)/R xx (0), meaning that the first-order LP is a special case of the FLP model with "restricted memory" using the memory of three previous samples.
The last modification of the presented FLP model with "restricted memory" (Equation ( 10)) is taking into account the memory of four previous samples, i.e., D α x [n−1] is estimated using Equation ( 13): Computing the prediction error e and minimizing the mean squared prediction error [n] by finding the first derivative of J with respect to a and equating to zero, optimal coefficient a is obtained in the form: where Again, as in the case of FLP model with two-sample and three-sample memory, in the case of using the memory of four samples, when the order of fractional derivative α tends to zero, the computation of the FLP coefficient a is reduced to a = R xx (1)/R xx (0).This confirms that the proposed FLP models with the "restricted memory" are generalizations of the low-order LP, i.e., the first-order LP is only a special case of the presented FLP model.It was proven in [21,22] that the parameter α of the FLP model with "restricted memory" can be estimated as the inverse of the number of samples used by the FLP model, i.e., α = 1/M.Thus, the order of fractional differentiation is in this paper assumed fixed, with the values α = 0.5 for FLP model with two-sample memory, α = 0.33 for FLP model with three-sample memory, and α = 0.25 for FLP model with four-sample memory.It follows that the FLP model with "restricted memory" practically uses only one predictor coefficient, which has to be encoded and transmitted, regardless of the number of previous samples used for prediction.

MAPS Dataset
The MIDI Aligned Piano Sounds (MAPS) dataset contains 65 h of stereo audio recordings sampled at 44.1 kHz with 16 bit resolution (CD quality), recorded either using the software-based sound generation, or the Disklavier piano [24,25].The dataset contains four subsets: isolated notes (ISOL); chords composed of randomly chosen notes (RAND); usual chords in Western music (UCHO); and piano classical music pieces (MUS).The audio samples were recorded in different recording conditions (e.g., studio, jazz club, church, and concert hall).RAND, UCHO and MUS subsets were used in the experiments using all four recording conditions.

Orchset Dataset
Orchset database contains 64 mono and stereo audio recordings, sampled at 44.1 kHz, extracted from symphonies, ballets and other classical musical forms and interpreted by symphonic orchestras [26].The lengths of the recordings are 10-32 s (mean 22.1 s, standard deviation 6.1 s), the number of recordings per composer is 1-13, with 15 composers in total.Music excerpts were selected to have a dominant melody, maximizing the existence of voiced segments per excerpt.In all excerpts, the melody was played using more than one instrument from the instrument section, except for one excerpt where only oboe was used (with orchestral accompaniment).

Signal Preprocessing
In signal processing applications, e.g., when processing speech or audio signal that are non-stationary signals, the signal is usually divided into short-time windows, denoted as frames, where the signal is approximately stationary.In the case of audio signal, the frame length is typically 10-120 ms [27,28].In this study, the experiments were performed using three different frame-sizes, equal to 10 ms, 60 ms and 120 ms.
The audio signal may contain silent periods, usually at the beginning or at the end of a signal.This was especially evident in RAND and UCHO subsets of the MAPS dataset, where the silence periods were even longer than the signal itself.Modeling silent frames is unnecessary since the resources are spent on parts of the signal which do not contribute to signal reconstruction.Therefore, the silence frames were removed before further processing.Furthermore, DC offset was removed from the audio signal, as the signal compression, or any other processing of the signal that includes the absolute signal levels may lead to distortions and other non-desirable results.Finally, all stereo recordings were converted to mono by combining left and right channels prior to further processing.

Numerical Results and Discussion
The proposed FLP with "restricted memory" given in Equation (10) with the memory of two (Equation ( 14)), three (Equation ( 17)) and four samples (Equation ( 19)) was compared to conventional low-order LP using the same signal history.Experiments were performed using two test signals: the three-note chords composed of randomly chosen notes (MAPS-RAND subset), usual three-notes Western musical chords (MAPS-UCHO subset), and two musical signals: piano recordings (MAPS-MUS subset) and orchestra recordings (Orchset).The signals belonging to one recording condition (studio, jazz club, church, or concert hall) of the particular dataset were concatenated to one signal prior to applying either LP or FLP.
The prediction gain (PG) served as the predictor performance measure, defined as the ratio between the variance of the input signal and the variance of the prediction error measured in decibels: The smaller is the error generated by the predictor, the higher is the gain [29].

Experiments
The results for the randomly generated chords (MAPS-RAND subset) for different recording conditions (studio, jazz club, church, and concert hall) using four low-order LP models (first-order, second-order, third-order and fourth-order) and FLP models with the two-sample, three-sample and four-sample memory are presented in Table 1.The results show that the first-order LP is inappropriate; however, increasing the prediction-order beyond the second-order LP is not necessary, as it does not bring significant improvement.Similar behavior can be observed for FLP models, where the best performing model is the one with the two-sample memory.For the frames having 120 ms length, its performance is only slightly lower than the performance of the second-order LP, albeit obtained using only one predictor coefficient (note that the second-order LP that also uses the memory of two samples, requires the optimization of two predictor coefficients).By decreasing the frame length, the performance of both LP and FLP decrease, but with FLP approaching LP for the memory of three and four samples.Note that the results for FLP with the memory of three and four samples were obtained using two and three predictor coefficients less than in the case of the third-order and fourth-order LP.
The prediction results for the chords composed of three randomly chosen notes from the MAPS-RAND subset are also presented in Figure 1, where the prediction error using the second-order, third-order and fourth-order LP (black solid line) is compared to the prediction error obtained using the FLP model with two-sample, three-sample and four-sample memory (red dot-dashed line).Ten characteristic frames with the length of 60 ms are shown in the figure.The results confirm that the performance of the second-order LP and the FLP with two-sample memory is comparable for the signals recorded under different conditions (studio, jazz club, church, and concert hall), and the difference between the prediction error of the LP and FLP models is generally increasing with the length of the used memory.Similar behavior as in case of randomly generated chords can be observed when using usual three-notes Western musical chords (MAPS-UCHO subset).Again, the performance of FLP with two-sample memory is comparable to the second-order LP for all frames, although FLP is using one coefficient less (see Table 2).
Ten characteristic frames with the length of 60 ms are shown in Figure 2 for the MAPS-UCHO subset, where the prediction error using the second-order, third-order and fourth-order LP (black solid line) is compared to the prediction error obtained using the FLP model with two-sample, three-sample and four-sample memory (red dot-dashed line).The results confirm that the performance of the second-order LP and the FLP with two-sample memory is comparable for the signals recorded under different conditions (studio, jazz club, church, and concert hall), and also that the difference between the prediction errors of the LP and FLP models is increasing with the length of the used memory.The results for the piano music excerpts using MAPS-MUS subset are also presented for three different frame sizes, i.e., 10 ms, 60 ms and 120 ms (see Table 3).For shorter frames (10 ms), the performance of FLP is always comparable to the performance of the corresponding LP that uses the same signal memory.For longer frames, PG of FLP is comparable to PG of the corresponding LP for jazz club and church recording conditions, while the performance deteriorates by 1-2 dB only for FLP with the memory of three and four samples for studio and concert recording conditions, suggesting that FLP is better suited for signals recorded in reverberant or non-ideal acoustical conditions.Note that FLP always uses only one predictor coefficient, regardless of the signal memory used for prediction.For example, for the FLP with the four-sample memory, comparable performance is obtained to the corresponding fourth-order LP, but with three predictor coefficients less that need to be optimized.This can lead to substantial savings in bit rate, as predictor coefficients need to be encoded and transferred to receiver end.Furthermore, note that better performance is obtained using longer frames for both LP and FLP; hence, more frequent coefficient update does not bring any improvement.
The last experiment was performed using the orchestra music excerpts from the Orchset dataset.Since LP models are, in general, known to perform well on piano music, we tested the performance of our model on a more challenging music signal played by the orchestra (see Table 3).The performance of FLP in comparison to LP is lower than in piano music; however, the model with two-sample memory is still comparable to the corresponding second-order LP for all frame lengths.Third-and fourth-order LP models perform better than FLP at the expense of two and three additional coefficients, respectively.When evaluating the prediction error in case of using musical signals from the MAPS-MUS subset (see Figure 3) under the same recording conditions as in previous experiments (e.g., studio, jazz club, church, and concert hall), an interesting observation can be made, i.e., the difference between the prediction error of the LP and FLP models is not increasing that significantly with the length of the used memory (especially for the jazz club and church recording conditions), as was the case of using signals representing chords.Furthermore, it is obvious that the second-order LP and the FLP with two-sample memory for the shown signals perform at the same level for all four recording conditions.Similar behavior is present in the case of using orchestra music excerpts from the Orchset dataset (see Figure 4).Please note that, in Figures 3 and 4, again ten characteristic frames with the length of 60 ms are shown, and that the prediction error using the second-order, third-order and fourth-order LP (black solid line) is compared to the prediction error obtained using the FLP model with two-sample, three-sample and four-sample memory (red dot-dashed line).
Here, it should be emphasized that LP and FLP models always use the same number of previous samples (two, three and four) that allows a fair comparison.Furthermore, it is important to emphasize that all FLP models show comparable performance in comparison to LP models, even though they use only two coefficients, i.e., one predictor coefficient a and one order of fractional derivative α, in comparison to LP models that use two, three and four predictor coefficients (based on the order of the LP predictor).Moreover, the order of fractional differentiation α does not have to be computed or optimized.It might be estimated as the inverse of the predictor memory, as previously shown in [21,22], resulting in only one FLP coefficient that has to be encoded and transmitted.This makes the proposed FLP significantly more efficient than LP.  for second-order, third-order and fourth-order LP and the FLP with the two-sample, three-sample, and four-sample memory.

Conclusions
Fractional linear prediction with "restricted memory" that uses two, three, and four previous samples, respectively, for audio signal prediction is discussed in this work and the closed-form expressions for the FLP predictor coefficient are derived.Two datasets were used for the experiments to test the performance of the model and compare it to linear prediction, i.e., MAPS dataset, which contains chords composed of randomly chosen notes, usual chords in Western music, and piano music excerpts; and Orchset dataset, which contains music excerpts, extracted from symphonies, ballets and other classical musical forms, and interpreted by symphonic orchestras.
Using the same number of previous samples for prediction, the results show that FLP is better suited for prediction of audio signal than the conventional low-order LP models, since it provides comparable performance, even though it uses less parameters (one predictor coefficients and one order of fractional derivative).Furthermore, the order of fractional derivative does not have to be optimized and can be assumed as the inverse of the memory length of the FLP model, making it even more efficient in comparison to LP model, where the number of predictor coefficients is always equal to the predictor order.For example, FLP with the memory of four samples requires only one predictor coefficient, whereas the corresponding fourth-order LP requires four predictor coefficients, at similar performance.Therefore, substantial savings in transmission costs are possible.

Figure 1 .
Figure 1.The prediction error results for the random chords (MAPS-RAND) for second-order, third-order and fourth-order LP and the FLP with the two-sample, three-sample, and four-sample memory: (a) studio recording; (b) jazz club recording; (c) church recording; and (d) concert hall recording.

Figure 2 .
Figure 2. The prediction error results for the three-notes chords (MAPS-UCHO) for second-order, third-order and fourth-order LP and the FLP with the two-sample, three-sample, and four-sample memory: (a) studio recording; (b) jazz club recording; (c) church recording; and (d) concert hall recording.

Figure 3 .
Figure 3.The prediction error results for the musical signals (MAPS-MUS) for second-order, third-order and fourth-order LP and the FLP with the two-sample, three-sample, and four-sample memory: (a) studio recording; (b) jazz club recording; (c) church recording; and (d) concert hall recording.

Figure 4 .
Figure 4.The prediction error results for the musical signal (Strauss-BlueDanube-ex1, from the Orchset)for second-order, third-order and fourth-order LP and the FLP with the two-sample, three-sample, and four-sample memory.

Table 1 .
Prediction gain (dB) for the chords composed of three randomly chosen notes (MAPS-RAND subset).

Table 2 .
Prediction gain (dB) for the usual Western music three-notes chords (MAPS-UCHO subset).

Table 3 .
Prediction gain (dB) for musical signal of classical music pieces played by piano (MAPS-MUS subset) and the classical music pieces performed by orchestra (Orchset dataset).