Mutual Information, the Linear Prediction Model, and CELP Voice Codecs

: We write the mutual information between an input speech utterance and its reconstruction by a Code-Excited Linear Prediction (CELP) codec in terms of the mutual information between the input speech and the contributions due to the short term predictor, the adaptive codebook, and the fixed codebook. We then show that a recently introduced quantity, the log ratio of entropy powers, can be used to estimate these mutual informations in terms of bits/sample. A key result is that for many common distributions and for Gaussian autoregressive processes, the entropy powers in the ratio can be replaced by the corresponding minimum mean squared errors. We provide examples of estimating CELP codec performance using the new results and compare to the performance of the AMR codec and other CELP codecs. Similar to rate distortion theory, this method only needs the input source model and the appropriate distortion measure.

Linear Prediction (CELP) needed for the remainder of the paper, while Sec. 3 develops the 48 decomposition of the mutual information between the input speech and the speech reconstructed by 49 the CELP codec. The concept of entropy power as defined by Shannon [7] is presented in Sec. 4,and 50 the ordering of mutual information as a signal is passed through a cascaded signal processing system 51 is stated in Sec. 5. The recently proposed quantity, the log ratio of entropy powers, is given in Sec. 6,52 where it is shown that the mean squared estimation errors can be substituted for the entropy powers 53 in the ratio for an important set of probability densities [5,8,9]. The mutual information between the 54 input speech and the short term prediction sequence is discussed in Sec. 7 and the mutual information 55 provided by the adaptive and fixed codebooks about the input speech is developed in Sec. 8. The 56 promised analysis of CELP codecs using these prior mutual information results based on only the input 57 speech model and the distortion measure is presented in Sec. 9. The final section contains conclusions 58 drawn from the results in the paper.  [1,2]. 62 We provide a brief description of the various blocks in Figs. 1 and 2 to begin. The CELP encoder is an implementation of the Analysis-by-Synthesis (AbS) paradigm [1]. CELP, like most speech codecs in the last 45 years, is based on the linear prediction model for speech, wherein the speech is modeled as where we see that the current speech sample at time instant k is represented as a weighted linear  gains. These operations are represented by the Parameter Encoding block in Fig. 1 [1,10]. 80 The CELP Decoder uses these parameters to synthesize the block of M reconstructed speech 81 samples presented to the listener as shown in Fig. 2. There is also Post-Processing not shown in the 82 figure.

83
The quality and intelligibility of the synthesized speech is often determined by listening tests that 84 produce Mean Opinion Scores (MOS), which for narrowband speech vary from 1 up to 5 [11,12]. A 85 well-known codec such as G.711 is usually included to provide an anchor score value with respect to 86 which other narrowband codecs can be evaluated [13][14][15].

87
It would be helpful to be able to associate a separate contribution to the overall performance by 88 each of the main components in Fig. 2 codecs, but CELP does not attempt to follow the speech waveform, so SNR is not applicable. One This expression states that the mutual information between the original speech X and the reconstructed speech X R equals the mutual information between X and X N , the Nth order linear prediction of X, plus the mutual information between X and the combined codebook excitations X C conditioned on X N . Thus, to achieve or maintain a specified mutual information between the original speech and the reconstructed speech, any change in X N must be offset by an adjustment of X C . This fits what is known experimentally and that was alluded to earlier. If we define X A to represent the Adaptive codebook contribution and X F to represent the Fixed codebook contribution, we can further decompose I(X; X C |X N ) as where we have used the chain rule for mutual information [16].

100
While these expressions are interesting, the challenge that remains is to characterize each of these 101 mutual informations without actually having to calculate them directly from data, which is a difficult 102 problem in and of itself [17].

103
An interesting quantity introduced and analyzed by Gibson in a series of papers is the log ratio 104 of entropy powers [5,8,9]. Specifically, the log ratio of entropy powers is related to the difference in 105 mutual information, and further, in many cases, the entropy powers can be replaced with the minimum 106 mean squared prediction error (MMSPE) in the ratio. Using the MMSPE, the difference in mutual 107 informations can be easily calculated. The following sections develop these concepts before we apply 108 them to an analysis of the CELP structure.

110
Given a random variable X with probability density function p(x), we can write the differential entropy h(X) = − ∞ −∞ p(x) log p(x)dx where the variance var(X) = σ 2 . Since the Gaussian distribution has the maximum differential entropy of any distribution with mean zero and variance σ 2 [16], h(X) ≤ 1 2 log 2πeσ 2 (4) from which we obtain where Q was defined by Shannon to be the entropy power associated with the differential entropy of 111 the original random variable [7]. In addition to defining entropy power, this equation shows that 112 the entropy power is the minimum variance that can be associated with the not-necessarily-Gaussian 113 differential entropy h(X).

Cascaded Signal Processing
115 Figure 3 shows a cascade of N signal processing operations with the Estimator blocks at the output of each stage as studied by Messerschmitt [18]. He used the conditional mean at each stage and the corresponding conditional mean squared errors to obtain a representation of the distortion contributed by each stage. We analyze the cascade connection in terms of information theoretic quantities, such as mutual information, differential entropy, and entropy rate power. Similar to Messerschmitt, we consider systems that have no hidden connections between stages other than those explicitly shown. Therefore, we conclude directly from the Data Processing Inequality [16] that For the optimal estimators at each stage, the basic Data Processing Inequality also yields I(X; Y n ) ≥ 116 I(X; X n ) and thus h(X|Y n ) ≤ h(X| X n ). These are the fundamental results that additional processing We can also write that Q X|Y n ≤ Q X| X n In the context of Eq. (8), the notation Q X|Y n denotes the minimum variance when reconstructing 122 an approximation to X given the sequence at the output of stage n in the chain [8].
We can use the definition of the entropy power in Eq. (5) to express the logarithm of the ratio of two entropy powers in terms of their respective differential entropies as [8] log We can write a conditional version of Eq. (5) as and from which we can express Eq. (10) in terms of the entropy powers at successive stages in the signal processing chain, Fig. 3, as If we add and subtract h(X) to the right hand side of Eq. (12), we then obtain an expression in terms of the difference in mutual information between the two stages as From the series of inequalities on the entropy power in Eq. (8), we know that both expressions in Eqs.

125
(12) and (16) are greater than or equal to zero.

126
These results are from [8] and extend the Data Processing Inequality by providing a new 127 characterization of the information loss between stages in terms of the entropy powers of the two 128 stages. Since differential entropies are difficult to calculate, it would be particularly useful if we could 129 obtain expressions for the entropy power at two stages and then use Eqs. (12) and (16) to find the 130 difference in differential entropy and mutual information between these stages.

131
We are interested in studying the change in the differential entropy and mutual information 132 brought on by different signal processing operations by investigating the log ratio of entropy powers.

133
In the following we highlight several cases where Eq. (10) holds with equality when the entropy powers are replaced by the corresponding variances. The Gaussian and Laplacian distributions often appear in studies of speech processing and other signal processing applications [10,15,19], so we show that substituting the variances for entropy powers in the log ratio of entropy powers for these distributions satisfies Eq. (10) exactly. For two i.i.d. Gaussian distributions with zero mean and variances σ 2 which satisfies Eq. (10) exactly. Of course, since the Gaussian distribution is the basis for the definition 134 of entropy power, this result is not surprising.

135
For two i.i.d. Laplacian distributions with variances λ 2 X and λ 2 Y [20], their corresponding entropy powers Q X = 2eλ 2 X /π and Q Y = 2eλ 2 Y /π, respectively, so we form for the entropy power in Eq. (10) and the result is the difference in differential entropies.

138
Interestingly, using mean squared errors or variances in Eq. (10)  Therefore, the satisfaction of Eq. (10) with equality occurs in more than one or two special cases.

144
The key points are first that the entropy power is the smallest variance that can be associated with 145 a given differential entropy, so the entropy power is some fraction of the mean squared error for 146 a given differential entropy. Second Eq. (10) as in [5] to associate a change in mutual information with a change in the predictor order. Figure   158 4 (bottom) shows 160 time domain samples from a narrowband (200 to 3400 Hz) speech sentence 159 sampled at 8,000 samples/sec, and the top plot is the magnitude of the spectral envelope calculated 160 from the linear prediction model using Eq. (1). We show the MMSPE and the corresponding change in 161 mutual information for predictor orders N = 1, 2, ..., 10, in Table 1. We see that the mutual information 162 between the input speech frame and a 10th order predictor is 1.52 bits/sample. We can examine the 163 mutual information between the input speech and a 10th order linear predictor for other frames in the 164 same speech utterance.

165
To categorize for easy reference the differences in the speech frames, we use a common indicator of predictor performance, the Signal-to-Prediction Error (SPER) in dB [21], also called the Prediction Gain [15], defined as SPER(dB) = 10 log 10 MSE(X) MMSPE(X, X 10 ) where MSE(X) is the average energy in the utterance and MMSPE(X, X 10 ) is the minimum mean 166 squared prediction error achieved by a 10th order predictor. The SPER can be calculated for any 167 predictor order but we choose N = 10, a common choice in narrowband speech codecs and the 168 predictor order that, on average, captures most of the possible reduction in mean squared prediction 169 error without including long term pitch prediction. For a normalized MSE(X) = 1.0, we see that for 170 the speech frame in Fig. 4, the SPER = 9.15 dB.

171
Several other speech frames by the same speaker are analyzed in [5] and the results for these 172 frames are tabulated in Table 5. From this table, it is evident that the mutual information between 173 the input speech and a 10th order linear predictor can change quite dramatically across frames, even 174 with the same speaker. We observe that the change in mutual information in some sense mirrors a   The long term predictor that makes up the adaptive codebook in CELP may have 1 or 3 taps and is of the form where β i , i = −1, 0, 1, are the predictor coefficients that are updated on a frame by frame basis and P is the lag of the periodic correlation that captures the pitch excitation of the vocal tract. A 3 tap predictor as shown in Eq. (18) requires a stability check to guarantee that this component does not cause divergence [22]. The single tap form given by only needs the stability check that |β| < 1, which should hold for any normalized autocorrelation. The 189 3 tap form can often provide improved performance over the single tap predictor, but at increased 190 complexity.

191
We denote the long term predicted value as X P P to distinguish it from the short term predictor of 192 order N, X N , which contains all terms up to and including N. Thus, we can write the MMSPE between 193 X and X P P as MMSPE(X, X P P ).

194
The prediction gain in dB is often used as a performance indicator for a pitch predictor. The long 195 term prediction gain has the form of Eq. (17) but with MMSPE(X, X 10 ) replaced by MMSPE(X, X P P ), 196 where X P P is the long term predicted value for a pitch lag of P samples, where usually P = 20 up to  Table 3 shows the mutual information in bits/letter that can be associated with prediction gains of 1, 2, 202 3, 4, and 6 dB.

203
It is interesting to consult the standardized speech codecs to see how many bits/sample are 204 allocated to coding the pitch gain (coefficient or coefficients) and the pitch delay (lag   the fixed codebook gets about 0.5 bits/sample, and at 12.2 kbits/sec, 1 bit/sample.

228
We now have estimates of the bits/sample devoted to the short term predictor, adaptive codebook, 229 and fixed codebook for a CELP codec operating at different bit rates. We show how to exploit these 230 estimates in the following to predict the rate of CELP codecs for different speech sources.

232
The analyses determining the mutual information in bits/sample between the input speech and 233 the short term linear prediction, the adaptive codebook, and the fixed codebook individually, are 234 entirely new and provide new ways to analyze CELP codec performance by only analyzing the input 235 source. In this section we estimate the performance of a CELP codec by analyzing the input speech 236 source to find the mutual information provided by each CELP component about the input speech, and 237 then subtracting the three mutual informations from a reference codec rate in bits/sample for a chosen 238 MOS value to get the final estimate of the rate required in bits/sample to achieve the target MOS.

239
For waveform speech coding, such as differential pulse code modulation (DPCM), for a particular 240 utterance, we can study the rate in bits/sample versus the mean squared reconstruction error or SNR 241 to obtain some idea of the codec performance for this input speech segment [14,15] were away a year ago," spoken by a male and "A lathe is a big tool. Grab every dish of sugar", spoken    If we now combine all of these estimates with their associated relative frequencies of occurrence as 294 indicated in Table 4, we obtain a total mutual information of 0.5265(1.96 + 0.5) + 1.2 + 0.5(0.0771) = 2.5 295 bits/sample. Subtracting this from 4 bits/sample, we estimate that the CELP codec rate in bits/sample 296 for an MOS = 4.0 would be 1.5 bits/sample. We see from