1. Introduction
Several speech processing methods such as speech coding, speaker/speech recognition, speech synthesis, speech enhancement, voice activity detection (VAD), independent component analysis (ICA), speaker diarization, etc. require the statistical modeling of speech signals. The performance of these speech processing applications depends on the accurate modeling of the probability distribution of speech signals [
1]. Many results available in the literature attribute speech with a Gamma distribution (γ-D), Gaussian distribution (GD), Generalized Gaussian distribution (GGD), Laplacian distribution (LD) as well as multivariate Gaussian and Laplacian distributions [
1,
2,
3,
4,
5,
6,
7,
8]. The probability distribution that best model speech samples depends on various factors such as the domain of speech representation (such as time domain or frequency domain), segment length, silence periods in the speech as well as noise. Important findings available in the literature are discussed below to highlight the fact that there are several probability distributions that can be used to model speech signals under different conditions.
The following findings have been reported in [
1]: The time domain representation of speech signals follows the Laplace distribution, when the silence intervals (pauses in speech) are omitted. Furthermore, the de-correlated representation of speech, where the correlation between successive samples of speech is removed using the Karhunen–Loève transform, has a multivariate LD. In the presence of silence intervals along with voice activity, γ-D is a good fit. Short segments (<2.5 ms) of speech in the time domain have been shown to follow a GD, whereas for longer time domain segments (>5 ms), LD is shown to provide a better fit. For time domain speech segment lengths of the order of several seconds, γ-D or GGD is preferred.
De-correlated speech segments are also shown to follow LD with different distribution parameters [
2]. In [
3,
4], it was concluded that the voiced component of speech followed LD and the unvoiced component followed GD. The generalized γ-D (Gγ-D) has also been proposed as a more efficient and accurate parametric model of speech probability distribution [
6,
7].
In the spectral domain, the power spectral density of speech, obtained using linear predictive coding (LPC) has been shown to be best described by a Rayleigh distribution for frequency components around formant frequencies and by a Gaussian distribution for frequency components away from formant frequencies [
9]. The frequency components around which acoustic energy is concentrated are formant frequencies. Frequency components that are not near formant frequencies have low concentrations of acoustic energy. Time domain segments of lengths 20–200 ms have been shown to fit LD but for segments shorter than 20 ms, GD is considered better. The distribution of discrete Fourier transform (DFT) coefficients of speech segments around 30 ms has been shown to fit a GD [
10], which is in contradiction to [
11,
12],where the authors suggest that an LD or γ-D provides a better fit for the DFT coefficients. The authors of [
10] attribute this contradiction to longer segment lengths and the multiple speaker samples used in [
11,
12].
The design of digital hearing aids requires an understanding of the probabilistic model of long segments of speech in the spectral domain, i.e., with high spectral resolution. A type of hearing aid, called a cochlear implant provides the sense of sound to people with profound hearing loss. Early version of cochlear implants discarded the fine spectral details and used only the coarse details. The performance of cochlear implants improves progressively by including more spectral details [
13]. Hence, the need to model speech spectral parameters with high spectral resolution. While results are available in the literature for representing long segments of speech in the time domain, there is a void of such representation in spectral domain as there are no available results in the literature that provide a statistical model of spectral representation of long segments of speech. In this paper, we show that the distribution of short-time Fourier transform (STFT) coefficients (spectral domain) of speech segments over durations greater than 500 ms can be modeled accurately by LD, having a small RMS error for the estimated LD parameters, which validates the estimated parameters. The Cramer–Rao bound (CRB) for the estimated parameter was also evaluated and the RMS error, which represents the variance of the estimated parameter, was found to be close to CRB, indicating that the maximum likelihood (ML) estimation is an efficient algorithm for the proposed distribution parameter. It was also found that for low spectral resolution, both LD and GD were not accurate distribution models.
The rest of the paper is organized as follows. In
Section 2, the ML estimation of LD and GD parameters and the error bound, which represents estimation optimality, is presented. Removal of direct current (DC) bias and VAD to remove silence intervals is discussed in
Section 3. In
Section 4, experimental details and results corresponding to fitting LD and GD are presented along with their analysis and interpretation. Conclusions and future work are presented in
Section 5.
2. Maximum Likelihood Estimation of Laplacian Distribution (LD) and Gaussian Distribution (GD) Parameters
In ML estimation, based on the observation of STFT coefficients of speech, which are assumed to be independently and identically distributed (IID), the distribution parameters which maximize their likelihood function are estimated. It is a standard assumption in the literature that DFT coefficients of speech are IID [
10,
11,
12]. Since STFT coefficients are computed as the DFT of windowed segments, it follows that the assumption of STFT coefficients to be IID is also valid. ML estimation is chosen due to its low implementation complexity and for being a minimum variance unbiased estimator (MVUE) as the observation sample length increases [
14]. Laplace distribution has a probability density function (PDF) defined as
with parameters
and
b which are to be estimated. The likelihood function for the Laplace model parameters
and
b is defined as
and the corresponding log-likelihood is defined as
The ML estimate of
is given as the median of the observed samples i.e., an efficient estimator
for the parameter
is the median of the observed samples [
15]. Due to the symmetric nature of LD, the median is equal to the mean:
The ML estimate of
b is obtained by differentiating the log-likelihood function with respect to w.r.t
b and equating the derivative to zero.
i.e.,
Solving for
b gives an efficient estimator
for the parameter
b [
15]
Similarly, the Gaussian distribution has a PDF defined as
where
and
σ denote the mean and standard deviation of GD. The corresponding efficient estimators
and
are obtained by following similar procedures as the estimation of LD parameters and are given as
CRB provides a limit to the performance optimality of unbiased estimators and is used to measure the performance of estimators [
16]. CRB gives the lower bound on the variance of any unbiased estimator. An unbiased estimator that achieves this bound is said to be fully efficient. In the context of this work, the
CRB for the estimated LD parameter
b is defined as
where
The corresponding
CRB values for
and
σ of GD are defined as
where
I(
b),
I(
),and
I(σ
2) denote the Fisher information associated with the estimated parameters [
17].
4. Experimental Procedure and Discussion of Results
Speech samples are recorded in typical office and residential environments rather than in a controlled environment such as an anechoic chamber, using a single microphone at a sample rate equal to 16,000 samples per second. This sample rate is standard in speech processing research and corresponds to a wideband representation of speech that faithfully restores all frequency components of the speech signal. A lower sample rate can cause a loss of information in fricative components of speech signal. DC offset is first removed, followed by VAD to remove silence intervals. Since the spectral components of speech are time varying, computing STFT depicts the variation of spectral components with time. The STFT of speech containing only voice activity is computed and ML estimation is used to estimate the parameters of the fitted distribution. Accuracy of the fitted distribution is determined by computing the RMS error and effectiveness of the fitting algorithm i.e., MLE is determined by computing the CRB on the estimation error.
We computed a narrowband STFT using a wide Hamming window, providing a frequency resolution below 5 Hz. Speech processing with high frequency resolution is useful in several applications such as speech enhancement, improved VAD, speech de-noising, analysis of speech intonation to detect emotion, speech recognition, and automatic speech restoration (ASR) to improve speech intelligibility and in the design of hearing aids and cochlear implants for people with a hearing disability [
13,
20,
21,
22,
23]. A less obvious but important application is to model the perceptual stability of human listening capability, which allows humans to perceive speech reliably under a wide range of acoustic conditions [
24]. While speech is generally considered as a non-stationary signal over segment lengths longer than 20–30 ms, the perceptual stability of the human auditory system motivates the need for an invariant representation of speech over long segment lengths which has application in the design of digital hearing aids.
In this paper, we demonstrate through computer simulation that STFT coefficients of speech with high spectral resolution fit reasonably accurately to LD as shown in
Figure 2,
Figure 3,
Figure 4 and
Figure 5 for arbitrary speech segments of different individuals. The PDF’s wereobtained by plotting the envelope of histograms of STFT coefficients by dividing their entire range into 100 equally spaced bins. The envelope was obtained by joining the peak of each bin in the histogram. In all the figures that follow, the red (thick) curve represents the fitted distribution based on MLE of distribution parameters and the blue (thin) curve is the actual distribution of the corresponding STFT coefficients. The RMS error of the estimated LD parameter as well as the CRB was small, indicating that MLE is an efficient estimator for LD scale parameter
b. It is therefore argued that STFT coefficients of long speech segments (>500 ms) where silence intervals have been removed are modeled accurately by LD. This conclusion was found to be valid for ‘female’ (
Figure 2) and ‘male’ (
Figure 3) speech as well as for both the ‘real’ (
Figure 4) and ‘imaginary’ (
Figure 5) parts of the STFT coefficients.
For the purpose of comparison, GD was also fitted to STFT coefficients as shown in
Figure 6,
Figure 7,
Figure 8 and
Figure 9 from which it is clear that the STFT coefficients of speech were not accurately modeled by GD. The RMS error of the fitted GD was small (but larger compared to fitting LD) and the MLE of the mean
of GD was also reasonably accurate, but the estimated value of variance
, using MLE, deviated quite significantly from the actual variance
of STFT coefficients with
. It is therefore concluded that speech STFT coefficients with a high spectral resolution are not accurately represented by GD and that LD provides a better fit than GD. This wasfound to be the case for ‘female’ (
Figure 6) and ‘male’ (
Figure 7) speech as well as for ‘real’ (
Figure 8) and ‘imaginary’ (
Figure 9) parts of STFT coefficients.
For a short segment length of 8 ms corresponding to low spectral resolution (125 Hz), neither LD nor GD provided an accurate model for the distribution of STFT coefficients of speech. The plots in
Figure 10 and
Figure 11 depict the fitted distributions corresponding to LD and the plots in
Figure 12 and
Figure 13 depict the fitted distributions corresponding to GD for STFT coefficients with low spectral resolution. It should be noted that the ordinates in
Figure 10,
Figure 11,
Figure 12 and
Figure 13 represent the actual count due to the short segment lengths, rather than the probabilities as in
Figure 2,
Figure 3,
Figure 4,
Figure 5,
Figure 6,
Figure 7,
Figure 8 and
Figure 9.The abscissa of all plots in
Figure 2,
Figure 3,
Figure 4,
Figure 5,
Figure 6,
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11,
Figure 12 and
Figure 13 denotes the actual values (real/imaginary parts) of the STFT coefficients.
For low spectral resolution STFT coefficients of both male and female speech, LD and GD fitting based on MLE yielded a significantly large RMS error. The ML estimated value of the scale parameter of LD and corresponding scale parameter of GD was large in comparison to the actual scale parameters b and ′, respectively, resulting in the RMS error being too large. Hence, LD and GD cannot be considered as favorable probabilistic models to represent speech STFT coefficients with low spectral resolution. Further investigation is required to identify the probabilistic models that best represent short segments of speech as STFT coefficients.
In
Figure 2,
Figure 3,
Figure 4 and
Figure 5, both the RMS error and CRB values were small, indicating that LD was an appropriate fit and MLE was an efficient estimation algorithm. In
Figure 6,
Figure 7,
Figure 8 and
Figure 9, although the RMS error was small, the CRB values corresponding to the estimation of ‘
’ was large. This indicates that the choice of distribution is appropriate, but the ML estimation did not yield accurate parameters to represent the distribution, based on the observed data. Hence, a different estimation algorithm needed to be chosen. Since the CRB values in
Figure 10,
Figure 11,
Figure 12 and
Figure 13 were small, it indicates that the ML estimation is an efficient algorithm to fit the model parameters. However, the large value of the RMS error indicates that the distributions hypothesized as potential fits i.e., LD and GD, are not suitable. These observations are summarized in
Table 1. As CRB represents the lower bound on the estimation error of an estimation algorithm, a small value of CRB indicates the better efficiency of the estimation algorithm.
There are applications that require the appropriate probabilistic modeling of speech signals in spectral domain with high spectral resolution. STFT transforms signals from the time domain to frequency domain with the flexibility to control resolution depth. In order to find the best probabilistic model that represents speech STFT coefficients, it is important to hypothesize the correct distribution and to also choose an appropriate estimation algorithm. Speech STFT coefficients with high spectral resolution can be accurately represented as LD using MLE to estimate LD parameters. While GD makes a valid hypothesis, MLE does not accurately estimate the parameters of the fitted GD. It should still be noted that LD is a better fit than GD. For STFT coefficients with a low spectral resolution, both LD and GD are invalid hypotheses, even though MLE is an accurate estimation algorithm. It was therefore necessary to include other probability distributions in our hypothesis and also investigate alternative estimation algorithms to MLE in our pursuit to model speech by a representation that is relatively invariable, as needed by several applications. The work presented in this article used speech samples in English, spoken by healthy individuals in the age group of 20–40 years of Asian origin. Certain features of speech such as ‘pitch’, mel frequency cepstral coefficients (MFCC), and perceptual linear predictive (PLP) features depend on language, age, and race. For example, the variation of pitch for different languages for both male and female speakers of various ethnic origins is presented in [
25], which concludes that certain languages are inherently high pitched while certain others are low pitched. While there is a variation of pitch between males and females, the amount of variation is different for different languages. It is also hypothesized that the nativity of the speaker also affects the pitch frequency. MFCC has been used in published literature for language identification [
26,
27]. In [
28], both MFCC and PLP were used for language identification. Speech recognition for four different languages (German, Danish, Finnish, and Spanish) based on adapted versions of MFCC and PLP was discussed in [
29] and MFCC was used for Arabic speech recognition in [
30], indicating that MFCC and PLP features depend on language. MFCC was also used in [
31,
32] for age classification to classify speakers as adults or kids. In [
33], statistical modeling of speech spectral coefficients was used to discriminate the speech of patients with Parkinson’s disease from that of healthy individuals. The effect of language, race, and age on the statistical distribution of speech STFT coefficients is not available in the literature and needs to be investigated.
5. Conclusions
Several probabilistic distributions have been proposed in the literature to model speech in different domains and with different segment lengths, typically below 200 ms. It is evident from the existing literature that different probabilistic models are applicable to speech under different conditions and each has utility in different applications. Some applications such as the design of digital hearing aids require the stable modeling of speech over longer durations due to the perceptual stability of human listening capability. In particular, it is required to find a probabilistic model for speech signals in the spectral domain with high spectral resolution. In this work, a probabilistic model for frequency domain representation of speech, where silence intervals were removed, was proposed. It was shown that STFT coefficients of speech segments greater than 500 ms, corresponding to a high spectral resolution, were accurately modeled by LD for which the distribution scale parameter wasobtained using MLE. Fitting a GD also yielded a small RMS error, but the MLE of variance of the fitted GD was much larger than the actual variance, leading to the conclusion that LD provides a better fit than GD to accurately model speech STFT coefficients with a high spectral resolution. In the case of STFT coefficients with low spectral resolution (short segments), neither LD nor GD provided an accurate representation as the RMS error was too large. These conclusions are valid for both the male and female speech samples as well as for both the real and imaginary parts of STFT coefficients. In order to find the best distribution, it is important to hypothesize the correct distribution and also use an appropriate estimation algorithm to estimate the distribution parameters. The conclusion that speech STFT coefficients with high spectral resolution are modeled by LD is useful in improving the design of digital hearing aids to make their performance stable and better under a wide range of ambient conditions as the inclusion of fine spectral details is necessary to improve their performance. Future work shall investigate more probabilistic distributions as well as other estimation algorithms to obtain the distribution and parameters that can provide accurate models for speech under different conditions. The presented results are for speech samples in English, spoken by individuals of Asian origin in the age group of 20–40 years. The effect of speaker language, speaker race, noise, and resolution of the transducer shall also be investigated in future.