1. Introduction
Since Alexander Graham Bell’s invention of the telephone in 1876, researchers in the field of audio processing have continuously developed innovations to enhance the speech quality of speech systems, with the ultimate objective of achieving speech quality comparable to face-to-face communication [
1]. Therefore, evaluating the performance of speech systems such as microphone arrays in terms of the perceptual quality of the speech they produce is important to ensure they meet these historical standards of excellence [
2].
To quantify perceptual quality efficiently, there exist objective metrics such as perceptual evaluation of speech quality (PESQ) used in [
3,
4], short-time objective intelligibility (STOI) used in [
4], the UTokyo-SaruLab system for voiceMOS (UTMOS) [
5], non-intrusive speech quality assessment (NISQA) [
1] and perceptual objective listening quality analysis (POLQA) [
1] to measure the perceptual quality of the speech signal. These objective measures have been used for assessing the perceptual quality of speech instead of using subjective tests that are based on time-intensive listening tests and need an acoustic chamber, thereby allowing for more rapid and consistent assessment of speech signals.
A wide range of acoustic features, such as the temporal and spectral entropies, which describe sound structures in detail, are extensively used in audio analysis. These features provide statistical representations of sound structures without relying on sound pressure level, SNR, or prior knowledge of the sound. Entropy quantifies the information content of a signal by assessing its predictability, which indicates the amount of information it contains. Entropy is calculated by summing the negative logarithms of the individual probabilities of the acoustic parameter of interest, yielding a single value that represents the information content of the signal. Entropy can be interpreted in terms of the probability distribution of a signal parameter. For instance, a noise signal with equal power across all frequency bins would exhibit a uniform probability density in the frequency domain, and, consequently, a higher spectral entropy, since noise introduces uncertainty, and entropy serves as a measure of that uncertainty [
6]. The authors in [
6] extended these findings by measuring entropy in both the time and frequency domains of real-world noise and evaluating its effect on speech perception. Their experimental results have shown that higher entropy (lower variance and a higher mean) corresponds to poorer speech quality perception. In the frequency domain, a more uniform power distribution across frequency bins leads to a higher entropy. This result has demonstrated that a dimension of acoustic complexity in real-world noise could be quantified using a simple acoustic feature to predict speech perception, even in the absence of additional information about the noise [
6].
In view of this, it is of interest to study whether one could define a spectral entropy-based metric as an objective metric to measure the speech perceptual quality, and to cross-validate such an entropy metric with other objective perceptual quality measures.
Speech perceptual quality is inherently multi-dimensional, encompassing intelligibility, naturalness, distortion and clarity, each reflecting different listener priorities and acoustic correlations. Speech naturalness itself arises from a combination of spectral consistency, prosody (rhythm and intonation), and temporal dynamics, where listeners integrate smooth formant trajectories, appropriate pitch and amplitude modulation, and predictable temporal structure to judge a voice as “human-like” [
7]. When speech is natural, listeners can focus on the message’s meaning rather than deciphering the speech signal. In other words, high naturalness frees up cognitive resources for understanding content. Cognitive load refers to the mental effort required to process and understand speech, which can be influenced by the quality, intelligibility, and spectral–temporal consistency of the speech signal.
Perceptually, voice naturalness is tightly linked to whether the speech signal behaves like a plausible output of the human speech production system, that is, whether its spectrotemporal patterns follow the smooth, predictable trajectories produced by articulatory dynamics. Nussbaum et al. emphasize that listeners form rapid naturalness impressions from multiple acoustic cues (including pitch contour, temporal structure, and spectral composition) and that naturalness judgments reflect an integration of such cues against an internal human-voice reference (a deviation- or human-likeness-based framework). From this perspective, spectral consistency, the presence of a smoothly varying formant structure and coherent harmonic envelopes across time, supports the perceptual inference that a stimulus is a plausible human voice and, therefore, raises perceived naturalness [
7].
Moore and Tan [
8] have shown that systematic perturbations of the long-term spectral shape reduce perceived naturalness in predictable ways. They have shown that introducing spectral “ripples” (periodic peaks and valleys), broad spectral tilts, or severe band-limiting produces large, reliable drops in naturalness ratings for speech; the degree of degradation scales with ripple depth/density, tilt magnitude, and the extent of the frequency range affected, with mid-frequencies being especially important for speech naturalness. Their results demonstrate that an inconsistent or non-human-like spectral structure—whether an irregular fine structure, gross tilt, or missing bands—is perceived as noticeably less natural [
8]. Together, these conceptual and empirical strands show that maintaining spectrotemporal coherence, neither unnaturally flat nor irregularly distorted, is central to producing speech that listeners judge as natural. The works in [
7,
8] show that deviations from expected spectral patterns reduce perceived naturalness. In [
9], the authors have shown that when the spectral content of speech is well-preserved or “coherent,” listeners experience more speech naturalness. In this work, we use the term ‘spectral coherence’ to refer to the preservation of the structured, predictable spectral energy distribution of natural speech, which we quantify via spectral entropy. Therefore, for the evaluation of a speech system, such as a microphone array, in terms of the naturalness of the output speech, spectral consistency can be quantified using spectral coherence.
The objective of this paper is to define an objective metric using entropy to evaluate the performance of a speech system in terms of the perceptual quality of the output it produces. To validate this metric in the context of perceptual quality, we choose a first-order differential microphone array (DMA) (non-steerable and steerable) as a representative model for speech systems and measure its output perceptual quality by means of the objective entropy-based metric. The choice of the first-order DMA is motivated by its wide range of perceptual qualities associated with its output, making it an ideal speech system to be evaluated using the objective entropy-based metric. Such an evaluation will help us to figure out how well the entropy-based metric and other perceptual quality objective measures capture the naturalness of the produced output speech.
While other metrics like PESQ and STOI focus on intelligibility, naturalness is dependent on spectral consistency, the preservation of smooth formant trajectories and coherent harmonic envelopes. A unique contribution of this research is the introduction of a perceptual quality metric that functions as a specific diagnostic tool for spectral stability. By quantifying the distribution of spectral energy, a spectral entropy-based metric identifies processing artifacts and unnatural distortions that are often ignored by measures prioritized for intelligibility. This allows for a more precise evaluation of how ‘human-like’ a speech signal remains after processing by speech systems like DMAs.
We also utilize this spectral entropy-based metric to compute its sensitivity with respect to the spectral coherence as a focused subsection of the speech naturalness. It is shown by experimentation that this measure is strongly tied to spectral coherence. Such a spectral entropy-based metric quantifies the distribution of spectral energy, which directly impacts perceived naturalness [
9]. This entropy-based metric can be used to diagnose processing artifacts and unnatural distortions in speech systems.
Unlike other metrics such as PESQ or UTMOS, which follow a ‘higher-is-better’ scale to represent overall quality, the spectral entropy-based metric is fundamentally a degradation-based metric. In this framework, a higher value of the spectral entropy metric signifies increased spectral uncertainty and a departure from the coherent structures of natural speech. Consequently, an inverse correlation with other quality scores is expected and serves as a primary indicator of the metric’s construct validity, confirming its ability to accurately identify signal degradation.
This paper is structured as follows.
Section 2 provides a brief overview of first-order fixed DMA as well as a steerable first-order DMA using four microphones.
Section 3 presents the spectral entropy-based measure and its application in the spectral coherence evaluation of the output speech signal of first-order DMAs. The experimental results are presented in
Section 4, and the conclusions in
Section 5.
3. Entropy-Based Measure
The entropy of a random variable
with
states or symbol probabilities [
, where
, is given by
where
H is the Shannon entropy. To compute the entropy of a spectrum, the authors of [
16] converted the spectrum into a probability mass function (PMF)-like function by normalizing it over the sum of the energies of the frequency components of the short-time frame. By doing such a normalization, the area under the normalized spectrum in full-band will sum up to unity. The authors in [
16] suggested the use of the entropy computation from the full-band normalized spectrum. The following equation is used for the full-band normalization.
where
is the energy of the
ith frequency component of the spectrum,
) is the PMF of the spectrum and
is the number of points in the spectrum (order of short-time Fourier transform (STFT)/number of discrete Fourier transform (DFT) points). It was found in [
16] that the entropy can be used to capture the peak shapes of a PMF. A PMF with a sharp peak will have low entropy, while a PMF with a flat distribution will have high entropy. In the case of STFT spectra of speech, the authors observed distinct spectral peaks, with their positions varying based on the phoneme being analyzed. The importance of formants is well established, and in [
17], the authors explored the use of spectral peak location as an additional feature for automatic speech recognition (ASR).
As mentioned before, noise introduces additional entropy into a system by increasing uncertainty. Calculating the entropy of a noisy speech signal in the time domain consistently shows higher entropy compared with that of a clean signal [
18], confirming that noise increases entropy by reducing the information content. In the case of white Gaussian noise (WGN), entropy and variance are directly related; an increase in one leads to an increase in the other [
19]. For white noise with non-Gaussian distributions, such as multimodal or uniform distributions [
20], variance fails to fully capture the uncertainty or unpredictability. In such cases, entropy is a more effective metric for quantifying uncertainty. The spectral entropy of a speech signal captures information embedded in its various frequency components, as represented in the short-time Fourier transform (STFT). The choice of STFT over wavelet transformation is motivated by two key factors. First, STFT ensures consistent spectral resolution, reducing the risk of entropy variations that stem from wavelet decomposition choices (e.g., basis function selection and decomposition level). While wavelets provide multi-resolution analysis, they utilize a non-linear frequency tiling that under-resolves high-frequency components while over-resolving lower scales. For the purpose of the proposed spectral entropy-based metric, such a variable resolution would introduce scale-dependent entropy biases, potentially obscuring the ‘peaky’ structures of high-frequency harmonics that are also vital for perceived naturalness. In contrast, the constant bin width of the STFT ensures that every frequency component is treated with equal statistical weight during the PMF normalization process. This linear resolution is crucial for the diagnostic accuracy of the spectral entropy-based metric, as it allows for a direct and consistent evaluation of formant-region integrity and spectral ripples across the full spectrum. Second, unlike wavelets, which redistribute energy across scales, STFT retains a direct frequency-to-entropy relationship across the entire speech bandwidth [
16], crucial for interpreting speech signal degradation. Defining spectral entropy for the power spectral density (PSD) of each STFT window enables us to evaluate the contribution of perceptually important frequency components, such as those in the formant region [
21,
22], which cannot be adequately assessed using temporal entropy.
Moreover, while temporal entropy requires obtaining a histogram of samples to derive the probability mass function (PMF) [
23], spectral entropy offers an advantage over temporal entropy in that the PMF can be determined by normalizing the STFT power spectrum [
16]. Additionally, the spectrum of white noise, interpreted as a uniform distribution over a frequency range, has maximum entropy [
24]. Since this flat noise distribution overlaps with the speech spectrum within the same frequency range, it increases the spectral entropy due to the uncertainty introduced by the noise [
25,
26]. Our proposed metric is capable of evaluating the contribution of perceptually natural important frequency components, such as those in the formant region.
In order to evaluate the performance of a speech system in terms of the perceptual quality of the output it produces, we now define a spectral entropy-based measure by
For every STFT of a speech signal, using Equation (1), the spectral entropy for a frame is
where
is the
component of the PMF, and
N is the number of STFT points. The spectral entropy of the speech signal is given by
where
is the number of frames.
Substituting Equation (5) into Equation (3), we can calculate the value of
. Based on the value of
, two perceptual states are defined to characterize the output of the speech system:
The condition indicates an increase in spectral entropy, signifying a shift toward a more uniform power distribution across frequency bins. While this often results from additive noise, it also captures spectral distortions introduced by array processing, such as filtering artifacts that may flatten the distinct spectral peaks or formants necessary for naturalness. The larger the value of , the worse the degradation of spectral coherence and the lower the naturalness. Conversely, would indicate an unnatural sharpening of the spectrum or a loss of information content. Thus, the metric serves as a broad indicator of spectral structural integrity.
The framework treats the speech spectrum as a probability distribution, where entropy serves as a direct measure of spectral uncertainty. An increase in reflects a loss of spectral structure, often occurring when additive noise or improper beamforming flattens the distinct peaks of the voice. Conversely, while not commonly observed in standard linear processing, is theoretically possible and would indicate excessive spectral sharpening. This condition implies that processing has unnaturally narrowed the spectral peaks, creating a signal that is ‘peakier’ than the human-voice reference, which equally degrades perceived naturalness by introducing metallic or robotic artifacts.
Mathematically, since entropy captures the predictability of a signal’s power distribution, a higher signifies that the output has shifted toward a more uniform (and, thus, more uncertain) state. This shift represents a degradation of spectral coherence, where the system fails to preserve the harmonic relationships of the input. In the theoretical case of , the output exhibits reduced spectral uncertainty compared with the input. This would imply that the speech system has performed an aggressive non-linear reduction in the spectral width of formants, a process known as excessive sharpening. In both the and scenarios, the metric successfully identifies a departure from the spectral coherence required for natural speech.
In order to calculate the of the output of a speech system, the following procedure is used:
Step 1: Divide the input clean speech signal into frames;
Step 2: Compute the STFT for each frame of the clean speech signal;
Step 3: Calculate the PMF for each frame’s STFT of the clean speech signal;
Step 4: Compute the spectral entropy of the clean speech signal, using (5);
Step 5: Feed the clean speech into the speech system and obtain the output of the speech system;
Step 6: Perform steps 1 to 4 using the output of the speech system instead of the clean speech to obtain the spectral entropy of the output speech,
Step 7: Obtain
as the ratio of
to
, that is,
The use of the ratio in the definition of serves a critical normalization function. Because different speakers and phonetic contents naturally possess different baseline spectral entropies, using the clean input () as a reference ensures that the metric isolates the processing artifacts of the system rather than the characteristics of the speaker’s voice. This formulation allows to function as a system-dependent diagnostic tool.
It is important to distinguish the causes of entropy variation. Natural speech is characterized by a distinct ‘peaky’ structure in the frequency domain, particularly in the formant regions. When a system like a DMA processes a signal, any filtering artifacts that smooth these peaks or flatten the spectral envelope will result in a higher value, regardless of whether external noise is present. By utilizing the direct frequency-to-entropy relationship of the STFT, identifies these spectral inconsistencies as a deviation from the human-voice reference.
While , as well as the other objective metrics, differ in the way they are computed, their purpose remains the same, namely, evaluating the perceptual quality. Given their shared goal, and other objective metrics enable the calculation of a correlation between them, as all metrics are used to assess the perceptual quality of the output of a speech system.
While objective metrics such as PESQ [
3,
4], STOI [
4], and UTMOS [
5] focus primarily on intelligibility, they do not explicitly take into consideration the spectral consistency of the speech signal, which is critical in evaluating speech naturalness [
9]. In contrast,
, as an entropy-based metric, quantifies the distribution of spectral energy, which directly impacts perceived naturalness [
9].
The metric should not be judged only by how well it correlates with other objective metrics. Its unique role in the evaluation of the output speech naturalness should be emphasized.
For spectral coherence tests, we use spectral coherence sensitivity, a measure that shows how much each perceptual objective metric changes when speech is distorted.
To ensure a fair comparison of how each metric responds to the spectral coherence changes independent of their original scales, we first apply a linear normalization to map each metric (
) to a common range of 0 to 1, as shown by the following equation:
where
and
are theoretical ranges for each metric. The normalized spectral coherence sensitivity (
) is then calculated as the change in the normalized metric relative to the change in spectral coherence (
) using the following equation:
where
. In this framework,
is the spectral coherence between the clean and output signals. This approach allows us to quantify the intrinsic responsiveness of each metric to the coherence loss independent of its original scale.
The introduction of the spectral coherence sensitivity framework represents a novel analytical contribution of this work. This measure quantifies the reactivity of an objective metric to specific spectral distortions. Unlike standard MOS-based metrics designed for broad quality assessment, this framework highlights the high responsiveness of to the breakdown of harmonic structures. By isolating spectral coherence as a focused subsection of speech naturalness, we demonstrate that provides a level of diagnostic granularity that is currently absent in less sensitive, intelligibility-focused industry standards.
The design of as a ratio of output-to-input entropy means it specifically scales with spectral complexity and distortion. Because noise and processing artifacts introduce more uniform power distributions and, thus, higher entropy, higher values represent a worse naturalness. This inherent orientation as a measure of degradation distinguishes it from MOS-based metrics, which are designed to measure perceptual excellence. This distinction is crucial for interpreting cross-validation results, as a strong negative correlation indicates that is successfully capturing the same perceptual phenomena as established quality metrics, but from the perspective of signal breakdown.
Comparison of Computational Complexities
The computational complexity of the spectral entropy-based measure is where is the number of frames of the signal, is the length of the , and is number of microphones.
It should be noted that the other metrics, namely, PESQ, POLQA, and STOI, all have the same computational complexity, namely, .
Since relies on standard STFT operations, its runtime behavior is deterministic and highly efficient on modern hardware with FFT acceleration. Unlike UTMOS, which requires significant memory for model weights and inference, is a low-power, purely statistical metric, making it viable for integration into real-time speech-enhancement diagnostic tools in edge devices such as hearing aids and mobile communication systems.
To illustrate the usefulness of the proposed entropy-based metric, , we now consider the performance evaluation of a non-steerable as well as a steerable first-order DMA as examples.
4. Experimental Results
In this section, we evaluate the perceptual quality of both the fixed-beam and the steerable-beam first-order DMAs in the presence of diffuse noise and an interfering speaker. The fixed-beam first-order DMA, composed of two omnidirectional microphones, is configured with its main lobe fixed at , while the steerable-beam first-order DMA utilizes four omnidirectional microphones arranged in a square geometry.
The fixed-beam and steerable-beam first-order DMAs were simulated in the “MATLAB, R2015a” environment. We chose 55 utterances spoken by male speakers and 45 utterances spoken by female speakers from the TIMIT database, which includes phonetically balanced sentences [
27] with a sampling rate of,
fs = 16 kHz, and fed them as input to the simulated speech systems. The interfering speaker was modeled by an audio sample from the TIMIT database located at one of the four different null angles. Such a speaker interfered at the same time with a relative signal-to-interference ratio (SIR) of 0 dB. The additive diffuse noise was also modeled in “MATLAB, R2015a” so that the resulting signal had an SNR of 10 dB.
We followed the procedure given in
Section 3 to calculate the
of the system using each of the utterances as the input and obtain the corresponding output of the system. For this purpose, we set the frame length to be 20 ms and the number of DFT points to be 320 for each time frame. The comparison of
was against other objective perceptual quality metrics, namely, PESQ, POLQA, STOI, NISQA, and deep learning-based UTMOS [
5]. Hence, they were also used to evaluate the experimental results, and their values were compared with
’s values to validate the effectiveness of
and its irreplaceability.
4.1. Results for the Fixed-Beam First-Order DMA
Consider a fixed-beam first-order DMA whose microphone inter-element distance δ is 0.5 cm. We now obtain the values of the spectral entropy measure for four different azimuth angles and null angles for the fixed-beam first-order DMA. For this purpose, we consider a particular sound file, namely, “sa1.wav”, that consists of the utterance “She had your dark suit in greasy wash water all year” by an adult female, as the desired sound source for four different azimuth angle locations. In addition, we assume that there is an undesired speaker located at one of the four different null angles, namely, “sx178.wav”, which consists of the utterance “She encouraged her children to make their own Halloween costume” by an adult male speaker.
The directional pattern of the fixed-beam first-order DMA and its main-lobe beam orientation for the four azimuth and null angle pairs is shown in
Figure 3. It can be seen from this figure that the main lobe beam is fixed at
, while the speaker’s azimuth angle and the angle of null are varying. It is evident that while the speaker’s location is changing, the DMA cannot steer its beam towards the corresponding new location of the speaker and, therefore, can capture the desired speech only when the speaker is located at
. In other words, for all the other cases where the speaker is located at an azimuth angle other than
, the fixed-beam first-order DMA cannot properly reproduce the desired speech.
Figure 4 illustrates boxplots for the fixed-beam first-order DMA; each subplot shows how
, PESQ, POLQA, STOI, NISQA and UTMOS are distributed across the four angular pairs (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°). A clear pattern emerges in these angular pairs. For
, the angle pair (0°, 90°) has a very tight box (low variance) at ≈1.00, with minimal outliers. PESQ, STOI, and UTMOS display significantly higher medians (e.g., PESQ near 4.2–4.3 and UTMOS near 4.3–4.4) than the other angles, indicating the best perceived quality and intelligibility. In the remaining angular pairs,
’s median is 1.3–1.4 on average, showing broader interquartile ranges and more outliers. PESQ, STOI, and UTMOS exhibit noticeably lower box medians and more outliers. The boxplot whiskers for angles (45°, 135°), (270°, 180°), and (315°, 225°) show larger spreads/outliers, matching the higher standard deviations observed numerically. The fixed-beam best serves the angular pair (0°, 90°), as expected, reflected by the unity value of
, which is the same as that indicated by the highest values of the perceptual metrics, PESQ, POLQA, STOI, NISQA and UTMOS.
4.1.1. Descriptive Statistics
Mean and standard deviation values for
, PESQ, POLQA, STOI, NISQA and UTMOS for the four angular pairs (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°) are given for the fixed-beam first-order DMA in
Table 1. It is observed from this table that the mean of
is lowest (≈1.00) at (0°, 90°), but rises to around 1.39–1.42 for the other angular pairs. Meanwhile, the mean values of PESQ and UTMOS, which track the perceived quality, are highest at (0°, 90°) and drop significantly at (45°, 135°), (270°, 180°), and (315°, 225°). STOI follows a similar trend, reflecting better performance near the (0°, 90°) angular pair. It is noted that the STOI values observed in
Table 1 are relatively low (ranging from approximately 0.31 to 0.42). This can be attributed to the stringent test conditions involving a strong interfering speaker set at 0 dB SIR, combined with diffuse noise at a 10 dB SNR. Under such high-interference scenarios, the intelligibility metric, STOI, exhibits lower absolute values, further emphasizing the need for supplementary diagnostic metrics like the proposed
. All these metrics exhibit angular pair dependencies, demonstrating significant differences in values for both
and other perceptual objective metrics for each tested angular pair.
The rise in mean values to approximately 1.39–1.42 for ‘off’ angular pairs indicates a significant loss of spectral definition. This increase is not solely a reflection of the 10 dB diffuse noise, but also indicates array-induced spectral artifacts. At these angles, the DMA’s inability to steer its beam results in a transfer function that flattens the harmonic structure of the desired speaker. The higher entropy, thus, quantifies the perceptual degradation caused by the system’s failure to preserve the spectral peaks essential for a ‘human-like’ sound.
In terms of standard deviation (SD), ’s variability is negligible (≈0.002) for (0°, 90°), but grows significantly for (45°, 135°), (270°, 180°), and (315°, 225°). NISQA shows the largest standard deviations at angles other than (0°, 90°). STOI remains moderately stable, though it also exhibits higher SD at the “off” angles.
Linear regression analysis is presented for the fixed-beam first-order DMA as a scatter plot with a regression line in
Figure 5 and
Figure 6. In these figures, the
x-axis represents
and the
y-axis represents other objective metrics. The line shows a negative slope, indicating that for any given value, one can reliably predict the perceptual quality score.
4.1.2. Analysis of Variance (ANOVA): Statistical Significance of Differences Between Conditions
Table 2 provides the one-way ANOVA across four angular pairs (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°). The F-statistic and
p-value for each objective perceptual metric, obtained from the fixed-beam first-order DMA output speech, are given in this table.
yields the highest F-statistic, indicating massive variation in
across angles, consistent with the large jump in the mean value from ≈1.00 to ≈1.40. It is also observed that
has the lowest
p-value, indicating that there exists a statistically significant difference between the mean value of the angular pair of (0°, 90°) and the mean values of the other three angular pairs.
The ANOVA findings confirm that there is a significant difference among the angular pair mean values and at least one angular pair mean value is different from the others, which, in our case, is the angular pair of (0°, 90°). These results are obtained for the four angular pairs, irrespective of the metric used in determining the speech quality.
The massive variation in across different angular pairs (F-statistic of 190) highlights its sensitivity to increased spectral uncertainty caused by the DMA’s spatial filtering. At ‘off’ angles, the improper alignment of the beam pattern with the speaker’s location acts as a spectral disruptor, increasing the entropy of the output. These results confirm that effectively monitors the structural integrity of the speech spectrum, flagging any processing configuration that leads to a loss of the clear, predictable spectral peaks characteristic of a clean human voice.
4.1.3. Cross-Validation of SEM vs. Other Objective Perceptual Metrics
A five-fold cross-validation is performed to assess the applicability of
as a metric to measure the perceptual quality relative to the objective metrics (PESQ, POLQA, STOI, NISQA and UTMOS) for each of the four angular bins (each angular pair is considered as an angular bin). For this purpose, the dataset of 100 audio files is randomly partitioned into five folds for each of the four angular bins. For each of the five iterations, four folds are used as the reference set, while the remaining fold serves as the test set. For each of the four angular bins, we first compute the Pearson correlation coefficient between
and each of the other objective measures and find the average of the results across all five folds. Thus, there is one Pearson correlation coefficient value for each of the four angular bins. The value of the Pearson correlation coefficient, shown in
Table 3, is the average of all four Pearson correlation coefficients. The standard deviation shown in
Table 3 is the SD between the Pearson correlation coefficient values of these four angular bins.
It can be observed that all the correlations are negative and moderate in absolute value (Pearson correlation≈ −0.634 to −0.733). This implies that as the value of
rises, the values of PESQ, POLQA, STOI, NISQA, and UTMOS tend to decrease. This negative sign indicates that as the value of
rises, indicating a loss of spectral structural integrity, the other quality scores tend to decrease. This consistently negative correlation serves as a direct mathematical validation of
’s inverse relationship with perceptual quality. It confirms that
correctly functions as a degradation-based metric, where an increase in spectral uncertainty, as measured by entropy, directly corresponds to a predictable drop in established quality scores. In other words, such a high negative correlation underscores that the
metric measures the speech perceptual quality aligned with perceptual improvements or degradation. The absolute value of the Pearson correlation coefficient (
) between
and UTMOS is the highest (0.713), while the standard deviation is the lowest (±0.061), indicating a stable inverse relationship. It is noted that UTMOS has the highest correlation with actual MOS, as mentioned in [
28].
In order to evaluate the monotonic relationship between
and other objective perceptual metrics, we calculate the mean Spearman rank correlation coefficient (
) using the same five-fold cross-validation procedure that we employed to obtain the Pearson correlation coefficient. While the Pearson correlation measures linear alignment, Spearman correlation confirms the consistency of the rank-order, which is critical for verifying that any increase in the spectral entropy ratio (
) corresponds to a decrease in perceived quality. As seen from
Table 3, the mean Spearman correlation for the
-UTMOS relationship is −0.702. This value closely mirrors the Pearson mean (−0.713), demonstrating that
maintains a high degree of monotonicity even when the processing artifacts introduce non-linear degradations. The standard deviation for Spearman correlation is notably low, reinforcing the stability of the
metric across different phonetic contents and speakers.
The statistical reliability of these results is underscored by the 95% confidence intervals (CI) for both Pearson and Spearman correlation coefficients, as shown in
Table 3. For the
-UTMOS pair, the CIs are [−0.81, −0.62] and [−0.80, −0.60], respectively, which are the narrowest among all the tested metric pairs. This indicates that the inverse relationship between
and perceptual quality is statistically significant and highly repeatable for fixed-beam first-order DMA systems. In conclusion, the alignment of Pearson and Spearman coefficients for the
-UTMOS confirms that
is a robust, high-confidence indicator of speech quality. Due to the strong correlation of
with UTMOS, it is known to have the highest correlation with actual human mean opinion scores (MOSs) [
28].
4.1.4. Spectral Coherence Sensitivity and Mean Temporal Variability
The evaluation of the mean spectral coherence sensitivity for the fixed-beam first-order DMA is given in
Table 4. Because the ranges of objective measure values are normalized, spectral coherence sensitivity directly compares how “reactive” each metric is to spectral misalignment.
It is seen from this table that exhibits the highest absolute value of mean sensitivity (2.3691). This indicates that is highly responsive to variations in spectral coherence, suggesting that it effectively captures how well the spectral structure is preserved in the enhanced output speech. In contrast, the MOS-based metrics, namely, PESQ (1.2690), STOI (0.4953) and UTMOS (1.1473), exhibit far less sensitivity to spectral coherence changes, implying that they do not explicitly account for spectral consistency, but rather focus on the overall perceptual quality and intelligibility, confirming that these metrics are less sensitive to spectral stability.
Negative sensitivity for means that this metric increases as spectral coherence decreases and vice versa.
4.2. Results for the Steerable-Beam First-Order DMA
We now consider a steerable-beam first-order DMA with four microphones arranged in a square geometry [
13] (see
Figure 2), with an inter-element distance
δ of 0.5 cm between its two adjacent microphones. Just as in the case of fixed DMA, we now consider the particular sound file, “sa1.wav”, that consists of the utterance “She had your dark suit in greasy wash water all year” by an adult female, as the desired sound source for the same four azimuth angle locations, as in the case of the fixed-beam first-order DMA. In addition, we assume that there is an undesired speaker at one of the null angles, just as in the case of fixed-beam first-order DMA with the sound file “sx178.wav”, which consists of the utterance “She encouraged her children to make their own Halloween costume” by an adult male speaker.
Evaluation of the perceptual quality of the output of the steerable-beam first-order DMA is conducted in a manner similar to that for the fixed-beam version.
The directional pattern of the steerable-beam first-order DMA and its main-lobe beam orientation for the four azimuth and null angle pairs is shown in
Figure 7. It is seen from this figure that the main lobe of the steerable-beam first-order DMA can be dynamically steered towards any desired azimuth angle
, effectively aligning with the changing position of the speaker. In view of this steering capability, the steerable-beam DMA can capture the desired speech reasonably well across all four angle pairs, as the speaker moves to different locations. In other words, unlike the fixed-beam counterpart, the steerable-beam DMA can reproduce with a reasonably good perceptual quality at its output, even when the speaker is located at an azimuth angle other than
.
Figure 8 illustrates boxplots for the steerable-beam first-order DMA; each subplot shows how
, PESQ, POLQA, STOI, NISQA and UTMOS are distributed across the four angular pairs, (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°). A clear pattern emerges in these angular pairs. For the angular pairs (0°, 90°) and (270°, 180°),
remains close to unity (around 1.00–1.05) in both angular pairs, with narrower boxes and fewer extreme outliers. PESQ, STOI, and UTMOS show high medians, reflecting better quality/intelligibility. At (45°, 135°), and (315°, 225°),
jumps to ~1.3–1.4, with more outliers. PESQ, STOI, and UTMOS drop by 1+ point on average, signifying lower perceived quality. The interquartile ranges are broader, especially for UTMOS, indicating higher variance at these angles. The box medians and whisker ranges underscore the role of beam steering as both (0°, 90°) and (270°, 180°) yield the best perceptual results, while two other angular pairs degrade the perceptual results.
Across both fixed-beam and steerable configurations, the boxplot data highlight significant differences in , PESQ, POLQA, STOI, NISQA and UTMOS across the angular pairs. In both speech systems and across all angular pairs, correlates inversely with these other perceptual metrics.
4.2.1. Descriptive Statistics
The mean and standard deviation values for
, PESQ, POLQA, STOI, NISQA and UTMOS for the four angular pairs (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°) are given for the steerable-beam first-order DMA in
Table 5. It is observed from this table that the mean of
is lowest (≈1.00) at (0°, 90°) or (270°, 180°), but rises to around 1.28–1.30 for the other two angular pairs. Meanwhile, the mean values of PESQ, STOI, and UTMOS are higher at (0°, 90°) and (270°, 180°) but lower at (45°, 135°) and (315°, 225°). It is noted that the STOI values observed in
Table 5 are relatively low (ranging from approximately 0.34 to 0.42). This is attributed to the stringent test conditions involving a strong interfering speaker set at 0 dB SIR combined with diffuse noise at a 10 dB SNR. Under such high-interference scenarios, the intelligibility metric, STOI, exhibits lower absolute values, further emphasizing the need for supplementary diagnostic metrics like the proposed
.
In terms of standard deviation (SD), ’s variability is minimal (≈0.002) and negligible for (0°, 90°) and (270°, 180°), but much higher (≈0.17) at the other two angular pairs. PESQ and UTMOS exhibit the largest variations in SD range (0.607–0.706 for PESQ and 0.935–1.065 for UTMOS) at (45°, 135°), and (315°, 225°). Just as in the case of the fixed-beam first-order DMA, this indicates that the system’s perceptual performance gets degraded at certain estimated angular azimuth and null angles.
The descriptive statistics demonstrate that angular pair dependencies at (45°, 135°) and (315°, 225°) lead to a higher mean and SD for , but lower means for PESQ, STOI, and UTMOS, indicating consistency in predicting the speech perceptual quality irrespective of the beam-forming effects at the estimated azimuth and null angles.
To assess the scalability of
as an objective metric, we conduct an analysis using five different SNR levels, −5 dB, 0 dB, 5 dB, 10 dB and 15 dB.
Figure 9 shows the average value of
over 400 output audio files (100 audio files for each of the four angular bins) as a function of the input SNR level. It is seen from
Figure 9 that for both the fixed-beam and steerable-beam first-order DMAs, at very low SNR, the high spectral uncertainty leads to a mean value of
of approximately 1.5 and 1.35, respectively, while at 15 dB, the mean of the
approaches 1.15 and 1.05 respectively, signifying a nearly faithful reproduction of the input speech. This trend across the SNR spectrum confirms that
is a stable indicator of perceptual quality regardless of the noise intensity.
Linear regression analysis for steerable-beam first-order DMA is presented as a scatter plot with a regression line in
Figure 10 and
Figure 11. In these figures, the
x-axis represents
and the
y-axis represents other objective metrics. The line shows a nearly negative slope, indicating, again, that for any given value, one can reliably predict the perceptual quality score.
4.2.2. Analysis of Variance (ANOVA): Statistical Significance of Differences Between Conditions
Table 6 provides the one-way ANOVA across four angular pairs, (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°). The F-statistic and
p-value for each objective perceptual metric obtained from the steerable-beam first-order DMA output speech are given in this table.
yields the highest F-statistic (185), indicating massive variation in
across angles, consistent with the large jump in the mean value from ≈1.00 to ≈1.30. It is also observed that
has the lowest
p-value, indicating that there exists a statistically significant difference between the mean value of at least one angular pair and the other angular pairs. It should be noted that in our case, there are two such angular pairs, namely, the angular pairs (0°, 90°) and (270°, 180°) and the mean values of the other two angular pairs.
Similarly, the F-statistic values for PESQ, STOI and UTMOS indicate the variations across the angular pairs. This implies that the performance of the steerable-beam first-order DMA depends strongly on the angular pair under consideration, due to the directivity patterns and the noise interference for certain orientations. The very small p-values (<<0.05) for these perceptual quality metrics confirm that the angular pair has a statistically significant impact on all metrics.
The ANOVA findings confirm that there is a significant difference among the mean values of the angular pairs. In our case, there are two angular pairs, namely, (0°, 90°) and (270°, 180°), whose mean values are different from those of the other two angular pairs. These results are obtained for the four angular pairs, irrespective of the metric used in determining the speech quality, just as in the case of fixed-beam first-order DMA.
To analyze whether
is signal-dependent, we evaluate its stability across a phonetically balanced TIMIT dataset of 100 utterances from both male and female speakers. The results in
Table 1 and
Table 5 show that at optimal beamforming angles, the standard deviation of
is negligible (≈0.002), indicating that the metric is not sensitive to variations in gender or phonetic content. Furthermore, the high ANOVA F-statistics (reaching 190 for the fixed-beam first-order DMA and 185 for the steerable-beam first-order DMA) confirm that the variance in the metric is statistically tied to the system’s performance rather than to the characteristics of the input signal.
4.2.3. Cross-Validation of SEM vs. Other Objective Perceptual Metrics
The five-fold cross-validation procedure utilized in the case of fixed-beam first-order DMA to compute the mean Pearson correlation coefficient and its standard deviation is used to compute these two quantities for the case of the steerable-beam first-order DMA. The corresponding results for the steerable-beam case are given in
Table 7.
It is observed that all the correlations are negative and moderate in absolute value (Pearson correlation≈ −0.510 to −0.597), just as in the case of the fixed-beam DMA. This implies that as the value of rises, the values of PESQ, POLQA, STOI, NISQA, and UTMOS tend to decrease. The stable recurrence of these negative coefficients across the steerable configurations further validates the inverse relationship between entropy-based measures of signal degradation and traditional perceptual benchmarks. This framing reinforces the construct validity of , demonstrating that its sensitivity to spectral coherence is fundamentally aligned with the perceptual quality drops captured by other objective metrics. In other words, such a negative correlation underscores that the metric measures the speech perceptual quality aligned with perceptual improvements or degradation. The absolute value of the Pearson correlation coefficient () between and UTMOS is the highest (0.597), while the standard deviation is the lowest (±0.163), just as in the case of the fixed-beam first-order DMA.
The consistent negative Pearson correlation coefficients (averaging −0.63 to −0.73 for fixed DMA and −0.51 to −0.60 for steerable DMA) deeply reinforce the construct validity of . The fact that rises precisely when UTMOS and PESQ scores fall confirms that the metric is correctly identifying the perceptual degradation caused by angular-dependent distortions. The stability of this inverse relationship, particularly with UTMOS, which has the highest known correlation with actual human scores, demonstrates that is a robust and valid proxy for naturalness, effectively mapping spectral entropy to perceived quality drops.
In the more complex case of the steerable-beam configuration, the Spearman rank correlation (
) is utilized to see if the
metric remains a reliable rank-order indicator despite the dynamic nature of beam steering. In
Table 7, the mean Spearman coefficient for the
-UTMOS relationship is −0.585, following the same moderate-to-strong inverse trend observed in the Pearson analysis. This consistency between
and
indicates that as
increases, the rank-order of speech quality consistently decreases across all steerable angular pairs. While the absolute values are slightly lower than those in the fixed-beam case due to steerable filtering effects, the Spearman standard deviation of the
-UTMOS pair remains the lowest with respect to that of the other pairs.
The statistical reliability of these results is underscored by the 95% confidence intervals (CIs) for both Pearson and Spearman correlation coefficients, as shown in
Table 7. For the
-UTMOS pair, the CIs are [−0.86, −0.34] and [−0.85, −0.32], respectively, which are the narrowest among all the tested metric pairs. These ranges demonstrate that even with the added variability of beamformer steering,
maintains a statistically significant predictive relationship with perceptual quality. This again indicates, just as in the case of fixed-beam DMA, that the inverse relationship between
and perceptual quality is statistically significant and highly repeatable for the steerable-beam first-order DMA systems also. In conclusion, the alignment of Pearson and Spearman coefficients for the
-UTMOS confirms that
is a robust, high-confidence indicator of speech quality. Due to the strong correlation of
with UTMOS, which is known to have the highest correlation with actual human mean opinion scores (MOSs) [
28].
4.2.4. Spectral Coherence Sensitivity and Mean Temporal Variability
The evaluation of the mean spectral coherence sensitivity for the steerable-beam first-order DMA is given in
Table 8. Because the ranges of objective measure values are normalized, spectral coherence sensitivity directly compares how “reactive” each metric is to spectral misalignment.
It is seen from this table that exhibits the highest absolute value of mean sensitivity (2.3897). This indicates that is highly responsive to variations in spectral coherence, suggesting that it effectively captures how well the spectral structure is preserved in the enhanced output speech. In contrast, the MOS-based metrics, namely, PESQ (1.3460), STOI (0.5742), and UTMOS (1.2198), exhibit far less sensitivity to spectral coherence changes, implying that they do not explicitly account for spectral consistency, but rather focus on the overall perceptual quality and intelligibility, confirming that these metrics are less sensitive to spectral stability.
The mean spectral coherence sensitivity of exhibits a slightly greater change compared with the absolute value of its fixed-beam counterpart, reinforcing that it remains highly sensitive to spectral integrity changes. This suggests that the steerable-beam processing might introduce reduced distortions or filtering effects than in the fixed-beam case.
The sensitivity of to spectral coherence (~−2.39) further validates its role in detecting non-humanlike spectral structures. Because specifically tracks the distribution of spectral energy, it is uniquely equipped to diagnose when steerable-beam processing introduces unnatural spectral tilts or ripples. These results confirm that is a robust proxy for naturalness, as it remains highly reactive to any processing effect that disrupts the coherent harmonic envelopes of the original speech.
The experimental results provided in
Table 4 and
Table 8 demonstrate the superiority of
in capturing subtle spectral degradations. In both the fixed and steerable DMA configurations,
exhibits an absolute mean sensitivity, which is nearly double that of PESQ and approximately four times that of STOI. These results indicate that
is significantly more effective at resolving distortions in the formant regions. Furthermore,
yields the highest F-statistics in the ANOVA analysis (e.g., 190 for fixed DMA), confirming it as the most statistically significant metric for detecting angular-dependent signal variations compared with all other tested measures.
5. Conclusions
Subjective evaluation of the speech quality is resource-intensive and time-consuming, as it requires human listeners and a test setup in a noise-free environment. Hence, objective metrics have been proposed in the literature to measure the quality of the output of a speech system without having to resort to time-consuming listening tests.
In this paper, we defined and employed a spectral entropy-based measure for evaluation of the perceptual quality of the speech produced by a speech system. This measure quantifies the spectral structural integrity of a speech system, which, in turn, assesses the system’s ability to preserve spectral coherence, an important indicator of the naturalness of the speech. Such a metric was cross-validated against other objective perceptual quality metrics. Unlike other objective metrics, such as PESQ, STOI, and POLQA, which primarily target intelligibility, the proposed spectral entropy-based measure provides a unique diagnostic focus on spectral structural integrity. By quantifying spectral coherence, this measure identifies the specific harmonic degradations that dictate perceived speech naturalness.
The importance of this work lies in the proposed measure’s ability to detect ‘unnatural’ artifacts that other metrics often overlook.
To illustrate the feasibility of employing the proposed metric in real-life applications, we employed the proposed measure to evaluate the performance of first-order differential microphone arrays, steerable as well as non-steerable, from the point of view of the perceptual quality of the speech they produce.
Based on the experimental results, it has been shown that the proposed metric highly correlates with the UTMOS metric, which is known to have the highest correlation with the actual mean opinion score. Further, the proposed metric has been shown to be a more sensitive indicator of the spectral coherence compared with the other objective metrics, making it a good measure for the assessment of the naturalness of the output speech signal, irrespective of the speech system used.
Thus, this study confirms that is not only a quality measure of performance, but also an irreplaceable indicator of spectral integrity. Its inverse correlation with UTMOS, the metric that has been shown to have the highest correlation with actual human scores, validates its perceptual relevance. Given its superior sensitivity to spectral coherence and its computational efficiency, can be considered as a vital tool for researchers aiming to preserve speech naturalness in the next generation of audio-processing technologies.
Despite the strong statistical performance of in this study, it is to be noted that is, in its present form, an intrusive metric. The calculation of the metric assumes access to the spectral entropy of a clean reference signal (), which may not always be available in real-time applications. This requirement may limit its application in real-world monitoring scenarios where a reference is not readily available. Moreover, since is calculated based on the PSD of STFT windows, it primarily tracks the distribution of the spectrum. Consequently, it may exhibit lower sensitivity to certain non-linear distortions, such as signal clipping or pure phase-shift distortions, which may degrade the perceived quality without significantly altering the global entropy of the spectral distribution. In view of these limitations, future work could include the investigation of non-intrusive entropy approximations of and integrating them into a holistic evaluation framework along with other perceptual metrics, potentially drawing on features used in non-intrusive benchmarks like NISQA, to enable reference-free quality monitoring as well as exploring phase-aware entropy features to enhance the metric’s diagnostic precision in non-linear environments.
Future work could include investigation of the performance of in higher-order arrays to test its robustness against more complex processing artifacts. Finally, while this study focused on microphone arrays, a logical next step is to validate across a wider variety of speech systems, such as audio codecs and modern neural enhancement networks, to ensure its diagnostic precision across different types of spectral distortions.
While the TIMIT corpus provided a phonetically balanced baseline for this study, the metric is designed to be language-independent. Since quantifies the spectral coherence, a universal indicator of the ‘human-voice reference’, its application can be naturally extended to diverse languages and speaking styles without requiring language-specific training. Further, since operates as a ratio of spectral entropy values, it is inherently robust to variations in the input signal’s spectral shape, making it a promising tool for evaluating pathological speech or speech in high-noise environments. As demonstrated in our SNR evaluation, the metric reliably tracks the speech quality degradation across five different SNR levels (−5, 0, 5, 10, and 15 dB), confirming its stability across varying noise intensities.