A Spectral Entropy-Based Metric for Evaluating Speech Perceptual Quality with Emphasis on Spectral Coherence

Sarafnia, Ali; Ahmad, M. Omair; Swamy, M.N.S.

doi:10.3390/signals7020027

Open AccessArticle

A Spectral Entropy-Based Metric for Evaluating Speech Perceptual Quality with Emphasis on Spectral Coherence

by

Ali Sarafnia

,

M. Omair Ahmad

^*

and

M.N.S. Swamy

Electrical and Computer Engineering Department, Concordia University, Montreal, QC H3G 1M8, Canada

^*

Author to whom correspondence should be addressed.

Signals 2026, 7(2), 27; https://doi.org/10.3390/signals7020027

Submission received: 13 December 2025 / Revised: 12 February 2026 / Accepted: 4 March 2026 / Published: 16 March 2026

(This article belongs to the Topic Image Processing, Signal Processing and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

Distortion of speech in real-life communication is inevitable, affecting its quality. Conventionally, the effectiveness of a speech system in terms of the perceptual quality of the speech it produces has been assessed using a time-consuming subjective metric, the mean opinion score. There are a number of objective metrics that can be used instead of the mean opinion score to assess the perceptual quality of the speech signal. The objective of this paper is to propose and validate a new objective metric, the spectral entropy-based metric (SEM), designed to evaluate the perceptual quality of speech and perceptual naturalness by quantifying spectral coherence. While other metrics focus on intelligibility, this study aims to fill a gap in naturalness assessment. The core novelty of this work lies in offering a diagnostic perspective on spectral coherence, an indicator of speech naturalness that is often not explicitly addressed by other metrics. To demonstrate the effectiveness of the proposed metric in evaluating the perceptual quality, we consider fixed-beam and steerable-beam first-order differential microphone arrays. Compared with other objective metrics, it is shown that the proposed SEM is more sensitive to spectral coherence, a predominant indicator of the naturalness of the output speech signal of a speech system.

Keywords:

entropy-based measure; differential microphone array; steerable microphone array; beampattern

1. Introduction

Since Alexander Graham Bell’s invention of the telephone in 1876, researchers in the field of audio processing have continuously developed innovations to enhance the speech quality of speech systems, with the ultimate objective of achieving speech quality comparable to face-to-face communication [1]. Therefore, evaluating the performance of speech systems such as microphone arrays in terms of the perceptual quality of the speech they produce is important to ensure they meet these historical standards of excellence [2].

To quantify perceptual quality efficiently, there exist objective metrics such as perceptual evaluation of speech quality (PESQ) used in [3,4], short-time objective intelligibility (STOI) used in [4], the UTokyo-SaruLab system for voiceMOS (UTMOS) [5], non-intrusive speech quality assessment (NISQA) [1] and perceptual objective listening quality analysis (POLQA) [1] to measure the perceptual quality of the speech signal. These objective measures have been used for assessing the perceptual quality of speech instead of using subjective tests that are based on time-intensive listening tests and need an acoustic chamber, thereby allowing for more rapid and consistent assessment of speech signals.

A wide range of acoustic features, such as the temporal and spectral entropies, which describe sound structures in detail, are extensively used in audio analysis. These features provide statistical representations of sound structures without relying on sound pressure level, SNR, or prior knowledge of the sound. Entropy quantifies the information content of a signal by assessing its predictability, which indicates the amount of information it contains. Entropy is calculated by summing the negative logarithms of the individual probabilities of the acoustic parameter of interest, yielding a single value that represents the information content of the signal. Entropy can be interpreted in terms of the probability distribution of a signal parameter. For instance, a noise signal with equal power across all frequency bins would exhibit a uniform probability density in the frequency domain, and, consequently, a higher spectral entropy, since noise introduces uncertainty, and entropy serves as a measure of that uncertainty [6]. The authors in [6] extended these findings by measuring entropy in both the time and frequency domains of real-world noise and evaluating its effect on speech perception. Their experimental results have shown that higher entropy (lower variance and a higher mean) corresponds to poorer speech quality perception. In the frequency domain, a more uniform power distribution across frequency bins leads to a higher entropy. This result has demonstrated that a dimension of acoustic complexity in real-world noise could be quantified using a simple acoustic feature to predict speech perception, even in the absence of additional information about the noise [6].

In view of this, it is of interest to study whether one could define a spectral entropy-based metric as an objective metric to measure the speech perceptual quality, and to cross-validate such an entropy metric with other objective perceptual quality measures.

Speech perceptual quality is inherently multi-dimensional, encompassing intelligibility, naturalness, distortion and clarity, each reflecting different listener priorities and acoustic correlations. Speech naturalness itself arises from a combination of spectral consistency, prosody (rhythm and intonation), and temporal dynamics, where listeners integrate smooth formant trajectories, appropriate pitch and amplitude modulation, and predictable temporal structure to judge a voice as “human-like” [7]. When speech is natural, listeners can focus on the message’s meaning rather than deciphering the speech signal. In other words, high naturalness frees up cognitive resources for understanding content. Cognitive load refers to the mental effort required to process and understand speech, which can be influenced by the quality, intelligibility, and spectral–temporal consistency of the speech signal.

Perceptually, voice naturalness is tightly linked to whether the speech signal behaves like a plausible output of the human speech production system, that is, whether its spectrotemporal patterns follow the smooth, predictable trajectories produced by articulatory dynamics. Nussbaum et al. emphasize that listeners form rapid naturalness impressions from multiple acoustic cues (including pitch contour, temporal structure, and spectral composition) and that naturalness judgments reflect an integration of such cues against an internal human-voice reference (a deviation- or human-likeness-based framework). From this perspective, spectral consistency, the presence of a smoothly varying formant structure and coherent harmonic envelopes across time, supports the perceptual inference that a stimulus is a plausible human voice and, therefore, raises perceived naturalness [7].

Moore and Tan [8] have shown that systematic perturbations of the long-term spectral shape reduce perceived naturalness in predictable ways. They have shown that introducing spectral “ripples” (periodic peaks and valleys), broad spectral tilts, or severe band-limiting produces large, reliable drops in naturalness ratings for speech; the degree of degradation scales with ripple depth/density, tilt magnitude, and the extent of the frequency range affected, with mid-frequencies being especially important for speech naturalness. Their results demonstrate that an inconsistent or non-human-like spectral structure—whether an irregular fine structure, gross tilt, or missing bands—is perceived as noticeably less natural [8]. Together, these conceptual and empirical strands show that maintaining spectrotemporal coherence, neither unnaturally flat nor irregularly distorted, is central to producing speech that listeners judge as natural. The works in [7,8] show that deviations from expected spectral patterns reduce perceived naturalness. In [9], the authors have shown that when the spectral content of speech is well-preserved or “coherent,” listeners experience more speech naturalness. In this work, we use the term ‘spectral coherence’ to refer to the preservation of the structured, predictable spectral energy distribution of natural speech, which we quantify via spectral entropy. Therefore, for the evaluation of a speech system, such as a microphone array, in terms of the naturalness of the output speech, spectral consistency can be quantified using spectral coherence.

The objective of this paper is to define an objective metric using entropy to evaluate the performance of a speech system in terms of the perceptual quality of the output it produces. To validate this metric in the context of perceptual quality, we choose a first-order differential microphone array (DMA) (non-steerable and steerable) as a representative model for speech systems and measure its output perceptual quality by means of the objective entropy-based metric. The choice of the first-order DMA is motivated by its wide range of perceptual qualities associated with its output, making it an ideal speech system to be evaluated using the objective entropy-based metric. Such an evaluation will help us to figure out how well the entropy-based metric and other perceptual quality objective measures capture the naturalness of the produced output speech.

While other metrics like PESQ and STOI focus on intelligibility, naturalness is dependent on spectral consistency, the preservation of smooth formant trajectories and coherent harmonic envelopes. A unique contribution of this research is the introduction of a perceptual quality metric that functions as a specific diagnostic tool for spectral stability. By quantifying the distribution of spectral energy, a spectral entropy-based metric identifies processing artifacts and unnatural distortions that are often ignored by measures prioritized for intelligibility. This allows for a more precise evaluation of how ‘human-like’ a speech signal remains after processing by speech systems like DMAs.

We also utilize this spectral entropy-based metric to compute its sensitivity with respect to the spectral coherence as a focused subsection of the speech naturalness. It is shown by experimentation that this measure is strongly tied to spectral coherence. Such a spectral entropy-based metric quantifies the distribution of spectral energy, which directly impacts perceived naturalness [9]. This entropy-based metric can be used to diagnose processing artifacts and unnatural distortions in speech systems.

Unlike other metrics such as PESQ or UTMOS, which follow a ‘higher-is-better’ scale to represent overall quality, the spectral entropy-based metric is fundamentally a degradation-based metric. In this framework, a higher value of the spectral entropy metric signifies increased spectral uncertainty and a departure from the coherent structures of natural speech. Consequently, an inverse correlation with other quality scores is expected and serves as a primary indicator of the metric’s construct validity, confirming its ability to accurately identify signal degradation.

This paper is structured as follows. Section 2 provides a brief overview of first-order fixed DMA as well as a steerable first-order DMA using four microphones. Section 3 presents the spectral entropy-based measure and its application in the spectral coherence evaluation of the output speech signal of first-order DMAs. The experimental results are presented in Section 4, and the conclusions in Section 5.

2. Background Material

As mentioned in the Introduction, fixed and steerable first-order DMAs are used as illustrative speech systems in order to show how good the objective metric based on spectral entropy introduced in this paper is by evaluating the output of the first-order DMA and comparing it with other objective measures of the perceptual quality, and for this purpose, we first briefly review the two types of first-order DMA, namely, fixed beam and the steerable beam.

2.1. A Fixed-Beam First-Order DMA

A fixed-beam first-order DMA has a beam direction aimed at a desired angle. A fixed-beam DMA for which the main lobe is preset at 0° has been presented in [10].

A fixed-beam first-order DMA requires two microphones, as shown in Figure 1, where

δ

represents the distance between the microphones,

c

represents the speed of sound,

ω

is the angular frequency and

α = c o s β

represents the design parameter, with

β

being the null angle in the beampattern.

Assuming a planewave impinging on the first-order DMA, the time delay between the received signals at the two microphones is

τ_{0} = \frac{δ}{c}

when the azimuth angle

φ

is 0°. Further, the speakers are assumed to be in the far-field.

It is noted that when the speaker relocates, that is, φ ≠ 0°, the fixed beam direction of the DMA prevents it from capturing the desired speech signal, thus limiting its application to only those scenarios where the signal source remains stationary [11]. Therefore, optimal performance in terms of the directivity factor occurs only at the end-fire direction, which is the direction along the line that connects two microphones [12]. While this assumption holds in applications such as hearing aids and Bluetooth headsets, where steering is unnecessary [13], it is not suitable for environments like conference rooms, where the desired speaker may move within the room [13].

2.2. A Steerable-Beam First-Order DMA

In applications where the speaker may be moving, it is necessary to make the first-order DMA steerable, so that its main beam is steered towards the speaker [11]. In this context, steerability refers to the ability of the DMA to electronically adjust the direction of the main beam towards the speaker without requiring any mechanical movement of the array [11]. In this paper, we consider a first-order steerable differential array (FOSDA) that employs a four-element square array configuration

(M = 4)

, as shown in Figure 2 [14]. In real-world environments, the estimation of the location of the desired speaker may get degraded due to the presence of reverberation, background noise, or an undesired speaker [15]. In this paper, we consider such a real-world scenario where diffuse noise is present, the desired speaker is moving, and an undesired speaker is talking at the same time. The steerability ensures that ambient noise is suppressed independently of the sound source location, which can significantly improve the perceptual quality of the output speech of the FOSDA.

3. Entropy-Based Measure

The entropy of a random variable

X

with

N

states or symbol probabilities [

p_{1}, \dots, p_{N}]

, where

P_{X} (x_{i}) = p_{i}

, is given by

H (X) = - \sum_{i = 1}^{N} P_{X} (x_{i}) {l o g}_{2} P_{X} (x_{i})

(1)

where H is the Shannon entropy. To compute the entropy of a spectrum, the authors of [16] converted the spectrum into a probability mass function (PMF)-like function by normalizing it over the sum of the energies of the frequency components of the short-time frame. By doing such a normalization, the area under the normalized spectrum in full-band will sum up to unity. The authors in [16] suggested the use of the entropy computation from the full-band normalized spectrum. The following equation is used for the full-band normalization.

x_{i} = \frac{X_{i}}{\sum_{i = 1}^{N} X_{i}} f o r i = 1 t o N

(2)

where

X_{i}

is the energy of the ith frequency component of the spectrum,

x = (x_{1}, \dots, x_{N}

) is the PMF of the spectrum and

N

is the number of points in the spectrum (order of short-time Fourier transform (STFT)/number of discrete Fourier transform (DFT) points). It was found in [16] that the entropy can be used to capture the peak shapes of a PMF. A PMF with a sharp peak will have low entropy, while a PMF with a flat distribution will have high entropy. In the case of STFT spectra of speech, the authors observed distinct spectral peaks, with their positions varying based on the phoneme being analyzed. The importance of formants is well established, and in [17], the authors explored the use of spectral peak location as an additional feature for automatic speech recognition (ASR).

As mentioned before, noise introduces additional entropy into a system by increasing uncertainty. Calculating the entropy of a noisy speech signal in the time domain consistently shows higher entropy compared with that of a clean signal [18], confirming that noise increases entropy by reducing the information content. In the case of white Gaussian noise (WGN), entropy and variance are directly related; an increase in one leads to an increase in the other [19]. For white noise with non-Gaussian distributions, such as multimodal or uniform distributions [20], variance fails to fully capture the uncertainty or unpredictability. In such cases, entropy is a more effective metric for quantifying uncertainty. The spectral entropy of a speech signal captures information embedded in its various frequency components, as represented in the short-time Fourier transform (STFT). The choice of STFT over wavelet transformation is motivated by two key factors. First, STFT ensures consistent spectral resolution, reducing the risk of entropy variations that stem from wavelet decomposition choices (e.g., basis function selection and decomposition level). While wavelets provide multi-resolution analysis, they utilize a non-linear frequency tiling that under-resolves high-frequency components while over-resolving lower scales. For the purpose of the proposed spectral entropy-based metric, such a variable resolution would introduce scale-dependent entropy biases, potentially obscuring the ‘peaky’ structures of high-frequency harmonics that are also vital for perceived naturalness. In contrast, the constant bin width of the STFT ensures that every frequency component is treated with equal statistical weight during the PMF normalization process. This linear resolution is crucial for the diagnostic accuracy of the spectral entropy-based metric, as it allows for a direct and consistent evaluation of formant-region integrity and spectral ripples across the full spectrum. Second, unlike wavelets, which redistribute energy across scales, STFT retains a direct frequency-to-entropy relationship across the entire speech bandwidth [16], crucial for interpreting speech signal degradation. Defining spectral entropy for the power spectral density (PSD) of each STFT window enables us to evaluate the contribution of perceptually important frequency components, such as those in the formant region [21,22], which cannot be adequately assessed using temporal entropy.

Moreover, while temporal entropy requires obtaining a histogram of samples to derive the probability mass function (PMF) [23], spectral entropy offers an advantage over temporal entropy in that the PMF can be determined by normalizing the STFT power spectrum [16]. Additionally, the spectrum of white noise, interpreted as a uniform distribution over a frequency range, has maximum entropy [24]. Since this flat noise distribution overlaps with the speech spectrum within the same frequency range, it increases the spectral entropy due to the uncertainty introduced by the noise [25,26]. Our proposed metric is capable of evaluating the contribution of perceptually natural important frequency components, such as those in the formant region.

In order to evaluate the performance of a speech system in terms of the perceptual quality of the output it produces, we now define a spectral entropy-based measure by

S E M = \frac{S p e c t r a l E n t r o p y o f t h e o u t p u t s p e e c h s i g n a l}{S p e c t r a l E n t r o p y o f t h e i n p u t s p e e c h s i g n a l}

(3)

For every STFT of a speech signal, using Equation (1), the spectral entropy for a frame is

- \sum_{i = 1}^{N} (x_{i}) {l o g}_{2} (x_{i})

(4)

where

x_{i}

is the

i^{t h}

component of the PMF, and N is the number of STFT points. The spectral entropy of the speech signal is given by

S E = - \sum_{j = 1}^{L} \sum_{i = 1}^{N} x_{i} {l o g}_{2} x_{i}

(5)

where

L

is the number of frames.

Substituting Equation (5) into Equation (3), we can calculate the value of

S E M

. Based on the value of

S E M

, two perceptual states are defined to characterize the output of the speech system:

\{\begin{matrix} S E M = 1 O u t p u t s p e e c h i s a f a i t h f u l r e p r o d u c t i o n o f i n p u t s p e e c h \\ S E M \neq 1 O u t p u t s p e e c h i s a s p e c t r a l l y d i s t o r t e d o r d e g r a d e d v e r s i o n \end{matrix}

(6)

The condition

S E M > 1

indicates an increase in spectral entropy, signifying a shift toward a more uniform power distribution across frequency bins. While this often results from additive noise, it also captures spectral distortions introduced by array processing, such as filtering artifacts that may flatten the distinct spectral peaks or formants necessary for naturalness. The larger the value of

S E M

, the worse the degradation of spectral coherence and the lower the naturalness. Conversely,

S E M < 1

would indicate an unnatural sharpening of the spectrum or a loss of information content. Thus, the metric serves as a broad indicator of spectral structural integrity.

The

S E M

framework treats the speech spectrum as a probability distribution, where entropy serves as a direct measure of spectral uncertainty. An increase in

S E M

reflects a loss of spectral structure, often occurring when additive noise or improper beamforming flattens the distinct peaks of the voice. Conversely, while not commonly observed in standard linear processing,

S E M < 1

is theoretically possible and would indicate excessive spectral sharpening. This condition implies that processing has unnaturally narrowed the spectral peaks, creating a signal that is ‘peakier’ than the human-voice reference, which equally degrades perceived naturalness by introducing metallic or robotic artifacts.

Mathematically, since entropy captures the predictability of a signal’s power distribution, a higher

S E M

signifies that the output has shifted toward a more uniform (and, thus, more uncertain) state. This shift represents a degradation of spectral coherence, where the system fails to preserve the harmonic relationships of the input. In the theoretical case of

S E M < 1

, the output exhibits reduced spectral uncertainty compared with the input. This would imply that the speech system has performed an aggressive non-linear reduction in the spectral width of formants, a process known as excessive sharpening. In both the

S E M > 1

and

S E M < 1

scenarios, the metric successfully identifies a departure from the spectral coherence required for natural speech.

In order to calculate the

S E M

of the output of a speech system, the following procedure is used:

Step 1: Divide the input clean speech signal into frames;

Step 2: Compute the STFT for each frame of the clean speech signal;

Step 3: Calculate the PMF for each frame’s STFT of the clean speech signal;

Step 4: Compute the spectral entropy of the clean speech signal,

{S E}_{C},

using (5);

Step 5: Feed the clean speech into the speech system and obtain the output of the speech system;

Step 6: Perform steps 1 to 4 using the output of the speech system instead of the clean speech to obtain the spectral entropy of the output speech,

{S E}_{O};

Step 7: Obtain

S E M

as the ratio of

{S E}_{O}

to

{S E}_{C}

, that is,

S E M = \frac{{S E}_{O}}{{S E}_{C}} .

(7)

The use of the ratio in the definition of

S E M

serves a critical normalization function. Because different speakers and phonetic contents naturally possess different baseline spectral entropies, using the clean input (

{S E}_{C}

) as a reference ensures that the metric isolates the processing artifacts of the system rather than the characteristics of the speaker’s voice. This formulation allows

S E M

to function as a system-dependent diagnostic tool.

It is important to distinguish the causes of entropy variation. Natural speech is characterized by a distinct ‘peaky’ structure in the frequency domain, particularly in the formant regions. When a system like a DMA processes a signal, any filtering artifacts that smooth these peaks or flatten the spectral envelope will result in a higher

S E M

value, regardless of whether external noise is present. By utilizing the direct frequency-to-entropy relationship of the STFT,

S E M

identifies these spectral inconsistencies as a deviation from the human-voice reference.

While

S E M

, as well as the other objective metrics, differ in the way they are computed, their purpose remains the same, namely, evaluating the perceptual quality. Given their shared goal,

S E M

and other objective metrics enable the calculation of a correlation between them, as all metrics are used to assess the perceptual quality of the output of a speech system.

While objective metrics such as PESQ [3,4], STOI [4], and UTMOS [5] focus primarily on intelligibility, they do not explicitly take into consideration the spectral consistency of the speech signal, which is critical in evaluating speech naturalness [9]. In contrast,

S E M

, as an entropy-based metric, quantifies the distribution of spectral energy, which directly impacts perceived naturalness [9].

The metric

S E M

should not be judged only by how well it correlates with other objective metrics. Its unique role in the evaluation of the output speech naturalness should be emphasized.

For spectral coherence tests, we use spectral coherence sensitivity, a measure that shows how much each perceptual objective metric changes when speech is distorted.

To ensure a fair comparison of how each metric responds to the spectral coherence changes independent of their original scales, we first apply a linear normalization to map each metric (

M

) to a common range of 0 to 1, as shown by the following equation:

M_{n o r m} = \frac{M - M_{m i n}}{M_{m a x} - M_{m i n}}

(8)

where

M_{m i n}

and

M_{m a x}

are theoretical ranges for each metric. The normalized spectral coherence sensitivity (

S

) is then calculated as the change in the normalized metric relative to the change in spectral coherence (

C

) using the following equation:

S = \frac{M_{n o r m, c l e a n} - M_{n o r m, o u t p u t}}{∆ C}

(9)

where

∆ C = C_{c l e a n & c l e a n} - C_{c l e a n & o u t p u t}

. In this framework,

C

is the spectral coherence between the clean and output signals. This approach allows us to quantify the intrinsic responsiveness of each metric to the coherence loss independent of its original scale.

The introduction of the spectral coherence sensitivity framework represents a novel analytical contribution of this work. This measure quantifies the reactivity of an objective metric to specific spectral distortions. Unlike standard MOS-based metrics designed for broad quality assessment, this framework highlights the high responsiveness of

S E M

to the breakdown of harmonic structures. By isolating spectral coherence as a focused subsection of speech naturalness, we demonstrate that

S E M

provides a level of diagnostic granularity that is currently absent in less sensitive, intelligibility-focused industry standards.

The design of

S E M

as a ratio of output-to-input entropy means it specifically scales with spectral complexity and distortion. Because noise and processing artifacts introduce more uniform power distributions and, thus, higher entropy, higher

S E M

values represent a worse naturalness. This inherent orientation as a measure of degradation distinguishes it from MOS-based metrics, which are designed to measure perceptual excellence. This distinction is crucial for interpreting cross-validation results, as a strong negative correlation indicates that

S E M

is successfully capturing the same perceptual phenomena as established quality metrics, but from the perspective of signal breakdown.

Comparison of Computational Complexities

The computational complexity of the spectral entropy-based measure

S E M

is

O (L λ l o g L λ),

where

L

is the number of frames of the signal,

λ

is the length of the

F F T (D F T)

, and

M

is number of microphones.

It should be noted that the other metrics, namely, PESQ, POLQA, and STOI, all have the same computational complexity, namely,

O (L λ l o g L λ)

.

Since

S E M

relies on standard STFT operations, its runtime behavior is deterministic and highly efficient on modern hardware with FFT acceleration. Unlike UTMOS, which requires significant memory for model weights and inference,

S E M

is a low-power, purely statistical metric, making it viable for integration into real-time speech-enhancement diagnostic tools in edge devices such as hearing aids and mobile communication systems.

To illustrate the usefulness of the proposed entropy-based metric,

S E M

, we now consider the performance evaluation of a non-steerable as well as a steerable first-order DMA as examples.

4. Experimental Results

In this section, we evaluate the perceptual quality of both the fixed-beam and the steerable-beam first-order DMAs in the presence of diffuse noise and an interfering speaker. The fixed-beam first-order DMA, composed of two omnidirectional microphones, is configured with its main lobe fixed at

φ = 0^{°}

, while the steerable-beam first-order DMA utilizes four omnidirectional microphones arranged in a square geometry.

The fixed-beam and steerable-beam first-order DMAs were simulated in the “MATLAB, R2015a” environment. We chose 55 utterances spoken by male speakers and 45 utterances spoken by female speakers from the TIMIT database, which includes phonetically balanced sentences [27] with a sampling rate of, f_s = 16 kHz, and fed them as input to the simulated speech systems. The interfering speaker was modeled by an audio sample from the TIMIT database located at one of the four different null angles. Such a speaker interfered at the same time with a relative signal-to-interference ratio (SIR) of 0 dB. The additive diffuse noise was also modeled in “MATLAB, R2015a” so that the resulting signal had an SNR of 10 dB.

We followed the procedure given in Section 3 to calculate the

S E M

of the system using each of the utterances as the input and obtain the corresponding output of the system. For this purpose, we set the frame length to be 20 ms and the number of DFT points to be 320 for each time frame. The comparison of

S E M

was against other objective perceptual quality metrics, namely, PESQ, POLQA, STOI, NISQA, and deep learning-based UTMOS [5]. Hence, they were also used to evaluate the experimental results, and their values were compared with

S E M

’s values to validate the effectiveness of

S E M

and its irreplaceability.

4.1. Results for the Fixed-Beam First-Order DMA

Consider a fixed-beam first-order DMA whose microphone inter-element distance δ is 0.5 cm. We now obtain the values of the spectral entropy measure

S E M

for four different azimuth angles and null angles for the fixed-beam first-order DMA. For this purpose, we consider a particular sound file, namely, “sa1.wav”, that consists of the utterance “She had your dark suit in greasy wash water all year” by an adult female, as the desired sound source for four different azimuth angle locations. In addition, we assume that there is an undesired speaker located at one of the four different null angles, namely, “sx178.wav”, which consists of the utterance “She encouraged her children to make their own Halloween costume” by an adult male speaker.

The directional pattern of the fixed-beam first-order DMA and its main-lobe beam orientation for the four azimuth and null angle pairs is shown in Figure 3. It can be seen from this figure that the main lobe beam is fixed at

φ = 0^{°}

, while the speaker’s azimuth angle and the angle of null are varying. It is evident that while the speaker’s location is changing, the DMA cannot steer its beam towards the corresponding new location of the speaker and, therefore, can capture the desired speech only when the speaker is located at

φ = 0^{°}

. In other words, for all the other cases where the speaker is located at an azimuth angle other than

φ = 0^{°}

, the fixed-beam first-order DMA cannot properly reproduce the desired speech.

Figure 4 illustrates boxplots for the fixed-beam first-order DMA; each subplot shows how

S E M

, PESQ, POLQA, STOI, NISQA and UTMOS are distributed across the four angular pairs (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°). A clear pattern emerges in these angular pairs. For

S E M

, the angle pair (0°, 90°) has a very tight box (low variance) at ≈1.00, with minimal outliers. PESQ, STOI, and UTMOS display significantly higher medians (e.g., PESQ near 4.2–4.3 and UTMOS near 4.3–4.4) than the other angles, indicating the best perceived quality and intelligibility. In the remaining angular pairs,

S E M

’s median is 1.3–1.4 on average, showing broader interquartile ranges and more outliers. PESQ, STOI, and UTMOS exhibit noticeably lower box medians and more outliers. The boxplot whiskers for angles (45°, 135°), (270°, 180°), and (315°, 225°) show larger spreads/outliers, matching the higher standard deviations observed numerically. The fixed-beam best serves the angular pair (0°, 90°), as expected, reflected by the unity value of

S E M

, which is the same as that indicated by the highest values of the perceptual metrics, PESQ, POLQA, STOI, NISQA and UTMOS.

4.1.1. Descriptive Statistics

Mean and standard deviation values for

S E M

, PESQ, POLQA, STOI, NISQA and UTMOS for the four angular pairs (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°) are given for the fixed-beam first-order DMA in Table 1. It is observed from this table that the mean of

S E M

is lowest (≈1.00) at (0°, 90°), but rises to around 1.39–1.42 for the other angular pairs. Meanwhile, the mean values of PESQ and UTMOS, which track the perceived quality, are highest at (0°, 90°) and drop significantly at (45°, 135°), (270°, 180°), and (315°, 225°). STOI follows a similar trend, reflecting better performance near the (0°, 90°) angular pair. It is noted that the STOI values observed in Table 1 are relatively low (ranging from approximately 0.31 to 0.42). This can be attributed to the stringent test conditions involving a strong interfering speaker set at 0 dB SIR, combined with diffuse noise at a 10 dB SNR. Under such high-interference scenarios, the intelligibility metric, STOI, exhibits lower absolute values, further emphasizing the need for supplementary diagnostic metrics like the proposed

S E M

. All these metrics exhibit angular pair dependencies, demonstrating significant differences in values for both

S E M

and other perceptual objective metrics for each tested angular pair.

The rise in

S E M

mean values to approximately 1.39–1.42 for ‘off’ angular pairs indicates a significant loss of spectral definition. This increase is not solely a reflection of the 10 dB diffuse noise, but also indicates array-induced spectral artifacts. At these angles, the DMA’s inability to steer its beam results in a transfer function that flattens the harmonic structure of the desired speaker. The higher entropy, thus, quantifies the perceptual degradation caused by the system’s failure to preserve the spectral peaks essential for a ‘human-like’ sound.

In terms of standard deviation (SD),

S E M

’s variability is negligible (≈0.002) for (0°, 90°), but grows significantly for (45°, 135°), (270°, 180°), and (315°, 225°). NISQA shows the largest standard deviations at angles other than (0°, 90°). STOI remains moderately stable, though it also exhibits higher SD at the “off” angles.

Linear regression analysis is presented for the fixed-beam first-order DMA as a scatter plot with a regression line in Figure 5 and Figure 6. In these figures, the x-axis represents

S E M,

and the y-axis represents other objective metrics. The line shows a negative slope, indicating that for any given value, one can reliably predict the perceptual quality score.

4.1.2. Analysis of Variance (ANOVA): Statistical Significance of Differences Between Conditions

Table 2 provides the one-way ANOVA across four angular pairs (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°). The F-statistic and p-value for each objective perceptual metric, obtained from the fixed-beam first-order DMA output speech, are given in this table.

S E M

yields the highest F-statistic, indicating massive variation in

S E M

across angles, consistent with the large jump in the mean value from ≈1.00 to ≈1.40. It is also observed that

S E M

has the lowest p-value, indicating that there exists a statistically significant difference between the mean value of the angular pair of (0°, 90°) and the mean values of the other three angular pairs.

The ANOVA findings confirm that there is a significant difference among the angular pair mean values and at least one angular pair mean value is different from the others, which, in our case, is the angular pair of (0°, 90°). These results are obtained for the four angular pairs, irrespective of the metric used in determining the speech quality.

The massive variation in

S E M

across different angular pairs (F-statistic of 190) highlights its sensitivity to increased spectral uncertainty caused by the DMA’s spatial filtering. At ‘off’ angles, the improper alignment of the beam pattern with the speaker’s location acts as a spectral disruptor, increasing the entropy of the output. These results confirm that

S E M

effectively monitors the structural integrity of the speech spectrum, flagging any processing configuration that leads to a loss of the clear, predictable spectral peaks characteristic of a clean human voice.

4.1.3. Cross-Validation of SEM vs. Other Objective Perceptual Metrics

A five-fold cross-validation is performed to assess the applicability of

S E M

as a metric to measure the perceptual quality relative to the objective metrics (PESQ, POLQA, STOI, NISQA and UTMOS) for each of the four angular bins (each angular pair is considered as an angular bin). For this purpose, the dataset of 100 audio files is randomly partitioned into five folds for each of the four angular bins. For each of the five iterations, four folds are used as the reference set, while the remaining fold serves as the test set. For each of the four angular bins, we first compute the Pearson correlation coefficient between

S E M

and each of the other objective measures and find the average of the results across all five folds. Thus, there is one Pearson correlation coefficient value for each of the four angular bins. The value of the Pearson correlation coefficient, shown in Table 3, is the average of all four Pearson correlation coefficients. The standard deviation shown in Table 3 is the SD between the Pearson correlation coefficient values of these four angular bins.

It can be observed that all the correlations are negative and moderate in absolute value (Pearson correlation≈ −0.634 to −0.733). This implies that as the value of

S E M

rises, the values of PESQ, POLQA, STOI, NISQA, and UTMOS tend to decrease. This negative sign indicates that as the value of

S E M

rises, indicating a loss of spectral structural integrity, the other quality scores tend to decrease. This consistently negative correlation serves as a direct mathematical validation of

S E M

’s inverse relationship with perceptual quality. It confirms that

S E M

correctly functions as a degradation-based metric, where an increase in spectral uncertainty, as measured by entropy, directly corresponds to a predictable drop in established quality scores. In other words, such a high negative correlation underscores that the

S E M

metric measures the speech perceptual quality aligned with perceptual improvements or degradation. The absolute value of the Pearson correlation coefficient (

|r|

) between

S E M

and UTMOS is the highest (0.713), while the standard deviation is the lowest (±0.061), indicating a stable inverse relationship. It is noted that UTMOS has the highest correlation with actual MOS, as mentioned in [28].

In order to evaluate the monotonic relationship between

S E M

and other objective perceptual metrics, we calculate the mean Spearman rank correlation coefficient (

ρ

) using the same five-fold cross-validation procedure that we employed to obtain the Pearson correlation coefficient. While the Pearson correlation measures linear alignment, Spearman correlation confirms the consistency of the rank-order, which is critical for verifying that any increase in the spectral entropy ratio (

S E M \neq 1

) corresponds to a decrease in perceived quality. As seen from Table 3, the mean Spearman correlation for the

S E M

-UTMOS relationship is −0.702. This value closely mirrors the Pearson mean (−0.713), demonstrating that

S E M

maintains a high degree of monotonicity even when the processing artifacts introduce non-linear degradations. The standard deviation for Spearman correlation is notably low, reinforcing the stability of the

S E M

metric across different phonetic contents and speakers.

The statistical reliability of these results is underscored by the 95% confidence intervals (CI) for both Pearson and Spearman correlation coefficients, as shown in Table 3. For the

S E M

-UTMOS pair, the CIs are [−0.81, −0.62] and [−0.80, −0.60], respectively, which are the narrowest among all the tested metric pairs. This indicates that the inverse relationship between

S E M

and perceptual quality is statistically significant and highly repeatable for fixed-beam first-order DMA systems. In conclusion, the alignment of Pearson and Spearman coefficients for the

S E M

-UTMOS confirms that

S E M

is a robust, high-confidence indicator of speech quality. Due to the strong correlation of

S E M

with UTMOS, it is known to have the highest correlation with actual human mean opinion scores (MOSs) [28].

4.1.4. Spectral Coherence Sensitivity and Mean Temporal Variability

The evaluation of the mean spectral coherence sensitivity for the fixed-beam first-order DMA is given in Table 4. Because the ranges of objective measure values are normalized, spectral coherence sensitivity directly compares how “reactive” each metric is to spectral misalignment.

It is seen from this table that

S E M

exhibits the highest absolute value of mean sensitivity (2.3691). This indicates that

S E M

is highly responsive to variations in spectral coherence, suggesting that it effectively captures how well the spectral structure is preserved in the enhanced output speech. In contrast, the MOS-based metrics, namely, PESQ (1.2690), STOI (0.4953) and UTMOS (1.1473), exhibit far less sensitivity to spectral coherence changes, implying that they do not explicitly account for spectral consistency, but rather focus on the overall perceptual quality and intelligibility, confirming that these metrics are less sensitive to spectral stability.

Negative sensitivity for

S E M

means that this metric increases as spectral coherence decreases and vice versa.

4.2. Results for the Steerable-Beam First-Order DMA

We now consider a steerable-beam first-order DMA with four microphones arranged in a square geometry [13] (see Figure 2), with an inter-element distance δ of 0.5 cm between its two adjacent microphones. Just as in the case of fixed DMA, we now consider the particular sound file, “sa1.wav”, that consists of the utterance “She had your dark suit in greasy wash water all year” by an adult female, as the desired sound source for the same four azimuth angle locations, as in the case of the fixed-beam first-order DMA. In addition, we assume that there is an undesired speaker at one of the null angles, just as in the case of fixed-beam first-order DMA with the sound file “sx178.wav”, which consists of the utterance “She encouraged her children to make their own Halloween costume” by an adult male speaker.

Evaluation of the perceptual quality of the output of the steerable-beam first-order DMA is conducted in a manner similar to that for the fixed-beam version.

The directional pattern of the steerable-beam first-order DMA and its main-lobe beam orientation for the four azimuth and null angle pairs is shown in Figure 7. It is seen from this figure that the main lobe of the steerable-beam first-order DMA can be dynamically steered towards any desired azimuth angle

φ

, effectively aligning with the changing position of the speaker. In view of this steering capability, the steerable-beam DMA can capture the desired speech reasonably well across all four angle pairs, as the speaker moves to different locations. In other words, unlike the fixed-beam counterpart, the steerable-beam DMA can reproduce with a reasonably good perceptual quality at its output, even when the speaker is located at an azimuth angle other than

φ = 0 °

.

Figure 8 illustrates boxplots for the steerable-beam first-order DMA; each subplot shows how

S E M

, PESQ, POLQA, STOI, NISQA and UTMOS are distributed across the four angular pairs, (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°). A clear pattern emerges in these angular pairs. For the angular pairs (0°, 90°) and (270°, 180°),

S E M

remains close to unity (around 1.00–1.05) in both angular pairs, with narrower boxes and fewer extreme outliers. PESQ, STOI, and UTMOS show high medians, reflecting better quality/intelligibility. At (45°, 135°), and (315°, 225°),

S E M

jumps to ~1.3–1.4, with more outliers. PESQ, STOI, and UTMOS drop by 1+ point on average, signifying lower perceived quality. The interquartile ranges are broader, especially for UTMOS, indicating higher variance at these angles. The box medians and whisker ranges underscore the role of beam steering as both (0°, 90°) and (270°, 180°) yield the best perceptual results, while two other angular pairs degrade the perceptual results.

Across both fixed-beam and steerable configurations, the boxplot data highlight significant differences in

S E M

, PESQ, POLQA, STOI, NISQA and UTMOS across the angular pairs. In both speech systems and across all angular pairs,

S E M

correlates inversely with these other perceptual metrics.

4.2.1. Descriptive Statistics

The mean and standard deviation values for

S E M

, PESQ, POLQA, STOI, NISQA and UTMOS for the four angular pairs (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°) are given for the steerable-beam first-order DMA in Table 5. It is observed from this table that the mean of

S E M

is lowest (≈1.00) at (0°, 90°) or (270°, 180°), but rises to around 1.28–1.30 for the other two angular pairs. Meanwhile, the mean values of PESQ, STOI, and UTMOS are higher at (0°, 90°) and (270°, 180°) but lower at (45°, 135°) and (315°, 225°). It is noted that the STOI values observed in Table 5 are relatively low (ranging from approximately 0.34 to 0.42). This is attributed to the stringent test conditions involving a strong interfering speaker set at 0 dB SIR combined with diffuse noise at a 10 dB SNR. Under such high-interference scenarios, the intelligibility metric, STOI, exhibits lower absolute values, further emphasizing the need for supplementary diagnostic metrics like the proposed

S E M

.

In terms of standard deviation (SD),

S E M

’s variability is minimal (≈0.002) and negligible for (0°, 90°) and (270°, 180°), but much higher (≈0.17) at the other two angular pairs. PESQ and UTMOS exhibit the largest variations in SD range (0.607–0.706 for PESQ and 0.935–1.065 for UTMOS) at (45°, 135°), and (315°, 225°). Just as in the case of the fixed-beam first-order DMA, this indicates that the system’s perceptual performance gets degraded at certain estimated angular azimuth and null angles.

The descriptive statistics demonstrate that angular pair dependencies at (45°, 135°) and (315°, 225°) lead to a higher mean and SD for

S E M

, but lower means for PESQ, STOI, and UTMOS, indicating consistency in predicting the speech perceptual quality irrespective of the beam-forming effects at the estimated azimuth and null angles.

To assess the scalability of

S E M

as an objective metric, we conduct an analysis using five different SNR levels, −5 dB, 0 dB, 5 dB, 10 dB and 15 dB. Figure 9 shows the average value of

S E M

over 400 output audio files (100 audio files for each of the four angular bins) as a function of the input SNR level. It is seen from Figure 9 that for both the fixed-beam and steerable-beam first-order DMAs, at very low SNR, the high spectral uncertainty leads to a mean value of

S E M

of approximately 1.5 and 1.35, respectively, while at 15 dB, the mean of the

S E M

approaches 1.15 and 1.05 respectively, signifying a nearly faithful reproduction of the input speech. This trend across the SNR spectrum confirms that

S E M

is a stable indicator of perceptual quality regardless of the noise intensity.

Linear regression analysis for steerable-beam first-order DMA is presented as a scatter plot with a regression line in Figure 10 and Figure 11. In these figures, the x-axis represents

S E M,

and the y-axis represents other objective metrics. The line shows a nearly negative slope, indicating, again, that for any given value, one can reliably predict the perceptual quality score.

4.2.2. Analysis of Variance (ANOVA): Statistical Significance of Differences Between Conditions

Table 6 provides the one-way ANOVA across four angular pairs, (0°, 90°), (45°, 135°), (270°, 180°), and (315°, 225°). The F-statistic and p-value for each objective perceptual metric obtained from the steerable-beam first-order DMA output speech are given in this table.

S E M

yields the highest F-statistic (185), indicating massive variation in

S E M

across angles, consistent with the large jump in the mean value from ≈1.00 to ≈1.30. It is also observed that

S E M

has the lowest p-value, indicating that there exists a statistically significant difference between the mean value of at least one angular pair and the other angular pairs. It should be noted that in our case, there are two such angular pairs, namely, the angular pairs (0°, 90°) and (270°, 180°) and the mean values of the other two angular pairs.

Similarly, the F-statistic values for PESQ, STOI and UTMOS indicate the variations across the angular pairs. This implies that the performance of the steerable-beam first-order DMA depends strongly on the angular pair under consideration, due to the directivity patterns and the noise interference for certain orientations. The very small p-values (<<0.05) for these perceptual quality metrics confirm that the angular pair has a statistically significant impact on all metrics.

The ANOVA findings confirm that there is a significant difference among the mean values of the angular pairs. In our case, there are two angular pairs, namely, (0°, 90°) and (270°, 180°), whose mean values are different from those of the other two angular pairs. These results are obtained for the four angular pairs, irrespective of the metric used in determining the speech quality, just as in the case of fixed-beam first-order DMA.

To analyze whether

S E M

is signal-dependent, we evaluate its stability across a phonetically balanced TIMIT dataset of 100 utterances from both male and female speakers. The results in Table 1 and Table 5 show that at optimal beamforming angles, the standard deviation of

S E M

is negligible (≈0.002), indicating that the metric is not sensitive to variations in gender or phonetic content. Furthermore, the high ANOVA F-statistics (reaching 190 for the fixed-beam first-order DMA and 185 for the steerable-beam first-order DMA) confirm that the variance in the metric is statistically tied to the system’s performance rather than to the characteristics of the input signal.

4.2.3. Cross-Validation of SEM vs. Other Objective Perceptual Metrics

The five-fold cross-validation procedure utilized in the case of fixed-beam first-order DMA to compute the mean Pearson correlation coefficient and its standard deviation is used to compute these two quantities for the case of the steerable-beam first-order DMA. The corresponding results for the steerable-beam case are given in Table 7.

It is observed that all the correlations are negative and moderate in absolute value (Pearson correlation≈ −0.510 to −0.597), just as in the case of the fixed-beam DMA. This implies that as the value of

S E M

rises, the values of PESQ, POLQA, STOI, NISQA, and UTMOS tend to decrease. The stable recurrence of these negative coefficients across the steerable configurations further validates the inverse relationship between entropy-based measures of signal degradation and traditional perceptual benchmarks. This framing reinforces the construct validity of

S E M

, demonstrating that its sensitivity to spectral coherence is fundamentally aligned with the perceptual quality drops captured by other objective metrics. In other words, such a negative correlation underscores that the

S E M

metric measures the speech perceptual quality aligned with perceptual improvements or degradation. The absolute value of the Pearson correlation coefficient (

|r|

) between

S E M

and UTMOS is the highest (0.597), while the standard deviation is the lowest (±0.163), just as in the case of the fixed-beam first-order DMA.

The consistent negative Pearson correlation coefficients (averaging −0.63 to −0.73 for fixed DMA and −0.51 to −0.60 for steerable DMA) deeply reinforce the construct validity of

S E M

. The fact that

S E M

rises precisely when UTMOS and PESQ scores fall confirms that the metric is correctly identifying the perceptual degradation caused by angular-dependent distortions. The stability of this inverse relationship, particularly with UTMOS, which has the highest known correlation with actual human scores, demonstrates that

S E M

is a robust and valid proxy for naturalness, effectively mapping spectral entropy to perceived quality drops.

In the more complex case of the steerable-beam configuration, the Spearman rank correlation (

ρ

) is utilized to see if the

S E M

metric remains a reliable rank-order indicator despite the dynamic nature of beam steering. In Table 7, the mean Spearman coefficient for the

S E M

-UTMOS relationship is −0.585, following the same moderate-to-strong inverse trend observed in the Pearson analysis. This consistency between

r

and

ρ

indicates that as

S E M

increases, the rank-order of speech quality consistently decreases across all steerable angular pairs. While the absolute values are slightly lower than those in the fixed-beam case due to steerable filtering effects, the Spearman standard deviation of the

S E M

-UTMOS pair remains the lowest with respect to that of the other pairs.

The statistical reliability of these results is underscored by the 95% confidence intervals (CIs) for both Pearson and Spearman correlation coefficients, as shown in Table 7. For the

S E M

-UTMOS pair, the CIs are [−0.86, −0.34] and [−0.85, −0.32], respectively, which are the narrowest among all the tested metric pairs. These ranges demonstrate that even with the added variability of beamformer steering,

S E M

maintains a statistically significant predictive relationship with perceptual quality. This again indicates, just as in the case of fixed-beam DMA, that the inverse relationship between

S E M

and perceptual quality is statistically significant and highly repeatable for the steerable-beam first-order DMA systems also. In conclusion, the alignment of Pearson and Spearman coefficients for the

S E M

-UTMOS confirms that

S E M

is a robust, high-confidence indicator of speech quality. Due to the strong correlation of

S E M

with UTMOS, which is known to have the highest correlation with actual human mean opinion scores (MOSs) [28].

4.2.4. Spectral Coherence Sensitivity and Mean Temporal Variability

The evaluation of the mean spectral coherence sensitivity for the steerable-beam first-order DMA is given in Table 8. Because the ranges of objective measure values are normalized, spectral coherence sensitivity directly compares how “reactive” each metric is to spectral misalignment.

It is seen from this table that

S E M

exhibits the highest absolute value of mean sensitivity (2.3897). This indicates that

S E M

is highly responsive to variations in spectral coherence, suggesting that it effectively captures how well the spectral structure is preserved in the enhanced output speech. In contrast, the MOS-based metrics, namely, PESQ (1.3460), STOI (0.5742), and UTMOS (1.2198), exhibit far less sensitivity to spectral coherence changes, implying that they do not explicitly account for spectral consistency, but rather focus on the overall perceptual quality and intelligibility, confirming that these metrics are less sensitive to spectral stability.

The mean spectral coherence sensitivity of

S E M

exhibits a slightly greater change compared with the absolute value of its fixed-beam counterpart, reinforcing that it remains highly sensitive to spectral integrity changes. This suggests that the steerable-beam processing might introduce reduced distortions or filtering effects than in the fixed-beam case.

The sensitivity of

S E M

to spectral coherence (~−2.39) further validates its role in detecting non-humanlike spectral structures. Because

S E M

specifically tracks the distribution of spectral energy, it is uniquely equipped to diagnose when steerable-beam processing introduces unnatural spectral tilts or ripples. These results confirm that

S E M

is a robust proxy for naturalness, as it remains highly reactive to any processing effect that disrupts the coherent harmonic envelopes of the original speech.

The experimental results provided in Table 4 and Table 8 demonstrate the superiority of

S E M

in capturing subtle spectral degradations. In both the fixed and steerable DMA configurations,

S E M

exhibits an absolute mean sensitivity, which is nearly double that of PESQ and approximately four times that of STOI. These results indicate that

S E M

is significantly more effective at resolving distortions in the formant regions. Furthermore,

S E M

yields the highest F-statistics in the ANOVA analysis (e.g., 190 for fixed DMA), confirming it as the most statistically significant metric for detecting angular-dependent signal variations compared with all other tested measures.

5. Conclusions

Subjective evaluation of the speech quality is resource-intensive and time-consuming, as it requires human listeners and a test setup in a noise-free environment. Hence, objective metrics have been proposed in the literature to measure the quality of the output of a speech system without having to resort to time-consuming listening tests.

In this paper, we defined and employed a spectral entropy-based measure for evaluation of the perceptual quality of the speech produced by a speech system. This measure quantifies the spectral structural integrity of a speech system, which, in turn, assesses the system’s ability to preserve spectral coherence, an important indicator of the naturalness of the speech. Such a metric was cross-validated against other objective perceptual quality metrics. Unlike other objective metrics, such as PESQ, STOI, and POLQA, which primarily target intelligibility, the proposed spectral entropy-based measure provides a unique diagnostic focus on spectral structural integrity. By quantifying spectral coherence, this measure identifies the specific harmonic degradations that dictate perceived speech naturalness.

The importance of this work lies in the proposed measure’s ability to detect ‘unnatural’ artifacts that other metrics often overlook.

To illustrate the feasibility of employing the proposed metric in real-life applications, we employed the proposed measure to evaluate the performance of first-order differential microphone arrays, steerable as well as non-steerable, from the point of view of the perceptual quality of the speech they produce.

Based on the experimental results, it has been shown that the proposed metric highly correlates with the UTMOS metric, which is known to have the highest correlation with the actual mean opinion score. Further, the proposed metric has been shown to be a more sensitive indicator of the spectral coherence compared with the other objective metrics, making it a good measure for the assessment of the naturalness of the output speech signal, irrespective of the speech system used.

Thus, this study confirms that

S E M

is not only a quality measure of performance, but also an irreplaceable indicator of spectral integrity. Its inverse correlation with UTMOS, the metric that has been shown to have the highest correlation with actual human scores, validates its perceptual relevance. Given its superior sensitivity to spectral coherence and its computational efficiency,

S E M

can be considered as a vital tool for researchers aiming to preserve speech naturalness in the next generation of audio-processing technologies.

Despite the strong statistical performance of

S E M

in this study, it is to be noted that

S E M

is, in its present form, an intrusive metric. The calculation of the metric

S E M

assumes access to the spectral entropy of a clean reference signal (

{S E}_{C}

), which may not always be available in real-time applications. This requirement may limit its application in real-world monitoring scenarios where a reference is not readily available. Moreover, since

S E M

is calculated based on the PSD of STFT windows, it primarily tracks the distribution of the spectrum. Consequently, it may exhibit lower sensitivity to certain non-linear distortions, such as signal clipping or pure phase-shift distortions, which may degrade the perceived quality without significantly altering the global entropy of the spectral distribution. In view of these limitations, future work could include the investigation of non-intrusive entropy approximations of

S E M

and integrating them into a holistic evaluation framework along with other perceptual metrics, potentially drawing on features used in non-intrusive benchmarks like NISQA, to enable reference-free quality monitoring as well as exploring phase-aware entropy features to enhance the metric’s diagnostic precision in non-linear environments.

Future work could include investigation of the performance of

S E M

in higher-order arrays to test its robustness against more complex processing artifacts. Finally, while this study focused on microphone arrays, a logical next step is to validate

S E M

across a wider variety of speech systems, such as audio codecs and modern neural enhancement networks, to ensure its diagnostic precision across different types of spectral distortions.

While the TIMIT corpus provided a phonetically balanced baseline for this study, the

S E M

metric is designed to be language-independent. Since

S E M

quantifies the spectral coherence, a universal indicator of the ‘human-voice reference’, its application can be naturally extended to diverse languages and speaking styles without requiring language-specific training. Further, since

S E M

operates as a ratio of spectral entropy values, it is inherently robust to variations in the input signal’s spectral shape, making it a promising tool for evaluating pathological speech or speech in high-noise environments. As demonstrated in our SNR evaluation, the metric reliably tracks the speech quality degradation across five different SNR levels (−5, 0, 5, 10, and 15 dB), confirming its stability across varying noise intensities.

Author Contributions

Conceptualization, A.S., M.O.A. and M.N.S.S.; Methodology, A.S., M.O.A. and M.N.S.S.; Validation, A.S. and M.N.S.S.; Formal Analysis, A.S. and M.N.S.S.; Investigation, A.S., M.O.A. and M.N.S.S.; Resources, M.O.A. and M.N.S.S.; Writing—Original Draft, A.S.; Writing—Review and Editing, M.O.A. and M.N.S.S.; Visualization, A.S.; Supervision, M.O.A. and M.N.S.S.; Project Administration, M.O.A. and M.N.S.S.; Funding Acquisition, M.O.A. and M.N.S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Data Availability Statement

The original contributions presented in this study are included in this article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PESQ	Perceptual Evaluation of Speech Quality
STOI	Short-Time Objective Intelligibility
UTMOS	UTokyo-SaruLab System for VoiceMOS
NISQA	Non-Intrusive Speech Quality Assessment
POLQA	Perceptual Objective Listening Quality Analysis
DMA	Differential Microphone Array
FOSDA	First-Order Steerable Differential Array
PMF	Probability Mass Function
STFT	Short-Time Fourier Transform
DFT	Discrete Fourier Transform
ASR	Automatic Speech Recognition
WGN	White Gaussian Noise
PSD	Power Spectral Density
ANOVA	Analysis of Variance
SD	Standard Deviation

References

Cutler, R.; Saabas, A.; Naderi, B.; Ristea, N.-C.; Braun, S.; Branets, S. ICASSP 2023 Speech Signal Improvement Challenge. IEEE Open J. Signal Process. 2024, 5, 662–674. [Google Scholar] [CrossRef]
Blanco Galindo, M. Microphone Array Beamforming for Spatial Audio Object Capture. Ph.D. Thesis, University of Surrey, Surrey, UK, 2020. [Google Scholar] [CrossRef]
Lazim, R.Y.; Yun, Z.; Wu, X. Improving Speech Quality for Hearing Aid Applications Based on Wiener Filter and Composite of Deep Denoising Autoencoders. Signals 2020, 1, 138–156. [Google Scholar] [CrossRef]
Roy, S.K.; Paliwal, K.K. Robustness and Sensitivity Tuning of the Kalman Filter for Speech Enhancement. Signals 2021, 2, 434–455. [Google Scholar] [CrossRef]
Saeki, T.; Xin, D.; Nakata, W.; Koriyama, T.; Takamichi, S.; Saruwatari, H. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 4521–4525. [Google Scholar]
Jorgensen, E.; Wu, Y.-H. Effects of entropy in real-world noise on speech perception in listeners with normal hearing and hearing loss. J. Acoust. Soc. Am. 2023, 154, 3627–3643. [Google Scholar] [CrossRef] [PubMed]
Nussbaum, C.; Frühholz, S.; Schweinberger, S.R. Understanding voice naturalness. Trends Cogn. Sci. 2025, 29, 467–480. [Google Scholar] [CrossRef] [PubMed]
Moore, B.C.J.; Tan, C.-T. Perceived naturalness of spectrally distorted speech and music. J. Acoust. Soc. Am. 2003, 114, 408–419. [Google Scholar] [CrossRef] [PubMed]
Hillenbrand, J.M.; Houde, R.A.; Gayvert, R.T. Speech perception based on spectral peaks versus spectral shape. J. Acoust. Soc. Am. 2006, 119, 4041–4054. [Google Scholar] [CrossRef] [PubMed]
Benesty, J.; Chen, J. Study and Design of Differential Microphone Arrays; Springer: Berlin, Germany, 2012. [Google Scholar]
Lai, C.C.; Nordholm, S.E.; Leung, Y.H. A Study into the Design of Steerable Microphone Arrays, 1st ed.; SpringerBriefs in Electrical and Computer Engineering; Springer: Singapore, 2016. [Google Scholar] [CrossRef]
Huang, G.; Chen, J.; Benesty, J.; Cohen, I.; Zhao, X. Steerable differential beamformers with planar microphone arrays. EURASIP J. Audio Speech Music Process. 2020, 2020, 1. [Google Scholar] [CrossRef]
Jin, J.; Huang, G.; Wang, X.; Chen, J.; Benesty, J.; Cohen, I. Steering study of linear differential microphone arrays. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 158–170. [Google Scholar] [CrossRef]
Derkx, R.M.M.; Janse, K. Theoretical analysis of a first-order azimuth-steerable superdirective microphone array. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 150–162. [Google Scholar] [CrossRef]
Oualil, Y.; Faubel, F.; Doss, M.M.; Klakow, D. A TDOA Gaussian mixture model for improving acoustic source tracking. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 1339–1343. [Google Scholar]
Misra, H.; Ikbal, S.; Sivadas, S.; Bourlard, H. Multi-resolution spectral entropy feature for robust ASR. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, 18–23 March 2005; pp. 253–256. [Google Scholar]
Padmanabhan, M. Spectral peak tracking and its use in speech recognition. In Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP), Beijing, China, 16–20 October 2000. [Google Scholar]
Borges, V.S.; Nepomuceno, E.G.; Duque, C.A.; Butusov, D.N. Some remarks about entropy of digital filtered signals. Entropy 2020, 22, 365. [Google Scholar] [CrossRef] [PubMed]
Chung, H.W.; Sadler, B.M.; Hero, A.O. Bounds on variance for symmetric unimodal distributions. In Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 29 September–2 October 2015; pp. 1235–1240. [Google Scholar] [CrossRef]
Marmarelis, V.Z. Appendix II: Gaussian white noise. In Nonlinear Dynamic Modeling of Physiological Systems; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2004; pp. 499–501. [Google Scholar] [CrossRef]
Toh, A.; Togneri, R.; Nordholm, S. Spectral entropy as speech features for speech recognition. In Proceedings of the Progress in Electronic and Electrical Engineering Conference (PEECS), Perth, Australia, 10–13 January 2005; p. 92. [Google Scholar]
Roberts, B.; Summers, R.J.; Bailey, P.J. Formant-frequency variation and informational masking of speech by extraneous formants: Evidence against dynamic and speech-specific acoustical constraints. J. Exp. Psychol. Hum. Percept. Perform. 2014, 40, 1507–1525. [Google Scholar] [CrossRef] [PubMed]
Fraj, O.; Ghozi, R.; Jaïdane-Saïdane, M. Temporal entropy-based texturedness indicator for audio signals. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 564–568. [Google Scholar] [CrossRef]
Trandafir, R. Determination of a discrete distribution with given entropy. Oper. Res. 2003, 3, 41–46. [Google Scholar] [CrossRef]
Mittag, G.; Möller, S. Quality estimation of noisy speech using spectral entropy distance. In Proceedings of the 26th International Conference on Telecommunications (ICT), Hanoi, Vietnam, 4–6 December 2019; pp. 197–201. [Google Scholar] [CrossRef]
Misra, H.; Bourlard, H. Spectral entropy feature in full-combination multi-stream for robust ASR. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; pp. 2485–2488. [Google Scholar] [CrossRef]
Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus; Linguistic Data Consortium: Philadelphia, PA, USA, 1993; Available online: https://catalog.ldc.upenn.edu/LDC93S1 (accessed on 7 December 2025).
Zhang, W.; Saijo, K.; Cornell, S.; Scheibler, R.; Li, C.; Ni, Z.; Kumar, A.; Sach, M.; Wang, W.; Fu, Y.; et al. Lessons Learned from the URGENT 2024 Speech Enhancement Challenge. In Proceedings of the Interspeech 2025, Rotterdam, The Netherlands, 17–21 August 2025; pp. 853–857. [Google Scholar] [CrossRef]

Figure 1. Illustration of a first-order differential microphone array.

Figure 2. A steerable first-order DMA using four omnidirectional microphones in square geometry.

Figure 3. Beampatterns of the fixed-beam first-order DMA for four azimuth and null angle pairs: (a) (0°, 90°), (b) (45°, 135°), (c) (270°, 180°), and (d) (315°, 225°).

Figure 4. Boxplots for objective perceptual metrics for four angular pairs of azimuth and null angles for the fixed-beam first-order DMA: (a)

S E M

, (b) PESQ, (c) POLQA, (d) STOI, (e) NISQA, and (f) UTMOS.

Figure 4. Boxplots for objective perceptual metrics for four angular pairs of azimuth and null angles for the fixed-beam first-order DMA: (a)

S E M

, (b) PESQ, (c) POLQA, (d) STOI, (e) NISQA, and (f) UTMOS.

Figure 5. Scatter plot of

S E M

with respect to (a) PESQ, (b) POLQA, (c) STOI, (d) NISQA, and (e) UTMOS, and linear regression for four angular pairs of azimuth and null angles for the fixed-beam first-order DMA.

Figure 5. Scatter plot of

S E M

with respect to (a) PESQ, (b) POLQA, (c) STOI, (d) NISQA, and (e) UTMOS, and linear regression for four angular pairs of azimuth and null angles for the fixed-beam first-order DMA.

Figure 6. Flattened data scatter plot of

S E M

with respect to (a) PESQ, (b) POLQA, (c) STOI, (d) NISQA, and (e) UTMOS, and linear regression for the fixed-beam first-order DMA.

Figure 6. Flattened data scatter plot of

S E M

with respect to (a) PESQ, (b) POLQA, (c) STOI, (d) NISQA, and (e) UTMOS, and linear regression for the fixed-beam first-order DMA.

Figure 7. Beam pattern of the steerable-beam first-order DMA for four pairs of estimated azimuth angles and null angles: (a) (0°, 90°), (b) (45°, 135°), (c) (270°, 180°), and (d) (315°, 225°).

Figure 8. Boxplots for objective perceptual metrics for four angular pairs of azimuth and null angles for the steerable-beam first-order DMA: (a)

S E M

, (b) PESQ, (c) POLQA, (d) STOI, (e) NISQA, and (f) UTMOS.

Figure 8. Boxplots for objective perceptual metrics for four angular pairs of azimuth and null angles for the steerable-beam first-order DMA: (a)

S E M

, (b) PESQ, (c) POLQA, (d) STOI, (e) NISQA, and (f) UTMOS.

Figure 9. Average values of

S E M

taken over 400 output audio files (100 audio files for each of the four angular bins) as a function of input SNR level for the fixed-beam and steerable-beam first-order DMAs.

Figure 9. Average values of

S E M

taken over 400 output audio files (100 audio files for each of the four angular bins) as a function of input SNR level for the fixed-beam and steerable-beam first-order DMAs.

Figure 10. Scatter plot of

S E M

with respect to (a) PESQ, (b) POLQA, (c) STOI, (d) NISQA, and (e) UTMOS and linear regression for four angular pairs of azimuth and null angles for the steerable-beam first-order DMA.

Figure 10. Scatter plot of

S E M

with respect to (a) PESQ, (b) POLQA, (c) STOI, (d) NISQA, and (e) UTMOS and linear regression for four angular pairs of azimuth and null angles for the steerable-beam first-order DMA.

Figure 11. Flattened data scatter plot of

S E M

with respect to (a) PESQ, (b) POLQA, (c) STOI, (d) NISQA, and (e) UTMOS and linear regression for the steerable-beam first-order DMA.

Figure 11. Flattened data scatter plot of

S E M

with respect to (a) PESQ, (b) POLQA, (c) STOI, (d) NISQA, and (e) UTMOS and linear regression for the steerable-beam first-order DMA.

Table 1. Mean and standard deviation values of six objective perceptual metrics for four azimuth and null angle pairs for the fixed-beam first-order DMA.

Desired Speaker’s Azimuth	Null Angle	SEM	PESQ	POLQA	STOI	NISQA	UTMOS
0°	90°	1.003, 0.002	4.232, 0.369	4.141, 0.800	0.419, 0.038	4.236, 0.835	4.317, 0.662
45°	135°	1.393, 0.147	3.125, 0.551	2.124, 1.003	0.316, 0.053	2.215, 1.014	2.462, 0.844
270°	180°	1.407, 0.160	3.167, 0.630	2.283, 1.028	0.327, 0.054	2.157, 1.020	2.422, 0.879
315°	225°	1.417, 0.164	3.090, 0.623	2.240, 1.009	0.317, 0.051	2.212, 1.001	2.491, 0.890

Table 2. Statistical significance: F-statistic and p-value for all objective perceptual metrics.

Metric	F-Statistic	p-Value
SEM	190	6.07 × 10⁻⁸⁴
PESQ	100	3.49 × 10⁻⁴⁸
POLQA	102	3.24 × 10⁻⁴⁸
STOI	99.6	4.78 × 10⁻⁴⁸
NISQA	111	4.69 × 10⁻⁵²
UTMOS	127	9 × 10⁻⁵⁸

Table 3. Five-fold cross-validation (SEM vs. other objective metrics) for the fixed-beam first-order DMA.

Objective Metric	Mean Pearson Correlation Coefficient (r ± SD)	95% Confidence Interval (CI) (Pearson)	Mean Spearman Correlation (ρ ± SD)	95% Confidence Interval (CI) (Spearman)
PESQ	−0.652 ± 0.122	[−0.85, −0.46]	−0.640 ± 0.125	[−0.84, −0.44]
POLQA	−0.634 ± 0.168	[−0.90, −0.37]	−0.621 ± 0.170	[−0.89, −0.35]
STOI	−0.669 ± 0.156	[−0.92, −0.42]	−0.655 ± 0.159	[−0.91, −0.40]
NISQA	−0.686 ± 0.164	[−0.95, −0.43]	−0.671 ± 0.167	[−0.94, −0.41]
UTMOS	−0.713 ± 0.061	[−0.81, −0.62]	−0.702 ± 0.064	[−0.80, −0.60]

Table 4. Mean of spectral coherence sensitivity for objective perceptual metrics for the fixed-beam first-order DMA.

Metric	Mean Spectral Coherence Sensitivity
SEM	−2.3691
POLQA	1.1638
PESQ	1.2690
STOI	0.4953
NISQA	1.1074
UTMOS	1.1473

Table 5. Mean and standard deviation values of six objective perceptual metrics for four azimuth and null angle pairs for the steerable-beam first-order DMA.

Desired Speaker’s Azimuth	Null Angle	SEM	PESQ	POLQA	STOI	NISQA	UTMOS
0°	90°	1.003, 0.002	4.210, 0.420	4.227, 0.808	0.415, 0.042	4.064, 0.931	4.243, 0.755
45°	135°	1.277, 0.169	3.504, 0.607	2.743, 1.123	0.345, 0.056	2.733, 1.121	2.963, 0.935
270°	180°	1.003, 0.002	4.225, 0.408	4.176, 0.899	0.419, 0.047	4.167, 0.934	4.223, 0.722
315°	225°	1.296, 0.172	3.376, 0.706	2.547, 1.190	0.340, 0.057	2.457, 1.216	2.922, 1.065

Table 6. Statistical significance: F-statistics and p-values for all objective perceptual metrics.

Metric	F-Statistic	p-Value
SEM	185	6.1 × 10⁻⁷⁵
PESQ	67.6	2.59 × 10⁻³⁵
POLQA	78.8	5.68 × 10⁻⁴⁰
STOI	71.4	4.76 × 10⁻³⁷
NISQA	70.2	2.01 × 10⁻³⁶
UTMOS	71.7	4.69 × 10⁻³⁷

Table 7. Five-fold cross-validation (SEM vs. other objective metrics) for the steerable-beam first-order DMA.

Objective Metric	Mean Pearson Correlation Coefficient (r ± SD)	95% Confidence Interval (CI) (Pearson)	Mean Spearman Correlation (ρ ± SD)	95% Confidence Interval (CI) (Spearman)
PESQ	−0.536 ± 0.237	[−0.91, −0.16]	−0.521 ± 0.240	[−0.90, −0.14]
POLQA	−0.522 ± 0.214	[−0.86, −0.18]	−0.511 ± 0.218	[−0.86, −0.16]
STOI	−0.593 ± 0.169	[−0.86, −0.32]	−0.580 ± 0.172	[−0.85, −0.31]
NISQA	−0.510 ± 0.176	[−0.79, −0.23]	−0.496 ± 0.183	[−0.79, −0.20]
UTMOS	−0.597 ± 0.163	[−0.86, −0.34]	−0.585 ± 0.165	[−0.85, −0.32]

Table 8. Mean of spectral coherence sensitivity for objective perceptual metrics for the steerable-beam first-order DMA.

Metric	Mean Spectral Coherence Sensitivity
SEM	−2.3897
POLQA	1.3460
PESQ	1.2693
STOI	0.5742
NISQA	1.1547
UTMOS	1.2198

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sarafnia, A.; Ahmad, M.O.; Swamy, M.N.S. A Spectral Entropy-Based Metric for Evaluating Speech Perceptual Quality with Emphasis on Spectral Coherence. Signals 2026, 7, 27. https://doi.org/10.3390/signals7020027

AMA Style

Sarafnia A, Ahmad MO, Swamy MNS. A Spectral Entropy-Based Metric for Evaluating Speech Perceptual Quality with Emphasis on Spectral Coherence. Signals. 2026; 7(2):27. https://doi.org/10.3390/signals7020027

Chicago/Turabian Style

Sarafnia, Ali, M. Omair Ahmad, and M.N.S. Swamy. 2026. "A Spectral Entropy-Based Metric for Evaluating Speech Perceptual Quality with Emphasis on Spectral Coherence" Signals 7, no. 2: 27. https://doi.org/10.3390/signals7020027

APA Style

Sarafnia, A., Ahmad, M. O., & Swamy, M. N. S. (2026). A Spectral Entropy-Based Metric for Evaluating Speech Perceptual Quality with Emphasis on Spectral Coherence. Signals, 7(2), 27. https://doi.org/10.3390/signals7020027

Article Menu

A Spectral Entropy-Based Metric for Evaluating Speech Perceptual Quality with Emphasis on Spectral Coherence

Abstract

1. Introduction

2. Background Material

2.1. A Fixed-Beam First-Order DMA

2.2. A Steerable-Beam First-Order DMA

3. Entropy-Based Measure

Comparison of Computational Complexities

4. Experimental Results

4.1. Results for the Fixed-Beam First-Order DMA

4.1.1. Descriptive Statistics

4.1.2. Analysis of Variance (ANOVA): Statistical Significance of Differences Between Conditions

4.1.3. Cross-Validation of SEM vs. Other Objective Perceptual Metrics

4.1.4. Spectral Coherence Sensitivity and Mean Temporal Variability

4.2. Results for the Steerable-Beam First-Order DMA

4.2.1. Descriptive Statistics

4.2.2. Analysis of Variance (ANOVA): Statistical Significance of Differences Between Conditions

4.2.3. Cross-Validation of SEM vs. Other Objective Perceptual Metrics

4.2.4. Spectral Coherence Sensitivity and Mean Temporal Variability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI