Objective measurements of the real world—on the one hand—require appropriate technology and—on the other hand—promote and accelerate development, sharing new principles, testing, and technology improvement, so it fits into a circle of innovation in technology as defined by Phillips [1
]. It brings about new technology and, at the same time, new ways of using technology and services, and based on the feedback between both reinforces innovation. An example of the practical use of such a paradigm is the development of objective speech intelligibility measures in the domain of measurements in architectural acoustics, public address system design, occupant warning systems as well as sound systems for emergency purposes. Speech transmission index (STI) and its newer derivative measures such as rapid STI (RASTI), STIPA for public address, ETSI for telephone networks [2
], or STITEL for telecommunication systems [3
], along with devices, allow for validating compliance with audibility and intelligibility standardized requirements. All measures mentioned above are intended for objective measurement of speech intelligibility (SI) carried out as an automated process that does not require a laborious and costly procedure of performing subjective listening tests. However, it should be remembered that the best judge of speech intelligibility is the human ear. Therefore, there are many attempts to find a correlation between objective measurement results and subjective evaluation [5
]. In general, the STI value can be determined using two ways, i.e., the direct method based on modulated signals or the indirect method based on the impulse response, according to IEC 60268-16 standard [10
]. Using an STI-measuring meter, which is often a compact, hand-held device, is a very compelling way of obtaining fast and repeatable measurements associated with speech intelligibility. Moreover, STI or STIPA and other measures have a substantial drawback which disqualifies their usage in the case of, e.g., evaluation of the performance of algorithms designated for improvement of speech intelligibility of demanding conditions that require non-standard methods of acoustic treatment [11
]. This concerns especially locations such as multi-story parking lots or train stations, which are strongly affected by environmental conditions influencing the acoustics of such sites. In the works of Gomez-Agustina et al. [12
], Tronchin [13
], and Yang and Moon [14
], it was shown that air parameters, i.e., temperature or relative humidity, might significantly influence measures such as reverberation time (RT), STI score, or even subjective feeling of being annoyed by the environmental noise. Therefore, in some cases, a more sophisticated algorithm has to be applied to dynamically improve acoustic conditions in such spaces. An example of such an algorithm is provided in the earlier authors’ paper [11
]. It is based on slowing down the speech rate, which improves subjectively assessed speech intelligibility. Unfortunately, such an operation is a nonlinear transformation applied to the speech signal. Consequently, it is not possible to use STI to measure the effect of such an algorithm on speech intelligibility, as standards defining the STI and STI-derived measures are designated to measure signals that are processed in a linear-only manner [10
]. Hence, the STI measure does not apply to evaluating a nonlinear audio-processing algorithm. To mitigate this problem, we propose a modification of the STI measure to use it in the context of nonlinear operations altering speech for increasing its intelligibility.
Therefore, the aim of this paper is to propose a modified STI measure and check its credibility with regard to signals altered by the environment or nonlinear speech changed by a DSP algorithm. To achieve this goal, we introduce a notion of a broadband STI, called bSTI, derived from comparing the cumulated energy of the transmitted envelope modulation and the received modulation. To assess the validity of the proposed measure, we carried out a comparative analysis of ten selected impulse responses for which a baseline value of STI was known. They were measured in three types of spaces:
The repeatability of the bSTI-based measurements and Pearson’s correlation with the STI measure are further investigated.
1.1. State-of-the-Art of the Speech Intelligibility Assessment Methods
Speech is featured by redundancy in many ways: acoustic, phonetic, and lexical. However, noise, distortions, interfering sounds, or reverberation negatively affects speech intelligibility or acoustic measures related to linguistically contrasting units [9
]. As a result, speech may be audible but not intelligible. The issues related to the influence of various factors on speech intelligibility are discussed in detail by Assmann and Summerfield [15
]. Among other observations, the authors address the amplitude-time dependence of the speech signal, which they refer to as “temporal envelope modulations”, i.e., reverberation disrupts natural variations in signal amplitude by filling in sections of silence and pauses. Analysis of the modulation spectrum provides a means of assessing the influence that various disturbances have on speech intelligibility. This approach has become the basis for developing an objective measure of speech intelligibility: the speech transmission index (STI) [16
], as well as the articulation index (AI), introduced much earlier [19
It should, however, be remembered that STI, introduced originally by Houtgast and Steeneken [16
], addresses both background noise and reverberation [16
]. Overall, it is said that there are three methods underlying speech intelligibility evaluation, i.e., Speech Intelligibility Index (SII), Speech Transmission Index (STI), and Articulation Index (AI) [22
]. A thorough review of the follow-up of predicting speech intelligibility was brought by Ma et al. [3
], who pointed out some limitations with regard to the usage of such measures. They indicated that one of the most important factors impairing SI or AI is fluctuating noise, especially speech embedded in fluctuating maskers, e.g., competing talkers [3
]. Lombard effect is another factor impeding SI or speech quality evaluation [25
]. More limitation factors were identified in the context of hearing aids, i.e., peak-clipping and center-clipping distortions in the speech signal [26
], cochlear implants, i.e., STOI (Short-Time Objective Intelligibility) [27
], or noise suppression algorithms [28
]. The thread related to creating SI-based metrics derived from the time-domain approaches such as Envelope Regression (ER) was found to be promising in acoustically degraded environments with multiple talkers and speaking styles [4
]. ER is a time-domain STI method that works as a function of the window length [4
]. Moreover, Payton and Shrestha [4
] concluded that short-term windows might be more appropriate than a long-term analysis so that distortions during gain transitions do not necessarily distort predicted intelligibility during steady-state intervals.
While discussing speech intelligibility, approaches to speech quality assessment based on perceptual principles should also be brought [29
]. Among them, Perceptual Evaluation of Speech Quality (PESQ) [30
], designed by ITU-T P.862 [32
], utilized for narrowband speech with minor impairments, should be mentioned [33
]. Another method, i.e., ITU-T P.563 [34
], allows for dealing with narrowband speech quality.
Moreover, in recent years, one can see a new way to deal with speech assessment based on machine learning [35
]. Most of these approaches are designed for automatic speech quality evaluation. However, it may be assumed that the same paradigm will more often be used for speech intelligibility assessment based on deep learning in the future [35
1.2. STI as an Objective Speech Intelligibility Measure
Before the first attempts were made to use objective measures of intelligibility, subjective methods were used, demanding the involvement of many individuals in the process and thorough statistical analysis. The choice of the test material was also a problem. One of the first attempts to formalize subjective tests of speech intelligibility was an IEEE recommendation published in 1969 [39
]. It defines three types of tests (Isopreference, Relative Preference, and Category-Judgment), describes speakers, listeners, and speech loudness requirements, and provides a list of phonetically balanced sentences (in English) that could serve as test material. Even modern researcher study uses this list (e.g., [40
]). However, a literature review reveals a wide variety of subjective tests, mainly in the selection of test material. Often nonsense words or syllables are used [40
]. This group also includes the so-called logatoms-short pseudo-words [43
]. In the case of sentences, the test material is usually constructed to make it difficult to predict the successive words. Such sentences are grammatically correct but do not carry meaningful, semantic information. Their length is limited to a few words. This type of list is described, for example, by Lavandier [45
] or Ozimek [46
]. Some studies are based on several tests—containing words and sentences—to investigate speech intelligibility in diverse conditions (e.g., [47
Subjective tests are definitely more time-consuming than objective tests, and the results obtained are difficult to compare with one another, for example, due to the different participants. However, this last disadvantage may prove to be an advantage in some situations. The selection of an appropriate group of participants provides an opportunity to assess the intelligibility of speech, e.g., for people with hearing impairments.
Subjective tests are also capable of considering features typical for a given language. Unfortunately, some objective measures, such as, e.g., the STI, will provide inconsistent results in this situation. The studies of Kitapci and Galbrun ([47
]) are a good example—the results obtained show that changes in STI values affect intelligibility differently depending on a particular language.
It can be said that this approach was intentional in the design of the STI—it primarily enables evaluation of the transmission channel quality, leaving aside issues related to speakers or listeners. In contrast to the STI, the speech intelligibility index (SII) enables accounting for listeners’ hearing impairments, e.g., by including a subject’s audiogram in the calculations. In a simplified way, it is also possible to simulate specific frequency dependencies existing in a given speech or language [49
]. The disadvantage of SII is an inability to consider, e.g., reverberation—the method focuses on assessing the effect of stationary noise on speech intelligibility. Rhebergen et al. in [50
] proposed an extension of SII that takes into account the presence of nonstationary noise (called ESII), and George et al. [49
] attempted to combine the properties of SII and STI.
Binaural hearing is another factor that should be considered in this type of analysis. Binaural hearing improves speech intelligibility in complex environments (e.g., reverberant conditions). Typically, single microphones with omnidirectional characteristics are used for STI measurements. As a result, the results may not be as good as those obtained with human participants. However, research is underway on a version of STI that would consider the binaural properties of human hearing (e.g., [51
Other noteworthy STI measure modifications are its simplifications—whose examples among many are STIPA and RASTI. STIPA (STI for public address systems) is a simplified version of the STI measure (e.g., using only seven modulation frequencies instead of the original 14 frequencies) designed to be used in portable speech intelligibility measurement devices [52
]. RASTI (rapid speech transmission index) is computed based on only two-octave bands (500 Hz and 2 kHz), which makes the method faster to perform, but sometimes too inaccurate for some more complicated condition uses. Therefore, it was mainly employed for signals limited to the telephone acoustic frequency band [53
1.3. Limitations of STI as a Way of Assessing Speech Intelligibility
STI is an objective measure that correlates with the degree of intelligibility of human speech at a point in space where it was measured. One of the assumptions of the measurement methodology is the use of a known excitation signal in the form of a noise signal being the sum of narrowband noise signals modulated by pair of modulating frequencies. The center frequencies of the carrier bands and the modulating frequencies are constant with time, and their values are defined in BS EN 60268-16. The algorithm for calculating the STI measure assumes that the component frequencies of the measurement signal do not change during the measurement. This fact makes the STI measure unsuitable for assessing the effect of nonlinear transformations applied in the audio path on speech intelligibility. Examples of this type of transformation are slowing down speech or compression of the dynamics of a speech signal. These transformations can affect the modulation frequency present in the STI-PA signal. Such a detuning causes the algorithm that computes the STI to have a greater modulation loss than it actually is. This is due to shifting the energy of modulation components from frequencies taken into account by the algorithm calculating STI to frequency ranges that this algorithm does not take into account. The STI measure enables the analysis of:
The impact of linear transformations on the intelligibility of a speech signal (e.g., filtration, adaptive filtering, signal amplification),
The effect of the appearance of additive disturbances (the disturbances being signals containing modulation signals, e.g., a real speech signal, are excluded).
For this reason, it is not possible to study the effect of the slowing down on speech intelligibility using the STI signal because it is a nonlinear transformation that changes the frequency structure of the signal, e.g., by changing the frequency of the carrier signals in the excitation signal. Therefore, to obtain a measure to objectively analyze the effect of the slowdown of the speech signal on speech intelligibility, the methodology for computing the STI has to be modified. This way the nature of the slowdown transformation can be taken into account in the analysis.
Possible extensions could envision applying such an approach to TIPA and STIPA and other MTF-based methods if they fail when dealing with the nonlinear enhancement of speech intelligibility.