A Speech Preprocessing Method Based on Perceptually Optimized Envelope Processing to Increase Intelligibility in Reverberant Environments

: Speech intelligibility in public places can be degraded by the environmental noise and reverberation. In this study, a new near-end listening enhancement (NELE) approach is proposed in which using a time varying ﬁlter jointly enhances the onsets and reduces the overlap masking. For optimization, some look-ahead in clean speech and prior knowledge of room impulse response (RIR) are required. In this method, by optimizing a deﬁned cost function, the Spectro-Temporal Envelope of reverb speech is optimized to be as close as possible to that of clean speech. In this cost function, onsets of speech are optimized with increased weight. This approach is different from overlap-masking ratio (OMR) and speech enhancement (OE) approaches (Grosse, van de Par, 2017, J. Audio Eng. Soc., Vol. 65 (1/2), pp. 31–41) that only consider previous frames in each time slot for determining the time variant ﬁltering. The SRT measurements show that the new optimization framework enhances the speech intelligibility up to 2 dB more that OE.


Introduction
In conventional speech enhancement methods, the speech signal is recovered from a mixture of reverberation and noise. This type of processing can be used at the receiver side, for example in hearing aids. For degradation of speech intelligibility in public places such as airports and train stations due to reverberation and noise, because of using one or multiple independent loudspeakers and lack of further processing in the listener side, speech modification is only possible at the source side and only on the clean speech before playback. In the literature, this type of clean speech modification is called near-end listening enhancement (NELE) [1] and is typically evaluated under an equal-level constraint. The modified signal must be more intelligible in the presence of reverberation and noise and also must be robust to different listener positions in a wide area. NELE algorithms can be divided into three categories: rule-based, noise-dependent, and reverberation-dependent. In the rule-based approaches, knowledge about psychoacoustics and speech perception is used to produce more intelligible speech preferably with low audible processing artifacts. However, this method does not optimize a specific criteria and for this reason, the modification would only be sub-optimal in terms of speech intelligibility [2]. Generally, in NELE algorithms, the preprocessing of clean speech is performed on a time-frequency signal representation. A very well-known rule-based approach is the Spectral-Shaping Dynamic Range Compression (SSDRC) method [2,3] for which the acoustic cues that are perceptually important are enhanced. In the time domain, a non-linear Dynamic Range Compression (DRC) amplifies lower-energy parts of speech, like consonants, which are known to be more susceptible to reverberation and noise [4]. In addition, in the frequency domain, by spectral-tilt flattening and formant shifting, the intelligibility is improved. Based on SSDRC, another successful method named 2 of 18 Automatic Sound Engineer (ASE) is proposed in which using equalization and broadband compression maximizes speech intelligibility while keeping a good sound quality [5].
In the second type of NELE methods, only the presence of noise is taken into account in the development of an enhancement algorithm. Speech modifications in these methods are usually based on an objective speech intelligibility measure, e.g., the speech intelligibility index (SII) [5] or the glimpse proportion (GP) [6] which is used as an optimization target. Similar to the SSDRC, most of the successful noise-dependent algorithms use DRC in the time domain to enhance the consonants intelligibility. In adaptDRC [6,7], spectral modification and dynamic time compressing are performed to improve the SII in the presence of additive noise. In this method, although the impact of reverberation is not explicitly considered in the enhancement procedure, nevertheless considerable enhancement is reported in the intelligibility in the presence of reverberation [8]. Another noise-dependent NELE approach is based on improving the STOI score [9]. In some of the noise-dependent methods, deep neural networks (DNNs) are used to modify speech energy. In a method called iMetricGAN [10], the enhancement is performed by repeated predictions of the intelligibility score of modified speech and producing scale factors that multiplied on the unmodified spectrogram. The intelligibility-improving signal processing approach (IISPA) [11] is another DNN-based method that uses an automatic-speech recognitionbased model of speech perception to optimize different parameters such as band-pass edge frequencies, spectral slope and curvature, and spectral modulation compression or expansion. Note that in these noise-dependent methods, the quality of speech can degrade strongly, specifically in the presence of non-stationary noise. In the third category of NELE methods named reverberation-dependent, room impulse response (RIR) data are explicitly considered in the modification procedure ( [12][13][14][15]). Grosse and van de Par (2017) [16] proposed two methods namely, OE (Onset Enhancement) and Overlap Masking Ratio (OMR)-which were inspired by previous studies in [17,18]. In these two approaches, having access to a RIR, time varying gains are calculated for each frame, based on the energy of current frame and that of the previous frames of speech.
In the current study, a reverberation-dependent approach is developed to optimize the Spectro-Temporal Envelope of a reverberated speech by onset enhancement and also by reducing the amount of overlap-masking. In contrast to the OMR and OE methods proposed in [16], in this approach future frames are also considered in determination of the weight of current frame. Considering the extension of the current frame and its overlapping with the upcoming frames, now an explicit cost function is defined and optimized in a way that the Spectro-Temporal Envelope of the reverberated speech is as close as possible to that of the clean speech signal.
The paper is organized as follows. Section 2 provides the structure of the proposed NELE algorithm. In this section, the cost function definition and the optimization procedure are described. Section 3 presents the results of simulations and measurements. Finally, Section 4 concludes the article with the discussion.

Proposed NELE Method
The main objective of the proposed algorithm is to apply a time-varying filtering on a clean speech signal in such a way that after reproduction in a reverberant environment, the temporal envelope in each frequency band is as similar to that of the clean speech signal as possible. For this purpose, a cost function is defined that is optimized. In the cost function, onsets are weighted stronger to ensure accurate reproduction. The time-variant filtering is updated on a frame-by-frame basis. For this reason, the signal analysis, the modelling of the effect of reverberation, and the cost function optimization is done on a frame-by-frame basis. The block diagram of the proposed approach is depicted in Figure 1. It consists of signal windowing and convolving with RIR, preprocessing using the FFT, onset detection, cost function definition, optimization unit, and finally the signal modification unit that is including Overlap-Add (OLA) and summation over frequency bands. After windowing and convolution with RIR, the speech signal is separated into one-third octave bands in the FFT processing unit. To enhance the signal, a frame-based time-varying filter is designed to improve an intelligibility criterion. For a one-third octave band, a cost function is defined and then is optimized to obtain the weights of filter. These time-varying weights are used to synthesize a new speech signal. The parts of this block diagram are explained in detail as follows. After windowing and convolution with RIR, the speech signal is separated into one-third octave bands in the FFT processing unit. To enhance the signal, a frame-based time-varying filter is designed to improve an intelligibility criterion. For a one-third octave band, a cost function is defined and then is optimized to obtain the weights of filter. These timevarying weights are used to synthesize a new speech signal. The parts of this block diagram are explained in detail as follows.

Figure 1.
Block diagram of the proposed NELE approach. The preprocessing includes signal windowing, convolving short frames with the RIR to construct extended frames, FFT processing and bin separation to divide a speech into one-third octave bands and onset detection. In the cost function block, the extended frames and data of the onset detection unit are used to construct a cost function for each short frame. The constructed cost functions are iteratively optimized using the data of onset detection unit to compute the gains of short frames. Finally, the enhanced signal is reconstructed using the obtained weights by OLA and summation over one-third octave bands.

Preprocessing
As a first step, the time-domain signal is transformed to the frequency domain on a frame-by-frame basis. The main goal of this preprocessing unit is the separation of speech signal into one-third octave bands. This frame wise frequency analysis is needed for optimization and also for onset detection. First, the clean speech ( ) is framed using = 30 ms Hann windows with 50% overlapping to construct N overlapping frames ( ):  (1) in which N is determined according to the length of signal. Then, each frame is convolved with the RIR: Here, ( ) is called "extended frame" number n. An extended frame can mask the upcoming frames dependent on the reverberation time . Each extended frame is analyzed using a Fast Fourier Transform (FFT) and subsequently the frequency bins are separated according to the one-third octave bands and finally are synthesized using the inverse FFT (IFFT): where , ( ) is the synthesized signal in one-third octave band number . The , ( ) can be considered as the convolution of a short frame , ( ) with the RIR that is filtered by the one-third octave band number .The , ( ) is named a "short frame". The , ( ) can also be described by smaller frames called "sub-frames", such that the length Figure 1. Block diagram of the proposed NELE approach. The preprocessing includes signal windowing, convolving short frames with the RIR to construct extended frames, FFT processing and bin separation to divide a speech into one-third octave bands and onset detection. In the cost function block, the extended frames and data of the onset detection unit are used to construct a cost function for each short frame. The constructed cost functions are iteratively optimized using the data of onset detection unit to compute the gains of short frames. Finally, the enhanced signal is reconstructed using the obtained weights by OLA and summation over one-third octave bands.

Preprocessing
As a first step, the time-domain signal is transformed to the frequency domain on a frame-by-frame basis. The main goal of this preprocessing unit is the separation of speech signal into one-third octave bands. This frame wise frequency analysis is needed for optimization and also for onset detection. First, the clean speech x(t) is framed using τ = 30 ms Hann windows with 50% overlapping to construct N overlapping frames x n (t): x n (t), (1) in which N is determined according to the length of signal. Then, each frame is convolved with the RIR: Here, y n (t) is called "extended frame" number n. An extended frame can mask the upcoming frames dependent on the reverberation time T 60 . Each extended frame is analyzed using a Fast Fourier Transform (FFT) and subsequently the frequency bins are separated according to the one-third octave bands and finally are synthesized using the inverse FFT (IFFT): where y n, f (t) is the synthesized signal in one-third octave band number f . The y n, f (t) can be considered as the convolution of a short frame x n, f (t) with the RIR that is filtered by the one-third octave band number f . The x n, f (t) is named a "short frame". The y n, f (t) can also be described by smaller frames called "sub-frames", such that the length of each sub-frame is equal and temporally aligned to that of a short frame. The length of an extended frame is M times the length of a short frame: where l w and l h are the length of a short frame and the RIR respectively. A sub-frame y n,m, f (t) is defined as the frame number m of an extended frame y n, f (t): The extended frame that is transformed into its corresponding sub-frames allows one to calculate the total signal power that can be observed within the reverberant environment within a particular time frame and frequency band. Since in this approach, a time variant filtering is applied on a frame-by-frame basis to the input signal x, the effect of the filtering can be evaluated in weighted summations of y n,m, f (t) that are affected by all extended frames in a short frame interval. This resulting summation will be considered in a cost function. The weight of short frame x n, f (t) and subsequently an extended frame y n, f (t) within a one-third octave band is denoted by α n, f . The short frames and their respected weights, an extended frame, and a sub-frame are schematically illustrated in Figure 2. of each sub-frame is equal and temporally aligned to that of a short frame. The length of an extended frame is M times the length of a short frame: , otherwise The extended frame that is transformed into its corresponding sub-frames allows one to calculate the total signal power that can be observed within the reverberant environment within a particular time frame and frequency band. Since in this approach, a time variant filtering is applied on a frame-by-frame basis to the input signal x, the effect of the filtering can be evaluated in weighted summations of , , ( ) that are affected by all extended frames in a short frame interval. This resulting summation will be considered in a cost function. The weight of short frame , ( ) and subsequently an extended frame , ( ) within a one-third octave band is denoted by , . The short frames and their respected weights, an extended frame, and a sub-frame are schematically illustrated in Fig  , , , , respectively. The extended frame , and one of its sub-frames , , are also shown.

Construction of a Cost Function for a Frame
According to Figure 2 and Equation (4), the sub-frame number one of , ( ) denoted by , , has overlaps with the previous M extended frames. All of these M extended frames have sub-frames that are overlapping with , , . A "summed-reverbedshort frame" number k that includes all of sub-frames within the short window number k can be constructed using the summation of current short frame , , ( ) and the subframes of previous extended frames overlapping in frame k: For the speech enhancement, a time-varying weight , according to Figure 2 is multiplied to each short frame and subsequently its convolution with the RIR that constructs the extended frame , . This weight will be consequently multiplied to all , , ( ) that construct , ( ). By multiplication to these weights, a summed-reverbedshort frame number k is changed to a "weighted-summed-reverbed-short frame": In Equation (7), the weights are used to reduce the amount of the overlap-masking in a reverb condition. According to the STOI [19], the temporal envelope of this weighted signal is considered for defining a cost function. A time-frequency unit norm (TF-unit) of a weighted-summed-reverbed-short frame in Equation (7) is calculated by non-coherent

Construction of a Cost Function for a Frame
According to Figure 2 and Equation (4), the sub-frame number one of y k, f (t) denoted by y k,1, f has overlaps with the previous M extended frames. All of these M extended frames have sub-frames that are overlapping with y k,1, f . A "summed-reverbed-short frame" number k that includes all of sub-frames within the short window number k can be constructed using the summation of current short frame y k,1, f (t) and the sub-frames of previous extended frames overlapping in frame k: For the speech enhancement, a time-varying weight α n, f according to Figure 2 is multiplied to each short frame and subsequently its convolution with the RIR that constructs the extended frame y n, f . This weight will be consequently multiplied to all y n,m, f (t) that construct y n, f (t). By multiplication to these weights, a summed-reverbedshort frame number k is changed to a "weighted-summed-reverbed-short frame": In Equation (7), the weights are used to reduce the amount of the overlap-masking in a reverb condition. According to the STOI [19], the temporal envelope of this weighted signal is considered for defining a cost function. A time-frequency unit norm (TF-unit) of a weighted-summed-reverbed-short frame in Equation (7) is calculated by non-coherent summation of the power spectral density values of its discrete Fourier transform (DFT) within a one-third octave band: The target signal is obtained by processing of the clean speech similar to calculation that is done for the extended frames in Equation (8). However, now only the direct part of RIR (h d (t)) is used for the computation of the target signal: The TF-unit of this signal in a one-third octave band for frame k is calculated similar to WS k, f : The cost function for frame k is now defined as the squared error between the TF-unit of target (E k, f ) and the TF-unit of weighted frames (WS k, f ): This definition of cost function is comparable with the criteria used in the STOI that uses the correlation between temporal envelopes of the clean and degraded speech as an intelligibility score. The temporal envelope in STOI is a vector of TF-units covering a 384 ms time interval. Because increasing the correlation is equivalent to minimizing the squared error, minimizing CF k ( f ) for consecutive frames increases the correlation and subsequently the STOI score. In addition, by defining the cost function in the form of Equation (11), the optimization is easier to handle.
An additional factor that is being considered in the cost function is the importance of a frame denoted by β k . The importance of a frame is determined by an onset detector. If a frame is detected to be an onset, a higher β k is multiplied by CF k ( f ). The role of this additional weight is clarified in the following section in which the optimization procedure is explained.

Optimization Unit
The summed-cost-function (SCF) for frame number k(SCF k ( f )) is defined by weighted summation of cost functions of that frame (CF k ( f )) and upcoming frames (CF i ( f )): Each one-third octave band denoted by f is optimized independently. To optimize a SCF, the number of frames influenced by the gain of a frame instead of M is denoted by P in Equation (12). According to Equation (4) and considering the T 60 , M short frames are overlapped by an extended frame. However, to reduce the computational load, lower number of frames may be considered, because it is reasonable to neglect the effect of the reverberation after P ≤ M frames that are naturally shorter than T 60 . Therefore, this implementation does not require having a full-length signal available when optimizing a given frame and only a limited look ahead into the future is needed. This P value is imperially set using the energy decay curve (EDC) of RIR when its value is dropping by 25 dB.
For the onset detection, an energy-based method described in [20,21] is used. Here, a parameter, namely the high frequency content (HFC), is constructed from a weighted sum of spectral powers for each frame. A detection function (DF), which is the ratio of the HFC over two consecutive frames, is calculated. After obtaining the DF for the full signal, its values are normalized to its maximum across frames. These normalized values 0 ≤ β i ≤ 1 that are the outputs of the onset detection unit are used for weighting the cost function in Equation (12). If the frame number k is detected to be an onset, a higher β i is assigned to its cost function.
To prevent high fluctuations of weights in the optimization, lower and upper bounds for a weight α i,k are set. The lower bound that is the minimum gain is set to −40 dB. The output of the onset detection unit is also used here in the optimization routine. A frame is recognized as an onset if its β i is above a threshold. This threshold is computed adaptively according to the average value of all β i across all frames. If a frame is detected to be an onset, the maximum possible gain (upper bound) is set to 20 dB, otherwise maximally 0 dB is allowed. In the optimization procedure, only positive weights are accepted because the parameter that is going to be controlled is the leaked energy of each frame on the upcoming frames. Since the optimization is targeting energy, the application of negative weights would be meaningless. The constrained optimization problem is summarized as follows: To find the minimum of constrained non-linear multivariable function of Equation (13), a state-of-the-art method called the Sequential Quadratic Programming (SQP) [22,23] is used. In this numerical optimization method, the Hessian of the Lagrangian function using a quasi-Newton updating method is estimated. The algorithm starts independently for each frequency band by calculating the coefficient of first frame α 1, f . The effect of the first frame is considered up to P frames after. Therefore, SCF 1 ( f ) is a function of α 1, f , α 2, f , α p, f . All of these P weights are determined in the first optimization routine. However, only α 1, f is accepted as the final value, because other weights, except α 1, f , have an effect on the upcoming SCFs. In the next optimization α 2, f is determined and being fixed and so on. Although in every optimization cycle many weights related to the current and future frames are computed, only a weight obtained for the current frame will be fixed and other weights will be updated on the next optimization cycle. After determination of a weight in each optimization, its value in all CFs is replaced. The routine for the determination of weights and updating the CFs is explained in Algorithm 1.
In the Onset Enhancement (OE) method [16], a time-varying weight of a frame is determined according to the power spectrums of that frame and that of the previous frames, which are of influence because of the reverberation tail that has overlap with the current frame. In the proposed approach, in order to determine the weight of current frame, a cost function is optimized based on the power spectrum of current frame and that of the future frames. Considering the future frames in the cost function optimization implies that the effect of the filtering of the current frame on future frames will also be considered; this is not the case in the OE method. Note that the weights of the previous frames were fixed before, but their effect is still considered in Equation (12). Algorithm 1. Example text of a theorem. Determination of weights α 1, f , α 2, f , ..., α k, f , . . . , α N, f in a one-third octave (f ).
Step 1: k = 1 Step 2: The weight of frame number k is determined. All the upcoming CFs that are influ-enced by α k, f construct the SCK k ( f ) according to Equation (12). For construction of SCK k ( f ), previous determined weights in CFs are used and the new weight α k, f is determined: .., α P, f . All of these weights are determined in the first optimization. However, only α 1, f is accepted as the final value: Step 3: All CFs that are influenced by α k, f are replaced by their numerical value.
If k < N: go to Step 2 else finish

Stimuli
The Oldenburg Sentence Test (OLSA) [24] corpus with the male speaker is used as the speech material to evaluate the proposed algorithm. The OLSA corpus consists of 120 German sentences for each speaker. A sentence includes name, verb, number, adjective, and noun and there are 10 alternatives for each word. The corpus is downsampled to 16 kHz. Speech shaped noise (SSN) and pink noise (PN) are used as the interferers which are convolved with binaural room impulse response (BRIR) and presented at an average level of 65 dB-SPL for left and right ears. The SSN was generated by a summation of all 120 OLSA sentences followed by phase randomization, creating noise with a similar long-term spectrum as the speech corpus. In addition to SNN, also pink noise is used which has an energy distribution similar to environmental noise. It covers the frequency range between 100 Hz to 8 kHz, approximately corresponding to the spectral range of the speech material.

Binaural Room Impulse Responses
In this section the BRIRs used in this paper are described. For the optimization and subjective evaluation, four rooms were used and from each of these rooms, three recorded BRIRs with the same receiver position are selected. These rooms are described in Table 1 and differ in terms of geometrical dimensions and reverberation time T 60 . The main BRIR in Table 1 is convolved with a speech source and the resulting left ear signal is used for the optimization to find the weights but is also used for rendering and evaluating the preprocessed speech signal. The second BRIR in this table is used to evaluate the robustness of the algorithm for a different position of the listener compared to the weights obtained for the main BRIR. Thus, the algorithm uses a different IR for the preprocessing than the IR that was originally obtained in the optimization. Finally, the third BRIR is used for convolving with a noise signal to create a binaural noise for both main and robustness evaluation scenarios. The first room (R1) with a relatively short T 60 of 0.6 s is selected from a set of BRIR measurements made at our university in Oldenburg. The second room (R2) is a music hall that is selected from the BRAS database [25] with a T 60 equal to 1.1 s. The recorded BRIRs in R2 have a relatively long distance between the source and receiver and therefore the direct-to-diffuse ratio is low. The third room (R3) selected again from the BRAS database is a seminar room with a T 60 equal to 1.5 s. This room is critical in terms of T 60 and speech intelligibility because it has a long reverberant tail that creates a considerable amount of time smearing the source speech signal. Room (4) represents a church selected from Air database [24]. The T 60 is very long (about 5 s) because of the large dimensions of the room and low degree of damping. Because of relatively small source-receiver distances (3 m), however, the selected BRIRs in the church have a high direct-to-reverb diffuse ratio. The distance from speech and noise sources to the listener positions of the main and robustness evaluation scenarios is almost held constant within a room to avoid differences in the direct-to-diffuse ratio. A collection of all used room-acoustical scenarios, reverberation times, selected BRIRs, and the length of P frames are shown in Table 1. P is a number of future frames that is used to construct the summed-cost-function (SCF) in Equation (12) for the optimization purpose. As previously explained, the length of P frames is determined according to 25 dB drop of EDC. For a larger T 60 , more future frames are needed for the optimization.

Signal Processing Details
The corpus and noises are downsampled to 16 kHz. The length of the analysis and synthesis frame is 30 ms with 50% overlap. A square-root Hann window is used in the signal framing in both the analysis and synthesis to avoid audible artifacts because of the cyclic convolution. To synthesis the signal, the overlap-add (OLA) method is used. The frequency resolution used in separation of an extended frame in Equation (3) to the one-third octave bands is limited by the length of P frames according to Table 1. According to length of P frames and the sampling rate of 16 kHz, it could be at least 4096 for the shortest room impulse response (T 60 = 0.6 s) and 16,384 for the longest room impulse response (T 60 = 5 s). The bins are grouped into 17 one-third octave band. Similar to the STOI [19], the lowest center frequency is set to 150 Hz and the highest one-third octave band has a center-frequency approximately equal to 6 kHz. A frequency resolution used in Equation (8) for analysis of weighted sub-frames in Equation (5) is determined by the window length and sampling rate equal to 512 bins. The RMS values of the processed signal are adjusted to that of the unprocessed signal to keep levels equal between the output of the NELE algorithms and the unprocessed signal.

Effect of the Algorithm on Signal
The cochleagram of a clean speech from the OLSA corpus (first raw) and two preprocessed speech signals, one of them preprocessed with the OE algorithm [16] (second raw) and another one preprocessed by the proposed algorithm (third raw), are depicted in Figure 3a.The weights are calculated for room R4. It can be seen that the preprocessing of OE and the proposed algorithm causes a high-pass filter effect on the speech, which is caused by the fact that both enhancement algorithms reduced the amplitude of speech portions with a high and constant energy over time or those which were exposed to a longer T 60 (yellow ovals in the left panels). It can be seen that this attenuation is stronger for the proposed method in comparison to the OE. Because of the importance of onsets for intelligibility, and in accordance with Equation (12), onsets are more strongly weighted in the proposed method. The other steady portions, on the other hand, are allowed to be attenuated more in order to minimize the defined cost function. The first-row panel shows the unprocessed clean sentence. In the first-raw panel, a reverbed unprocessed speech sentence is depicted. The second-row and third-row panels show a reverbed preprocessed speech sentence using OE algorithm and the proposed algorithm respectively. Yellow rectangles show the effect of the frame attenuations in the OE and proposed approaches. In the third panel belonging to the proposed method, there is a low amount of energy leakage from previous frames into the silent gap in comparison to the unprocessed and OE-preprocessed reverbed signals.
The signals of Figure 3a are now convolved with the left ear of BRIR in room R4 and their cochleagrams are plotted in. The cochleagram of reverbed unprocessed, reverbed preprocessed by OE, and reverbed preprocessed by the proposed algorithm are shown in the first, second, and third panels of Figure 3b, respectively. Besides the onset enhancement and high-pass filtering effect of the proposed approach, the effect of the frame attenuations can be compared with the two other reverbed signals. The effect can be seen in the longer silent gap of the clean speech around second 1 (yellow rectangles in the left panels). It can be seen that in the third panel belonging to the proposed method, there is a low amount of energy leakage from previous frames into the silent gap in comparison to the unprocessed and OE-preprocessed reverbed signals. A similar effect can be seen in second 1.6. This effect can potentially contribute to higher speech intelligibility due to the reduced overlap masking of preceding speech segments.

Objective Evaluation of the Algorithm Using Two Intelligibility Models
For the objective evaluation of the proposed algorithm, the left ear of the above-mentioned BRIRs were used for the weight computations of the OE and proposed algorithm. The BRIRs were then convolved with OLSA speech material that was either unprocessed, OE-preprocessed, or preprocessed with the proposed method. Two intelligibility models, Figure 3. (a) The cochleagram of clean speech and two preprocessed speech signals are shown. The first-row panel shows the unprocessed speech sentence. Second-row panel shows a preprocessed speech sentence using OE algorithm and the third-row panel shows the preprocessed speech using the proposed algorithm obtained for room R4. The yellow ovals show the difference between the unprocessed signal and two enhanced signals. The preprocessing of OE and the proposed algorithm has a high-pass filter effect on the speech.; (b) The cochleagram of three reverbed signals in room R4 are shown.
The first-row panel shows the unprocessed clean sentence. In the first-raw panel, a reverbed unprocessed speech sentence is depicted. The second-row and third-row panels show a reverbed preprocessed speech sentence using OE algorithm and the proposed algorithm respectively. Yellow rectangles show the effect of the frame attenuations in the OE and proposed approaches. In the third panel belonging to the proposed method, there is a low amount of energy leakage from previous frames into the silent gap in comparison to the unprocessed and OE-preprocessed reverbed signals.
The signals of Figure 3a are now convolved with the left ear of BRIR in room R4 and their cochleagrams are plotted in. The cochleagram of reverbed unprocessed, reverbed preprocessed by OE, and reverbed preprocessed by the proposed algorithm are shown in the first, second, and third panels of Figure 3b, respectively. Besides the onset enhancement and high-pass filtering effect of the proposed approach, the effect of the frame attenuations can be compared with the two other reverbed signals. The effect can be seen in the longer silent gap of the clean speech around second 1 (yellow rectangles in the left panels). It can be seen that in the third panel belonging to the proposed method, there is a low amount of energy leakage from previous frames into the silent gap in comparison to the unprocessed and OE-preprocessed reverbed signals. A similar effect can be seen in second 1.6. This effect can potentially contribute to higher speech intelligibility due to the reduced overlap masking of preceding speech segments.

Objective Evaluation of the Algorithm Using Two Intelligibility Models
For the objective evaluation of the proposed algorithm, the left ear of the abovementioned BRIRs were used for the weight computations of the OE and proposed algorithm. The BRIRs were then convolved with OLSA speech material that was either unprocessed, OE-preprocessed, or preprocessed with the proposed method. Two intelligibility models, the STOI [19] and the multi-resolution generalized power-spectrum model (mr-GPSM) [26], were used. In the STOI [19], the clipped temporal envelope of noisy and manipulated speech is compared with that of clean speech using a correlation in 384 ms time intervals as an intermediate intelligibility measure in each one-third octave band. The average of the intermediate intelligibility scores across all time intervals and bands is the STOI intelligibility score which can be a number between zero and one. In the mr-GPSM method [26], using "speech+noise" and "noise" signals, the Hilbert envelope in each auditory channel is calculated and then low-pass filtered. The signal for each auditory channel is then separated into two independent pathways where the outputs of envelope power SNRs model (EPSM) and power SNRs model (PSM) are calculated. The envelope power SNRs across auditory and modulation channels and power SNRs across auditory channels are first combined and then each of them is multiplied to its empirical correction factor. The final mr-GPSM score is the maximum value of weighted combined envelope power SNRs and weighted combined power SNRs. For both intelligibility models, the intelligibility scores were averaged across 120 sentences. For the STOI, the unprocessed and preprocessed reverbed signals without additive noise are compared with clean speech. The reason for the selection of STOI is the fact that its metric is very similar to our defined cost function using the optimization method. It is expected that by minimizing the cost function in Equation (11) based on the square error, the STOI score based on correlation is also improved. For evaluation with mr-GPSM, the SSN and PN without convolving with BRIRs are added to the reverbed speech materials in three SNR values of −15, −10, and 0 dB. This is done to keep the influence of background noise identical across conditions. For each reverbed sentence, different samples of the full noise token are added. To have a good averaging across noise samples, five additional runs for each set of reverbed speech materials are performed. Figure 4 shows the STOI score of the main scenario in each room for reverbed unprocessed, reverbed OE-enhanced, and the reverbed enhanced signal with the proposed algorithm. The OE does not show much difference compared to the unprocessed case, but the improvement for the proposed approach is considerable. The STOI improvement for rooms R1, R2, and R3 is about 0.05. but lower improvement is seen for room R4 with very high reverberation time. as an intermediate intelligibility measure in each one-third octave band. The average of the intermediate intelligibility scores across all time intervals and bands is the STOI intelligibility score which can be a number between zero and one. In the mr-GPSM method [26], using "speech+noise" and "noise" signals, the Hilbert envelope in each auditory channel is calculated and then low-pass filtered. The signal for each auditory channel is then separated into two independent pathways where the outputs of envelope power SNRs model (EPSM) and power SNRs model (PSM) are calculated. The envelope power SNRs across auditory and modulation channels and power SNRs across auditory channels are first combined and then each of them is multiplied to its empirical correction factor. The final mr-GPSM score is the maximum value of weighted combined envelope power SNRs and weighted combined power SNRs. For both intelligibility models, the intelligibility scores were averaged across 120 sentences. For the STOI, the unprocessed and preprocessed reverbed signals without additive noise are compared with clean speech. The reason for the selection of STOI is the fact that its metric is very similar to our defined cost function using the optimization method. It is expected that by minimizing the cost function in Equation (11) based on the square error, the STOI score based on correlation is also improved. For evaluation with mr-GPSM, the SSN and PN without convolving with BRIRs are added to the reverbed speech materials in three SNR values of −15, −10, and 0 dB. This is done to keep the influence of background noise identical across conditions. For each reverbed sentence, different samples of the full noise token are added. To have a good averaging across noise samples, five additional runs for each set of reverbed speech materials are performed. Figure 4 shows the STOI score of the main scenario in each room for reverbed unprocessed, reverbed OE-enhanced, and the reverbed enhanced signal with the proposed algorithm. The OE does not show much difference compared to the unprocessed case, but the improvement for the proposed approach is considerable. The STOI improvement for rooms R1, R2, and R3 is about 0.05. but lower improvement is seen for room R4 with very high reverberation time.  Figure 5a shows the predictions of mr-GPSM for SSN. The SNR-mr-GPSM is the intelligibility score of this intelligibility model which is a summation of envelope-SNR and DC-SNR [26]. The speech reception threshold (SRT) is an increasing monotonic function of SNR-mr-GPSM. A comparison between three signals for each panel shows that both  Figure 5a shows the predictions of mr-GPSM for SSN. The SNR-mr-GPSM is the intelligibility score of this intelligibility model which is a summation of envelope-SNR and DC-SNR [26]. The speech reception threshold (SRT) is an increasing monotonic function of SNR-mr-GPSM. A comparison between three signals for each panel shows that both OE and the proposed approach enhance the speech intelligibility. However, the improvement for the proposed algorithm is considerable in comparison to the OE method.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 12 of 19 OE and the proposed approach enhance the speech intelligibility. However, the improvement for the proposed algorithm is considerable in comparison to the OE method. Specifically for low SNR = −15 dB, this improvement is about 7 SNR-mr-GPSM for rooms R2 and R3. Generally, the model shows less improvement for rooms R1 and R4. The intelligibility score of R1, because of lower , is higher than others and therefore this score may reach near to its maximum possible value such that more improvement is not possible specifically for high SNRs near 0 dB (third-row panel in Figure 5a). For room R4 with a long reverberation tail, the model, because of high amount of time smearing, does not show much improved scores, specifically for more noisy conditions with SNR = -10 and −15 dB. The same evaluation using mr-GPSM is shown in Figure 5b, now using PN. The model shows lower scores for the PN in comparison to SSN and also lower improvements caused by the preprocessing algorithm. Generally, the model predictions show less than 1 dB improvement for OE and about 2 dB improvement using the proposed method. In spite of lower values of improvements, the curves of Figure 5b show a consistent increase of speech intelligibility using the OE and the proposed approaches.
The intelligibility prediction models are also applied to the robustness evaluation scenarios. In Figure 6, the STOI scores similar to Figure 4 show an improvement of intelligibility for the proposed method in comparison to the unprocessed and OE-enhanced signals. Specifically for low SNR = −15 dB, this improvement is about 7 SNR-mr-GPSM for rooms R2 and R3. Generally, the model shows less improvement for rooms R1 and R4. The intelligibility score of R1, because of lower T 60 , is higher than others and therefore this score may reach near to its maximum possible value such that more improvement is not possible specifically for high SNRs near 0 dB (third-row panel in Figure 5a). For room R4 with a long reverberation tail, the model, because of high amount of time smearing, does not show much improved scores, specifically for more noisy conditions with SNR = −10 and −15 dB. The same evaluation using mr-GPSM is shown in Figure 5b, now using PN. The model shows lower scores for the PN in comparison to SSN and also lower improvements caused by the preprocessing algorithm. Generally, the model predictions show less than 1 dB improvement for OE and about 2 dB improvement using the proposed method. In spite of lower values of improvements, the curves of Figure 5b show a consistent increase of speech intelligibility using the OE and the proposed approaches.
The intelligibility prediction models are also applied to the robustness evaluation scenarios. In Figure 6, the STOI scores similar to Figure 4 show an improvement of intelligibility for the proposed method in comparison to the unprocessed and OE-enhanced signals. Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 19 Figure 6. The STOI scores for robustness evaluation scenarios in each room for reverbed unprocessed, reverbed OE-enhanced, and the reverbed enhanced signal with the proposed algorithm. The proposed algorithm similar to the main scenarios in Figure 4 shows improvement for the STOI score in comparison to the reverbed unprocessed and OE-enhanced signals.
In Figure 7, the SNR-mr-GPSM scores for evaluation of robustness of the proposed algorithm are compared with the unprocessed and OE-enhanced signals. The improvements are in the range obtained for the main scenarios in Figure 5. In Figure 7a, the maximum improvement predicted by the mr-GPSM in the presence of SSN for the OE-preprocessed signal is 3 dB and for the proposed approach it is about 9 dB. For the PN, a similar tendency can be seen in the three panels of Figure 7b. The data Figure 6. The STOI scores for robustness evaluation scenarios in each room for reverbed unprocessed, reverbed OE-enhanced, and the reverbed enhanced signal with the proposed algorithm. The proposed algorithm similar to the main scenarios in Figure 4 shows improvement for the STOI score in comparison to the reverbed unprocessed and OE-enhanced signals.
In Figure 7, the SNR-mr-GPSM scores for evaluation of robustness of the proposed algorithm are compared with the unprocessed and OE-enhanced signals. The improvements are in the range obtained for the main scenarios in Figure 5.
In Figure 7a, the maximum improvement predicted by the mr-GPSM in the presence of SSN for the OE-preprocessed signal is 3 dB and for the proposed approach it is about 9 dB. For the PN, a similar tendency can be seen in the three panels of Figure 7b. The data show an overall improvement of 3 to 4 dB for the proposed method depending on the scenario. This improvement is sometimes better than the main scenario for SSN and PN. This underlines that the proposed algorithm is very robust against changes in listener position and that detailed knowledge of the IRs is not essential. The intelligibility score is more dependent on the listening scenario and was sometimes better than the main scenario for which the weights were calculated because of more binaural advantage caused by more azimuthal separation of the target and noise sources.

Subjective Evaluation
The 50% speech reception thresholds (SRT 50 ) were carried out using the AFC toolbox in MATLAB [27]. For each scenario, a different list of 20 sentences is played for a listener. For each played sentence in the list, there are ten alternatives. The listener was asked to select a word from the alternatives after playing an audio file. The speech level of next sentence is adaptively adjusted to measure SRT 50 . The step size of each level is dependent on the number of correctly selected words of the previous sentence.
In Figure 7, the SNR-mr-GPSM scores for evaluation of robustness of the proposed algorithm are compared with the unprocessed and OE-enhanced signals. The improvements are in the range obtained for the main scenarios in Figure 5. In Figure 7a, the maximum improvement predicted by the mr-GPSM in the presence of SSN for the OE-preprocessed signal is 3 dB and for the proposed approach it is about 9 dB. For the PN, a similar tendency can be seen in the three panels of Figure 7b. The data  Figure 8 shows the SRT 50 for eight subjects obtained with the OLSA matrix test for all four room-acoustical scenarios and SSN for the unprocessed, OE, and the proposed approach. In the left panel, median values are shown together with the 25% and 75% quantiles and outliers across eight subjects. The right panel shows the mean values and the standard error across the subject's mean values and the standard error at the most right-hand side is calculated across all subjects and rooms. Considering all rooms, it can be seen in Figure 8b that the intelligibility is enhanced up to 3.5 dB for OE and 5 dB for the proposed approach compared to the unprocessed speech. In the rooms R2 and R3, the proposed approach has a slightly larger effect on intelligibility which may be caused by the midrange values of T 60 . Data obtained in rooms R2 and R3 for SSN show an improvement of about 1.5 dB and 1 dB for the proposed method in comparison to the OE. The proposed approach shows only a small improvement in Rooms R1 and R4, which is due to the low and very high reverberation times that make more improvements difficult. Room 4, which is a church with a reverberation time of T 60 = 5 s, shows no large difference between the proposed and OE approach. For this room, in spite of the higher reverberation time, the SRTs are lower in comparison to that of rooms R2 and R3. This is mainly because of high direct-to-reverberant ratio in the BRIRs of the church and also the larger azimuth angle difference between source and the noise for BRIRs in the church. The results of SRT50 measurements for the robustness evaluation scenarios are shown in Figures 10 and 11 for SSN and PN, respectively. These figures show the SRTs optimized on the main scenario and applied to the robustness evaluation positions. For the SSN in Figure 10, a comparison between the three signals shows that for SSN there is an improvement up to 3.5 dB for the OE and 5.5 dB for the proposed method relative to the unprocessed signal. A similar tendency can be seen in Figure 11 for PN. The data in Figure 11 show an overall improvement of 1.5 to 3 dB for OE and 2 to 3.5 dB for the proposed approach depending on the room. In general, comparing the thresholds of the robustness evaluation scenario with that of main scenarios, it can be seen that, similar to the predictions of the intelligibility models, both the OE and proposed methods are very robust against changes in position and a detailed knowledge of the IR is not necessary. In Figure 9, the same plots are depicted but this time for the PN interferer. Both the OE and proposed approach shows fairly good improvement in comparison to the unprocessed signal. However, altogether a low improvement of about 1 dB is seen for the proposed approach in comparison to the OE. Only for room 2, there is more than 1 dB improvement. The results of SRT50 measurements for the robustness evaluation scenarios are shown in Figures 10 and 11 for SSN and PN, respectively. These figures show the SRTs optimized on the main scenario and applied to the robustness evaluation positions. For the SSN in Figure 10, a comparison between the three signals shows that for SSN there is an improvement up to 3.5 dB for the OE and 5.5 dB for the proposed method relative to the unprocessed signal. A similar tendency can be seen in Figure 11 for PN. The data in Figure 11 show an overall improvement of 1.5 to 3 dB for OE and 2 to 3.5 dB for the proposed approach depending on the room. In general, comparing the thresholds of the robustness evaluation scenario with that of main scenarios, it can be seen that, similar to the predictions of the intelligibility models, both the OE and proposed methods are very robust against changes in position and a detailed knowledge of the IR is not necessary. The results of SRT 50 measurements for the robustness evaluation scenarios are shown in Figures 10 and 11 for SSN and PN, respectively. These figures show the SRTs optimized on the main scenario and applied to the robustness evaluation positions. For the SSN in Figure 10, a comparison between the three signals shows that for SSN there is an improvement up to 3.5 dB for the OE and 5.5 dB for the proposed method relative to the unprocessed signal. A similar tendency can be seen in Figure 11 for PN. The data in Figure 11 show an overall improvement of 1.5 to 3 dB for OE and 2 to 3.5 dB for the proposed approach depending on the room. In general, comparing the thresholds of the robustness evaluation scenario with that of main scenarios, it can be seen that, similar to the predictions of the intelligibility models, both the OE and proposed methods are very robust against changes in position and a detailed knowledge of the IR is not necessary.

Discussion
In this study a new reverb-based NELE approach based on the optimization of a cost function was proposed that reduces the time-smearing effect of reverberation on speech and similar to the OE amplifies the onsets and has high-pass filter characteristics. Its main advantage to the OE is considering future frames in finding filtering weights and for this reason, more reduction of the overlap masking of reverberation tail is seen. The amount of overlap masking is considered in the defined cost function and is used to control the weights applied on the original speech signal speech segments that would make the upcoming frames inaudible. Higher importance for the onset segments in the cost function is assigned to avoid attenuation of onsets which would decrease intelligibility. Both the model predictions and listening-test results showed improvement in SRTs. It has been demonstrated by the model prediction that the proposed algorithm is better able to compensate for the detrimental effects of reverberation than the OE method. The subjective evaluation showed that depending on the scenario there is an improvement of 0.5 dB up

Discussion
In this study a new reverb-based NELE approach based on the optimization of a cost function was proposed that reduces the time-smearing effect of reverberation on speech and similar to the OE amplifies the onsets and has high-pass filter characteristics. Its main advantage to the OE is considering future frames in finding filtering weights and for this reason, more reduction of the overlap masking of reverberation tail is seen. The amount of overlap masking is considered in the defined cost function and is used to control the weights applied on the original speech signal speech segments that would make the upcoming frames inaudible. Higher importance for the onset segments in the cost function is assigned to avoid attenuation of onsets which would decrease intelligibility. Both the model predictions and listening-test results showed improvement in SRTs. It has been demonstrated by the model prediction that the proposed algorithm is better able to compensate for the detrimental effects of reverberation than the OE method. The subjective evaluation showed that depending on the scenario there is an improvement of 0.5 dB up

Discussion
In this study a new reverb-based NELE approach based on the optimization of a cost function was proposed that reduces the time-smearing effect of reverberation on speech and similar to the OE amplifies the onsets and has high-pass filter characteristics. Its main advantage to the OE is considering future frames in finding filtering weights and for this reason, more reduction of the overlap masking of reverberation tail is seen. The amount of overlap masking is considered in the defined cost function and is used to control the weights applied on the original speech signal speech segments that would make the upcoming frames inaudible. Higher importance for the onset segments in the cost function is assigned to avoid attenuation of onsets which would decrease intelligibility. Both the model predictions and listening-test results showed improvement in SRTs. It has been demonstrated by the model prediction that the proposed algorithm is better able to compensate for the detrimental effects of reverberation than the OE method. The subjective evaluation showed that depending on the scenario there is an improvement of 0.5 dB up to 2 dB in comparison to the OE. The mr-GPSM model predicts a larger improvement for the proposed method over the OE method than what is actually observed in the listening tests. One reason could be the stronger artifacts created by the proposed algorithm compared to the OE method. A possible modification of the method could be using a quality-assessment criteria in the optimization procedure. The algorithm evaluation using both the intelligibility models and the listening test showed that improvements in speech intelligibility did not depend on having an exact match between the positions of the source and the listener used for obtaining the optimal weights. Similar to the OE approach, this underlines the robustness of this algorithm for errors in the estimation of room impulse responses. For the proposed algorithm, only a course spectro-temporal representation of the room impulse response is used and the exact magnitudes and phases of the transfer function are not needed. Therefore, the robustness problem that exists in the inverse-filtering approaches is avoided in the proposed method.
Another important point about the proposed approach is the fixed parameters being used in construction of the cost function and optimization. The fixed set of parameters is used for all of the scenarios. The performance of the algorithm is dependent on the parameters that are set. The first parameter is the number of the future frames (P) that are considered in the summed-cost-function (SCF) of Equation (12) which is based on the reverberation time. To reduce the computational load, it could set much less than the number of frames covered by the T 60 . Informal listening tests showed that beyond a specific number of future frames, the signal is not much more improved. Imperially, P was determined according to the EDC of the RIR, until the point it drops 25 dB. Other parameters that are fixed empirically are the weights (β i ) assigned to onsets in the defined SCF again in Equation (12) and the lower and upper bound for the gains in Equation (13). For the future work these parameters could be selected according to an intelligibility model.
For the computational load, there is not much difference between preprocessing for the proposed approach and that of the OE. In both methods, the signal is separated into frequency bands (Gammatone-based in OE and one-third octave band bin separation in the proposed approach) and power spectral of signals are obtained using FFT processing. However, in the proposed approach, there is a significant computational load in the optimization part. The optimization is performed using Sequential Quadratic Programming (SQP) algorithm implemented in MATLAB by running the "fmincon" function. Because of using a symbolic function in MATLAB and the complicated optimization algorithm, the running time is high. It is dependent on the T60 and the length of signal. The time needed to calculate the weights of 17 independent bands for a 3 s speech audio file and T60 = 1 s is about 40 min. The running time of OE algorithm for the same file and similar room condition is very low and below 10 s. Therefore, the proposed algorithm with the current optimization approach is not possible to be used in a real-time scenario. Note that this study was focused on the design of algorithm and the reduction of the complexity and computational load will be considered in the future study. We intend to replace the symbolic optimization in MATLAB with a faster algorithm such as a modified version of [28].
The proposed method is in the category of reverbed-base NELE algorithms. According to the literature, algorithms that use priori knowledge of the maskers and RIRs do not perform better than noise-independent algorithms. The ASE and SSDRC approaches that are not using the characteristics of the playback environment outperformed other methods in the NELE challenge [29]. It is surprising that until now enhancement algorithms with the goal of enhancing the noise and reverberation effect on the speech have not performed well. A promising approach could be a combination of three categories of NELE algorithms including rule-based, noise-dependent, and reverberation-dependent to benefit from the advantages of separate methods. For example, in Adaptive Compressive Onset-Enhancement (ACO) method [30], sequential and independent combination of a modified version of the AdaptDRC [6] and the OE [16] is used to enhance the speech in a reverb and noisy room with the knowledge of statistics of additive noise and RIR respectively.