Residual Echo Suppression Considering Harmonic Distortion and Temporal Correlation

: In acoustic echo cancellation, a certain level of residual echo resides in the output of the linear echo canceller because of the nonlinearity of the power ampliﬁer, loudspeaker, and acoustic transfer function in addition to the estimation error of the linear echo canceller. The residual echo in the current frame is correlated not only to the linear echo estimates for the harmonically-related frequency bins in the current frame, but also with linear echo estimates, residual echo estimates, and microphone signals in adjacent frames. In this paper, we propose a residual echo suppression scheme considering harmonic distortion and temporal correlation in the short-time Fourier transform domain. To exploit residual echo estimates and microphone signals in past frames without the adverse effect of the near-end speech and noise, we adopt a double-talk detector which is tuned to have a low false rejection rate of double-talks. Experimental results show that the proposed method outperformed the conventional approach in terms of the echo return loss enhancement during single-talk periods and the perceptual evaluation of speech quality scores during double-talk periods.


Introduction
Acoustic echo caused by acoustic coupling among microphones and loudspeakers is one of the most important issues in many applications of audio and speech signal processing such as speech communication [1], speech recognition [2], and hearing aids [3,4]. This allows the signal coming from the loudspeaker to be captured on the microphone, which degrades the quality of the speech communication or speech recognition rate. Acoustic echo cancellation (AEC) or acoustic echo suppression (AES) have been introduced to remove acoustic echoes [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. Many AEC approaches employ linear adaptive filters in the time, frequency, or subband domain to predict and cancel out acoustic echoes based on far-end signals [5,9,13]. In the time domain, the most widely used method for linear AEC may be the normalized least mean square (NLMS) algorithm, possibly with step-size control [5] or a double-talk detector (DTD) [6,7], which provides a good balance between fast convergence and low misalignment. However, the length of the time domain adaptive filter should be long enough to accommodate possible impulse responses of the echo path, which sometimes requires huge computation. As alternative solutions, frequency domain and subband domain approaches have been proposed [8][9][10][11][12][13], which can reduce the computational complexity and increase the convergence speed simultaneously. In [9], AEC based on a frequency domain Kalman filter with a shadow filter approach employing an echo path change detector was proposed with reconvergence analysis. Ni et al. [13] proposed a combination of two subband adaptive filters with different step sizes without estimating system noise power, which showed fast convergence speed and small steady-state mean square error.
The linear filtering approach cannot completely remove the acoustic echo as the echo is not a linear function of the far-end signal. The nonlinearity arise mainly from the nonlinear response of the power amplifier and loudspeaker, as well as from the nonlinear acoustic transfer function and the misalignment of the linear echo canceller. In order to suppress nonlinear echo components, AEC based on a nonlinear adaptive filter has been proposed [14][15][16][17]. Volterra filter-based methods [14][15][16] were proposed to model the nonlinear relationship between far-end and acoustic echo signals with polynomial models. Unfortunately, these methods exhibited slow convergence rates [21]. Park et al. [17] also employed a polynomilal model in a Kalman filter-based framework when multiple microphone signals were available. AES algorithms analogous to speech enhancement techniques that estimates spectral gain functions such as Wiener filtering have also been proposed, and demonstrated an impressive performance with low computational complexity [18][19][20].
Another class of approach is placing a separate module after the linear echo canceller to clean up the residual echo left after the linear AEC or AES, which is called residual echo suppression (RES) [22][23][24][25][26][27][28]. Hoshuyama et al. [22] suggested a spectral subtraction scheme to remove the residual echo by assuming that the spectral magnitude of the residual echo is proportional to that of the linear echo estimate. Lee and Kim [23] proposed a statistical model based RES incorporating four hypotheses according to the existence of the near-end speech and residual echo. Schwarz et al. [24] proposed a RES that estimates residual echo from the far-end signal using an artificial neural network. In [25,26], RES based on deep neural networks have been proposed and have shown good performance, while requiring heavy computation. Another class of approaches is based on adaptive filters [27,28] that showed decent performance with a reasonable computational cost [21]. Bendersky et al. [28] proposed harmonic distortion RES (HDRES) in short-time Fourier transform (STFT) domain, which models the residual echo as a linear function of the linear echo estimates in the frequency bins that can make a harmonic distortion in the current frequency bin.
In this paper, we extend HDRES in [28] by modeling the residual echo in the current time-frequency bin as a function of the linear echo estimate, residual echo estimate, and microphone signals in adjacent frames as well as the linear echo estimates in the harmonically-related frequency bins in the current frame. A DTD is adopted to take account of residual echo estimates and microphone signals in the past frames without the adverse effect of the near-end speech and noise.

Problem Formulation
Let x(t) denote the far-end signal and let s(t) and n(t) denote near-end speech and background noise with time index t. AEC output signal e(t) and the microphone signal y(t) are expressed as follows: in which d(t) is the echo signal,d(t) is the linear echo estimate produced by an AEC filter, and d r (t) is the residual echo. The residual echo is always left in the output signal of the AEC, due to the nonlinearity arising from the power amps, loudspeakers, nonlinear echo path, and imperfect AEC.
In the frequency domain, these equations become: where Y(m, f ), E(m, f ), D(m, f ),D(m, f ), D r (m, f ), S(m, f ), and N(m, f ) are the STFT coefficients of y(t), e(t), x(t), d(t), d r (t), s(t), and n(t) in the frame m and frequency f , respectively. The goal of the RES is to estimate and remove the residual echo, D r (m, f ), from the available signals, such as the far-end reference signal, X(m, f ), the linear echo estimate,D(m, f ), and microphone signals, Y(m, f ), in the past and current frames in all frequency bins. A block diagram of an AEC system with a RES module is shown in Figure 1.

Harmonic Distortion Residual Echo Suppression
HDRES [28] estimates the magnitude of the residual echo as a linear function of the linear echo estimates in the frequency bins that can affect the current frequency bin with a harmonic distortion, i.e., the frequencies that are quotients of the current frequency and integers. With the δ(·), which is 1 only for the frequency bins harmonically related to the current frequency and a few nearby bins, the estimate of the magnitude of the residual echo becomes: where M is the number of frequencies, H is the number of harmonics considered, K is the harmonic search window to accommodate nearby frequency bins to deal with the insufficient frequency resolution of the STFT, W H (i, j, k)'s are weights of the linear combination, and X (m, f ) is the linear echo estimate defined by [28]: where the normalized weighting factor L(t, f ) is: in which W L is the weight of the frequency domain linear adaptive AEC filter [8,28] considering S frames. The function δ in (6) is constructed to deal with the harmonic distortion as the frequency contents falling in the i-th frequency bin affects the bins of which the frequencies are integer multiples of that frequency. The affected bins are centered on the bin i × j but may also include nearby 2K frequency bins, as the frequency resolution of the STFT is limited. The parameters of the residual echo suppression, W H (i, j, k)'s, are estimated by a NLMS algorithm.

Residual Echo Suppression Considering Harmonic Distortion and Temporal Correlation
In this paper, we propose an extension of HDRES [28] which considers not only the harmonic distortion but also the temporal correlation of the relevant signals. The residual echo in the current time-frequency bin is modeled as a linear function of not only the linear echo estimates in the harmonically-related frequency bins in the current frame but also the linear echo estimate, residual echo estimate, and microphone signals in adjacent frames. Speech signal has strong temporal correlation that is beneficial to exploit in the vast areas of speech signal processing. In addition to the temporal correlation of the far-end speech signal or the filtered version of it that can help the estimation of the residual echo, the nonlinear part of the echo signal may have its own temporal correlation. Moreover, the microphone signals bears raw information in the captured signal that might be complementary to that in the estimated signals, which may also help the estimation of the residual echo in the current frame in the far-end single-talk regions.
To utilize all the relevant signals considering spectral and temporal correlation, two estimates for the spectral magnitude of the residual echo for far-end single-talk and double-talk periods are maintained. The residual echo for the far-end single-talk period in the frame m and frequency f is modeled as: where T is the number of considered previous frames, W Hs (i, j, k) is the weight for the linear echo estimates in the harmonically related frequencies and W Ts1 (p, f ), W Ts2 (p, f ), and W Ts3 (p, f ) are the weights for the linear echo estimates, residual echo estimates, and microphone signals in the previous frames, respectively. It is noted that the summation index for Y starts with 0 to allow the effect of the microphone signal in the current frame. On the other hand, the residual echo in the double-talk period is estimated without the microphone signal to avoid the adverse effect of the near-end signals: in which W Hd (i, j, k) is the weight for the linear echo estimates in the harmonically-related frequencies and W Td1 (p, f ) and W Td2 (p, f ) are the weights for the linear echo estimates, and residual echo estimates in the previous frames, respectively. With these two estimates, the estimate for the magnitude of the residual echo is determined depending on the result of the double-talk detection: The weights in Equations (9) and (10) are updated during far-end single-talk periods using the NLMS algorithm that minimizes the mean square error between the AEC output signal and the residual echo estimate. The weights W Hs (i, j, k), W T1s (p, f ), W T2s (p, f ), and W T3s (p, f ) used for the single-talk periods are adapted as follows: where ξ ST (m, f ) is the error, µ is the step size, andP X ,P D , andP Y are smoothed powers given by: in which ρ is the smoothing parameter. As for the parameters used to estimate the residual noise during the double-talk period, W Hd (i, j, k), W T1d (p, f ), and W T2d (p, f ), are updated as follows: It is noted that the weights used to estimate the residual echo magnitude in the double-talk periods are updated with Equations (20)- (22) in the far-end single-talk period, not in the double-talk period. This is because the weights W Hd , W Td1 , and W Td2 try to estimate the effect of the linear echo in the harmonically related frequencies, the linear echo in the previous frames, and the residual echo in the previous frames to the residual echo in the current frame, but the |X | and |D r | in the double talk period contain a significant amount of near-end speech which disrupts the estimation of W Hd , W Td1 , and W Td2 .
With the estimated magnitude of the residual echo in the Equation (11), |D r (m, f )|, the output of the linear echo canceller, |E(m, f )|, and the estimate of the magnitude of the noise spectrum obtained by the minimum statistics approach [29], |N(m, f )|, the real-valued gain function of the residual echo suppressor, G(m, f ), is constructed as a spectral subtraction with a noise floor as in [27,28].
where β is a parameter that controls the aggressiveness of the RES, max(·, ·) function returns the largest value between two variables, andD r ,Ē, andN are smoothed versions of |D r , |E|, and |N| obtained by: (27) in which α is a smoothing factor. The final output of the RES is obtained by: where exp(·) is the exponential function, and the time domain signalŝ(t) is computed by an inverse short-time Fourier transform ofŜ(m, f ).

Experiments
To demonstrate the performance of the proposed RES, we compared the echo return loss enhancements (ERLEs) in the far-end single-talk periods and the perceptual evaluation of speech quality (PESQ) scores in double-talk periods for the linear AEC output without RES, that with the power filter-based RES (PFRES) [27], that with the HDRES [28], and that with the proposed RES. As for the linear AEC, we adopted the frequency-domain NLMS-based echo canceller proposed in [8]. The DTD module used in the experiments was the one proposed in [30]. The ERLE is defined by: The parameter values for the proposed RES were M = 257, S = 3, H = 5, K = 1, T = 1, µ = 0.01, ρ = 0.9, α = 0.7, and β = 2. The parameters H, T, and K are related to the nonlinearity of the acoustic echo, so the values for those parameter would depend on the device configurations. The smoothing factors ρ and α were set to be rather high values, as the echo signal is highly nonstationary. The step size parameter µ was set to control the trade-off between the convergence rate and misalignment. We performed experiments for both simulated and real-recorded data with various noise types, echo-to-noise ratios (ENRs), near-end signal-to-noise ratios (SNRs), and near-end signal-to-echo ratios (SERs).

Experiments with Simulated Data
Firstly, the performances of the RES algorithms were evaluated with the simulated data. A total of 20 utterances spoken by 10 male and 10 female speakers were selected from the TIMIT database [31] as the far-end speech, while another 20 utterances from the same database were used as the near-end speech for the double-talk scenario. Background noises were Babble, White, and Factory noises from the NOISEX-92 database [32]. The sampling rates were 16 kHz. The ENR for the single-talk periods was set to be 10, 15, and 20 dB for each type of noise in addition to the noise-free condition. Therefore, the total number of data for the far-end single-talk scenario was 20 × (3 × 3 + 1) = 200. As for the double-talk data, 20 pairs of utterances were used for near-end and far-end speech signals mixed at the SER of −5, 0, and 5 dB. The SNR was set to be 10, 15, and 20 dB for each noise type in addition to the clean noiseless condition, making the total number of data 20 × 3 × (3 × 3 + 1) = 600.
To simulate the nonlinearity that arise from power amplifiers and loudspeakers, we adopted a clipping function [33] and a memoryless sigmoidal function [34]. The hard clipping function [33] is defined as: where x max was set to be 80% of the maximum volume of x(t). To model the nonlinear characteristics of the loudspeakers, the memoryless sigmoidal function [34] was used: where the sigmoid gain parameter γ is set to 2, and the sigmoid slope is set to be a = 4 if b(t) > 0 and a = 0.5 otherwise. The echo path after the nonlinearity was modeled with the image method [35] which simulates a 4 × 4 × 3 m small office with the reverberation time of T 60 = 200 ms. Tables 1 and 2 show the average ERLEs in the far-end single-talk periods and average PESQ scores in the double-talk periods for the simulated data. In all noise types and echo-to-noise ratios, the HDRES [28] improved the ERLE of the linear AEC while keeping the PESQ scores the same, and the proposed RES outperformed the HDRES in both the ERLE and PESQ scores. The PESQ scores for the PFRES was similar to those for HDRES, while the ERLE for the PFRES was in the middle of that for the HDRES and that for the proposed RES. On average, the ERLE was improved by 5.37 dB compared with the linear AEC output, by 0.74 dB compared with the PFRES, and by 4.43 dB compared with the HDRES, while the PESQ scores was improved by 0.064 compared with the linear AEC output, by 0.042 compared with the PFRES, and by 0.051 compared with the HDRES. We can conclude that exploiting the temporal correlation of the relevant signals in addition to the harmonic relation between frequency bins was effective to estimate the residual echo. Table 1. Echo return loss enhancement (ERLEs) in various echo-to-noise ratio (ENR) conditions for the acoustic echo canceller (AEC) without residual echo suppression (RES), with the power filter-based RES (PFRES) [27], with the harmonic distortion RES (HDRES) [28], and with the proposed RES during the far-end single-talk periods for the simulated data. The numbers in the bold face indicate the best performances.  Table 2. Average perceptual evaluation of speech quality (PESQ) scores in various signal-to-noise ratio (SNR) and signal-to-echo ratio (SER) conditions for the AEC without RES, with the PFRES [27], with the HDRES [28], and with the proposed RES during the double-talk periods for the simulated data. The numbers in the bold face indicate the best performances.

Experiments with Real-Recorded Data
Since the nonlinear echo may not be simulated well enough by the clipping function and memoryless sigmoidal function, we additionally performed experiments with real recordings. We used 28 far-end speech signals and a near-end speech signal recorded with a commercial mobile phone, Samsung Galaxy S8, in hand-held hands-free mode [36]. The raw microphone signals and the far-end signal are obtained by an internal development program in Samsung. Each data have a length of about 65 seconds with a sampling rate of 16 kHz. A total of 3 types of noises including Pub, Road, and Call center noises from the European Telecommunications Standards Institute (ETSI) EG 202 396-1 background noise database [37], were replayed from the loudspeakers in the recording room to simulate background noises. The total number of real-recorded data for the far-end single-talk and double-talk periods considering the same ENR, SNR, and SER conditions with simulated data were 28 × (3 × 3 + 1) = 280 and 28 × 3 × (3 × 3 + 1) = 840. Table 3 shows the average ERLE with real-recorded data for the far-end single-talk period. The average ERLE for the proposed RES was 5.23 dB higher than that of the linear AEC output, 1.27 dB higher than that of the PFRES, and 1.53 dB higher than that of the HDRES. Utilizing the correlation with the microphone signal and the previous frames of the linear echo estimate, microphone signal, and residual echo estimates were shown to be effective in the real recordings with a commercial smartphone, too. Figure 2 demonstrates an example of the ERLEs over time for the AEC system without RES, with the PFRES, with the HDRES, and with the proposed RES along with the microphone signal waveform. In Table 4, the PESQ scores for real recordings are shown. Again, we can confirm that the proposed RES could achieve better speech quality in the double-talk periods in various background noise and echo conditions for the real-recorded data.  Table 3. ERLEs in various ENR conditions for the AEC, with the PFRES [27], with the HDRES [28], and with the proposed RES during the far-end single-talk periods for the real-recorded data. The numbers in the bold face indicate the best performances.

Computational Complexity of the Proposed RES Algorithm
We investigated the computational complexity of the proposed RES algorithm. The proposed RES algorithm in our experiment requires {2(2K + 1)H + 5T + S + 46}M real-valued multiplications per frame, while HDRES algorithm [28] requires {(2K + 1)H + S + 22}M real-valued multiplications and the PFRES algorithm [27] requires (6p 2 + 13p + 3)M real-valued multiplications, where p denotes the order of the polynomial model of the nonlinear echo path. For the parameter values we used in the experiments (e.g., M = 257, T = 1, S = 3, H = 5, K = 1, and p = 3), the computational complexity of the proposed method is approximately twice of that of the HDRES [28] and 10% less than that of the PFRES [27]. We can see that the proposed method has reasonable computational complexity considering the performance improvement shown in Tables 1-4.

Conclusions
In this paper, we proposed a method for residual echo suppression considering harmonic distortion and temporal correlation. The proposed method estimates residual echo taking account of not only the linear echo estimates in the harmonically-related bins but also the linear echo estimates, residual echo estimates, and the microphone signals in adjacent frames. To utilize the microphone signal without the adverse effect of the near-end signals, the DTD module is utilized. Experimental results showed that the proposed method improved the ERLE of the HDRES in the far-end single talk period by 4.43 dB for the simulated data and 1.53 dB for the real-recorded data, and the PESQ scores of the HDRES in the double-talk periods by 0.051 for both simulated data and 0.049 for the real-recorded data, which may justify the increase of the computational complexity.

Conflicts of Interest:
The authors declare no conflict of interest.