Crossband Filtering for Weighted Prediction Error-Based Speech Dereverberation

: Weighted prediction error (WPE) is a linear prediction-based method extensively used to predict and attenuate the late reverberation component of an observed speech signal. This paper introduces an extended version of the WPE method to enhance the modeling accuracy in the time–frequency domain by incorporating crossband ﬁlters. Two approaches to extending the WPE while considering crossband ﬁlters are proposed and investigated. The ﬁrst approach improves the model’s accuracy. However, it increases the computational complexity, while the second approach maintains the same computational complexity as the conventional WPE while still achieving improved accuracy and comparable performance to the ﬁrst approach. To validate the effectiveness of the proposed methods, extensive simulations are conducted. The experimental results demonstrate that both methods outperform the conventional WPE regarding dereverberation performance. These ﬁndings highlight the potential of incorporating crossband ﬁlters in improving the accuracy and efﬁcacy of the WPE method for dereverberation tasks.


Introduction
When a distant microphone captures a speech signal in a room, it is inevitably subjected to adverse acoustic effects, including background noise and reverberation.These effects can harm the quality of the observed speech signal, significantly degrading the performance of crucial applications like automatic speech recognition (ASR).To address this issue, extensive research has been conducted on speech dereverberation.The primary objective of speech dereverberation is to eliminate or reduce late reflections in the observed speech signal.It is well known that while the early reflections are not harmful and, in some cases, even might improve the speech intelligibility [1][2][3], the late reflections are a significant contributing factor to the degradation in speech quality and intelligibility [4][5][6].By mitigating the effects of reverberation, dereverberation techniques aim to restore the clarity and intelligibility of the captured speech, ultimately enhancing the performance of various speech processing applications.As a result, developing efficient and reliable dereverberation methods plays a vital role in advancing the field of speech processing and facilitating the deployment of robust speech-based applications in diverse settings.
Over the years, numerous dereverberation methods have been developed, employing different approaches to address the challenge of reverberation in speech signals [7][8][9][10].One prominent approach is beamforming, which leverages an array of microphones to enhance the desired speech signal while suppressing unwanted background noise and reverberation.Notable beamforming methods include the Minimum Variance Distortionless Response (MVDR) beamformer [11] and its two-stage variant [12].These methods estimate the optimal weights for the microphone array to enhance the desired speech source while attenuating the reverberant and noise components.Spectral enhancement methods have also been widely explored in dereverberation research.These methods operate in the frequency domain and aim to enhance the desired speech signal by modifying the spectral characteristics.One example is the spectral subtraction method [13,14].This technique estimates the noise and reverberation spectra and subtracts them from the observed signal to enhance the speech component.Other spectral enhancement approaches employ advanced signal processing methods such as Wiener filtering and statistical modeling to separate the desired speech from the interfering components [15].
Another category of dereverberation methods focuses on estimating an inverse filter to predict and suppress the late reflections present in the observed speech signal [16][17][18][19][20].One notable method in this category that has gained significant attention in the field of speech processing is the weighted prediction error (WPE) method [19,20].WPE is based on linear prediction (LP), utilizing an inverse filter to estimate and suppress the late reflections in the observed speech signal.By exploiting the statistical properties of the reverberation, WPE effectively separates the desired speech component from the reverberant component.The method estimates the optimal filter coefficients by minimizing the prediction error between the observed signal and its predicted version.WPE has proven highly effective in various applications [21,22].Due to its effectiveness, WPE has received significant attention and has been extensively studied, leading to the proposal of numerous extensions, generalizations, and variants.For instance, in [20], the model formulation, which assumed a single speech source, was generalized to an arbitrary number of sources.Other extensions and generalizations include the employment of deep neural networks [23,24], switching mechanisms [25], and Kronecker product filtering to improve the computational complexity [26].
In most applications, the WPE method is implemented in the short-time Fourier transform (STFT) domain, which provides a suitable framework for the analysis and processing of the observed speech signal.By decomposing the time-domain-observed signal into subbands, WPE operates on these subbands individually in a frequency-band-wise manner.In the time domain, WPE models the observed speech signal as the result of a linear convolution between the clean speech signal and an unknown room impulse response (RIR).However, when transitioning to the STFT domain, the relationship between the observed and clean signals becomes more intricate.In the STFT domain, the observed subbands are influenced not only by their corresponding clean subbands but also by the adjacent subbands and the crossband filters [27,28].The exact relation between each observed subband and the clean signal results from convolutions between all clean subbands and their corresponding crossband filters.These filters capture the interdependencies and interactions between different frequency bands.The influence and information exchange between adjacent frequency components is taken into account by considering the crossband filters, enabling more comprehensive modeling and processing of the observed speech signal.
However, the conventional WPE method in the STFT domain neglects the influence of crossband filters.It assumes that each observed subband is solely the result of a convolution between the corresponding clean subband and a convolutive transfer function (CTF), often referred to as the "band-to-band filter" [28].In other words, the CTF approach approximates each observed subband as solely dependent on its corresponding clean subband.This simplification introduces an inherent error in the WPE model when operating in the STFT domain [19].The approximation error resulting from neglecting the crossband filters can significantly impact the performance of speech processing methods in the STFT domain.Previous studies have explored the effect of this approximation error in the context of system identification methods [28,29].However, for WPE-based speech dereverberation, a preliminary study was conducted in [30].Unfortunately, this study had limited scope and did not provide a comprehensive analysis of the effectiveness of the WPE-based approach incorporating crossband filtering in various real-world scenarios.
Given the importance of accurately modeling the crossband filters, further investigation is necessary to understand their impact on the performance of WPE-based dereverberation.A comprehensive analysis of the effectiveness of the WPE approach with crossband filtering in diverse real-world scenarios is essential to shed light on the potential benefits and limitations of incorporating crossband filters into the dereverberation process.Such research will contribute to advancing the understanding and practical implementation of WPE-based methods, facilitating their optimization and broader application in real-world speech processing scenarios.
In this paper, we present an extension to the conventional WPE method that enhances the accuracy of the model approximation in the STFT domain by incorporating crossband filters.Our approach aims to capture the interdependencies between adjacent subbands and refine the estimation of the late reflections in the observed speech signal.By considering samples from neighboring subbands, we redefine the WPE observation vector and modify the prediction inverse filter to include both crossband and traditional band-to-band components.We explore two versions of the proposed method to investigate their effectiveness.The first version prioritizes accuracy by improving the model approximation, albeit at the expense of increased computational complexity.In contrast, the second version maintains the same computational complexity as the conventional WPE while enhancing the accuracy.Surprisingly, the second version demonstrates competitive performance compared to the first method.
We conduct a series of experiments to validate the performance of our proposed versions.The results confirm that both versions surpass the conventional WPE regarding dereverberation performance.This highlights the significance of incorporating information from neighboring subbands in the STFT domain in improving dereverberation outcomes.Furthermore, our findings suggest that the early samples of the crossband components might offer greater efficacy in mitigating reverberation than the late samples of the bandto-band component.By presenting these experimental results, we provide empirical evidence supporting the effectiveness of our proposed extensions to the WPE method, offering valuable insights into the potential benefits of considering crossband filters for dereverberation tasks in the STFT domain.
The remainder of this paper is organized as follows.Section 2 presents the model and the problem.Section 3 describes the proposed method.Section 4 details the experimental setup and results.Section 5 concludes this work.

Signal Model and Crossband Filters
Our study considers an arbitrary room with a single speech source.Let x(n) ∈ R be the time-domain clean speech signal, and let x f ,t ∈ C be the STFT representation of x(n), where f = 0, . . ., F − 1 and t = 0, . . ., T − 1 are the frequency and time bins, respectively.The speech signal is captured by an array of M microphones.In this work, we assume that the background noise is negligible.Hence, the observed signal in the m-th microphone, y (m) (n), is given by where n is the discrete-time index, and h (m) (n) is the RIR from the source x(n) to the m-th microphone.Based on the analysis in [28], the relation between the clean signal x f ,t and the observed signal y (m) f ,t in the STFT domain is given by where * denotes a linear convolution, and the coefficients h f ; f ,t ∈ C are derived from the time-domain RIR h (m) (n) and from the analysis and synthesis filters that are used to transform the signals from the time domain to the STFT domain and vice versa.Given a single frequency bin f , we consider the time sequence h (m) f ; f ,t as the band-to-band filter, while the set of time sequences h are considered as the crossband filters associated with f .

Problem Formulation
In the conventional STFT-domain dereverberation problem, the CTF approximation is employed, i.e., the contribution of the crossband filters is neglected, and the observed signal y f ,t is approximated as where we assume a length L band-to-band filter (or CTF) h (m) f ; f ,t .In the context of this paper, the goal of the dereverberation is to predict the late reflection component and subtract it from the observed signal, resulting in an enhanced signal z (m) f ,t that consists of the direct sound and the early reflections: where 0 < D L is a predefined parameter that enables separation between early and late reflections.

Extension to Crossband Filters
The model described in (4) can be extended by employing the accurate model in (2).The enhanced signal is then given by where L f is the length of the crossband filter corresponding to frequency bin f .In terms of computational complexity, the accurate model in ( 5) is expensive since it increases the complexity by a factor of F compared to the CTF approximation in (4).The analysis in [28] shows that in terms of energy, the band-to-band filter is more significant compared to the crossband filters, and the energy of h f ; f ,t decreases when | f − f | increases.Based on this observation, and to improve the accuracy of the CTF model in (3) with a relatively small price of increasing model complexity, we consider the contribution of the two nearest crossband filters, i.e., we approximate the observed and the enhanced signals as where L 0 := L bb and L 1 := L cb are the lengths of the band-to-band filter and the crossband filter, respectively.

Conventional WPE
The conventional WPE for multichannel input predicts the components of the late reflections based on the LP of an inverse filter [19].To leverage the spatial information to improve the estimation of the late reflections, the inverse filter predicts the late reflections based on observations from all channels.More specifically, let y where (•) T and (•) H represent the transpose and Hermitian transpose, respectively, and D > 0 is the predefined prediction delay.Note that given a frequency bin f , based on the definition of y f ,t;L in (9), the enhanced signal is obtained using only information from samples of the observed signal in the band-to-band frequency bin, ignoring samples from the crossband frequency bins.For simplicity, we select an arbitrary value for m and omit the microphone designation from this point onward.

Filter Estimation
The filter g f is estimated in a frequency-wise manner based on the maximum likelihood (ML) criterion, assuming that the signal z f := z f ,t t follows a complex Gaussian distribution with zero mean and time-dependent variances λ f := λ f ,t t .The filter coefficients g f and the variances λ f are alternately estimated to minimize the following objective: The entire estimation process for the first channel (i.e., m = 1) is described in Algorithm 1.The extension to other channels is straightforward.

Input:
Observed multichannel signal in STFT domain y 2. for n = 1, . . ., N do Compute: Update: 3. end for end for

WPE with Crossband Filtering
To consider the contribution of the two nearest crossband filters, we define a prediction filter g f ∈ C LM , where L = L bb + 2L cb , and an observation vector: where we consider y f ,t;L bb and y f ±1,t;L cb as the "band-to-band component" and "crossband components", respectively.The enhanced signal is now obtained as The definition of ỹ f ,t; L in (12) forces the enhanced signal to take into account samples from the two nearest crossband frequency bins in addition to the samples from the bandto-band frequency bin.The filter coefficients g f are estimated according to the method described in Section 3.1.1and in Algorithm 1.Here, the term g H f y f ,t;L in ( 11) is substituted with the term gH f ỹ f ,t; L. We propose two versions of the proposed WPE with crossband filtering.First, we fix L bb = L.The parameter L cb controls this setup's tradeoff between model complexity and model accuracy.When L cb = 0, the proposed method is equivalent to the conventional WPE, and the length of g f is equal to the length of g f .When L cb increases, the accuracy of the model approximation increases, but so does the length of g f , resulting in larger computational complexity.In the second version, we fix L = L. Here, L bb decreases when L cb increases, meaning that early samples from crossband components are taken into account instead of late samples from the band-to-band component.This setup reduces the accuracy of the band-to-band model approximation but maintains fixed computational complexity.

Data and Setup
To validate the performance of the proposed method, we collected a dataset of 10 clean speech signals from the Deep Noise Suppression (DNS) challenge dataset [31].To emulate realistic acoustic conditions, we generated acoustic channel RIRs using the image model method [32].The reverberation levels were controlled by adjusting the wall reflection coefficient parameter.The experimental setup consisted of a uniform linear array with four microphones positioned in a room measuring 6 m by 8 m by 3 m.The speaker was located at coordinates (5, 4, 1.7), while the microphones were placed at different positions along the x-axis.More specifically, the microphones were positioned at (x, 2, 1.6), where x was uniformly distributed from 2.936 to 2.999.The reverberation time (T60) of approximately 300 ms was attained by configuring the absorption coefficient of the room's walls.This choice was informed by our observation that lower values of T60 resulted in a relatively inconspicuous reverberation effect, accompanied by an insubstantial enhancement through the proposed method.Furthermore, various room configurations were comprehensively explored, encompassing alterations in the microphone array's spatial arrangement, inter-microphone spacing, and the speaker's position.More specifically, we conducted a comprehensive exploration of microphone spacing, ranging from 1 cm to 4 cm.Additionally, we meticulously examined various configurations involving offsets in both the microphone array and the speaker's position along both the x-axis and y-axis.Remarkably, despite these deliberate modifications, the experimental outcomes exhibited remarkable consistency across configurations.Given this consistency, we present the outcomes from a representative configuration for brevity and clarity.The clean speech signals and RIRs were sampled at 16 kHz.The multichannel observed signals were generated by convolving the RIRs with the single-channel clean speech signals.Spectral analysis was performed using STFTs with a 512-length Blackman window and a shift of 128 samples between frames.By meticulously designing this experimental setup, we aimed to establish a reliable basis for evaluating the proposed method's performance.Including diverse speech signals, accurate RIR generation, and careful control of reverberation levels contributed to a comprehensive assessment of the method's effectiveness in enhancing speech signals.Additionally, when performing WPE, we set the predefined prediction delay to D = 3 to maintain consistency across the experiments.Using Algorithm 1, we have empirically determined that setting the number of iterations to N = 3 is adequate to achieve the convergence of the filter coefficients and the variances.Based on the definition of the prediction filters of order L and L in Section 3, we conducted two types of experiments.

1.
Length extension (Ext.):In this first version of the proposed method, denoted as the "length extension", we set the parameter L bb equal to L. This design choice ensured that the samples from the crossband components, which introduced additional computational complexity, were included in the analysis.This specific experiment aimed to demonstrate the significance of the information contained in the crossband components in enhancing the dereverberation performance.While it did introduce computational complexity, this experimental approach allowed us to assess the true potential and effectiveness of the proposed method by leveraging the information-rich crossband components.

2.
Length preservation (Pres.):In the second version of the proposed method, denoted as "length preservation", we established L to equal L. To maintain comparable computational complexity to the conventional WPE method, we introduced a modification by discarding the two most recent samples from the band-to-band component for every sample utilized in the crossband components.This adjustment allowed us to strike a balance between computational efficiency and the evaluation of the relative importance of early samples from the crossband components and late samples from the band-to-band component.By discarding the two latest samples from the band-to-band component, we aimed to explore the tradeoff between the temporal characteristics of the crossband and band-to-band components.This experimental design enabled us to assess the respective significance of early samples from the crossband components and late samples from the band-to-band component in the dereverberation process.

Performance Measure
We varied the length of the crossband components L cb to examine how it affected the performance of the proposed method.We examined the performance in terms of three widely used measures for speech dereverberation [21,33]: the frequency-weighted segmental SNR (FWSegSNR) [34,35], the cepstral distance (CD) [36], and the perceptual evaluation of speech quality (PESQ) [37].Given a clean ground-truth signal in the STFT domain x f ,t and the corresponding enhanced signal x f ,t , FWSegSNR was computed as follows: where F and T are the numbers of frequency bands and time frames, respectively, and w f ,t is the weight assigned to the f -th frequency at the t-th frame.We set the weights w f ,t according to the standard AI weights [38].The CD measure is defined as where C x (m, t) is the cepstral coefficient of the m-th Mel band of x f ,t [36].It is worth noting that a universally accepted suite of objective quality measures has yet to be fully established within the dereverberation landscape [33].Given this ongoing evolution, our choice of performance measures aimed to shed light on the relative strengths and limitations of various approaches.For FWSegSNR and PESQ, larger values indicate better dereverberation performance.For CD, smaller values indicate better performance.To highlight the effectiveness of the method, we considered the "gain" concerning the observed signal, i.e., instead of presenting the absolute measures' values, we offer the following measures: where (•) observed is the performance measure when considering the observed signal instead of the enhanced signal.Based on this definition, larger values indicate better dereverberation performance across all measures.Values smaller than 1 indicate a degradation in performance.The performance gain was computed individually for each of the 10 speakers.
The scores depict the mean improvement across these 10 speakers and the corresponding standard deviation.

Optimal Band-to-Band Length
To begin our investigation, we performed a series of simulated experiments on the conventional WPE method to identify the optimal filter length L bb within the specific room configuration under consideration.In this set of experiments, we systematically varied the value of L bb in the range of 5 to 25 while keeping the crossband filter length L cb fixed at 0.
The results of these experiments are presented in Figure 1, which showcases the scores obtained for each measure across the range of L bb .Upon close examination, it becomes evident that the optimal filter length varies for different performance measures.Specifically, the FWSegSNR measure attains its peak performance with L bb = 15, while the CD measure achieves its optimal result at L bb = 14.On the other hand, the PESQ measure demonstrates its best performance when L bb is set to 20.
Building upon these findings, we conducted further experiments, focusing exclusively on the optimal values of L bb .Consequently, we set L bb to take on 14, 15, and 20 values, thereby allowing us to thoroughly compare the proposed and conventional WPE methods' performance under these specific settings.

Crossband Filtering-Length Extension
To thoroughly explore the impact of different crossband filter lengths (L cb ) in conjunction with various choices of L bb , we systematically varied L cb within the range of 0 to L bb , while keeping L bb fixed for each specific experiment.Results show that early crossband samples indeed improve the dereverberation performance, and, in all cases (i.e., for each measure and each choice of L bb ), the optimal performance is achieved for L bb > 0. To our surprise, introducing late samples from the crossband components leads to a decrease in performance, even when the length of the band-to-band component remains fixed.This intriguing finding strongly suggests that late samples of the crossband components may have a detrimental effect on the overall performance, even compared to the conventional WPE.Drawing from this observation and the outcomes presented in Figure 1, we propose that both the traditional and proposed methods of WPE attain optimal performance at a specific length choice.Surprisingly, beyond this optimal value, the performance deteriorates, despite the availability of additional information for dereverberation.The observations depicted in Figure 2a,c provide valuable insights into the impact of incorporating the first crossband component on the dereverberation performance, specifically in terms of FWSegSNR and PESQ.Surprisingly, it is evident that introducing the first sample of the crossband component leads to a decrease in performance compared to the conventional WPE method.Conversely, when considering the CD measure (as illustrated in Figure 2b), adding the first sample of the crossband component improves the dereverberation performance.Further analysis reveals that the optimal performance, in terms of FWSegSNR and CD, is achieved when L cb is set to 2, as shown in Table 1.On the other hand, the optimal performance in terms of PESQ is attained when L cb is set to 3. It is worth noting that for L cb > 6, the performance starts to decline, surpassing the level achieved by the conventional WPE method.These intriguing findings shed light on the intricate relationship between different choices of L cb and their impact on the overall dereverberation performance.

Optimal CD (L bb = 14)
The obtained results are demonstrated in Figure 3a-c.Notably, a remarkable improvement in performance is observed from the first sample of the crossband component, as evidenced by the enhancement in all measured metrics.Furthermore, as indicated in Table 1, the optimal performance, both in terms of FWSegSNR and CD, is achieved when L cb is set to 2. Similarly, for optimal performance in terms of PESQ, a value of L cb = 3 is identified.It is worth highlighting that for values of L cb exceeding 7, a noticeable decline in performance is observed compared to the conventional WPE method.This observation further underscores the importance of carefully selecting an appropriate value for L cb to achieve optimal dereverberation results.The presented findings shed light on the effectiveness of integrating the first crossband component and its significant impact in improving the dereverberation performance across various evaluation metrics.The obtained results are depicted in Figure 4a-c, providing insights into the performance characteristics when considering L cb = 20.The observed behavior closely resembles the findings discussed in Section 4.4.1,where the inclusion of the first sample of the crossband component initially leads to a degradation in performance.However, it is noteworthy that a performance improvement becomes evident from the second sample onward.Table 1 reveals that the optimal gain in performance coincides with the choices identified in Section 4.4.1.However, it is worth noting that the optimal gain values are slightly lower for the cases of FWSegSNR and CD.These findings underscore the consistent impact of the crossband component and its potential to enhance the dereverberation performance, albeit with some variation in the optimal gain values across different evaluation metrics.

Crossband Filtering-Length Preservation
In order to evaluate the performance, we set up an experimental configuration where we systematically varied the value of L cb for each chosen L bb .In this setup, we discarded two late samples from the band-to-band component for each increment in the crossband samples in order to maintain fixed computational complexity.Specifically, we explored the range of L cb from 0 to L bb /3 , where • denotes the floor function, ensuring that the band-to-band component retained its significance relative to the crossband components.Notably, introducing early crossband samples led to an overall improvement in dereverberation performance across all measured criteria.This finding underscores the effectiveness of incorporating the information within the crossband components to enhance the quality of the dereverberated speech signals.The obtained results are illustrated in Figures 5a-c, providing a comprehensive assessment of the performance.The optimal gains and corresponding values of L cb are summarized in Table 2. Surprisingly, the observed performance is highly competitive with the outcomes presented in Section 4.4.1, despite using fewer data for the dereverberation process.Notably, the introduced method even exhibits an improvement in terms of CD compared to the length extension approach.Optimal performance, in terms of FWSegSNR, is achieved when L cb = 2, while, for CD and PESQ, the optimal values are obtained with L cb = 3.These findings highlight the efficacy of incorporating early crossband samples, demonstrating their valuable contribution in improving dereverberation outcomes.The obtained results are presented in Figures 6a-c and are summarized in Table 2. Interestingly, it is observed that for all measures, the optimal gain is achieved when L cb = 2.In this particular setup, the length preservation method outperforms the length extension method in terms of FWSegSNR and CD.However, regarding PESQ, the length extension method provides better and competitive performance.These findings emphasize the significance of considering the specific setup and context when evaluating the performance of different dereverberation methods, as their effectiveness may vary depending on the chosen measures and objectives.Results are presented in Figures 7a-c and are summarized in Table 2. Notably, in Figure 7a, it is observed that the optimal performance in terms of FWSegSNR is achieved when L cb = 5, which corresponds to L bb = 10.This finding is further supported by Figure 1a, which indicates that the performance in terms of FWSegSNR remains relatively stable for L bb in the range of 10 to 16.This suggests that the optimal performance in terms of FWSegSNR, when utilizing the crossband components, is achieved when L bb is in proximity to the optimal value obtained with the conventional WPE.A similar observation can be made for the CD metric, as depicted in Figures 1b and 7b.Furthermore, it is worth mentioning that this experiment yielded the best overall performance in terms of PESQ, as shown in Table 2.These findings highlight the importance of carefully selecting the parameters and considering the specific objectives when evaluating the performance of dereverberation methods.

Discussion
The conducted experiments involving length extension and length preservation methods have provided valuable insights into the performance of the WPE-based dereverberation approach.The results, as summarized in Tables 1 and 2, demonstrate the effectiveness of both methods in improving the performance compared to the conventional WPE while considering different aspects of the evaluation metrics.The length preservation method, which incorporates the early samples of crossband components, has shown competitive performance compared to the length extension method.Remarkably, the length preservation method achieves comparable or superior results across various evaluation metrics, including FWSegSNR, CD, and PESQ.This indicates that by utilizing the crossband components in an optimized manner, the length preservation approach offers an attractive alternative for dereverberation tasks.Notably, the length preservation method achieves these performance gains while maintaining the same computational complexity as the conventional WPE.This is a significant advantage, as it allows for efficient real-time implementation without sacrificing the quality of dereverberation results.
Overall, the findings highlight the importance of considering different approaches and parameters in the WPE-based dereverberation framework.The length preservation method presents a promising avenue for further exploration, offering competitive performance with improved computational efficiency.Further research can investigate the method's robustness across various real-world scenarios and explore potential optimizations to enhance its effectiveness in different reverberant environments.

Conclusions
Our investigation focused on exploring the impact of crossband filters in the STFT domain on WPE-based speech dereverberation.We introduced two extensions to the conventional WPE that specifically accounted for crossband filtering and demonstrated their effectiveness in enhancing the dereverberation performance.Interestingly, the first extension, which increased the model's complexity, naturally improved the performance.However, the second extension maintained the same model complexity as the conventional WPE and exhibited notable performance improvements.This observation suggests that the early samples of the crossband components play a crucial role in dereverberation, surpassing the significance of the late samples from the band-to-band components.Surprisingly, the late samples of the crossband components had an unexpected detrimental effect on the dereverberation performance.To further advance this research area, future investigations can explore the impact of crossband filtering in more complex models, such as scenarios involving speaker switching or time-varying RIRs.Additionally, combining the proposed concept with other extensions of WPE, such as the Kronecker filtering extension, holds promise [26].The combination of crossband and Kronecker filtering for WPE has the potential to reduce the computational complexity while simultaneously improving the performance, as demonstrated by the recent work on Kronecker filtering for WPE.
(m) f ,t be an observed signal in the STFT domain captured by the m-th microphone of an M length microphone array, and let g (m) f ∈ C LM be an LM-order prediction filter.The enhanced signal z (m)

Table 1 .
Summary of experimental results (length extension).

Table 2 .
Summary of experimental results (length preservation).