1. Introduction
Individuals with hearing impairment often have trouble understanding the specific meaning of speech in their everyday lives. Researchers have attempted to solve this issue by developing wearable hearing aid devices, which are commonly used to balance the dynamic range to compensate for hearing loss [
1]. However, many individuals find the functioning of hearing aids to be inadequate, mostly owing to the interference of noise with the speech signal entering the ear. In particular, only 23% of hearing-impaired (HI) (All the abbreviations used in this paper are listed in the Abbreviations) individuals use hearing aid devices [
2,
3]. The limitations associated with noisy speech in the context of hearing aids were reported more than 35 years ago [
4] and have not yet been effectively addressed.
A potential solution is to use multiple microphones, which can improve the signal-to-noise ratio (SNR); however, this improvement is limited by several factors. In real-life situations, hearing aids cannot function adequately in environments involving multiple noise sources and high reverberation [
5]. Moreover, the size of modern hearing aids is continually decreasing, owing to which, only one or two microphones can be installed. Consequently, single-channel noise reduction algorithms have been developed to facilitate the complex speech perception for hearing aid users.
Speech enhancement algorithms improve the quality of the target speech signal by reducing the effects of background noise, thereby notably reducing the amount of listening effort and fatigue that the background noise requires of the listener. Typically, this enhancement is realized by retaining the characteristics of speech through a short-term spectral amplitude (STSA) analysis, implemented using statistical techniques such as Wiener filters, minimum mean square error (MMSE) estimation, and MMSE log-spectral amplitude estimation, to improve the quality of noisy speech by suppressing additive background noise [
5,
6]. These techniques are based on the accurate estimation of the a priori SNR by estimating the noise power spectral densities (PSDs). Such statistical techniques can reduce the noise in speech under stationary noise conditions, using techniques such as the decision-directed (DD) approach [
7,
8]. However, it is difficult to reliably estimate the noise PSDs when speech is corrupted by non-stationary noise such as babble noise; in such cases, the a priori SNR estimations are often inaccurate because the noise components remain in the enhanced speech spectrum even after being processed through the Wiener filter [
9,
10].
In recent decades, in order to find more effective solutions for speech enhancement, research focus has shifted to deep neural network (DNN) techniques from the aforementioned STSA approaches [
11,
12]. DNN-based speech enhancement methods can notably outperform statistical approaches, even though a more complex neural network architecture is required to achieve a higher performance in speech enhancement tasks. Nevertheless, such approaches involve prohibitively high computing costs and thus cannot be easily applied in portable hearing devices (e.g., hearing aids) that require low computing complexity approaches for real-world implementation.
An alternative approach to extract clean speech from a noisy speech signal is to use a non-negative matrix factorization (NMF)-based signal decomposition technique [
13,
14]. In the NMF framework, the speech and noise spectral components are approximately reconstructed using the non-negative speech and noise bases, corresponding to the speech and noise, respectively [
15,
16]. Nevertheless, even if a sufficiently large universally representative speech database (DB) can be collected to construct the non-negative speech base dictionary, it may still be challenging to collect all real noises to acquire the noise bases in the existing noise environments [
16,
17]. To address this problem, the framework must be trained online by using the noise bases pertaining to the currently observed noise, and this aspect has been actively studied [
16,
17]. However, the performance of online training depends significantly on the accuracy of estimation of the noise intervals from the observed noisy speech, which represents another critical challenge [
8].
In this context, it is desirable to develop an NMF-based approach that employs only the speech bases instead of explicitly estimating the noise bases. The NMF approach introduced in [
18,
19] reconstructed the clean speech spectral components from noisy speech by using only the bases from a speech dictionary with sparse activation. This non-negative sparse coding (NNSC) strategy was noted to be efficient in the speech enhancement for a cochlear implant and could enhance the speech enhancement performance, especially in noisy environments [
18].
However, although this method could be applied to cochlear implants, it could not necessarily improve the speech quality for other hearing device applications, as described further in
Section 5. In general, speech consists of harmonic and non-harmonic components across a wide range of frequencies, and the noise spectrum is also relatively widely spread over the frequency. Thus, part of the noise spectrum can be represented by the speech dictionary (or basis), and it generates a certain residual noise in the estimated target speech.
Considering these aspects, this paper proposes an improved NNSC-based speech enhancement algorithm that reduces the residual noise based on the principle that the residual noise components remaining after processing through the DD-based Wiener filter tend to be whitened [
9,
10]. Furthermore, the NNSC approaches, as a statistical approach, aim to identify the basis and activation components by minimizing the Gaussian independent identically distributed noise [
18]. In this context, the NNSC can reconstruct the target speech spectra of the Wiener filter output by using a pre-trained dictionary while removing the whitened residual noise components. In other words, the proposed method applies NNSC to the spectrum enhanced through Wiener filtering, thereby reducing the residual noise and minimizing the speech distortion compared to that when using only a Wiener filter or the NNSC strategy. The objective is to enhance the speech quality rather than the speech intelligibility. In general, noise reduction strategies are highly correlated with an improved speech quality, although they may not always lead to improved intelligibility [
8].
Furthermore, the objective is to implement the speech enhancement algorithm on an auditory hearing device filter bank that can satisfy unique conditions such as the signal quality, computational complexity, and signal delay. In particular, the latency in auditory processing algorithms should not be more than 10 ms, to prevent a deterioration in the subjective listening experience [
20,
21,
22,
23,
24]. Moreover, the approaches should have a low computational complexity due to the limited processing capacity and battery power in real-world portable devices [
22,
25,
26]. To this end, many researchers employ a discrete Fourier transform (DFT)-based uniform polyphase filter bank, as it can enable perfect reconstruction with low latency and can be expanded into non-uniform filter banks [
20,
21,
26]. Furthermore, such banks can be implemented through a short-term Fourier transform (STFT), thereby allowing the integration of single microphone noise reduction algorithms based on a fast Fourier transform (FFT) [
22,
26]. However, the relevant literature pertaining to STFT-based single-channel noise reduction algorithms such as STSA, NNSC, and DNN for a uniform polyphase DFT filter bank is limited.
The remaining paper is structured as follows.
Section 2 describes the uniform polyphase DFT filter bank used to implement the proposed wearable hearing device spectral gain enhancement method.
Section 3 provides a review of a conventional spectral gain estimation method based on a Wiener filter with a DD-based approach.
Section 4 describes the NNSC–Wiener filter for speech enhancement.
Section 5 describes the efficiency validation of the proposed approach, through the perceptual evaluation of the speech quality (PESQ) [
27] and comparison with the NNSC [
18], a two-stage Mel-warped Wiener filter [
9], and a model-based Wiener filter [
10].
Section 6 presents the concluding remarks.
2. Hearing Device Spectral Enhancement
An auditory filter bank must have equally spaced narrow frequency bands and at least 60 dB of stopband attenuation (a higher value is ideal) [
21], as mentioned in
Section 1. Furthermore, a filter bank must exhibit low computational complexity and a small time delay of less than 10 ms. These constraints can be satisfied using a uniform polyphase DFT filter bank, implemented through the FFT. We introduce a filter bank of 32 channels with a time delay of 8 ms under a sampling rate of 16 kHz [
21,
28,
29].
As shown in
Figure 1, the filter bank is implemented by setting the number of channels
the downsample factor
and the FFT size
to satisfy the oversampled perfect reconstruction condition with a time delay of 8 ms. The
input frame signal
is generated by buffering the input time-discrete signal
, where
is the transpose operator. Furthermore, by implementing an FFT, the prototype low pass filter (LPF) applied signal on
and
is converted into the complex-spectral value
in the
frequency bin
and
frame. The prototype LPF is developed using the method described in [
3]. The definitions of the 128 sequences and form of the frequency domain magnitude are illustrated in the upper and lower panels in
Figure 2, respectively.
The enhanced version of the spectral value
, , is obtained by applying a spectral enhancement algorithm to
in the
frequency bin
. Subsequently, the 16 down-sampled speech denoised signals in the
frequency band,
can be extracted from the real number part of the complex value
. These signals are then utilized to obtain the power envelope of each band. The term
denotes the corresponding spectral output of the hearing aid algorithms, such as a dynamic range compressor and a feedback cancellation algorithm, to
and can be converted into the
frame signal
through an inverse FFT [
20,
21,
22]. Finally, the filter-bank-synthesized output signal is derived from the overlap-and-add operation of the LPF applied signal at
.
3. Conventional Spectral Gain Estimation
This section provides the review of a conventional spectral gain estimation method based on the STFT. When the target speech is deteriorated by additive noise the noisy speech is related to and in the frequency domain as where and are the spectral components of , , and , respectively, at the frequency and frame .
As shown in
Figure 3, the spectral gain for speech enhancement,
attempts to estimate
in the form
. Here,
can be represented in the form of the following Wiener filter [
8]:
where
is the a priori SNR estimate and
is processed according to the DD approach [
8,
9,
10]:
where
and
denote the prefixed minimal threshold value and noise variance estimate, respectively. Moreover,
where
is a smoothing parameter, used to avoid the sudden adjustment of the SNR.
Due to its simplicity and effectiveness,
in (2) is commonly used to suppress the noise components; however, because this value is directly obtained from the noisy speech
it may be inaccurate in severely noisy environments [
2]. Moreover, because the accuracy of
is affected by the previous target speech estimate
. the error in estimating
may propagate to the estimation error of the spectral gain
, resulting in the distortion of the estimated target speech. To address this problem, we integrate the DD method with the NNSC strategy to develop a novel spectral gain enhancement stage.
4. Proposed Spectral Gain Enhancement Driven by NNSC-Based Residual Noise Reduction
The proposed method is aimed at minimizing the residual noise remaining after DD-based Wiener filter processing. In particular, the NNSC approach, which minimizes the Gaussian independent identically distributed noise, is used to remove the whitened residual noise. Subsequently, the NNSC reconstructs the target speech spectra of the Wiener filter output by using a pre-trained dictionary and removing the whitened residual noise components. In other words, the proposed method enhances the DD-based spectral gain in (1) through an NNSC technique to address the residual noise components that remain after applying .
As shown in
Figure 4, in the first stage, the DD-based a priori SNR and spectral enhancement gain are estimated based on the Wiener filter described in Equation (1). In the second stage, the NNSC method strategy is applied to increase the spectral gain
estimated in the first stage.
To increase
in (1) through the NNSC technique,
. is first accumulated over all the frequency bins as a vector
.
, where
is the transpose operator. Subsequently,
can be expressed in terms of the pre-trained basis matrix
and an activation vector
as
where
is the vector consisting of the residual noise components remaining after applying over all frequency bins, and the subscripts
represent the matrix (or vector) dimension.
is trained from a universal speech DB by assuming that
can reconstruct any clean target speech. The main task in the NNSC framework is to determine
that can minimize the error
by minimizing the cost function with the L1 sparsity constraint [
18,
19]:
where
is either the Euclidean (EU) distance or the Kullback–Leibler (KL) divergence, and
is a sparseness control parameter. The term
in Equation (5) is estimated via random initialization and iterative updating using an iteration number (
iter), such that the error
is minimized and converges:
where the multiplication,
, and division operators are element-wise operators. Finally, we obtain the NNSC-based spectral gain to attenuate the residual noise remaining after applying the Wiener filter as
where
and where
is the minimum value that can avoid a zero value in the numerator.
Next, the
obtained using Equation (7) is used to improve
by using two approaches. In the first approach,
is redefined by applying
as a weight to the a priori SNR estimate
, as
In the second approach,
and
in Equation (1) are multiplicatively combined as
According to the results of a preliminary speech enhancement experiment performed considering Equations (9) and (10), the quality of enhanced speech through
is comparable to or slightly higher than that by
for all the SNRs. Thus, the performance evaluation is conducted using
, as described in
Section 5.
Finally, the enhanced speech spectral estimate is calculated as , which is later used to estimate in Equations (2) and (3) recursively.
5. Performance Evaluation
The performance of the proposed hearing device spectral gain enhancement algorithm based on NNSC and the Wiener filter was evaluated by measuring the PESQ scores [
27]. The test set involved 240 speech utterances from the TIMIT DB [
30] and four types of noise sources (Gaussian, babble, factory, and car) from the NOISEX-92 DB [
31]. The noise signals were mixed with the target speech at different SNRs ranging from −5 to 20 dB in steps of 5 dB. Each signal was sampled at 16 kHz and was segmented using a 128-point LPF, as shown in
Figure 2; each segment had an overlap of one-eighth with the previous segment.
To implement the NNSC technique, 100 speech basis vectors were trained using the speech utterances in the TIMIT DB, with the training data including the speech of sixteen male and eight female speakers, with a duration of approximately 560 s. The number of speech basis vectors (100) was selected with reference to [
16]. The speakers in the basis training set did not overlap with those in the test set. The noise variance
in Equations (2) and (3) was updated as
in the noise-only intervals, as in [
8].
First, to determine the optimal value of the sparseness parameter
in Equations (5) and (6) to optimize the speech quality performance, the PESQ scores of the speech signals produced using the proposed method were measured by changing
from 0 to 1, as shown in
Figure 5. The evaluation was performed using the training data, and the proposed method was implemented using the EU distance. According to the results, the proposed method achieved the highest PESQ scores averaged over all the SNRs when
was set to 0.2. Therefore,
was set as 0.2 in the subsequent experiments.
Second, the effect of the two different cost functions in Equation (5) on the speech enhancement performance was investigated.
Table 1 presents a comparison of the PESQ scores of the conventional Wiener filter and the proposed method, with scores presented for cases involving the EU distance and KL divergence in the proposed method. Both the KL-NNSC and EU-NNSC achieved higher PESQ scores than that of the Wiener filter for all the SNRs. At high SNRs (20 dB), the PESQ scores of the EU-NNSC and KL-NNSC were comparable. However, the EU-NNSC scores were significantly higher than those of the KL-NNSC for lower SNRs (from −5 to 15 dB). This finding indicates that the EU-NNSC is likely a more appropriate form for the proposed NNSC-Wiener filter method than the KL-NNSC form. In particular, the residual noise remaining after the application of the Wiener filter may be Gaussian-distributed [
9,
10], and the EU-NNSC aimed to find the basis and activation components by extracting the Gaussian independent identically distributed noise [
18,
19].
Third, as described previously, the proposed method was implemented in an online auditory device filter bank framework. Thus, it was necessary to examine the difference in the performance of the online and offline implementations of the proposed method. In the offline implementation, the NNSC strategy, as described in
Section 4, was applied to each utterance instead of to each frame. The PESQ scores of the offline implementation were evaluated under the babble and Gaussian noise conditions.
Table 2 presents a comparison of the PESQ scores for the online and offline implementations. The PESQ scores for the offline implementation were slightly higher than those for the online implementation under both the noise conditions, because the offline implementation could obtain more accurate estimates for the activation vectors than the estimates obtained for the online implementation. Nevertheless, the performance difference between the online and offline implementations was minimal, indicating that the block size of the accumulated signal for the NNSC did not significantly affect the speech enhancement performance in the proposed method.
Fourth, the effectiveness of the proposed method in reducing the residual noise remaining after the application of the Wiener filter is demonstrated in
Figure 6. Each black area represents the spectral magnitude of the babble or Gaussian noise of one frame, which was mixed to clean the speech at an SNR of 5 dB. The noise reduction attained by the Wiener filter and proposed method is indicated by the dark gray and light gray areas, respectively. The proposed method achieved a lower spectral magnitude noise than that when only the Wiener filter was used.
Fifth, the spectrograms of the speech signal enhanced using the proposed and existing methods were compared, as shown in
Figure 7.
Figure 7a,b illustrates the spectrograms of the desired clean speech and its noise-contaminated version, respectively, at an SNR of 0 dB under the babble noise condition.
Figure 7c,d shows the results obtained by applying the Wiener filter in (1) and NNSC to the signal in
Figure 7b, respectively.
Figure 7e shows the results obtained using the proposed NNSC–Wiener filter. The proposed method effectively suppressed the babble noise components. As shown in
Figure 7c, some residual noise remained after processing through a Wiener filter. However, the proposed method could successfully suppress most of the noise spectra, as shown in
Figure 7e.
Table 3 compares the PESQ scores [
26] corresponding to the spectrograms presented in
Figure 7. Although it was expected that the speech enhancement performance when using the Wiener filter (
Figure 7c) would be superior to that of the NNSC (
Figure 7d), the PESQ scores for the two approaches were comparable. This aspect is likely why the Wiener filter led to target speech distortion in addition to the residual noise problem. The PESQ scores for the proposed method, in which both the NNSC strategy and Wiener filter were applied, were considerably higher than those for the Wiener filter or NNSC-based speech enhancement method. This finding demonstrated that the combination of the NNSC and Wiener filter could enhance the performance by addressing the target speech distortion and residual noise problems.
Finally, the speech enhancement performance of the proposed approach was assessed and compared with those of the original noisy speech (
No), the NNSC technique (
R1) [
18], the two-stage Mel-warped Wiener filter (
R2) [
9], and the model-based Wiener filter (
R3) [
10]. In this experiment, four distinct types of noise were considered, and a statistical analysis was conducted using a Games–Howell criterion-based multiple-pair test.
Table 4 presents the comparison of the PESQ scores averaged over all the SNRs ranging from −5 to 10 dB.
R2 exhibited the highest performance in Gaussian noise environments, and
R3 exhibited the highest performance under the babble, factory, and car noise conditions. However, the proposed method outperformed all the other approaches under all the noise conditions, except under the car noise condition, in which case, the performance of the proposed method was statistically comparable to that of
R3.
The number of iterations in Equation (6) to estimate is a crucial parameter in terms of the computational complexity for real-world applications. In our experiments, the number of iterations was approximately 13, averaged over all frames (minimum 3, maximum 39).
6. Conclusions
The proposed method was aimed at enhancing the conventional spectral Wiener filter approach in terms of the hearing device speech enhancement performance by introducing the NNSC approach to reduce the residual noise after the application of the Wiener filter. To this end, the NNSC technique was combined with the a priori SNR estimate to enhance the gain attenuator of the Wiener filter. The spectral gain was increased using the proposed approach through a uniform polyphase DFT filter bank to fulfill the low computational complexity and algorithm processing delay criteria for hearing devices.
Subsequently, the performance of the proposed speech enhancement method was compared to those of the conventional Wiener filter, two-stage Mel-warped Wiener filter, and conventional NNSC method in terms of the PESQ scores and spectrograms. The results indicated that the proposed method produced significantly higher PESQ scores than the other methods for all the SNRs under four different noise conditions.
Accordingly, it could be concluded that the proposed algorithm did not notably increase the computational costs. To further decrease the computational costs according to the target hearing device system, the NNSC can be applied to function selectively according to the noise environments. Nevertheless, the development of the NNSC controller is beyond the scope of this work and should be discussed in future studies.
The current research on speech processing has been focused on DNN techniques. Notably, the DNN-based speech enhancement method can outperform the existing methods. Nevertheless, this approach involves substantially higher computing costs. Thus, it is difficult to implement DNN-based approaches in portable hearing devices that require a low computing complexity for real-world implementations. In this regard, it may be desirable to combine the existing Wiener filter and DNN approaches to enhance the performance in terms of both the speech quality and intelligibility.
Effective hearing wearable devices are expected to be of notable objective value as a natural connection to other devices. Specifically, from the viewpoint of sustainable internet-of-things wearables, hearing wearable devices represent an essential element in recognizing user contexts to construct human-oriented environments [
32,
33,
34]. This study was aimed at improving the speech quality enhancement performance of the existing Wiener filter, implemented through hearing wearable device filter bank algorithms. However, the relevant literature on the NNSC or DNN-based speech enhancement algorithms for the filter bank is limited. The presented findings may provide guidance to achieve a satisfactory speech processing performance for successful hearing wearable devices.