A Nested U-Network with Temporal Convolution for Monaural Speech Enhancement in Laser Hearing

Zhou, Bomao; Tang, Jin; Guo, Fan

doi:10.3390/modelling7010032

Open AccessArticle

A Nested U-Network with Temporal Convolution for Monaural Speech Enhancement in Laser Hearing

by

Bomao Zhou

¹

,

Jin Tang

¹ and

Fan Guo

^1,2,*

¹

School of Automation, Central South University, Changsha 410083, China

²

Xiangjiang Laboratory, Changsha 410205, China

^*

Author to whom correspondence should be addressed.

Modelling 2026, 7(1), 32; https://doi.org/10.3390/modelling7010032

Submission received: 23 October 2025 / Revised: 21 January 2026 / Accepted: 30 January 2026 / Published: 3 February 2026

(This article belongs to the Special Issue AI-Driven and Data-Driven Modelling in Acoustics and Vibration)

Download

Browse Figures

Versions Notes

Abstract

Laser Doppler vibrometer (LDV) has the characteristics of long-distance, non-contact, and high sensitivity, and plays an increasingly important role in industrial, military, and security fields. Remote speech acquisition technology based on LDV has progressed significantly in recent years. However, unlike microphone receivers, LDV-captured signals have severe signal distortion, which affects the quality of the LDV-captured speech. This paper proposes a nested U-network with gated temporal convolution (TCNUNet) to enhance monaural speech based on LDV. Specifically, the network is based on an encoder-decoder structure with skip connections and introduces nested U-Net (NUNet) in the encoder to better reconstruct speech signals. In addition, a temporal convolutional network with a gating mechanism is inserted between the encoder and decoder. The gating mechanism helps to control the information flow, while temporal convolution helps to model the long-range temporal dependencies. In a real-world environment, we designed an LDV monitoring system to collect and enhance voice signals remotely. Different datasets were collected from various target objects to fully validate the performance of the proposed network. Compared with baseline models, the proposed model achieves state-of-the-art performance. Finally, the results of the generalization experiment also indicate that the proposed model has a certain degree of generalization ability for different languages.

Keywords:

laser Doppler vibrometer; speech enhancement; time-frequency domain; nested U-structure; gated temporal convolutional networks

Graphical Abstract

1. Introduction

When sound propagates in the air, objects close to the sound source will vibrate due to sound pressure. However, the sound pressure during propagation will rapidly decrease, which limits the working distance of traditional sensors. Laser Doppler vibrometer (LDV) has excellent performance in detecting small vibrations. The LDV can remotely collect the vibration of objects and convert them into sound signals, which has great potential for application in remote acoustic detection [1]. In recent years, research on capturing speech signals using LDVs has received increasing attention. Peng et al. [2] developed an all-fiber LDV and achieved acoustic signal reconstruction at a distance of 23 m. Wu et al. [3] used a partial-fiber structure system to obtain comprehensible speech signals within a range of 50 m. Xu et al. [4] designed a long-distance LDV speech monitoring system that can reconstruct understandable speech at a distance of 200 m and used spectral subtraction [5] to improve speech quality. There are also studies focusing on the enhancement of LDV-captured speech signals. Li et al. [6] used Gaussian bandpass and Wiener filters to enhance the speech captured by an LDV. Zhang et al. [7] used band-pass filters and Wiener filtering methods to obtain comprehensible speech signals within a range of 75 m. Lv et al. [8] developed a dual-mode system to simultaneously capture video and audio signals and used the optimal modified log-spectral amplitude (OM-LSA) algorithm to improve the intelligibility of speech signals. Afterward, Lv et al. [9] used a linear prediction algorithm to remove speckle noise from LDV-captured signals. Peng et al. [10] used two LDVs to reduce reverberation and noise in the collected speech signals. Wang et al. [11] used linear prediction coding and adaptive spectral subtraction methods to measure speech at a distance of 310 m in an outdoor environment using an LDV. However, to our knowledge, there are currently only a few studies that combine deep learning (DL) methods with LDV-captured signals. Xie et al. [12] proposed the use of deep neural networks (DNNs) to achieve automatic speech recognition of speech captured by the LDV. Peng et al. [13] collected speech samples from various target objects, used the OM-LSA algorithm to enhance speech, and used machine learning methods to recognize remote speakers. Cai et al. [14] used a two-stage DNN to enhance the speech captured by LDV. Although Cai’s method has successfully applied Deep Neural Networks (DNNs) to LDV speech enhancement, establishing the baseline viability of data-driven approaches in this domain. While our primary contribution is the novel TCNUNet architecture. The network design to address the unique challenges of LDV-captured speech, such as complex noise patterns and subtle vibrational distortions.

DL-based methods consider speech enhancement as a supervised learning problem [15]. Recently, speech enhancement based on DL methods has become popular. According to the signal domain in which they work, speech enhancement methods are divided into time-frequency domain methods [16] and time-domain methods [17]. And time-frequency domain methods are divided into mask-based methods and mapping-based methods. Mask-based methods are used to estimate the time-frequency correlation between clean speech and noisy speech, including ideal ratio mask (IRM) [18] and complex radio mask (CRM) [19]. Mapping-based methods directly learn the magnitude spectrum [20], logarithmic power spectrum, or real and imaginary part spectrum [21] of clean speech from noisy speech. Among them, U-network (U-Net) is a popular neural network architecture that can be applied to various methods. U-Net is an encoder-decoder architecture. It uses multiple encoding layers to decompose noisy speech signals into different scales to extract more feature information. Then use the same number of decoding layers to reconstruct a clean speech signal. This method fully utilizes information from different scales and improves the performance and efficiency of the model. Many studies have applied U-Net to speech enhancement and achieved good performance. Pandey et al. [22] proposed a fully convolutional neural network to solve the problem of time-domain speech enhancement. Tolooshams et al. [23] introduced dense connections and channel attention in U-Net to achieve multi-channel speech enhancement. Fu et al. [24] proposed a U-architecture-based network for simultaneous speech enhancement and dereverbration. Xu et al. [25] proposed a speech enhancement network based on cooperative attention to trade off local and non-local attention. To further improve speech quality, many studies have made improvements to the U-Net structure. Zhang et al. [26] introduced a gating mechanism instead of standard 2-D convolution and adopted dual branch output to improve noisy-reverberant speech. Zhu et al. [27] inserted a temporal convolutional module (TCM) between the encoder and decoder to model the long-term dependencies of speech signals and implemented end-to-end speech enhancement. Fan et al. [28] combined nested U-structures and TCMs and used multi-domain processing to enhance speech for better performance. Xiang et al. [29] introduced the gating mechanism in temporal convolutional networks (TCNs) and proposed a single-channel speech enhancement framework that combines waveform and magnitude processing, achieving complementary advantages of time-domain and time-frequency features.

In this paper, we propose a nested U-Net with gated temporal convolution (TCNUNet) to achieve monaural speech enhancement based on LDV. We extracted the magnitude spectrum features of noisy speech signals and fed them into TCNUNet for training. The TCNUNet model consists of three modules: an encoder, a bottleneck layer, and a decoder, and skip connections are used between the encoder and decoder. The encoding and decoding layers use gated 2-D convolutions instead of standard 2-D convolutions. The use of gated convolution helps improve the propagation of information flow. In addition, different layers of nested U-Net (NUNet) are introduced after the encoding layers to further extract feature map information at different scales. In the bottleneck layer, three gated temporal convolution modules (GTCMs) are employed to model the long-range temporal dependencies. The GTCM adds two gates to the TCM to control the number of channels and uses depthwise convolution to reduce network parameters. The proposed model uses an IRM as the output to obtain an enhanced speech spectrogram. Finally, transform the estimated speech spectrogram into the waveform domain and use the overlap-and-add method to obtain an enhanced speech waveform. We have designed an LDV monitoring system to remotely capture voice signals in a real-world environment. In addition, we collected different datasets from various target objects to fully validate the performance of the proposed network. Compared with baseline models, the proposed model achieves state-of-the-art performance. Finally, we also conducted generalization testing. The experimental results indicate that the proposed model also has a certain degree of generalization ability for speakers of different languages. The main contributions of this paper are summarized as follows:

We propose a nested U-network with gated temporal convolution for monaural speech enhancement.
In a real-world environment, we designed a remote speech acquisition system based on LDV to obtain enhanced speech. We used various target objects and different corpora to demonstrate the effectiveness and generalization ability of the proposed method.

The rest of this paper is organized as follows. In Section 2, we analyze the signals captured by LDV and describe the proposed model and its construction modules. In Section 3, we construct different datasets and provide experimental configurations and evaluation methods. In Section 4, the results of the experiments are compared and analyzed. The conclusion is presented in Section 5.

2. Method

2.1. Analysis of Signals Captured by LDV

The monaural noisy speech signal captured by a microphone can be modeled in the time domain as

y (m) = s (m) + n (m)

(1)

where

y (m)

,

s (m)

, and

n (m)

respectively denote the noisy, clean, and noise signals in the time sample index m. Due to the special structure of the microphone, it has excellent frequency response characteristics, so there is no need to consider signal distortion. However, unlike microphones, the target objects monitored by LDVs have various natural frequency response characteristics [30,31], which cause severe signal distortion in the captured speech signals. Therefore, the monaural noisy speech signal captured by an LDV can be further modeled as

y (m) = s (m) * h (m) + n (m)

(2)

where

h (m)

denotes the natural frequency response of the target object surface, and ∗ denotes the convolutional operation. Using short-time Fourier transform (STFT) to transform noisy speech signals into time-frequency representations is defined as

Y_{t, f} = S_{t, f} H_{t, f} + N_{t, f}

(3)

where Y, S, N, and H represent the STFT of y, s, n, and h, respectively, and t and f represent the time frame index and the frequency bin index, respectively. In the polar coordinate system, (3) becomes

|Y_{t, f}| e^{i φ_{Y}} = |S_{t, f}| |H_{t, f}| e^{i (φ_{S} + φ_{H})} + |N_{t, f}| e^{i φ_{N}}

(4)

where

|\cdot|

denotes the magnitude response and

φ

denotes the phase response. The imaginary unit is represented by ‘i’. Figure 1 shows the speech signal collected from a cardboard box using an LDV. In addition to interference from additive noise, the speech signal is also distorted due to the target object’s natural frequency response characteristics. This results in the high signal-to-noise ratio (SNR) frequency band of the speech signal being concentrated below 2 kHz, while the speech signal in the 2–8 kHz frequency band is masked by additive noise.

2.2. Network Architecture

The architecture of the proposed TCNUNet model is shown in Figure 2. Transform the noisy speech waveform signal into the STFT domain and feed it to TCNUNet for training. Use the magnitude spectrum of noisy speech as input to the model and output the IRM of the magnitude spectrum. The noisy speech spectrogram is element-wise multiplied with IRM to obtain an enhanced speech signal spectrogram. Note that the proposed model only enhances the magnitude spectrum, while keeping the phase spectrum unchanged. That’s because the proposed TCNUNet model operates in the complex spectral domain rather than treating magnitude and phase separately. Specifically, we employ complex spectral mapping, which allows the network to implicitly learn and reconstruct both the real and imaginary components of the clean speech signal. This process inherently involves refining the phase information, not merely preserving the noisy phase. Furthermore, as noted in Evaluation part, although some baseline models like GCRN only reconstruct the magnitude spectrum, leading to performance degradation in some mesurement indexes due to inaccurate phase recovery. Our model achieves state-of-the-art results across all metrics, including some indexes, which is sensitive to perceived quality and correlates with phase fidelity. To reach this goal, we use inverse SIFT (ISTFT) to transform the enhanced speech spectrogram into the time domain, and finally use the overlap-and-add method to obtain the improved speech waveform. Except that the number of input and output channels is 1, the overall dimension of the proposed TCNUNet model is

C \times T \times F

, where C is the number of channels, T is the number of time frames, and F is the number of frequency bins. The proposed TCNUNet model consists of an encoder, a bottleneck layer, and a decoder. The encoder consists of five coding layers and four NUNet blocks. The coding layer down-samples along the F dimension to capture feature information at different scales. Each coding layer consists of a 2-D gated linear unit convolution (GLUConv2d), a batch normalization (BN) [32], and a Parametric rectified linear unit (PReLU) [33]. The gating mechanism helps to control the information flow and alleviates the problem of vanishing or exploding gradients. We used a 2-D gated linear unit (GLU) [34] instead of standard 2-D convolution for down-sampling. The feature map of GLU is defined as

x^{'} = (x * W_{1} + b_{1}) ⊙ σ (x * W_{2} + b_{2})

(5)

where x and

x^{'}

respectively denote the input and output feature maps. ⊙ denotes the element-wise multiplication operator.

σ

denotes the sigmoid function.

W_{1}

,

W_{2}

and

b_{1}

,

b_{2}

are the weights and biases of the convolutional filter, respectively. The kernel size and stride of GLUConv2d are (2, 3) and (1, 2), respectively, and the dimensions are halved along the F dimension. The use of BN and PReLU helps to train the model and improve performance, while also reducing the complexity of the model. We introduced NUNet blocks after the encoding layers. The number of layers in NUNet blocks of different depths is (4, 3, 2, 1), and the number of input and output channels is C. The use of the NUNet block after the encoding layer improves feature representation ability, further utilizing global and local speech features to improve speech reconstruction performance. The bottleneck layer consists of three identical 1-D GTCMs (GTCM1d). The GTCM combines temporal convolution and gating mechanisms to model the long-range temporal dependencies using 1-D convolutional blocks. The decoder consists of five decoding layers. Corresponding to the encoding layer, the decoding layer uses deconvolution operation instead of convolution operation. Each decoding layer is composed of 2-D GLU deconvolution (GLUDConv2d), BN, and PReLU. Due to the use of skip connections between the encoder and decoder, the number of input channels in the decoding layer is

2 C

. The output layer of the model uses a softplus activation function to constrain the output range of IRM. In Table 1, we provide a more detailed description of the proposed network parameters. The hyperparameters of the encoder and decoder are specified in filterHeight × filterWidth, (stride along frame, stride along frequency).

2.3. Nested U-Net Block

U-Net originated from the field of computer vision and is used for medical image segmentation [35]. It allows the network to propagate contextual information to higher-resolution layers and requires fewer network parameters. Qin et al. [36] further proposed a U²-Net architecture for salient object detection in the field of image segmentation. Xiang et al. [37] found that nested U-structures also work well in the field of speech enhancement. Motivated by this, we designed the NUNet block to recalibrate the feature distribution and reduce noise interference in the input. Let L denote the number of layers of NUNet blocks. By setting different L to achieve NUNet blocks in different scales. Figure 3 shows an example of a NUNet block with

L = 4

. The NUNet is a symmetric encoder-decoder structure. The encoder uses 2-D convolution (Conv2d) for down-sampling, followed by BN and PReLU. The decoder is a mirror version of the encoder, except that the convolution operation is replaced by the 2-D deconvolution [38] (DConv2d) operation to reconstruct the compressed features to the original size in turn. In addition, due to the use of skip connections between the encoder and decoder, the number of input channels of the decoder block is

2 C

. The kernel size and stride of the encoder and the decoder are (1, 3) and (1, 2), respectively. A more detailed network parameter is given in Table 2. The format of hyperparameters is specified as filterHeight × filterWidth, (stride along frame, stride along frequency). The input and output of the NUNet block have the same feature map size to keep the feature map size in the backbone networks unchanged. Note that residual connection is used for the output of the NUNet block, which is used to extract intra-stage multi-scale features without reducing the feature map resolution. It can be written as

z^{'} = {N U N e t}_{j} (z) + z

(6)

where z and

z^{'}

respectively denote the feature map outputs of the GLUConv2d and the residual connection, and

j \in {1, 2, 3, 4}

is the layer index.

Overall, the reason why we design the NUNet block is that the distortion introduced in LDV signals, is not merely additive noise but a complex, material-dependent spectral convolution. The standard U-Net, while powerful, performs a single downsampling and upsampling pass, which may inadequately model the cascaded, non-linear filtering effects of vibrational transmission. Our core hypothesis is that the nested structure offers distinct advantages for this specific problem: (1) Multi-Stage Refinement of Material-Specific Distortions. The primary LDV distortion often manifests as a strong low-frequency dominance due to the mechanical resonance properties of the target object. The nested design progressively refines the spectral representation at multiple resolutions, effectively learning a series of residual corrections to the distorted signal. This iterative refinement is more capable of inverting the complex, unknown transfer function between clean speech and the observed vibration signal; (2) Enhanced Hierarchical Feature Reuse. Unlike conventional additive noise, LDV distortion is entangled with the signal across frequencies. The nested skip connections allow high-frequency details from the early encoder (e.g., consonant information) to be directly fused with semantically rich, low-frequency features from deeper layers. This mechanism is crucial for recovering speech components that are severely attenuated or masked by the object’s frequency response. A single U-Net pass has a more limited capacity for this type of deep feature reinjection. (3) Since the low-frequency energy smears and overshadows the harmonic structure, the NUNet is explicitly designed to mitigate this by allowing the network to “re-examine” and correct the coarse spectral envelope at multiple scales before producing the final output. Therefore, by framing the nested blocks not just as a generic capacity boost but as a targeted mechanism for multi-scale deconvolution of physics-based distortion.

2.4. Gated Temporal Convolution Module

In the sequence modeling task [39], TCNs were used to replace recurrent neural networks (RNNs) and achieved performance comparable to or better than RNNs. Pandey et al. [40] proposed a temporal convolutional neural network (TCNN) for real-time speech enhancement. Shi et al. [41] combined the gating mechanism with the TCN and further proposed a gated TCN. Existing gated TCNs use a single gate applied to the time-domain feature map. While effective for additive noise suppression, this mechanism treats all channels uniformly and lacks fine-grained control over temporal dynamics. In contrast, LDV signals exhibit structured, non-stationary distortions, such as transient phase jumps and material-dependent frequency smearing, which vary across both time and channel dimensions. Motivated by this, we designed the GTCM, as shown in Figure 4. LDV signals are highly sensitive to micro-vibrations and environmental noise, resulting in non-stationary, time-varying distortions. Standard convolutional or recurrent models may struggle with such dynamic patterns. The GTCM is designed to model these temporal dynamics through dilated temporal convolutions with large receptive fields. The two gating mechanisms allow the network to adaptively focus on relevant time-frequency segments and suppress transient artifacts. While GTCM does not explicitly model distortion as a time-varying system, it implicitly learns to track and compensate for temporal irregularities by emphasizing stable signal components, making it particularly effective for real-world LDV data. Specifically. The GTCM consists of four gated 1-D convolutions (GConv1d) blocks, each of which uses residual connections and sets different dilation rates d. The dilation rates of different GConv1d blocks are (1, 2, 5, 9) to expand the receptive field and aggregate long-term contextual information. The GConv1d block has added two gates to the TCM. Specifically, one gate is used in the input convolution layer to control the information flow and compress the number of input channels, and another gate is used in the output convolution layer to get back to the original number of channels. The dilated 1-D convolution (D-Conv1d) layers use depthwise convolution to reduce the number of network parameters further and improve computational efficiency. In addition, the input and the dilated convolutions are followed by PReLU and BN. Table 3 describes the network parameters of the GTCM in detail. The format of the hyperparameter is specified as filterHeight, dilationRate. Note that all convolutions in the GTCM are zero-padded to ensure that the time dimensions of input and output features are the same. An example of a causal and a non-causal model is shown in Figure 5. Here, Figure 5a shows a dilated causal convolution that only utilizes current and historical information. While Figure 5b shows a dilated non-causal convolution that also utilizes future information. Due to non-causal models having better performance than causal models, we used non-causal models by introducing bidirectional gated temporal convolutions and context-dependent U-Net skip connections in this work to obtain better performance.

From the proposed network structure, we can see that although TCNUNet incorporates nested structures and temporal convolutions, the design is justified even for single-channel enhancement. The reasons are as follows: on the one hand, the proposed TCNUNet model adopts some measures to optimize the “heaviness”. For example, the use of depthwise convolution within the gated temporal convolution module significantly reduces the number of parameters and computational load. Moreover, the architectural complexity addresses a specific challenge, that is the need for both deep feature representation and long-range temporal dependency modeling in low-SNR, real-world LDV-captured speech. The nested U-structure enhances multi-scale feature fusion, and the temporal convolution explicitly models temporal dynamics, which both critical for effective enhancement in our scenario. On the other hand, the importance of efficiency is also very important. Although current implementation prioritizes performance, future work will focus on model compression techniques such as knowledge distillation, pruning, and quantization to enable deployment on resource-constrained platforms without significant performance loss. Besides, the vibration-to-speech mapping captured by LDV is inherently non-linear and spectrally distorted, often resembling a form of unknown spectral convolution due to the frequency-dependent response of materials. The nested U-structure enhances multi-scale feature fusion at multiple levels of abstraction. By reusing encoder features across sub-U-nets, the model captures both local and global contextual information, enabling more accurate inversion of the distorted spectral envelope. This hierarchical reuse improves robustness to material-specific frequency responses, which vary significantly across targets.

3. Evaluation

3.1. Datasets

The experiments aim to evaluate the performance of the system in the real-world environment. We used an LDV to collect various target data in a real-world environment. The setting of the remote voice collection system is shown in Figure 6a. Use a loudspeaker as the sound source to excite the target object to produce forced vibration. Meanwhile, the LDV emits an invisible laser to measure the tiny vibrations on the target surface. Use a field programmable gate array (FPGA) to demodulate optical signals into audio signals, and then transmit the audio signals to a personal computer (PC). The LDV device we used is shown in Figure 6b. The LDV recorded indoor speech recordings from a distance of 30 m. The sound pressure of the speaker was about 65 dB, located 15 cm away from the target object. To fully validate the performance of the proposed model, we used a cardboard box with dimensions of 18 cm × 12 cm × 8.5 cm, a polyvinyl chloride (PVC) sheet with a thickness of 3 mm, an iron box with dimensions of 13 cm × 13 cm × 12 cm, and an A4 printing paper as the measured target objects, and constructed different datasets. In the experiment, we used two different corpora to evaluate our system. Firstly, we used the THCHS-30 corpus [42] as clean speech to train the proposed model. The THCHS-30 corpus contains 13388 Chinese sentences spoken by 40 different speakers, with a sampling rate of 16 kHz and a total of approximately 40 h. For each target object, we randomly collected 6 h of speech audio, each recording was 60-s-long and resampled at a frequency of 16 kHz, for a total of 360 speech recordings. Among them, 216 speech recordings were used for training, 72 speech recordings were used for validation, and 72 speech recordings were used for testing the model’s generalization ability. In addition, we also used the TIMIT corpus [43] as clean speech to further test the generalization ability of the proposed model for different languages. The TIMIT corpus contains 6300 English utterances from 630 speakers, with a sampling rate of 16 kHz. For each target object, we randomly collected 10 min of speech recordings. Each recording was 60-s-long and resampled at a frequency of 16 kHz, resulting in a total of 10 speech recordings.

In our experiments, the training data consists of non-overlapping segments extracted from continuous recordings. We chose non-overlapping segments to avoid overfitting to specific temporal contexts and to ensure statistical independence across batches. However, during inference, we use a 50% overlapping sliding window to mitigate boundary artifacts. This ensures smooth transitions between predictions on adjacent segments while maintaining real-time feasibility. Besides, segments containing more than 80% of frames below −35 dB were excluded from training. This filtering removes prolonged silent intervals that do not contribute meaningfully to learning speech recovery under vibration-induced distortion. However, short pauses and low-amplitude phonemes are preserved, as they are frequently distorted by environmental noise in LDV settings and thus represent an important part of the target distribution. Moreover, to prevent bias toward any single material, we applied stratified sampling such that each mini-batch contains approximately equal numbers of segments from each of the different target objects. This ensures uniform learning across different mechanical responses and improves generalization to unseen materials.

3.2. Experimental Setup and Baselines

We performed 320-point STFT on the framed signal with a Hanning window to produce 161-D spectral features. The hop length was 160 points. The Adam optimizer [44] was used to train the network in 1-s-long segments. The batch size was set to 60. We trained the network using the mean absolute error (MAE) loss function to estimate the IRM. During the training process, we trained the network at an initial learning rate of 0.001. If there was no performance improvement in five consecutive epochs, the learning rate was halved. When the learning rate is below 0.0001, the training process terminates.

To fully compare the performance of different models, we used nine baseline models. Firstly, we compared three conventional methods, namely Minimum Mean-Square Error Log-Spectral Amplitude (MMSELSA) [45], Optimally Modified Log-Spectral Amplitude (OM-LSA) [46] and Wiener estimation [47]. These classic algorithms have good interpretability and computational efficiency and do not require much data to train the model. Afterward, we compared six DL methods, namely DARCN [48], NSE-CATNet [49], DCCRN [50], MAB-CED [51], GCRN [52] and GaGNet [53]. Among them, DARCN and NSE-CATNet only enhanced the magnitude spectrum, while the phase spectrum remained unchanged. DCCRN and MAB-CED used CRMs to improve the complex spectrum. GCRN and GaGNet used complex spectral mapping to estimate the real and imaginary spectra of clean speech.

3.3. Metrics

We used perceptual evaluation of speech quality (PESQ) [54], short-term objective intelligence (STOI) [55], and log spectral distance (LSD) [56] to evaluate the speech enhancement performance of different models. PESQ is used to evaluate the overall perceptual speech quality. The value range is between −0.5 and 4.5, with −0.5 indicating the worst quality and 4.5 indicating the best quality. STOI is related to human speech intelligibility. It uses scores to measure the intelligibility of the target speech. The value range is between 0 and 1, which can be roughly explained as the correct percentage, with high scores indicating better performance. LSD is used to estimate the difference between the reference and estimated magnitude spectra. The smaller the LSD value, the smaller the speech degradation. The LSD metric is defined as

L S D = \frac{1}{T} \sum_{t}^{T} \sqrt{\frac{1}{F} \sum_{f}^{F} {(10 \log_{10} \frac{M_{t, f}}{{\hat{M}}_{t, f}})}^{2}}

(7)

where M and

\hat{M}

denote the reference and estimated magnitude spectra, respectively.

4. Results

4.1. Ablation Study

The ablation study aims to evaluate the effectiveness of the various components that make up TCNUNet. We used the dataset of the cardboard box as the target object for ablation research. The objective results on unseen test sets are shown in Table 4. We used TCNUNet as the baseline and removed GLU, NUNet, and GTCM modules respectively, and showed the performance after removing the modules. Compared to the baseline, the performance of the model has deteriorated after removing various modules. This indicates that each module has played a positive role in improving the overall performance. GLU effectively promotes the propagation of information flow in the network and helps the network capture long-term dependencies. After changing the GLU convolution/deconvolution of the encoding and decoding layers to standard convolution/deconvolution, performance degradation of 0.303, 0.03, and 0.057 was observed in LSD, STOI, and PESQ metrics, respectively. The NUNet architecture helps to further extract multi-scale information to separate noise and clean speech. After removing the NUNet in the encoder, performance degradation of 0.088, 0.04, and 0.021 was observed in LSD, STOI, and PESQ metrics, respectively. In contrast, the performance degradation was most severe after removing GTCMs, with a deterioration of 0.45, 0.013, and 0.107 in LSD, STOI, and PESQ metrics, respectively. This indicates that GTCMs play a greater role in the network and effectively model the long-term temporal dependencies. Since removing the GTCM leads to a significant drop in measurement indexes, this confirms its role in handling temporal instability. Similarly, replacing the nested structure with a standard U-encoder also results in poorer performance on unseen materials, demonstrating its importance for spectral generalization.

4.2. Comparison with Baselines

In this section, we compared the proposed method with various baseline methods to compare their differences. The comparative experiment with the baselines used the cardboard box as the target object. As shown in Table 5, the comparison results of the validation set and unseen test set are provided respectively. Firstly, we presented the speech enhancement results of three conventional methods, namely MMSELSA, OM-LSA, and Wiener estimation. Compared with mixtures, conventional methods generally experience performance degradation in LSD and STOI metrics, with only performance improvement achieved in the PESQ metric. This is because conventional methods are generally based on the assumption of (1) that clean speech signals and noise signals are uncorrelated with each other. Conventional methods focus more on filtering out additive noise. After filtering, a large amount of additive noise is removed, resulting in an overall improvement in perceived speech quality. However, under the condition of using a non-microphone as the receiver, signal distortion is an undeniable interference factor. Conventional methods simultaneously filter out additive noise and distorted speech masked by additive noise, resulting in a decrease in speech intelligibility. In addition, signal distortion includes distortion of magnitude and phase spectra, which makes it difficult for conventional methods to improve the LSD metric.

Correspondingly, DL-based methods have generally achieved performance improvements in LSD, STOI, and PESQ metrics. This is because DL-based methods attempt to restore distorted signals while reducing noise. For a fair comparison, baseline models also used non-causal models to achieve better performance. DCCRN and MAB-CED used CRMs to simultaneously enhance the real and imaginary parts of the complex spectrum. Compared with the mixture, both baselines achieved performance improvement in LSD, STOI, and PESQ metrics. GCRN and GaGNet used complex spectral mapping to reconstruct the real and imaginary spectra of clean speech. Compared to CRM, this method achieved better overall performance improvement. It was observed that although GCRN achieved performance improvement in LSD and STOI metrics, performance degradation occurred in PESQ metrics. This is because, in real-world environments, the recovery of phase spectra is more difficult compared to magnitude spectra. GCRN only reconstructed the magnitude spectrum of clean speech and failed to effectively recover its phase spectrum. In GaGNet, we assigned higher weight values to the magnitude spectrum, resulting in better performance in PESQ metrics. DARCN and NSE-CATNet preserved the phase spectrum of noisy speech and only enhanced the magnitude spectrum. Compared with other baselines, this method achieved better performance improvement in the PESQ metric. Compared to all baseline models, the proposed TCNUNet achieved state-of-the-art performance, outperforming baseline models in all metrics. Figure 7 shows the speech enhancement results of the proposed method on the cardboard box. It was observed that the proposed method effectively restored high-frequency speech features, improving the quality and comprehensibility of speech. Finally, we compared the measurement results of the models in the validation set and the test set. It can be observed that the differences in measurement results between different datasets were small. This indicates that the models have effectively generalized to unseen data. This table and the following tables also include the 95% confidence intervals for each separate metric, which provide a method for evaluating the accuracy of statistical estimates. In our case, it measures the range of true values with a confidence level of 0.95.

Besides, we also compared TCNUNet with SepFormer, a state-of-the-art transformer-based model known for its strong performance on standard speech enhancement benchmarks such as DNS and VoiceBank. In our experiment we find that while models like SepFormer excel in conventional noisy acoustic environments, they are designed under assumptions (e.g., additive noise, microphone capture) that differ significantly from the non-additive, vibration-mediated distortions in LDV signals, such as phase corruption, material-dependent frequency response, and low signal-to-noise ratio due to remote sensing. By including SepFormer as a benchmark, we aim to demonstrate that: (1) General-purpose architectures, despite their power, are suboptimal for LDV-specific challenges due to mismatched inductive biases; (2) The performance gain of TCNUNet is not merely due to model capacity but stems from architectural choices explicitly aligned with the physical nature of LDV degradation (e.g., temporal modeling via GTCM, multi-scale fusion via nested U-Net). Preliminary results already show that while SepFormer improves over the noisy input, it underperforms compared to TCNUNet in preserving speech naturalness and temporal coherence. These results support our claim that domain-specific design is crucial. That’s why the TCNUNet is proposed to face LDV-specific challenges.

4.3. Comparison with Various Targets

We also evaluated the performance of the proposed model on datasets collected from various other target objects. Table 6 and Table 7 respectively provide the speech enhancement measurement results of various target objects under seen and unseen conditions. It was observed that the proposed model achieved performance improvement in various other target object datasets, and the performance difference between generalization testing and validation testing was small. This indicates that the proposed model also has a certain generalization ability for various target objects. In addition, different target objects exhibit different characteristics due to the influence of materials and structures. The proposed model achieved the optimal LSD metric in the iron box dataset, with the minimum average distortion of the magnitude spectrum. The best STOI metric was achieved in the PVC sheet dataset, with the highest speech intelligibility. The best PESQ metric was achieved in the printing paper dataset, indicating the highest perceived speech quality. This helps in selecting appropriate target objects for focus metrics in real-world environments.

Finally, to fully validate the generalization ability of the proposed model, we also tested its performance on unseen datasets in different languages. Figure 8 shows the spectrograms of speech collected from various target objects. The test results of various target object datasets are shown in Table 8. From the above tables, we can deduce that the performance varies across different materials. However, we believe this variation reflects the inherent physical differences in how various target objects vibrate and reflect laser signals, such as material density, surface roughness, and structural rigidity, rather than a fundamental limitation in model generalization. Importantly, while absolute metric values differ, the trend of consistent improvement is observed across all materials under both seen and unseen conditions. For example, on the unseen TIMIT dataset (see Table 8), our model improves STOI and PESQ indexes across all target objects, including those not encountered during training. This demonstrates robust generalization capability. Furthermore, the relatively small performance gap between seen and unseen conditions also suggests that the model has learned generalizable acoustic-physical patterns rather than overfitting to specific materials. Note that the proposed model evaluated on the unseen TIMIT dataset was not trained from scratch, but rather the same model trained on our main dataset was directly tested in a zero-shot generalization setting. This demonstrates cross-dataset robustness and the model’s ability to generalize to new phonetic content and speakers under unseen conditions. Besides, from Table 8 we can also see that the cardboard box demonstrated the best initial mixture performance and enhanced performance, achieving performance improvements of 1.733, 0.064, and 0.251 in LSD, STOI, and PESQ metrics, respectively. This indicates that the structure of the cardboard box is suitable as a target object for LDV monitoring. In addition, although not as good as the cardboard box, other target objects have also achieved improved performance. Thus, from above tables we can deduce tthat TCNUNet achieves strong but variable performance across materials. These differences are systematically related to the mechanical impedance, damping ratio, and frequency response function (FRF) of each surface, which govern how airborne sound is transduced into measurable vibrations. Specifically, Cardboard Box has high overall performance (Best in TIMIT). The cardboard box exhibits high stiffness-to-mass ratio and moderate internal damping, resulting in a relatively flat FRF over the speech band. This allows faithful transmission of mid-frequency phonetic content, enabling both high STOI and PESQ scores. Its structural rigidity also minimizes non-linear distortions, making the inverse problem more tractable for the network. PVC Sheet has highest STOI Improvement and lower overall SNR due to fluttering under airflow, it displays strong low-frequency resonance, amplifying voicing cues. Our GTCM and NUNet components are particularly effective at exploiting this enhanced periodicity, selectively suppressing noise while preserving harmonic structure. This leads to exceptional intelligibility gains, even if perceptual quality (PESQ) remains slightly lower than on stiffer materials. Iron Box has lower Performance Despite Rigidity. Although metal has high reflectivity and minimal damping, its high density leads to poor acoustic coupling with airborne sound, resulting in very weak vibrational signals. Additionally, multi-path reflections within the enclosed box create modal resonances that distort spectral balance. These factors increase estimation uncertainty, explaining the higher variance and modest improvement observed in in above tables. Printing Paper has most challenging surface. As a thin, flexible membrane, paper suffers from severe signal attenuation above 1 kHz and high sensitivity to ambient disturbances. Its large compliance introduces strong non-linear phase shifts, distorting temporal alignment of harmonics. Consequently, all models struggle with this material, but the proposed TCNUNet’s multi-scale refinement still provides meaningful gains by recovering residual voiced segments. Thus, performance differences across materials are consistent with known physical properties. Surfaces with flatter frequency responses and better acoustic coupling (e.g., cardboard) yield superior reconstruction, while highly resonant or poorly coupled materials (e.g., PVC, iron) present distinct challenges that are partially mitigated by the proposed architecture. Besides, the proposed model also has a certain degree of generalization ability for different languages. In our experiment we find that if the model is trained on Chinese utterances and evaluated on English test sets, the two languages differ in phonotactics and prosody. Our goal is to assess cross-lingual generalization under physical degradation, not linguistic modeling. In the experiment, we find that the LDV distortion mechanism depends primarily on acoustic-to-vibration transduction and object resonance, which are largely independent of linguistic content. Moreover, both languages share similar spectral envelopes and phonetic features, allowing the network to learn universal restoration strategies. Thus, although trained on Chinese and tested on English, the task focuses on recovering signal structure rather than language-specific features. The physical nature of LDV distortion is language-agnostic, enabling effective cross-lingual transfer, as confirmed by listener intelligibility tests.

For computational cost, the conditions are as follows: Model parameters: 8.7 M; Floating-point operations (FLOPs): 16.3 GFLOPS per second of audio; Inference latency: ∼240 ms; Inference speed: 0.8× real-time on an NVIDIA RTX 3090. Besides, the proposed TCNUNet is non-causal, primarily due to the use of bidirectional gated temporal convolutions and context-dependent U-Net skip connections. This design prioritizes performance by leveraging future context, which is beneficial for offline or quasi-online applications such as forensic analysis or post-disaster surveillance where near-real-time processing is acceptable. However, we fully acknowledge that true real-time, causal operation is essential for live monitoring. To address this, we have conducted preliminary experiments with a causal variant of TCNUNet, where all convolutions are constrained to use only past and present context, and receptive fields are managed via dilation. Initial results show a moderate performance drop in measurement indexes, but the model remains competitive with existing methods while achieving real-time inference on the same hardware.

5. Limitations

Although the proposed model achieved competitive performance compared with other methods, it also has some limitations: (1) Performance under extremely low SNR conditions. While TCNUNet achieves strong results at certain SNRs, performance begins to degrade significantly below this threshold. This reflects the fundamental challenge of recovering speech from severely corrupted LDV signals where vibrational energy is nearly indistinguishable from environmental noise. Current architectures may benefit from additional noise-robust pretraining or hybrid signal-processing/deep learning strategies in such regimes; (2) Failure modes with highly reflective or flexible materials. The model may show reduced effectiveness on targets with complex mechanical responses, such as thin plastic films or loosely mounted glass, where non-linear vibrations and multi-path reflections introduce unpredictable distortions not well captured by the training data. These cases often result in over-smoothed outputs or residual artifacts; (3) Phase spectrum treatment. A major limitation is our reliance on the noisy phase spectrum from the input LDV signal during reconstruction. Although magnitude estimation is performed in the complex domain, we do not explicitly refine the phase. As noted in prior work, inaccurate phase can impair perceptual quality and intelligibility, particularly in transient regions. Our current method implicitly improves phase through complex spectral mapping, but without explicit phase modeling, performance remains bounded compared to ideal phase-aware methods; (4) Generalization to unseen environments. All experiments were conducted in controlled acoustic settings. Performance may decrease in open-air or high-reverberation environments due to unmodeled atmospheric effects (e.g., wind, temperature gradients), which are not present in the training data. Therefore, in the future we will further investigate causal modeling, phase-optimized objectives, and domain randomization for environmental robustness.

6. Conclusions

Remote laser monitoring has the advantages of strong concealment and easy operation. Therefore, it can be used in military reconnaissance, security monitoring, disaster relief, and other fields. In this paper, we analyzed the voice signals captured by LDV and proposed a nested U-network with temporal convolution. The proposed model includes an encoder, a decoder, and a bottleneck layer. The difference from the standard U-Net is that the proposed model also adds nested U-Nets in the encoder to further enhance feature representation ability. In the bottleneck layer, we designed the gated temporal convolution module. Two gates were added to the temporal convolution module to control the number of channels, and depthwise convolution was used to further compress network parameters. We have designed an LDV monitoring system to remotely capture voice signals in a real-world environment. In addition, we collected different datasets from various target objects to fully validate the performance of the proposed network. The results of real-world experiments indicate that the proposed model achieves state-of-the-art performance compared to various baselines. The experimental results on various target object datasets also demonstrate that the proposed model has good speech enhancement performance and generalization ability.

We have demonstrated the potential of using LDV for remote capture of speech signals and achieved speech monitoring of various common target objects. However, due to the complex frequency response characteristics of target objects, it is difficult to reconstruct the phase spectrum of distorted speech signals. In the next stage of research, we plan to enhance the phase spectrum of distorted speech to further enhance the speech signal captured by LDV.

Author Contributions

Conceptualization, B.Z. and J.T.; methodology, B.Z.; software, B.Z.; validation, B.Z.; formal analysis, B.Z.; investigation, B.Z.; resources, B.Z. and J.T.; data curation, B.Z.; writing—original draft preparation, B.Z.; writing—review and editing, B.Z. and J.T.; visualization, B.Z.; supervision, F.G.; project administration, F.G.; funding acquisition, F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Scientific Research Project of Xiang Jiang Lab (No. 23XJ02002), and the Natural Science Foundation of Hunan Province of China (No. 2025JJ50396).

Data Availability Statement

All data included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, R.; Wang, T.; Zhu, Z.; Xiao, W. Vibration Characteristics of Various Surfaces Using an LDV for Long-Range Voice Acquisition. IEEE Sens. J. 2011, 11, 1415–1422. [Google Scholar] [CrossRef]
Peng, S.; Wu, S.; Li, Y.; Chen, H. All-fiber monostatic pulsed laser Doppler vibrometer: A digital signal processing method to eliminate cochannel interference. Opt. Laser Technol. 2020, 124, 105952. [Google Scholar] [CrossRef]
Wu, S.; Lv, T.; Han, X.; Yan, C.; Zhang, H. Remote audio signals detection using a partial-fiber laser Doppler vibrometer. Appl. Acoust. 2018, 130, 216–221. [Google Scholar] [CrossRef]
Xu, Z.; Li, J.; Zhang, S.; Tan, Y.; Zhang, X.; Lin, X.; Wan, X.; Zhuang, S. Remote eavesdropping at 200 m distance based on laser feedback interferometry with single-photon sensitivity. Opt. Lasers Eng. 2021, 141, 106562. [Google Scholar] [CrossRef]
Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
Li, W.; Liu, M.; Zhu, Z.; Huang, T. LDV Remote Voice Acquisition and Enhancement. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 4, pp. 262–265. [Google Scholar] [CrossRef]
Zhang, H.; Lv, T.; Yan, C. The novel role of arctangent phase algorithm and voice enhancement techniques in laser hearing. Appl. Acoust. 2017, 126, 136–142. [Google Scholar] [CrossRef]
Lv, T.; Zhang, H.; Yan, C. Double mode surveillance system based on remote audio/video signals acquisition. Appl. Acoust. 2018, 129, 316–321. [Google Scholar] [CrossRef]
Lv, T.; Han, X.; Wu, S.; Li, Y. The effect of speckles noise on the Laser Doppler Vibrometry for remote speech detection. Opt. Commun. 2019, 440, 117–125. [Google Scholar] [CrossRef]
Peng, R.; Xu, B.; Li, G.; Zheng, C.; Li, X. Long-range speech acquirement and enhancement with dual-point laser Doppler vibrometers. In Proceedings of the 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), Shanghai, China, 19–21 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–5. [Google Scholar] [CrossRef]
Wang, Y.H.; Zhang, W.X.; Wu, Z.; Kong, X.X.; Wang, Y.B.; Zhang, H.X. Highly-Accurate and Real-Time Speech Measurement for Laser Doppler Vibrometers. IEICE Trans. Inf. Syst. 2022, E105D, 1568–1580. [Google Scholar] [CrossRef]
Xie, Z.; Du, J.; McLoughlin, I.; Xu, Y.; Ma, F.; Wang, H. Deep neural network for robust speech recognition with auxiliary features from laser-Doppler vibrometer sensor. In Proceedings of the 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, China, 17–20 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [Google Scholar] [CrossRef]
Peng, S.; Lv, T.; Han, X.; Wu, S.; Yan, C.; Zhang, H. Remote speaker recognition based on the enhanced LDV-captured speech. Appl. Acoust. 2019, 143, 165–170. [Google Scholar] [CrossRef]
Cai, C.; Iwai, K.; Nishiura, T. Speech Enhancement Based on Two-Stage Processing with Deep Neural Network for Laser Doppler Vibrometer. Appl. Sci. 2023, 13, 1958. [Google Scholar] [CrossRef]
Wang, D.; Chen, J. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1702–1726. [Google Scholar] [CrossRef]
Wang, Y.; Narayanan, A.; Wang, D. On Training Targets for Supervised Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1849–1858. [Google Scholar] [CrossRef] [PubMed]
Pandey, A.; Wang, D. Dense CNN With Self-Attention for Time-Domain Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1270–1279. [Google Scholar] [CrossRef]
Hummersone, C.; Stokes, T.; Brookes, T. On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis. In Blind Source Separation: Advances in Theory, Algorithms and Applications; Naik, G.R., Wang, W., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 349–368. [Google Scholar] [CrossRef]
Williamson, D.S.; Wang, Y.; Wang, D. Complex Ratio Masking for Monaural Speech Separation. IEEE/ACM Trans. Audio Speech, Lang. Process. 2016, 24, 483–492. [Google Scholar] [CrossRef] [PubMed]
Li, A.; Yuan, M.; Zheng, C.; Li, X. Speech enhancement using progressive learning-based convolutional recurrent neural network. Appl. Acoust. 2020, 166, 107347. [Google Scholar] [CrossRef]
Wang, Z.Q.; Wang, P.; Wang, D. Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2001–2014. [Google Scholar] [CrossRef] [PubMed]
Pandey, A.; Wang, D. A New Framework for CNN-Based Speech Enhancement in the Time Domain. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1179–1188. [Google Scholar] [CrossRef]
Tolooshams, B.; Giri, R.; Song, A.H.; Isik, U.; Krishnaswamy, A. Channel-Attention Dense U-Net for Multichannel Speech Enhancement. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 836–840. [Google Scholar] [CrossRef]
Fu, Y.; Liu, Y.; Li, J.; Luo, D.; Lv, S.; Jv, Y.; Xie, L. Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 7417–7421. [Google Scholar] [CrossRef]
Xu, X.; Tu, W.; Yang, Y. CASE-Net: Integrating local and non-local attention operations for speech enhancement. Speech Commun. 2023, 148, 31–39. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, S.; Zhuang, X.; Qian, Y.; Wang, M. Dual branch deep interactive UNet for monaural noisy-reverberant speech enhancement. Appl. Acoust. 2023, 212, 109574. [Google Scholar] [CrossRef]
Zhu, Y.; Xu, X.; Ye, Z. FLGCNN: A novel fully convolutional neural network for end-to-end monaural speech enhancement with utterance-based objective functions. Appl. Acoust. 2020, 170, 107511. [Google Scholar] [CrossRef]
Fan, C.; Zhang, H.; Li, A.; Xiang, W.; Zheng, C.; Lv, Z.; Wu, X. CompNet: Complementary network for single-channel speech enhancement. Neural Netw. 2023, 168, 508–517. [Google Scholar] [CrossRef]
Xiang, X.; Zhang, X. Joint waveform and magnitude processing for monaural speech enhancement. Appl. Acoust. 2022, 200, 109077. [Google Scholar] [CrossRef]
Zhang, Y.; Li, L.; Tong, Y.; Zeng, H.; Zhou, Y. Vibration characteristics of aluminum material and its influences on laser Doppler voice acquisition. In International Symposium on Photoelectronic Detection and Imaging 2013: Infrared Imaging and Applications; Gong, H., Shi, Z., Chen, Q., Lu, J., Eds.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 2013; Volume 8907, p. 89074B. [Google Scholar] [CrossRef][Green Version]
Li, L.; Zeng, H.; Zhang, Y.; Kong, Q.; Zhou, Y.; Liu, Y. Analysis of backscattering characteristics of objects for remote laser voice acquisition. Appl. Opt. 2014, 53, 971–978. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR: Cambridge, MA, USA, 2015; pp. 448–456. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; IEEE: Piscataway, NJ, USA, 2015. [Google Scholar]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: Cambridge, MA, USA, 2017; pp. 933–941. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Xiang, X.; Zhang, X.; Chen, H. A Nested U-Net With Self-Attention and Dense Connectivity for Monaural Speech Enhancement. IEEE Signal Process. Lett. 2022, 29, 105–109. [Google Scholar] [CrossRef]
Noh, H.; Hong, S.; Han, B. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; IEEE: Piscataway, NJ, USA, 2015. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Pandey, A.; Wang, D. TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6875–6879. [Google Scholar] [CrossRef]
Shi, Z.; Lin, H.; Liu, L.; Liu, R.; Han, J.; Shi, A. Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; ISCA: Toulouse, France, 2019; pp. 3183–3187. [Google Scholar] [CrossRef]
Wang, D.; Zhang, X.; Zhang, Z. THCHS-30: A Free Chinese Speech Corpus. 2015. Available online: https://www.openslr.org/18/ (accessed on 15 August 2025).
Lyons, J.W. DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus; National Institute of Standards and Technology: Gaithersburg, MD, USA, 1993.
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [Google Scholar] [CrossRef]
Cohen, I. Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator. IEEE Signal Process. Lett. 2002, 9, 113–116. [Google Scholar] [CrossRef]
Scalart, P.; Filho, J. Speech enhancement based on a priori signal to noise estimation. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 7–10 May 1996; IEEE: Piscataway, NJ, USA, 1996; Volume 2, pp. 629–632. [Google Scholar] [CrossRef]
Li, A.; Zheng, C.; Fan, C.; Peng, R.; Li, X. A Recursive Network with Dynamic Attention for Monaural Speech Enhancement. arXiv 2020, arXiv:2003.12973. [Google Scholar] [CrossRef]
Saleem, N.; Gunawan, T.S.; Kartiwi, M.; Nugroho, B.S.; Wijayanto, I. NSE-CATNet: Deep Neural Speech Enhancement Using Convolutional Attention Transformer Network. IEEE Access 2023, 11, 66979–66994. [Google Scholar] [CrossRef]
Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. arXiv 2020, arXiv:2008.00264. [Google Scholar] [CrossRef]
Saleem, N.; Gunawan, T.S.; Shafi, M.; Bourouis, S.; Trigui, A. Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement. IEEE Access 2023, 11, 114172–114186. [Google Scholar] [CrossRef]
Tan, K.; Wang, D. Learning Complex Spectral Mapping With Gated Convolutional Recurrent Networks for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 380–390. [Google Scholar] [CrossRef] [PubMed]
Li, A.; Zheng, C.; Zhang, L.; Li, X. Glance and gaze: A collaborative learning framework for single-channel speech enhancement. Appl. Acoust. 2022, 187, 108499. [Google Scholar] [CrossRef]
Rix, A.; Beerends, J.; Hollier, M.; Hekstra, A. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001; Proceedings (Cat. No.01CH37221). IEEE: Piscataway, NJ, USA, 2001; Volume 2, pp. 749–752. [Google Scholar] [CrossRef]
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
Wang, P.; Wang, Y.; Liu, H.; Sheng, Y.; Wang, X.; Wei, Z. Speech enhancement based on auditory masking properties and log-spectral distance. In Proceedings of the 2013 3rd International Conference on Computer Science and Network Technology, Dalian, China, 12–13 October 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1060–1064. [Google Scholar] [CrossRef]

Figure 1. Examples of LDV-captured speech and clean speech. (a) The waveform of clean speech. (b) The waveform of noisy speech. (c) The spectrogram of clean speech. (d) The spectrogram of noisy speech.

Figure 2. Diagram of the proposed TCNUNet model.

Figure 3. Diagram of the NUNet with 4 layers.

Figure 4. Diagram of the GTCM.

Figure 5. An example of dilated convolution with a filter of size 3. (a) A dilated causal convolution. (b) A dilated non-causal convolution.

Figure 6. Configuration of Remote Voice Acquisition System Based on LDV. (a) Experiment setup for remote voice acquisition. (b) The LDV device.

Figure 7. The enhancement results of the proposed method on the cardboard box. (a) The spectrogram of clean speech. (b) The spectrogram of noisy speech. (c) The spectrogram of enhanced speech.

Figure 8. Spectrograms of various target objects. (a) The spectrogram of clean speech. (b) The spectrogram of the cardboard box. (c) The spectrogram of the PVC sheet. (d) The spectrogram of the iron box. (e) The spectrogram of the printing paper.

Table 1. Architecture of the proposed TCNUNet model.

Layer Name	Input Size	Hyperparameters	Output Size
GLUConv2d_1	$1 \times T \times 161$	$2 \times 3, (1, 2)$	$64 \times T \times 81$
NUNet2d_1	$64 \times T \times 81$	$1 \times 3, (1, 2)$	$64 \times T \times 81$
GLUConv2d_2	$64 \times T \times 81$	$2 \times 3, (1, 2)$	$64 \times T \times 41$
NUNet2d_2	$64 \times T \times 41$	$1 \times 3, (1, 2)$	$64 \times T \times 41$
GLUConv2d_3	$64 \times T \times 41$	$2 \times 3, (1, 2)$	$64 \times T \times 21$
NUNet2d_3	$64 \times T \times 21$	$1 \times 3, (1, 2)$	$64 \times T \times 21$
GLUConv2d_4	$64 \times T \times 21$	$2 \times 3, (1, 2)$	$64 \times T \times 11$
NUNet2d_4	$64 \times T \times 11$	$1 \times 3, (1, 2)$	$64 \times T \times 11$
GLUConv2d_5	$64 \times T \times 11$	$2 \times 3, (1, 2)$	$64 \times T \times 6$
reshape_1	$64 \times T \times 6$	-	$384 \times T$
GTCM1d_1	$384 \times T$		$384 \times T$
GTCM1d_2	$384 \times T$		$384 \times T$
GTCM1d_3	$384 \times T$		$384 \times T$
reshape_2	$384 \times T$	-	$64 \times T \times 6$
GLUDConv2d_5	$128 \times T \times 6$	$1 \times 3, (1, 2)$	$64 \times T \times 11$
GLUDConv2d_4	$128 \times T \times 11$	$1 \times 3, (1, 2)$	$64 \times T \times 21$
GLUDConv2d_3	$128 \times T \times 21$	$1 \times 3, (1, 2)$	$64 \times T \times 41$
GLUDConv2d_2	$128 \times T \times 41$	$1 \times 3, (1, 2)$	$64 \times T \times 81$
GLUDConv2d_1	$128 \times T \times 81$	$1 \times 3, (1, 2)$	$1 \times T \times 161$

Table 2. Architecture of the NUNet with 4 layers.

Layer Name	Input Size	Hyperparameters	Output Size
Conv2d_1	$64 \times T \times 81$	$1 \times 3, (1, 2)$	$64 \times T \times 41$
Conv2d_2	$64 \times T \times 41$	$1 \times 3, (1, 2)$	$64 \times T \times 21$
Conv2d_3	$64 \times T \times 21$	$1 \times 3, (1, 2)$	$64 \times T \times 11$
Conv2d_4	$64 \times T \times 11$	$1 \times 3, (1, 2)$	$64 \times T \times 6$
DConv2d_4	$64 \times T \times 6$	$1 \times 3, (1, 2)$	$64 \times T \times 11$
DConv2d_3	$128 \times T \times 11$	$1 \times 3, (1, 2)$	$64 \times T \times 21$
DConv2d_2	$128 \times T \times 21$	$1 \times 3, (1, 2)$	$64 \times T \times 41$
DConv2d_1	$128 \times T \times 41$	$1 \times 3, (1, 2)$	$64 \times T \times 81$

Table 3. Architecture of the GTCM.

Layer Name	Input Size	Hyperparameters	Output Size
GLUConv1d_1	$384 \times T$	$1, 1$	$192 \times T$
D-Conv1d_1	$192 \times T$	$3, 1$	$192 \times T$
GLUConv1d_2	$192 \times T$	$1, 1$	$384 \times T$
GLUConv1d_3	$384 \times T$	$1, 1$	$192 \times T$
D-Conv1d_2	$192 \times T$	$3, 2$	$192 \times T$
GLUConv1d_4	$192 \times T$	$1, 1$	$384 \times T$
GLUConv1d_5	$384 \times T$	$1, 1$	$192 \times T$
D-Conv1d_3	$192 \times T$	$3, 5$	$192 \times T$
GLUConv1d_6	$192 \times T$	$1, 1$	$384 \times T$
GLUConv1d_7	$384 \times T$	$1, 1$	$192 \times T$
D-Conv1d_4	$192 \times T$	$3, 9$	$192 \times T$
GLUConv1d_8	$192 \times T$	$1, 1$	$384 \times T$

Table 4. The results of ablation studies on unseen test sets. The bold values show the best results. ‘w/o’ means ‘without’.

Model	LSD ↓	STOI ↑	PESQ ↑
Mixture	8.658	0.697	1.509
TCNUNet	4.642	0.822	2.268
w/o GLU	4.945	0.819	2.211
w/o GTCM	5.092	0.809	2.161
w/o NUNet	4.730	0.818	2.247

Table 5. Experimental results of various baseline methods on seen and unseen datasets. The bold values show the best results. After ‘±’ is the 95% confidence interval.

	Validation			Test
Model	LSD ↓	STOI ↑	PESQ ↑	LSD ↓	STOI ↑	PESQ ↑
Mixture	8.566 ± 0.13	0.697 ± 0.01	1.521 ± 0.02	8.658 ± 0.12	0.697 ± 0.01	1.509 ± 0.03
MMSELSA	8.995 ± 0.09	0.692 ± 0.01	1.716 ± 0.04	8.959 ± 0.10	0.693 ± 0.01	1.689 ± 0.05
OM-LSA	11.258 ± 0.14	0.683 ± 0.01	1.722 ± 0.04	11.137 ± 0.16	0.685 ± 0.01	1.691 ± 0.05
Wiener	8.748 ± 0.06	0.696 ± 0.01	1.626 ± 0.03	8.751 ± 0.09	0.695 ± 0.01	1.606 ± 0.04
DCCRN	5.915 ± 0.05	0.767 ± 0.01	1.730 ± 0.02	5.887 ± 0.06	0.766 ± 0.01	1.721 ± 0.02
MAB-CED	4.804 ± 0.07	0.775 ± 0.01	1.538 ± 0.02	4.833 ± 0.06	0.773 ± 0.01	1.535 ± 0.02
GaGNet	4.710 ± 0.06	0.815 ± 0.01	2.104 ± 0.04	4.744 ± 0.06	0.815 ± 0.01	2.123 ± 0.04
GCRN	4.751 ± 0.06	0.770 ± 0.01	1.514 ± 0.01	4.791 ± 0.06	0.768 ± 0.01	1.514 ± 0.01
DARCN	6.826 ± 0.06	0.747 ± 0.01	1.816 ± 0.03	6.862 ± 0.07	0.750 ± 0.01	1.819 ± 0.04
NSE-CATNet	4.936 ± 0.06	0.815 ± 0.01	2.179 ± 0.04	4.986 ± 0.06	0.815 ± 0.01	2.199 ± 0.04
TCNUNet	4.598 ± 0.06	0.822 ± 0.01	2.255 ± 0.05	4.642 ± 0.06	0.822 ± 0.01	2.268 ± 0.05

Table 6. Experimental results of various target objects under seen conditions. The bold values show the best results. After ‘±’ is the 95% confidence interval.

	Mixture			Enhanced
Target Object	LSD ↓	STOI ↑	PESQ ↑	LSD ↓	STOI ↑	PESQ ↑
Iron box	8.990 ± 0.12	0.547 ± 0.01	1.149 ± 0.01	4.821 ± 0.05	0.752 ± 0.01	1.705 ± 0.02
Printing paper	8.931 ± 0.14	0.633 ± 0.01	1.326 ± 0.02	5.053 ± 0.10	0.762 ± 0.01	2.038 ± 0.05
PVC sheet	9.064 ± 0.13	0.680 ± 0.01	1.374 ± 0.02	4.937 ± 0.08	0.786 ± 0.01	1.995 ± 0.04

Table 7. Experimental results of various target objects under unseen conditions. The bold values show the best results. After ‘±’ is the 95% confidence interval.

	Mixture			Enhanced
Target Object	LSD ↓	STOI ↑	PESQ ↑	LSD ↓	STOI ↑	PESQ ↑
Iron box	8.954 ± 0.13	0.550 ± 0.01	1.150 ± 0.01	4.853 ± 0.06	0.755 ± 0.01	1.733 ± 0.02
Printing paper	9.286 ± 0.13	0.628 ± 0.01	1.328 ± 0.02	5.070 ± 0.09	0.759 ± 0.01	2.051 ± 0.05
PVC sheet	9.076 ± 0.15	0.680 ± 0.01	1.388 ± 0.01	4.931 ± 0.10	0.784 ± 0.01	2.010 ± 0.04

Table 8. Experimental results of various target objects in the unseen TIMIT dataset. The bold values show the best results. After ‘±’ is the 95% confidence interval.

	Mixture			Enhanced
Target Object	LSD ↓	STOI ↑	PESQ ↑	LSD ↓	STOI ↑	PESQ ↑
Cardboard box	7.665 ± 0.07	0.716 ± 0.01	1.741 ± 0.05	5.932 ± 0.18	0.780 ± 0.01	1.992 ± 0.06
Iron box	8.074 ± 0.09	0.560 ± 0.01	1.457 ± 0.02	7.035 ± 0.09	0.607 ± 0.02	1.500 ± 0.03
Printing paper	9.090 ± 0.12	0.625 ± 0.01	1.518 ± 0.02	6.531 ± 0.10	0.713 ± 0.01	1.870 ± 0.04
PVC sheet	7.690 ± 0.10	0.687 ± 0.01	1.599 ± 0.03	6.091 ± 0.08	0.728 ± 0.01	1.804 ± 0.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, B.; Tang, J.; Guo, F. A Nested U-Network with Temporal Convolution for Monaural Speech Enhancement in Laser Hearing. Modelling 2026, 7, 32. https://doi.org/10.3390/modelling7010032

AMA Style

Zhou B, Tang J, Guo F. A Nested U-Network with Temporal Convolution for Monaural Speech Enhancement in Laser Hearing. Modelling. 2026; 7(1):32. https://doi.org/10.3390/modelling7010032

Chicago/Turabian Style

Zhou, Bomao, Jin Tang, and Fan Guo. 2026. "A Nested U-Network with Temporal Convolution for Monaural Speech Enhancement in Laser Hearing" Modelling 7, no. 1: 32. https://doi.org/10.3390/modelling7010032

APA Style

Zhou, B., Tang, J., & Guo, F. (2026). A Nested U-Network with Temporal Convolution for Monaural Speech Enhancement in Laser Hearing. Modelling, 7(1), 32. https://doi.org/10.3390/modelling7010032

Article Menu

A Nested U-Network with Temporal Convolution for Monaural Speech Enhancement in Laser Hearing

Abstract

1. Introduction

2. Method

2.1. Analysis of Signals Captured by LDV

2.2. Network Architecture

2.3. Nested U-Net Block

2.4. Gated Temporal Convolution Module

3. Evaluation

3.1. Datasets

3.2. Experimental Setup and Baselines

3.3. Metrics

4. Results

4.1. Ablation Study

4.2. Comparison with Baselines

4.3. Comparison with Various Targets

5. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI