1. Introduction
With the continuous advancement of global trade, the volume of international timber circulation has significantly increased, leading to a growing risk of the cross-border transmission of wood-boring pests. Such pests are highly prone to spreading with cargo flow during timber transportation. By boring into the internal structure of timber, they not only severely reduce the economic value of timber but also cause the invasion of alien harmful species, posing a potential threat to agricultural and forestry ecosystems. Currently, customs mainly rely on manual visual inspection for timber pest detection. Professional personnel observe signs of insect holes on the timber surface and make judgments by combining means such as splitting timber. These wood-boring insects are in their infancy and are weak and difficult to detect, as shown in
Figure 1. However, this method has problems such as long detection time, high labor costs, low efficiency, and being susceptible to subjective experience, making it difficult to meet the needs of rapid quarantine for large-scale and high-throughput import and export timber. Therefore, achieving the rapid, accurate, and intelligent detection of wood-boring pests inside timber has become a research hotspot in the field of pest quarantine [
1]. In recent years, the academic community has carried out extensive explorations on detection technologies for wood-boring pests, particularly focusing on the identification of weak vibration signals generated by their feeding or peristalsis. The research path has roughly gone through three stages: (1) detection by analyzing the pulse interval range corresponding to signal characteristic spectrograms or signal-based algorithms; (2) detection by combining signal characteristic spectrograms with machine learning methods; and (3) detection by combining signal characteristic spectrograms with deep learning methods.
In the field of wood-boring pest detection, early studies predominantly employed methods based on signal characteristic spectrogram analysis and traditional signal processing algorithms, achieving the preliminary identification of pest infestations by extracting features from weak vibration signals generated by insect activities. For instance, Guo et al. utilized pickups and recorders to collect crawling sound signals of stored grain pests (
Tribolium confusum and
Oryzaephilus surinamensis) and digitally processed the collected signals via Matlab software [
2,
3]. Their approach involved low-pass filtering to eliminate background noise, followed by extracting acoustic functional spectrum features of crawling sounds and identifying pest species by comparing energy value differences in power spectrum pulses among different species. Geng et al.used a microphone to collect adult crawling sound signals of
Tribolium castaneum and
Sitophilus oryzae [
4,
5]. After low-pass filtering and wavelet threshold denoising, he analyzed the main frequency and sub-frequency characteristics in wheat medium. The study revealed that although the main frequencies of power spectra for different pests were close, the sub-frequency distributions exhibited significant differences, serving as effective distinguishing criteria. In foreign studies, Mankin et al. conducted mean spectral energy distribution analysis on crawling and feeding sounds of larvae such as
Rhynchophorus ferrugineus,
Monochamus alternatus, and
Buprestidae, primarily using characteristic parameters such as high-frequency pulse energy, pulse burst frequency, and pulse interval for species differentiation [
6]. Dingfeng Lou collected feeding sounds of six pest species through a microphone in an anechoic chamber, plotted their power spectral density diagrams, and determined the main energy distribution frequency bands for species identification [
7]. At the signal algorithm level, Mingzhen Zhang recorded crawling and turning vibration signals of
Sitophilus zeamais and
Tribolium castaneum on a film in a soundproof room [
8]. He used low-pass filtering and wavelet thresholding for denoising, and combined with the FastICA algorithm to achieve signal centering, whitening, and independent component extraction, thereby separating mixed signals of different pest species. Shenghuang Liu, for the vibration signals of three species of longhorn beetle larvae, firstly applied Variational Mode Decomposition (VMD) for denoising, then extracted the energy proportion of each node through three-layer wavelet packet decomposition, and achieved species identification by integrating the fluctuation duration of time-domain signals with the main frequency components of the frequency domain [
9]. The above studies indicate that in relatively quiet or ideal experimental environments, methods based on spectrograms and traditional signal processing algorithms exhibit certain effectiveness in identifying wood-boring pests. However, in practical applications, due to the complex and changeable environmental noise, the above methods have limited ability to suppress real background noise, often leading to a decline in recognition accuracy or even failure. Therefore, such methods still have significant limitations in terms of noise robustness, environmental adaptability, and generalization ability in real-world scenarios, necessitating further optimization and improvement.
Subsequently, studies introducing machine learning methods based on feature spectrograms for pest identification gradually attracted attention in the academic community. Some scholars attempted to combine pest activity signal features with statistical learning models to enhance the automation and accuracy of identification. For example, Min Guo collected crawling and turning sound signals of two stored grain pests,
Sitophilus zeamais and
Tribolium castaneum in a soundproof environment [
10]. She firstly used Mel-frequency cepstral coefficients (MFCCs) to extract frequency-domain feature parameters, then estimated the parameters through a Gaussian Mixture Model (GMM) combined with the Expectation–Maximization (EM) algorithm, and used a clustering algorithm to classify and identify pest sound signals. Mingzhen Zhang, for four types of activity sound signals of two stored grain pests in a soundproof box, used the Isometric Feature Mapping (ISOMAP) method for manifold dimensionality reduction to extract streamline features in pest sound signals [
11]. Subsequently, a Support Vector Machine (SVM) with a heavy-tailed radial basis function as the kernel was used to construct a classification model, which trained and tested streamline features data to achieve the effective differentiation of pest species. Yufei Bu, from the perspective of time and frequency domains, extracted features such as the pulse duration and energy distribution range of seven pest species, and used the sum of squared deviations for the cluster analysis of time-domain data to achieve pest species identification and classification [
12]. Additionally, Ping Han proposed an automatic parameter selection method for Support Vector Machines (SVMs) based on chaotic optimization, aiming at the parameter selection problem of SVM models in the sound signal identification of stored grain pests [
13]. This method guides the search process of parameters C and kernel width
by generating chaotic sequences of logistic mapping and circular mapping and extends chaotic variables to the parameter space through “carrier mapping” to achieve the global optimization of SVM parameter combinations. Experimental results show that this method not only improves the recognition accuracy but also effectively reduces the number of models and improves the overall calculation efficiency. Although the above methods have achieved certain recognition effects in controlled environments, there are still some key problems in practical applications. Firstly, the assumptions about the distribution of pest signals in the modeling process of some methods deviate from the real data, leading to insufficient model generalization ability. Secondly, most algorithms are sensitive to hyperparameters, and improper parameter selection may significantly affect the classification performance. Especially when facing new pests not involved in training or complex environmental interferences, the recognition effect is prone to an obvious decline. Therefore, constructing an identification model that is robust to diverse pest species and has good adaptability to noise and feature disturbances remains a key challenge in current research.
In research on the integration of signal feature spectrograms and deep learning methods, a series of breakthroughs have been made in recent years. Some scholars have attempted to introduce neural network architectures to enhance the robustness and accuracy of pest sound signal identification under complex backgrounds. For example, Ping Han used the adaptive neural network noise reduction method (Madaline) for signal filtering and combined it with an adaptive silencer to effectively suppress environmental noise interference for three typical stored grain pests:
Sitophilus oryzae,
Sitophilus zeamais, and
Tribolium castaneum [
14]. The experiment set up two groups of control signals: one group was a mixture of pest sounds and noise, and the other was pure noise signals. In the study, pest spectrograms were input into a Backpropagation neural network (BP neural network) for training, and pest species classification and identification were achieved through labeled samples. Xiaoqian Tuo constructed a noise reduction neural network (Enhance) based on a dilated convolution structure for processing insect sound data under three types of noise conditions [
15]. Subsequently, four recognition models (InsectFrames) with different output dimensions of convolutional layers were designed to evaluate the differences in recognition accuracy under different feature expression capabilities. Juhu Li collected four types of signals:
Agrilus planipennis,
Cryptorhynchus lapathi, their mixed sounds, and environmental noise [
16]. He extracted cepstral coefficient spectrograms through the MFCC method and input them into the self-designed deep neural network model BoreNet for classification and identification. To improve the model’s generalization ability, noise-free insect sound fragments were used in the training stage to capture universal signal features, and noisy insect sound fragments were introduced in the test stage to simulate practical application scenarios. Weizheng Jiang used sensors to collect signals of the emerald ash borer,
Holcocerus insularis, their mixed sounds, and noise and also extracted MFCC spectrograms as model inputs [
17]. He proposed a novel convolutional neural network architecture called the Residual Mixed-domain Attention Module Network (RMAMNet), which integrates channel attention and temporal attention mechanisms to enhance the model’s ability to learn key features, demonstrating good recognition stability in multi-source insect sound mixed backgrounds. Haopeng Shi proposed a compact and excellent-performance vibration-enhanced neural network for the larvae of the wood-boring pest
Agrilus planipennis [
18]. The network combines frequency-domain enhancement and time-domain enhancement modules in a stacked framework. Experimental results show that the enhanced network significantly improves the recognition accuracy under noisy backgrounds, exhibiting obvious performance advantages compared to the undenoised model. Overall, the above studies demonstrate the potential of combining deep learning with signal feature spectrograms in the identification of wood-boring pests. However, these methods still have problems such as high model complexity, long inference time, and limited generalization ability. Especially under unstructured noise and complex on-site background conditions, they still face challenges such as accuracy degradation and insufficient robustness. Therefore, constructing an efficient insect sound detection model with lightweight structure, strong feature perception ability, and adaptability to complex environments remains the core problem to be solved in this field.
In terms of detection methods, they are mainly divided into acoustic signal-based detection methods and vibration-based detection methods. In terms of acoustic detection, Chunfeng Dou used an NI acquisition card combined with an acoustic emission sensor SR 150 N to collect acoustic signals [
19] and then used wavelet packets to reconstruct the time–frequency-domain signals of pests. The effect of larval number on the number of pulses, duration, and amplitude of the signal was studied. Yufei Bu used the AED-2010L sound detector (built-in sound sensor) combined with the SP-1L probe to collect four types of acoustic signals of two types of longhorn larvae [
20] and distinguished them by the amplitude, waveform, pulse and energy of the spectrogram of the time domain map. Senlin Geng used microphones and sound capture cards to collect the sounds of two types of grain storage pests in the soundproof room [
5] and distinguished them by the power spectrum. In terms of vibration signal detection: Piotr Bilski uses a CCLD accelerometer and acquisition card to collect vibration signals and distinguish pest signals from background noise through a Support Vector Machine [
21]. Xing Zhang used the SP-1L piezoelectric sensor probe combined with the self-developed vibration sensor to collect vibration signals and designed TrunkNet to identify pest vibration signals [
22].
Although current methods based on signal feature spectrograms and deep learning have achieved high accuracy in pest sound signal identification, two prominent issues remain. On the one hand, existing denoising networks perform insufficiently in handling non-uniform complex noise, limiting noise reduction effects, and on the other hand, most methods decouple denoising and classification processes, lacking a unified integrated modeling framework, which affects the robustness and efficiency of the overall identification system. To address the above issues, this paper proposes an integrated denoising and classification multi-attention recognition network—the Residual Denoising Vision Network (RDVNet). The model firstly performs deep denoising processing on pest sound signals through two groups of residual structures (each composed of four residual blocks), then inputs the denoised results into a lightweight classification network with a sandwich structure to complete end-to-end recognition tasks. On the collected actual insect sound dataset, RDVNet achieved excellent performance in both noise reduction performance and classification accuracy, verifying the effectiveness and practicality of the model design. The main contributions of this paper are as follows: (1) We propose an integrated multi-attention recognition network, effectively fusing the denoising module and the classification module. It significantly improves the model’s adaptability to non-uniform noise environments and recognition accuracy and enhances the network’s perception of key signal features under complex backgrounds. (2) We introduce a dedicated denoising module into the recognition model and conduct comparative experiments combined with three types of cepstral coefficient diagrams (MFCC, RASTA-PLP, and PNCC), verifying that PNCC features have the best expression effect in noisy environments. RDVNet achieves the best denoising performance among all comparative models. (3) We develop a PyQt-based visual recognition system, realizing full-process integration from signal collection, preprocessing, and feature extraction to recognition result display. It significantly improves data processing efficiency and visual interaction performance and has good practicality and promotion potential.
4. Results and Discussion
4.1. Denoising Comparison
In this study, the collected pest vibration signals were respectively converted into three types of cepstral coefficient spectrograms, MFCC, PNCC, and RASTA-PLP, which were used to characterize pure insect peristaltic signals and insect peristaltic signals mixed with noise. To simulate various complex noise backgrounds, datasets under five different signal-to-noise ratio (SNR) conditions were constructed [
41], specifically
dB,
dB,
dB,
dB, and 0 dB, and proportionally divided into a training set, validation set, and test set. In the model evaluation experiment, the proposed denoising network de-RDVNet was systematically compared with multiple mainstream denoising models, including, VDNNet, RIDNet, CBDNet, and DeamNet. These networks are representative of signal denoising tasks and are often used as benchmarks. We used the same AdamW optimizer, cosine learning rate schedule, batch size, and loss function. We also used the same input feature map parameters (MFCC, RASTA-PLP, and PNCC). The comparison content included the differences in denoising performance of each model under the input of the above cepstral spectrograms and the quality of spectrogram feature restoration. To comprehensively measure the denoising effect of the model, this paper uses the following two common image quality evaluation indicators: Peak Signal-to-Noise Ratio (PSNR): Used to measure the difference between the restored image and the original image, it is usually used to evaluate the performance of image restoration tasks such as image compression and denoising. The larger the PSNR value, the closer the restored image is to the original image, and the better the noise suppression effect. Its calculation formula is shown in Equation (
14). Structural Similarity Index Measure (SSIM): This is based on the joint modeling of brightness, contrast, and structural information between images [
42]. It is used to measure the structural fidelity between the restored image and the reference image. The range of SSIM values is [0, 1]. When SSIM approaches 1, it indicates that the structures of the two images are more consistent and the visual similarity is higher. Its definition is as shown in Equation (
10). Through the above two indicators, the image reconstruction quality and detail restoration ability of each model under different SNR and feature map input conditions can be quantitatively evaluated, providing a reliable basis for model performance analysis.
where the numerator represents the maximum possible pixel value of the image, m and n denote the dimensions of the image, I and K correspond to the denoised image and the original noisy image, respectively, and B stands for the binary representation of pixel values. In this study, the spectra used for evaluation were saved as 8-bit integers, so B is 8.
Figure 13 shows the changes in training and testing loss curves of the proposed de-RDVNet model under the condition of a −10 dB signal-to-noise ratio (SNR) with PNCC feature maps as input. It can be observed that with the progress of training iterations, both the training loss and testing loss of the model continuously decrease and tend to stabilize in the later stage, indicating that the network has good convergence. Among them, the final training set loss converges to 0.405, and the testing set loss converges to 0.402. The two curves are highly consistent in the later stage, indicating that the model has excellent fitting performance and no obvious overfitting phenomenon occurs.
Table 1 summarizes the PSNR and SSIM performances of five comparative networks (including de-RDVNet, VDNNet, RIDNet, CBDNet, and DeamNet) under five signal-to-noise ratios (from −10 dB to 0 dB) with three types of feature spectrogram inputs (MFCC, PNCC, and RASTA-PLP). It can be observed from
Table 1 that as the SNR increases (i.e., noise interference weakens), the PSNR and SSIM values of each model on the three types of feature maps show a gradual upward trend, which is consistent with the common sense judgment that noise level is positively correlated with reconstruction quality.
To further compare the overall performance of different networks and feature spectrograms across the full SNR range,
Table 2 calculates the average PSNR and SSIM values of each network in
Table 1 under five SNR conditions. The results show the following: (1) Among all comparative networks, de-RDVNet achieves an average PSNR of 29.8 and SSIM of 0.820 under PNCC feature spectrograms, significantly outperforming other methods, indicating its superior denoising capability in strong noise backgrounds. DeamNet follows next. (2) From the perspective of feature spectrograms, PNCC generally obtains the highest PSNR and SSIM values in all networks. This indicates that PNCC features have the strongest suppression ability for non-stationary noise in this task, with optimal feature robustness and representation capability. In summary, de-RDVNet has obvious advantages in both structural design and feature adaptation, and especially when combined with PNCC spectrograms, it can achieve optimal pest signal denoising and reconstruction performance.
4.2. Comparison of Classification Models
Based on the aforementioned denoising experiments, PNCC spectrograms are further selected as feature inputs, and DeamNet, which ranks second in denoising performance, is used as the pre-denoising module. Combinatorial comparisons are conducted with five mainstream lightweight classification models, including ShuffleNet, Swin Transformer, ConvNeXt, MobileViT, and the proposed RDVNet in this paper. These methods cover both convolutional and Transformer architectures, enabling a comprehensive evaluation of the relative performance of RDVNet. We used the same AdamW optimizer, batch size, and loss function. We also used the same input feature map parameters (MFCC, RASTA-PLP, and PNCC). The aim is to verify the recognition performance differences in different classification backbones under the same feature map input and pre-denoising conditions.
As shown in
Figure 14, the training loss and testing loss curves of RDVNet during the classification training process are highly consistent and continue to decline with the increase in the number of training epochs, indicating that the model has good convergence and generalization capabilities. After 50 training epochs, the training set loss decreased to 0.198, and the testing set loss was 0.197. The two are almost the same, indicating that the network achieves an optimal fitting state without obvious overfitting. In addition,
Figure 11b and
Figure 15a show the changing trends of the classification accuracy and F1-score of the five comparative models under five signal-to-noise ratio conditions. It can be observed that as the signal-to-noise ratio decreases, the overall trend of each model still shows a gradual increase in accuracy and F1-score, indicating that the value of the signal-to-noise ratio will affect the recognition ability of the model. Under all signal-to-noise ratio conditions, RDVNet always maintains the highest classification accuracy and F1-score, with the accuracy stabilizing above 90.0%, demonstrating excellent anti-noise recognition capability. Further averaging the classification performance under the five signal-to-noise ratios shows that the average F1-score is 0.878, and the average classification accuracy is 92.8%. In summary, the results show that RDVNet not only performs excellently in denoising tasks, but also outperforms current mainstream lightweight models in classification performance when combined with high-robustness feature map inputs, demonstrating good practicality and promotion value.
4.3. Discussion
The comprehensive experimental results show that the integrated denoising and recognition model proposed in this paper exhibits superior performance in multi-signal-to-noise ratio (SNR) environments as follows: (1) In terms of denoising performance, under five SNR conditions from −10 dB to 0 dB, the proposed denoising sub-network de-RDVNet outperforms the four comparative mainstream networks (VDNNet, RID-Net, CBDNet, and DeamNet) in all indicators. Its average peak signal-to-noise ratio (PSNR) is 29.8 and average Structural Similarity (SSIM) is 0.820 under all conditions, ranking first among the five types of models. This fully demonstrates that this method has strong comprehensive capabilities in maintaining image structure and suppressing noise, and the restored image has the highest similarity to the original clean image. (2) In terms of feature spectrogram comparison, among the three types of cepstral spectrograms (MFCC, PNCC, and RASTA-PLP), PNCC shows the highest average PSNR and SSIM in all networks. This indicates that it has stronger suppression ability and feature fidelity for non-stationary background noise and is suitable as a robust feature representation for insect peristaltic sounds. (3) In terms of classification performance: Under the five SNR test conditions, RDVNet’s accuracy and F1-score comprehensively outperform the comparative networks. The classification accuracy under all SNRs exceeds 90.0%, with an average accuracy of 92.8% and an average F1-score of 0.878, verifying its strong classification ability under noisy conditions.
RDVNet achieves optimal results in both denoising and classification tasks, mainly attributed to the following two structural design advantages: (1) The dual-branch structure design of de-RDVNet combines Residual Attention Blocks (RABs) and Hybrid Dilated Residual Attention Blocks (HDRABs), achieving the deep fusion of local and global information through multi-level skip connections. Meanwhile, dilated convolution and downsampling operations jointly expand the receptive field, realizing effective differentiation between insect peristaltic signals and background vibration noise in the feature space, providing a clearer feature foundation for the classification module from the source. (2) The cascaded group attention mechanism in the main classification module explicitly introduces information interaction between multiple heads, avoiding the problem of independent calculation between attention heads in traditional Transformers. By sharing and fusing the contextual information of different attention heads, it improves the ability to capture the overall semantics of features while reducing redundant feature calculations and enhancing efficiency and generalization ability. In addition, the robustness of PNCC feature spectrograms was further verified in the experiment. Due to its use of a nonlinear compression mechanism to suppress non-stationary noise, it has stronger anti-interference ability compared to other feature spectrograms in this research scenario, thereby significantly improving classification accuracy. It should be noted that this paper currently only models the binary classification problem of “with pest/without pest”, mainly focusing on the global distinction between insect peristaltic sounds and external noise vibration sounds in feature frequency bands. Future research can further refine pest categories and perform recognition modeling under more complex classification systems to improve the model’s practicality and versatility in multi-pest detection scenarios. It is worth noting that in extreme noise environments (such as −10 dB), de-RDVNet maintains good image reconstruction quality (PSNR ⩾ 23.0, SSIM ⩾ 0.700) for various feature map inputs, while RDVNet achieves a classification accuracy of 90.0% and an F1-score of 0.810, demonstrating its good anti-noise robustness. Future work will further explore the model’s performance in a wider SNR range and real field scenarios and consider introducing an adaptive SNR discrimination mechanism to enhance model adaptability.
Although this study has achieved remarkable results, there are still the following areas for improvement, and subsequent research will focus on the following directions: (1) Improvement in hardware automatic collection: The current system still relies on manual drilling in the insect hole positioning stage, resulting in limited efficiency for large quantities of timber. In the future, semi-automatic or fully automatic drilling and sensor placement platforms can be developed in combination with mechanical arms or automatic positioning devices to enhance the practical application capability of the system. (2) Lightweight model deployment: The cur-rent model deployment is based on desktop GPU devices, which is not suitable for field environments. In the future, model pruning and inference deployment on embedded platforms such as Jetson and Raspberry Pi can be explored to achieve the portability and real-time performance of pest detection devices. (3) System cloud extension: Currently, the PyQt system only supports local interactive operations. In the future, it can be deployed to cloud platforms through Web front-ends or WeChat mini-programs to achieve remote data upload, recognition inference, and result return, constructing an intelligent cloud-based pest identification platform. In summary, the RDVNet model and integrated system proposed in this paper provide a new idea and tool for the efficient identification of wood-boring pests, with good application prospects. Future research will further deepen and optimize in the directions of fine pest classification, lightweight model deployment, and multi-terminal system interaction.