1. Introduction
Lung sounds are important physiological signals that reflect the health status of the respiratory system. Through auscultation, abnormal respiratory sounds such as rhonchi and pleural friction rub can be identified, providing important clinical evidence for the diagnosis of respiratory diseases [
1,
2,
3]. However, traditional auscultation relies heavily on veterinarians’ experience, resulting in strong subjectivity, and its diagnostic accuracy is limited under complex noise conditions or in cases of mild pathological changes [
4].
With the development of electronic stethoscopes and signal processing techniques, computer-aided lung sound recognition has achieved considerable progress in human healthcare [
5,
6]. However, studies focusing on ruminants, particularly goats, remain limited. In large-scale farming environments, respiratory diseases are characterized by rapid transmission and strong concealment, which may lead to significant economic losses. Therefore, developing automatic recognition methods for abnormal goat lung sounds is of great practical importance. At present, computer-aided diagnosis of lung sounds in ruminants is still in its early stage [
7], while advances in human lung sound analysis have provided valuable references for applying related techniques to livestock scenarios.
Effective feature extraction is a key step in automatic lung sound recognition. In the early stage of human lung sound abnormality detection, time-domain features are mainly extracted, such as short-time energy, short-time zero-crossing rate, amplitude, envelope, mean, and standard deviation [
8,
9,
10]. However, purely time-domain features are insufficient to characterize the non-stationary and nonlinear properties of lung sound signals. Consequently, frequency-domain analysis methods were gradually introduced. Demir et al. [
11] converted lung sounds into spectrograms and short-time Fourier transform representations, achieving high classification accuracy. Jayalakshmy et al. [
12] utilized Gammatone Cepstral Coefficients (GTCC) as acoustic features and performed classification based on these features, achieving promising results in lung sound classification. With further development, joint time–frequency representations became mainstream. Naqviet al. [
13] combined time-domain, frequency-domain, and time–frequency-domain features with multiple classifiers for recognition. Mang et al. [
14] performed classification by extracting spectrograms, Mel spectrograms, and CQT spectrograms from lung sound data. Gupta et al. [
15] extracted gammatone spectrograms using a filter bank that simulates human auditory frequency analysis, enhancing abnormal lung sound representation.
In terms of modeling, early studies mainly relied on traditional machine learning methods such as support vector machines (SVM), k-nearest neighbors (k-NN), and random forests [
16,
17,
18,
19]. These approaches often suffer from low accuracy and instability, with limited generalization under complex noise conditions or high inter-class similarity. With the advent of deep learning, a new paradigm for lung sound diagnosis emerged. Unlike traditional methods that depend on handcrafted features, deep learning enables automatic feature learning through deep neural networks and has demonstrated superior performance across various tasks. Aykanat et al. [
20] proposed a two-layer convolutional neural network (CNN) using MFCC features, outperforming traditional SVM models. In addition, CNNs have been employed to model lung sound spectrograms and achieved effective discrimination among multiple respiratory diseases [
21,
22,
23,
24]. Pham et al. [
25] introduced a teacher–student CNN expert network based on Mel and gammatone spectrograms, achieving high recognition accuracy. Choi et al. [
26] incorporated attention mechanisms into deep learning models, further improving classification accuracy by emphasizing critical features. To model temporal dependencies in lung sounds, recurrent neural networks (RNNs) were introduced. Kochetov et al. proposed an end-to-end RNN-based framework for abnormal lung sound detection [
27,
28]. Moreover, hybrid CNN-RNN/LSTM architectures were proposed to leverage both local feature extraction and sequential modeling capabilities [
29,
30,
31]. In recent years, inspired by their success in computer vision, Transformer architectures and self-attention mechanisms have attracted increasing attention in lung sound analysis and agricultural artificial intelligence [
32,
33,
34,
35,
36].
Building on the established research in human lung sound abnormality recognition, this study focuses on the intelligent identification of abnormal goat lung sounds. In livestock acoustic recognition, persistent challenges include heavily intertwined background noise, high inter-class similarity, and significant intra-class heterogeneity. While traditional CNNs are constrained by their local receptive fields, Transformer-based models break through these local spatial constraints to dynamically establish global spatio-temporal correlations. This capability allows the model to precisely focus on critical pathological acoustic frequency bands and temporal segments while effectively suppressing interference from non-stationary environmental noise, offering a robust solution for acoustic signal recognition in agricultural scenarios [
37,
38]. To this end, this study adopts the Swin Transformer, which possesses strong hierarchical local–global modeling capability, as the backbone network. On this basis, we further optimize the architecture from multiple perspectives, including multi-scale attention mechanisms, spatial feature aggregation strategies, and frequency-band adaptive modeling. Consequently, an improved AAF-SwinT model is proposed. The proposed model aims to enhance the network’s ability to capture and represent high-learning-difficulty abnormal sample features, thereby achieving high-accuracy and robust intelligent recognition of abnormal goat lung sounds in complex environments.
The main contributions of this work are summarized as follows:
- (1)
An axial decomposed attention (ADA) module is proposed to enhance lung sound feature representation by decomposing time–frequency attention modeling, thereby alleviating the feature similarity problem among different lung sound categories.
- (2)
An Adaptive Spatial Aggregation for Patch Merging (ASAP) is designed to adaptively weight and aggregate features from important regions, reducing intra-class feature variability caused by noise and individual differences.
- (3)
A Frequency-Aware Multi-Layer Perceptron (FAM) is proposed to improve the extraction of discriminative spectral information in lung sounds by applying differentiated feature transformation strategies across different frequency bands.
The remainder of this paper is structured as follows.
Section 2 provides the materials and methods.
Section 3 presents the experimental results, followed by the discussion of these results in
Section 4. Finally,
Section 5 provides the conclusions.
4. Discussion
This study focuses on abnormal goat lung sound recognition and proposes an improved Swin Transformer-based model, termed AAF-SwinT, for deep feature extraction and classification of goat lung sounds. Experimental analysis verifies the effectiveness of AAF-SwinT. In
Section 3.6, we compared the proposed model with multiple mainstream Transformer models, and the results demonstrate that AAF-SwinT outperforms the comparison models on the goat lung sound dataset. It achieves an Accuracy of 88.21%, representing a 2.68% improvement over the baseline Swin Transformer, while maintaining a favorable balance between Sensitivity (86.99%) and Specificity (89.14%). Swin Transformer, as a classical hierarchical vision Transformer model, has demonstrated strong capability in global feature modeling in previous human lung sound recognition studies [
43]. This study further verifies the extensibility of Transformer-based models to livestock lung sound recognition tasks.
Regarding attention mechanisms, experimental results in
Section 3.7 show that general-purpose attention modules such as SE, ECA, and CBAM provide limited performance improvement. Similar observations have been reported in related studies, where generic attention mechanisms fail to effectively capture task-specific structural dependencies in acoustic signals [
44]. This is mainly because these methods focus on global channel or spatial dimensions without explicitly modeling the interaction between temporal and frequency dimensions inherent in lung sound signals. In contrast, the proposed ADA module independently computes attention along temporal and frequency axes and makes a fusion, effectively strengthening time–frequency correlations and reducing feature confusion between rhonchi and normal lung sounds. The misclassification rate of rhonchi as normal sounds is reduced from 12.0% to 10.7%.
In this study, the introduction of the FAM further enhances the model’s ability to represent frequency-dependent information, leading to an additional 1.57% improvement in accuracy over the baseline model [
41]. Previous studies have shown that lung sound signals exhibit significant differences in information distribution across frequency bands [
45]. Normal lung sounds, rhonchi, and tachypnea are mainly concentrated in low- and mid-frequency ranges, whereas high-frequency regions often contain abnormal sounds or noise. By modeling different frequency bands with tailored network capacities, FAM enables the model to focus on diagnostically relevant spectral regions while suppressing irrelevant interference.
The performance improvement of AAF-SwinT is accompanied by a moderate increase in computational complexity. Compared with the baseline model [
46], the floating-point operations increase by 29.06%, and the number of parameters increases by 2.68 times. This trade-off between performance and complexity is acceptable in complex livestock farming environments. Lightweight Transformer models such as MobileViT [
42] have been widely used in human lung sound recognition to balance performance and computational efficiency, and the findings of this study provide useful references for the lightweight optimization of goat lung sound recognition models. It is worth noting that there are currently no publicly available goat lung sound datasets or related recognition studies. This work represents an initial attempt to apply Transformer-based models to goat lung sound recognition, thereby enriching research in abnormal lung sound analysis for ruminants.
Although this study has achieved significant results, several limitations still remain. First, the dataset was collected from one goat farm, and the generalization ability of the model across different environments and animal types requires further validation. Second, although data augmentation was applied, the relatively small dataset size may still influence model training stability. Finally, compared with lightweight models, AAF-SwinT still has room for optimization in terms of computational efficiency [
47].
At the engineering application level, the development of an end-to-end intelligent auscultation system needs to consider the efficiency of the overall processing pipeline. System latency is determined collectively by the processing time of multiple modules, including preprocessing, feature extraction, and model inference, and any stage’s computational load directly affects the overall system performance. In particular, during the preprocessing and feature extraction phases, multi-step filtering and spectral transformation operations involve a certain amount of computational costs. Relevant studies have indicated that these processes are among the main computational burdens in resource-constrained embedded systems [
48].
Future efforts will focus on diversifying the dataset by incorporating lung sound recordings from various breeds, geographical locations, and management systems, specifically including dairy goats across different age groups and parities to improve model robustness. To facilitate deployment, we will continue optimizing the model through lightweight techniques, such as pruning and knowledge distillation, aiming to mitigate computational costs without compromising performance. Furthermore, advanced preprocessing algorithms and hardware acceleration strategies will be explored to enhance real-time efficiency. The ultimate goal is to develop a comprehensive end-to-end diagnostic platform that synergizes electronic stethoscopes, real-time transmission, and optimized deep learning models, enabling fully automated “acquisition-to-diagnosis” functionality.
5. Conclusions
For the task of abnormal goat lung sound recognition, this study proposes an improved Swin Transformer-based model, termed AAF-SwinT. By incorporating an axial decomposed attention module, an adaptive spatial aggregation module, and a frequency-aware module, the proposed model is able to more effectively capture the correlations of lung sound signals across temporal and frequency dimensions, highlight salient time–frequency information, and adapt to the characteristic differences among frequency bands, thereby significantly enhancing the recognition capability for abnormal lung sounds. Experimental results on a self-constructed goat lung sound dataset demonstrate that the proposed model achieves an Accuracy of 88.21% and a Score of 88.07%, significantly outperforming several mainstream models. Further comparative experiments on attention mechanisms and comprehensive ablation studies verify the rationality and effectiveness of each proposed module. By jointly considering recognition performance and computational complexity, AAF-SwinT demonstrates strong practical applicability while maintaining high recognition accuracy. The proposed method provides a feasible and effective technical solution for intelligent identification of goat lung diseases and offers valuable support for livestock health monitoring and precision farming. In future work, we will focus on further improving model performance, expanding the scale and diversity of datasets, and exploring the applicability of the proposed method to other livestock species.