1. Introduction
Currently, wearable devices and continuous analysis of monitoring data are key to achieving optimal athletic performance in sports. The respiratory rate and duration of the inhalation and exhalation phases are valuable monitoring parameters, as they can reflect several processes occurring in the body. First, the respiratory rate can indicate the psychological and physical state [
1]. Certain breathing patterns can increase the risk of musculoskeletal injuries [
2]. Some studies have indicated that, statistically, people tend to breathe incorrectly. Therefore, having information about breathing can enable athletes to identify areas for improvement [
2]. There are many different methods for determining the respiratory rate; however, difficulties arise with many of them during activity. For example, photoplethysmography-based respiratory rate estimators work well at rest but often fail under motion due to strong artifacts [
3].
An alternative solution is the use of belts that measure chest wall deformation; however, similar to PPG, it may be difficult to distinguish between inhalation and exhalation phases and to determine the breathing rate during sports activities because of chest deformations unrelated to breathing. A precise gold-standard technique for such measurements is the use of a face mask spirometry, as it allows an accurate evaluation of the duration of breathing phases and even the volume of air used for breathing [
4]. These systems are frequently employed as clinical reference tools for validating novel respiratory monitoring approaches, where they serve as ground-truth sources for respiratory rate and breathing cycle determination, enabling computation of breath count error, systematic bias, limits of agreement and other benchmark performance metrics across both rest and exercise conditions [
5,
6,
7].
However, this approach can be difficult and uncomfortable during sports activities. A promising direction for monitoring breathing patterns is the extraction of this information from sound because breathing during physical activity can be heard by the human ear. Simultaneously, many athletes use headphones to listen to music, which can potentially capture audio signals for further processing. This approach can provide valuable information without the need for additional equipment during daily training sessions.
Several studies have aimed to extract breathing information from audio signals [
5,
8,
9,
10]. Some approaches combine microphones with additional sensors and employ mathematical signal filtering to improve respiratory event detection [
10]. Other systems integrate microphones within face masks used during spirometric assessment, where the closed mask environment can reduce ambient noise and enhance respiratory signals [
5]. Although beneficial for respiratory rate estimation, such acquisition conditions differ substantially from the external microphone recording scenario encountered in sports activities and may result in different acoustic features of breathing sounds. There are also contributions that rely solely on audio signals, including the mathematical postprocessing of spectrograms [
11], random forest models applied directly to audio [
12] and deep learning approaches operating on spectrogram representations [
8,
9]. However, these methods often depend on signal filtering, controlled recording conditions, or are evaluated primarily in terms of the accuracy of respiratory rate estimation. Furthermore, their performance may degrade in the noisy environments typical of sports activities, even when the respiratory patterns remain visible within the spectrogram representations. These limitations motivate the development of methods capable of robustly detecting breathing events from external microphone recordings acquired under realistic conditions.
Computer-assisted analysis of respiratory sounds has received increasing attention as a complementary tool to conventional medical acoustic examinations. A broad review of automated respiratory sound analysis reported that objective methods can achieve high agreement with expert assessment under controlled conditions [
13]. More recent work has highlighted the rapid development of machine learning approaches for respiratory audio processing and noted that time-frequency representations are a core component of many relevant analysis pipelines [
14]. Accordingly, spectrogram-based representations have become central to respiratory sound research and enable the application of computer vision techniques to breathing-related tasks. Consistent with these observations, convolutional neural network approaches operating on spectrograms have demonstrated promising performance in respiratory sound classification problems [
15].
In this direction, several authors have solved object detection and classification tasks on spectrograms [
8,
9,
16,
17]. However, some of them evaluated the effectiveness of their solutions only in terms of respiratory rate estimation accuracy or computer vision task metrics. This does not allow for the assessment of how accurately these methods determine the duration of signals corresponding to different breathing phases.
In addition, several authors have highlighted the negative effects of noise. In existing solutions, the computer vision model alone handles noise; to achieve higher accuracy, the spectrograms themselves could potentially include trainable parameters to provide a better signal representation for the computer vision model.
A key practical constraint is the availability of suitable, annotated training data. Although several public respiratory audio datasets exist, most are designed for clinical anomaly detection (e.g., crackles, wheezes, or sleep apnea) and are commonly recorded using digital stethoscopes [
18,
19]. Such recordings differ substantially from external microphone measurements in sports settings, both in terms of sensor characteristics and dominant noise sources (e.g., friction, motion and ambient sound). Moreover, many collections provide labels for full respiratory cycles instead of independent timestamped breathing phase events, which limits their applicability to multiclass breathing-phase detection. To our knowledge, no publicly available dataset simultaneously uses external microphone recordings under rest and exercise-like conditions, provides time annotations for inhalation and exhalation as separate events and excludes respiratory anomalies to reflect a healthy sports-monitoring scenario. This lack of directly suitable data motivated the creation of a breathing phase time-annotated dataset from respirations at rest and under load.
We present a neural network architecture for estimating the respiratory rate from audio recordings collected during sports activities. The proposed approach integrates adaptive spectrogram learning with an image object detection-based framework, combining optimization for feature extraction and event detection.
Systematic optimisation is also increasingly important in respiratory engineering applications, where performance can depend strongly on parameter selection. For example, Taguchi-based optimization has been shown to improve the operational stability and efficiency of a pulmonary ventilator while reducing its sensitivity to parameter variability [
20]. This practical example highlights the broader role of statistically grounded parameter optimization in respiratory technologies. Similar considerations apply to respiratory sound analysis systems intended for use outside controlled environments, where the surrounding noise, sensor placement and subject-specific variability can substantially affect performance.
To achieve this goal, the
Section 2 discusses the techniques for the automatic adjustment of spectrograms and their integration with computer vision object detection models. We also explain how the outputs of computer vision models can be used to determine the respiratory rate and the durations of the inhalation and exhalation phases. Subsequently, the
Section 3 describes the methods and conditions used to generate data for training and validating the proposed approach, as well as the training procedure, prediction thresholding approach and hyperparameter selection. Finally, in the
Section 4, visual changes in trainable audio front-end spectrograms, breathing phase detection accuracy and improvement statistical analysis are discussed.
2. Materials and Methods
The overall operation of the model is illustrated in
Figure 1 and is divided into four parts.
2.1. Audio Recording
The model takes as input a raw acoustic signal recording acquired from headphones. Considering availability, standard and publicly accessible call center and gaming headphones were used throughout the dataset generation and performance validation. No additional sensor front-end was introduced. The full audio recording was recorded at 44.1 kHz sample rate and sliced into manageable 15 s long segments, padding the last segment if necessary. These segments were then passed to the model for spectrogram transformation, breathing phase detection and further respiratory rate processing.
2.2. Trainable Spectrograms
In this study, spectrograms make it possible to visualize these audio signal fragments, which can then be tackled as a computer vision task. It is important to note that there are many different methods for creating spectrograms and the choice depends on the specific objectives. For example, Mel spectrograms and Mel-frequency cepstral coefficients are suitable for models focused on speech-related tasks because they adapt the representation to frequencies that humans can distinguish well. In contrast, the constant-Q transformation uses a logarithmic frequency layout from the outset, where each octave contains the same number of frequency bins. This makes it more effective for representing signals in musical applications [
16].
In this study, Mel spectrograms were chosen to transform the audio information because human breathing can be clearly heard and compared to constant-Q. This algorithm requires fewer resources, which can be useful in practical applications. For faster and more resource-efficient training and interaction, the spectrogram was integrated directly into the computer vision model architecture using the nnAudio library [
21].
Notably, nnAudio reformulates fixed spectrograms using a set of neural network layers, enabling seamless integration with the training pipeline. The STFT is implemented using one-dimensional convolutional kernels that mimic sinusoidal basis functions, whereas the Mel spectrogram is obtained through a linear transformation corresponding to the Mel filterbanks. Both components can be made trainable, enabling the model to adapt to these bases and during training, emphasizing the frequency regions of interest.
2.3. Object Detection
Various computer vision tasks can be applied to determine the breathing rate (denoted as BPM or Breaths Per Minute) from spectrograms. For example, it can be addressed as a classification problem by dividing the spectrogram into small segments and processing them further [
8]. It is worth noting that with this approach, the measurement accuracy may be affected by the duration of the signal being classified. If the window is too large and the breathing rate is high, especially when it is necessary to distinguish between the inhalation and exhalation phases, this parameter will have an even greater impact, since these phases are even shorter and multiple events may fall within a single classification window. On the other hand, using very small windows is statistically inefficient because even if the model itself has high classification accuracy, the large number of predictions will lead to low accuracy in estimating the breathing rate or the duration of the phases.
An alternative solution used in our model is to treat the task as an object detection task. By knowing the positions of the breathing-phase sound signals on the spectrogram, one can determine their duration and the time intervals between them. It should be noted that different strategies can be used to determine the breathing rate based on the results of object detection models.
For example, one can calculate the period between the inhalation or exhalation phases by tracking only one of the respiratory sound events. Another possibility is to evaluate the period between complete breathing cycles by coupling together the recognized consecutive inhale and exhale phases. The choice of a specific strategy to achieve the most accurate results depends on how well the model can detect different types of classes. Differences in accuracy may be caused by the physical limitations of signal measurement (e.g., the microphone’s inability to provide sufficient information). Additionally, the inhale–exhale cycle tracking strategy can be abandoned, as detecting twice as many events for one respiratory rate measurement can lead to degraded performance. Therefore, in this study, different approaches to inhalation and exhalation-tracked sound events were compared.
For our goal, there are many models for solving object detection tasks to choose from, such as Faster R-CNN, Detection Transformer, etc [
22,
23]. Among the many models, the YOLO model provides a good balance between speed and accuracy [
24]. In its original form, this model works with RGB images; however, spectrograms produce single-channel matrixes. Therefore, to use this architecture in our model, the input layers were modified to match the required dimensions.
The choice of YOLOv11 was primarily task-driven and was used here to bridge the object detection methodology with sound event recognition and segmentation in the spectrogram domain. In this study, the downstream target is not only event presence, but also accurate temporal localization of breathing-phase boundaries on the spectrogram time axis, as these boundaries are directly transformed into BPM and phase-duration estimates. Therefore, a single-stage detector is a practical fit for this localization-oriented pipeline while preserving low-latency inference for segment-wise processing. The methodological scope of this study is the reconfiguration of an object-detection pipeline for respiratory sound segmentation with trainable spectrogram front-ends, rather than a full cross-family benchmark against alternative audio architectures.
2.4. BPM Calculation
A brief explanation of the BPM calculation is provided in the previous section, but some accounts of result processing must be considered. It is common to use such rolling functions over time series data as median filters to mitigate any existing noise [
8,
11]. Although this could increase the predicted BPM accuracy, the aim of this study is to mitigate the respiratory rate calculation pipeline through the means of learning and improving spectrograms. Hence, no additional smoothing filters were used.
In a similar manner, choosing which predictions to keep or ignore is not straightforward. The YOLO architecture is designed for fast and accurate physical object detection; however, in the case of recognizing frequency patterns along the time axis, it may introduce some unwanted deviations. Mitigating this noise is related to the filtration strategies of choice, which will be analyzed in the following sections.
3. Experiment
This section describes the experimental setup used to evaluate the respiratory rate estimation from audio using the proposed pipeline with both fixed and learnable spectrogram front-ends. We outline the dataset preparation and annotation format, practical implementation details of the nnAudio waveform-to-spectrogram transformer and updated object detection architecture to ensure stable optimization with the nnAudio front end. Finally, we defined the post-processing and confidence thresholding strategies used to convert per-segment inhale and exhale detections into a global event timeline and to quantify the performance in terms of both detection quality and respiratory rate error.
3.1. Dataset Generation
To train such an architecture, it is important to have data with precise labeling of the start and end times of the inhalation and exhalation phases. In addition, it is important that the audio signal is not distorted and has properties similar to those of signals used in the real world.
Therefore, generating a dataset using, for example, a face mask is not feasible, because a face mask significantly distorts the audio signal. Chest belts that measure thoracic deformation are also poorly suited because it is difficult to determine the exact start and end of breathing phases from data recorded during sports activities.
Because the position of the microphone plays a major role in this study, the microphone must be located in close proximity to the airways.
To generate the training datasets, a graphical application for supervised breathing was developed that automatically produces both labels and audio files. In the application, depicted in
Figure 2, it is possible to set pauses between breathing phases and their corresponding durations. After launching the application, similar to a clock hand, the user is indicated to which phase they are in. Thus, if the person followed the application’s instructions, the labeling automatically matched the recorded audio signal.
For the training dataset, approximately 1 h of audio was recorded from two individuals, yielding approximately 3000 annotated respiratory phases. Every annotated file was checked both visually and by listening to ensure that all the labels matched the corresponding sound events. Any misaligned events or mistakes in the recording process were removed from the final dataset.
All audio tracks were recorded under controlled conditions and featured varying breathing rates ranging from 10.0 to 45.0 breaths per minute, although the majority of samples fell within the 20–35 BPM interval. The breathing cycle count over the BPM ranges is presented in
Table 1.
To better represent light physical exhaustion, which may arise from light running, more than 500 cycles from five recordings were represented in the 20–25 BPM range. Comparable coverage was also ensured for the higher-frequency 25–30 and 30–35 BPM intervals, both contributing more than 360 cycles each. Although the upper bound of the dataset was limited to approximately 40 BPM, the highest breathing rate interval still included approximately 100 annotated cycles, providing a sufficient representation of elevated respiratory dynamics. At the lower end, approximately 150 cycles corresponded to the resting conditions.
The training dataset followed the conventional YOLO image dataset format but replaced images with 15 s audio clips sampled at 44.1 kHz for a higher temporal feature resolution. The 15 s duration was chosen to increase the probability of capturing multiple respiratory cycles within a single sample and to expose the model to a wide variety of breathing phase transitions, while remaining computationally tractable compared to longer durations.
To increase the effective number of training examples while preserving the original acoustic characteristics, a deterministic time-shift augmentation was applied on the time axis by shifting each sample by one quarter of a respiratory cycle and repeating this offset two additional times, resulting in a fourfold increase in the number of annotated breathing phase events (approximately 12,000 events in total). Time-stretching and pitch-shifting augmentations were deliberately excluded because these operations alter the time-frequency features of breathing sounds and may distort the phase-specific spectral patterns that the detector uses for localization. Synthetic noise was not added because the training recordings already included both clean laboratory audio and noisy gym audio, providing exposure to realistic background conditions without introducing artificial noise distributions.
After augmentation, the dataset was split into training and validation sets with an 85–15% ratio (903 and 144 files, respectively). Because the training data originate from only two participants, both are present in both splits; however, to avoid cross-contamination through augmentation, each original audio clip and all of its time-shifted augmented variants are assigned exclusively to either the training or the validation split. Given the limited availability of phase-annotated sports recordings, this dataset prioritizes label fidelity and exercise-relevant BPM coverage, but it does not provide broad demographic or physiological representativeness.
To assess the performance of the models trained on this dataset, a separate test dataset was collected under various environmental and usage conditions. Recordings were acquired in both a quiet laboratory environment and a noisy gym setting, reflecting realistic usage scenarios. This test set comprised approximately 34 min of audio from six participants and spanned a substantially broader respiratory rate range of 13–60 BPM, including two 7 min recordings of high-intensity running in a noisy gym environment and four 5 min recordings of normal breathing. Running recordings in the gym contain mixed background interference, including music playback, nearby conversations, additional runners and exercise equipment noise. The normal-breathing recordings included both quieter and busier surroundings. All samples were manually annotated in Audacity by inspecting both spectrograms and waveforms. Importantly, none of the test files appeared in the training or validation data, enabling a fully subject-independent evaluation and allowing the assessment of whether the model relied on speaker-dependent cues rather than generalizable respiratory patterns. With it, we aim to gather preliminary transfer results across unseen participants and conditions, but it remains insufficient for broad real-world robustness claims; a large-scale validation with more participants, exercise conditions and microphones is required.
The train/validation split was used for model training and for early stopping, best-weight selection and extraction of both confidence candidates (PR-intersection and min-MAE). The independent test set was used only for the final performance analysis after the training and threshold selection were fixed.
Acknowledging the limitation of publicly available annotated respiratory phase audio files, the training dataset used in this project was made public through Kaggle [
25].
3.2. Practical Implementation of Architecture
In essence, the nnAudio component serves as both the input stage and feature transformation module for the modified YOLO architecture. The model receives raw audio tensors of shape [Batch, 15 s · 44.1 × 103]. The STFT was computed using a Hann window with a relatively large FFT size of 2048 points to preserve the frequency resolution. The hop length was automatically determined to match a fixed time resolution of 640 frames by dividing the total number of samples in each recording. In our configuration, this corresponds to approximately 1033 samples per hop or a frame spacing of approximately 23.4 ms. The resulting STFT power spectrum was then converted to a logarithmic decibel scale to enhance lower-intensity components, after which 128 trainable Mel-scale filterbanks were applied.
To match the YOLO 640 × 640 input while preserving the feature resolution, a series of spatial transformations were performed. To improve the detection module performance, the shallow [Batch, 128, 640] Mel-spectrogram could provide small convolution filters to be trained; hence, it must be enlarged. Choosing an excessively large raw Mel filterbank value tends to instability and computation cost; therefore, the Mel axis is bililinearly interpolated by a factor of 3 to 384 pixels. This increased vertical height is symmetrically zero-padded along the frequency axis to reach the desired resolution as a single-channel image.
After temporal and intensity scaling, each sample was independently min-max scaled to the [0, 1] interval to accommodate wide variations in amplitude scaling over different recordings. The resulting single-channel image is directly fed into a modified YOLOv11 backbone, whose first convolution accepts grayscale images instead of the standard 3 channel RGB tensor.
The novelty of the proposed architecture lies in the combination and task reconfiguration aspects. First, an image object detector is reconfigured for quasi-periodic respiratory sound event localization on spectrograms and these localized events are then used as midpoint measurements for downstream BPM and phase-duration calculations. Second, the nnAudio front-end is made trainable so that spectrogram bases are iteratively adapted through backpropagation, reducing the reliance on manual feature engineering. Together, this bridges object detection and sound event recognition in a single localization-oriented pipeline, where representation learning and event segmentation are jointly optimized. We refer to this modified object detection architecture as RENI (Representation Enhancement for Neural Imaging). The complete implementation of RENI, including spectrogram transformer integration, training configuration and inference pipeline, is publicly available in the project repository [
26].
3.3. Model Training
During training, a grid search over the hyperparameters was performed to achieve the highest possible accuracy. It became clear that the batch size has a significant impact on model accuracy; in this case, the evaluation is done on 8, 16, 32, 48, 64 values. Three different configurations were tested to evaluate the effect of trainable spectrograms on performance. This included training the STFT kernels or Mel filterbanks and the results were compared with fixed, non-trainable spectrograms.
We trained separate parameter groups for the YOLO and nnAudio modules and no learning rate scheduler. Adam’s adaptive per-parameter learning rate provided effective convergence across hyperparameters, which made the scheduler unnecessary. The learning rates are selected for each specific purpose, as object detection frequently benefits from high learning rates; hence, it is given the default Ultralytics’ learning rate of 10−3. In contrast, nnAudio directly operates on low-level spectral representations, where small parameter updates can cause instability; hence, a 10−6 low step is chosen.
A two-stage training strategy was adopted for stable optimization. The object detection component exhibits steep initial gradients that can destabilize the more sensitive spectrogram transformation layers. Hence, during the initial training phase, the nnAudio module was frozen and operated solely as a fixed waveform-to-spectrogram converter. It was determined that freezing the audio front-end for the first 20 epochs was sufficient for the backbone, neck and detection head to reach stable convergence. After this stage, the spectrogram layers were unfrozen and fine-tuned using backpropagation.
The standard YOLO implementation utilizes both AdamW and SGD as optimizers based on the dataset size. For our implementation, AdamW and Adam were evaluated; however, AdamW consistently resulted in lower validation performance. Therefore, all experiments were conducted using the Adam optimizer with no learning rate scheduler.
Model checkpoint selection follows the Ultralytics YOLO implementation of the lowest total validation loss, based on YOLO’s object detection loss metrics. Additionally, early stopping is applied with a patience of 50 epochs, counted from the point at which the nnAudio module is unfrozen.
To ensure the robustness of the results, each model configuration was trained five times. Considering five batch sizes, three spectrogram configurations (trainable STFT, trainable Mel and static), and five independent runs per setting, 75 models were trained and evaluated.
3.4. Object Detection Post-Processing
Each processed audio sample corresponded to a 15 s segment from a continuous recording. For every segment, the model predicts the inhalation and exhalation events in the form of bounding boxes. These predicted areas lay within the 640 px width, which corresponds to the time axis. Knowing the pixel-to-time conversion ratio, the left and right boundaries of each bounding box were converted into start and end timestamps in the time window.
To obtain the global annotated timeline of all sound events, these local timestamps are mapped to the full recording by accounting for the audio segment index
and passed time. Therefore, each detection was processed as follows:
where
is the global start or end time,
is the corresponding left or right side of the bounding box’s side pixel position and 15 s is the audio duration, which, along with the 640 px spectrogram tensor 640 unit width, forms the time-pixel conversion ratio.
Additional filtration steps are taken to refine the detected breathing phase consistency. First, same-class predictions crossing the segment boundaries were merged. Second, overlapping boxes are resolved by retaining only the prediction with the highest confidence score.
The resulting global event sequence was then used to estimate the breathing rate. This is computed as the number of Breaths Per Minute (BPM), defined as 60 s divided by the time interval between consecutive events of the same class (inhalation or exhalation).
The resulting BPM over time, derived from the predicted bounding box positions, is shown in
Figure 3. To assess the deviation of the estimated respiratory rate from the ground truth, the Mean Absolute Error (MAE) was computed over the full duration of each recording and for inhalation and exhalation tracked events separately. This is then repeated for each retrained configuration and the lowest MAE (denoted as min-MAE) is chosen to mitigate the effects of poor weight initialization.
For the test set evaluation, each model was inferred over the entirety of the test set 34 min duration, which formed 136,15 s windows. These segment-level predictions were merged to form the corresponding audio timeline before the BPM and MAE were computed. Finally, to gather the model deviation for this audio set, an inverse-duration weight mean was applied to all the MAE values.
3.5. Prediction Thresholding
The accuracy of the estimated BPM is influenced not only by the positioning in time of the detected breathing phases but also by the presence of false-positive and false-negative predictions, which are governed by the selected confidence threshold (see
Figure 3). A threshold that is too low may introduce unwanted detections, leading to an overestimation of the respiratory rate, whereas an overly strict threshold may suppress valid events, resulting in an underestimation.
In this section, two confidence thresholding strategies are evaluated with respect to both the bounding box detection performance and the resulting respiratory rate estimation. Subsequently, the impact of different spectrogram representations, both fixed and trainable, on the model performance was analyzed.
Ultralytics’ YOLO framework evaluates model performance based on the overlap between predicted and ground-truth bounding boxes, which is depicted by Precision and Recall metrics. These metrics are based on the overlap of the predicted and ground-truth bounding boxes, with the Intersection over Union (IoU) threshold set to 0.5 by default. In the case of sound event segmentation, a modified IoU was used, filtering only along the time axis and ignoring the height of the Mel scale. Although this approach is efficient for recognizing breathing phases, the positioning of the bounding boxes is crucial, as small deviations along the time axis of the spectrogram image can lead to significant deviations in breathing rate calculation.
Hence, there is a need to evaluate the confidence threshold strategy. One approach is based on the aforementioned YOLO standard of balancing precision and recall, while another method is goal-oriented thresholding by minimizing the respiratory rate error.
The optimal confidence boundary is determined by the response of the metric to a confidence sweep. This procedure gathers all predictions at a small initial confidence threshold (typically 0.001) and then filters them through a confidence range of 0.01–1.00. For each step, Precision and Recall were evaluated. By selecting the intersection between the precision and recall (also denoted as PR) curves, the corresponding confidence depicts the most optimal threshold for balancing false-positive and false-negative detections. However, a high true prediction count does not correspond to an accurate placement or precise resulting BPM.
Hence, we introduced an additional thresholding criterion based on minimizing the objective error. It is determined over the same confidence sweep by calculating the MAE evolution of the respiratory rate during the breathing phase. The confidence value that yielded the lowest MAE was selected.
Furthermore, unlike the standard YOLO implementation, where a single global confidence threshold is used for all classes, thresholding is performed separately for both respiratory events. Inhalations and exhalations differ in both acoustic patterns and their spectrogram representations, as exhales are typically more pronounced and easier to distinguish than weaker inhalations. Consequently, the application of class-specific thresholds allows for more targeted filtering and comparison of per-phase BPM calculations.
Two candidate confidence thresholds were determined from the validation dataset for each of the 15 model configurations, as illustrated in
Figure 4. These thresholds were subsequently evaluated on the test set by applying each value independently per class and computing the resulting respiratory rate error for both sound events. The selected thresholds were fixed before the test time inference, with no further threshold sweep, threshold tuning, or model re-selection based on the test outcomes. Class-specific confidence calibration is used to reduce bias caused by different inhale–exhale detectability, not to introduce extra test time optimization freedom.
4. Results
To demonstrate how trainable spectrogram front-ends affect image-based respiratory rate estimation, the model configuration detection efficiency is reported in the following sections.
4.1. Spectrogram Representation Analysis
Figure 5A depicts the spectrograms, which has been enabled with audio nnAudio front-end basis function training. For visualization purposes, these image tensors are colored, but grayscale-normalized spectrograms are passed to the detector. Visually, the static and trainable STFT spectrograms remain relatively similar, whereas the trainable Mel representation exhibits stronger structured changes, with several Mel bins enhanced and others attenuated, producing pronounced dark streaks throughout the respiratory events.
To better understand the changes within the configurations,
Figure 5B shows the difference maps relative to the static spectrogram. For trainable Mel, multiple intensity changes are visible (approximately up to ±0.75 in the normalized intensity scale), which mostly diminishes most of the Mel filterbanks. However, multiple event-related patterns can be observed. For inhalations, there are some Mel bins around 64 that have not changed but are less intensive in exhale-correlated time intervals. However, the exhalations around the highest frequency bins depicted higher intensities. Notably, there are no noticeable changes between the spectrograms between breathing events. This pattern suggests that the trainable Mel front-end adapts features mainly around respiratory events rather than uniformly over time.
For the trainable STFT case, the observed difference range is much smaller (approximately ±0.03), justifying why this configuration remains close to the static baseline. Notably, the dominant frequency patterns seem to change only slightly, with their relative intensity changing by up to approximately −0.02, but their corresponding higher frequency bins have the largest intensity deviations. Unlike the Mel case, the background pause signal between events was slightly enhanced. Although these changes are minimal, the trainable audio front-end enables feature enhancement, which can improve detection accuracy and event localization.
4.2. Event Detection and BPM Error Analysis
Regarding the choice of breathing phase to use for respiratory rate estimation, the most accurate BPM results are expected from the class with a stronger detection quality. For this purpose, the F1 score, defined as the harmonic mean of Precision and Recall, was used as a balanced event-detection metric. For each of the 75 model configurations across five training iterations, F1 scores were computed for both sound events under each confidence-threshold strategy, yielding 150 F1 measurements per strategy. Taking the median across all configurations provides a robust estimate that is less sensitive to outliers from unfavorable initializations. The results in
Table 2 summarize this per-class behavior.
Notably, the exhalation class had substantially higher F1 scores than the inhalation class, with differences of 0.2634 and 0.3069 for PR-intersection and min-MAE thresholding, respectively. This indicates a strong class asymmetry in the recognition of events. This trend is consistent with the spectral characteristics discussed earlier, where exhalations are typically more pronounced and visually distinguishable than the inhalations.
Therefore, the respiratory rate estimation is based primarily on exhale detections in the spectrogram domain.
Table 2 also indicates that the min-MAE confidence strategy provides a small exhale-side advantage (approximately 0.04 in F1) compared with the PR-intersection, while the inhale F1 remains near 0.61 in both strategies.
The effect of the confidence-filtering strategy on inhale- and exhale-tracked BPM is illustrated in
Figure 6. The plot shows the lowest MAE models across all trainable and static spectrogram settings and batch sizes. As expected from the detection asymmetry, the inhale-tracked BPM contained more missed events and, therefore, a larger underestimation error. Across all spectrogram types and thresholding strategies, the inhale-tracked MAE remained above 5.88 BPM and exceeded 7 BPM for trainable Mel configurations.
Significant improvements were observed in the exhale-tracked BPM. Under standard PR-intersection filtering, the MAE remained below 2 BPM for all three spectrogram configurations. The trainable STFT provides a limited change (1.93 BPM) relative to the static front-end (1.98 BPM), whereas the trainable Mel reduces the error by approximately half a BPM. Because these comparisons were evaluated on independent test recordings from unseen participants in both laboratory and gym settings, the observed MAE reduction suggests preliminary cross-subject transfer rather than only in-split optimization.
Under min-MAE-guided filtering, the effect of the trainable front-end is more pronounced. Although the static baseline yields a higher MAE of 2.6 BPM than that under PR-intersection, the trainable STFT and trainable Mel reduce this value to 1.69 and 1.15 BPM, respectively.
To further analyze the findings reported in
Figure 6,
Table 3 depicts condition-specific exhale-tracked weighted MAE values for recordings taken at rest and while running. Across both thresholding approaches and all spectrogram configurations, the at-rest recordings consistently yielded lower MAE than the running examples. Under PR-intersection, static and trainable STFT remain relatively close across the analyzed tasks (differences of 0.38 and 0.22 BPM, respectively), whereas trainable Mel shows a larger gap (1.01 vs. 2.24 BPM). Under min-MAE thresholding, the static front-end degrades most strongly during running (3.43 BPM), whereas the trainable STFT improves at rest (1.49 BPM) and remains close to its PR-intersection running value (2.20 BPM). The trainable Mel front-end again provided the lowest at-rest deviation (0.77 BPM) but still showed a deviation within the running subgroup (2.15 BPM), corresponding to a difference of 1.38 BPM.
Taken together, these findings indicate that confidence calibration and spectrogram adaptation interact, and the strongest practical gains appear when trainable representations are paired with a task-oriented min-MAE threshold criterion. After establishing the main performance trends, we quantified the stability of these effects across repeated retraining.
4.3. Statistical Consistency Across Repeated Trainings
As noted earlier, all audio front-end combinations were retrained multiple times to reduce the convergence-related variability. To quantify the stability of the trainable spectrogram effects on BPM accuracy, all five retrainings per configuration were analyzed. It is important to mention that as exhale events provide the most accurate detection metrics, inhale-specific results are omitted from this analysis.
For each comparison, the trainable spectrogram accuracy difference against corresponding static front-end configuration is defined as
where
indicates a lower error for the trainable frontend. Uncertainty was estimated using non-parametric bootstrap confidence intervals (95% CI,
). In parallel, one-sided permutation testing was applied to assess whether the observed differences were statistically significant.
The analysis was conducted at two levels to separate the local batch-specific behavior from the global tendency of each frontend-threshold pair. First, at the per-batch level, is estimated from independently resampled trainable and static MAE differences and each contrast is evaluated using an unpaired permutation test. Second, for an overall estimate across model settings, the five batch-level deltas are pooled and tested with an exact blocked sign-flip procedure over all 25 sign assignments.
Figure 7 shows that most per-batch median effects are negative, indicating a lower MAE for the trainable frontends in many settings. Simultaneously, the CIs vary substantially across batch sizes. The smallest batch (8) exhibited the widest intervals, consistent with a higher per-run variance, whereas at approximately batch size 32, both trainable frontends under both thresholding strategies moved toward positive or near-zero shifts, indicating weaker performance relative to static spectrograms.
The CI position provides a more interpretable and trustworthy result for the front-end functionality. A fully negative interval indicates a trainable-frontend advantage, whereas a fully positive interval indicates degradation and an interval crossing indicates an inconclusive effect.
When analyzing CIs, a clear contrast appears between the thresholding methods. Under the PR-intersection, all per-batch intervals cross the zero mark. Under min-MAE thresholding, fully negative intervals are obtained for trainable Mel at batch sizes 16 (CI: ) and 64 (CI: ) and for trainable STFT at batch size 64 (CI: ).
When pooling all batch-size iteration results and applying the blocked bootstrap protocol, the same pattern is preserved, as shown in
Figure 8. Only trainable Mel with min-MAE provides a median improvement of 3.06 BPM relative to static spectrograms while keeping its 95% CI fully below zero (CI:
). Simultaneously, the pooled PR-intersection effects and pooled trainable STFT effects still cross zero.
These bootstrap findings are reinforced by the permutation
p-values in
Figure 9. For PR-intersection, the
p-values remain comparatively high, especially at a batch size of 32, where both trainable frontends are close to 1.0. Similarly, the trainable STFT at batch sizes 48 and 8 yielded
and
, respectively. The closest PR-intersection case to significance is the trainable Mel at a batch size of 48 with
.
Stronger evidence was observed under the min-MAE thresholding. Although both frontends remain above 0.8 at a batch size of 32, the trainable STFT and trainable Mel at a batch size of 64 are below 0.05 ( and , respectively). In addition, the trainable Mel at a batch size of 16 was significant ().
At the pooled level (
Figure 10), none of the contrasts reached
. However, the min-MAE with trainable Mel remained close to this boundary (
), whereas trainable STFT produced the highest pooled
p-values, most notably under PR-intersection (
).
5. Discussion
Based on the findings in the previous section, the spectrogram representation analysis indicates that trainable Mel front-ends produce stronger and more structured feature adaptation than trainable STFT, while STFT remains closer to the static baseline. This pattern is consistent with the observed performance trend and suggests that Mel filterbank reweighting is the main contributor to the strongest exhale-tracked improvements in this dataset.
When analyzing the sound event detection accuracy, the event detection and BPM results showed a consistent inhale–exhale asymmetry. Exhalations were detected more reliably, with the highest F1 score between the events and exhale-tracked BPM remained the most accurate operating mode, providing significantly lower MAE values than inhalations. This class-specific bias in the resulting performance is because inhalations have higher acoustic variance than exhalations at different breathing rates and loads. Condition-specific analysis further shows that running segments are harder than at-rest segments across all front-ends, while trainable Mel and trainable STFT still preserve competitive accuracy under the min-MAE thresholding strategy.
Repeated-training statistics indicate that the precision gains against static spectrogram counterparts are not uniform across all configurations. The most consistent evidence appears for trainable Mel and trainable STFT configurations under min-MAE thresholding, where at a batch size of 64, the 95% confidence interval is fully negative and the p-values are below the statistically significant 0.05 threshold, thus further supporting the advantages of differentiable audio front-end implementation. At the pooled level, the effects remain directionally favorable, especially with the Mel scale; however, with high variance across batch sizes owing to random weight initialization, these conclusions are best interpreted as robust trends rather than universal improvements across all configurations.
However, several limitations should be considered. The gathered training and testing content is still limited in size and demographics; therefore, transferability beyond the studied conditions remains preliminary. Robustness is currently supported by mixed-environment results but not yet by controlled SNR benchmarking with explicit noise profiling. In addition, current evidence comes from a single acquisition workflow, even though the subjects are disjoint; therefore, validation on independent external cohorts with frozen thresholds is still needed. Baseline diversity is also limited because this study compares static and trainable front-ends within one fixed object-detection backbone, rather than across multiple audio-model families.
From a deployment perspective, the current results indicate the feasibility of real-time exhale-based tracking; however, this has not yet been validated through an end-to-end embedded real-time benchmark. Therefore, future work should extend the validation to larger multi-subject respiratory datasets, quantify the robustness under controlled testing protocols, and evaluate the stability under realistic motion conditions with deployment-oriented hardware constraints.
6. Conclusions
In this study, we present a respiratory rate estimation framework that adapts an image-based object detection head (YOLO) for sound event detection by treating breathing phases as time-localized spectrogram objects. By localizing the inhale and exhale events on the spectrogram time axis, this method enables the direct estimation of time intervals and phase durations, avoiding the limitations of window-based classification under variable breathing rates and short event durations. In addition, integrating raw waveform-to-spectrogram tensor conversion into the model via nnAudio enables an end-to-end trainable front-end in which STFT kernels or Mel filterbanks can be optimized through backpropagation. This emphasizes breathing-relevant frequency regions while reducing the need for manual spectrogram hyperparameter tuning. The model architecture, along with the custom-annotated audio dataset, is publicly available on GitLab [
26] and Kaggle [
25] respectively.
Across the evaluated front-ends, the trainable Mel representation provided the most accurate respiratory rate estimates, outperforming both static and trainable STFT configurations and achieving a mean absolute error of 1.15 breaths per minute. This result is consistent with the spectrogram representation analysis, where Mel adaptation showed stronger structured feature reweighting than the STFT in the current dataset.
The BPM accuracy in this pipeline depends mainly on temporal localization along the spectrogram time axis rather than on generic detection metrics alone. In line with this objective, class-specific confidence selection based on validation min-MAE yielded a stronger test set performance than PR-intersection thresholding. The resulting conclusions are primarily supported for exhale-based respiratory rate estimation, while inhale detection remains less robust and running segments remain more challenging than at-rest recordings.
Under the studied conditions, these findings support the feasibility of accurate exhale-based respiratory tracking from a simple audio acquisition setup; however, broader deployment claims still require larger and more diverse breathing audio test datasets.
Statistical analysis supports this conclusion. Compared with PR-intersection thresholding, the min-MAE strategy provides the most consistent evidence of trainable-frontend benefits, including fully negative batch-level confidence intervals and significant one-sided permutation results in multiple batch settings. At the pooled level, the effects remained directionally favorable for trainable Mel but did not cross the strict threshold, indicating robust practical trends with moderate residual uncertainty.